30:30
Build your first decision Tree model with Tidymodels and Caret
Machine learning for Everyone
Welcome to interactive explanatory data analysis and machine learning
R Machine Learning Essential Training : Part 2
The Heart disease dataset : Decision Trees
all necessary packages are being installed for you !
We will be examining the Heart
data set for this laboratory.
- save the data into an object
Check for missing data
i guess we dont have! , lets keep exploring
- look at the first few rows of the data using
head()
- we observe here that the dataset is a mixture of numerical and categorical data
- look in more detail at the structure of the dataset
- we donβt want to work with
characters
but ratherfactors
- we use
mutate_if()
from dplyr - fill the blank with
as.factor
define a custom function for categorical data plots
- function takes
3 arguments
which are themydata , variable , ylab and xlabel
test the function
- now do a plot for
RestingECG and Angina
using the same function
Histograms
- the trick i always use is to melt the numeric data (change from long to wide then plot a facetted histogram)
- so you select if it is numeric (
is.numeric
) - one could also use a custom function or loop
- set
bins=30
- we use the function
gather()
which returnsvalue and key
column - so you will facet by
key
and setx=value
Classes for our target
response
- our target variable is
HeartDisease
visualise this
- change response to a factor since this is a requirement
Some inferential statistics
- at this point ,one would want to know if there is any association between the response and nominal variables
-
chi-squared test
comes to the rescue - include the categorical valiables
(HeartDisease ,Angina,sex and RestingECG)
- Fill in the gap for missing variables
Determine correlations
Model building
- split your data into
70%/30%
proportion - set proportion to
0.7
- set strata to
HeartDisease
ensuring stratified sampling - set the seed to
1234
for reproducibility
- use the
testing()
andtraining()
functions to pull required datasets from thesplit
object
Create a recipe (feature engineering)
- use
step_dummy()
to turn all categorical data to dummy variables - set
data = train
apply to training data
- to apply the recipe to the original
train
data setnew_data=NULL
apply to testing data
- Set
new_data=test
Fitting models using 2 different approaches
-
HeartDisease~.
, thedot
implies we want to include all variables otherwise we could useHeartDisease~ var1+var2+....
- fit the model using
train
data - set maximum tree depth to 8 (twerk around to see effects of changing it)
- set
cp=0.001
orcost_complexity
(also twerk it to see effects of changing it)
One way around this would have been to use the parsnip package, and would need to use decision_tree()
to create a decision tree model specification.
- initialise a model specification
- set mode to
classification
- set engine to
rpart
- we will use the
rpart
function fromcaret
rather
Plot the tree
- set
cex= 0.8
- for the rest of the following parts we will use results from
caret
Prune the tree
- pruning is a data compression technique for reducing the size of the decision trees by removing non critical and redundant sections of the tree
- the purpose is to reduce the complexity of the classifier
- change (
cp=0.05
)
Check model perfomence
- predict on
test
data
Combine with the test set
- bind_cols with
predictions
Look at the confusion matrix
- truth =
HeartDisease
our target variable - set the estimate to
estimate
which is our predicted variable
A good performing model would ideally have high numbers along the diagonal (up-left to down-right) with small numbers on the off-diagonal.
if you want a more visual representation of the confusion matrix you can pipe the result of conf_mat()
into autoplot()
to generate a ggplot2 chart.
We can also calculate various performance metrics. One of the most common metrics is accuracy, which is how often the model predicted correctly as a percentage.
- our model performed well , with
85%
accuracy
GREAT!!!
-
lung
is a dataset insurvival
package - explore the data
- determine missing values
- impute missing values anyhow
- use
status
as your response thus find the proportion in each category
- plot the proportion
- now do it for
sex
and plot
- fit
decision tree
caret