30:30
Build your first decision Tree model with Tidymodels and Caret
Machine learning for Everyone
Welcome to interactive explanatory data analysis and machine learning
R Machine Learning Essential Training : Part 2
The Heart disease dataset : Decision Trees
all necessary packages are being installed for you !
We will be examining the Heart data set for this laboratory.
- save the data into an object
Check for missing data
i guess we dont have! , lets keep exploring
- look at the first few rows of the data using
head()
- we observe here that the dataset is a mixture of numerical and categorical data
- look in more detail at the structure of the dataset
- we donβt want to work with
charactersbut ratherfactors - we use
mutate_if()from dplyr - fill the blank with
as.factor
define a custom function for categorical data plots
- function takes
3 argumentswhich are themydata , variable , ylab and xlabel
test the function
- now do a plot for
RestingECG and Anginausing the same function
Histograms
- the trick i always use is to melt the numeric data (change from long to wide then plot a facetted histogram)
- so you select if it is numeric (
is.numeric) - one could also use a custom function or loop
- set
bins=30 - we use the function
gather()which returnsvalue and keycolumn - so you will facet by
keyand setx=value
Classes for our target response
- our target variable is
HeartDisease
visualise this
- change response to a factor since this is a requirement
Some inferential statistics
- at this point ,one would want to know if there is any association between the response and nominal variables
-
chi-squared testcomes to the rescue - include the categorical valiables
(HeartDisease ,Angina,sex and RestingECG) - Fill in the gap for missing variables
Determine correlations
Model building
- split your data into
70%/30%proportion - set proportion to
0.7 - set strata to
HeartDiseaseensuring stratified sampling - set the seed to
1234for reproducibility
- use the
testing()andtraining()functions to pull required datasets from thesplitobject
Create a recipe (feature engineering)
- use
step_dummy()to turn all categorical data to dummy variables - set
data = train
apply to training data
- to apply the recipe to the original
traindata setnew_data=NULL
apply to testing data
- Set
new_data=test
Fitting models using 2 different approaches
-
HeartDisease~., thedotimplies we want to include all variables otherwise we could useHeartDisease~ var1+var2+.... - fit the model using
traindata - set maximum tree depth to 8 (twerk around to see effects of changing it)
- set
cp=0.001orcost_complexity(also twerk it to see effects of changing it)
One way around this would have been to use the parsnip package, and would need to use decision_tree() to create a decision tree model specification.
- initialise a model specification
- set mode to
classification - set engine to
rpart
- we will use the
rpartfunction fromcaretrather
Plot the tree
- set
cex= 0.8
- for the rest of the following parts we will use results from
caret
Prune the tree
- pruning is a data compression technique for reducing the size of the decision trees by removing non critical and redundant sections of the tree
- the purpose is to reduce the complexity of the classifier
- change (
cp=0.05)
Check model perfomence
- predict on
testdata
Combine with the test set
- bind_cols with
predictions
Look at the confusion matrix
- truth =
HeartDiseaseour target variable - set the estimate to
estimatewhich is our predicted variable
A good performing model would ideally have high numbers along the diagonal (up-left to down-right) with small numbers on the off-diagonal.
if you want a more visual representation of the confusion matrix you can pipe the result of conf_mat() into autoplot() to generate a ggplot2 chart.
We can also calculate various performance metrics. One of the most common metrics is accuracy, which is how often the model predicted correctly as a percentage.
- our model performed well , with
85%accuracy
GREAT!!!
-
lungis a dataset insurvivalpackage - explore the data
- determine missing values
- impute missing values anyhow
- use
statusas your response thus find the proportion in each category
- plot the proportion
- now do it for
sexand plot
- fit
decision treecaret