Build your first decision Tree model with Tidymodels and Caret

Machine learning for Everyone

Author

Bongani Ncube

Published

2 November 2025

WebR Status

🟡 Loading...

Welcome to interactive explanatory data analysis and machine learning

R Machine Learning Essential Training : Part 2

The Heart disease dataset : Decision Trees

all necessary packages are being installed for you !

The data

We will be examining the Heart data set for this laboratory.

save the data into an object

Check for missing data

i guess we dont have! , lets keep exploring

look at the first few rows of the data using head()

points to note

we observe here that the dataset is a mixture of numerical and categorical data

look in more detail at the structure of the dataset

Note

we don’t want to work with characters but rather factors
we use mutate_if() from dplyr
fill the blank with as.factor

define a custom function for categorical data plots

function takes 3 arguments which are the mydata , variable , ylab and xlabel

test the function

now do a plot for RestingECG and Angina using the same function

Histograms

Tip

the trick i always use is to melt the numeric data (change from long to wide then plot a facetted histogram)
so you select if it is numeric (is.numeric)
one could also use a custom function or loop
set bins=30
we use the function gather() which returns value and key column
so you will facet by key and set x=value

Classes for our `target` response

our target variable is HeartDisease

visualise this

change response to a factor since this is a requirement

Some inferential statistics

at this point ,one would want to know if there is any association between the response and nominal variables
chi-squared test comes to the rescue
include the categorical valiables (HeartDisease ,Angina,sex and RestingECG)
Fill in the gap for missing variables

Heart |> 
  tbl_summary(include = c(HeartDisease, 
                          __,
                          __,
                          __, 
              by =HeartDisease,
              type=all_categorical()~"categorical",
              statistic = list(all_continuous()~"{mean}({sd})",
                               all_categorical()~"{p}%({n}/{N})")) %>%
  add_p() %>%
  bold_labels()

Determine correlations

Model building

split your data into 70%/30% proportion
set proportion to 0.7
set strata to HeartDisease ensuring stratified sampling
set the seed to 1234 for reproducibility

use the testing() and training() functions to pull required datasets from the split object

Create a recipe (feature engineering)

use step_dummy() to turn all categorical data to dummy variables
set data = train

apply to training data

to apply the recipe to the original train data set new_data=NULL

apply to testing data

Set new_data=test

Fitting models using 2 different approaches

Tip

HeartDisease~. , the dot implies we want to include all variables otherwise we could use HeartDisease~ var1+var2+....
fit the model using train data
set maximum tree depth to 8 (twerk around to see effects of changing it)
set cp=0.001 or cost_complexity (also twerk it to see effects of changing it)

Decision Tree model using Tidymodels

One way around this would have been to use the parsnip package, and would need to use decision_tree() to create a decision tree model specification.

initialise a model specification
set mode to classification
set engine to rpart

Using Caret instead

we will use the rpart function from caret rather

Plot the tree

set cex= 0.8

for the rest of the following parts we will use results from caret

Prune the tree

pruning is a data compression technique for reducing the size of the decision trees by removing non critical and redundant sections of the tree
the purpose is to reduce the complexity of the classifier
change (cp=0.05)

Check model perfomence

predict on test data

Combine with the test set

bind_cols with predictions

Look at the confusion matrix

truth = HeartDisease our target variable
set the estimate to estimate which is our predicted variable

A good performing model would ideally have high numbers along the diagonal (up-left to down-right) with small numbers on the off-diagonal.

Tip

if you want a more visual representation of the confusion matrix you can pipe the result of conf_mat() into autoplot() to generate a ggplot2 chart.

Tip

We can also calculate various performance metrics. One of the most common metrics is accuracy, which is how often the model predicted correctly as a percentage.

perfomence

our model performed well , with 85% accuracy

GREAT!!!

Your turn

30:30

lung is a dataset in survival package
explore the data
determine missing values
impute missing values anyhow

use status as your response thus find the proportion in each category

plot the proportion

now do it for sex and plot

fit decision tree caret

Welcome to interactive explanatory data analysis and machine learning

R Machine Learning Essential Training : Part 2

The Heart disease dataset : Decision Trees

Check for missing data

define a custom function for categorical data plots

test the function

Histograms

Classes for our target response

visualise this

Some inferential statistics

Determine correlations

Model building

Create a recipe (feature engineering)

apply to training data

apply to testing data

Fitting models using 2 different approaches

Plot the tree

Prune the tree

Check model perfomence

Combine with the test set

Look at the confusion matrix

Classes for our `target` response