Build your first decision Tree model with Tidymodels and Caret

Machine learning for Everyone

Author

Bongani Ncube

Published

2 November 2025

Welcome to interactive explanatory data analysis and machine learning

R Machine Learning Essential Training : Part 2

The Heart disease dataset : Decision Trees

all necessary packages are being installed for you !

The data

We will be examining the Heart data set for this laboratory.

Check for missing data

i guess we dont have! , lets keep exploring

  • look at the first few rows of the data using head()
points to note
  • we observe here that the dataset is a mixture of numerical and categorical data
  • look in more detail at the structure of the dataset
Note
  • we don’t want to work with characters but rather factors
  • we use mutate_if() from dplyr
  • fill the blank with as.factor

define a custom function for categorical data plots

  • function takes 3 arguments which are the mydata , variable , ylab and xlabel

test the function

  • now do a plot for RestingECG and Angina using the same function

Histograms

Tip
  • the trick i always use is to melt the numeric data (change from long to wide then plot a facetted histogram)
  • so you select if it is numeric (is.numeric)
  • one could also use a custom function or loop
  • set bins=30
  • we use the function gather() which returns value and key column
  • so you will facet by key and set x=value

Classes for our target response

  • our target variable is HeartDisease

visualise this

  • change response to a factor since this is a requirement

Some inferential statistics

  • at this point ,one would want to know if there is any association between the response and nominal variables
  • chi-squared test comes to the rescue
  • include the categorical valiables (HeartDisease ,Angina,sex and RestingECG)
  • Fill in the gap for missing variables

Determine correlations

Model building

  • split your data into 70%/30% proportion
  • set proportion to 0.7
  • set strata to HeartDisease ensuring stratified sampling
  • set the seed to 1234 for reproducibility
  • use the testing() and training() functions to pull required datasets from the split object

Create a recipe (feature engineering)

  • use step_dummy() to turn all categorical data to dummy variables
  • set data = train

apply to training data

  • to apply the recipe to the original train data set new_data=NULL

apply to testing data

  • Set new_data=test

Fitting models using 2 different approaches

Tip
  • HeartDisease~. , the dot implies we want to include all variables otherwise we could use HeartDisease~ var1+var2+....
  • fit the model using train data
  • set maximum tree depth to 8 (twerk around to see effects of changing it)
  • set cp=0.001 or cost_complexity (also twerk it to see effects of changing it)
Decision Tree model using Tidymodels

One way around this would have been to use the parsnip package, and would need to use decision_tree() to create a decision tree model specification.

  • initialise a model specification
  • set mode to classification
  • set engine to rpart
Using Caret instead
  • we will use the rpart function from caret rather

Plot the tree

  • set cex= 0.8
  • for the rest of the following parts we will use results from caret

Prune the tree

  • pruning is a data compression technique for reducing the size of the decision trees by removing non critical and redundant sections of the tree
  • the purpose is to reduce the complexity of the classifier
  • change (cp=0.05)

Check model perfomence

  • predict on test data

Combine with the test set

  • bind_cols with predictions

Look at the confusion matrix

  • truth = HeartDisease our target variable
  • set the estimate to estimate which is our predicted variable

A good performing model would ideally have high numbers along the diagonal (up-left to down-right) with small numbers on the off-diagonal.

Tip

if you want a more visual representation of the confusion matrix you can pipe the result of conf_mat() into autoplot() to generate a ggplot2 chart.

Tip

We can also calculate various performance metrics. One of the most common metrics is accuracy, which is how often the model predicted correctly as a percentage.

perfomence
  • our model performed well , with 85% accuracy

GREAT!!!

Your turn
30:30
  • lung is a dataset in survival package
  • explore the data
  • determine missing values
  • impute missing values anyhow
  • use status as your response thus find the proportion in each category
  • plot the proportion
  • now do it for sex and plot
  • fit decision tree caret