Build your first Logistic model with Tidymodels

Machine learning for Everyone

Author

Bongani Ncube

Welcome to interactive learning

R Machine Learning Essential Training

The Heart disease dataset

We load the MLDataR for the dataset ,parsnip for modeling functions,rsample for splitting,yardstick for performance as well as dplyr and magrittr for wrangling.

all necessary packages are being installed for you !

The data

We will be examining the Heart data set for this lab.

Note
  • we don’t want to work with characters but rather factors
  • we use mutate_if() from dplyr
  • fill the blank with as.factor

Classes for our target response

  • our target variable is HeartDisease

visualise this

Check for missing data

  • i guess we dont have! , lets start building

Model building

  • change response to a factor since this is a requirement
  • split your data into 70%/30% proportion
  • set proportion to 0.7
  • set strata to HeartDisease ensuring stratified sampling
  • use the testing() and training() functions to pull required datasets
Logistic Regression

Now we will fit a logistic regression model. We will again use the parsnip package, and we will use logistic_reg() to create a logistic regression model specification.

  • initialise a model specification
  • set mode to classification
  • set engine to glm
Tip
  • HeartDisease~. , the dot implies we want to include all variables otherwise we could use HeartDisease~ var1+var2+....
  • fit the model using train data

this fit is done using the glm() function, and it comes with a very handy summary() method as well.

This lets us see a couple of different things such as; parameter estimates, standard errors, p-values, and model fit statistics. we can use the tidy() function on the lr_fit to extract some of these model attributes for further analysis or presentation .

Predictions are done much the same way. Here we use the model to predict on the test data .

The result is a tibble with a single column .pred_class which will be a factor variable of the same labels as the original training data set.

We can also get back probability predictions, by specifying type = "prob".

Tip

Using augment() we can add the predictions to the data.frame and then use that to look at model performance metrics. truth should be set as our response variable which in our case is HeartDisease

A good performing model would ideally have high numbers along the diagonal (up-left to down-right) with small numbers on the off-diagonal. :::{.callout-tip} if you want a more visual representation of the confusion matrix you can pipe the result of conf_mat() into autoplot() to generate a ggplot2 chart. :::

Tip

We can also calculate various performance metrics. One of the most common metrics is accuracy, which is how often the model predicted correctly as a percentage.

perfomence
  • our model performed well , with 81% accuracy

GREAT!!!

Your turn
30:30
  • lung is a dataset in survival package
  • explore the data
  • determine missing values
  • impute missing values anyhow
  • use status as your response thus find the proportion in each category
  • plot the proportion
  • now do it for sex and plot
  • fit logistic model using tidymodels and determine accuracy