Build your first Logistic model with Tidymodels

Machine learning for Everyone

Author

Bongani Ncube

Welcome to interactive learning

R Machine Learning Essential Training

The Heart disease dataset

We load the MLDataR for the dataset ,parsnip for modeling functions,rsample for splitting,yardstick for performance as well as dplyr and magrittr for wrangling.

all necessary packages are being installed for you !

The data

We will be examining the Heart data set for this lab.

save the data into an object

look at the structure of the data

Note

we don’t want to work with characters but rather factors
we use mutate_if() from dplyr
fill the blank with as.factor

Classes for our `target` response

our target variable is HeartDisease

visualise this

Check for missing data

i guess we dont have! , lets start building

Model building

change response to a factor since this is a requirement

split your data into 70%/30% proportion
set proportion to 0.7
set strata to HeartDisease ensuring stratified sampling

use the testing() and training() functions to pull required datasets

Logistic Regression

Now we will fit a logistic regression model. We will again use the parsnip package, and we will use logistic_reg() to create a logistic regression model specification.

initialise a model specification
set mode to classification
set engine to glm

Tip

HeartDisease~. , the dot implies we want to include all variables otherwise we could use HeartDisease~ var1+var2+....
fit the model using train data

this fit is done using the glm() function, and it comes with a very handy summary() method as well.

This lets us see a couple of different things such as; parameter estimates, standard errors, p-values, and model fit statistics. we can use the tidy() function on the lr_fit to extract some of these model attributes for further analysis or presentation .

Predictions are done much the same way. Here we use the model to predict on the test data .

The result is a tibble with a single column .pred_class which will be a factor variable of the same labels as the original training data set.

We can also get back probability predictions, by specifying type = "prob".

Tip

Using augment() we can add the predictions to the data.frame and then use that to look at model performance metrics. truth should be set as our response variable which in our case is HeartDisease

A good performing model would ideally have high numbers along the diagonal (up-left to down-right) with small numbers on the off-diagonal. :::{.callout-tip} if you want a more visual representation of the confusion matrix you can pipe the result of conf_mat() into autoplot() to generate a ggplot2 chart. :::

Tip

We can also calculate various performance metrics. One of the most common metrics is accuracy, which is how often the model predicted correctly as a percentage.

perfomence

our model performed well , with 81% accuracy

GREAT!!!

Your turn

30:30

lung is a dataset in survival package
explore the data
determine missing values
impute missing values anyhow

use status as your response thus find the proportion in each category

plot the proportion

now do it for sex and plot

fit logistic model using tidymodels and determine accuracy

Welcome to interactive learning

R Machine Learning Essential Training

The Heart disease dataset

Classes for our target response

visualise this

Check for missing data

Model building

Classes for our `target` response