30:30
Build your first Logistic model with Tidymodels
Machine learning for Everyone
Welcome to interactive learning
R Machine Learning Essential Training
The Heart disease dataset
We load the MLDataR for the dataset ,parsnip for modeling functions,rsample for splitting,yardstick for performance as well as dplyr and magrittr for wrangling.
all necessary packages are being installed for you !
We will be examining the Heart data set for this lab.
- save the data into an object
- look at the structure of the data
- we donβt want to work with
charactersbut ratherfactors - we use
mutate_if()from dplyr - fill the blank with
as.factor
Classes for our target response
- our target variable is
HeartDisease
visualise this
Check for missing data
- i guess we dont have! , lets start building
Model building
- change response to a factor since this is a requirement
- split your data into
70%/30%proportion - set proportion to
0.7 - set strata to
HeartDiseaseensuring stratified sampling
- use the
testing()andtraining()functions to pull required datasets
Now we will fit a logistic regression model. We will again use the parsnip package, and we will use logistic_reg() to create a logistic regression model specification.
- initialise a model specification
- set mode to
classification - set engine to
glm
-
HeartDisease~., thedotimplies we want to include all variables otherwise we could useHeartDisease~ var1+var2+.... - fit the model using
traindata
this fit is done using the glm() function, and it comes with a very handy summary() method as well.
This lets us see a couple of different things such as; parameter estimates, standard errors, p-values, and model fit statistics. we can use the tidy() function on the lr_fit to extract some of these model attributes for further analysis or presentation .
Predictions are done much the same way. Here we use the model to predict on the test data .
The result is a tibble with a single column .pred_class which will be a factor variable of the same labels as the original training data set.
We can also get back probability predictions, by specifying type = "prob".
Using augment() we can add the predictions to the data.frame and then use that to look at model performance metrics. truth should be set as our response variable which in our case is HeartDisease
A good performing model would ideally have high numbers along the diagonal (up-left to down-right) with small numbers on the off-diagonal. :::{.callout-tip} if you want a more visual representation of the confusion matrix you can pipe the result of conf_mat() into autoplot() to generate a ggplot2 chart. :::
We can also calculate various performance metrics. One of the most common metrics is accuracy, which is how often the model predicted correctly as a percentage.
- our model performed well , with
81%accuracy
GREAT!!!
-
lungis a dataset insurvivalpackage - explore the data
- determine missing values
- impute missing values anyhow
- use
statusas your response thus find the proportion in each category
- plot the proportion
- now do it for
sexand plot
- fit
logistic modelusing tidymodels and determine accuracy