30:30
Build your first Logistic model with Tidymodels
Machine learning for Everyone
Welcome to interactive learning
R Machine Learning Essential Training
The Heart disease dataset
We load the MLDataR for the dataset ,parsnip for modeling functions,rsample for splitting,yardstick for performance as well as dplyr and magrittr for wrangling
.
all necessary packages are being installed for you !
We will be examining the Heart
data set for this lab.
- save the data into an object
- look at the structure of the data
- we donβt want to work with
characters
but ratherfactors
- we use
mutate_if()
from dplyr - fill the blank with
as.factor
Classes for our target
response
- our target variable is
HeartDisease
visualise this
Check for missing data
- i guess we dont have! , lets start building
Model building
- change response to a factor since this is a requirement
- split your data into
70%/30%
proportion - set proportion to
0.7
- set strata to
HeartDisease
ensuring stratified sampling
- use the
testing()
andtraining()
functions to pull required datasets
Now we will fit a logistic regression model. We will again use the parsnip package, and we will use logistic_reg()
to create a logistic regression model specification.
- initialise a model specification
- set mode to
classification
- set engine to
glm
-
HeartDisease~.
, thedot
implies we want to include all variables otherwise we could useHeartDisease~ var1+var2+....
- fit the model using
train
data
this fit is done using the glm()
function, and it comes with a very handy summary()
method as well.
This lets us see a couple of different things such as; parameter estimates, standard errors, p-values, and model fit statistics. we can use the tidy()
function on the lr_fit
to extract some of these model attributes for further analysis or presentation .
Predictions are done much the same way. Here we use the model to predict on the test
data .
The result is a tibble with a single column .pred_class
which will be a factor variable of the same labels as the original training data set.
We can also get back probability predictions, by specifying type = "prob"
.
Using augment()
we can add the predictions to the data.frame and then use that to look at model performance metrics. truth
should be set as our response variable which in our case is HeartDisease
A good performing model would ideally have high numbers along the diagonal (up-left to down-right) with small numbers on the off-diagonal. :::{.callout-tip} if you want a more visual representation of the confusion matrix you can pipe the result of conf_mat()
into autoplot()
to generate a ggplot2 chart. :::
We can also calculate various performance metrics. One of the most common metrics is accuracy, which is how often the model predicted correctly as a percentage.
- our model performed well , with
81%
accuracy
GREAT!!!
-
lung
is a dataset insurvival
package - explore the data
- determine missing values
- impute missing values anyhow
- use
status
as your response thus find the proportion in each category
- plot the proportion
- now do it for
sex
and plot
- fit
logistic model
using tidymodels and determine accuracy