library(MASS) # For coronary data set
library(tidymodels)
coronary<-foreign::read.dta("coronary.dta")Introduction to Regression in Tidymodels
Data Science for Actuaries
Introduction
Linear regression is one of the most common statistical analyses in medical and health sciences. Linear regression models the linear (i.e. straight line) relationship between:
- outcome: numerical variable (e.g. blood pressure, BMI, cholesterol level).
- predictors/independent variables: numerical variables and categorical variables (e.g. gender, race, education level).
In simple words, we might be interested in knowing the relationship between the cholesterol level and its associated factors, for example gender, age, BMI and lifestyle. This can be explored by a Linear regression analysis.
Linear regression is a type of Generalized linear models (GLMs), which also includes other outcome types, for example categorical and count. In subsequent chapters, we will cover these outcome types in form of logistic regression and Poisson regression. Basically, the relationship between the outcome and predictors in a linear regression is structured as follows,
In this tutorial we will go over how to perform linear regression. This will include simple linear regression and multiple linear regression in addition to how you can apply transformations to the predictors. Assumption is you are already familiar with how linear regression works in Base R .
Libraries
We load tidymodels
Simple linear regression
The coronary data set contains various statistics medical patients. We will build a simple linear regression model that related the body mass index (bmi) as the response with a variable indicating the cholestrol level (chol) as the predictor.
- id: Subjects’ ID.
- cad: Coronary artery disease status (categorical) {no cad, cad}.
- sbp : Systolic blood pressure in mmHg (numerical).
- dbp : Diastolic blood pressure in mmHg (numerical).
- chol: Total cholesterol level in mmol/L (numerical).
- age: Age in years (numerical).
- bmi: Body mass index (numerical).
- race: Race of the subjects (categorical) {malay, chinese, indian}.
- gender: Gender of the subjects (categorical) {woman, man}.
We start by creating a parsnip specification for a linear regression model.
Regression in Detail
lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")While it is unnecessary to set the mode for a linear regression since it can only be regression, we continue to do it in these labs to be explicit.
The specification doesn’t perform any calculations by itself. It is just a specification of what we want to do.
lm_specLinear Regression Model Specification (regression)
Computational engine: lm Once we have the specification we can fit it by supplying a formula expression and the data we want to fit the model on. The formula is written on the form y ~ x where y is the name of the response and x is the name of the predictors. The names used in the formula should match the names of the variables in the data set passed to data.
lm_fit <- lm_spec %>%
  fit(chol ~ age, data = coronary)
lm_fitparsnip model object
Call:
stats::lm(formula = chol ~ age, data = data)
Coefficients:
(Intercept)          age  
    3.49016      0.05723  The result of this fit is a parsnip model object. This object contains the underlying fit as well as some parsnip-specific information. If we want to look at the underlying fit object we can access it with lm_fit$fit or with
lm_fit %>% 
  pluck("fit")
Call:
stats::lm(formula = chol ~ age, data = data)
Coefficients:
(Intercept)          age  
    3.49016      0.05723  The lm object has a nice summary() method that shows more information about the fit, including parameter estimates and lack-of-fit statistics.
lm_fit %>% 
  pluck("fit") %>%
  summary()
Call:
stats::lm(formula = chol ~ age, data = data)
Residuals:
     Min       1Q   Median       3Q      Max 
-2.98115 -0.74813  0.01165  0.71885  2.88392 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.49016    0.51320   6.801 1.20e-10 ***
age          0.05723    0.01072   5.340 2.53e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.109 on 198 degrees of freedom
Multiple R-squared:  0.1259,    Adjusted R-squared:  0.1215 
F-statistic: 28.52 on 1 and 198 DF,  p-value: 2.534e-07We can use packages from the broom package to extract key information out of the model objects in tidy formats.
the tidy() function returns the parameter estimates of a lm object
tidy(lm_fit)# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   3.49      0.513       6.80 1.20e-10
2 age           0.0572    0.0107      5.34 2.53e- 7and glance() can be used to extract the model statistics.
glance(lm_fit)# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic     p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.126         0.121  1.11      28.5 0.000000253     1  -304.  613.  623.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>Suppose that we like the model fit and we want to generate predictions, we would typically use the predict() function like so:
predict(lm_fit)Error in predict.model_fit(lm_fit): argument "new_data" is missing, with no defaultBut this produces an error when used on a parsnip model object. This is happening because we need to explicitly supply the data set that the predictions should be performed on via the new_data argument
predict(lm_fit, new_data = coronary)# A tibble: 200 × 1
   .pred
   <dbl>
 1  6.92
 2  5.44
 3  5.55
 4  6.07
 5  6.52
 6  5.95
 7  6.01
 8  6.35
 9  5.95
10  6.24
# ℹ 190 more rowsNotice how the predictions are returned as a tibble. This will always be the case for parsnip models, no matter what engine is used. This is very useful since consistency allows us to combine data sets easily.
We can also return other types of predicts by specifying the type argument. Setting type = "conf_int" return a 95% confidence interval.
predict(lm_fit, new_data = coronary, type = "conf_int")# A tibble: 200 × 2
   .pred_lower .pred_upper
         <dbl>       <dbl>
 1        6.61        7.23
 2        5.11        5.76
 3        5.27        5.84
 4        5.90        6.23
 5        6.33        6.72
 6        5.77        6.13
 7        5.84        6.18
 8        6.19        6.52
 9        5.77        6.13
10        6.08        6.39
# ℹ 190 more rowsNot all engines can return all types of predictions.
If you want to evaluate the performance of a model, you might want to compare the observed value and the predicted value for a data set. You
# A tibble: 200 × 2
    chol .pred
   <dbl> <dbl>
 1  6.57  6.92
 2  6.32  5.44
 3  5.97  5.55
 4  7.04  6.07
 5  6.66  6.52
 6  5.97  5.95
 7  4.48  6.01
 8  5.47  6.35
 9  7.45  5.95
10  6.44  6.24
# ℹ 190 more rowsYou can get the same results using the augment() function to save you a little bit of typing.
augment(lm_fit, new_data = coronary) %>% 
  select(chol, .pred)# A tibble: 200 × 2
    chol .pred
   <dbl> <dbl>
 1  6.57  6.92
 2  6.32  5.44
 3  5.97  5.55
 4  7.04  6.07
 5  6.66  6.52
 6  5.97  5.95
 7  4.48  6.01
 8  5.47  6.35
 9  7.45  5.95
10  6.44  6.24
# ℹ 190 more rowsMultiple linear regression
The multiple linear regression model can be fit in much the same way as the simple linear regression model. The only difference is how we specify the predictors. We are using the same formula expression y ~ x, but we can specify multiple values by separating them with +s.
lm_fit2 <- lm_spec %>% 
  fit(chol ~ age+bmi+sbp, data = coronary)
lm_fit2parsnip model object
Call:
stats::lm(formula = chol ~ age + bmi + sbp, data = data)
Coefficients:
(Intercept)          age          bmi          sbp  
    3.82042      0.03676     -0.04212      0.01702  Everything else works the same. From extracting parameter estimates
tidy(lm_fit2)# A tibble: 4 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   3.82     1.28         2.98 0.00326  
2 age           0.0368   0.0113       3.25 0.00135  
3 bmi          -0.0421   0.0284      -1.48 0.140    
4 sbp           0.0170   0.00422      4.03 0.0000785to predicting new values
predict(lm_fit2, new_data = coronary)# A tibble: 200 × 1
   .pred
   <dbl>
 1  6.19
 2  5.69
 3  5.75
 4  6.24
 5  6.03
 6  5.93
 7  5.87
 8  5.84
 9  6.01
10  5.63
# ℹ 190 more rowsA shortcut when using formulas is to use the form y ~ . which means; set y as the response and set the remaining variables as predictors. This is very useful if you have a lot of variables and you don’t want to type them out.
lm_fit3 <- lm_spec %>% 
  fit(chol ~ .-id, data = coronary)
lm_fit3parsnip model object
Call:
stats::lm(formula = chol ~ . - id, data = data)
Coefficients:
(Intercept)       cadcad          sbp          dbp          age          bmi  
  4.5728723    0.2588297   -0.0002273    0.0278889    0.0051094   -0.0343331  
racechinese   raceindian    genderman  
  0.3303000    0.6557015    0.0870278  For more formula syntax look at ?formula.
Interaction terms
Adding interaction terms is quite easy to do using formula expressions. However, the syntax used to describe them isn’t accepted by all engines so we will go over how to include interaction terms using recipes as well.
There are two ways on including an interaction term; x:y and x * y
- 
x:ywill include the interaction betweenxandy,
- 
x * ywill include the interaction betweenxandy,x, andy, i.e. it is short forx:y + x + y.
with that out of the way let expand lm_fit2 by adding an interaction term
modeling
lm_fit4 <- lm_spec %>%
  fit(chol ~ age*bmi, data = coronary)
lm_fit4parsnip model object
Call:
stats::lm(formula = chol ~ age * bmi, data = data)
Coefficients:
(Intercept)          age          bmi      age:bmi  
   2.660975     0.118513     0.023089    -0.001658  note that the interaction term is named lstat:age.
Sometimes we want to perform transformations, and we want those transformations to be applied, as part of the model fit as a pre-processing step. We will use the recipes package for this task.
We use the step_interact() to specify the interaction term. Next, we create a workflow object to combine the linear regression model specification lm_spec with the pre-processing specification rec_spec_interact which can then be fitted much like a parsnip model specification.
rec_spec_interact <- recipe(chol ~ bmi + age, data = coronary) %>%
  step_interact(~ bmi:age)
lm_wf_interact <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_spec_interact)
lm_wf_interact %>% fit(coronary)══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_interact()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept)          bmi          age    bmi_x_age  
   2.660975     0.023089     0.118513    -0.001658  Notice that since we specified the variables in the recipe we don’t need to specify them when fitting the workflow object. Furthermore, take note of the name of the interaction term. step_interact() tries to avoid special characters in variables.
Non-linear transformations of the predictors
Much like we could use recipes to create interaction terms between values are we able to apply transformations to individual variables as well. If you are familiar with the dplyr package then you know how to mutate() which works in much the same way using step_mutate().
You would want to keep as much of the pre-processing inside recipes such that the transformation will be applied consistently to new data.
rec_spec_pow2 <- recipe(chol ~ bmi, data = coronary) %>%
  step_mutate(bmi2 = bmi ^ 2)
lm_wf_pow2 <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_spec_pow2)
lm_wf_pow2 %>% fit(coronary)══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_mutate()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept)          bmi         bmi2  
  20.866309    -0.732174     0.009046  You don’t have to hand-craft every type of linear transformation since recipes have a bunch created already here such as step_log() to take logarithms of variables.
rec_spec_log <- recipe(chol ~ bmi, data = coronary) %>%
  step_log(bmi)
lm_wf_log <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_spec_log)
lm_wf_log %>% fit(coronary)══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_log()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept)          bmi  
     15.405       -2.543