Build your first linear regression model

Machine learning for Everyone

Author

Bongani Ncube

Published

5 June 2024

Load in required packages

Note

Machine learning (ML) develops algorithms to identify patterns in data (unsupervised ML) or make predictions and inferences (supervised ML).

Supervised ML trains the machine to learn from prior examples to predict either a categorical outcome (classification) or a numeric outcome (regression), or to infer the relationships between the outcome and its explanatory variables.

Use cases for linear regression

Linear regression is particularly suited to a problem where the outcome of interest is on some sort of continuous scale (for example, quantity, money, height, weight). For outcomes of this type, it can be a first port of call before trying more complex modeling approaches. It is simple and easy to explain, and analysts will often accept a somewhat poorer fit using linear regression in order to avoid having to interpret a more complex model.

Here are some illustratory examples of questions that could be tackled with a linear regression approach:

Simple linear regression

In order to visualize our approach and improve our intuition, we will start with simple linear regression, which is the case where there is only a single input variable and outcome variable.

Linear relationship between a single input and an outcome

Let our input variable be \(x\) and our outcome variable be \(y\) and we assume that :

\[ y = f(x) + \epsilon \]

where \(f\) is some mathematical transformation of \(x\) and \(\epsilon\) is some random, uncontrollable error. Put another way, we assume that \(f(x)\) represents the mean expected value of \(y\) over many observations of \(x\), and that any single observation of \(y\) could vary from \(f(x)\) by an error of \(\epsilon\). Since we are assuming that \(f(x)\) represents the mean of \(y\), our errors \(\epsilon\) are therefore distributed around a mean of zero. We further assume that the distribution of our errors \(\epsilon\) is a normal distribution1.

OLS model

The population regression model \(E(Y) = X \beta\) summarizes the trend between the predictors and the mean responses. The individual responses vary about the population regression, \(y_i = X_i \beta + \epsilon_i\) with assumed mean structure \(y_i \sim N(\mu_i, \sigma^2)\) and assumed constant variance \(\sigma^2\). Equivalently, the model presumes a linear relationship between \(y\) and \(X\) with residuals \(\epsilon\) that are independent normal random variables with mean zero and constant variance \(\sigma^2\). Estimate the population regression model coefficients as \(\hat{y} = X \hat{\beta}\), and the population variance as \(\hat{\sigma}^2\). The most common method of estimating the \(\beta\) coefficients and \(\sigma\) is ordinary least squares (OLS). OLS minimizes the sum of squared residuals from a random sample. The individual predicted values vary about the actual value, \(e_i = y_i - \hat{y}_i\), where \(\hat{y}_i = X_i \hat{\beta}\).

Parameter Estimation

There are two model parameters to estimate: \(\hat{\beta}\) estimates the coefficient vector \(\beta\), and \(\hat{\sigma}\) estimates the variance of the residuals along the regression line.

Derive the coefficient estimators by minimizing the sum of squared residuals \(SSE = (y - X \hat{\beta})' (y - X \hat{\beta})\). The result is

\[\hat{\beta} = (X'X)^{-1}X'y.\]

The residual standard error (RSE) estimates the sample deviation around the population regression line. (Think of each value of \(X\) along the regression line as a subpopulation with mean \(y_i\) and variance \(\sigma^2\). This variance is assumed to be the same for all \(X\).)

\[\hat{\sigma} = \sqrt{(n-k-1)^{-1} e'e}.\]

The standard error for the coefficient estimators is the square root of the error variance divided by \((X'X)\).

\[SE(\hat{\beta}) = \sqrt{\hat{\sigma}^2 (X'X)^{-1}}.\]

Consider the following example

Visualise the relationship graphically

Look at the coefficients

The summary() function shows \(\hat{\beta}\) as Estimate, \(SE({\hat{\beta}})\) as Std. Error, and \(\hat{\sigma}\) as Residual standard error. You can verify this by manually peforming these calculations using matrix algebra.

Here are the coefficient estimators, \(\hat{\beta} = (X'X)^{-1}X'y\).

Note

Here is the residual standard error, \(\hat{\sigma} = \sqrt{(n-k-1)^{-1} \hat{e}'\hat{e}}\).

Use the residual standard errors to derive the standard errors of the coefficients, \(SE(\hat{\beta}) = \sqrt{\hat{\sigma}^2 (X'X)^{-1}}\).

Model Assumptions

  • The linear regression model assumes the relationship between the predictors and the response is linear with the residuals that are independent random variables normally distributed with mean zero and constant variance.

  • Additionally, you will want to check for multicollinearity in the predictors because it can produce unreliable coefficient estimates and predicted values.

  • Use a residuals vs fits plot \(\left( e \sim \hat{Y} \right)\) to detect non-linearity and unequal error variances, including outliers. The polynomial trend line should show that the residuals vary around \(e = 0\) in a straight line (linearity). The variance should be of constant width (especially no fan shape at the low or high ends).

  • Use a residuals normal probability plot to compares the theoretical percentiles of the normal distribution versus the observed sample percentiles. It should be approximately linear.

  • A scale-location plot \(\sqrt{e / sd(e)} \sim \hat{y}\) checks the homogeneity of variance of the residuals (homoscedasticity). The square root of the absolute value of the residuals should be spread equally along a horizontal line.

  • A residuals vs leverage plot identifies influential observations. A plot of the standardized residuals vs the leverage should fall within the 95% probability band.

Coefficient confidence

Intuitively, these coefficients appear too precise for comfort. After all, we are attempting to estimate a relationship based on a limited set of data. Like in any statistical estimation, the coefficients calculated for our model have a margin of error. Typically, in any such situation, we seek to know a 95% confidence interval to set a standard of certainty around the values we are interpreting.
The summary() function is a useful way to gather critical information in your model, including important statistics on your coefficients:

The 95% confidence interval corresponds to approximately two standard errors above or below the estimated value. For a given coefficient, if this confidence interval includes zero, you cannot reject the hypothesis that the variable has no relationship with the outcome. Another indicator of this is the Pr(>|t|) column of the coefficient summary, which represents the p-value of the null hypothesis that the input variable has no relationship with the outcome. If this value is less than a certain threshold (usually 0.05), you can conclude that this variable has a statistically significant relationship with the outcome. To see the precise confidence intervals for your model coefficients, you can use the confint() function.

OLS Reference

Penn State University, STAT 501, Lesson 12: Multicollinearity & Other Regression Pitfalls. https://newonlinecourses.science.psu.edu/stat501/lesson/12.

STHDA. Bootstrap Resampling Essentials in R. http://www.sthda.com/english/articles/38-regression-model-validation/156-bootstrap-resampling-essentials-in-r/

Molnar, Christoph. “Interpretable machine learning. A Guide for Making Black Box Models Explainable”, 2019. https://christophm.github.io/interpretable-ml-book/.

Footnotes

  1. This assumption is necessary in order for the estimations of the model parameters to be determined.↩︎