R for Public Health and Epidemiology

Logistic Regression R : Interactive Learning

Author

Bongani Ncube

Note

My name is Bongani Ncube ,I am a researcher with a passion for applying biostatistical ,demographic and epidemiological methods in Evidence Based Research . I was recently involved in the EPIMS study for Swaziland . The highlight of that work was investigating the MTCT study outcomes, among lactating women and their children at the age of 6-8 weeks, 9 months, 18 months, and 24 months.

Learning Objectives

Identify a binomial random variable and assess the validity of the binomial assumptions.
Write a generalized linear model for binomial responses in two forms, one as a function of the logit and one as a function of \(p\).
Explain how fitting a logistic regression differs from fitting a linear least squares regression (LLSR) model.
Interpret estimated coefficients in logistic regression.
Differentiate between logistic regression models with binary and binomial responses.
Use the residual deviance to compare models, to test for lack-of-fit when appropriate, and to check for unusual observations or needed transformations.

Introduction to Logistic Regression

Logistic regression is characterized by research questions with binary (yes/no or success/failure) or binomial (number of yesses or successes in \(n\) trials) responses:

is dying influenza related to certain variables
Is exposure to a particular chemical associated with a cancer diagnosis?

Binary Responses:

Binary responses take on only two values: success (Y=1) or failure (Y=0), Yes (Y=1) or No (Y=0), etc. Binary responses are ubiquitous; they are one of the most common types of data that statisticians encounter. We are often interested in modeling the probability of success \(p\) based on a set of covariates, although sometimes we wish to use those covariates to classify a future observation as a success or a failure.
binomial responses are the number of successes in \(n\) identical, independent trials with constant probability \(p\) of success. A sequence of independent trials like this with the same probability of success is called a Bernoulli process. As with binary responses, our objective in modeling binomial responses is to quantify how the probability of success, \(p\), is associated with relevant covariates.

Assumptions

Much like ordinary least squares (OLS), using logistic regression to make inferences requires model assumptions.

Binary Response The response variable is dichotomous (two possible responses) or the sum of dichotomous responses.
Independence The observations must be independent of one another.
Variance Structure By definition, the variance of a binomial random variable is \(np(1-p)\), so that variability is highest when \(p=.5\).
Linearity The log of the odds ratio, log(\(\frac{p}{1-p}\)), must be a linear function of \(x\).

Logistic Regression vs Linear Regression : Pictorial representation

The figure above illustrates a data set with a binary (0 or 1) response (Y) and a single continuous predictor (X). The solid line is a linear regression fit with least squares to model the probability of a success (Y=1) for a given value of X. With a binary response, the line doesn’t fit the data well, and it produces predicted probabilities below 0 and above 1. On the other hand, the logistic regression fit (dashed curve) with its typical “S” shape follows the data closely and always produces predicted probabilities between 0 and 1.

\[\begin{equation*} log(\frac{p_i}{1-p_i})=\beta_0+\beta_1 x_i \end{equation*}\] where the observed values \(Y_i \sim\) binomial with \(p=p_i\) for a given \(x_i\) and \(n=1\) for binary responses.

Mathematical Idea

Note

we first attempt to model binary (0 or 1) response (Y) and a single continuous predictor (X). The first guess is to use linear regression haaa!… but wait , let’s see how this fails.

\[p_i =\beta_0+\beta_iX_i\]

this is obviously a bad idea , because remember \(p_i\) is a probability (binary) with values ranging from [0,1] but for certain values of X the linear model above produces values of \(p_i\) that are out of bound.
In laymen terms we want an equation such that for any value of X \(p_i\) is always in the range [0,1]

Tranformation 1

\[p_i =\exp^{\beta_0+\beta_iX_i}\]

lets seee , hmmm … for any X in the range \([-\infty,infty]\) we are bound to get

\[p_i \in [\exp^{-\infty},\exp^{\infty}] = [0,\infty]\]

Tranformation 2

\[ p_i = \frac{exp(\mathbf{x_i'\beta})}{1 + exp(\mathbf{x_i'\beta})} \] + The above kinda gives us what we really want , try it yourself

so basically

\[ p_i = f(\mathbf{x}_i ; \beta) = \frac{exp(\mathbf{x_i'\beta})}{1 + exp(\mathbf{x_i'\beta})} \]

Equivalently,

\[ logit(p_i) = log(\frac{p_i}{1+p_i}) = \mathbf{x_i'\beta} \]

where \(\frac{p_i}{1+p_i}\)is the odds.

In this form, the model is specified such that a function of the mean response is linear. Hence, Generalized Linear Models

The likelihood function

\[L(p_i) = \prod_{i=1}^{n} p_i^{Y_i}(1-p_i)^{1-Y_i}\]

where \(p_i = \frac{\mathbf{x'_i \beta}}{1+\mathbf{x'_i \beta}}\) and \(1-p_i = (1+ exp(\mathbf{x'_i \beta}))^{-1}\)

Hence, our objective function is

\[ Q(\beta) = log(L(\beta)) = \sum_{i=1}^n Y_i \mathbf{x'_i \beta} - \sum_{i=1}^n log(1+ exp(\mathbf{x'_i \beta})) \]

we could maximize this function numerically using the optimization method above, which allows us to find numerical MLE for \(\hat{\beta}\). Then we can use the standard asymptotic properties of MLEs to make inference.

Property of MLEs is that parameters are asymptotically unbiased with sample variance-covariance matrix given by the inverse Fisher information matrix

\[ \hat{\beta} \dot{\sim} AN(\beta,[\mathbf{I}(\beta)]^{-1}) \]

where the Fisher Information matrix, \(\mathbf{I}(\beta)\) is

\[ \begin{aligned} \mathbf{I}(\beta) &= E[\frac{\partial \log(L(\beta))}{\partial (\beta)}\frac{\partial \log(L(\beta))}{\partial \beta'}] \\ &= E[(\frac{\partial \log(L(\beta))}{\partial \beta_i} \frac{\partial \log(L(\beta))}{\partial \beta_j})_{ij}] \end{aligned} \]

Under regularity conditions, this is equivalent to the negative of the expected value of the Hessian Matrix

\[ \begin{aligned} \mathbf{I}(\beta) &= -E[\frac{\partial^2 \log(L(\beta))}{\partial \beta \partial \beta'}] \\ &= -E[(\frac{\partial^2 \log(L(\beta))}{\partial \beta_i \partial \beta_j})_{ij}] \end{aligned} \]

Example:

\[ x_i' \beta = \beta_0 + \beta_1 x_i \]

\[ - \frac{\partial^2 \ln(L(\beta))}{\partial \beta^2_0} = \sum_{i=1}^n \frac{\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - [\frac{\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n p_i (1-p_i) \\ - \frac{\partial^2 \ln(L(\beta))}{\partial \beta^2_1} = \sum_{i=1}^n \frac{x_i^2\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - [\frac{x_i\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n x_i^2p_i (1-p_i) \\ - \frac{\partial^2 \ln(L(\beta))}{\partial \beta_0 \partial \beta_1} = \sum_{i=1}^n \frac{x_i\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - x_i[\frac{\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n x_ip_i (1-p_i) \\ \]

Hence,

\[ \mathbf{I} (\beta) = \left[ \begin{array} {cc} \sum_i p_i(1-p_i) & \sum_i x_i p_i(1-p_i) \\ \sum_i x_i p_i(1-p_i) & \sum_i x_i^2 p_i(1-p_i) \end{array} \right] \]

Inference

Likelihood Ratio Tests

To formulate the test, let \(\beta = [\beta_1', \beta_2']'\). If you are interested in testing a hypothesis about \(\beta_1\), then we leave \(\beta_2\) unspecified (called nuisance parameters). \(\beta_1\) and \(\beta_2\) can either a vector or scalar, or \(\beta_2\) can be null.

Example: \(H_0: \beta_1 = \beta_{1,0}\) (where \(\beta_{1,0}\) is specified) and \(\hat{\beta}_{2,0}\) be the MLE of \(\beta_2\) under the restriction that \(\beta_1 = \beta_{1,0}\). The likelihood ratio test statistic is

\[ -2\log\Lambda = -2[\log(L(\beta_{1,0},\hat{\beta}_{2,0})) - \log(L(\hat{\beta}_1,\hat{\beta}_2))] \]

where

the first term is the value fo the likelihood for the fitted restricted model
the second term is the likelihood value of the fitted unrestricted model

Under the null,

\[ -2 \log \Lambda \sim \chi^2_{\upsilon} \]

where \(\upsilon\) is the dimension of \(\beta_1\)

We reject the null when \(-2\log \Lambda > \chi_{\upsilon,1-\alpha}^2\)

Wald Statistics

Based on

\[ \hat{\beta} \sim AN (\beta, [\mathbf{I}(\beta)^{-1}]) \]

\[ H_0: \mathbf{L}\hat{\beta} = 0 \]

where \(\mathbf{L}\) is a q x p matrix with q linearly independent rows. Then

\[ W = (\mathbf{L\hat{\beta}})'(\mathbf{L[I(\hat{\beta})]^{-1}L'})^{-1}(\mathbf{L\hat{\beta}}) \]

under the null hypothesis

Confidence interval

\[ \hat{\beta}_i \pm 1.96 \hat{s}_{ii}^2 \]

where \(\hat{s}_{ii}^2\) is the i-th diagonal of \(\mathbf{[I(\hat{\beta})]}^{-1}\)

If you have

large sample size, the likelihood ratio and Wald tests have similar results.
small sample size, the likelihood ratio test is better.

Logistic Regression: Interpretation of \(\beta\)

For single regressor, the model is

\[ logit\{\hat{p}_{x_i}\} \equiv logit (\hat{p}_i) = \log(\frac{\hat{p}_i}{1 - \hat{p}_i}) = \hat{\beta}_0 + \hat{\beta}_1 x_i \]

When \(x= x_i + 1\)

\[ logit\{\hat{p}_{x_i +1}\} = \hat{\beta}_0 + \hat{\beta}(x_i + 1) = logit\{\hat{p}_{x_i}\} + \hat{\beta}_1 \]

Then,

\[ logit\{\hat{p}_{x_i +1}\} - logit\{\hat{p}_{x_i}\} = log\{odds[\hat{p}_{x_i +1}]\} - log\{odds[\hat{p}_{x_i}]\} \\ = log(\frac{odds[\hat{p}_{x_i + 1}]}{odds[\hat{p}_{x_i}]}) = \hat{\beta}_1 \]

and

\[ exp(\hat{\beta}_1) = \frac{odds[\hat{p}_{x_i + 1}]}{odds[\hat{p}_{x_i}]} \]

the estimated odds ratio

the estimated odds ratio, when there is a difference of c units in the regressor x, is \(exp(c\hat{\beta}_1)\). When there are multiple covariates, \(exp(\hat{\beta}_k)\) is the estimated odds ratio for the variable \(x_k\), assuming that all of the other variables are held constant.

Inference on the Mean Response

Let \(x_h = (1, x_{h1}, ...,x_{h,p-1})'\). Then

\[ \hat{p}_h = \frac{exp(\mathbf{x'_h \hat{\beta}})}{1 + exp(\mathbf{x'_h \hat{\beta}})} \]

and \(s^2(\hat{p}_h) = \mathbf{x'_h[I(\hat{\beta})]^{-1}x_h}\)

For new observation, we can have a cutoff point to decide whether y = 0 or 1.

Logistic Regression in action

\(x \sim Unif(-0.5,2.5)\). Then \(\eta = 0.5 + 0.75 x\)

Passing \(\eta\)’s into the inverse-logit function, we get

\[ p = \frac{\exp(\eta)}{1+ \exp(\eta)} \]

where \(p \in [0,1]\)

Then, we generate \(y \sim Bernoulli(p)\)

Model Fit

Based on the odds ratio, when

\(x = 0\) , the odds of success of 1.59
\(x = 1\), the odds of success increase by a factor of 2.19 (i.e., 119.29% increase).

Deviance Tests

\[ H_0: \text{No variables are related to the response (i.e., model with just the intercept)} \\ H_1: \text{at least one variable is related to the response} \]

Since we see the p-value of 0, we reject the null that no variables are related to the response

Deviance residuals

However, this plot is not informative. Hence, we can can see the residuals plots that are grouped into bins based on prediction values.

We can also see the predicted value against the residuals.

We can also look at a binned plot of the logistic prediction versus the true category

Formal deviance test

Hosmer-Lemeshow test

Null hypothesis: the observed events match the expected evens

\[ X^2_{HL} = \sum_{j=1}^{J} \frac{(y_j - m_j \hat{p}_j)^2}{m_j \hat{p}_j(1-\hat{p}_j)} \]

Odds versus probability

It is extremely common for these two terms to be used synonymously, and this can lead to serious misunderstandings when interpreting a logistic regression model.

If a certain event has a probability of 0.1, then this means that its odds are 1:9, or 0.111. If the probability is 0.5, then the odds are 1, if the probability is 0.9, then the odds are 9, and if the probability is 0.99, the odds are 99. As we approach a probability of 1, the odds become exponentially large, as illustrated in Figure below:

The consequence of this is that a given increase in odds can have a different effect on probability depending on what the original probability was in the first place. If the probability was already quite low, for example 0.1, then a 4% increase in odds translates to odds of 0.116, which translates to a new probability of 0.103586, representing an increase in probability of 3.59%, which is very close to the increase in odds. If the probability was already high, say 0.9, then a 4% increase in odds translates to odds of 9.36, which translates to a new probability of 0.903475 representing an increase in probability of 0.39%, which is very different from the increase in odds. Figure shows the impact of a 4% increase in odds according to the original probability of the event.

Case Study: H7N9 Outbreak Data

The data we’re using here is from the 2013 outbreak of influenza A H7N9 in China, analyzed by Kucharski et al., published in 2014.

case_id: the sample identifier
date_onset: date of onset of syptoms
date_hospitalization: date the patient was hospitalized, if available
date_outcome: date the outcome (recovery, death) was observed, if available
outcome: “Death” or “Recover,” if available
gender: male (m) or female (f)
age: age of the individual, if known
province: either Shanghai, Jiangsu, Zhejiang, or Other (lumps together less common provinces)

take a glimpse of the data at hand .

Data cleaning

a dumb assumption

assume our date has a weird naming convention as follows

Note

the variables names are rather weird
we have to use a better naming convention
we can use the clean_names() from the janitor package

Note

the data type for age is wrong (factor instead of numeric/integer)

Missing data

Note

Do we have any missing data in sheet ?

Note

2 observations missing on the gender variable , hmmmmmm

interesting , let us remove this .

Note

let us make sure age is of the right data type …..

Note

that’s rather weird…..!!!!
lump all but three of the most common provinces together

make the names easier to deal with

make the ids more explicitly a string/factor

Was someone ever hospitalized?

Number of days till hospitalization

Experienced early outcome?

Imputation of missing data

Data for this exercise

Note

just as we have said before ….

We would like to model the odds of success (death maybe); however, odds are strictly positive. Therefore, similar to modeling log(\(\lambda\)) in Poisson regression, which allowed the response to take on values from \(-\infty\) to \(\infty\), we will model the log(odds), the logit, in logistic regression. Logits will be suitable for modeling with a linear function of the predictors:

\[\begin{equation*} \log\left(\frac{p}{1 - p}\right)=\beta_0+\beta_1X \end{equation*}\] Models of this form are referred to as binomial regression models, more generally as logistic regression models. I will provide intuition for using and interpreting logistic regression models and then present rationale for these models using GLM theory.

In our example we could define \(X=0\) for recovering and \(X=1\) for dead and fit the model:

\[\begin{equation} \log\left(\frac{p_X}{1-p_X}\right)=\beta_0 +\beta_1X \end{equation}\]

\[\begin{equation*} \beta_1 = (\beta_0 + \beta_1) - \beta_0 = \log\left(\frac{p_1}{1-p_1}\right) - \log\left(\frac{p_0}{1-p_0}\right) = \log\left(\frac{p_1/(1-p_1)}{p_0/{(1-p_0)}}\right). \end{equation*}\]

The logit model can also be re-written in a probability form:

\[\begin{equation*} p_X=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}} \end{equation*}\] which can be re-written for when someone dies as:

\[\begin{equation} p_1=\frac{e^{\beta_0+\beta_1}}{1+e^{\beta_0+\beta_1}} \end{equation}\] and for when someone recovers as:

\[\begin{equation} p_0=\frac{e^{\beta_0}}{1+e^{\beta_0}} \end{equation}\]

We use likelihood methods to estimate \(\beta_0\) and \(\beta_1\).

If \(Y\) follows a binomial distribution with \(n\) trials and probability of success \(p\), we can write:

\[\begin{align*} P(Y=y)&= \binom{n}{y}p^y(1-p)^{(n-y)} \\ &=e^{y\log(p) + (n-y)\log(1-p) + \log\binom{n}{y}} \end{align*}\] However, this probability mass function is not quite in one-parameter exponential family form. Note that there are two terms in the exponent which consist of a product of functions of \(y\) and \(p\). So more simplification is in order:

\[\begin{equation*} P(Y=y) = e^{y\log\left(\frac{p}{1-p}\right) + n\log(1-p)+ \log\binom{n}{y}} \end{equation*}\] Don’t forget to consider the support; we must make sure that the set of possible values for this response is not dependent upon \(p\). For fixed \(n\) and any value of \(p\), \(0<p<1\), all integer values from \(0\) to \(n\) are possible, so the support is indeed independent of \(p\).

The one-parameter exponential family form for binomial responses shows that the canonical link is \(\log\left(\frac{p}{1-p}\right)\). Thus, GLM theory suggests that constructing a model using the logit, the log odds of a score, as a linear function of covariates is a reasonable approach.

Prelimenary analysis

primary response

covariates

covariates vs outcome

(60%) of our sample recovered, 27.3% of the sample are females, and 36.4% are from Shanghai.

Weird Intuition

let’s plot the logistic function \[ P(y = 1) = \frac{1}{1 + e^{-k(x - x_{0})}} \]

on this data, where we set \(x_0\) to the mean of age and \(k\) to be some maximum gradient value. we can see these logistic functions for different values of \(k\). All of these seem to reflect the pattern we are observing to some extent, but how do we determine the best-fitting logistic function?

Deriving the concept of log odds

Let’s look more carefully at the index of the exponential constant \(e\) in the denominator of our logistic function. Note that, because \(x_{0}\) is a constant, we have:

\[ -k(x - x_{0}) = -(-kx_{0} + kx) = -(\beta_{0} + \beta_1x) \] where \(\beta_0 = -kx_0\) and \(\beta_{1} = k\). Therefore,

\[ P(y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x)}} \]

This equation makes intuitive sense. As the value of \(x\) increases, the value \(e^{-(\beta_0 + \beta_1x)}\) gets smaller and smaller towards zero, and thus \(P(y = 1)\) approaches its theoretical maximum value of 1. As the value of \(x\) decreases towards zero, we see that the value of \(P(y = 1)\) approaches a minimum value of \(\frac{1}{1 + e^{-\beta_0}}\). Referring back to our influenza dataset example, we can thus see that \(\beta_0\) helps determine the baseline probability of dying assuming you have zero age. If \(\beta_0\) has an extremely negative value, this baseline probability will approach its theoretical minimum of zero.

Let’s formalize the role of \(\beta_0\) and \(\beta_1\) in the likelihood of a positive outcome. We know that for any binary event \(y\), \(P(y = 0)\) is equal to \(1 - P(y = 1)\), so

\[ \begin{aligned} P(y = 0) &= 1 - \frac{1}{1 + e^{-(\beta_0 + \beta_1x)}} \\ &= \frac{1 + e^{-(\beta_0 + \beta_1x)} - 1}{1 + e^{-(\beta_0 + \beta_1x)}} \\ &= \frac{e^{-(\beta_0 + \beta_1x)}}{1 + e^{-(\beta_0 + \beta_1x)}} \end{aligned} \]

Putting these together, we find that

\[ \begin{aligned} \frac{P(y = 1)}{P(y = 0)} &= \frac{\frac{1}{1 + e^{-(\beta_0 + \beta_1x)}}}{\frac{e^{-(\beta_0 + \beta_1x)}}{1 + e^{-(\beta_0 + \beta_1x)}}} \\ &= \frac{1}{e^{-(\beta_0 + \beta_1x)}} \\ &= e^{\beta_0 + \beta_1x} \end{aligned} \]

or alternatively, if we apply the natural logarithm to both sides

\[ \ln\left(\frac{P(y = 1)}{P(y = 0)}\right) = \beta_0 + \beta_1x \]

The right-hand side should look familiar from the previous chapter on linear regression, meaning there is something here we can model linearly. But what is the left-hand side?

\(P(y = 1)\) is the probability that the event will occur, while \(P(y = 0)\) is the probability that the event will not occur. You may be familiar from sports like horse racing or other gambling situations that the ratio of these two represents the odds of an event. For example, if a given horse has odds of 1:4, this means that there is a 20% probability they will win and an 80% probability they will not¹.

Therefore we can conclude that the natural logarithm of the odds of \(y\)—usually termed the log odds of \(y\)—is linear in \(x\), and therefore we can model the log odds of \(y\) using similar linear regression methods )².

Let’s start Modeling

Our strategy for modeling is to use our questions of interest and what we have learned in the exploratory data analysis. For each model we interpret the coefficient of interest, look at the corresponding Wald test and, as a final step, compare the deviances for the different models we considered.

We first use a model where age is our only predictor.

Our estimated binomial regression model is:

We can interpret the coefficients as follows:

The (Intercept) coefficient is the value of the log odds with zero input value of \(x\)—it is the log odds of dying if you had zero age.
The age coefficient represents the increase in the log odds of dying associated with each unit increase in age.

We can convert these coefficients from log odds to odds by applying the exponent function, to return to the identity we had previously.

\[ \frac{P(y = 1)}{P(y = 0)} = e^{\beta_0 + \beta_1x} = e^{\beta_0}(e^{\beta_1})^x \] From this, we can interpret that \(e^{\beta_0}\) represents the base odds of dying assuming you have just been born, and that for every additional unit age, those base odds are multiplied by \(e^{\beta_1}\). Given this multiplicative effect that \(e^{\beta_1}\) has on the odds, it is known as an odds ratio.

Age and Province

Age , province and gender

Drop-in-Deviance Tests

Work in Progress

Footnotes

Often in sports the odds are expressed in the reverse order, but the concept is the same.↩︎
In this case, a more general form of the Ordinary Least Squares procedure is used to fit the model, known as maximum likelihood estimation.↩︎