Do your first Time to event analysis with R and tidymodels

Data Science for Everyone

Author

Bongani Ncube

Welcome to interactive learning

Survival analysis Essential Training

The Brain cancer dataset

We load the following packages

The data

We will be examining the Braincancer data set for this lab.

Note

The core functions we’ll use out of the survival package include:

  • Surv(): Creates a survival object.
  • survfit(): Fits a survival curve using either a formula, of from a previously fitted Cox model.
  • coxph(): Fits a Cox proportional hazards regression model.

Other optional functions you might use include:

  • cox.zph(): Tests the proportional hazards assumption of a Cox regression model.
  • survdiff(): Tests for differences in survival between two groups using a log-rank / Mantel-Haenszel test.1

Surv() creates the response variable, and typical usage takes the time to event,[^time2] and whether or not the event occured (i.e., death vs censored). survfit() creates a survival curve that you could then display or plot. coxph() implements the regression analysis, and models specified the same way as in regular linear models, but using the coxph() function.

Classes for our target response

  • our target variable is status and time

visualise this

Check for missing data

  • i guess we only have a missing value in the `diagnosis variable
lets start analysing

Some key components of this survfit object that will be used to create survival curves include:

  • time: the timepoints at which the curve has a step, i.e. at least one event occurred
  • event: the estimate of survival at the corresponding time
  • the variable corresponding to event must be numeric and in our case the event is status to imply someone has died or not
  • time should be numeric as well
  • we can see the times that have censored and those that did not
Estimating the Survival Function: Kaplan-Meier Estimator

The Kaplan-Meier (KM) Estimator is a non-parametric method that estimates the survival probabilities at each time an event occurs. We will use the function survival::survfit(), which uses the KM Estimator, to estimate survival probabilities from our data.

survfit requires two arguments:

  1. A formula object where the outcome is a survival object
  2. A data frame
  • the above results in the survival probabilities for all patients
Plot the survival probability

The KM estimate provides the survival probabilities. We can plot these probabilities to look at the trend of survival over time. The plot provides

  1. survival probability on the \(y-axis\)
  2. time on the \(x-axis\)

Plotting the Survival Function

Doing it for groups

  • at times we want to compare groups
  • lets compare survival times between sex
  • we replace the 1 with sex
Note
  • the p-value here is from a log rank test which tests if there is any significant differences in survival times

we can do a logrank test like so:

  • we use the survdiff() function and specify the source of our data
  • we want to compare sex groups
Tip
  • the p-value is above 5% and generally suggests that there is no significant difference in survival times between men and women with brain cancer

Cox Model building

Compute the Cox Model
  • The coxph() function uses the same syntax as lm(), glm(), etc. The response variable you create with Surv() goes on the left hand side of the formula, specified with a ~. Explanatory variables go on the right side.
  • create a model with sex alone
Note

Recall that the Cox model is expressed by the hazard function \(h(t)\):

\(h(t) = h_0(t) \times exp(\beta_1x_1 + \beta_2x_2 + \ldots + \beta_px_p)\)

The quantity of interest from a Cox regression model is the hazard ratio. The quantities \(exp(\beta_i)\) are the hazard ratios.

The Hazard Ratio (HR) can be interpreted as follows:

  • HR = 1: No effect
  • HR < 1: indicates a decreased risk of death
  • HR > 1: indicates an increased risk of death

How to Interpret Results

The estimate column in the summary above is the regression parameter \(\beta_i\) of the Cox model.

The estimate column quantifies the effect size (the impact) that the covariate has on the patient’s survival time.

The expression is \(exp(\beta_i)\) is the hazard ratio – this is the blah column of the summary above.

So for example, we obtained a regression parameter \(\beta_1 = 0.4077\) for the diagnosisLG glioma vs other type diagnosis. The hazard ratio for this covariate is

\(HR = exp(\beta_1) = 1.5\).

A HR > 1 indicates increased hazard of death of males as compared to females

we can add another variable

  • add the variable gtv and see what happens

add all variables

  • we use the dot method to include all variables

we can use tidymodels as well

  • we use the proportional_hazards() parsnip object
  • set engine to survival
  • include the name of our data object
Note
  • Well done !!! you have successfully run a survival analysis in R
Your turn
30:30
  • lung is a dataset in survival package
  • explore the data
Tip
  • recode the variable status to only (1,0) values
  • find overall survival probabilities and plot
  • now do it for sex and plot
  • fit two cox models , one with sex alone and the other with ph.ecog compare the models using AIC()

Footnotes

  1. Cox regression and the logrank test from survdiff are going to give you similar results most of the time. The log-rank test is asking if survival curves differ significantly between two groups. Cox regression is asking which of many categorical or continuous variables significantly affect survival.β†©οΈŽ