30:30
Do your first Time to event analysis with R and tidymodels
Data Science for Everyone
Welcome to interactive learning
Survival analysis Essential Training
The Brain cancer dataset
We load the following packages
We will be examining the Braincancer
data set for this lab.
- save the data into an object
- look at the structure of the data
The core functions weβll use out of the survival package include:
-
Surv()
: Creates a survival object. -
survfit()
: Fits a survival curve using either a formula, of from a previously fitted Cox model. -
coxph()
: Fits a Cox proportional hazards regression model.
Other optional functions you might use include:
-
cox.zph()
: Tests the proportional hazards assumption of a Cox regression model. -
survdiff()
: Tests for differences in survival between two groups using a log-rank / Mantel-Haenszel test.1
Surv()
creates the response variable, and typical usage takes the time to event,[^time2] and whether or not the event occured (i.e., death vs censored). survfit()
creates a survival curve that you could then display or plot. coxph()
implements the regression analysis, and models specified the same way as in regular linear models, but using the coxph()
function.
Classes for our target
response
- our target variable is
status
andtime
visualise this
Check for missing data
- i guess we only have a missing value in the `diagnosis variable
Some key components of this survfit
object that will be used to create survival curves include:
-
time
: the timepoints at which the curve has a step, i.e. at least one event occurred -
event
: the estimate of survival at the correspondingtime
- the variable corresponding to event must be numeric and in our case the event is
status
to imply someone has died or not - time should be numeric as well
- we can see the times that have censored and those that did not
The Kaplan-Meier (KM) Estimator is a non-parametric method that estimates the survival probabilities at each time an event occurs. We will use the function survival::survfit()
, which uses the KM Estimator, to estimate survival probabilities from our data.
survfit
requires two arguments:
- A formula object where the outcome is a survival object
- A data frame
- the above results in the survival probabilities for all patients
The KM estimate provides the survival probabilities. We can plot these probabilities to look at the trend of survival over time. The plot provides
- survival probability on the \(y-axis\)
- time on the \(x-axis\)
Plotting the Survival Function
Doing it for groups
- at times we want to compare groups
- lets compare survival times between
sex
- we replace the
1
withsex
- the
p-value
here is from a log rank test which tests if there is any significant differences in survival times
we can do a logrank
test like so:
- we use the
survdiff()
function and specify the source of our data - we want to compare
sex
groups
- the p-value is above 5% and generally suggests that there is no significant difference in survival times between men and women with brain cancer
Cox Model building
- The
coxph()
function uses the same syntax as lm(), glm(), etc. The response variable you create with Surv() goes on the left hand side of the formula, specified with a ~. Explanatory variables go on the right side.
- create a model with
sex
alone
Recall that the Cox model is expressed by the hazard function \(h(t)\):
\(h(t) = h_0(t) \times exp(\beta_1x_1 + \beta_2x_2 + \ldots + \beta_px_p)\)
The quantity of interest from a Cox regression model is the hazard ratio. The quantities \(exp(\beta_i)\) are the hazard ratios.
The Hazard Ratio (HR) can be interpreted as follows:
- HR = 1: No effect
- HR < 1: indicates a decreased risk of death
- HR > 1: indicates an increased risk of death
How to Interpret Results
The estimate
column in the summary above is the regression parameter \(\beta_i\) of the Cox model.
The estimate
column quantifies the effect size (the impact) that the covariate has on the patientβs survival time.
The expression is \(exp(\beta_i)\) is the hazard ratio β this is the blah
column of the summary above.
So for example, we obtained a regression parameter \(\beta_1 = 0.4077\) for the diagnosisLG glioma vs other type diagnosis. The hazard ratio for this covariate is
\(HR = exp(\beta_1) = 1.5\).
A HR > 1 indicates increased hazard of death of males as compared to females
we can add another variable
- add the variable
gtv
and see what happens
add all variables
- we use the
dot
method to include all variables
we can use tidymodels as well
- we use the
proportional_hazards()
parsnip object - set engine to
survival
- include the name of our data object
- Well done !!! you have successfully run a
survival analysis
in R
-
lung
is a dataset insurvival
package - explore the data
- recode the variable
status
to only (1,0) values
- find overall survival probabilities and plot
- now do it for
sex
and plot
- fit two
cox models
, one with sex alone and the other withph.ecog
compare the models usingAIC()
Footnotes
Cox regression and the logrank test from
survdiff
are going to give you similar results most of the time. The log-rank test is asking if survival curves differ significantly between two groups. Cox regression is asking which of many categorical or continuous variables significantly affect survival.β©οΈ