Do your first Time to event analysis with R and tidymodels

Data Science for Everyone

Author

Bongani Ncube

WebR Status

Installing package 1 out of 6: parsnip

Welcome to interactive learning

Survival analysis Essential Training

The Brain cancer dataset

We load the following packages

The data

We will be examining the Braincancer data set for this lab.

save the data into an object

look at the structure of the data

Note

The core functions we’ll use out of the survival package include:

Surv(): Creates a survival object.
survfit(): Fits a survival curve using either a formula, of from a previously fitted Cox model.
coxph(): Fits a Cox proportional hazards regression model.

Other optional functions you might use include:

cox.zph(): Tests the proportional hazards assumption of a Cox regression model.
survdiff(): Tests for differences in survival between two groups using a log-rank / Mantel-Haenszel test.¹

Surv() creates the response variable, and typical usage takes the time to event,[^time2] and whether or not the event occured (i.e., death vs censored). survfit() creates a survival curve that you could then display or plot. coxph() implements the regression analysis, and models specified the same way as in regular linear models, but using the coxph() function.

Classes for our `target` response

our target variable is status and time

visualise this

Check for missing data

i guess we only have a missing value in the `diagnosis variable

lets start analysing

Some key components of this survfit object that will be used to create survival curves include:

time: the timepoints at which the curve has a step, i.e. at least one event occurred
event: the estimate of survival at the corresponding time

the variable corresponding to event must be numeric and in our case the event is status to imply someone has died or not
time should be numeric as well

we can see the times that have censored and those that did not

Estimating the Survival Function: Kaplan-Meier Estimator

The Kaplan-Meier (KM) Estimator is a non-parametric method that estimates the survival probabilities at each time an event occurs. We will use the function survival::survfit(), which uses the KM Estimator, to estimate survival probabilities from our data.

survfit requires two arguments:

A formula object where the outcome is a survival object
A data frame

the above results in the survival probabilities for all patients

Plot the survival probability

The KM estimate provides the survival probabilities. We can plot these probabilities to look at the trend of survival over time. The plot provides

survival probability on the $y - a x i s$
time on the $x - a x i s$

Plotting the Survival Function

Doing it for groups

at times we want to compare groups
lets compare survival times between sex
we replace the 1 with sex

Note

the p-value here is from a log rank test which tests if there is any significant differences in survival times

we can do a `logrank` test like so:

we use the survdiff() function and specify the source of our data
we want to compare sex groups

Tip

the p-value is above 5% and generally suggests that there is no significant difference in survival times between men and women with brain cancer

Cox Model building

Compute the Cox Model

The coxph() function uses the same syntax as lm(), glm(), etc. The response variable you create with Surv() goes on the left hand side of the formula, specified with a ~. Explanatory variables go on the right side.

create a model with sex alone

Note

Recall that the Cox model is expressed by the hazard function $h (t)$ :

$h (t) = h_{0} (t) \times e x p (β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p})$

The quantity of interest from a Cox regression model is the hazard ratio. The quantities $e x p (β_{i})$ are the hazard ratios.

The Hazard Ratio (HR) can be interpreted as follows:

HR = 1: No effect
HR < 1: indicates a decreased risk of death
HR > 1: indicates an increased risk of death

How to Interpret Results

The estimate column in the summary above is the regression parameter $β_{i}$ of the Cox model.

The estimate column quantifies the effect size (the impact) that the covariate has on the patient’s survival time.

The expression is $e x p (β_{i})$ is the hazard ratio – this is the blah column of the summary above.

So for example, we obtained a regression parameter $β_{1} = 0.4077$ for the diagnosisLG glioma vs other type diagnosis. The hazard ratio for this covariate is

$H R = e x p (β_{1}) = 1.5$ .

A HR > 1 indicates increased hazard of death of males as compared to females

we can add another variable

add the variable gtv and see what happens

add all variables

we use the dot method to include all variables

we can use tidymodels as well

we use the proportional_hazards() parsnip object
set engine to survival
include the name of our data object

Note

Well done !!! you have successfully run a survival analysis in R

Your turn

30:30

lung is a dataset in survival package
explore the data

Tip

recode the variable status to only (1,0) values

find overall survival probabilities and plot

now do it for sex and plot

fit two cox models , one with sex alone and the other with ph.ecog compare the models using AIC()

Footnotes

Cox regression and the logrank test from survdiff are going to give you similar results most of the time. The log-rank test is asking if survival curves differ significantly between two groups. Cox regression is asking which of many categorical or continuous variables significantly affect survival.↩︎

Welcome to interactive learning

Survival analysis Essential Training

The Brain cancer dataset

Classes for our target response

visualise this

Check for missing data

Plotting the Survival Function

Doing it for groups

we can do a logrank test like so:

Cox Model building

How to Interpret Results

we can add another variable

add all variables

we can use tidymodels as well

Footnotes

Classes for our `target` response

we can do a `logrank` test like so: