🍥

Survival analysis

Chapter 11 Survival Analysis: Kaplan-Meier and Cox Proportional Hazard (PH) Regression | Data Analysis in Medicine and Health using R
https://bookdown.org/drki_musa/dataanalysis/survival-analysis-kaplan-meier-and-cox-proportional-hazard-ph-regression.html#objectives-10

https://web.stanford.edu/~lutian/coursepdf/

The log-rank Test Assumes More Than the Cox Model – Statistical Thinking
It is well known that the score test for comparing survival distributions between two groups without covariate adjustment, using a Cox proportional hazards (PH) model, is identical to the log-rank \(\chi^2\) test when there are no tied failure times. Yet there persists a belief that the log-rank test is somehow completely nonparametric and does not assume PH. Log-rank and Cox approaches can only disagree if the more commonly used likelihood ratio (LR) statistic from the Cox model disagrees with the log-rank statistic (and the Cox score statistic). This article shows that in fact the log-rank and Cox LR statistics agree to a remarkable degree, and furthermore the hazard ratio arising from the log-rank test also has remarkable agreement with the Cox model counterpart. Since both methods assume PH and the log-rank test assumes within-group heterogeneity (because it doesn’t allow for covariate adjustment), the Cox model actually makes fewer assumptions than log-rank.
https://www.fharrell.com/post/logrank/
t-Test, Chi-Square, ANOVA, Regression, Correlation...
Webapp for statistical data analysis.
https://datatab.net/tutorial/log-rank-test


Log-rank test with a simple example

Let’s start with a simple example. Suppose we have Group 1 and Group 2 and we want to test whether the two groups have the same survival function or not.

Group 1:

timestatus
21
31
50
71
71
81

Group 2:

timestatus
21
21
41
41
61
81

In the status, 1 means an event, and 0 means a censored.

There were 6 different time points where events (or censored) were recorded. They are {2, 3, 4, 6, 7, 8}. We do another bookkeeping as following.

Group1 Group2

timem1m_1q1q_1n1n_1m2m_2q2q_2n2n_2
2106206
3105004
4014204
6003102
7203001
8101101

mm counts events and qq counts censored. nn is for at-risk counts. The interpretation for the first row is as follows: before the time point 2, there were 6 at-risk both in Group 1 and Group 2. There were 1 event in Group 1 and 2 events in Group 2. None was censored in both groups. Likewise, the interpretation of the third row is as follows: there were four individuals at risk in both Group 1 and Group 2. At time point 4, one individual in Group 1 was censored, and two events occurred in Group 2. However, if we refer to the status table above, we see that the censoring actually occurred at time point 5, not 4. Although censored individuals are not counted as events, we include them in the risk set at the immediately preceding time point, rather than at the time they were censored.

If two groups follow the same hazard function, we will estimate that the hazard rate at time 2 is 3/12. It is (m1+m2)/(n1+n2).(m_1+m_2)/(n_1+n_2). So, we calculate the expected number of events by

timem1m_1n1n_1m2m_2n2n_2e1e_1e2e_2
216261.51.5
315040.560.44
4042411
603120.60.4
723011.50.5
8111111

Note that m1+m2=e1+e2m_1+m_2=e_1+e_2.

We also see that if the hazard rates are same, the difference in mm of both groups is just a matter of random split. That is, m1m_1 follows H(n1+n2,m1+m2,n1)H(n_1+n_2, m_1+m_2, n_1), hypergeometric distribution with the population size of n1+n2n_1+n_2, the success count of m1+m2m_1+m_2, and the draw size of n1n_1.

So the expected value of m1m_1 is

E(m1)=n1(m1+m2)/(n1+n2).E(m_1)=n_1\cdot(m_1+m_2)/(n_1+n_2).

The variance of m1m_1 is

Var(m1)=n1(m1+m2)n1+n2n1+n2m1m2n1+n2n1+n2n1n1+n21Var(m_1)=\frac{n_1(m_1+m_2)}{n_1+n_2}\cdot\frac{n_1+n_2-m_1-m_2}{n_1+n_2}\cdot\frac{n_1+n_2-n_1}{n_1+n_2-1}

By the Lyapunov central limit theorem,

z=1stime points(m1E(m1))N(0,1)z=\frac{1}{s}\sum_{\text{time points}}(m_1-E(m_1)) \sim\mathcal{N}(0,1)

where

s2=time pointsVar(m1)s^2=\sum_{\text{time points}}Var(m_1)

In fact, there is a symmetry in Var(m1)Var(m_1), so Var(m1)=Var(m2)Var(m_1)=Var(m_2).

timem1m_1n1n_1m2m_2n2n_2Var(m1)Var(m_1)e1e_1m1e1m_1-e_1
216260.6131.5-0.5
315040.2460.560.44
404240.4281-1
603120.240.6-0.6
723010.251.50.5
81111010

So, s2=1.777s^2=1.777 and s=1.333s=1.333.

z=1.16/1.333=0.8702z=-1.16/1.333=-0.8702. So, we won’t be able to reject the null hypothesis.

By the way, z2z^2 follows χ2(1)\chi^2(1).

Let’s see no-roundup calculation using R code.

df <- data.frame(m1=c(1, 1, 0, 0, 2, 1))
df$n1 <- c(6,5,4,3,3,1)
df$m2 <- c(2,0,2,1,0,1)
df$n2 <- c(6,4,4,2,1,1)

df$E <- df$n1*(df$m1 + df$m2)/(df$n1 + df$n2)
df$V <- ( df$n1*df$n2 )/( (df$n1 + df$n2)^2  )*( df$m1 + df$m2 )*(df$n1 + df$n2 -df$m1-df$m2 )/(df$n1 + df$n2 - 1 )

s_square = sum(df$V)
OmE_square = (sum(df$m1 - df$E))^2

z = sqrt(OmE_square/s_square)
chisq_stat = OmE_square/s_square
cat('z =',z)
cat('\nchisq =', chisq_stat)

cat('\np_val_z = ', 2*(1-pnorm(z)))
cat('\np_val_chisq = ', (1-pchisq(chisq_stat, df=1)))
z = 0.8663394
chisq = 0.7505439
p_val_z =  0.3863041
p_val_chisq =  0.3863041

If we use R libraries, we can see familiar numbers.

More to add with a complex example

Dig-ressions

Cox proportional hazards regression

The hazard function is a rate, and because of that it must be strictly positive.

One form of a regression model for the hazard function that addresses the study goal is:

h(t,x,β)=h0(t)r(x,β)h(t,x,\beta)=h_0(t)r(x,\beta)

Here, xx is variables like certain medical conditions and β\beta is a vector of parameters.

The hazard ratio (HR) depends only on the function r(x,β)r(x,\beta).

If the model hazard function is

h(t,x,β)=h0(t)exβh(t,x,\beta)=h_0(t)e^{x\beta}

the hazard ratio is

HR(t,x1,x0)=eβ(x1x0)HR(t,x_1,x_0)=e^{\beta(x_1-x_0)}

This model is referred to in the literature by a variety of terms, such as the Cox model, the Cox proportional hazards model or simply the proportional hazards model.

So for example, if we have a covariate which is a dichomotomous (binary), such as stroke type: coded as a value of x0=0x_0=0 and x1=1x_1=1, for HS and IS, respectively, then the hazard ratio becomes

HR(t,x1,x0)=eβ.HR(t,x_1, x_0)=e^\beta.

If the value of the coefficient is β=ln(2)\beta=\ln(2), then the interpretation is that HS are dying at twice (eβ=2e^\beta=2) the rate of patients with IS.

Some disadvantages of KM


We also acknowledge that the fully parametric regression models in survival analysis have stringent assumptions and distribution
requirement. So, to overcome the limitations of the KM analysis and the fully parametric analysis, we can model our survival data using the
semi-parametric Cox proportional hazards regression.

💡

Let’s apply this to the data generated in ⛏️Digression - simulate survival function.

The hazard ratio is 2.75. hY(t)/hX(t)=2.75h_Y(t)/h_X(t)=2.75.

sim_data_coxph <- 
  coxph(Surv(time = time, 
             event = event == 1) ~ group,
                     data = simulated_data)
summary(sim_data_coxph)

It looks like exp(coef) is very close to 2.75.

Estimation from Cox proportional hazards regression

Simple Cox PH regression

Using our stroke dataset, we will estimate the parameters using the Cox PH regression. Remember, in our data we have

  1. the time variable : time2
  1. the event variable : status and the event of interest is dead. Event classified other than dead are considered as censored.
  1. date variables : date of admission (doa) and date of discharge (dod)
  1. all other covariates
stroke_stype <- 
  coxph(Surv(time = time2, 
             event = status == 'dead') ~ stroke_type,
                     data = stroke)
summary(stroke_stype)
tidy(stroke_stype,
     conf.int = TRUE)

The simple Cox PH model with covariate stroke type shows that the patients with IS has −0.0662 times the crude log hazard for death as compared to patients with HS (p-value = 0.0368).

This means that

h(t,HS)=h0(t),h(t,HS)=h_0(t),

and

h(t,IS)=h0(t)e0.66h0(t)×0.5157038h(t,IS)=h_0(t)e^{-0.66}\approx h_0(t)\times0.5157038

So, patients with IS survived more.