Simple functionality; powerful quasi-experimental/Meta-analysis. U of U MSCI uses.
Best NLP, machine learning libraries
Weakness
Clunky syntax; many ‘dialects’
Simple syntax
Moderately Complex Syntax
Explainable Programming*
Quarto
No options
Jupyter
*The idea that code should be readable by consumers of science has caught on in more quantitative fields (Math, CS), but will be coming to medicine. Long overdue. Learn now.
Today’s data
1 2 3 4 5 6
stroke
doa
dod
status
sex
dm
gcs
sbp
dbp
wbc
time2
stroke_type
referral_from
17/2/2011
18/2/2011
alive
male
no
15
151
73
12.5
1
IS
non-hospital
20/3/2011
21/3/2011
alive
male
no
15
196
123
8.1
1
IS
non-hospital
9/4/2011
10/4/2011
dead
female
no
11
126
78
15.3
1
HS
hospital
12/4/2011
13/4/2011
dead
male
no
3
170
103
13.9
1
IS
hospital
12/4/2011
13/4/2011
alive
female
yes
15
103
62
14.7
1
IS
non-hospital
4/5/2011
5/5/2011
dead
female
no
3
91
55
14.2
1
HS
hospital
Today’s data
1 2 3 4 5 6
stroke
These data come from patients who were admitted at a tertiary hospital due to acute stroke. They were treated in the ward and the status (dead or alive) were recorded. The variables are:
doa : date of admission
dod : date of discharge
status : event at discharge (alive or dead)
sex : male or female
dm : diabetes (yes or no)
gcs : Glasgow Coma Scale (value from 3 to 15)
sbp : Systolic blood pressure (mmHg)
Today’s data
1 2 3 4 5 6
stroke
dbp : Diastolic blood pressure (mmHg)
wbc : Total white cell count
time2 : days in ward
stroke_type : stroke type (Ischaemic stroke or Haemorrhagic stroke)
referral_from : patient was referred from a hospital or not from a hospital
:::
Today’s data
1 2 3 4 5 6
stroke
The outcome of interest is time from admission to death. The time variable is time2 (in days) and the event variable is status. The event of interest is dead. We also note that variables dates and other categorical variables are in character format. The rest are in numerical format. :::
Today’s data
1 2 3 4 5 6
Cancer
marital
sex
age
surgery
race
first
status
survivaltime
Single
Male
4
Yes
White
Yes
0
290
Single
Female
4
Yes
White
Yes
1
9
Single
Female
4
Yes
Black
Yes
1
10
Single
Female
5
Yes
White
Yes
0
141
Single
Female
5
No
White
Yes
1
12
Single
Male
5
No
White
Yes
1
54
Today’s data
1 2 3 4 5 6
Cancer
This study was conducted from latest release data from the 2017 submission of the SEER database (1973 to 2015 data) of the National Cancer Institute. Patients with a diagnosis of AO were selected from the SEER database using the International Classification of Diseases for Oncology, Third Edition (ICD-O-3) histology code 9451. There are 1824 patients were diagnosed with AO in this dataset.
The dataset contains all variables of our interest. For the purpose of this session, we want to explore prognostic factor association of age, surgery status and marital status with the survival of AO patients. The variables in the dataset as follow:
Today’s data
1 2 3 4 5 6
Cancer
age : biological age at the beginning of the study. This data in numerical values.
surgery : is the patient has undergone surgery. Coded as “Yes” and “No”
marital : Marital status at diagnosis. Coded as “single”, “married”, or “separated/divorced/widowed”.
survivaltime : survival time in month. This data is in numerical values.
status : survival status coded as “1 = died” and “0 = censored”
Basic data manipulation
1 2 3 4 5 6
Useful operators
1 2 3 4 5 6
<-
“save as”
opt + -
|>
“and then”
Cmd + shift + m
Common functions
1 2 3 4 5 6
Common functions
1 2 3 4 5 6
filter keeps or discards rows (aka observations)
select keeps or discards columns (aka variables)
arrange sorts data set by certain variable(s)
count tallies data set by certain variable(s)
mutate creates new variables
group_by/summarize aggregates data (pivot tables!)
str_* functions work easily with text
Syntax of a function
1 2 3 4 5 6
function(data, argument(s))
is the same as
data |>
function(argument(s))
Filter
1 2 3 4 5 6
filter keeps or discards rows (aka observations)
the == operator tests for equality
stroke |>filter(sex=="male")
doa
dod
status
sex
dm
gcs
sbp
dbp
wbc
time2
stroke_type
referral_from
17/2/2011
18/2/2011
alive
male
no
15
151
73
12.5
1
IS
non-hospital
20/3/2011
21/3/2011
alive
male
no
15
196
123
8.1
1
IS
non-hospital
12/4/2011
13/4/2011
dead
male
no
3
170
103
13.9
1
IS
hospital
22/5/2011
23/5/2011
alive
male
no
11
171
80
8.7
1
IS
hospital
21/10/2011
22/10/2011
alive
male
no
4
230
120
12.7
1
IS
hospital
28/11/2011
29/11/2011
dead
male
no
10
207
128
10.8
1
HS
non-hospital
Filter
1 2 3 4 5 6
the | operator signifies “or”
cancer |>filter(marital =="Single"| marital =="Married")
marital
sex
age
surgery
race
first
status
survivaltime
Single
Male
4
Yes
White
Yes
0
290
Single
Female
4
Yes
White
Yes
1
9
Single
Female
4
Yes
Black
Yes
1
10
Single
Female
5
Yes
White
Yes
0
141
Single
Female
5
No
White
Yes
1
12
Single
Male
5
No
White
Yes
1
54
marital
n
percent
Married
1137
0.7316602
Single
417
0.2683398
Filter
1 2 3 4 5 6
the %in% operator allows for multiple options in a list
cancer |>filter(marital %in%c("Separated/divorced/widowed","Single"))
marital
sex
age
surgery
race
first
status
survivaltime
Single
Male
4
Yes
White
Yes
0
290
Single
Female
4
Yes
White
Yes
1
9
Single
Female
4
Yes
Black
Yes
1
10
Single
Female
5
Yes
White
Yes
0
141
Single
Female
5
No
White
Yes
1
12
Single
Male
5
No
White
Yes
1
54
marital
n
percent
Separated/divorced/widowed
270
0.3930131
Single
417
0.6069869
Filter
1 2 3 4 5 6
the & operator combines conditions
cancer |>filter(marital %in%c("Separated/divorced/widowed","Single") & status ==1)
marital
sex
age
surgery
race
first
status
survivaltime
Single
Female
4
Yes
White
Yes
1
9
Single
Female
4
Yes
Black
Yes
1
10
Single
Female
5
No
White
Yes
1
12
Single
Male
5
No
White
Yes
1
54
Single
Male
6
Yes
Others
Yes
1
11
Single
Male
8
Yes
Others
Yes
1
8
Select
1 2 3 4 5 6
select keeps or discards columns (aka variables)
stroke |>select(sex,status,gcs)
sex
status
gcs
male
alive
15
male
alive
15
female
dead
11
male
dead
3
female
alive
15
female
dead
3
Select
1 2 3 4 5 6
can drop columns with -column
stroke |>select(-sex)
doa
dod
status
dm
gcs
sbp
dbp
wbc
time2
stroke_type
referral_from
17/2/2011
18/2/2011
alive
no
15
151
73
12.5
1
IS
non-hospital
20/3/2011
21/3/2011
alive
no
15
196
123
8.1
1
IS
non-hospital
9/4/2011
10/4/2011
dead
no
11
126
78
15.3
1
HS
hospital
12/4/2011
13/4/2011
dead
no
3
170
103
13.9
1
IS
hospital
12/4/2011
13/4/2011
alive
yes
15
103
62
14.7
1
IS
non-hospital
4/5/2011
5/5/2011
dead
no
3
91
55
14.2
1
HS
hospital
Select
1 2 3 4 5 6
the pipe |> or |> chains multiple functions together
Combining dplyr for data wrangling and then ggplot2 (both are packages inside the tidyverse metapackage) to plot the data. For example, dplyr part for data wrangling:
pep_age <- cancer |>group_by(Survival) |>summarize(mean_age =mean(age)) pep_age
# A tibble: 2 × 2
Survival mean_age
<chr> <dbl>
1 Died 52.2
2 censored 45.1
A distribution of a categorical variable
1 2 3 4 5 6
And the ggplot2 part to make the plot:
ggplot(pep_age, mapping =aes(x = Survival, y = mean_age)) +geom_col()+theme_stata()
A distribution of a categorical variable
1 2 3 4 5 6
We can combine both tasks dplyr and ggplot together that will save time:
library(ggthemes)cancer |>mutate(Survival =as.factor(Survival)) |>group_by(Survival) |>summarize(mean_age =mean(age)) |>ggplot(mapping =aes(x = Survival, y = mean_age, fill = Survival)) +geom_col() +ylab("Mean age (Years)") +xlab("Survival Status") +theme_stata()
Histogram
1 2 3 4 5 6
To plot the distribution of a numerical variable, we can plot a histogram. To specify the number of bin, we can use binwidth and add some customization.
By overlaying histograms, examine the distribution of a numerical variable (var age) based on variable status. First, we create an object called hist_surv. Next, we create a boxplot object and name it as box_surv. After that, we combine the two objects side-by-side using a vertical bar.
You can read more placement of multiple plots from the patchworkpackage to learn about arranging multiple plots in a single figure.
library(patchwork)hist_surv | box_surv
Overlaying histograms and boxplot
1 2 3 4 5 6
library(patchwork)hist_surv | box_surv
Faceting the plots
1 2 3 4 5 6
It is hard to visualize three variables in a single histogram plot. But we can use facet_.() function to further split the plots.
We can see better plots if we split the histogram based on particular grouping. In this example, we stratify the distribution of variable age (a numerical variable) based on status and gender (both are categorical variables)
Faceting the plots
1 2 3 4 5 6
ggplot(data = cancer, aes(x = survivaltime, fill = sex)) +geom_histogram(binwidth =5, aes(y = ..density..), position ="identity", alpha =0.45) +geom_density(aes(linetype = sex), alpha =0.65) +scale_fill_grey() +xlab("survival time") +ylab("Density") +labs(title ="Density distribution of survival time for status and gender",caption ="Source : Survival Analysis data") +theme_bw() +facet_wrap( ~ Survival)
How to do a statistical analysis
1 2 3 4 5 6
Biostatistics is the branch of statistics that applies statistical methods and techniques to biological, medical, and health sciences.
Step 0: Save yourself a headache and collect your data in a processable format
Step 1: Data Wrangling
Each row is an observation (usually a patient)
Each column contains only 1 type of data (more below)
No free text (if you need to, categorize responses)
How to do a statistical analysis
1 2 3 4 5 6
Step 2: For each data element, consider the data type
Binary (aka dichotomous scale): e.g. Yes or No, 0 or 1
Unordered Categorical (nominal scale): e.g. Utah, Colorado, Nevada, Idaho
Ordered Categorical (ordinal scale): e.g. Room air, nasal cannula, HFNC, intubated, ECMO, dead
Continuous (interval & ratio scales - differ by whether 0 is special): e.g. Temperature (Celsius or Kelvin, respectively)
How to choose a statistical test
1 2 3 4 5 6
dichotomous
nominal
ordinal
interval
a.ka.
binary
categorical
ordered categorical
continuous
n
X
X
X
X
%
X
X
X
X
min
X
X
max
X
X
range
X
X
mode
X
X
X
X
mean
X
median
X
X
IQR
X
X
Std. dev.
X
Std. err.
X
From: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine.
How to choose a statistical test
1 2 3 4 5 6
How do you choose the right test?
What type of variables? How many groups? Are the samples correlated (e.g. observation from the same patient at two different times)?
How do you choose the right test?
1 2 3 4 5 6
Level of measurement of status variable | Two Independent Groups | Three or more Independent Groups | Two Correlated* Samples | Three or more Correlated* Samples |
Dichotomous
chi-square or Fisher’s exact test
chi-square or Fisher-Freeman-Halton test
McNemar test
Cochran Q test
Unordered Categorical
chi-square or Fisher-Freeman-Halton test
chi-square or Fisher-Freeman-Halton test
Stuart-Maxwell test
Multiplicity adjusted Stuart-Maxwell tests#
Ordered categorical
Wilcoxon-Mann-Whitney (WMW) test
Old School***: Kruskal-Wallis analysis of variance (ANOVA)
New School***: multiplicity adjusted WMW test
Wilcoxon sign rank test
Old School# Friedman two-way ANOVA by ranks
New School# Mulitiplicity adjusted Wilcoxon sign rank tests
Continuous
independent groups t-test
Old school***: oneway ANOVA
New school***: multiplicity adjusted independent groups t tests
paired t-test
mixed effects linear regression
Censored: time to event
log-rank test
Multiplicity adjusted log-rank test
Shared-frailty Cox regression
Shared-frailty Cox regression
Examples AO cancer
1 2 3 4 5 6
What test would we use to assess if:
“surgery” and “Survival” are associated beyond what’s attributable to chance?
To test if “Sex” and “Survival” are associated?
If “Survivaltime” and “Surgery” are associated?
if “Age” and “Survival status” are associated?
Test for Associations
Does Survival outcome depend on whether you have gone through surgery?
Pearson's Chi-squared test with Yates' continuity correction
data: cancer$sex and cancer$Survival
X-squared = 0.6967, df = 1, p-value = 0.4039
Test for Associations
1 2 3 4 5 6
Does survival time depend on surgery status?
t_test_result <-t.test(survivaltime ~ surgery, data = cancer)print(t_test_result)
Welch Two Sample t-test
data: survivaltime by surgery
t = -3.4353, df = 271.52, p-value = 0.0006845
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-23.234358 -6.305457
sample estimates:
mean in group No mean in group Yes
45.22326 59.99316
Test for Associations
1 2 3 4 5 6
Does survival outcome depend on age?
t_test_result <-t.test(age~ Survival, data = cancer)print(t_test_result)
Welch Two Sample t-test
data: age by Survival
t = -10.293, df = 1817.5, p-value < 0.00000000000000022
alternative hypothesis: true difference in means between group censored and group Died is not equal to 0
95 percent confidence interval:
-8.419480 -5.724381
sample estimates:
mean in group censored mean in group Died
45.10299 52.17492
Test for Associations (Neat Tables)
1 2 3 4 5 6
library(gtsummary)p <-cancer |>select(-first,-marital,-status,-race) # summary of descriptive statistics per river (mean,standard deviation and Analysis Of Variance)theme_gtsummary_journal("jama")table1<-p|>tbl_summary(by = status,type=list(surgery~"categorical"),statistic =list(all_continuous() ~"{mean} ({sd})",all_categorical() ~"{n} / {N} ({p}%)"))|>add_p(test =all_continuous() ~"aov",pvalue_fun =function(x) style_pvalue(x, digits =2))|>modify_header(statistic ~"**Test Statistic**")|>bold_labels()|>modify_fmt_fun(statistic ~ style_sigfig);table1
Characteristic
Died, N = 989
censored, N = 835
Test Statistic
p-value1
sex, n / N (%)
0.78
0.38
Female
425 / 989 (43%)
376 / 835 (45%)
Male
564 / 989 (57%)
459 / 835 (55%)
surgery, n / N (%)
835 / 989 (84%)
774 / 835 (93%)
30
<0.001
survivaltime, Mean (SD)
38 (42)
83 (65)
321
<0.001
1 Pearson’s Chi-squared test; One-way ANOVA
Statistical Modeling
Understand the logic of regression analysis
1 2 3 4 5 6
Recall, if there is an association between an ‘exposure’ and an ‘status’, there are 4 possible explanations
Chance
Confounding (some other factor influences the exposure and the status)
Bias
Or, causation (a real effect)
Understand the logic of regression analysis
1 2 3 4 5 6
There are at least 3 uses of regression models:
Inferential Statistics: Hypothesis testing with confounding control
Descriptive Statistics: Summarize the strength of association
Prediction of an status (e.g. statistical machine learning)
Understand the logic of regression analysis
1 2 3 4 5 6
Regression comes with additional assumptions:
Independent observations (special “mixed models” can relax this)
The form of the output variable is correct*
The form of the predictor variables are correct
The relationship between the predictors are properly specified.**
Additional constraints (e.g. constant variance)
Understand the logic of regression analysis
1 2 3 4 5 6
Thus the logic is: if the assumptions of the models hold in reality, then the described relationships are valid
No model is perfect, but some models are useful
Morris moment(TM)
Output variable (aka the dependent variable, predicted variable) form determines the type of regression :
EXAMPLES
1 2 3 4 5 6
Dichotomous
Chi2 Test
logistic regression
Unordered categorical
Chi2 Test
multinomial logistic regression
Ordered categorical
Wilcoxon-Mann-Whitney
ordinal logistic regression
Continuous (normally distributed)
T-test
linear regression
Censored: time to event
Log-rank test
Cox regression
From: From: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine.
Interpretation:
1 2 3 4 5 6
Regression coefficient = What change in the status do you expected if you change the predictor by 1 unit, holding all other variables constant
For linear regression: additive change in status
For logistic regression: multiplicative change in odds of the status
For Cox regression: multiplicative change in the hazard of the status.
Example:
1 2 3 4 5 6
Consider, if we want to test whether ‘surgery’ and ‘Survival’ are associated, we could use a chi2 test:
Pearson's Chi-squared test with Yates' continuity correction
data: cancer$surgery and cancer$Survival
X-squared = 28.961, df = 1, p-value = 0.00000007386
model development
1 2 3 4 5 6
Alternatively you could specify a logistic regression
(“GLM” standards for ‘generalised linear model’. Logistic regression is a type of glm where the family is binomial)
logistic_model <-glm(status ~ surgery, data = cancer, family =binomial())# Output the summary of the model to see coefficients and statisticssummary(logistic_model)
Call:
glm(formula = status ~ surgery, family = binomial(), data = cancer)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.9261 0.1513 6.121 0.000000000927 ***
surgeryYes -0.8502 0.1593 -5.337 0.000000094385 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2515.6 on 1823 degrees of freedom
Residual deviance: 2484.7 on 1822 degrees of freedom
AIC: 2488.7
Number of Fisher Scoring iterations: 4
model development
1 2 3 4 5 6
logistic_model <-glm(status ~ age, data = cancer, family =binomial())# Output the summary of the model to see coefficients and statisticssummary(logistic_model)
Call:
glm(formula = status ~ age, family = binomial(), data = cancer)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.377002 0.167291 -8.231 <0.0000000000000002 ***
age 0.031789 0.003311 9.600 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2515.6 on 1823 degrees of freedom
Residual deviance: 2416.3 on 1822 degrees of freedom
AIC: 2420.3
Number of Fisher Scoring iterations: 4
model development
1 2 3 4 5 6
logistic_model <-glm(status ~ sex, data = cancer, family =binomial())# Output the summary of the model to see coefficients and statisticssummary(logistic_model)
Call:
glm(formula = status ~ sex, family = binomial(), data = cancer)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.12250 0.07080 1.730 0.0836 .
sexMale 0.08350 0.09468 0.882 0.3778
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2515.6 on 1823 degrees of freedom
Residual deviance: 2514.8 on 1822 degrees of freedom
AIC: 2518.8
Number of Fisher Scoring iterations: 3
model development
1 2 3 4 5 6
logistic_model <-glm(status ~ surgery, data = cancer, family =binomial())# Output the summary of the model to see coefficients and statisticssummary(logistic_model)
Call:
glm(formula = status ~ surgery, family = binomial(), data = cancer)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.9261 0.1513 6.121 0.000000000927 ***
surgeryYes -0.8502 0.1593 -5.337 0.000000094385 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2515.6 on 1823 degrees of freedom
Residual deviance: 2484.7 on 1822 degrees of freedom
AIC: 2488.7
Number of Fisher Scoring iterations: 4
Survival analysis - survival estimates
1 2 3 4 5 6
Kaplan-Meier survival estimates is the non-parametric survival estimates. It provides the survival probability estimates at different time. Using survfit(), we can estimate the survival probability based on Kaplan-Meier (KM).
Let’s estimate the survival probabilities for
overall
Surgery
sex
race
The survival probabilities for all patients:
Kaplan-Meier survival estimates
1 2 3 4 5 6
KM <-survfit(Surv(time = survivaltime, event = Survival =="Died" ) ~1, data = cancer)summary(KM)
There are a number of available tests to compare the survival estimates between groups based on KM. The tests include:
log-rank (default)
peto-peto test
Log-rank test
1 2 3 4 5 6
From Kaplan-Meier survival curves, we could see the graphical representation of survival probabilities in different group over time. And to answer question if the survival estimates are different between levels or groups we can use statistical tests for example the log rank and the peto-peto tests.
For all the test, the null hypothesis is that that the survival estimates between levels or groups are not different. For example, to do that:
The survival estimates between Male and female patients are not different (p-value = 0.1).
peto-peto test
1 2 3 4 5 6
We will be confident with our results if we obtain almost similar findings from other tests. So, now let’s compare survival estimates using the peto-peto test.
This is the result for comparing survival estimates for surgery status using peto-peto test.
One advantage of time-to-event data (from a cohort study) is the ability to estimate the hazard or risk to develop the event (outcome) of interest. However, the challenge in the cohort study is the presence of censoring. Censoring can happen due to
patients leave the study (loss to follow up) randomly
patients do not experience the event even at the termination of the study
patients are withdrawn from the study
In censored patients, we do not know exactly the time for them to develop the event.
Semi-parametric in survival analysis
1 2 3 4 5 6
To explore how to incorporate a regression model-like structure into the hazard function, we can model the hazard function using:
\[h(t) = \theta_0\] The hazard function is a rate, and because of that it must be strictly positive. To constrain \(\theta\) at greater than zero, we can parameterize the hazard function as:
\[h(t) = \exp^{\beta_0}\]
Semi-parametric models in survival analysis
1 2 3 4 5 6
So for a covariate \(x\) the log-hazard function is:
\[ln[h(t.x)] = \beta_0 + \beta_1(x)\] and the hazard function is
\[h(t.x) = exp^{\beta_0 + \beta_1(x)}\]
This is the exponential distribution which is one example of a fully parametric hazard function. Fully parametric models accomplishes two goals simultaneously:
It describes the basic underlying distribution of survival time (error component)
It characterizes how that distribution changes as a function of the covariates (systematic component).
Semi-parametric models in survival analysis
1 2 3 4 5 6
However, even though fully parametric models can be used to accomplish the above goals, the assumptions required for their error components may be unnecessarily stringent or unrealistic. One option is to have a fully parametric regression structure but leave their dependence on time unspecified. The models that utilize this approach are called semiparametric regression models.
Cox proportional hazards regression
1 2 3 4 5 6
If we want to compare the survival experience of cancer patients based on surgery status , one form of a regression model for the hazard function that addresses the study goal is:
\[h(t,x,\beta) = h_0(t)r(x,\beta)\] We can see that the hazard function is the product of two functions:
The function, \(h_0(t)\), characterizes how the hazard function changes as a function of survival time.
The function, \(r(x,\beta)\), characterizes how the hazard function changes as a function of subject covariates.
The \(h_0(t)\) is frequently referred to as the baseline hazard function.
Cox proportional hazards regression
1 2 3 4 5 6
The hazard ratio (HR) depends only on the function \(r(x,\beta)\). If the ratio function \(HR(t,x_1,x_0)\) has a clear clinical interpretation then, the actual form of the baseline hazard function is of little importance.
With this parameterization the hazard function is
\[h(t,x,\beta) = h_o(t)exp^{x \beta}\]
and the hazard ratio is
\[HR(t,x_1, x_0) = exp^{\beta(x_1 - x_0)}\]
Cox proportional hazards regression
1 2 3 4 5 6
This model is referred to in the literature by a variety of terms, such as the Cox model, the Cox proportional hazards model or simply the proportional hazards model.
So for example, if we have a covariate which is a dichomotomous (binary), such as surgetypery : coded as a value of \(x_1 = 1\) and \(x_0 = 0\), for yes and no, respectively, then the hazard ratio becomes
\[HR(t,x_1, x_0) = exp^\beta\]
Advantages of the Cox proportional hazards regression
1 2 3 4 5 6
If you remember that by using Kaplan-Meier (KM) analysis, we could estimate the survival probability. And using the log-rank or peto-peto test, we could compare the survival between categorical covariates. However, the disadvantages of KM include:
Need to categorize numerical variable to compare survival
It is a univariable analysis
It is a non-parametric analysis
Advantages of the Cox proportional hazards regression
1 2 3 4 5 6
We also acknowledge that the fully parametric regression models in survival analysis have stringent assumptions and distribution requirement. So, to overcome the limitations of the KM analysis and the fully parametric analysis, we can model our survival data using the semi-parametric Cox proportional hazards regression.
Estimation from Cox proportional hazards regression
1 2 3 4 5 6
Using our cancer dataset, we will estimate the parameters using the Cox PH regression. Remember, in our data we have
the time variable : survivaltime
the event variable : Survivaltime and the event of interest is Died. Event classified other than dead are considered as censored
all other covariates
Estimation from Cox proportional hazards regression
1 2 3 4 5 6
Now let’s take surgery as the covariate of interest:
Estimation from Cox proportional hazards regression
1 2 3 4 5 6
The simple Cox PH model with SURGERY shows that the patients who went through surgery have \(-0.5928356\) times the crude log hazard for death as compared to patients who did not(p-value = 0.00000000001466763).
The \(95\%\) confidence intervals for the crude log hazards are calculated by:
Estimation from Cox proportional hazards regression
1 2 3 4 5 6
Or we can get the crude hazard ratio (HR) by exponentiating the log HR. In this example, the simple Cox PH model with covariate surgery shows that the patients who took surgery has \(55\%\) lower risk for cancer as compared to patients who did not (p-value = 0.00000000001466763 and \(95\% CI 46, 66\)).
The \(95\%\) confidence intervals for crude HR are calculated by
Call:
coxph(formula = Surv(time = survivaltime, event = Survival ==
"Died") ~ age, data = cancer)
n= 1824, number of events= 989
coef exp(coef) se(coef) z Pr(>|z|)
age 0.039872 1.040678 0.002431 16.4 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.041 0.9609 1.036 1.046
Concordance= 0.667 (se = 0.01 )
Likelihood ratio test= 272.6 on 1 df, p=<0.0000000000000002
Wald test = 269 on 1 df, p=<0.0000000000000002
Score (logrank) test = 270.1 on 1 df, p=<0.0000000000000002
Estimation from Cox proportional hazards regression
1 2 3 4 5 6
The simple Cox PH model with covariate sex shows that with each one unit increase in age, the crude log hazard for death changes by a factor of \(0.039872\).
# A tibble: 1 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 age 1.04 0.00243 16.4 1.89e-60 1.04 1.05
Estimation from Cox proportional hazards regression
1 2 3 4 5 6
When we exponentiate the log HR, the simple Cox PH model shows that with each one unit increase in age, the crude risk for death increases for about \(4%\). The relationship between cancer death and age is highly significant (p-value \(< 0.0001\)) when not adjusting for other covariates.
By using tbl_uvregression() we can generate simple univariable model for all covariates in one line of code. In return, we get the crude HR for all the covariates of interest.
Estimation from Cox proportional hazards regression
There are two primary reasons to include more than one covariates in the model. One of the primary reasons for using a regression model is to include multiple covariates to adjust statistically for possible imbalances in the observed data before making statistical inferences. In traditional statistical applications, it is called analysis of covariance, while in clinical and epidemiological investigations it is often called control of confounding. The other reason is a statistically related issue where the inclusion of higher-order terms in a model representing interactions between covariates. These are also called effect modifiers.
Multiple Cox PH regression
1 2 3 4 5 6
Let’s decide based on our clinical expertise and statistical significance, we would model a Cox PH model with these covariates.
age
sex
surgery
Multiple Cox PH regression
1 2 3 4 5 6
The reasons because we found that both age and surgery are statistically significant. We also believe that gender may also influence outcome.
Multiple Cox PH regression
1 2 3 4 5 6
To estimate to Cox PH model with age, sex and surgery:
cancer_mv <-coxph(Surv(time = survivaltime, event = Survival =="Died") ~ age + sex + surgery, data = cancer)tidy(cancer_mv, exponentiate =TRUE, conf.int =TRUE)