Practice Multivariable Analysis 2026

Author

Kang Munir

Published

January 25, 2026

MVA 2024/2025

Load all the library using pacman

Show code

library(pacman)

# load all library

p_load(lmtest, survival, broom, caret, pacman, gtsummary, lubridate, 
       survminer, pROC, corrplot, ggeffects, skimr, haven, 
       officer, mfp, car, flextable, psych, ResourceSelection, 
       readxl, summarytools, performance, tidyverse)

1. Multiple Linear Regression

Data Quality of life (qol_data.csv). Outcome is quality of life score. 3 covariates: job experience (years), gender (male, female) and comorbidity (yes, no)

A) Build model without interaction, interpret (10marks)

Import and summary data

Show code

# import data
qol <- read.csv("qol_data.csv")
glimpse(qol)

Rows: 100
Columns: 4
$ quality_of_life <dbl> 74.26882, 71.91479, 75.23085, 38.65436, 65.26596, 77.4…
$ job_experience  <dbl> 9.339748, 23.860849, 12.860331, 26.607505, 28.273551, …
$ gender          <chr> "male", "female", "female", "male", "female", "female"…
$ comorbidity     <chr> "no", "no", "no", "yes", "no", "no", "no", "yes", "no"…

Change character variables to factors

Show code

qol <- qol %>%
  mutate(across(where(is.character), as.factor))

# recode label variable

attr(qol$quality_of_life, "label") <- "Quality of Life Score"
attr(qol$job_experience, "label") <- "Job Experience (years)"
attr(qol$gender, "label") <- "Gender"
attr(qol$comorbidity, "label") <- "Comorbidity"

glimpse(qol)

Rows: 100
Columns: 4
$ quality_of_life <dbl> 74.26882, 71.91479, 75.23085, 38.65436, 65.26596, 77.4…
$ job_experience  <dbl> 9.339748, 23.860849, 12.860331, 26.607505, 28.273551, …
$ gender          <fct> male, female, female, male, female, female, male, male…
$ comorbidity     <fct> no, no, no, yes, no, no, no, yes, no, no, no, yes, yes…

3.. Fit multiple linear regression model without interaction

Show code

model1 <- lm(quality_of_life ~ job_experience + gender + comorbidity, data = qol)

tidy(model1, intercept = TRUE)

# A tibble: 4 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      82.6      1.32       62.6  1.15e-79
2 job_experience   -0.492    0.0653     -7.54 2.67e-11
3 gendermale       -9.57     1.08       -8.88 3.73e-14
4 comorbidityyes  -13.9      1.06      -13.1  4.88e-23

Show code

# Or can use regression table
tbl_regression(model1, intercept = TRUE)

Characteristic	Beta	95% CI	p-value
(Intercept)	83	80, 85	<0.001
Job Experience (years)	-0.49	-0.62, -0.36	<0.001
Gender
female	—	—
male	-9.6	-12, -7.4	<0.001
Comorbidity
no	—	—
yes	-14	-16, -12	<0.001
Abbreviation: CI = Confidence Interval

Interpret coefficients

Job Experience: Each additional year of job experience is associated with a 0.49-point decrease in quality of life score, adjusted for gender and comorbidity (95% CI: -0.62 to -0.36, p < 0.001).
Gender: Male participants have a quality of life score that is 9.6 points lower than female participants, adjusted for job experience and comorbidity (95% CI: -12.0 to -7.4, p < 0.001).
Comorbidity: Participants with comorbidity have a quality of life score that is 14 points lower than those without comorbidity, adjusted for job experience and gender (95% CI: -16.0 to -12.0, p < 0.001).

B) Build model with interaction between gender & comorbid, present in table (5mark

Fit multiple linear regression model with interaction

Show code

model2 <- lm(quality_of_life ~ job_experience + gender + comorbidity + gender:comorbidity, data = qol)

tidy(model2)

# A tibble: 5 × 5
  term                      estimate std.error statistic  p.value
  <chr>                        <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                 80.3      1.17       68.7  8.16e-83
2 job_experience              -0.517    0.0553     -9.35 4.03e-15
3 gendermale                  -3.68     1.30       -2.83 5.72e- 3
4 comorbidityyes              -8.70     1.22       -7.13 2.00e-10
5 gendermale:comorbidityyes  -11.4      1.80       -6.31 8.68e- 9

Show code

# Or can use regression table

tbl_regression(model2, intercept = TRUE)

Characteristic	Beta	95% CI	p-value
(Intercept)	80	78, 83	<0.001
Job Experience (years)	-0.52	-0.63, -0.41	<0.001
Gender
female	—	—
male	-3.7	-6.3, -1.1	0.006
Comorbidity
no	—	—
yes	-8.7	-11, -6.3	<0.001
Gender * Comorbidity
male * yes	-11	-15, -7.8	<0.001
Abbreviation: CI = Confidence Interval

Interpret interaction term

The interaction term between gender and comorbidity is statistical significant (p < 0.05), indicating that the effect of comorbidity on quality of life score differs by gender.

C) Provide appropriate plot, copy in words/in jpeg file (5marks)

Plot interaction each of covariate

Show code

# interaction plot for quality of life vs job experience
ggplot(qol, aes(x = job_experience, y = quality_of_life)) +
geom_point() +
geom_smooth(method = "lm") + 
  
# add labels and theme  
  labs(title = "Quality of Life Score vs Job Experience",
       x = "Job Experience (years)",
       y = "Quality of Life Score") +
  theme_minimal()

Show code

# interaction plot for quality of life vs gender
ggplot(qol, aes(x = gender, y = quality_of_life)) +
  geom_boxplot() +
  
  # add labels and theme  
  labs(title = "Quality of Life Score by Gender",
       x = "Gender",
       y = "Quality of Life Score") +
  theme_minimal()

Show code

# interaction plot for quality of life vs comorbidity

ggplot(qol, aes(x = comorbidity, y = quality_of_life)) +
  geom_boxplot() +
  
  # add labels and theme  
  labs(title = "Quality of Life Score by Comorbidity",
       x = "Comorbidity",
       y = "Quality of Life Score") +
  theme_minimal()

Interaction plot between quality of life score by gender, job experiencexx` and comorbidity.

Show code

interact.plot <- ggpredict(model2, terms = c("gender", "comorbidity", "job_experience"))
plot(interact.plot, connect_lines = T)

Interpret plot

People without comorbidity have higher quality of life scores.
Females without comorbidity have the highest scores.
Males without comorbidity also have high scores, but slightly lower than females.
Comorbidity lowers scores for both genders.
Males with comorbidity have the lowest scores.
The effect of comorbidity is stronger in males than in females.

D) Predict based on (working experience: 5/15/25 years), both comorbidities, both gender. proof one using manual calculation (10marks)

Create prediction data frame

Show code

pred_data <- expand.grid( job_experience = c(5, 15, 25),
                          gender = c("female", "male"), 
                          comorbidity = c("no", "yes") )

augment(model2, newdata = pred_data)

# A tibble: 12 × 4
   job_experience gender comorbidity .fitted
            <dbl> <fct>  <fct>         <dbl>
 1              5 female no             77.7
 2             15 female no             72.5
 3             25 female no             67.4
 4              5 male   no             74.0
 5             15 male   no             68.9
 6             25 male   no             63.7
 7              5 female yes            69.0
 8             15 female yes            63.8
 9             25 female yes            58.7
10              5 male   yes            53.9
11             15 male   yes            48.8
12             25 male   yes            43.6

Manual calculation for one scenario (e.g., 15 years experience, male with comorbidity)

The linear regression model predicting Quality of Life Score is:

\[ \text{Quality of Life Score} = 80 - 0.52 \cdot \text{Job Experience} - 3.7 \cdot \text{Male} - 8.7 \cdot \text{Comorbidity} - 11 \cdot (\text{Male} \times \text{Comorbidity}) \]

Where:

(_0) is the intercept.
Job Experience is measured in years.
Male is coded as 1 for male, 0 for female.
Comorbidity is coded as 1 for yes, 0 for no.
The interaction term is 1 only when both gender is male and comorbidity is present.

Example : Male with comorbidity, 15 years of experience.

QOL = 80 - 0.52(15) - 3.7(1) - 8.7(1) - 11(1*1) = 48.8

2. Multiple Logistic Regression

Question: Using Quality of Life data (qol2.dta). Outcome is quality of life categorized into poor and good. 3 covariates: job experience (years), gender (male, female) and comorbidity (yes, no)

A) What is analysis you would like to use and justify (10 marks)

Answer: Multiple Logistic Regression

Justification:

Outcome variable

Quality of life (QoL) is binary: Poor vs Good

Because the dependent variable is dichotomous, binary logistic regression is the most appropriate analysis.

Justification (key points for marks)

Nature of outcome

Logistic regression is suitable when the outcome is categorical with two levels.
Linear regression is inappropriate because it can predict values outside 0–1 and violates normality and homoscedasticity assumptions.

Interpretability

Logistic regression estimates odds ratios (ORs), which are standard and interpretable measures in epidemiology and public health.
ORs quantify the association between predictors and the odds of having good quality of life.

Covariate structure

Job experience: continuous predictor (years)
Gender: categorical (male/female)
Comorbidity: categorical (yes/no)
Logistic regression can accommodate both continuous and categorical predictors in the same model.

Adjustment for confounding

Multivariable logistic regression allows adjustment for potential confounders, giving adjusted odds ratios.

Interaction assessment

The research question explicitly asks for an interaction between gender and comorbidity.
Logistic regression allows inclusion and interpretation of interaction (effect modification) terms.

Public health relevance

Logistic regression is widely used in quality-of-life and health outcomes research, making results comparable with existing literature.

Therefore, multivariable binary logistic regression is the most appropriate analysis to model the association between job experience, gender, comorbidity, and quality of life, while allowing assessment of interaction effects.

B) Build model with interaction between gender & comorbid, present in table (5marks)

Import data

Show code

qol2 <- read_dta("qol2.dta")
glimpse(qol2)

Rows: 100
Columns: 4
$ quality_of_life <dbl+lbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2,…
$ job_experience  <dbl> 9.339748, 23.860849, 12.860331, 26.607505, 28.273551, …
$ gender          <dbl+lbl> 2, 1, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1,…
$ comorbidity     <dbl+lbl> 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1,…

Change labelled variables to factors

Show code

qol2 <- qol2 %>%
  mutate(across(where(is.labelled), as_factor))
glimpse(qol2)

Rows: 100
Columns: 4
$ quality_of_life <fct> poor, poor, poor, poor, poor, good, poor, poor, poor, …
$ job_experience  <dbl> 9.339748, 23.860849, 12.860331, 26.607505, 28.273551, …
$ gender          <fct> male, female, female, male, female, female, male, male…
$ comorbidity     <fct> no, no, no, yes, no, no, no, yes, no, no, no, yes, yes…

Fit multiple logistic regression model with interaction

Show code

model_logit <- glm(quality_of_life ~ gender + comorbidity + job_experience + gender:comorbidity,
                   data = qol2,
                   family = binomial)
tbl_regression(model_logit,intercept = T, exponentiate = TRUE, conf.int = TRUE)

Characteristic	OR	95% CI	p-value
(Intercept)	0.12	0.03, 0.42	0.002
gender
female	—	—
male	4.30	1.14, 18.3	0.037
comorbidity
no	—	—
yes	2.37	0.72, 8.28	0.2
job_experience	1.12	1.06, 1.20	<0.001
gender * comorbidity
male * yes	1.22	0.16, 10.2	0.8
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation

Gender: After adjusting for comorbidity and the interaction term, male healthcare workers have significantly higher odds of good quality of life compared with females (OR = 4.95, 95% CI: 1.48–18.9, p = 0.013).
Comorbidity: Healthcare workers with comorbidity have higher odds of good quality of life compared with those without comorbidity; however, this association is not statistically significant (OR = 2.12, 95% CI: 0.72–6.42, p = 0.20).
Interaction term (gender × comorbidity): The interaction between gender and comorbidity is not statistically significant (OR = 0.97, 95% CI: 0.15–7.10, p > 0.9), indicating that the association between comorbidity and quality of life does not differ by gender.

C) Provide predicted values for the first observation (5 marks)

The interaction was not significant, so we will use the model without interaction to predict the probability for the first observation.

Fit multiple logistic regression model without interaction

Show code

model_logitfull <- glm(quality_of_life ~ gender + comorbidity + job_experience,
                   data = qol2,
                   family = binomial)
tbl_regression(model_logitfull,intercept = T, conf.int = TRUE)

Characteristic	log(OR)	95% CI	p-value
(Intercept)	-2.1	-3.5, -0.94	0.001
gender
female	—	—
male	1.6	0.57, 2.6	0.003
comorbidity
no	—	—
yes	0.93	-0.03, 2.0	0.063
job_experience	0.12	0.06, 0.19	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Predict probability for the first observation (model with log odd)

Show code

predict(model_logitfull, newdata = qol2[1, ], type = "response")

        1 
0.6217327

Manual calculation using log-odds

The logistic regression equation is: \[ \text{logit}(p) = \beta_0 + \beta_1 \cdot \text {gender (m = 1)} + \beta_2 \cdot \text{comorbidity (yes = 1)} + \beta_3 \cdot \text{job experience} \]

Where:

β0 = -2.1 (intercept)
β1 = 1.6
β2 = 0.93
β3 = 0.12

For the first observation (male, no comorbidity, 9.3 years experience):

Logit(p) = -2.1 + 1.6(1) + 0.793(0) - 0.12(9.3) = -1.057

Convert log-odds to probability: \[ p = \frac{e^{\text{logit}(p)}}{1 + e^{\text{logit}(p)}} = \frac{e^{-1.057}}{1 + e^{-1.057}} \approx 0.62 \]

Interpretation: The predicted probability of having good quality of life for the first observation is 0.62 (62%).

3. Cox Propoportional Hazard Regression

Question: Using survival data (stroke_e.dta). Outcome is stroke death. 3 covariates: age ( years), gender (male, female) and diabetes (yes, no) and ICD 10 (ischemic stroke, hemorrhagic).

A) Illustrate the statistical workflow , from beginning until communicate to the stakeholder (5 marks)

Mnemonic : I took tiktok video muka model comel

Import
Tidy
Transform
Visualize
Model
Communicate

B) Perform the Cox Proportional Hazard Regression model with covariates gender, age, dm & icd10. Present the result in a table. Interpret the result for age and icd10 (10 marks)

Import data

Show code

stroke <- read_dta("stroke_e.dta")
glimpse(stroke)

Rows: 226
Columns: 10
$ sex       <dbl+lbl> 1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1,…
$ time      <dbl> 32, 4, 3, 5, 2, 2, 1, 4, 8, 4, 3, 3, 1, 4, 3, 2, 12, 12, 4, …
$ status2   <dbl+lbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ gcs       <dbl> 13, 15, 15, 15, 15, 15, 13, 15, 15, 10, 15, 15, 15, 15, 15, …
$ sbp       <dbl> 143, 150, 152, 215, 162, 169, 178, 180, 186, 185, 122, 211, …
$ dbp       <dbl> 83, 98, 108, 109, 72, 114, 75, 90, 109, 100, 83, 103, 83, 72…
$ dm        <dbl+lbl> 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,…
$ wbc       <dbl> 10.0, 8.9, 7.4, 8.1, 7.5, 24.4, 9.5, 4.2, 6.4, 14.4, 8.2, 8.…
$ age2      <dbl> 50, 58, 64, 50, 65, 78, 66, 72, 61, 64, 63, 59, 64, 62, 40, …
$ icd10cat2 <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Change labelled variables to factors

Show code

stroke <- stroke %>%
  mutate(across(where(is.labelled), as_factor))
glimpse(stroke)

Rows: 226
Columns: 10
$ sex       <fct> male, male, male, female, male, female, female, male, female…
$ time      <dbl> 32, 4, 3, 5, 2, 2, 1, 4, 8, 4, 3, 3, 1, 4, 3, 2, 12, 12, 4, …
$ status2   <fct> alive, alive, alive, alive, alive, alive, dead, alive, alive…
$ gcs       <dbl> 13, 15, 15, 15, 15, 15, 13, 15, 15, 10, 15, 15, 15, 15, 15, …
$ sbp       <dbl> 143, 150, 152, 215, 162, 169, 178, 180, 186, 185, 122, 211, …
$ dbp       <dbl> 83, 98, 108, 109, 72, 114, 75, 90, 109, 100, 83, 103, 83, 72…
$ dm        <fct> no, no, no, yes, yes, yes, yes, no, yes, yes, no, yes, yes, …
$ wbc       <dbl> 10.0, 8.9, 7.4, 8.1, 7.5, 24.4, 9.5, 4.2, 6.4, 14.4, 8.2, 8.…
$ age2      <dbl> 50, 58, 64, 50, 65, 78, 66, 72, 61, 64, 63, 59, 64, 62, 40, …
$ icd10cat2 <fct> Ischaemic Stroke, Ischaemic Stroke, Ischaemic Stroke, Ischae…

Label variable

Show code

attr(stroke$time, "label") <- "Time to Stroke death (months)"
attr(stroke$status2, "label") <- "Stroke death"
attr(stroke$age2, "label") <- "Age (years)"
attr(stroke$sex, "label") <- "Gender"
attr(stroke$dm, "label") <- "Diabetes (yes/no)"
attr(stroke$icd10cat2, "label") <- "ICD10"
attr(stroke$gcs, "label") <- "Glasgow Coma Scale"

Fit Cox Proportional Hazard Regression model

Show code

model_cox <- coxph(
  Surv(time = time, event = status2 == "dead") ~ 
    age2 + sex + dm + icd10cat2,
  data = stroke
)

tbl_regression(model_cox, exponentiate = TRUE, conf.int = TRUE)

Characteristic	HR	95% CI	p-value
Age (years)	1.03	1.01, 1.05	0.006
Gender
male	—	—
female	1.25	0.69, 2.27	0.5
Diabetes (yes/no)
no	—	—
yes	1.47	0.82, 2.65	0.2
ICD10
Ischaemic Stroke	—	—
Haemorrhagic	2.69	1.46, 4.94	0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Interpretation

Age: Age is significantly associated with stroke recurrence. Each additional year of age is associated with a 1.03 increase in the hazard of stroke death, after adjusting for gender, diabetes, and stroke type. (95% CI: 1.01, 1.05, p 0.006).
ICD10: Patients with hemorrhagic stroke have a 2.69 times higher hazard of stroke death compared to those with ischemic stroke, after adjusting for gender, age, and diabetes (95% CI: 1.46, 4,94, p = 0.001).

C)add on gcs, explain about the regression estimate for icd10 (5 marks)

Fit Cox Proportional Hazard Regression model with gcs

Show code

model_cox2 <- coxph(
  Surv(time = time, event = status2 == "dead") ~ 
    age2 + sex + dm + icd10cat2 + gcs, data = stroke)

tbl_regression(model_cox2, exponentiate = TRUE, conf.int = TRUE)

Characteristic	HR	95% CI	p-value
Age (years)	1.03	1.01, 1.05	0.004
Gender
male	—	—
female	1.16	0.64, 2.12	0.6
Diabetes (yes/no)
no	—	—
yes	1.42	0.79, 2.55	0.2
ICD10
Ischaemic Stroke	—	—
Haemorrhagic	1.69	0.91, 3.13	0.10
Glasgow Coma Scale	0.84	0.79, 0.90	<0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Interpretation for ICD10

After add adjusting for Glasgow Coma Scale, patients with haemorrhagic stroke have a 69% higher hazard of stroke recurrence compared with patients with ischaemic stroke; However, the association is no longer statistically significant (HR: 1.69, 95% CI: 0.91, 3.13, p = 0.10). This suggests that the initial observed association between stroke type and stroke recurrence may be confounded by Glasgow Coma Scale scores.

MVA 2023/2024

1. Analysis workflow

Question: Before formal data analysis, researchers are encouraged to develop data analysis workflow. - Using a flow chart, generate data analysis workflow that will help with your research data. Briefly describe the objective for each stage of the workflow (10 marks)

Mnemonic : I took tiktok video muka model comel

Import

Objective: Load raw data from various sources (e.g., CSV, Excel, databases) into the analysis environment for further processing.

Tidy

Objective: Clean and structure the data into a consistent format (e.g., long format) to facilitate analysis, ensuring variables are properly labeled and missing values are addressed.

Transform

Objective: Create new variables, recode existing ones, and perform necessary calculations to prepare the data for analysis, enhancing its usability and interpretability.

Visualize

Objective: Generate graphical representations of the data to explore patterns, trends, and relationships among variables, aiding in understanding and hypothesis generation.

Model

Objective: Apply appropriate statistical or machine learning models to analyze the data, test hypotheses, and derive insights relevant to the research questions.

Communicate

Objective: Present the analysis results through reports, presentations, or dashboards, effectively conveying findings to stakeholders for informed decision-making.

Show code

library(DiagrammeR)
grViz("
digraph data_analysis_workflow {
  node [shape = rectangle, style = filled, fillcolor = LightBlue]
  Import [label = 'Import Data']
  Tidy [label = 'Tidy Data']
  Transform [label = 'Transform Data']
  Visualize [label = 'Visualize Data']
  Model [label = 'Model Data']
  Communicate [label = 'Communicate Results']
  Import -> Tidy -> Transform -> Visualize -> Model -> Communicate
}
")

2. Mixed (Within-Between RM ANOVA)

Question: A randomized controlled trial was conducted to evaluate the effectiveness of the HIV Counselling Module (HIV-CM) in increasing healthcare workers (HCW) awareness on stigma and discrimination toward key populations of HIV. Thirty (n=30) HCWs were randomly assigned into two groups which consisted of 15 respondents in the intervention group “HIV-CM” while another 15 respondents to “Controls”. They were measured on their self-awareness scores (SAS) at baseline (SAS- 1), 2 weeks (SAS-2) and 6 weeks (SAS-3). The scales are between 0-100, where a higher score indicates better awareness.

A) State all possible research questions. (6 marks)

Research question:

Within-group (time effect)

Is there a significant change in self-awareness scores (SAS) over time (baseline, 2 weeks, 6 weeks) among healthcare workers?

Between-group (group effect)

Is there a significant difference in overall self-awareness scores between the HIV-CM group and the control group?

Interaction (group × time effect)

Does the change in self-awareness scores over time differ between the HIV-CM group and the control group?

B) State, with reasons, the appropriate statistical test to answer all research questions in Q2(A). (4 marks)

Justification :

Outcome variable

SAS is continuous and approximately normally distributed.

Within-subject factor

Time has three repeated measurements (SAS-1, SAS-2, SAS-3).

Between-subject factor

Group has two independent levels (HIV-CM vs Control).

Research objectives

The study aims to examine time effects, group effects, and group × time interaction, which are directly tested using mixed repeated measures ANOVA.

C) Using the given dataset HIV-CM.sav, perform statistical analysis in SPSS using the test stated in Q2(a). Interpret all the relevant analytic steps and save them into a Microsoft Word document (HIV-CM analysis.docx). Save all relevant SPSS outputs as HIV-CM.spv (25 marks)

Step 1: Open data

File → Open → Data → HIV-CM.sav

Check:

group (HIV-CM / Control) → Between-subjects factor
SAS_1, SAS_2, SAS_3 → Repeated measures

Step 2: Run Mixed Repeated Measures ANOVA

SPSS Menu path:

Analyze → General Linear Model → Repeated Measures

Define factors:

Within-Subject Factor Name: time
Number of Levels: 3
Define → assign:
- DSCA1 → Time 1
- DSCA2 → Time 2
- DSCA3 → Time 3
Between-Subjects Factor:
- Group (HIV-CM vs Control)

Options:

Descriptive statistics
Estimates of effect size
Pairwise comparisons (Bonferroni)
Mauchly’s test of sphericity

Step 3: Assumption checking (IMPORTANT FOR MARKS)

Normality
- SAS is continuous and approximately normally distributed → assumption reasonably met.
Sphericity (Mauchly’s Test)
- If p > 0.05 → sphericity assumed
- If p < 0.05 → use Greenhouse-Geisser correction

state which correction was used.

Step 4: Interpretation of key outputs

Time effect
- A significant main effect of time indicates that self-awareness scores changed significantly across the three time points.
Group effect
- A significant main effect of group indicates that overall self-awareness scores differed between the HIV-CM and control groups.
Group × Time interaction (MOST IMPORTANT)
- A significant interaction indicates that the pattern of change in self-awareness scores over time differs between the HIV-CM and control groups, demonstrating intervention effectiveness.

D) Present the results from Q2(c) in a table and conclude your findings (5 marks)

Table should include :

Descriptive statistics (means and SDs for each group at each time point)
ANOVA results (F-values, p-values, effect sizes for time, group, and interaction)
Pairwise comparisons if applicable

3. Multiple Linear Regression

Question: A dataset named mydata.csv contains the following variables:

id: id of participants
score: performance score (the outcome)
marker1: biomarker
group: a variable with group 1 (grp1) or group 2 (grp2)

A) Write a linear regression model and a regression equation to show the relationship between the average score as a function of marker1 and group and an interaction between marker1 and group. (4 marks)

Linear regression model:

\[ \text{score}_i = \beta_0 + \beta_1 \cdot \text{marker1}_i + \beta_2 \cdot \text{group2}_i + \beta_3 \cdot (\text{marker1}_i \times \text{group2}_i) + \epsilon_i \]

Where:

score_i: Performance score for participant i
marker1_i: Biomarker value for participant i
group2_i: Indicator variable for group 2 (1 if grp2, 0 if grp1)
β0: Intercept (mean score for grp1 when marker1 = 0
β1: Effect of marker1 on score for grp1
β2: Difference in mean score between grp2 and grp1 when marker1 = 0
β3: Interaction effect between marker1 and group2
εi: Random error term

Regression equation:

\[ \hat{\text{score}} = \beta_0 + \beta_1 \cdot \text{marker1} + \beta_2 \cdot \text{group2} + \beta_3 \cdot (\text{marker1} \times \text{group2}) \]

B) Using R, perform the analysis and present the results in a table. Interpret the interaction term. (6 marks)

Import data

Show code

mydata <- read.csv("mydata.csv")
glimpse(mydata)

Rows: 300
Columns: 4
$ id      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ score   <dbl> -0.2110439, 5.8046025, 20.9680192, -0.9366398, 7.1760110, 6.49…
$ marker1 <dbl> 0.2823356, 3.9935731, 6.0111029, -2.5642443, 4.3728117, 4.5651…
$ group   <chr> "grp1", "grp1", "grp1", "grp1", "grp1", "grp1", "grp2", "grp1"…

Change character variables to factors

Show code

mydata <- mydata %>%
  mutate(across(where(is.character), as.factor))
glimpse(mydata)

Rows: 300
Columns: 4
$ id      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ score   <dbl> -0.2110439, 5.8046025, 20.9680192, -0.9366398, 7.1760110, 6.49…
$ marker1 <dbl> 0.2823356, 3.9935731, 6.0111029, -2.5642443, 4.3728117, 4.5651…
$ group   <fct> grp1, grp1, grp1, grp1, grp1, grp1, grp2, grp1, grp1, grp1, gr…

Show code

# add label variable
attr(mydata$score, "label") <- "Performance Score"
attr(mydata$marker1, "label") <- "Biomarker 1"
attr(mydata$group, "label") <- "Group"

Fit multiple linear regression model with interaction

Show code

model_mlr <- lm(score ~ marker1 * group, data = mydata)

tbl_regression(model_mlr, intercept = TRUE)

Characteristic	Beta	95% CI	p-value
(Intercept)	3.4	2.0, 4.9	<0.001
Biomarker 1	1.4	1.1, 1.8	<0.001
Group
grp1	—	—
grp2	-0.03	-2.5, 2.5	>0.9
Biomarker 1 * Group
Biomarker 1 * grp2	0.53	-0.05, 1.1	0.072
Abbreviation: CI = Confidence Interval

Interpretation

Biomarker 1: Each unit increase in biomarker 1 is associated with a 1.4-point increase in performance score, after adjusting for group (95% CI: 1.1, 1.8, p < 0.001).
Group: Group 2 has a 0.03-point lower performance score compared to group 1, after adjusting for biomarker 1 (95% CI: 1.0, 9.0, p = 0.01).
Interaction term: The interaction term between biomarker 1 and group is statistically not significant (p > 0.05), indicating that the effect of biomarker 1 on performance score does not differ by group.

C) Show calculations for the 95% confidence intervals of the interaction term and interpret it. (5 marks)

Extract coefficient and standard error for interaction term

Show code

tidy(model_mlr, conf.int = TRUE)

# A tibble: 4 × 7
  term              estimate std.error statistic  p.value conf.low conf.high
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)         3.44       0.753    4.56   7.39e- 6   1.95        4.92
2 marker1             1.44       0.187    7.74   1.60e-13   1.08        1.81
3 groupgrp2          -0.0277     1.26    -0.0220 9.82e- 1  -2.51        2.45
4 marker1:groupgrp2   0.535      0.296    1.80   7.21e- 2  -0.0484      1.12

Formula to calculate 95% CI:

\[ 95\% \; CI = \hat{\beta} \pm 1.96 \times SE \]

Where:

β: Coefficient estimate for interaction term
SE: Standard error of the coefficient

Manual calculation for interaction term

β3 (marker1:group2) = 0.535 SE = 0.296

Lower bound = 0.535 - 1.96 * 0.296 = -0.045
Upper bound = 0.535 + 1.96 * 0.296 = 1.115

4. Survival Analysis

Question: The dataset stroke_outcome.dta comes from a short survival study. The variable outcome is labelled as either dead (1) or censored (0) and the variable days is the duration from the start of follow-up until the last follow-up.

A) Perform the Kaplan-Meier analysis for the overall survival. Interpret the finding based on number at risk, number of events and survival probabilities at time 5 and 29. (5 marks)

Import data

Show code

stroke_outcome <- read_dta("stroke_outcome.dta")
glimpse(stroke_outcome)

Rows: 226
Columns: 6
$ id      <chr> "B179568", "B454143", "B221072", "B455158", "B099206", "A17393…
$ days    <dbl> 4, 3, 16, 23, 5, 4, 22, 14, 4, 2, 22, 2, 43, 2, 4, 4, 3, 21, 1…
$ gcs     <dbl> 15, 15, 15, 11, 15, 7, 5, 13, 15, 15, 10, 15, 14, 9, 15, 15, 1…
$ age     <dbl> 69, 64, 63, 67, 66, 78, 55, 65, 67, 56, 50, 78, 59, 83, 83, 44…
$ outcome <dbl+lbl> 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1…
$ icd10   <dbl+lbl> 0, 0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1…

Change labelled variables to factors

Show code

stroke_outcome <- stroke_outcome %>%
  mutate(across(where(is.labelled), as_factor))
glimpse(stroke_outcome)

Rows: 226
Columns: 6
$ id      <chr> "B179568", "B454143", "B221072", "B455158", "B099206", "A17393…
$ days    <dbl> 4, 3, 16, 23, 5, 4, 22, 14, 4, 2, 22, 2, 43, 2, 4, 4, 3, 21, 1…
$ gcs     <dbl> 15, 15, 15, 11, 15, 7, 5, 13, 15, 15, 10, 15, 14, 9, 15, 15, 1…
$ age     <dbl> 69, 64, 63, 67, 66, 78, 55, 65, 67, 56, 50, 78, 59, 83, 83, 44…
$ outcome <fct> dead, alive, alive, alive, alive, dead, alive, dead, alive, al…
$ icd10   <fct> "CI,Others", "CI,Others", "ICB, Other Haemorrhage", "ICB, Othe…

Show code

# add label variable
attr(stroke_outcome$days, "label") <- "Time to Stroke death (days)"
attr(stroke_outcome$outcome, "label") <- "Stroke outcome"
attr(stroke_outcome$icd10, "label") <- "ICD10"
attr(stroke_outcome$gcs, "label") <- "Glasgow Coma Scale"
attr(stroke_outcome$age, "label") <- "Age (years)"

*** There is missing data in out dataset ***

Show code

# Check for missing data
colSums(is.na(stroke_outcome))

     id    days     gcs     age outcome   icd10 
      0       0       1       0       0       0

Show code

# Ampute missing data
stroke_outcome$gcs[is.na(stroke_outcome$gcs)] <- median(stroke_outcome$gcs, na.rm = TRUE)

# Check again for missing data
colSums(is.na(stroke_outcome))

     id    days     gcs     age outcome   icd10 
      0       0       0       0       0       0

Fit Kaplan-Meier survival model and summarize at time 5 and 29

Show code

KM.overall <- survfit(
  Surv(time = days, outcome == "dead") ~ 1, type = "kaplan-meier",
  data = stroke_outcome
)

summary(KM.overall)

Call: survfit(formula = Surv(time = days, outcome == "dead") ~ 1, data = stroke_outcome, 
    type = "kaplan-meier")

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1    226      13    0.942  0.0155       0.9126        0.973
    2    197       4    0.923  0.0179       0.8889        0.959
    3    169       4    0.901  0.0205       0.8621        0.943
    4    133       4    0.874  0.0240       0.8286        0.923
    5     92       5    0.827  0.0307       0.7689        0.889
    6     67       3    0.790  0.0360       0.7224        0.864
    7     58       4    0.735  0.0426       0.6565        0.824
    9     44       2    0.702  0.0467       0.6160        0.800
   10     38       1    0.683  0.0490       0.5938        0.787
   12     34       4    0.603  0.0574       0.5004        0.727
   14     25       2    0.555  0.0621       0.4455        0.691
   18     20       1    0.527  0.0649       0.4140        0.671
   22     16       1    0.494  0.0687       0.3762        0.649
   25     10       2    0.395  0.0832       0.2616        0.597
   28      6       1    0.329  0.0918       0.1908        0.569
   29      5       1    0.264  0.0942       0.1308        0.531
   41      3       1    0.176  0.0953       0.0607        0.509

Show code

# 

ggsurvplot(KM.overall, data = stroke_outcome, risk.table = TRUE, 
           linetype = c(1,2), pval = TRUE)

Interpretation

At time 5 days:

Number at risk: 92 patients were still under follow-up just before day 5
Number of events: 30 deaths had occurred on day 5
Survival probability: The estimated probability of survival at day 5 is 0.827 (82.7%) (95% CI: 0.769, 0.889)

At time 29 days:

-Number at risk: 5 patients were still under follow-up just before day 29 - Number of events: 22 deaths had occurred on day 29 - Survival probability: The estimated probability of survival at day 29 is 0.264 (26.4%) (95% CI: 0.131, 0.531)

Kaplan–Meier analysis demonstrates a marked decline in survival over time, with high early mortality and a substantial reduction in the number of patients at risk by day 29.

B) Perform the Kaplan-Meier analysis for icd10. Generate a table for survival probabilities. (5 marks)

Fit Kaplan-Meier survival model by icd10 and summarize

Show code

KM.icd10 <- survfit(
  Surv(time = days, outcome == "dead") ~ icd10, type = "kaplan-meier",
  data = stroke_outcome
)

tidy(KM.icd10)

# A tibble: 52 × 9
    time n.risk n.event n.censor estimate std.error conf.high conf.low strata   
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl> <chr>    
 1     1    149       3       14    0.980    0.0117     1        0.958 icd10=CI…
 2     2    132       3       21    0.958    0.0177     0.991    0.925 icd10=CI…
 3     3    108       0       30    0.958    0.0177     0.991    0.925 icd10=CI…
 4     4     78       2       33    0.933    0.0255     0.981    0.888 icd10=CI…
 5     5     43       1       15    0.911    0.0347     0.976    0.851 icd10=CI…
 6     6     27       1        4    0.878    0.0513     0.970    0.794 icd10=CI…
 7     7     22       2        4    0.798    0.0847     0.942    0.676 icd10=CI…
 8     8     16       0        4    0.798    0.0847     0.942    0.676 icd10=CI…
 9     9     12       0        2    0.798    0.0847     0.942    0.676 icd10=CI…
10    11     10       0        1    0.798    0.0847     0.942    0.676 icd10=CI…
# ℹ 42 more rows

Show code

# Plot survival curves by icd10
ggsurvplot(KM.icd10, data = stroke_outcome, risk.table = TRUE, 
           linetype = c(1,2,3), pval = TRUE)

C) Perform the Cox Proportional Hazard Regression model with covariates gcs, icd10 and age. Present the result in a table. Interpret the result for age and icd10. (10 marks)

Fit Cox Proportional Hazard Regression model

Show code

model_cox_stroke <- coxph(
  Surv(time = days, event = outcome == "dead") ~ 
    gcs + icd10 + age,
  data = stroke_outcome
)

tbl_regression(model_cox_stroke, conf.int = TRUE)

Characteristic	log(HR)	95% CI	p-value
Glasgow Coma Scale	-0.18	-0.25, -0.12	<0.001
ICD10
CI,Others	—	—
SAH	0.40	-0.43, 1.2	0.3
ICB, Other Haemorrhage	0.52	-0.13, 1.2	0.11
Age (years)	0.03	0.01, 0.05	0.004
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Interpretation

Age: Each additional year of age is associated with a 1.03 increase in log hazard of death, after adjusting for Glasgow Coma Scale and stroke type (95% CI: 1.01, 1.06, p = 0.004).
ICD10: Patients with Subarachnoid Hemorrhage stroke (SAH) have a 1.51 times higher log hazard of death compared to Cerebral Infarction and Intracerebral Bleeding (ICB), after adjusting for Glasgow Coma Scale and age (95% CI: 0.66, 3.09, p = 0.3). While, patient with Intracerebral Bleeding stroke have a 1.62 times higher log hazard of death compared to Cerebral Infarction and SAH, after adjusting for Glasgow Coma Scale and age (95% CI: 0.84, 3.09, p = 0.3). Both associations are not statistically significant.

D) Test the proportional hazard assumptions. (10 marks)

Linearity assumption (continuous covariates)

Show code

ggcoxfunctional(Surv(time = days, event = outcome == 'dead') ~ age + gcs,
                data = stroke_outcome)

Interpretation: Age and gcs fairly flat and horizontal, indicating that the linearity assumption is reasonably met.

Proportional hazard test using Schoenfeld residuals

Show code

ph_test <- cox.zph(model_cox_stroke, transform = 'km', global = TRUE) 
ph_test

        chisq df    p
gcs    0.0675  1 0.80
icd10  3.6335  2 0.16
age    0.9293  1 0.34
GLOBAL 4.6402  4 0.33

Show code

# Plot Schoenfeld residuals
ggcoxzph(ph_test)

Interpretation: The global test p-value is 0.33 (p > 0.05), indicating that the proportional hazards assumption is reasonably met for the overall model. Individually, gcs (p = 0.80), icd10 (p = 0.16), and age (p = 0.34) also show no evidence of violation of the proportional hazards assumption.

Residuals analysis for continuous covariates

Show code

score <- resid(model_cox_stroke, type = "score")
plot(stroke_outcome$gcs, score[, "gcs"])

Show code

plot(stroke_outcome$age, score[, "age"])

5. Briefly describe the disadvantages of stepwise variable selection during data analysis. (5 marks)

Disadvantages of stepwise variable selection :

May exclude clinically or theoretically important variables

Stepwise selection relies on statistical significance rather than subject-matter knowledge. As a result, important confounders (e.g. age in stroke outcome studies) may be excluded if they are not statistically significant in the sample, leading to residual confounding.

Data-driven and sample-dependent

Stepwise methods are highly dependent on the observed sample. If the sample is not representative of the target population, the selected model may not generalise well, resulting in poor external validity.

Model instability

Small changes in the data can lead to different variables being selected. This instability reduces the reproducibility of results and makes the final model unreliable.

Inflated Type I error and biased estimates

Repeated testing during variable selection increases the risk of false-positive findings and can produce biased regression coefficients and p-values.

Ignores causal structure and research context

Stepwise selection does not consider causal relationships, temporality, or prior evidence. In contrast, manual or theory-driven variable selection allows researchers to incorporate biological plausibility and epidemiological reasoning into the model.

MVA 2021/2022

1. Analysis Workflow

Question: Describe the workflow involved from the reading of the dataset to communicating the findings to the stakeholders. (10 marks)

Mnemonic : I took tiktok video muka model comel

Import

Objective: Load raw data from various sources (e.g., CSV, Excel, databases) into the analysis environment for further processing.

Tidy

Objective: Clean and structure the data into a consistent format (e.g., long format) to facilitate analysis, ensuring variables are properly labeled and missing values are addressed.

Transform

Objective: Create new variables, recode existing ones, and perform necessary calculations to prepare the data for analysis, enhancing its usability and interpretability.

Visualize

Objective: Generate graphical representations of the data to explore patterns, trends, and relationships among variables, aiding in understanding and hypothesis generation.

Model

Objective: Apply appropriate statistical or machine learning models to analyze the data, test hypotheses, and derive insights relevant to the research questions.

Communicate

Objective: Present the analysis results through reports, presentations, or dashboards, effectively conveying findings to stakeholders for informed decision-making.

Show code

library(DiagrammeR)
grViz("digraph data_analysis_workflow {
  node [shape = rectangle, style = filled, fillcolor = LightBlue]
  Import [label = 'Import Data']
  Tidy [label = 'Tidy Data']
  Transform [label = 'Transform Data']
  Visualize [label = 'Visualize Data']
  Model [label = 'Model Data']
  Communicate [label = 'Communicate Results']
  Import -> Tidy -> Transform -> Visualize -> Model -> Communicate
}")

2. Mixed (Between-within) Repeated Measures ANOVA (Using SPSS)

A) Run Analysis

Step 1: Open data

File → Open → Data → HIV-CM.sav

Check: - group (HIV-CM / Control) → Between-subjects factor - DSCA1, DSCA2, DSCA3 → Repeated measures

Step 2: Run Mixed Repeated Measures ANOVA

SPSS Menu path:

Analyze → General Linear Model → Repeated Measures

Define factors:

Within-Subject Factor Name: time
Number of Levels: 3
Define → assign:
- DSCA1 → Time 1
- DSCA2 → Time 2
- DSCA3 → Time 3
Between-Subjects Factor:
- Group (HIV-CM vs Control)

Options:

Descriptive statistics
Estimates of effect size
Pairwise comparisons (Bonferroni)
Mauchly’s test of sphericity

Step 3: Assumption checking (IMPORTANT FOR MARKS)

Normality
- SAS is continuous and approximately normally distributed → assumption reasonably met.
Sphericity (Mauchly’s Test)
- If p > 0.05 → sphericity assumed
- If p < 0.05 → use Greenhouse-Geisser correction

state which correction was used.

Step 4: Interpretation of key outputs

Time effect
- A significant main effect of time indicates that self-awareness scores changed significantly across the three time points.
Group effect
- A significant main effect of group indicates that overall self-awareness scores differed between the HIV-C M and control groups.
Group × Time interaction (MOST IMPORTANT)
- A significant interaction indicates that the pattern of change in self-awareness scores over time differs between the HIV-CM and control groups, demonstrating intervention effectiveness.

B) Present the results using tables and plots in MS Word. Name the document as results.docx. (5 marks)

Table should include: - Descriptive statistics (means and SDs for each group at each time point) - ANOVA results (F-values, p-values, effect sizes for time, group, and interaction) - Pairwise comparisons if applicable

Plot should include: - Line plot showing mean self-awareness scores over time for each group with error bars representing standard errors.

3. Multiple Log Regression

A) The analyst has proposed TWO (2) candidate models:

Model_A : covariates are DMDX, FBS and MOGTT2
Model_B : covariates are AGE and TOTCHOL

i) Compare Model_A and Model_B. Show your results. (4 marks)

Import dataset

Show code

mlr_3_2122 <- read.csv("Q3.csv")
glimpse(mlr_3_2122)

Rows: 3,695
Columns: 6
$ AGE     <int> 47, 48, 39, 74, 43, 22, 42, 47, 40, 54, 44, 41, 50, 38, 48, 69…
$ DMDX    <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "n…
$ HBA1C   <dbl> 5.6, 5.7, 5.4, 5.7, 5.1, 5.5, 5.0, 5.1, 4.8, 5.6, 5.7, 6.9, 4.…
$ FBS     <dbl> 6.3, 8.3, 5.2, 6.5, 4.8, 5.1, 5.4, 5.0, 4.7, 5.5, 4.8, 4.6, 5.…
$ MOGTT2H <dbl> 6.1, 17.4, 7.8, 14.9, 7.6, 3.8, 5.6, 9.4, 7.0, 10.8, 11.3, 10.…
$ TOTCHOL <dbl> 6.8, 5.9, 5.6, 6.3, 7.5, 5.8, 4.8, 4.5, 6.4, 4.3, 4.1, 5.5, 4.…

Change character variables to factors

Show code

mlr_3_2122 <- mlr_3_2122 %>%
  mutate(across(where(is.character), as.factor))

glimpse(mlr_3_2122)

Rows: 3,695
Columns: 6
$ AGE     <int> 47, 48, 39, 74, 43, 22, 42, 47, 40, 54, 44, 41, 50, 38, 48, 69…
$ DMDX    <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no…
$ HBA1C   <dbl> 5.6, 5.7, 5.4, 5.7, 5.1, 5.5, 5.0, 5.1, 4.8, 5.6, 5.7, 6.9, 4.…
$ FBS     <dbl> 6.3, 8.3, 5.2, 6.5, 4.8, 5.1, 5.4, 5.0, 4.7, 5.5, 4.8, 4.6, 5.…
$ MOGTT2H <dbl> 6.1, 17.4, 7.8, 14.9, 7.6, 3.8, 5.6, 9.4, 7.0, 10.8, 11.3, 10.…
$ TOTCHOL <dbl> 6.8, 5.9, 5.6, 6.3, 7.5, 5.8, 4.8, 4.5, 6.4, 4.3, 4.1, 5.5, 4.…

Fit both models and compare

Show code

# Fit both model
model_A <- glm(DMDX ~ FBS + MOGTT2H, data = mlr_3_2122, family = binomial)
model_B <- glm(DMDX ~ AGE + TOTCHOL, data = mlr_3_2122, family = binomial)

tidy(model_A, conf.int = TRUE)

# A tibble: 3 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)  -5.56      0.337    -16.5   3.99e-61  -6.23      -4.89 
2 FBS          -0.0251    0.0945    -0.266 7.90e- 1  -0.215      0.155
3 MOGTT2H       0.129     0.0492     2.63  8.55e- 3   0.0315     0.224

Show code

tidy(model_B, conf.int = TRUE)

# A tibble: 3 × 7
  term        estimate std.error statistic       p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>         <dbl>    <dbl>     <dbl>
1 (Intercept)  -5.18      0.863     -6.01  0.00000000190 -6.88      -3.52  
2 AGE           0.0152    0.0118     1.28  0.200         -0.00818    0.0383
3 TOTCHOL      -0.0242    0.136     -0.178 0.859         -0.301      0.216

Show code

# Compare models using performance metrics
compare_performance(model_A, model_B)

# Comparison of Model Performance Indices

Name    | Model | AIC (weights) | AICc (weights) | BIC (weights) | Tjur's R2
----------------------------------------------------------------------------
model_A |   glm | 403.2 (>.999) |  403.2 (>.999) | 421.8 (>.999) |     0.008
model_B |   glm | 418.6 (<.001) |  418.6 (<.001) | 437.3 (<.001) | 4.586e-04

Name    |  RMSE | Sigma | Log_loss | Score_log | Score_spherical |   PCP
------------------------------------------------------------------------
model_A | 0.099 |     1 |    0.054 |    -0.382 |           0.016 | 0.980
model_B | 0.100 |     1 |    0.056 |    -0.382 |           0.016 | 0.980

Interpretation: Model_A has lower AIC and BIC values, indicating a better fit to the data compared to Model_B. Additionally, Model_A has a higher R-squared value, suggesting it explains more variance in the outcome variable (DMDX).

ii) Establish the model of your choice. Give reasons. (4 marks)

Model of choice : Model_A (DMDX ~ FBS + MOGTT2)

Based on the model comparison results, I would choose Model_A (DMDX ~ FBS + MOGTT2) for the following reasons:

Better fit: Model_A has lower AIC and BIC values, indicating it fits the data better than Model_B.
Higher explanatory power: Model_A has a higher R-squared value, meaning it explains more variance in the outcome variable (DMDX).
Clinical relevance: FBS and MOGTT2 are directly related to glucose metabolism and are likely more relevant predictors of DMDX compared to AGE and TOTCHOL.

B) Using your model of choice from (A), perform a model assessment. Show the results and comment on the findings. (8 marks)

4. Multiple Logistic Regression

A) A dataset named Q4_4_2122 has SIX (6) variables:

Outcome – alive or dead (the outcome variable) Sepsis – yes or no Creatinine – in mmol/l Age – in years old Perforation – size of perforation in cm Hemoglobin – in mg/dl

The binary logistic regression analysis was chosen to assess the relationship between the outcome and the rest of the variables.

Simulate the dataset using the code below:

Show code

set.seed(123)
n <- 300

# Generate predictors
Sepsis <- factor(sample(c("yes", "no"), n, replace = TRUE, prob = c(0.4, 0.6)))
Creatinine <- rnorm(n, mean = 100, sd = 20)
Age <- rnorm(n, mean = 60, sd = 15)
Perforation <- rnorm(n, mean = 2, sd = 0.5)
Hemoglobin <- rnorm(n, mean = 13, sd = 2)

# Encode Sepsis as numeric for modeling
Sepsis_num <- ifelse(Sepsis == "yes", 1, 0)

# Define true relationship: all covariates affect outcome
logit_p <- -10 + 
  0.06 * Creatinine + 
  0.04 * Age + 
  0.8 * Sepsis_num + 
  1.2 * Perforation - 
  0.3 * Hemoglobin

p_dead <- exp(logit_p) / (1 + exp(logit_p))
Outcome <- factor(rbinom(n, 1, p_dead), labels = c("alive", "dead"))

# Final dataset
mlogr_4_2122 <- data.frame(Outcome, Sepsis, Creatinine, Age, Perforation, Hemoglobin)

i) Justify the choice of the analysis. (4 marks)

Justification for choosing binary logistic regression:

1. Outcome variable is binary: The outcome variable “Outcome” has two categories (alive vs dead), making binary logistic regression appropriate for modeling the probability of one category (e.g., dead) based on predictor variables.

2. Multiple predictor variables: The analysis involves multiple independent variables (Sepsis, Creatinine, Age, Perforation, Hemoglobin), which can be continuous or categorical. Logistic regression can handle both types of predictors effectively.

3. Estimation of odds ratios: Logistic regression provides odds ratios for each predictor, allowing for interpretation of the strength and direction of associations between predictors and the outcome.

4. No assumption of normality: Logistic regression does not require the predictors to be normally distributed, making it suitable for a variety of data types and distributions.

ii) State the assumptions for the analysis. (2 marks)

Assumptions of binary logistic regression:

1. Independence of observations: The observations should be independent of each other, meaning that the outcome for one observation does not influence the outcome for another.

2. Linerity of the logit: There should be a linear relationship between the continuous predictor variables and the logit (log-odds) of the outcome variable.

B) Generate a saturated model and name it as Model_Outcome. Present Model_Outcome in a table. (5 marks)

Show code

model_Outcome <- glm(Outcome ~ Sepsis + Creatinine + Age + Perforation + Hemoglobin,
                     data = mlogr_4_2122, family = binomial)
tbl_regression(model_Outcome, conf.int = TRUE)

Characteristic	log(OR)	95% CI	p-value
Sepsis
no	—	—
yes	1.3	0.41, 2.1	0.004
Creatinine	0.06	0.04, 0.09	<0.001
Age	0.07	0.04, 0.11	<0.001
Perforation	1.1	0.39, 1.9	0.003
Hemoglobin	-0.30	-0.55, -0.07	0.013
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

C) Using Model_Outcome, compose an interaction between Creatinine and Sepsis. Name it as Model_Outcome_ia. Present Model_Outcome_ia in a table. (5 marks)

Show code

model_Outcome_ia <- glm(Outcome ~ Sepsis + Creatinine + Age + Perforation + Hemoglobin + Sepsis:Creatinine,
                         data = mlogr_4_2122, family = binomial)
tbl_regression(model_Outcome_ia, conf.int = TRUE, intercept = T)

Characteristic	log(OR)	95% CI	p-value
(Intercept)	-16	-23, -9.9	<0.001
Sepsis
no	—	—
yes	6.1	0.84, 12	0.027
Creatinine	0.09	0.05, 0.13	<0.001
Age	0.08	0.05, 0.12	<0.001
Perforation	1.1	0.38, 1.9	0.004
Hemoglobin	-0.31	-0.56, -0.08	0.010
Sepsis * Creatinine
yes * Creatinine	-0.04	-0.09, 0.00	0.072
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

D) For Model_Outcome_ia, write:

i) The logistic model (2 marks)

The logistic model for Model_Outcome_ia can be expressed as:

\[ \text{logit}(P(\text{Outcome} = \text{dead})) = \beta_0 + \beta_1 \cdot \text{Sepsis}_{\text{yes}} + \beta_2 \cdot \text{Creatinine} + \beta_3 \cdot \text{Age} + \beta_4 \cdot \text{Perforation} + \beta_5 \cdot \text{Hemoglobin} + \beta_6 \cdot (\text{Sepsis}_{\text{yes}} \times \text{Creatinine}) \]

Probability of death can be calculated using the logistic function for model_Outcome_ia:

\[ P(\text{Outcome} = \text{dead}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \cdot \text{Sepsis}_{\text{yes}} + \beta_2 \cdot \text{Creatinine} + \beta_3 \cdot \text{Age} + \beta_4 \cdot \text{Perforation} + \beta_5 \cdot \text{Hemoglobin} + \beta_6 \cdot (\text{Sepsis}_{\text{yes}} \times \text{Creatinine}))}} \]

ii) The logit equation (2 marks)

The logit equation for Model_Outcome_ia is: \[ \hat{\text{logit}} = \beta_0 + \beta_1 \cdot \text{Sepsis}_{\text{yes}} + \beta_2 \cdot \text{Creatinine} + \beta_3 \cdot \text{Age} + \beta_4 \cdot \text{Perforation} + \beta_5 \cdot \text{Hemoglobin} + \beta_6 \cdot (\text{Sepsis}_{\text{yes}} \times \text{Creatinine}) \]

Using beta from Model_Outcome_ia table, the logit equation is:

\[ \hat{\text{logit}} = -16 + 6.1 \cdot \text{Sepsis}_{\text{yes}} + 0.09 \cdot \text{Creatinine} + 0.08 \cdot \text{Age} + 1.10 \cdot \text{Perforation} - 0.31 \cdot \text{Hemoglobin} - 0.04 \cdot (\text{Sepsis}_{\text{yes}} \times \text{Creatinine}) \]

E) Using Model_Outcome_ia, generate the predicted values for observation 1. Show your manual calculation. (4 marks)

Create observation 1 values

Spepsis = yes
Creatinine = 120
Age = 70
Perforation = 2.5
Hemoglobin = 12

Show code

# Observation 1 values
obbs_4_2122 <- expand.grid(
  Sepsis = factor("yes", levels = c("no", "yes")),
  Creatinine = 120,
  Age = 70,
  Perforation = 2.5,
  Hemoglobin = 12
)

Manual calculation of logit for observation 1

Using the coefficients from Model_Outcome_ia table:
- β0 = -16
- β1 (Sepsis_yes) = 6.1
- β2 (Creatinine) = 0.09
- β3 (Age) = 0.08
- β4 (Perforation) = 1.10
- β5 (Hemoglobin) = -0.31
- β6 (Sepsis_yes:Creatinine) = -0.04
Logit calculation:

\[ \hat{\text{logit}} = -16 + 6.1 \cdot 1 + 0.09 \cdot 120 + 0.08 \cdot 70 + 1.10 \cdot 2.5 - 0.31 \cdot 12 - 0.04 \cdot (1 \cdot 120) \]

\[ \hat{\text{logit}} = -16 + 6.1 + 10.8 + 5.6 + 2.75 - 3.72 - 4.8 = 0.73 \]

Convert logit to predicted probability

Using the logistic function: \[ P(\text{Outcome} = \text{dead}) = \frac{1}{1 + e^{-0.73}} \approx 0.68 \]

Thus, the predicted probability of death for observation 1 is approximately 0.68 (68%).

MVA 2020/2021

1. Mutilple Linear Regression

Question: Dataset: Omega3 Analyse the relationship of triglyceride based on the following variables:

Model A = waist, total_chol, fbs, sex. Model B = waist, total_chol, fbs, sex, physical activity, interaction between sex and physical activity

*** Simulate the dataset using the code below:***

Show code

set.seed(123)
n <- 250
Omega3 <- data.frame(
  waist = rnorm(n, mean = 85, sd = 10),
  total_chol = rnorm(n, mean = 200, sd = 30),
  fbs = rnorm(n, mean = 100, sd = 15),
  physical_activity = rnorm(n, mean = 3),
  sex = factor(sample(c("female", "male"), n, replace = TRUE))
)

# Generate triglyceride with some relationship to predictors
Omega3$triglyceride <- 150 + 
  1.5 * Omega3$waist + 
  0.8 * Omega3$total_chol + 
  0.5 * Omega3$fbs - 
  10 * (Omega3$sex == "male") - 
  5 * Omega3$physical_activity +
  3 * (Omega3$sex == "male") * Omega3$physical_activity +
  rnorm(n, mean = 0, sd = 20)

A) Write the regression model and equation for Model A and Model B. (6 marks)

Fit regression models A and B:

Show code

modelA_1_2021 <- lm(triglyceride ~ waist + total_chol + fbs + sex , data = Omega3)
modelB_1_2021 <- lm(triglyceride ~ waist + total_chol + fbs + sex + physical_activity + sex : physical_activity, data = Omega3)
summary(modelA_1_2021)


Call:
lm(formula = triglyceride ~ waist + total_chol + fbs + sex, data = Omega3)

Residuals:
   Min     1Q Median     3Q    Max 
-56.56 -14.03  -0.65  13.65  68.34 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 150.10116   16.68627   8.995  < 2e-16 ***
waist         1.26601    0.13508   9.372  < 2e-16 ***
total_chol    0.81693    0.04221  19.355  < 2e-16 ***
fbs           0.51733    0.08371   6.180 2.66e-09 ***
sexmale      -3.26573    2.53927  -1.286      0.2    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.89 on 245 degrees of freedom
Multiple R-squared:  0.6779,    Adjusted R-squared:  0.6726 
F-statistic: 128.9 on 4 and 245 DF,  p-value: < 2.2e-16

Show code

summary(modelB_1_2021)


Call:
lm(formula = triglyceride ~ waist + total_chol + fbs + sex + 
    physical_activity + sex:physical_activity, data = Omega3)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.267 -14.170  -0.833  12.294  65.778 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               170.98018   17.77148   9.621  < 2e-16 ***
waist                       1.24380    0.13328   9.332  < 2e-16 ***
total_chol                  0.80480    0.04173  19.284  < 2e-16 ***
fbs                         0.50628    0.08271   6.121  3.7e-09 ***
sexmale                   -18.09602    7.97026  -2.270  0.02406 *  
physical_activity          -5.02610    1.60782  -3.126  0.00199 ** 
sexmale:physical_activity   4.81922    2.50032   1.927  0.05509 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.58 on 243 degrees of freedom
Multiple R-squared:  0.6904,    Adjusted R-squared:  0.6827 
F-statistic:  90.3 on 6 and 243 DF,  p-value: < 2.2e-16

Model A:

Regression model:

\[ \text{triglyceride}_i = \beta_0 + \beta_1 \cdot \text{waist}_i + \beta_2 \cdot \text{total_chol}_i + \beta_3 \cdot \text{fbs}_i + \beta_4 \cdot \text{sex}_i + \epsilon_i \]

Regression equation (using Beta from modelA_1_2021):

\[ \hat{\text{triglyceride}} = 150 + 1.3 \cdot \text{waist} + 0.8 \cdot \text{total_chol} + 0.5 \cdot \text{fbs} - 3.3 \cdot \text{sex}_{\text{male=1}} \]

Model B:

Regression model:

\[ \text{triglyceride}_i = \beta_0 + \beta_1 \cdot \text{waist}_i + \beta_2 \cdot \text{total_chol}_i + \beta_3 \cdot \text{fbs}_i + \beta_4 \cdot \text{sex}_i + \beta_5 \cdot \text{physical_activity}_i + \beta6 \cdot (\text{sex}_i \times \text{physical_activity}_i) + \epsilon_i \]

Regression equation (using Beta from modelB_1_2021):

\[ \hat{\text{triglyceride}} = 145 + 1.4 \cdot \text{waist} + 0.8 \cdot \text{total_chol} + 0.5 \cdot \text{fbs} - 10 \cdot \text{sex}_{\text{male=1}} - 4.5 \cdot \text{physical_activity} + 3.0 \cdot (\text{sex}_{\text{male=1}} \times \text{physical_activity}) \]

B) Compare the two models.

Show code

compare_performance(modelA_1_2021, modelB_1_2021)

# Comparison of Model Performance Indices

Name          | Model |  AIC (weights) | AICc (weights) |  BIC (weights)
------------------------------------------------------------------------
modelA_1_2021 |    lm | 2211.6 (0.050) | 2211.9 (0.057) | 2232.7 (0.642)
modelB_1_2021 |    lm | 2205.7 (0.950) | 2206.3 (0.943) | 2233.9 (0.358)

Name          |    R2 | R2 (adj.) |   RMSE |  Sigma
---------------------------------------------------
modelA_1_2021 | 0.678 |     0.673 | 19.692 | 19.892
modelB_1_2021 | 0.690 |     0.683 | 19.307 | 19.583

Interpretation: Model B has a lower AIC and BIC compared to Model A, indicating a better fit to the data. Additionally, Model B has a higher R-squared value, suggesting it explains more variance in triglyceride levels. Therefore, Model B is preferred over Model A.

C) Predict for the population given. (10 marks)

Create new data for prediction

Show code

new_data_1_2021 <- expand.grid(
waist = c(70, 80, 90, 100),
total_chol = c(180, 200, 220),
fbs = c(90, 100, 110),
physical_activity = c(1, 2, 3, 4, 5),
sex = factor(c("female", "male"), levels = c("female", "male"))
)

Generate predictions using Model B

Show code

new_data_1_2021$predicted_triglyceride <- predict(modelB_1_2021, newdata = new_data_1_2021)
head(new_data_1_2021)

  waist total_chol fbs physical_activity    sex predicted_triglyceride
1    70        180  90                 1 female               443.4488
2    80        180  90                 1 female               455.8868
3    90        180  90                 1 female               468.3247
4   100        180  90                 1 female               480.7627
5    70        200  90                 1 female               459.5447
6    80        200  90                 1 female               471.9827

Interpretation:

For observation 1 (waist = 70, total_chol = 180, fbs = 90, physical_activity = 1, sex = female), the predicted triglyceride level is approximately 443 mg/dL. As waist circumference, total cholesterol, fasting blood sugar, and physical activity levels increase, the predicted triglyceride levels also tend to increase, with variations based on sex due to the interaction term.

D) Check assumption (Multiple Linear Regression) for model B.

Global diagnostics

Show code

plot(modelB_1_2021)

Linearity of continuous predictors

Show code

# Linearity for waist
ggplot(Omega3, aes(x = waist, y = residuals(modelB_1_2021))) + geom_point() +
geom_hline(yintercept = 0)

Show code

# Linearity for total_cholesterol
ggplot(Omega3, aes(x = total_chol, y = residuals(modelB_1_2021))) + geom_point() +
geom_hline(yintercept = 0)

Show code

# Linearity for fbs
ggplot(Omega3, aes(x = fbs, y = residuals(modelB_1_2021))) + geom_point() +
geom_hline(yintercept = 0)

Show code

# Linearity for physical_activity
ggplot(Omega3, aes(x = physical_activity, y = residuals(modelB_1_2021))) + geom_point() +
geom_hline(yintercept = 0)

Interpretation: The residuals plotted against waist, total cholesterol, fbs, and physical activity show no obvious systematic pattern or curvature and are randomly scattered around zero. This suggests that the linearity assumption is reasonably satisfied for all continuous predictors in the multiple linear regression model.

Normality of residuals

Show code

hist(residuals(modelB_1_2021))

Show code

shapiro.test(modelB_1_2021$residuals)


    Shapiro-Wilk normality test

data:  modelB_1_2021$residuals
W = 0.99418, p-value = 0.4496

Interpretation:

The histogram of residuals appears approximately bell-shaped, suggesting that the normality of residual is met.
The Shapiro-Wilk test yields a p-value of 0.450 (p > 0.05), indicating residuals are approximately normally distributed.

Homoscedasticity - Breusch-Pagan test

Show code

bptest(modelB_1_2021)


    studentized Breusch-Pagan test

data:  modelB_1_2021
BP = 1.7869, df = 6, p-value = 0.9382

Interpretation: The Breusch-Pagan test yields a p-value of 0.938 (p > 0.05), indicating no significant evidence of heteroscedasticity. Thus, homoscedasticity assumption is met.

Independence of residuals - Durbin-Watson test

Show code

dwtest(modelB_1_2021)


    Durbin-Watson test

data:  modelB_1_2021
DW = 1.8242, p-value = 0.08398
alternative hypothesis: true autocorrelation is greater than 0

Interpretation: The Durbin-Watson test yields a p-value of 0.084 (p > 0.05), indicating no significant autocorrelation in the residuals. Thus, the independence assumption is met.

Assessing transformation using Box-Cox plot

Show code

MASS::boxcox(modelB_1_2021)

Interpretation: The Box-Cox plot suggests that the optimal lambda value is close to 1, indicating that no transformation of the outcome variable (triglyceride) is necessary. The current linear regression model is appropriate.

Check for multicollinearity - Variance Inflation Factor (VIF)

Show code

vif(modelB_1_2021)

                waist            total_chol                   fbs 
             1.023720              1.022966              1.013720 
                  sex     physical_activity sex:physical_activity 
            10.329046              1.731479             10.785706

Interpretation: The VIF values for all predictors are below 5, indicating no significant multicollinearity between the predictors. Thus, no multicollinearity.

Influential points - Cook’s Distance

Show code

cutoffmodB <- 4/nrow(Omega3)

plot(modelB_1_2021, which = 4, cook.levels = cutoffmodB)

Interpretation: Minimal observations exceed the Cook’s distance cutoff, indicating no highly influential points that unduly affect the regression results.

E) Present Model B in table n interpret for waist & physical activity*sex

Show code

tbl_regression(modelB_1_2021, conf.int = TRUE)

Characteristic	Beta	95% CI	p-value
waist	1.2	0.98, 1.5	<0.001
total_chol	0.80	0.72, 0.89	<0.001
fbs	0.51	0.34, 0.67	<0.001
sex
female	—	—
male	-18	-34, -2.4	0.024
physical_activity	-5.0	-8.2, -1.9	0.002
sex * physical_activity
male * physical_activity	4.8	-0.11, 9.7	0.055
Abbreviation: CI = Confidence Interval

Show code

glance(modelB_1_2021)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.690         0.683  19.6      90.3 4.98e-59     6 -1095. 2206. 2234.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Interpretation:

Waist: For each additional cm increase in waist circumference, triglyceride levels increase by approximately 1.2 mg/dL, holding all other variables constant (95% CI: 0.98, 1.5, p < 0.001).
physical activity * sex: The interaction term indicates that the effect of physical activity on triglyceride levels differs by sex. For males, each additional unit increase in physical activity is associated with a 4,8 mg/dL increase in triglyceride levels (95% CI: -0.11, 9.7, p = 0.055), while for females, the effect is a 5.0 mg/dL decrease in triglyceride levels (95% CI: -8.2, -1.9, p = 0.002). This suggests that physical activity has a protective effect on triglyceride levels for females but a detrimental effect for male.

2. Survival Analysis

Question: Using Dataset: stroke_e.dta

A) Interpret: (10 marks)

Overall Kaplan Meier
Kaplan Meier by ICD10

Import dataset

Show code

stroke_2_2021 <- read_dta("stroke_e.dta")
glimpse(stroke_2_2021)

Rows: 226
Columns: 10
$ sex       <dbl+lbl> 1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1,…
$ time      <dbl> 32, 4, 3, 5, 2, 2, 1, 4, 8, 4, 3, 3, 1, 4, 3, 2, 12, 12, 4, …
$ status2   <dbl+lbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ gcs       <dbl> 13, 15, 15, 15, 15, 15, 13, 15, 15, 10, 15, 15, 15, 15, 15, …
$ sbp       <dbl> 143, 150, 152, 215, 162, 169, 178, 180, 186, 185, 122, 211, …
$ dbp       <dbl> 83, 98, 108, 109, 72, 114, 75, 90, 109, 100, 83, 103, 83, 72…
$ dm        <dbl+lbl> 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,…
$ wbc       <dbl> 10.0, 8.9, 7.4, 8.1, 7.5, 24.4, 9.5, 4.2, 6.4, 14.4, 8.2, 8.…
$ age2      <dbl> 50, 58, 64, 50, 65, 78, 66, 72, 61, 64, 63, 59, 64, 62, 40, …
$ icd10cat2 <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Change variable label to factor

Show code

stroke_2_2021 <- stroke_2_2021 %>% mutate(across(where(is.labelled), as_factor))
glimpse(stroke_2_2021)

Rows: 226
Columns: 10
$ sex       <fct> male, male, male, female, male, female, female, male, female…
$ time      <dbl> 32, 4, 3, 5, 2, 2, 1, 4, 8, 4, 3, 3, 1, 4, 3, 2, 12, 12, 4, …
$ status2   <fct> alive, alive, alive, alive, alive, alive, dead, alive, alive…
$ gcs       <dbl> 13, 15, 15, 15, 15, 15, 13, 15, 15, 10, 15, 15, 15, 15, 15, …
$ sbp       <dbl> 143, 150, 152, 215, 162, 169, 178, 180, 186, 185, 122, 211, …
$ dbp       <dbl> 83, 98, 108, 109, 72, 114, 75, 90, 109, 100, 83, 103, 83, 72…
$ dm        <fct> no, no, no, yes, yes, yes, yes, no, yes, yes, no, yes, yes, …
$ wbc       <dbl> 10.0, 8.9, 7.4, 8.1, 7.5, 24.4, 9.5, 4.2, 6.4, 14.4, 8.2, 8.…
$ age2      <dbl> 50, 58, 64, 50, 65, 78, 66, 72, 61, 64, 63, 59, 64, 62, 40, …
$ icd10cat2 <fct> Ischaemic Stroke, Ischaemic Stroke, Ischaemic Stroke, Ischae…

Overall Kaplan Meier

Show code

KM.overall_2_2021 <- survfit(Surv(time = time,
                                  event = status2 == 'dead') ~ 1, type = "kaplan-meier",
                             data = stroke_2_2021)
tidy(KM.overall_2_2021)

# A tibble: 29 × 8
    time n.risk n.event n.censor estimate std.error conf.high conf.low
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
 1     1    226      13       16    0.942    0.0164     0.973    0.913
 2     2    197       4       24    0.923    0.0194     0.959    0.889
 3     3    169       4       32    0.901    0.0228     0.943    0.862
 4     4    133       4       37    0.874    0.0274     0.923    0.829
 5     5     92       5       20    0.827    0.0371     0.889    0.769
 6     6     67       3        6    0.790    0.0456     0.864    0.722
 7     7     58       5        5    0.722    0.0608     0.813    0.641
 8     8     48       0        4    0.722    0.0608     0.813    0.641
 9     9     44       2        4    0.689    0.0692     0.789    0.602
10    10     38       1        1    0.671    0.0741     0.776    0.580
# ℹ 19 more rows

Show code

# Plot overall survival curve
ggsurvplot(KM.overall_2_2021, data = stroke_2_2021, risk.table = TRUE, linetype = c(1,2), pval = TRUE)

Kaplan Meier by ICD10

Show code

KM.icd10_2_2021 <- survfit(Surv(time = time,
                                  event = status2 == 'dead') ~ icd10cat2, type = "kaplan-meier",
                             data = stroke_2_2021)
tidy(KM.icd10_2_2021)

# A tibble: 41 × 9
    time n.risk n.event n.censor estimate std.error conf.high conf.low strata   
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl> <chr>    
 1     1    149       3       14    0.980    0.0117     1        0.958 icd10cat…
 2     2    132       3       21    0.958    0.0177     0.991    0.925 icd10cat…
 3     3    108       0       30    0.958    0.0177     0.991    0.925 icd10cat…
 4     4     78       2       33    0.933    0.0255     0.981    0.888 icd10cat…
 5     5     43       1       15    0.911    0.0347     0.976    0.851 icd10cat…
 6     6     27       1        4    0.878    0.0513     0.970    0.794 icd10cat…
 7     7     22       2        4    0.798    0.0847     0.942    0.676 icd10cat…
 8     8     16       0        4    0.798    0.0847     0.942    0.676 icd10cat…
 9     9     12       0        2    0.798    0.0847     0.942    0.676 icd10cat…
10    11     10       0        1    0.798    0.0847     0.942    0.676 icd10cat…
# ℹ 31 more rows

Show code

# Plot survival curve by ICD10
ggsurvplot(KM.icd10_2_2021, data = stroke_2_2021, risk.table = TRUE, linetype = c(1,2), pval = TRUE)

Interpretation:

Overall Kaplan Meier: The overall survival curve indicates that the probability of survival decreases over time among stroke patients. The median survival time can be estimated from the curve, and the risk table shows the number of patients at risk at different time points.
Kaplan Meier by ICD10: The survival curves stratified by ICD10 categories show differences in survival probabilities among different stroke types. For example, patients with Haemorrhagic stroke may have a lower survival probability compared to those with Ischaemic stroke. The log-rank test p-value = 0.0043 indicates whether these differences are statistically significant.

B) Cox regression for variables: (10 marks)

Age
DM
ICD10
Gender

Fit Cox regression model

Show code

model_cox_2_2021 <- coxph(Surv(time = time,
                               event = status2 == 'dead') ~ age2 + dm + icd10cat2 + sex,
                          data = stroke_2_2021)
tidy(model_cox_2_2021, conf.int = TRUE, exponentiate = TRUE)

# A tibble: 4 × 7
  term                  estimate std.error statistic p.value conf.low conf.high
  <chr>                    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 age2                      1.03    0.0111     2.74  0.00616    1.01       1.05
2 dmyes                     1.47    0.299      1.30  0.195      0.820      2.65
3 icd10cat2Haemorrhagic     2.69    0.310      3.19  0.00143    1.46       4.94
4 sexfemale                 1.25    0.305      0.729 0.466      0.687      2.27

Interpretation:

Age: Each additional year of age is associated with a 0.003 times higher hazard ratio of death, after adjusting for other covariates (HR = 1.03, 95% CI: 1.01-1.05, p = 0.006).
DM: Patients with diabetes have a 0.299 times higher hazard ratio of death compared to those without diabetes, controlling for other covariates (HR = 1.57, 95% CI: 0.82, 2.65, p = 0.19). However, this association is not statistically significant.
ICD10: Patient with hemorrhagic stroke have a 2.69 times higher hazard ratio of death compared to those with ischaemic stroke, adjusting for other covariates (HR = 2.69, 95% CI: 1.30-4.80, p = 0.466). However, this association is not statistically significant.

C) Check assumptions needed to estimate cox regression. (10 marks)

Show code

ggcoxfunctional(Surv(time = time,
                     event = status2 == 'dead') ~ age2,
                data = stroke_2_2021)

Interpretation: The plot shows a linear relationship between age and the log hazard, indicating that the linearity assumption is reasonably satisfied for age in the Cox regression model.

Proportional hazards assumption - Schoenfeld residuals

Show code

prop.h_2.2021 <- cox.zph(model_cox_2_2021, transform = 'km', global = TRUE)
prop.h_2.2021

          chisq df     p
age2      0.997  1 0.318
dm        1.549  1 0.213
icd10cat2 3.979  1 0.046
sex       4.416  1 0.036
GLOBAL    8.443  4 0.077

Show code

ggcoxzph(prop.h_2.2021)

Interpretation: The global test p-value = 0.077 (p > 0.05) indicates no significant violation of the proportional hazards assumption. However, ICD10 and sex have p-values < 0.05, suggesting potential violations for these covariates. Further investigation may be needed.

Linearity - Score residuals

Show code

score_2_2021 <- resid(model_cox_2_2021, type = "score")
plot(stroke_2_2021$age2, score_2_2021[, "age2"])

Show code

# Linearity for age
ggplot(stroke_2_2021, aes(x = age2, y = residuals(model_cox_2_2021, type = "martingale"))) + geom_point() +
  geom_smooth(method = "loess") + geom_hline(yintercept = 0)

Interpretation: The residuals plotted against age show no obvious systematic pattern or curvature and are randomly scattered around zero. This suggests that the linearity assumption is reasonably satisfied for age in the Cox regression model.

Influential (Martingale residuals)

Show code

ggcoxdiagnostics(model_cox_2_2021, type = 'martingale', linear.predictions = FALSE)

Interpretation: The martingale residuals appear randomly scattered around zero, indicating no significant outliers or influential observations that could unduly affect the Cox regression results.

3. Mixed (Between-within) Repeated Measures ANOVA (Using SPSS)

They were measured on their mobilization activities using questionnaire with scores between 0-100 when higher score indicates better management. Endpoint data on the mobility scores were collected at baseline (score-a), 3-month (score-2) and 6-month (score-3).

Using the given dataset “mobility study.sav”,

A) Present the result using tables and plots (in a Word document). (10 marks)

Table should include

Descriptive statistics (means and SDs for each group at each time point)
ANOVA results (F-values, p-values, effect sizes for time, group, and interaction)
Pairwise comparisons if applicable

B) Perform appropriate statistical analysis using SPSS. Interpret all the relevant analytic step and save it into a Microsoft Word document (“mobility.docx”). Also save all the relevant SPSS outputs as “mobility.spv”. (20 marks)

Statistical analysis step using SPSS

Descriptive analysis
Univariable analysis
Model fit (M)
Interaction (I)
Normality of residual (N)
Equal variences (E)
Sphericity (S)
Split file (if interaction is significant)
Post-hoc test (if the interest variable has more than 2 categories)
Final table
Interpretation
Interaction plot

4. ANOVA

Question: IV alcohol and sleep DV performance

Research questions (main effect):

Is there any effect of alcohol on performance score?
Is there any effect of sleep on the performance score?
Is the difference of performance score between the alcohol status the same/vary based on sleep status?

Run the analysis.

Statistical analysis step using SPSS

Descriptive analysis
Univariable analysis
Model fit (M)
Interaction (I)
Normality of residual (N)
Equal variences (E)
Split file/ Syntax (if interaction is significant)
Post-hoc test (if the interest variable has more than 2 categories)
Final table
Interpretation
Interaction plot

MVA 2019/2020

1. Multiple Linear Regression

Question: Data were collected to study factors that were related with HbA1c. The dataset is named as “metab_syndrome” and in .xlsx formats. A model was proposed to examine the relationship between HBA1C with these covariates:

HbA1c = glycated haemoglobin (%)
AGE = age(years)
DMDX = diabetes status (yes/no)
MSBPR - mean systolic blood pressure (mmHg)
FBS = fasting blood sugar (mmol/L)
HDL = high-density lipoprotein (mg/dL)

*** Simulate the dataset using the code below:***

Show code

set.seed(123)
n <- 200
metab_syndrome <- data.frame(
  HbA1c = rnorm(n, mean = 7, sd = 1.5),
  AGE = rnorm(n, mean = 55, sd = 10),
  DM = factor(sample(c("yes", "no"), n, replace = TRUE, prob = c(0.6, 0.4))),
  MSBPR = rnorm(n, mean = 130, sd = 15),
  FBS = rnorm(n, mean = 6.5, sd = 2),
  HDL = rnorm(n, mean = 45, sd = 10),
  MDBPR = rnorm(n, mean = 80, sd = 10)
)

(a) Write the regression model and equation for MODEL_A. (15 marks)

Glimpse dataset

Show code

glimpse(metab_syndrome)

Rows: 200
Columns: 7
$ HbA1c <dbl> 6.159287, 6.654734, 9.338062, 7.105763, 7.193932, 9.572597, 7.69…
$ AGE   <dbl> 76.98810, 68.12413, 52.34855, 60.43194, 50.85660, 50.23753, 47.1…
$ DM    <fct> yes, yes, yes, yes, yes, no, yes, yes, no, no, yes, no, yes, no,…
$ MSBPR <dbl> 120.97161, 115.09452, 145.40178, 141.26592, 107.36250, 128.57279…
$ FBS   <dbl> 5.043562, 3.419115, 5.113811, 6.737699, 3.770581, 7.679965, 7.07…
$ HDL   <dbl> 34.85886, 37.08686, 47.99594, 61.39052, 55.84617, 38.75433, 53.2…
$ MDBPR <dbl> 89.15992, 88.00622, 70.63431, 65.99213, 81.60278, 77.26038, 70.1…

Built model A

Show code

model_A_1_1920 <- lm(HbA1c ~ AGE + DM + MSBPR + FBS + HDL, data = metab_syndrome)

tidy (model_A_1_1920, conf.int = TRUE)

# A tibble: 6 × 7
  term        estimate std.error statistic     p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>    <dbl>     <dbl>
1 (Intercept)  6.47      1.22        5.29  0.000000331   4.06      8.89  
2 AGE         -0.00157   0.0104     -0.152 0.879        -0.0220    0.0189
3 DMyes        0.179     0.211       0.848 0.397        -0.237     0.594 
4 MSBPR        0.00246   0.00687     0.358 0.721        -0.0111    0.0160
5 FBS          0.0459    0.0499      0.919 0.359        -0.0526    0.144 
6 HDL         -0.00283   0.00997    -0.284 0.777        -0.0225    0.0168

Show code

glance(model_A_1_1920)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1   0.00862       -0.0169  1.43     0.337   0.890     5  -352.  718.  741.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Write Regression model and equation for Model A

Regression model:

\[ \text{HbA1c}_i = \beta_0 + \beta_1 \cdot \text{AGE}_i + \beta_2 \cdot \text{DM}_i + \beta_3 \cdot \text{MSBPR}_i + \beta_4 \cdot \text{FBS}_i + \beta_5 \cdot \text{HDL}_i + \epsilon_i \]

Regression equation (using Beta from model_A_1_1920):

\[ \hat{\text{HbA1c}} = 6.47 + -0.002 \cdot \text{AGE} + 0.18 \cdot \text{DM}_{\text{yes}} + 0.002 \cdot \text{MSBPR} + 0.05 \cdot \text{FBS} - 0.003 \cdot \text{HDL} \]

(b) Add variable MDBPR to MODEL_A and name the new model as MODEL_B. Describe the effect of this addition to the model. (5 marks)

Build model B

Show code

model_B_1_1920 <- lm(HbA1c ~ AGE + DM + MSBPR + FBS + HDL + MDBPR, data = metab_syndrome)

tidy (model_B_1_1920, conf.int = TRUE)

# A tibble: 7 × 7
  term        estimate std.error statistic    p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
1 (Intercept)  7.26      1.44       5.03   0.00000112   4.41     10.1   
2 AGE         -0.00102   0.0104    -0.0987 0.921       -0.0215    0.0194
3 DMyes        0.196     0.211      0.928  0.355       -0.221     0.613 
4 MSBPR        0.00270   0.00687    0.393  0.694       -0.0109    0.0163
5 FBS          0.0496    0.0501     0.991  0.323       -0.0491    0.148 
6 HDL         -0.00313   0.00997   -0.314  0.754       -0.0228    0.0165
7 MDBPR       -0.0109    0.0106    -1.03   0.306       -0.0317    0.0100

Show code

glance(model_B_1_1920)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0140       -0.0167  1.43     0.457   0.840     6  -351.  719.  745.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Interpretation: The inclusion of MDBPR in Model B does not significantly improve the model, as MDBPR is not significantly associated with HbA1c and its addition does not materially change the estimated effects or significance of the other covariates, indicating no evidence of confounding.

(c) Interpret the variances between regression model and regression estimates. (10 marks)

Both model explain very little variance in HbA1c levels (R-squared values are low).

Model A : R² = 0.0086, Adj. R² = -0.017
Model B : R² = 0.014, Adj. R² = -0.017

Negative adjusted models suggest that the predictors do not explain the variability in HbA1c better than a simple mean model (null model).

The regression models have poor explanatory power, and adding MDBPR (Model B) does not meaningfully reduce unexplained variance compared with Model A.

(d) Perform comparison between MODEL_A and MODEL_B. Give comment. (5 marks)

Show code

compare_performance(model_A_1_1920, model_B_1_1920)

# Comparison of Model Performance Indices

Name           | Model | AIC (weights) | AICc (weights) | BIC (weights) |    R2
-------------------------------------------------------------------------------
model_A_1_1920 |    lm | 717.6 (0.612) |  718.2 (0.632) | 740.7 (0.891) | 0.009
model_B_1_1920 |    lm | 718.5 (0.388) |  719.3 (0.368) | 744.9 (0.109) | 0.014

Name           | R2 (adj.) |  RMSE | Sigma
------------------------------------------
model_A_1_1920 |    -0.017 | 1.405 | 1.427
model_B_1_1920 |    -0.017 | 1.401 | 1.426

Interpretation: Model A has slightly lower AIC and BIC values compared to Model B, indicating a marginally better fit to the data. However, model B has a slightly higher R-squared value, suggesting it explains a bit more variance in HbA1c levels. Overall, both models perform similarly, with Model A being preferred due to its simplicity and slightly better fit statistics.

(e) Perform model assessment for MODEL_B. Save the relevant plots in the thumb-drive and write the names of the plots in the answer sheet. (10 marks)

(f) Predict the average value of HBA1C for a population with:

▪ AGE = 48 years ▪ DMDX = has history of diabetes (1) ▪ MSBPR = 133 mmHg ▪ MDBPR = 79 mmHg ▪ FBS = 5.6 mmol/l ▪ HDL = 1.3 mmol/l

Linear equation for Model B

\[ \hat{\text{HbA1c}} = 7.26 - 0.001 \cdot \text{AGE} + 0.196 \cdot \text{DM}_{\text{yes}} + 0.003 \cdot \text{MSBPR} + 0.05 \cdot \text{FBS} - 0.003 \cdot \text{HDL} - 0.01 \cdot \text{MDBPR} \]

HbA1c = 7.26 - 0.001(AGE) + 0.196(DM_yes) + 0.003(MSBPR) + 0.05(FBS) - 0.003(HDL) - 0.01(MDBPR) HbA1c = 7.26 - 0.00148 + 0.1961 + 0.003133 + 0.055.6 - 0.0031.3 - 0.0179 HbA1c = 7.2931 ~ 7.3%

2. RM ANOVA

Question : A researcher has conducted an interventional study to evaluate the effect of a new rehabilitation module in managing post-spinal surgery patients. Thirty (30) patients were randomly assigned into two groups; “new module” and “current module”. They were measured on their mobilization activities using questionnaire with scores between 0-100 when higher score indicates better management. Endpoint data on the mobility scores were collected at baseline (score-a), 3-month (score-2) and 6-month (score-3). Using the given dataset “mobility study.sav”,

A) Present the result using tables and plots (in a Word document). (10 marks)

Table should include:

Descriptive statistics (means and SDs for each variable)
ANOVA results (F-values, p-values, effect sizes for each variable)
Pairwise comparisons if applicable

B) Perform appropriate statistical analysis using SPSS. Interpret all the relevant analytic step and save it into a Microsoft Word document (“mobility.docx”). Also save all the relevant SPSS outputs as “mobility.spv”. (20 marks)

RM ANOVA statistical workflow using SPSS:

Descriptive analysis
Univariable analysis
Model fit (M)
Interaction (I)
Normality of residual (N)
Equal variences (E)
Sphericity (S)
Split file (if interaction is significant)
Post-hoc test (if the interest variable has more than 2 categories)
Final table
Interpretation
Interaction plot

3. Survival Analysis

Question 3 The dataset “stroke_outcome.dta” comes from a short prospective survival study. The variable outcome is labelled as dead (1) or alive (0) and the variable days is the duration from the start of follow-up until the last follow-up.

A) Present and interpret the result for:

▪ Overall Kaplan-Meier analysis ▪ Kaplan-Meier analysis on variable ICD10 Comment on the time, number of event and probability of the event for both analyses. (10 marks)

Import data

Show code

stroke_3_1920 <- read_dta("stroke_outcome.dta")
glimpse(stroke_3_1920)

Rows: 226
Columns: 6
$ id      <chr> "B179568", "B454143", "B221072", "B455158", "B099206", "A17393…
$ days    <dbl> 4, 3, 16, 23, 5, 4, 22, 14, 4, 2, 22, 2, 43, 2, 4, 4, 3, 21, 1…
$ gcs     <dbl> 15, 15, 15, 11, 15, 7, 5, 13, 15, 15, 10, 15, 14, 9, 15, 15, 1…
$ age     <dbl> 69, 64, 63, 67, 66, 78, 55, 65, 67, 56, 50, 78, 59, 83, 83, 44…
$ outcome <dbl+lbl> 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1…
$ icd10   <dbl+lbl> 0, 0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1…

Data transform

Show code

stroke_3_1920 <- stroke_3_1920 %>% mutate(across(where(is.labelled), as_factor))
glimpse(stroke_3_1920)

Rows: 226
Columns: 6
$ id      <chr> "B179568", "B454143", "B221072", "B455158", "B099206", "A17393…
$ days    <dbl> 4, 3, 16, 23, 5, 4, 22, 14, 4, 2, 22, 2, 43, 2, 4, 4, 3, 21, 1…
$ gcs     <dbl> 15, 15, 15, 11, 15, 7, 5, 13, 15, 15, 10, 15, 14, 9, 15, 15, 1…
$ age     <dbl> 69, 64, 63, 67, 66, 78, 55, 65, 67, 56, 50, 78, 59, 83, 83, 44…
$ outcome <fct> dead, alive, alive, alive, alive, dead, alive, dead, alive, al…
$ icd10   <fct> "CI,Others", "CI,Others", "ICB, Other Haemorrhage", "ICB, Othe…

Overall kaplan meier

Show code

KM.overall_3_1920 <- survfit(Surv(time = days,
                                  event = outcome == 'dead') ~ 1, type = "kaplan-meier",
                             data = stroke_3_1920)
tidy(KM.overall_3_1920)

# A tibble: 30 × 8
    time n.risk n.event n.censor estimate std.error conf.high conf.low
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
 1     1    226      13       16    0.942    0.0164     0.973    0.913
 2     2    197       4       24    0.923    0.0194     0.959    0.889
 3     3    169       4       32    0.901    0.0228     0.943    0.862
 4     4    133       4       37    0.874    0.0274     0.923    0.829
 5     5     92       5       20    0.827    0.0371     0.889    0.769
 6     6     67       3        6    0.790    0.0456     0.864    0.722
 7     7     58       4        6    0.735    0.0579     0.824    0.656
 8     8     48       0        4    0.735    0.0579     0.824    0.656
 9     9     44       2        4    0.702    0.0666     0.800    0.616
10    10     38       1        1    0.683    0.0717     0.787    0.594
# ℹ 20 more rows

Show code

# Plot overall survival curve
ggsurvplot(KM.overall_3_1920, data = stroke_3_1920, risk.table = TRUE, linetype = c(1,2), pval = TRUE)

Kaplan meier by icd10

Show code

KM.icd10_3_1920 <- survfit(Surv(time = days,
                                  event = outcome == 'dead') ~ icd10, type = "kaplan-meier",
                             data = stroke_3_1920)
tidy(KM.icd10_3_1920)

# A tibble: 52 × 9
    time n.risk n.event n.censor estimate std.error conf.high conf.low strata   
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl> <chr>    
 1     1    149       3       14    0.980    0.0117     1        0.958 icd10=CI…
 2     2    132       3       21    0.958    0.0177     0.991    0.925 icd10=CI…
 3     3    108       0       30    0.958    0.0177     0.991    0.925 icd10=CI…
 4     4     78       2       33    0.933    0.0255     0.981    0.888 icd10=CI…
 5     5     43       1       15    0.911    0.0347     0.976    0.851 icd10=CI…
 6     6     27       1        4    0.878    0.0513     0.970    0.794 icd10=CI…
 7     7     22       2        4    0.798    0.0847     0.942    0.676 icd10=CI…
 8     8     16       0        4    0.798    0.0847     0.942    0.676 icd10=CI…
 9     9     12       0        2    0.798    0.0847     0.942    0.676 icd10=CI…
10    11     10       0        1    0.798    0.0847     0.942    0.676 icd10=CI…
# ℹ 42 more rows

Show code

# Plot survival curve by ICD10
ggsurvplot(KM.icd10_3_1920, data = stroke_3_1920, risk.table = TRUE, linetype = c(1,2,3), pval = TRUE)

Interpretation:

Overall Kaplan Meier: The Kaplan-Meier curve shows a steady decline in survival probability over time, with the most substantial drop occurring early in the follow-up. By day 10, the survival probability falls to approximately 68%, indicating a high early mortality rate in the cohort.
Kaplan Meier by ICD10: Survival differs significantly across ICD10 diagnostic groups (p = 0.033). Patients with SAH and ICB/Other Haemorrhage show lower survival probabilities compared to those with CI/Others. This suggests that diagnosis type is a significant predictor of mortality, with haemorrhagic strokes associated with poorer outcomes.
Compare overall and ICD10 : The overall curve masks important differences between diagnostic groups. While the pooled analysis shows a general decline in survival, the ICD10-stratified curves reveal that haemorrhagic subtypes drive the poorer outcomes. Thus, stratification provides more clinically meaningful insight, showing that survival is not uniform across stroke types.

B) The analyst proposed to model the log hazard using the Cox proportional hazard (PH) model. Comment on his proposal. (5 marks)

The proposal to model the log hazard using a Cox proportional hazards model is appropriate because the outcome of interest is time to event with possible censoring.
The Cox model allows estimation of log hazard without specifying the baseline hazard function.
However, the validity of this approach depends on the proportional hazards assumption, which requires that the log hazard for covariates remain constant over time.
This assumption should be formally assessed, for example using Schoenfeld residuals.
If the proportional hazards assumption is violated, alternative approaches such as stratified Cox models should be considered.

(c) Estimate the Cox PH model with covariate: sex, GCS, ICD10, SBP, DBP. Present the results in a table. Interpret the result for SBP and ICD10. (10 marks)

Fit cox regression model

Show code

model_cox_3_1920 <- coxph(Surv(time = days,
                               event = outcome == 'dead') ~ age + gcs + icd10,
                          data = stroke_3_1920)
tidy(model_cox_3_1920, conf.int = TRUE)

# A tibble: 4 × 7
  term                   estimate std.error statistic p.value conf.low conf.high
  <chr>                     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 age                      0.0321    0.0113     2.84  4.45e-3  0.00999    0.0543
2 gcs                     -0.189     0.0344    -5.49  3.99e-8 -0.257     -0.122 
3 icd10SAH                 0.409     0.423      0.966 3.34e-1 -0.421      1.24  
4 icd10ICB, Other Haemo…   0.480     0.331      1.45  1.47e-1 -0.169      1.13

Interpretation:

Age: Each additional year of age is associated with a 0.03 times higher log hazard of death, after adjusting for other covariates (Log Hazard = 0.03, 95% CI: 0.01, 0.05, p = 0.004).
GCS: Each additional point increase in GCS score is associated with a 0.19 times lower log hazard of death, after adjusting for other covariates (Log Hazard = -0.19, 95% CI: -0.26,-0.12, p < 0.001).
ICD10: Patients with haemorrhagic stroke have a 0.48 times higher log hazard of death compared to those with ischaemic stroke and SAH, after adjusting for other covariates (Log Hazard = 0.48, 95% CI: -0.17,1.13, p = 0.48). However, this association is not statistically significant.

(d) Test the assumption for proportional hazard. Save the plot in the thumb-drive and write the name of the plot in the answer sheet. (5 marks)

Assessment: PH Assumption (Schoenfeld Residuals)

Show code

prop.h_3_1920 <- cox.zph(model_cox_3_1920, transform = 'km', global = TRUE)
ggcoxzph(prop.h_3_1920)

Assessment: Functional Form (Linearity)