Practice Multivariable Analysis 2026

Author

Kang Munir

Published

January 25, 2026

MVA 2024/2025

Load all the library using pacman

Show code
library(pacman)

# load all library

p_load(lmtest, survival, broom, caret, pacman, gtsummary, lubridate, 
       survminer, pROC, corrplot, ggeffects, skimr, haven, 
       officer, mfp, car, flextable, psych, ResourceSelection, 
       readxl, summarytools, performance, tidyverse)

1. Multiple Linear Regression

Data Quality of life (qol_data.csv). Outcome is quality of life score. 3 covariates: job experience (years), gender (male, female) and comorbidity (yes, no)

A) Build model without interaction, interpret (10marks)

  1. Import and summary data
Show code
# import data
qol <- read.csv("qol_data.csv")
glimpse(qol)
Rows: 100
Columns: 4
$ quality_of_life <dbl> 74.26882, 71.91479, 75.23085, 38.65436, 65.26596, 77.4…
$ job_experience  <dbl> 9.339748, 23.860849, 12.860331, 26.607505, 28.273551, …
$ gender          <chr> "male", "female", "female", "male", "female", "female"…
$ comorbidity     <chr> "no", "no", "no", "yes", "no", "no", "no", "yes", "no"…
  1. Change character variables to factors
Show code
qol <- qol %>%
  mutate(across(where(is.character), as.factor))

# recode label variable

attr(qol$quality_of_life, "label") <- "Quality of Life Score"
attr(qol$job_experience, "label") <- "Job Experience (years)"
attr(qol$gender, "label") <- "Gender"
attr(qol$comorbidity, "label") <- "Comorbidity"

glimpse(qol)
Rows: 100
Columns: 4
$ quality_of_life <dbl> 74.26882, 71.91479, 75.23085, 38.65436, 65.26596, 77.4…
$ job_experience  <dbl> 9.339748, 23.860849, 12.860331, 26.607505, 28.273551, …
$ gender          <fct> male, female, female, male, female, female, male, male…
$ comorbidity     <fct> no, no, no, yes, no, no, no, yes, no, no, no, yes, yes…

3.. Fit multiple linear regression model without interaction

Show code
model1 <- lm(quality_of_life ~ job_experience + gender + comorbidity, data = qol)

tidy(model1, intercept = TRUE)
# A tibble: 4 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      82.6      1.32       62.6  1.15e-79
2 job_experience   -0.492    0.0653     -7.54 2.67e-11
3 gendermale       -9.57     1.08       -8.88 3.73e-14
4 comorbidityyes  -13.9      1.06      -13.1  4.88e-23
Show code
# Or can use regression table
tbl_regression(model1, intercept = TRUE)
Characteristic Beta 95% CI p-value
(Intercept) 83 80, 85 <0.001
Job Experience (years) -0.49 -0.62, -0.36 <0.001
Gender


    female
    male -9.6 -12, -7.4 <0.001
Comorbidity


    no
    yes -14 -16, -12 <0.001
Abbreviation: CI = Confidence Interval
  1. Interpret coefficients
  • Job Experience: Each additional year of job experience is associated with a 0.49-point decrease in quality of life score, adjusted for gender and comorbidity (95% CI: -0.62 to -0.36, p < 0.001).

  • Gender: Male participants have a quality of life score that is 9.6 points lower than female participants, adjusted for job experience and comorbidity (95% CI: -12.0 to -7.4, p < 0.001).

  • Comorbidity: Participants with comorbidity have a quality of life score that is 14 points lower than those without comorbidity, adjusted for job experience and gender (95% CI: -16.0 to -12.0, p < 0.001).

B) Build model with interaction between gender & comorbid, present in table (5mark

  1. Fit multiple linear regression model with interaction
Show code
model2 <- lm(quality_of_life ~ job_experience + gender + comorbidity + gender:comorbidity, data = qol)

tidy(model2)
# A tibble: 5 × 5
  term                      estimate std.error statistic  p.value
  <chr>                        <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                 80.3      1.17       68.7  8.16e-83
2 job_experience              -0.517    0.0553     -9.35 4.03e-15
3 gendermale                  -3.68     1.30       -2.83 5.72e- 3
4 comorbidityyes              -8.70     1.22       -7.13 2.00e-10
5 gendermale:comorbidityyes  -11.4      1.80       -6.31 8.68e- 9
Show code
# Or can use regression table

tbl_regression(model2, intercept = TRUE)
Characteristic Beta 95% CI p-value
(Intercept) 80 78, 83 <0.001
Job Experience (years) -0.52 -0.63, -0.41 <0.001
Gender


    female
    male -3.7 -6.3, -1.1 0.006
Comorbidity


    no
    yes -8.7 -11, -6.3 <0.001
Gender * Comorbidity


    male * yes -11 -15, -7.8 <0.001
Abbreviation: CI = Confidence Interval
  1. Interpret interaction term

The interaction term between gender and comorbidity is statistical significant (p < 0.05), indicating that the effect of comorbidity on quality of life score differs by gender.

C) Provide appropriate plot, copy in words/in jpeg file (5marks)

  1. Plot interaction each of covariate
Show code
# interaction plot for quality of life vs job experience
ggplot(qol, aes(x = job_experience, y = quality_of_life)) +
geom_point() +
geom_smooth(method = "lm") + 
  
# add labels and theme  
  labs(title = "Quality of Life Score vs Job Experience",
       x = "Job Experience (years)",
       y = "Quality of Life Score") +
  theme_minimal()

Show code
# interaction plot for quality of life vs gender
ggplot(qol, aes(x = gender, y = quality_of_life)) +
  geom_boxplot() +
  
  # add labels and theme  
  labs(title = "Quality of Life Score by Gender",
       x = "Gender",
       y = "Quality of Life Score") +
  theme_minimal()

Show code
# interaction plot for quality of life vs comorbidity

ggplot(qol, aes(x = comorbidity, y = quality_of_life)) +
  geom_boxplot() +
  
  # add labels and theme  
  labs(title = "Quality of Life Score by Comorbidity",
       x = "Comorbidity",
       y = "Quality of Life Score") +
  theme_minimal()

  1. Interaction plot between quality of life score by gender, job experiencexx` and comorbidity.
Show code
interact.plot <- ggpredict(model2, terms = c("gender", "comorbidity", "job_experience"))
plot(interact.plot, connect_lines = T)

  1. Interpret plot
  • People without comorbidity have higher quality of life scores.
  • Females without comorbidity have the highest scores.
  • Males without comorbidity also have high scores, but slightly lower than females.
  • Comorbidity lowers scores for both genders.
  • Males with comorbidity have the lowest scores.
  • The effect of comorbidity is stronger in males than in females.

D) Predict based on (working experience: 5/15/25 years), both comorbidities, both gender. proof one using manual calculation (10marks)

  1. Create prediction data frame
Show code
pred_data <- expand.grid( job_experience = c(5, 15, 25),
                          gender = c("female", "male"), 
                          comorbidity = c("no", "yes") )

augment(model2, newdata = pred_data)
# A tibble: 12 × 4
   job_experience gender comorbidity .fitted
            <dbl> <fct>  <fct>         <dbl>
 1              5 female no             77.7
 2             15 female no             72.5
 3             25 female no             67.4
 4              5 male   no             74.0
 5             15 male   no             68.9
 6             25 male   no             63.7
 7              5 female yes            69.0
 8             15 female yes            63.8
 9             25 female yes            58.7
10              5 male   yes            53.9
11             15 male   yes            48.8
12             25 male   yes            43.6
  1. Manual calculation for one scenario (e.g., 15 years experience, male with comorbidity)

The linear regression model predicting Quality of Life Score is:

\[ \text{Quality of Life Score} = 80 - 0.52 \cdot \text{Job Experience} - 3.7 \cdot \text{Male} - 8.7 \cdot \text{Comorbidity} - 11 \cdot (\text{Male} \times \text{Comorbidity}) \]

Where:

  • (_0) is the intercept.
  • Job Experience is measured in years.
  • Male is coded as 1 for male, 0 for female.
  • Comorbidity is coded as 1 for yes, 0 for no.
  • The interaction term is 1 only when both gender is male and comorbidity is present.

Example : Male with comorbidity, 15 years of experience.

QOL = 80 - 0.52(15) - 3.7(1) - 8.7(1) - 11(1*1) = 48.8

2. Multiple Logistic Regression

Question: Using Quality of Life data (qol2.dta). Outcome is quality of life categorized into poor and good. 3 covariates: job experience (years), gender (male, female) and comorbidity (yes, no)

A) What is analysis you would like to use and justify (10 marks)

Answer: Multiple Logistic Regression

Justification:

Outcome variable

  • Quality of life (QoL) is binary: Poor vs Good

Because the dependent variable is dichotomous, binary logistic regression is the most appropriate analysis.

Justification (key points for marks)

  1. Nature of outcome
  • Logistic regression is suitable when the outcome is categorical with two levels.
  • Linear regression is inappropriate because it can predict values outside 0–1 and violates normality and homoscedasticity assumptions.
  1. Interpretability
  • Logistic regression estimates odds ratios (ORs), which are standard and interpretable measures in epidemiology and public health.
  • ORs quantify the association between predictors and the odds of having good quality of life.
  1. Covariate structure
  • Job experience: continuous predictor (years)
  • Gender: categorical (male/female)
  • Comorbidity: categorical (yes/no)
  • Logistic regression can accommodate both continuous and categorical predictors in the same model.
  1. Adjustment for confounding
  • Multivariable logistic regression allows adjustment for potential confounders, giving adjusted odds ratios.
  1. Interaction assessment
  • The research question explicitly asks for an interaction between gender and comorbidity.
  • Logistic regression allows inclusion and interpretation of interaction (effect modification) terms.
  1. Public health relevance
  • Logistic regression is widely used in quality-of-life and health outcomes research, making results comparable with existing literature.

Therefore, multivariable binary logistic regression is the most appropriate analysis to model the association between job experience, gender, comorbidity, and quality of life, while allowing assessment of interaction effects.

B) Build model with interaction between gender & comorbid, present in table (5marks)

  1. Import data
Show code
qol2 <- read_dta("qol2.dta")
glimpse(qol2)
Rows: 100
Columns: 4
$ quality_of_life <dbl+lbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2,…
$ job_experience  <dbl> 9.339748, 23.860849, 12.860331, 26.607505, 28.273551, …
$ gender          <dbl+lbl> 2, 1, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1,…
$ comorbidity     <dbl+lbl> 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1,…
  1. Change labelled variables to factors
Show code
qol2 <- qol2 %>%
  mutate(across(where(is.labelled), as_factor))
glimpse(qol2)
Rows: 100
Columns: 4
$ quality_of_life <fct> poor, poor, poor, poor, poor, good, poor, poor, poor, …
$ job_experience  <dbl> 9.339748, 23.860849, 12.860331, 26.607505, 28.273551, …
$ gender          <fct> male, female, female, male, female, female, male, male…
$ comorbidity     <fct> no, no, no, yes, no, no, no, yes, no, no, no, yes, yes…
  1. Fit multiple logistic regression model with interaction
Show code
model_logit <- glm(quality_of_life ~ gender + comorbidity + job_experience + gender:comorbidity,
                   data = qol2,
                   family = binomial)
tbl_regression(model_logit,intercept = T, exponentiate = TRUE, conf.int = TRUE)
Characteristic OR 95% CI p-value
(Intercept) 0.12 0.03, 0.42 0.002
gender


    female
    male 4.30 1.14, 18.3 0.037
comorbidity


    no
    yes 2.37 0.72, 8.28 0.2
job_experience 1.12 1.06, 1.20 <0.001
gender * comorbidity


    male * yes 1.22 0.16, 10.2 0.8
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
  1. Interpretation
  • Gender: After adjusting for comorbidity and the interaction term, male healthcare workers have significantly higher odds of good quality of life compared with females (OR = 4.95, 95% CI: 1.48–18.9, p = 0.013).

  • Comorbidity: Healthcare workers with comorbidity have higher odds of good quality of life compared with those without comorbidity; however, this association is not statistically significant (OR = 2.12, 95% CI: 0.72–6.42, p = 0.20).

  • Interaction term (gender × comorbidity): The interaction between gender and comorbidity is not statistically significant (OR = 0.97, 95% CI: 0.15–7.10, p > 0.9), indicating that the association between comorbidity and quality of life does not differ by gender.

C) Provide predicted values for the first observation (5 marks)

The interaction was not significant, so we will use the model without interaction to predict the probability for the first observation.

  1. Fit multiple logistic regression model without interaction
Show code
model_logitfull <- glm(quality_of_life ~ gender + comorbidity + job_experience,
                   data = qol2,
                   family = binomial)
tbl_regression(model_logitfull,intercept = T, conf.int = TRUE)
Characteristic log(OR) 95% CI p-value
(Intercept) -2.1 -3.5, -0.94 0.001
gender


    female
    male 1.6 0.57, 2.6 0.003
comorbidity


    no
    yes 0.93 -0.03, 2.0 0.063
job_experience 0.12 0.06, 0.19 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
  1. Predict probability for the first observation (model with log odd)
Show code
predict(model_logitfull, newdata = qol2[1, ], type = "response")
        1 
0.6217327 
  1. Manual calculation using log-odds

The logistic regression equation is: \[ \text{logit}(p) = \beta_0 + \beta_1 \cdot \text {gender (m = 1)} + \beta_2 \cdot \text{comorbidity (yes = 1)} + \beta_3 \cdot \text{job experience} \]

Where:

  • β0 = -2.1 (intercept)
  • β1 = 1.6
  • β2 = 0.93
  • β3 = 0.12

For the first observation (male, no comorbidity, 9.3 years experience):

Logit(p) = -2.1 + 1.6(1) + 0.793(0) - 0.12(9.3) = -1.057

Convert log-odds to probability: \[ p = \frac{e^{\text{logit}(p)}}{1 + e^{\text{logit}(p)}} = \frac{e^{-1.057}}{1 + e^{-1.057}} \approx 0.62 \]

Interpretation: The predicted probability of having good quality of life for the first observation is 0.62 (62%).

3. Cox Propoportional Hazard Regression

Question: Using survival data (stroke_e.dta). Outcome is stroke death. 3 covariates: age ( years), gender (male, female) and diabetes (yes, no) and ICD 10 (ischemic stroke, hemorrhagic).

A) Illustrate the statistical workflow , from beginning until communicate to the stakeholder (5 marks)

Mnemonic : I took tiktok video muka model comel

  1. Import
  2. Tidy
  3. Transform
  4. Visualize
  5. Model
  6. Communicate

B) Perform the Cox Proportional Hazard Regression model with covariates gender, age, dm & icd10. Present the result in a table. Interpret the result for age and icd10 (10 marks)

  1. Import data
Show code
stroke <- read_dta("stroke_e.dta")
glimpse(stroke)
Rows: 226
Columns: 10
$ sex       <dbl+lbl> 1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1,…
$ time      <dbl> 32, 4, 3, 5, 2, 2, 1, 4, 8, 4, 3, 3, 1, 4, 3, 2, 12, 12, 4, …
$ status2   <dbl+lbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ gcs       <dbl> 13, 15, 15, 15, 15, 15, 13, 15, 15, 10, 15, 15, 15, 15, 15, …
$ sbp       <dbl> 143, 150, 152, 215, 162, 169, 178, 180, 186, 185, 122, 211, …
$ dbp       <dbl> 83, 98, 108, 109, 72, 114, 75, 90, 109, 100, 83, 103, 83, 72…
$ dm        <dbl+lbl> 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,…
$ wbc       <dbl> 10.0, 8.9, 7.4, 8.1, 7.5, 24.4, 9.5, 4.2, 6.4, 14.4, 8.2, 8.…
$ age2      <dbl> 50, 58, 64, 50, 65, 78, 66, 72, 61, 64, 63, 59, 64, 62, 40, …
$ icd10cat2 <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
  1. Change labelled variables to factors
Show code
stroke <- stroke %>%
  mutate(across(where(is.labelled), as_factor))
glimpse(stroke)
Rows: 226
Columns: 10
$ sex       <fct> male, male, male, female, male, female, female, male, female…
$ time      <dbl> 32, 4, 3, 5, 2, 2, 1, 4, 8, 4, 3, 3, 1, 4, 3, 2, 12, 12, 4, …
$ status2   <fct> alive, alive, alive, alive, alive, alive, dead, alive, alive…
$ gcs       <dbl> 13, 15, 15, 15, 15, 15, 13, 15, 15, 10, 15, 15, 15, 15, 15, …
$ sbp       <dbl> 143, 150, 152, 215, 162, 169, 178, 180, 186, 185, 122, 211, …
$ dbp       <dbl> 83, 98, 108, 109, 72, 114, 75, 90, 109, 100, 83, 103, 83, 72…
$ dm        <fct> no, no, no, yes, yes, yes, yes, no, yes, yes, no, yes, yes, …
$ wbc       <dbl> 10.0, 8.9, 7.4, 8.1, 7.5, 24.4, 9.5, 4.2, 6.4, 14.4, 8.2, 8.…
$ age2      <dbl> 50, 58, 64, 50, 65, 78, 66, 72, 61, 64, 63, 59, 64, 62, 40, …
$ icd10cat2 <fct> Ischaemic Stroke, Ischaemic Stroke, Ischaemic Stroke, Ischae…
  1. Label variable
Show code
attr(stroke$time, "label") <- "Time to Stroke death (months)"
attr(stroke$status2, "label") <- "Stroke death"
attr(stroke$age2, "label") <- "Age (years)"
attr(stroke$sex, "label") <- "Gender"
attr(stroke$dm, "label") <- "Diabetes (yes/no)"
attr(stroke$icd10cat2, "label") <- "ICD10"
attr(stroke$gcs, "label") <- "Glasgow Coma Scale"
  1. Fit Cox Proportional Hazard Regression model
Show code
model_cox <- coxph(
  Surv(time = time, event = status2 == "dead") ~ 
    age2 + sex + dm + icd10cat2,
  data = stroke
)

tbl_regression(model_cox, exponentiate = TRUE, conf.int = TRUE)
Characteristic HR 95% CI p-value
Age (years) 1.03 1.01, 1.05 0.006
Gender


    male
    female 1.25 0.69, 2.27 0.5
Diabetes (yes/no)


    no
    yes 1.47 0.82, 2.65 0.2
ICD10


    Ischaemic Stroke
    Haemorrhagic 2.69 1.46, 4.94 0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
  1. Interpretation
  • Age: Age is significantly associated with stroke recurrence. Each additional year of age is associated with a 1.03 increase in the hazard of stroke death, after adjusting for gender, diabetes, and stroke type. (95% CI: 1.01, 1.05, p 0.006).
  • ICD10: Patients with hemorrhagic stroke have a 2.69 times higher hazard of stroke death compared to those with ischemic stroke, after adjusting for gender, age, and diabetes (95% CI: 1.46, 4,94, p = 0.001).

C)add on gcs, explain about the regression estimate for icd10 (5 marks)

  1. Fit Cox Proportional Hazard Regression model with gcs
Show code
model_cox2 <- coxph(
  Surv(time = time, event = status2 == "dead") ~ 
    age2 + sex + dm + icd10cat2 + gcs, data = stroke)

tbl_regression(model_cox2, exponentiate = TRUE, conf.int = TRUE)
Characteristic HR 95% CI p-value
Age (years) 1.03 1.01, 1.05 0.004
Gender


    male
    female 1.16 0.64, 2.12 0.6
Diabetes (yes/no)


    no
    yes 1.42 0.79, 2.55 0.2
ICD10


    Ischaemic Stroke
    Haemorrhagic 1.69 0.91, 3.13 0.10
Glasgow Coma Scale 0.84 0.79, 0.90 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
  1. Interpretation for ICD10

After add adjusting for Glasgow Coma Scale, patients with haemorrhagic stroke have a 69% higher hazard of stroke recurrence compared with patients with ischaemic stroke; However, the association is no longer statistically significant (HR: 1.69, 95% CI: 0.91, 3.13, p = 0.10). This suggests that the initial observed association between stroke type and stroke recurrence may be confounded by Glasgow Coma Scale scores.

MVA 2023/2024

1. Analysis workflow

Question: Before formal data analysis, researchers are encouraged to develop data analysis workflow. - Using a flow chart, generate data analysis workflow that will help with your research data. Briefly describe the objective for each stage of the workflow (10 marks)

Mnemonic : I took tiktok video muka model comel

  1. Import
  • Objective: Load raw data from various sources (e.g., CSV, Excel, databases) into the analysis environment for further processing.
  1. Tidy
  • Objective: Clean and structure the data into a consistent format (e.g., long format) to facilitate analysis, ensuring variables are properly labeled and missing values are addressed.
  1. Transform
  • Objective: Create new variables, recode existing ones, and perform necessary calculations to prepare the data for analysis, enhancing its usability and interpretability.
  1. Visualize
  • Objective: Generate graphical representations of the data to explore patterns, trends, and relationships among variables, aiding in understanding and hypothesis generation.
  1. Model
  • Objective: Apply appropriate statistical or machine learning models to analyze the data, test hypotheses, and derive insights relevant to the research questions.
  1. Communicate
  • Objective: Present the analysis results through reports, presentations, or dashboards, effectively conveying findings to stakeholders for informed decision-making.
Show code
library(DiagrammeR)
grViz("
digraph data_analysis_workflow {
  node [shape = rectangle, style = filled, fillcolor = LightBlue]
  Import [label = 'Import Data']
  Tidy [label = 'Tidy Data']
  Transform [label = 'Transform Data']
  Visualize [label = 'Visualize Data']
  Model [label = 'Model Data']
  Communicate [label = 'Communicate Results']
  Import -> Tidy -> Transform -> Visualize -> Model -> Communicate
}
")

2. Mixed (Within-Between RM ANOVA)

Question: A randomized controlled trial was conducted to evaluate the effectiveness of the HIV Counselling Module (HIV-CM) in increasing healthcare workers (HCW) awareness on stigma and discrimination toward key populations of HIV. Thirty (n=30) HCWs were randomly assigned into two groups which consisted of 15 respondents in the intervention group “HIV-CM” while another 15 respondents to “Controls”. They were measured on their self-awareness scores (SAS) at baseline (SAS- 1), 2 weeks (SAS-2) and 6 weeks (SAS-3). The scales are between 0-100, where a higher score indicates better awareness.

A) State all possible research questions. (6 marks)

Research question:

Within-group (time effect)

  • Is there a significant change in self-awareness scores (SAS) over time (baseline, 2 weeks, 6 weeks) among healthcare workers?

Between-group (group effect)

  • Is there a significant difference in overall self-awareness scores between the HIV-CM group and the control group?

Interaction (group × time effect)

  • Does the change in self-awareness scores over time differ between the HIV-CM group and the control group?

B) State, with reasons, the appropriate statistical test to answer all research questions in Q2(A). (4 marks)

Justification :

  1. Outcome variable
  • SAS is continuous and approximately normally distributed.
  1. Within-subject factor
  • Time has three repeated measurements (SAS-1, SAS-2, SAS-3).
  1. Between-subject factor
  • Group has two independent levels (HIV-CM vs Control).
  1. Research objectives
  • The study aims to examine time effects, group effects, and group × time interaction, which are directly tested using mixed repeated measures ANOVA.

C) Using the given dataset HIV-CM.sav, perform statistical analysis in SPSS using the test stated in Q2(a). Interpret all the relevant analytic steps and save them into a Microsoft Word document (HIV-CM analysis.docx). Save all relevant SPSS outputs as HIV-CM.spv (25 marks)

Step 1: Open data

File → Open → Data → HIV-CM.sav

Check:

  • group (HIV-CM / Control) → Between-subjects factor
  • SAS_1, SAS_2, SAS_3 → Repeated measures

Step 2: Run Mixed Repeated Measures ANOVA

SPSS Menu path:

Analyze → General Linear Model → Repeated Measures

Define factors:

  • Within-Subject Factor Name: time

  • Number of Levels: 3

  • Define → assign:

    • DSCA1 → Time 1
    • DSCA2 → Time 2
    • DSCA3 → Time 3
  • Between-Subjects Factor:

    • Group (HIV-CM vs Control)

Options:

  • Descriptive statistics
  • Estimates of effect size
  • Pairwise comparisons (Bonferroni)
  • Mauchly’s test of sphericity

Step 3: Assumption checking (IMPORTANT FOR MARKS)

  1. Normality
    • SAS is continuous and approximately normally distributed → assumption reasonably met.
  2. Sphericity (Mauchly’s Test)
    • If p > 0.05 → sphericity assumed
    • If p < 0.05 → use Greenhouse-Geisser correction

state which correction was used.

Step 4: Interpretation of key outputs

  1. Time effect
    • A significant main effect of time indicates that self-awareness scores changed significantly across the three time points.
  2. Group effect
    • A significant main effect of group indicates that overall self-awareness scores differed between the HIV-CM and control groups.
  3. Group × Time interaction (MOST IMPORTANT)
    • A significant interaction indicates that the pattern of change in self-awareness scores over time differs between the HIV-CM and control groups, demonstrating intervention effectiveness.

D) Present the results from Q2(c) in a table and conclude your findings (5 marks)

Table should include :

  • Descriptive statistics (means and SDs for each group at each time point)
  • ANOVA results (F-values, p-values, effect sizes for time, group, and interaction)
  • Pairwise comparisons if applicable

3. Multiple Linear Regression

Question: A dataset named mydata.csv contains the following variables:

  • id: id of participants
  • score: performance score (the outcome)
  • marker1: biomarker
  • group: a variable with group 1 (grp1) or group 2 (grp2)

A) Write a linear regression model and a regression equation to show the relationship between the average score as a function of marker1 and group and an interaction between marker1 and group. (4 marks)

Linear regression model:

\[ \text{score}_i = \beta_0 + \beta_1 \cdot \text{marker1}_i + \beta_2 \cdot \text{group2}_i + \beta_3 \cdot (\text{marker1}_i \times \text{group2}_i) + \epsilon_i \]

Where:

  • score_i: Performance score for participant i
  • marker1_i: Biomarker value for participant i
  • group2_i: Indicator variable for group 2 (1 if grp2, 0 if grp1)
  • β0: Intercept (mean score for grp1 when marker1 = 0
  • β1: Effect of marker1 on score for grp1
  • β2: Difference in mean score between grp2 and grp1 when marker1 = 0
  • β3: Interaction effect between marker1 and group2
  • εi: Random error term

Regression equation:

\[ \hat{\text{score}} = \beta_0 + \beta_1 \cdot \text{marker1} + \beta_2 \cdot \text{group2} + \beta_3 \cdot (\text{marker1} \times \text{group2}) \]

B) Using R, perform the analysis and present the results in a table. Interpret the interaction term. (6 marks)

  1. Import data
Show code
mydata <- read.csv("mydata.csv")
glimpse(mydata)
Rows: 300
Columns: 4
$ id      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ score   <dbl> -0.2110439, 5.8046025, 20.9680192, -0.9366398, 7.1760110, 6.49…
$ marker1 <dbl> 0.2823356, 3.9935731, 6.0111029, -2.5642443, 4.3728117, 4.5651…
$ group   <chr> "grp1", "grp1", "grp1", "grp1", "grp1", "grp1", "grp2", "grp1"…
  1. Change character variables to factors
Show code
mydata <- mydata %>%
  mutate(across(where(is.character), as.factor))
glimpse(mydata)
Rows: 300
Columns: 4
$ id      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ score   <dbl> -0.2110439, 5.8046025, 20.9680192, -0.9366398, 7.1760110, 6.49…
$ marker1 <dbl> 0.2823356, 3.9935731, 6.0111029, -2.5642443, 4.3728117, 4.5651…
$ group   <fct> grp1, grp1, grp1, grp1, grp1, grp1, grp2, grp1, grp1, grp1, gr…
Show code
# add label variable
attr(mydata$score, "label") <- "Performance Score"
attr(mydata$marker1, "label") <- "Biomarker 1"
attr(mydata$group, "label") <- "Group"
  1. Fit multiple linear regression model with interaction
Show code
model_mlr <- lm(score ~ marker1 * group, data = mydata)

tbl_regression(model_mlr, intercept = TRUE)
Characteristic Beta 95% CI p-value
(Intercept) 3.4 2.0, 4.9 <0.001
Biomarker 1 1.4 1.1, 1.8 <0.001
Group


    grp1
    grp2 -0.03 -2.5, 2.5 >0.9
Biomarker 1 * Group


    Biomarker 1 * grp2 0.53 -0.05, 1.1 0.072
Abbreviation: CI = Confidence Interval
  1. Interpretation
  • Biomarker 1: Each unit increase in biomarker 1 is associated with a 1.4-point increase in performance score, after adjusting for group (95% CI: 1.1, 1.8, p < 0.001).
  • Group: Group 2 has a 0.03-point lower performance score compared to group 1, after adjusting for biomarker 1 (95% CI: 1.0, 9.0, p = 0.01).
  • Interaction term: The interaction term between biomarker 1 and group is statistically not significant (p > 0.05), indicating that the effect of biomarker 1 on performance score does not differ by group.

C) Show calculations for the 95% confidence intervals of the interaction term and interpret it. (5 marks)

  1. Extract coefficient and standard error for interaction term
Show code
tidy(model_mlr, conf.int = TRUE)
# A tibble: 4 × 7
  term              estimate std.error statistic  p.value conf.low conf.high
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)         3.44       0.753    4.56   7.39e- 6   1.95        4.92
2 marker1             1.44       0.187    7.74   1.60e-13   1.08        1.81
3 groupgrp2          -0.0277     1.26    -0.0220 9.82e- 1  -2.51        2.45
4 marker1:groupgrp2   0.535      0.296    1.80   7.21e- 2  -0.0484      1.12
  1. Formula to calculate 95% CI:

\[ 95\% \; CI = \hat{\beta} \pm 1.96 \times SE \]

Where:

  • β: Coefficient estimate for interaction term
  • SE: Standard error of the coefficient
  1. Manual calculation for interaction term

β3 (marker1:group2) = 0.535 SE = 0.296

  • Lower bound = 0.535 - 1.96 * 0.296 = -0.045
  • Upper bound = 0.535 + 1.96 * 0.296 = 1.115

4. Survival Analysis

Question: The dataset stroke_outcome.dta comes from a short survival study. The variable outcome is labelled as either dead (1) or censored (0) and the variable days is the duration from the start of follow-up until the last follow-up.

A) Perform the Kaplan-Meier analysis for the overall survival. Interpret the finding based on number at risk, number of events and survival probabilities at time 5 and 29. (5 marks)

  1. Import data
Show code
stroke_outcome <- read_dta("stroke_outcome.dta")
glimpse(stroke_outcome)
Rows: 226
Columns: 6
$ id      <chr> "B179568", "B454143", "B221072", "B455158", "B099206", "A17393…
$ days    <dbl> 4, 3, 16, 23, 5, 4, 22, 14, 4, 2, 22, 2, 43, 2, 4, 4, 3, 21, 1…
$ gcs     <dbl> 15, 15, 15, 11, 15, 7, 5, 13, 15, 15, 10, 15, 14, 9, 15, 15, 1…
$ age     <dbl> 69, 64, 63, 67, 66, 78, 55, 65, 67, 56, 50, 78, 59, 83, 83, 44…
$ outcome <dbl+lbl> 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1…
$ icd10   <dbl+lbl> 0, 0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1…
  1. Change labelled variables to factors
Show code
stroke_outcome <- stroke_outcome %>%
  mutate(across(where(is.labelled), as_factor))
glimpse(stroke_outcome)
Rows: 226
Columns: 6
$ id      <chr> "B179568", "B454143", "B221072", "B455158", "B099206", "A17393…
$ days    <dbl> 4, 3, 16, 23, 5, 4, 22, 14, 4, 2, 22, 2, 43, 2, 4, 4, 3, 21, 1…
$ gcs     <dbl> 15, 15, 15, 11, 15, 7, 5, 13, 15, 15, 10, 15, 14, 9, 15, 15, 1…
$ age     <dbl> 69, 64, 63, 67, 66, 78, 55, 65, 67, 56, 50, 78, 59, 83, 83, 44…
$ outcome <fct> dead, alive, alive, alive, alive, dead, alive, dead, alive, al…
$ icd10   <fct> "CI,Others", "CI,Others", "ICB, Other Haemorrhage", "ICB, Othe…
Show code
# add label variable
attr(stroke_outcome$days, "label") <- "Time to Stroke death (days)"
attr(stroke_outcome$outcome, "label") <- "Stroke outcome"
attr(stroke_outcome$icd10, "label") <- "ICD10"
attr(stroke_outcome$gcs, "label") <- "Glasgow Coma Scale"
attr(stroke_outcome$age, "label") <- "Age (years)"

*** There is missing data in out dataset ***

Show code
# Check for missing data
colSums(is.na(stroke_outcome))
     id    days     gcs     age outcome   icd10 
      0       0       1       0       0       0 
Show code
# Ampute missing data
stroke_outcome$gcs[is.na(stroke_outcome$gcs)] <- median(stroke_outcome$gcs, na.rm = TRUE)

# Check again for missing data
colSums(is.na(stroke_outcome))
     id    days     gcs     age outcome   icd10 
      0       0       0       0       0       0 
  1. Fit Kaplan-Meier survival model and summarize at time 5 and 29
Show code
KM.overall <- survfit(
  Surv(time = days, outcome == "dead") ~ 1, type = "kaplan-meier",
  data = stroke_outcome
)

summary(KM.overall)
Call: survfit(formula = Surv(time = days, outcome == "dead") ~ 1, data = stroke_outcome, 
    type = "kaplan-meier")

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1    226      13    0.942  0.0155       0.9126        0.973
    2    197       4    0.923  0.0179       0.8889        0.959
    3    169       4    0.901  0.0205       0.8621        0.943
    4    133       4    0.874  0.0240       0.8286        0.923
    5     92       5    0.827  0.0307       0.7689        0.889
    6     67       3    0.790  0.0360       0.7224        0.864
    7     58       4    0.735  0.0426       0.6565        0.824
    9     44       2    0.702  0.0467       0.6160        0.800
   10     38       1    0.683  0.0490       0.5938        0.787
   12     34       4    0.603  0.0574       0.5004        0.727
   14     25       2    0.555  0.0621       0.4455        0.691
   18     20       1    0.527  0.0649       0.4140        0.671
   22     16       1    0.494  0.0687       0.3762        0.649
   25     10       2    0.395  0.0832       0.2616        0.597
   28      6       1    0.329  0.0918       0.1908        0.569
   29      5       1    0.264  0.0942       0.1308        0.531
   41      3       1    0.176  0.0953       0.0607        0.509
Show code
# 

ggsurvplot(KM.overall, data = stroke_outcome, risk.table = TRUE, 
           linetype = c(1,2), pval = TRUE)

  1. Interpretation

At time 5 days:

  • Number at risk: 92 patients were still under follow-up just before day 5
  • Number of events: 30 deaths had occurred on day 5
  • Survival probability: The estimated probability of survival at day 5 is 0.827 (82.7%) (95% CI: 0.769, 0.889)

At time 29 days:

-Number at risk: 5 patients were still under follow-up just before day 29 - Number of events: 22 deaths had occurred on day 29 - Survival probability: The estimated probability of survival at day 29 is 0.264 (26.4%) (95% CI: 0.131, 0.531)

Kaplan–Meier analysis demonstrates a marked decline in survival over time, with high early mortality and a substantial reduction in the number of patients at risk by day 29.

B) Perform the Kaplan-Meier analysis for icd10. Generate a table for survival probabilities. (5 marks)

  1. Fit Kaplan-Meier survival model by icd10 and summarize
Show code
KM.icd10 <- survfit(
  Surv(time = days, outcome == "dead") ~ icd10, type = "kaplan-meier",
  data = stroke_outcome
)

tidy(KM.icd10)
# A tibble: 52 × 9
    time n.risk n.event n.censor estimate std.error conf.high conf.low strata   
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl> <chr>    
 1     1    149       3       14    0.980    0.0117     1        0.958 icd10=CI…
 2     2    132       3       21    0.958    0.0177     0.991    0.925 icd10=CI…
 3     3    108       0       30    0.958    0.0177     0.991    0.925 icd10=CI…
 4     4     78       2       33    0.933    0.0255     0.981    0.888 icd10=CI…
 5     5     43       1       15    0.911    0.0347     0.976    0.851 icd10=CI…
 6     6     27       1        4    0.878    0.0513     0.970    0.794 icd10=CI…
 7     7     22       2        4    0.798    0.0847     0.942    0.676 icd10=CI…
 8     8     16       0        4    0.798    0.0847     0.942    0.676 icd10=CI…
 9     9     12       0        2    0.798    0.0847     0.942    0.676 icd10=CI…
10    11     10       0        1    0.798    0.0847     0.942    0.676 icd10=CI…
# ℹ 42 more rows
Show code
# Plot survival curves by icd10
ggsurvplot(KM.icd10, data = stroke_outcome, risk.table = TRUE, 
           linetype = c(1,2,3), pval = TRUE)

C) Perform the Cox Proportional Hazard Regression model with covariates gcs, icd10 and age. Present the result in a table. Interpret the result for age and icd10. (10 marks)

  1. Fit Cox Proportional Hazard Regression model
Show code
model_cox_stroke <- coxph(
  Surv(time = days, event = outcome == "dead") ~ 
    gcs + icd10 + age,
  data = stroke_outcome
)

tbl_regression(model_cox_stroke, conf.int = TRUE)
Characteristic log(HR) 95% CI p-value
Glasgow Coma Scale -0.18 -0.25, -0.12 <0.001
ICD10


    CI,Others
    SAH 0.40 -0.43, 1.2 0.3
    ICB, Other Haemorrhage 0.52 -0.13, 1.2 0.11
Age (years) 0.03 0.01, 0.05 0.004
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
  1. Interpretation
  • Age: Each additional year of age is associated with a 1.03 increase in log hazard of death, after adjusting for Glasgow Coma Scale and stroke type (95% CI: 1.01, 1.06, p = 0.004).
  • ICD10: Patients with Subarachnoid Hemorrhage stroke (SAH) have a 1.51 times higher log hazard of death compared to Cerebral Infarction and Intracerebral Bleeding (ICB), after adjusting for Glasgow Coma Scale and age (95% CI: 0.66, 3.09, p = 0.3). While, patient with Intracerebral Bleeding stroke have a 1.62 times higher log hazard of death compared to Cerebral Infarction and SAH, after adjusting for Glasgow Coma Scale and age (95% CI: 0.84, 3.09, p = 0.3). Both associations are not statistically significant.

D) Test the proportional hazard assumptions. (10 marks)

  1. Linearity assumption (continuous covariates)
Show code
ggcoxfunctional(Surv(time = days, event = outcome == 'dead') ~ age + gcs,
                data = stroke_outcome)

Interpretation: Age and gcs fairly flat and horizontal, indicating that the linearity assumption is reasonably met.

  1. Proportional hazard test using Schoenfeld residuals
Show code
ph_test <- cox.zph(model_cox_stroke, transform = 'km', global = TRUE) 
ph_test
        chisq df    p
gcs    0.0675  1 0.80
icd10  3.6335  2 0.16
age    0.9293  1 0.34
GLOBAL 4.6402  4 0.33
Show code
# Plot Schoenfeld residuals
ggcoxzph(ph_test)

Interpretation: The global test p-value is 0.33 (p > 0.05), indicating that the proportional hazards assumption is reasonably met for the overall model. Individually, gcs (p = 0.80), icd10 (p = 0.16), and age (p = 0.34) also show no evidence of violation of the proportional hazards assumption.

  1. Residuals analysis for continuous covariates
Show code
score <- resid(model_cox_stroke, type = "score")
plot(stroke_outcome$gcs, score[, "gcs"])

Show code
plot(stroke_outcome$age, score[, "age"])

5. Briefly describe the disadvantages of stepwise variable selection during data analysis. (5 marks)

Disadvantages of stepwise variable selection :

  • May exclude clinically or theoretically important variables

Stepwise selection relies on statistical significance rather than subject-matter knowledge. As a result, important confounders (e.g. age in stroke outcome studies) may be excluded if they are not statistically significant in the sample, leading to residual confounding.

  • Data-driven and sample-dependent

Stepwise methods are highly dependent on the observed sample. If the sample is not representative of the target population, the selected model may not generalise well, resulting in poor external validity.

  • Model instability

Small changes in the data can lead to different variables being selected. This instability reduces the reproducibility of results and makes the final model unreliable.

  • Inflated Type I error and biased estimates

Repeated testing during variable selection increases the risk of false-positive findings and can produce biased regression coefficients and p-values.

  • Ignores causal structure and research context

Stepwise selection does not consider causal relationships, temporality, or prior evidence. In contrast, manual or theory-driven variable selection allows researchers to incorporate biological plausibility and epidemiological reasoning into the model.

MVA 2021/2022

1. Analysis Workflow

Question: Describe the workflow involved from the reading of the dataset to communicating the findings to the stakeholders. (10 marks)

Mnemonic : I took tiktok video muka model comel

  1. Import
  • Objective: Load raw data from various sources (e.g., CSV, Excel, databases) into the analysis environment for further processing.
  1. Tidy
  • Objective: Clean and structure the data into a consistent format (e.g., long format) to facilitate analysis, ensuring variables are properly labeled and missing values are addressed.
  1. Transform
  • Objective: Create new variables, recode existing ones, and perform necessary calculations to prepare the data for analysis, enhancing its usability and interpretability.
  1. Visualize
  • Objective: Generate graphical representations of the data to explore patterns, trends, and relationships among variables, aiding in understanding and hypothesis generation.
  1. Model
  • Objective: Apply appropriate statistical or machine learning models to analyze the data, test hypotheses, and derive insights relevant to the research questions.
  1. Communicate
  • Objective: Present the analysis results through reports, presentations, or dashboards, effectively conveying findings to stakeholders for informed decision-making.
Show code
library(DiagrammeR)
grViz("digraph data_analysis_workflow {
  node [shape = rectangle, style = filled, fillcolor = LightBlue]
  Import [label = 'Import Data']
  Tidy [label = 'Tidy Data']
  Transform [label = 'Transform Data']
  Visualize [label = 'Visualize Data']
  Model [label = 'Model Data']
  Communicate [label = 'Communicate Results']
  Import -> Tidy -> Transform -> Visualize -> Model -> Communicate
}")

2. Mixed (Between-within) Repeated Measures ANOVA (Using SPSS)

A) Run Analysis

Step 1: Open data

File → Open → Data → HIV-CM.sav

Check: - group (HIV-CM / Control) → Between-subjects factor - DSCA1, DSCA2, DSCA3 → Repeated measures

Step 2: Run Mixed Repeated Measures ANOVA

SPSS Menu path:

Analyze → General Linear Model → Repeated Measures

Define factors:

  • Within-Subject Factor Name: time
  • Number of Levels: 3
  • Define → assign:
    • DSCA1 → Time 1
    • DSCA2 → Time 2
    • DSCA3 → Time 3
  • Between-Subjects Factor:
    • Group (HIV-CM vs Control)

Options:

  • Descriptive statistics
  • Estimates of effect size
  • Pairwise comparisons (Bonferroni)
  • Mauchly’s test of sphericity

Step 3: Assumption checking (IMPORTANT FOR MARKS)

  1. Normality
    • SAS is continuous and approximately normally distributed → assumption reasonably met.
  2. Sphericity (Mauchly’s Test)
    • If p > 0.05 → sphericity assumed
    • If p < 0.05 → use Greenhouse-Geisser correction

state which correction was used.

Step 4: Interpretation of key outputs

  1. Time effect
    • A significant main effect of time indicates that self-awareness scores changed significantly across the three time points.
  2. Group effect
    • A significant main effect of group indicates that overall self-awareness scores differed between the HIV-C M and control groups.
  3. Group × Time interaction (MOST IMPORTANT)
    • A significant interaction indicates that the pattern of change in self-awareness scores over time differs between the HIV-CM and control groups, demonstrating intervention effectiveness.

B) Present the results using tables and plots in MS Word. Name the document as results.docx. (5 marks)

Table should include: - Descriptive statistics (means and SDs for each group at each time point) - ANOVA results (F-values, p-values, effect sizes for time, group, and interaction) - Pairwise comparisons if applicable

Plot should include: - Line plot showing mean self-awareness scores over time for each group with error bars representing standard errors.

3. Multiple Log Regression

A) The analyst has proposed TWO (2) candidate models:

  • Model_A : covariates are DMDX, FBS and MOGTT2
  • Model_B : covariates are AGE and TOTCHOL

i) Compare Model_A and Model_B. Show your results. (4 marks)

  1. Import dataset
Show code
mlr_3_2122 <- read.csv("Q3.csv")
glimpse(mlr_3_2122)
Rows: 3,695
Columns: 6
$ AGE     <int> 47, 48, 39, 74, 43, 22, 42, 47, 40, 54, 44, 41, 50, 38, 48, 69…
$ DMDX    <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "n…
$ HBA1C   <dbl> 5.6, 5.7, 5.4, 5.7, 5.1, 5.5, 5.0, 5.1, 4.8, 5.6, 5.7, 6.9, 4.…
$ FBS     <dbl> 6.3, 8.3, 5.2, 6.5, 4.8, 5.1, 5.4, 5.0, 4.7, 5.5, 4.8, 4.6, 5.…
$ MOGTT2H <dbl> 6.1, 17.4, 7.8, 14.9, 7.6, 3.8, 5.6, 9.4, 7.0, 10.8, 11.3, 10.…
$ TOTCHOL <dbl> 6.8, 5.9, 5.6, 6.3, 7.5, 5.8, 4.8, 4.5, 6.4, 4.3, 4.1, 5.5, 4.…
  1. Change character variables to factors
Show code
mlr_3_2122 <- mlr_3_2122 %>%
  mutate(across(where(is.character), as.factor))

glimpse(mlr_3_2122)
Rows: 3,695
Columns: 6
$ AGE     <int> 47, 48, 39, 74, 43, 22, 42, 47, 40, 54, 44, 41, 50, 38, 48, 69…
$ DMDX    <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no…
$ HBA1C   <dbl> 5.6, 5.7, 5.4, 5.7, 5.1, 5.5, 5.0, 5.1, 4.8, 5.6, 5.7, 6.9, 4.…
$ FBS     <dbl> 6.3, 8.3, 5.2, 6.5, 4.8, 5.1, 5.4, 5.0, 4.7, 5.5, 4.8, 4.6, 5.…
$ MOGTT2H <dbl> 6.1, 17.4, 7.8, 14.9, 7.6, 3.8, 5.6, 9.4, 7.0, 10.8, 11.3, 10.…
$ TOTCHOL <dbl> 6.8, 5.9, 5.6, 6.3, 7.5, 5.8, 4.8, 4.5, 6.4, 4.3, 4.1, 5.5, 4.…
  1. Fit both models and compare
Show code
# Fit both model
model_A <- glm(DMDX ~ FBS + MOGTT2H, data = mlr_3_2122, family = binomial)
model_B <- glm(DMDX ~ AGE + TOTCHOL, data = mlr_3_2122, family = binomial)

tidy(model_A, conf.int = TRUE)
# A tibble: 3 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)  -5.56      0.337    -16.5   3.99e-61  -6.23      -4.89 
2 FBS          -0.0251    0.0945    -0.266 7.90e- 1  -0.215      0.155
3 MOGTT2H       0.129     0.0492     2.63  8.55e- 3   0.0315     0.224
Show code
tidy(model_B, conf.int = TRUE)
# A tibble: 3 × 7
  term        estimate std.error statistic       p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>         <dbl>    <dbl>     <dbl>
1 (Intercept)  -5.18      0.863     -6.01  0.00000000190 -6.88      -3.52  
2 AGE           0.0152    0.0118     1.28  0.200         -0.00818    0.0383
3 TOTCHOL      -0.0242    0.136     -0.178 0.859         -0.301      0.216 
Show code
# Compare models using performance metrics
compare_performance(model_A, model_B)
# Comparison of Model Performance Indices

Name    | Model | AIC (weights) | AICc (weights) | BIC (weights) | Tjur's R2
----------------------------------------------------------------------------
model_A |   glm | 403.2 (>.999) |  403.2 (>.999) | 421.8 (>.999) |     0.008
model_B |   glm | 418.6 (<.001) |  418.6 (<.001) | 437.3 (<.001) | 4.586e-04

Name    |  RMSE | Sigma | Log_loss | Score_log | Score_spherical |   PCP
------------------------------------------------------------------------
model_A | 0.099 |     1 |    0.054 |    -0.382 |           0.016 | 0.980
model_B | 0.100 |     1 |    0.056 |    -0.382 |           0.016 | 0.980

Interpretation: Model_A has lower AIC and BIC values, indicating a better fit to the data compared to Model_B. Additionally, Model_A has a higher R-squared value, suggesting it explains more variance in the outcome variable (DMDX).

ii) Establish the model of your choice. Give reasons. (4 marks)

Model of choice : Model_A (DMDX ~ FBS + MOGTT2)

Based on the model comparison results, I would choose Model_A (DMDX ~ FBS + MOGTT2) for the following reasons:

  1. Better fit: Model_A has lower AIC and BIC values, indicating it fits the data better than Model_B.
  2. Higher explanatory power: Model_A has a higher R-squared value, meaning it explains more variance in the outcome variable (DMDX).
  3. Clinical relevance: FBS and MOGTT2 are directly related to glucose metabolism and are likely more relevant predictors of DMDX compared to AGE and TOTCHOL.

B) Using your model of choice from (A), perform a model assessment. Show the results and comment on the findings. (8 marks)

4. Multiple Logistic Regression

A) A dataset named Q4_4_2122 has SIX (6) variables:

Outcome – alive or dead (the outcome variable) Sepsis – yes or no Creatinine – in mmol/l Age – in years old Perforation – size of perforation in cm Hemoglobin – in mg/dl

The binary logistic regression analysis was chosen to assess the relationship between the outcome and the rest of the variables.

Simulate the dataset using the code below:

Show code
set.seed(123)
n <- 300

# Generate predictors
Sepsis <- factor(sample(c("yes", "no"), n, replace = TRUE, prob = c(0.4, 0.6)))
Creatinine <- rnorm(n, mean = 100, sd = 20)
Age <- rnorm(n, mean = 60, sd = 15)
Perforation <- rnorm(n, mean = 2, sd = 0.5)
Hemoglobin <- rnorm(n, mean = 13, sd = 2)

# Encode Sepsis as numeric for modeling
Sepsis_num <- ifelse(Sepsis == "yes", 1, 0)

# Define true relationship: all covariates affect outcome
logit_p <- -10 + 
  0.06 * Creatinine + 
  0.04 * Age + 
  0.8 * Sepsis_num + 
  1.2 * Perforation - 
  0.3 * Hemoglobin

p_dead <- exp(logit_p) / (1 + exp(logit_p))
Outcome <- factor(rbinom(n, 1, p_dead), labels = c("alive", "dead"))

# Final dataset
mlogr_4_2122 <- data.frame(Outcome, Sepsis, Creatinine, Age, Perforation, Hemoglobin)

i) Justify the choice of the analysis. (4 marks)

Justification for choosing binary logistic regression:

1. Outcome variable is binary: The outcome variable “Outcome” has two categories (alive vs dead), making binary logistic regression appropriate for modeling the probability of one category (e.g., dead) based on predictor variables.

2. Multiple predictor variables: The analysis involves multiple independent variables (Sepsis, Creatinine, Age, Perforation, Hemoglobin), which can be continuous or categorical. Logistic regression can handle both types of predictors effectively.

3. Estimation of odds ratios: Logistic regression provides odds ratios for each predictor, allowing for interpretation of the strength and direction of associations between predictors and the outcome.

4. No assumption of normality: Logistic regression does not require the predictors to be normally distributed, making it suitable for a variety of data types and distributions.

ii) State the assumptions for the analysis. (2 marks)

Assumptions of binary logistic regression:

1. Independence of observations: The observations should be independent of each other, meaning that the outcome for one observation does not influence the outcome for another.

2. Linerity of the logit: There should be a linear relationship between the continuous predictor variables and the logit (log-odds) of the outcome variable.

B) Generate a saturated model and name it as Model_Outcome. Present Model_Outcome in a table. (5 marks)

Show code
model_Outcome <- glm(Outcome ~ Sepsis + Creatinine + Age + Perforation + Hemoglobin,
                     data = mlogr_4_2122, family = binomial)
tbl_regression(model_Outcome, conf.int = TRUE)
Characteristic log(OR) 95% CI p-value
Sepsis


    no
    yes 1.3 0.41, 2.1 0.004
Creatinine 0.06 0.04, 0.09 <0.001
Age 0.07 0.04, 0.11 <0.001
Perforation 1.1 0.39, 1.9 0.003
Hemoglobin -0.30 -0.55, -0.07 0.013
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

C) Using Model_Outcome, compose an interaction between Creatinine and Sepsis. Name it as Model_Outcome_ia. Present Model_Outcome_ia in a table. (5 marks)

Show code
model_Outcome_ia <- glm(Outcome ~ Sepsis + Creatinine + Age + Perforation + Hemoglobin + Sepsis:Creatinine,
                         data = mlogr_4_2122, family = binomial)
tbl_regression(model_Outcome_ia, conf.int = TRUE, intercept = T)
Characteristic log(OR) 95% CI p-value
(Intercept) -16 -23, -9.9 <0.001
Sepsis


    no
    yes 6.1 0.84, 12 0.027
Creatinine 0.09 0.05, 0.13 <0.001
Age 0.08 0.05, 0.12 <0.001
Perforation 1.1 0.38, 1.9 0.004
Hemoglobin -0.31 -0.56, -0.08 0.010
Sepsis * Creatinine


    yes * Creatinine -0.04 -0.09, 0.00 0.072
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

D) For Model_Outcome_ia, write:

i) The logistic model (2 marks)

The logistic model for Model_Outcome_ia can be expressed as:

\[ \text{logit}(P(\text{Outcome} = \text{dead})) = \beta_0 + \beta_1 \cdot \text{Sepsis}_{\text{yes}} + \beta_2 \cdot \text{Creatinine} + \beta_3 \cdot \text{Age} + \beta_4 \cdot \text{Perforation} + \beta_5 \cdot \text{Hemoglobin} + \beta_6 \cdot (\text{Sepsis}_{\text{yes}} \times \text{Creatinine}) \]

Probability of death can be calculated using the logistic function for model_Outcome_ia:

\[ P(\text{Outcome} = \text{dead}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \cdot \text{Sepsis}_{\text{yes}} + \beta_2 \cdot \text{Creatinine} + \beta_3 \cdot \text{Age} + \beta_4 \cdot \text{Perforation} + \beta_5 \cdot \text{Hemoglobin} + \beta_6 \cdot (\text{Sepsis}_{\text{yes}} \times \text{Creatinine}))}} \]

ii) The logit equation (2 marks)

The logit equation for Model_Outcome_ia is: \[ \hat{\text{logit}} = \beta_0 + \beta_1 \cdot \text{Sepsis}_{\text{yes}} + \beta_2 \cdot \text{Creatinine} + \beta_3 \cdot \text{Age} + \beta_4 \cdot \text{Perforation} + \beta_5 \cdot \text{Hemoglobin} + \beta_6 \cdot (\text{Sepsis}_{\text{yes}} \times \text{Creatinine}) \]

Using beta from Model_Outcome_ia table, the logit equation is:

\[ \hat{\text{logit}} = -16 + 6.1 \cdot \text{Sepsis}_{\text{yes}} + 0.09 \cdot \text{Creatinine} + 0.08 \cdot \text{Age} + 1.10 \cdot \text{Perforation} - 0.31 \cdot \text{Hemoglobin} - 0.04 \cdot (\text{Sepsis}_{\text{yes}} \times \text{Creatinine}) \]

E) Using Model_Outcome_ia, generate the predicted values for observation 1. Show your manual calculation. (4 marks)

  1. Create observation 1 values
  • Spepsis = yes
  • Creatinine = 120
  • Age = 70
  • Perforation = 2.5
  • Hemoglobin = 12
Show code
# Observation 1 values
obbs_4_2122 <- expand.grid(
  Sepsis = factor("yes", levels = c("no", "yes")),
  Creatinine = 120,
  Age = 70,
  Perforation = 2.5,
  Hemoglobin = 12
)
  1. Manual calculation of logit for observation 1
  • Using the coefficients from Model_Outcome_ia table:

    • β0 = -16

    • β1 (Sepsis_yes) = 6.1

    • β2 (Creatinine) = 0.09

    • β3 (Age) = 0.08

    • β4 (Perforation) = 1.10

    • β5 (Hemoglobin) = -0.31

    • β6 (Sepsis_yes:Creatinine) = -0.04

  • Logit calculation:

\[ \hat{\text{logit}} = -16 + 6.1 \cdot 1 + 0.09 \cdot 120 + 0.08 \cdot 70 + 1.10 \cdot 2.5 - 0.31 \cdot 12 - 0.04 \cdot (1 \cdot 120) \]

\[ \hat{\text{logit}} = -16 + 6.1 + 10.8 + 5.6 + 2.75 - 3.72 - 4.8 = 0.73 \]

  • Convert logit to predicted probability

Using the logistic function: \[ P(\text{Outcome} = \text{dead}) = \frac{1}{1 + e^{-0.73}} \approx 0.68 \]

Thus, the predicted probability of death for observation 1 is approximately 0.68 (68%).

MVA 2020/2021

1. Mutilple Linear Regression

Question: Dataset: Omega3 Analyse the relationship of triglyceride based on the following variables:

Model A = waist, total_chol, fbs, sex. Model B = waist, total_chol, fbs, sex, physical activity, interaction between sex and physical activity

*** Simulate the dataset using the code below:***

Show code
set.seed(123)
n <- 250
Omega3 <- data.frame(
  waist = rnorm(n, mean = 85, sd = 10),
  total_chol = rnorm(n, mean = 200, sd = 30),
  fbs = rnorm(n, mean = 100, sd = 15),
  physical_activity = rnorm(n, mean = 3),
  sex = factor(sample(c("female", "male"), n, replace = TRUE))
)

# Generate triglyceride with some relationship to predictors
Omega3$triglyceride <- 150 + 
  1.5 * Omega3$waist + 
  0.8 * Omega3$total_chol + 
  0.5 * Omega3$fbs - 
  10 * (Omega3$sex == "male") - 
  5 * Omega3$physical_activity +
  3 * (Omega3$sex == "male") * Omega3$physical_activity +
  rnorm(n, mean = 0, sd = 20)

A) Write the regression model and equation for Model A and Model B. (6 marks)

Fit regression models A and B:

Show code
modelA_1_2021 <- lm(triglyceride ~ waist + total_chol + fbs + sex , data = Omega3)
modelB_1_2021 <- lm(triglyceride ~ waist + total_chol + fbs + sex + physical_activity + sex : physical_activity, data = Omega3)
summary(modelA_1_2021)

Call:
lm(formula = triglyceride ~ waist + total_chol + fbs + sex, data = Omega3)

Residuals:
   Min     1Q Median     3Q    Max 
-56.56 -14.03  -0.65  13.65  68.34 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 150.10116   16.68627   8.995  < 2e-16 ***
waist         1.26601    0.13508   9.372  < 2e-16 ***
total_chol    0.81693    0.04221  19.355  < 2e-16 ***
fbs           0.51733    0.08371   6.180 2.66e-09 ***
sexmale      -3.26573    2.53927  -1.286      0.2    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.89 on 245 degrees of freedom
Multiple R-squared:  0.6779,    Adjusted R-squared:  0.6726 
F-statistic: 128.9 on 4 and 245 DF,  p-value: < 2.2e-16
Show code
summary(modelB_1_2021)

Call:
lm(formula = triglyceride ~ waist + total_chol + fbs + sex + 
    physical_activity + sex:physical_activity, data = Omega3)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.267 -14.170  -0.833  12.294  65.778 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               170.98018   17.77148   9.621  < 2e-16 ***
waist                       1.24380    0.13328   9.332  < 2e-16 ***
total_chol                  0.80480    0.04173  19.284  < 2e-16 ***
fbs                         0.50628    0.08271   6.121  3.7e-09 ***
sexmale                   -18.09602    7.97026  -2.270  0.02406 *  
physical_activity          -5.02610    1.60782  -3.126  0.00199 ** 
sexmale:physical_activity   4.81922    2.50032   1.927  0.05509 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.58 on 243 degrees of freedom
Multiple R-squared:  0.6904,    Adjusted R-squared:  0.6827 
F-statistic:  90.3 on 6 and 243 DF,  p-value: < 2.2e-16

Model A:

Regression model:

\[ \text{triglyceride}_i = \beta_0 + \beta_1 \cdot \text{waist}_i + \beta_2 \cdot \text{total_chol}_i + \beta_3 \cdot \text{fbs}_i + \beta_4 \cdot \text{sex}_i + \epsilon_i \]

Regression equation (using Beta from modelA_1_2021):

\[ \hat{\text{triglyceride}} = 150 + 1.3 \cdot \text{waist} + 0.8 \cdot \text{total_chol} + 0.5 \cdot \text{fbs} - 3.3 \cdot \text{sex}_{\text{male=1}} \]

Model B:

Regression model:

\[ \text{triglyceride}_i = \beta_0 + \beta_1 \cdot \text{waist}_i + \beta_2 \cdot \text{total_chol}_i + \beta_3 \cdot \text{fbs}_i + \beta_4 \cdot \text{sex}_i + \beta_5 \cdot \text{physical_activity}_i + \beta6 \cdot (\text{sex}_i \times \text{physical_activity}_i) + \epsilon_i \]

Regression equation (using Beta from modelB_1_2021):

\[ \hat{\text{triglyceride}} = 145 + 1.4 \cdot \text{waist} + 0.8 \cdot \text{total_chol} + 0.5 \cdot \text{fbs} - 10 \cdot \text{sex}_{\text{male=1}} - 4.5 \cdot \text{physical_activity} + 3.0 \cdot (\text{sex}_{\text{male=1}} \times \text{physical_activity}) \]

B) Compare the two models.

Show code
compare_performance(modelA_1_2021, modelB_1_2021)
# Comparison of Model Performance Indices

Name          | Model |  AIC (weights) | AICc (weights) |  BIC (weights)
------------------------------------------------------------------------
modelA_1_2021 |    lm | 2211.6 (0.050) | 2211.9 (0.057) | 2232.7 (0.642)
modelB_1_2021 |    lm | 2205.7 (0.950) | 2206.3 (0.943) | 2233.9 (0.358)

Name          |    R2 | R2 (adj.) |   RMSE |  Sigma
---------------------------------------------------
modelA_1_2021 | 0.678 |     0.673 | 19.692 | 19.892
modelB_1_2021 | 0.690 |     0.683 | 19.307 | 19.583

Interpretation: Model B has a lower AIC and BIC compared to Model A, indicating a better fit to the data. Additionally, Model B has a higher R-squared value, suggesting it explains more variance in triglyceride levels. Therefore, Model B is preferred over Model A.

C) Predict for the population given. (10 marks)

  1. Create new data for prediction
Show code
new_data_1_2021 <- expand.grid(
waist = c(70, 80, 90, 100),
total_chol = c(180, 200, 220),
fbs = c(90, 100, 110),
physical_activity = c(1, 2, 3, 4, 5),
sex = factor(c("female", "male"), levels = c("female", "male"))
)
  1. Generate predictions using Model B
Show code
new_data_1_2021$predicted_triglyceride <- predict(modelB_1_2021, newdata = new_data_1_2021)
head(new_data_1_2021)
  waist total_chol fbs physical_activity    sex predicted_triglyceride
1    70        180  90                 1 female               443.4488
2    80        180  90                 1 female               455.8868
3    90        180  90                 1 female               468.3247
4   100        180  90                 1 female               480.7627
5    70        200  90                 1 female               459.5447
6    80        200  90                 1 female               471.9827
  1. Interpretation:

For observation 1 (waist = 70, total_chol = 180, fbs = 90, physical_activity = 1, sex = female), the predicted triglyceride level is approximately 443 mg/dL. As waist circumference, total cholesterol, fasting blood sugar, and physical activity levels increase, the predicted triglyceride levels also tend to increase, with variations based on sex due to the interaction term.

D) Check assumption (Multiple Linear Regression) for model B.

  1. Global diagnostics
Show code
plot(modelB_1_2021)

  1. Linearity of continuous predictors
Show code
# Linearity for waist
ggplot(Omega3, aes(x = waist, y = residuals(modelB_1_2021))) + geom_point() +
geom_hline(yintercept = 0)

Show code
# Linearity for total_cholesterol
ggplot(Omega3, aes(x = total_chol, y = residuals(modelB_1_2021))) + geom_point() +
geom_hline(yintercept = 0)

Show code
# Linearity for fbs
ggplot(Omega3, aes(x = fbs, y = residuals(modelB_1_2021))) + geom_point() +
geom_hline(yintercept = 0)

Show code
# Linearity for physical_activity
ggplot(Omega3, aes(x = physical_activity, y = residuals(modelB_1_2021))) + geom_point() +
geom_hline(yintercept = 0)

Interpretation: The residuals plotted against waist, total cholesterol, fbs, and physical activity show no obvious systematic pattern or curvature and are randomly scattered around zero. This suggests that the linearity assumption is reasonably satisfied for all continuous predictors in the multiple linear regression model.

  1. Normality of residuals
Show code
hist(residuals(modelB_1_2021))

Show code
shapiro.test(modelB_1_2021$residuals)

    Shapiro-Wilk normality test

data:  modelB_1_2021$residuals
W = 0.99418, p-value = 0.4496

Interpretation:

  • The histogram of residuals appears approximately bell-shaped, suggesting that the normality of residual is met.
  • The Shapiro-Wilk test yields a p-value of 0.450 (p > 0.05), indicating residuals are approximately normally distributed.
  1. Homoscedasticity - Breusch-Pagan test
Show code
bptest(modelB_1_2021)

    studentized Breusch-Pagan test

data:  modelB_1_2021
BP = 1.7869, df = 6, p-value = 0.9382

Interpretation: The Breusch-Pagan test yields a p-value of 0.938 (p > 0.05), indicating no significant evidence of heteroscedasticity. Thus, homoscedasticity assumption is met.

  1. Independence of residuals - Durbin-Watson test
Show code
dwtest(modelB_1_2021)

    Durbin-Watson test

data:  modelB_1_2021
DW = 1.8242, p-value = 0.08398
alternative hypothesis: true autocorrelation is greater than 0

Interpretation: The Durbin-Watson test yields a p-value of 0.084 (p > 0.05), indicating no significant autocorrelation in the residuals. Thus, the independence assumption is met.

  1. Assessing transformation using Box-Cox plot
Show code
MASS::boxcox(modelB_1_2021)

Interpretation: The Box-Cox plot suggests that the optimal lambda value is close to 1, indicating that no transformation of the outcome variable (triglyceride) is necessary. The current linear regression model is appropriate.

  1. Check for multicollinearity - Variance Inflation Factor (VIF)
Show code
vif(modelB_1_2021)
                waist            total_chol                   fbs 
             1.023720              1.022966              1.013720 
                  sex     physical_activity sex:physical_activity 
            10.329046              1.731479             10.785706 

Interpretation: The VIF values for all predictors are below 5, indicating no significant multicollinearity between the predictors. Thus, no multicollinearity.

  1. Influential points - Cook’s Distance
Show code
cutoffmodB <- 4/nrow(Omega3)

plot(modelB_1_2021, which = 4, cook.levels = cutoffmodB)

Interpretation: Minimal observations exceed the Cook’s distance cutoff, indicating no highly influential points that unduly affect the regression results.

E) Present Model B in table n interpret for waist & physical activity*sex

Show code
tbl_regression(modelB_1_2021, conf.int = TRUE)
Characteristic Beta 95% CI p-value
waist 1.2 0.98, 1.5 <0.001
total_chol 0.80 0.72, 0.89 <0.001
fbs 0.51 0.34, 0.67 <0.001
sex


    female
    male -18 -34, -2.4 0.024
physical_activity -5.0 -8.2, -1.9 0.002
sex * physical_activity


    male * physical_activity 4.8 -0.11, 9.7 0.055
Abbreviation: CI = Confidence Interval
Show code
glance(modelB_1_2021)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.690         0.683  19.6      90.3 4.98e-59     6 -1095. 2206. 2234.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Interpretation:

  • Waist: For each additional cm increase in waist circumference, triglyceride levels increase by approximately 1.2 mg/dL, holding all other variables constant (95% CI: 0.98, 1.5, p < 0.001).
  • physical activity * sex: The interaction term indicates that the effect of physical activity on triglyceride levels differs by sex. For males, each additional unit increase in physical activity is associated with a 4,8 mg/dL increase in triglyceride levels (95% CI: -0.11, 9.7, p = 0.055), while for females, the effect is a 5.0 mg/dL decrease in triglyceride levels (95% CI: -8.2, -1.9, p = 0.002). This suggests that physical activity has a protective effect on triglyceride levels for females but a detrimental effect for male.

2. Survival Analysis

Question: Using Dataset: stroke_e.dta

A) Interpret: (10 marks)

  • Overall Kaplan Meier
  • Kaplan Meier by ICD10
  1. Import dataset
Show code
stroke_2_2021 <- read_dta("stroke_e.dta")
glimpse(stroke_2_2021)
Rows: 226
Columns: 10
$ sex       <dbl+lbl> 1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1,…
$ time      <dbl> 32, 4, 3, 5, 2, 2, 1, 4, 8, 4, 3, 3, 1, 4, 3, 2, 12, 12, 4, …
$ status2   <dbl+lbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ gcs       <dbl> 13, 15, 15, 15, 15, 15, 13, 15, 15, 10, 15, 15, 15, 15, 15, …
$ sbp       <dbl> 143, 150, 152, 215, 162, 169, 178, 180, 186, 185, 122, 211, …
$ dbp       <dbl> 83, 98, 108, 109, 72, 114, 75, 90, 109, 100, 83, 103, 83, 72…
$ dm        <dbl+lbl> 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,…
$ wbc       <dbl> 10.0, 8.9, 7.4, 8.1, 7.5, 24.4, 9.5, 4.2, 6.4, 14.4, 8.2, 8.…
$ age2      <dbl> 50, 58, 64, 50, 65, 78, 66, 72, 61, 64, 63, 59, 64, 62, 40, …
$ icd10cat2 <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
  1. Change variable label to factor
Show code
stroke_2_2021 <- stroke_2_2021 %>% mutate(across(where(is.labelled), as_factor))
glimpse(stroke_2_2021)
Rows: 226
Columns: 10
$ sex       <fct> male, male, male, female, male, female, female, male, female…
$ time      <dbl> 32, 4, 3, 5, 2, 2, 1, 4, 8, 4, 3, 3, 1, 4, 3, 2, 12, 12, 4, …
$ status2   <fct> alive, alive, alive, alive, alive, alive, dead, alive, alive…
$ gcs       <dbl> 13, 15, 15, 15, 15, 15, 13, 15, 15, 10, 15, 15, 15, 15, 15, …
$ sbp       <dbl> 143, 150, 152, 215, 162, 169, 178, 180, 186, 185, 122, 211, …
$ dbp       <dbl> 83, 98, 108, 109, 72, 114, 75, 90, 109, 100, 83, 103, 83, 72…
$ dm        <fct> no, no, no, yes, yes, yes, yes, no, yes, yes, no, yes, yes, …
$ wbc       <dbl> 10.0, 8.9, 7.4, 8.1, 7.5, 24.4, 9.5, 4.2, 6.4, 14.4, 8.2, 8.…
$ age2      <dbl> 50, 58, 64, 50, 65, 78, 66, 72, 61, 64, 63, 59, 64, 62, 40, …
$ icd10cat2 <fct> Ischaemic Stroke, Ischaemic Stroke, Ischaemic Stroke, Ischae…
  1. Overall Kaplan Meier
Show code
KM.overall_2_2021 <- survfit(Surv(time = time,
                                  event = status2 == 'dead') ~ 1, type = "kaplan-meier",
                             data = stroke_2_2021)
tidy(KM.overall_2_2021)
# A tibble: 29 × 8
    time n.risk n.event n.censor estimate std.error conf.high conf.low
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
 1     1    226      13       16    0.942    0.0164     0.973    0.913
 2     2    197       4       24    0.923    0.0194     0.959    0.889
 3     3    169       4       32    0.901    0.0228     0.943    0.862
 4     4    133       4       37    0.874    0.0274     0.923    0.829
 5     5     92       5       20    0.827    0.0371     0.889    0.769
 6     6     67       3        6    0.790    0.0456     0.864    0.722
 7     7     58       5        5    0.722    0.0608     0.813    0.641
 8     8     48       0        4    0.722    0.0608     0.813    0.641
 9     9     44       2        4    0.689    0.0692     0.789    0.602
10    10     38       1        1    0.671    0.0741     0.776    0.580
# ℹ 19 more rows
Show code
# Plot overall survival curve
ggsurvplot(KM.overall_2_2021, data = stroke_2_2021, risk.table = TRUE, linetype = c(1,2), pval = TRUE)

  1. Kaplan Meier by ICD10
Show code
KM.icd10_2_2021 <- survfit(Surv(time = time,
                                  event = status2 == 'dead') ~ icd10cat2, type = "kaplan-meier",
                             data = stroke_2_2021)
tidy(KM.icd10_2_2021)
# A tibble: 41 × 9
    time n.risk n.event n.censor estimate std.error conf.high conf.low strata   
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl> <chr>    
 1     1    149       3       14    0.980    0.0117     1        0.958 icd10cat…
 2     2    132       3       21    0.958    0.0177     0.991    0.925 icd10cat…
 3     3    108       0       30    0.958    0.0177     0.991    0.925 icd10cat…
 4     4     78       2       33    0.933    0.0255     0.981    0.888 icd10cat…
 5     5     43       1       15    0.911    0.0347     0.976    0.851 icd10cat…
 6     6     27       1        4    0.878    0.0513     0.970    0.794 icd10cat…
 7     7     22       2        4    0.798    0.0847     0.942    0.676 icd10cat…
 8     8     16       0        4    0.798    0.0847     0.942    0.676 icd10cat…
 9     9     12       0        2    0.798    0.0847     0.942    0.676 icd10cat…
10    11     10       0        1    0.798    0.0847     0.942    0.676 icd10cat…
# ℹ 31 more rows
Show code
# Plot survival curve by ICD10
ggsurvplot(KM.icd10_2_2021, data = stroke_2_2021, risk.table = TRUE, linetype = c(1,2), pval = TRUE)

Interpretation:

  • Overall Kaplan Meier: The overall survival curve indicates that the probability of survival decreases over time among stroke patients. The median survival time can be estimated from the curve, and the risk table shows the number of patients at risk at different time points.
  • Kaplan Meier by ICD10: The survival curves stratified by ICD10 categories show differences in survival probabilities among different stroke types. For example, patients with Haemorrhagic stroke may have a lower survival probability compared to those with Ischaemic stroke. The log-rank test p-value = 0.0043 indicates whether these differences are statistically significant.

B) Cox regression for variables: (10 marks)

  • Age
  • DM
  • ICD10
  • Gender
  1. Fit Cox regression model
Show code
model_cox_2_2021 <- coxph(Surv(time = time,
                               event = status2 == 'dead') ~ age2 + dm + icd10cat2 + sex,
                          data = stroke_2_2021)
tidy(model_cox_2_2021, conf.int = TRUE, exponentiate = TRUE)
# A tibble: 4 × 7
  term                  estimate std.error statistic p.value conf.low conf.high
  <chr>                    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 age2                      1.03    0.0111     2.74  0.00616    1.01       1.05
2 dmyes                     1.47    0.299      1.30  0.195      0.820      2.65
3 icd10cat2Haemorrhagic     2.69    0.310      3.19  0.00143    1.46       4.94
4 sexfemale                 1.25    0.305      0.729 0.466      0.687      2.27

Interpretation:

  • Age: Each additional year of age is associated with a 0.003 times higher hazard ratio of death, after adjusting for other covariates (HR = 1.03, 95% CI: 1.01-1.05, p = 0.006).
  • DM: Patients with diabetes have a 0.299 times higher hazard ratio of death compared to those without diabetes, controlling for other covariates (HR = 1.57, 95% CI: 0.82, 2.65, p = 0.19). However, this association is not statistically significant.
  • ICD10: Patient with hemorrhagic stroke have a 2.69 times higher hazard ratio of death compared to those with ischaemic stroke, adjusting for other covariates (HR = 2.69, 95% CI: 1.30-4.80, p = 0.466). However, this association is not statistically significant.

C) Check assumptions needed to estimate cox regression. (10 marks)

Show code
ggcoxfunctional(Surv(time = time,
                     event = status2 == 'dead') ~ age2,
                data = stroke_2_2021)

Interpretation: The plot shows a linear relationship between age and the log hazard, indicating that the linearity assumption is reasonably satisfied for age in the Cox regression model.

  1. Proportional hazards assumption - Schoenfeld residuals
Show code
prop.h_2.2021 <- cox.zph(model_cox_2_2021, transform = 'km', global = TRUE)
prop.h_2.2021
          chisq df     p
age2      0.997  1 0.318
dm        1.549  1 0.213
icd10cat2 3.979  1 0.046
sex       4.416  1 0.036
GLOBAL    8.443  4 0.077
Show code
ggcoxzph(prop.h_2.2021)

Interpretation: The global test p-value = 0.077 (p > 0.05) indicates no significant violation of the proportional hazards assumption. However, ICD10 and sex have p-values < 0.05, suggesting potential violations for these covariates. Further investigation may be needed.

  1. Linearity - Score residuals
Show code
score_2_2021 <- resid(model_cox_2_2021, type = "score")
plot(stroke_2_2021$age2, score_2_2021[, "age2"])

Show code
# Linearity for age
ggplot(stroke_2_2021, aes(x = age2, y = residuals(model_cox_2_2021, type = "martingale"))) + geom_point() +
  geom_smooth(method = "loess") + geom_hline(yintercept = 0)

Interpretation: The residuals plotted against age show no obvious systematic pattern or curvature and are randomly scattered around zero. This suggests that the linearity assumption is reasonably satisfied for age in the Cox regression model.

  1. Influential (Martingale residuals)
Show code
ggcoxdiagnostics(model_cox_2_2021, type = 'martingale', linear.predictions = FALSE)

Interpretation: The martingale residuals appear randomly scattered around zero, indicating no significant outliers or influential observations that could unduly affect the Cox regression results.

3. Mixed (Between-within) Repeated Measures ANOVA (Using SPSS)

Question: A researcher has conducted an interventional study to evaluate the effect of a new rehabilitation module in managing post-spinal surgery patients. Thirty (30) patients were randomly assigned into two groups; “new module” and “current module”.

They were measured on their mobilization activities using questionnaire with scores between 0-100 when higher score indicates better management. Endpoint data on the mobility scores were collected at baseline (score-a), 3-month (score-2) and 6-month (score-3).

Using the given dataset “mobility study.sav”,

A) Present the result using tables and plots (in a Word document). (10 marks)

Table should include

  • Descriptive statistics (means and SDs for each group at each time point)
  • ANOVA results (F-values, p-values, effect sizes for time, group, and interaction)
  • Pairwise comparisons if applicable

B) Perform appropriate statistical analysis using SPSS. Interpret all the relevant analytic step and save it into a Microsoft Word document (“mobility.docx”). Also save all the relevant SPSS outputs as “mobility.spv”. (20 marks)

Statistical analysis step using SPSS

  1. Descriptive analysis
  2. Univariable analysis
  3. Model fit (M)
  4. Interaction (I)
  5. Normality of residual (N)
  6. Equal variences (E)
  7. Sphericity (S)
  8. Split file (if interaction is significant)
  9. Post-hoc test (if the interest variable has more than 2 categories)
  10. Final table
  11. Interpretation
  12. Interaction plot

4. ANOVA

Question: IV alcohol and sleep DV performance

Research questions (main effect):

  • Is there any effect of alcohol on performance score?
  • Is there any effect of sleep on the performance score?
  • Is the difference of performance score between the alcohol status the same/vary based on sleep status?

Run the analysis.

Statistical analysis step using SPSS

  1. Descriptive analysis
  2. Univariable analysis
  3. Model fit (M)
  4. Interaction (I)
  5. Normality of residual (N)
  6. Equal variences (E)
  7. Split file/ Syntax (if interaction is significant)
  8. Post-hoc test (if the interest variable has more than 2 categories)
  9. Final table
  10. Interpretation
  11. Interaction plot

MVA 2019/2020

1. Multiple Linear Regression

Question: Data were collected to study factors that were related with HbA1c. The dataset is named as “metab_syndrome” and in .xlsx formats. A model was proposed to examine the relationship between HBA1C with these covariates:

  • HbA1c = glycated haemoglobin (%)
  • AGE = age(years)
  • DMDX = diabetes status (yes/no)
  • MSBPR - mean systolic blood pressure (mmHg)
  • FBS = fasting blood sugar (mmol/L)
  • HDL = high-density lipoprotein (mg/dL)

*** Simulate the dataset using the code below:***

Show code
set.seed(123)
n <- 200
metab_syndrome <- data.frame(
  HbA1c = rnorm(n, mean = 7, sd = 1.5),
  AGE = rnorm(n, mean = 55, sd = 10),
  DM = factor(sample(c("yes", "no"), n, replace = TRUE, prob = c(0.6, 0.4))),
  MSBPR = rnorm(n, mean = 130, sd = 15),
  FBS = rnorm(n, mean = 6.5, sd = 2),
  HDL = rnorm(n, mean = 45, sd = 10),
  MDBPR = rnorm(n, mean = 80, sd = 10)
)

(a) Write the regression model and equation for MODEL_A. (15 marks)

  1. Glimpse dataset
Show code
glimpse(metab_syndrome)
Rows: 200
Columns: 7
$ HbA1c <dbl> 6.159287, 6.654734, 9.338062, 7.105763, 7.193932, 9.572597, 7.69…
$ AGE   <dbl> 76.98810, 68.12413, 52.34855, 60.43194, 50.85660, 50.23753, 47.1…
$ DM    <fct> yes, yes, yes, yes, yes, no, yes, yes, no, no, yes, no, yes, no,…
$ MSBPR <dbl> 120.97161, 115.09452, 145.40178, 141.26592, 107.36250, 128.57279…
$ FBS   <dbl> 5.043562, 3.419115, 5.113811, 6.737699, 3.770581, 7.679965, 7.07…
$ HDL   <dbl> 34.85886, 37.08686, 47.99594, 61.39052, 55.84617, 38.75433, 53.2…
$ MDBPR <dbl> 89.15992, 88.00622, 70.63431, 65.99213, 81.60278, 77.26038, 70.1…
  1. Built model A
Show code
model_A_1_1920 <- lm(HbA1c ~ AGE + DM + MSBPR + FBS + HDL, data = metab_syndrome)

tidy (model_A_1_1920, conf.int = TRUE)
# A tibble: 6 × 7
  term        estimate std.error statistic     p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>    <dbl>     <dbl>
1 (Intercept)  6.47      1.22        5.29  0.000000331   4.06      8.89  
2 AGE         -0.00157   0.0104     -0.152 0.879        -0.0220    0.0189
3 DMyes        0.179     0.211       0.848 0.397        -0.237     0.594 
4 MSBPR        0.00246   0.00687     0.358 0.721        -0.0111    0.0160
5 FBS          0.0459    0.0499      0.919 0.359        -0.0526    0.144 
6 HDL         -0.00283   0.00997    -0.284 0.777        -0.0225    0.0168
Show code
glance(model_A_1_1920)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1   0.00862       -0.0169  1.43     0.337   0.890     5  -352.  718.  741.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
  1. Write Regression model and equation for Model A
  • Regression model:

\[ \text{HbA1c}_i = \beta_0 + \beta_1 \cdot \text{AGE}_i + \beta_2 \cdot \text{DM}_i + \beta_3 \cdot \text{MSBPR}_i + \beta_4 \cdot \text{FBS}_i + \beta_5 \cdot \text{HDL}_i + \epsilon_i \]

  • Regression equation (using Beta from model_A_1_1920):

\[ \hat{\text{HbA1c}} = 6.47 + -0.002 \cdot \text{AGE} + 0.18 \cdot \text{DM}_{\text{yes}} + 0.002 \cdot \text{MSBPR} + 0.05 \cdot \text{FBS} - 0.003 \cdot \text{HDL} \]

(b) Add variable MDBPR to MODEL_A and name the new model as MODEL_B. Describe the effect of this addition to the model. (5 marks)

  1. Build model B
Show code
model_B_1_1920 <- lm(HbA1c ~ AGE + DM + MSBPR + FBS + HDL + MDBPR, data = metab_syndrome)

tidy (model_B_1_1920, conf.int = TRUE)
# A tibble: 7 × 7
  term        estimate std.error statistic    p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
1 (Intercept)  7.26      1.44       5.03   0.00000112   4.41     10.1   
2 AGE         -0.00102   0.0104    -0.0987 0.921       -0.0215    0.0194
3 DMyes        0.196     0.211      0.928  0.355       -0.221     0.613 
4 MSBPR        0.00270   0.00687    0.393  0.694       -0.0109    0.0163
5 FBS          0.0496    0.0501     0.991  0.323       -0.0491    0.148 
6 HDL         -0.00313   0.00997   -0.314  0.754       -0.0228    0.0165
7 MDBPR       -0.0109    0.0106    -1.03   0.306       -0.0317    0.0100
Show code
glance(model_B_1_1920)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0140       -0.0167  1.43     0.457   0.840     6  -351.  719.  745.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Interpretation: The inclusion of MDBPR in Model B does not significantly improve the model, as MDBPR is not significantly associated with HbA1c and its addition does not materially change the estimated effects or significance of the other covariates, indicating no evidence of confounding.

(c) Interpret the variances between regression model and regression estimates. (10 marks)

Both model explain very little variance in HbA1c levels (R-squared values are low).

  • Model A : R² = 0.0086, Adj. R² = -0.017
  • Model B : R² = 0.014, Adj. R² = -0.017

Negative adjusted models suggest that the predictors do not explain the variability in HbA1c better than a simple mean model (null model).

The regression models have poor explanatory power, and adding MDBPR (Model B) does not meaningfully reduce unexplained variance compared with Model A.

(d) Perform comparison between MODEL_A and MODEL_B. Give comment. (5 marks)

Show code
compare_performance(model_A_1_1920, model_B_1_1920)
# Comparison of Model Performance Indices

Name           | Model | AIC (weights) | AICc (weights) | BIC (weights) |    R2
-------------------------------------------------------------------------------
model_A_1_1920 |    lm | 717.6 (0.612) |  718.2 (0.632) | 740.7 (0.891) | 0.009
model_B_1_1920 |    lm | 718.5 (0.388) |  719.3 (0.368) | 744.9 (0.109) | 0.014

Name           | R2 (adj.) |  RMSE | Sigma
------------------------------------------
model_A_1_1920 |    -0.017 | 1.405 | 1.427
model_B_1_1920 |    -0.017 | 1.401 | 1.426

Interpretation: Model A has slightly lower AIC and BIC values compared to Model B, indicating a marginally better fit to the data. However, model B has a slightly higher R-squared value, suggesting it explains a bit more variance in HbA1c levels. Overall, both models perform similarly, with Model A being preferred due to its simplicity and slightly better fit statistics.

(e) Perform model assessment for MODEL_B. Save the relevant plots in the thumb-drive and write the names of the plots in the answer sheet. (10 marks)

(f) Predict the average value of HBA1C for a population with:

▪ AGE = 48 years ▪ DMDX = has history of diabetes (1) ▪ MSBPR = 133 mmHg ▪ MDBPR = 79 mmHg ▪ FBS = 5.6 mmol/l ▪ HDL = 1.3 mmol/l

  1. Linear equation for Model B

\[ \hat{\text{HbA1c}} = 7.26 - 0.001 \cdot \text{AGE} + 0.196 \cdot \text{DM}_{\text{yes}} + 0.003 \cdot \text{MSBPR} + 0.05 \cdot \text{FBS} - 0.003 \cdot \text{HDL} - 0.01 \cdot \text{MDBPR} \]

HbA1c = 7.26 - 0.001(AGE) + 0.196(DM_yes) + 0.003(MSBPR) + 0.05(FBS) - 0.003(HDL) - 0.01(MDBPR) HbA1c = 7.26 - 0.00148 + 0.1961 + 0.003133 + 0.055.6 - 0.0031.3 - 0.0179 HbA1c = 7.2931 ~ 7.3%

2. RM ANOVA

Question : A researcher has conducted an interventional study to evaluate the effect of a new rehabilitation module in managing post-spinal surgery patients. Thirty (30) patients were randomly assigned into two groups; “new module” and “current module”. They were measured on their mobilization activities using questionnaire with scores between 0-100 when higher score indicates better management. Endpoint data on the mobility scores were collected at baseline (score-a), 3-month (score-2) and 6-month (score-3). Using the given dataset “mobility study.sav”,

A) Present the result using tables and plots (in a Word document). (10 marks)

Table should include:

  • Descriptive statistics (means and SDs for each variable)
  • ANOVA results (F-values, p-values, effect sizes for each variable)
  • Pairwise comparisons if applicable

B) Perform appropriate statistical analysis using SPSS. Interpret all the relevant analytic step and save it into a Microsoft Word document (“mobility.docx”). Also save all the relevant SPSS outputs as “mobility.spv”. (20 marks)

RM ANOVA statistical workflow using SPSS:

  1. Descriptive analysis
  2. Univariable analysis
  3. Model fit (M)
  4. Interaction (I)
  5. Normality of residual (N)
  6. Equal variences (E)
  7. Sphericity (S)
  8. Split file (if interaction is significant)
  9. Post-hoc test (if the interest variable has more than 2 categories)
  10. Final table
  11. Interpretation
  12. Interaction plot

3. Survival Analysis

Question 3 The dataset “stroke_outcome.dta” comes from a short prospective survival study. The variable outcome is labelled as dead (1) or alive (0) and the variable days is the duration from the start of follow-up until the last follow-up.

A) Present and interpret the result for:

▪ Overall Kaplan-Meier analysis ▪ Kaplan-Meier analysis on variable ICD10 Comment on the time, number of event and probability of the event for both analyses. (10 marks)

  1. Import data
Show code
stroke_3_1920 <- read_dta("stroke_outcome.dta")
glimpse(stroke_3_1920)
Rows: 226
Columns: 6
$ id      <chr> "B179568", "B454143", "B221072", "B455158", "B099206", "A17393…
$ days    <dbl> 4, 3, 16, 23, 5, 4, 22, 14, 4, 2, 22, 2, 43, 2, 4, 4, 3, 21, 1…
$ gcs     <dbl> 15, 15, 15, 11, 15, 7, 5, 13, 15, 15, 10, 15, 14, 9, 15, 15, 1…
$ age     <dbl> 69, 64, 63, 67, 66, 78, 55, 65, 67, 56, 50, 78, 59, 83, 83, 44…
$ outcome <dbl+lbl> 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1…
$ icd10   <dbl+lbl> 0, 0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1…
  1. Data transform
Show code
stroke_3_1920 <- stroke_3_1920 %>% mutate(across(where(is.labelled), as_factor))
glimpse(stroke_3_1920)
Rows: 226
Columns: 6
$ id      <chr> "B179568", "B454143", "B221072", "B455158", "B099206", "A17393…
$ days    <dbl> 4, 3, 16, 23, 5, 4, 22, 14, 4, 2, 22, 2, 43, 2, 4, 4, 3, 21, 1…
$ gcs     <dbl> 15, 15, 15, 11, 15, 7, 5, 13, 15, 15, 10, 15, 14, 9, 15, 15, 1…
$ age     <dbl> 69, 64, 63, 67, 66, 78, 55, 65, 67, 56, 50, 78, 59, 83, 83, 44…
$ outcome <fct> dead, alive, alive, alive, alive, dead, alive, dead, alive, al…
$ icd10   <fct> "CI,Others", "CI,Others", "ICB, Other Haemorrhage", "ICB, Othe…
  1. Overall kaplan meier
Show code
KM.overall_3_1920 <- survfit(Surv(time = days,
                                  event = outcome == 'dead') ~ 1, type = "kaplan-meier",
                             data = stroke_3_1920)
tidy(KM.overall_3_1920)
# A tibble: 30 × 8
    time n.risk n.event n.censor estimate std.error conf.high conf.low
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
 1     1    226      13       16    0.942    0.0164     0.973    0.913
 2     2    197       4       24    0.923    0.0194     0.959    0.889
 3     3    169       4       32    0.901    0.0228     0.943    0.862
 4     4    133       4       37    0.874    0.0274     0.923    0.829
 5     5     92       5       20    0.827    0.0371     0.889    0.769
 6     6     67       3        6    0.790    0.0456     0.864    0.722
 7     7     58       4        6    0.735    0.0579     0.824    0.656
 8     8     48       0        4    0.735    0.0579     0.824    0.656
 9     9     44       2        4    0.702    0.0666     0.800    0.616
10    10     38       1        1    0.683    0.0717     0.787    0.594
# ℹ 20 more rows
Show code
# Plot overall survival curve
ggsurvplot(KM.overall_3_1920, data = stroke_3_1920, risk.table = TRUE, linetype = c(1,2), pval = TRUE)

  1. Kaplan meier by icd10
Show code
KM.icd10_3_1920 <- survfit(Surv(time = days,
                                  event = outcome == 'dead') ~ icd10, type = "kaplan-meier",
                             data = stroke_3_1920)
tidy(KM.icd10_3_1920)
# A tibble: 52 × 9
    time n.risk n.event n.censor estimate std.error conf.high conf.low strata   
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl> <chr>    
 1     1    149       3       14    0.980    0.0117     1        0.958 icd10=CI…
 2     2    132       3       21    0.958    0.0177     0.991    0.925 icd10=CI…
 3     3    108       0       30    0.958    0.0177     0.991    0.925 icd10=CI…
 4     4     78       2       33    0.933    0.0255     0.981    0.888 icd10=CI…
 5     5     43       1       15    0.911    0.0347     0.976    0.851 icd10=CI…
 6     6     27       1        4    0.878    0.0513     0.970    0.794 icd10=CI…
 7     7     22       2        4    0.798    0.0847     0.942    0.676 icd10=CI…
 8     8     16       0        4    0.798    0.0847     0.942    0.676 icd10=CI…
 9     9     12       0        2    0.798    0.0847     0.942    0.676 icd10=CI…
10    11     10       0        1    0.798    0.0847     0.942    0.676 icd10=CI…
# ℹ 42 more rows
Show code
# Plot survival curve by ICD10
ggsurvplot(KM.icd10_3_1920, data = stroke_3_1920, risk.table = TRUE, linetype = c(1,2,3), pval = TRUE)

  1. Interpretation:
  • Overall Kaplan Meier: The Kaplan-Meier curve shows a steady decline in survival probability over time, with the most substantial drop occurring early in the follow-up. By day 10, the survival probability falls to approximately 68%, indicating a high early mortality rate in the cohort.
  • Kaplan Meier by ICD10: Survival differs significantly across ICD10 diagnostic groups (p = 0.033). Patients with SAH and ICB/Other Haemorrhage show lower survival probabilities compared to those with CI/Others. This suggests that diagnosis type is a significant predictor of mortality, with haemorrhagic strokes associated with poorer outcomes.
  • Compare overall and ICD10 : The overall curve masks important differences between diagnostic groups. While the pooled analysis shows a general decline in survival, the ICD10-stratified curves reveal that haemorrhagic subtypes drive the poorer outcomes. Thus, stratification provides more clinically meaningful insight, showing that survival is not uniform across stroke types.

B) The analyst proposed to model the log hazard using the Cox proportional hazard (PH) model. Comment on his proposal. (5 marks)

  • The proposal to model the log hazard using a Cox proportional hazards model is appropriate because the outcome of interest is time to event with possible censoring.
  • The Cox model allows estimation of log hazard without specifying the baseline hazard function.
  • However, the validity of this approach depends on the proportional hazards assumption, which requires that the log hazard for covariates remain constant over time.
  • This assumption should be formally assessed, for example using Schoenfeld residuals.
  • If the proportional hazards assumption is violated, alternative approaches such as stratified Cox models should be considered.

(c) Estimate the Cox PH model with covariate: sex, GCS, ICD10, SBP, DBP. Present the results in a table. Interpret the result for SBP and ICD10. (10 marks)

Fit cox regression model

Show code
model_cox_3_1920 <- coxph(Surv(time = days,
                               event = outcome == 'dead') ~ age + gcs + icd10,
                          data = stroke_3_1920)
tidy(model_cox_3_1920, conf.int = TRUE)
# A tibble: 4 × 7
  term                   estimate std.error statistic p.value conf.low conf.high
  <chr>                     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 age                      0.0321    0.0113     2.84  4.45e-3  0.00999    0.0543
2 gcs                     -0.189     0.0344    -5.49  3.99e-8 -0.257     -0.122 
3 icd10SAH                 0.409     0.423      0.966 3.34e-1 -0.421      1.24  
4 icd10ICB, Other Haemo…   0.480     0.331      1.45  1.47e-1 -0.169      1.13  

Interpretation:

  • Age: Each additional year of age is associated with a 0.03 times higher log hazard of death, after adjusting for other covariates (Log Hazard = 0.03, 95% CI: 0.01, 0.05, p = 0.004).
  • GCS: Each additional point increase in GCS score is associated with a 0.19 times lower log hazard of death, after adjusting for other covariates (Log Hazard = -0.19, 95% CI: -0.26,-0.12, p < 0.001).
  • ICD10: Patients with haemorrhagic stroke have a 0.48 times higher log hazard of death compared to those with ischaemic stroke and SAH, after adjusting for other covariates (Log Hazard = 0.48, 95% CI: -0.17,1.13, p = 0.48). However, this association is not statistically significant.

(d) Test the assumption for proportional hazard. Save the plot in the thumb-drive and write the name of the plot in the answer sheet. (5 marks)

  1. Assessment: PH Assumption (Schoenfeld Residuals)
Show code
prop.h_3_1920 <- cox.zph(model_cox_3_1920, transform = 'km', global = TRUE)
ggcoxzph(prop.h_3_1920)

  1. Assessment: Functional Form (Linearity)