---
title: "Practice Multivariable Analysis 2026"
author: "Kang Munir"
date: "`r Sys.Date()`"
format:
html:
theme: cerulean
toc: true
toc-expand: 1
toc-location: left
code-fold: true
code-summary: "Show code"
code-tools: true
number-sections: false
number-depth: 3
execute:
echo: true
warning: false
message: false
comment: NA
---
# MVA 2024/2025
## Load all the library using pacman
```{r}
library(pacman)
# load all library
p_load(lmtest, survival, broom, caret, pacman, gtsummary, lubridate,
survminer, pROC, corrplot, ggeffects, skimr, haven,
officer, mfp, car, flextable, psych, ResourceSelection,
readxl, summarytools, performance, tidyverse)
```
# 1. Multiple Linear Regression
Data Quality of life (qol_data.csv). Outcome is quality of life score. 3 covariates: job experience (years), gender (male, female) and comorbidity (yes, no)
## A) Build model without interaction, interpret (10marks)
1. Import and summary data
```{r}
# import data
qol <- read.csv("qol_data.csv")
glimpse(qol)
```
2. Change character variables to factors
```{r}
qol <- qol %>%
mutate(across(where(is.character), as.factor))
# recode label variable
attr(qol$quality_of_life, "label") <- "Quality of Life Score"
attr(qol$job_experience, "label") <- "Job Experience (years)"
attr(qol$gender, "label") <- "Gender"
attr(qol$comorbidity, "label") <- "Comorbidity"
glimpse(qol)
```
3.. Fit multiple linear regression model without interaction
```{r}
model1 <- lm(quality_of_life ~ job_experience + gender + comorbidity, data = qol)
tidy(model1, intercept = TRUE)
# Or can use regression table
tbl_regression(model1, intercept = TRUE)
```
4. Interpret coefficients
- Job Experience: Each additional year of job experience is associated with a 0.49-point decrease in quality of life score, adjusted for gender and comorbidity (95% CI: -0.62 to -0.36, p \< 0.001).
- Gender: Male participants have a quality of life score that is 9.6 points lower than female participants, adjusted for job experience and comorbidity (95% CI: -12.0 to -7.4, p \< 0.001).
- Comorbidity: Participants with comorbidity have a quality of life score that is 14 points lower than those without comorbidity, adjusted for job experience and gender (95% CI: -16.0 to -12.0, p \< 0.001).
## B) Build model with interaction between gender & comorbid, present in table (5mark
1. Fit multiple linear regression model with interaction
```{r}
model2 <- lm(quality_of_life ~ job_experience + gender + comorbidity + gender:comorbidity, data = qol)
tidy(model2)
# Or can use regression table
tbl_regression(model2, intercept = TRUE)
```
2. Interpret interaction term
The interaction term between gender and comorbidity is statistical significant (p \< 0.05), indicating that the effect of comorbidity on quality of life score differs by gender.
## C) Provide appropriate plot, copy in words/in jpeg file (5marks)
1. Plot interaction each of covariate
```{r}
# interaction plot for quality of life vs job experience
ggplot(qol, aes(x = job_experience, y = quality_of_life)) +
geom_point() +
geom_smooth(method = "lm") +
# add labels and theme
labs(title = "Quality of Life Score vs Job Experience",
x = "Job Experience (years)",
y = "Quality of Life Score") +
theme_minimal()
# interaction plot for quality of life vs gender
ggplot(qol, aes(x = gender, y = quality_of_life)) +
geom_boxplot() +
# add labels and theme
labs(title = "Quality of Life Score by Gender",
x = "Gender",
y = "Quality of Life Score") +
theme_minimal()
# interaction plot for quality of life vs comorbidity
ggplot(qol, aes(x = comorbidity, y = quality_of_life)) +
geom_boxplot() +
# add labels and theme
labs(title = "Quality of Life Score by Comorbidity",
x = "Comorbidity",
y = "Quality of Life Score") +
theme_minimal()
```
2. Interaction plot between quality of life score by gender, job experiencexx\` and comorbidity.
```{r}
interact.plot <- ggpredict(model2, terms = c("gender", "comorbidity", "job_experience"))
plot(interact.plot, connect_lines = T)
```
3. Interpret plot
- People without comorbidity have higher quality of life scores.
- Females without comorbidity have the highest scores.
- Males without comorbidity also have high scores, but slightly lower than females.
- Comorbidity lowers scores for both genders.
- Males with comorbidity have the lowest scores.
- The effect of comorbidity is stronger in males than in females.
## D) Predict based on (working experience: 5/15/25 years), both comorbidities, both gender. proof one using manual calculation (10marks)
1. Create prediction data frame
```{r}
pred_data <- expand.grid( job_experience = c(5, 15, 25),
gender = c("female", "male"),
comorbidity = c("no", "yes") )
augment(model2, newdata = pred_data)
```
2. Manual calculation for one scenario (e.g., 15 years experience, male with comorbidity)
The linear regression model predicting Quality of Life Score is:
$$
\text{Quality of Life Score} = 80 - 0.52 \cdot \text{Job Experience} - 3.7 \cdot \text{Male} - 8.7 \cdot \text{Comorbidity} - 11 \cdot (\text{Male} \times \text{Comorbidity})
$$
Where:
- (\beta\_0) is the intercept.
- Job Experience is measured in years.
- Male is coded as 1 for male, 0 for female.
- Comorbidity is coded as 1 for yes, 0 for no.
- The interaction term is 1 only when both gender is male and comorbidity is present.
Example : Male with comorbidity, 15 years of experience.
QOL = 80 - 0.52(15) - 3.7(1) - 8.7(1) - 11(1\*1) = 48.8
# 2. Multiple Logistic Regression
Question: Using Quality of Life data (qol2.dta). Outcome is quality of life categorized into poor and good. 3 covariates: job experience (years), gender (male, female) and comorbidity (yes, no)
## A) What is analysis you would like to use and justify (10 marks)
Answer: Multiple Logistic Regression
Justification:
**Outcome variable**
- Quality of life (QoL) is **binary**: *Poor* vs *Good*
Because the dependent variable is **dichotomous**, **binary logistic regression** is the most appropriate analysis.
**Justification (key points for marks)**
1. **Nature of outcome**
- Logistic regression is suitable when the outcome is categorical with two levels.
- Linear regression is inappropriate because it can predict values outside 0–1 and violates normality and homoscedasticity assumptions.
2. **Interpretability**
- Logistic regression estimates **odds ratios (ORs)**, which are standard and interpretable measures in epidemiology and public health.
- ORs quantify the association between predictors and the odds of having **good quality of life**.
3. **Covariate structure**
- Job experience: continuous predictor (years)
- Gender: categorical (male/female)
- Comorbidity: categorical (yes/no)
- Logistic regression can accommodate **both continuous and categorical predictors** in the same model.
4. **Adjustment for confounding**
- Multivariable logistic regression allows adjustment for potential confounders, giving **adjusted odds ratios**.
5. **Interaction assessment**
- The research question explicitly asks for an **interaction between gender and comorbidity**.
- Logistic regression allows inclusion and interpretation of interaction (effect modification) terms.
6. **Public health relevance**
- Logistic regression is widely used in quality-of-life and health outcomes research, making results comparable with existing literature.
Therefore, multivariable binary logistic regression is the most appropriate analysis to model the association between job experience, gender, comorbidity, and quality of life, while allowing assessment of interaction effects.
## B) Build model with interaction between gender & comorbid, present in table (5marks)
1. Import data
```{r}
qol2 <- read_dta("qol2.dta")
glimpse(qol2)
```
2. Change labelled variables to factors
```{r}
qol2 <- qol2 %>%
mutate(across(where(is.labelled), as_factor))
glimpse(qol2)
```
3. Fit multiple logistic regression model with interaction
```{r}
model_logit <- glm(quality_of_life ~ gender + comorbidity + job_experience + gender:comorbidity,
data = qol2,
family = binomial)
tbl_regression(model_logit,intercept = T, exponentiate = TRUE, conf.int = TRUE)
```
4. Interpretation
- Gender: After adjusting for comorbidity and the interaction term, male healthcare workers have significantly higher odds of good quality of life compared with females (OR = 4.95, 95% CI: 1.48–18.9, p = 0.013).
- Comorbidity: Healthcare workers with comorbidity have higher odds of good quality of life compared with those without comorbidity; however, this association is not statistically significant (OR = 2.12, 95% CI: 0.72–6.42, p = 0.20).
- Interaction term (gender × comorbidity): The interaction between gender and comorbidity is not statistically significant (OR = 0.97, 95% CI: 0.15–7.10, p \> 0.9), indicating that the association between comorbidity and quality of life does not differ by gender.
## C) Provide predicted values for the first observation (5 marks)
The interaction was not significant, so we will use the model without interaction to predict the probability for the first observation.
1. Fit multiple logistic regression model without interaction
```{r}
model_logitfull <- glm(quality_of_life ~ gender + comorbidity + job_experience,
data = qol2,
family = binomial)
tbl_regression(model_logitfull,intercept = T, conf.int = TRUE)
```
2. Predict probability for the first observation (model with log odd)
```{r}
predict(model_logitfull, newdata = qol2[1, ], type = "response")
```
3. Manual calculation using log-odds
The logistic regression equation is: $$ \text{logit}(p) = \beta_0 + \beta_1 \cdot \text {gender (m = 1)} + \beta_2 \cdot \text{comorbidity (yes = 1)} + \beta_3 \cdot \text{job experience} $$
Where:
- β0 = -2.1 (intercept)
- β1 = 1.6
- β2 = 0.93
- β3 = 0.12
For the first observation (male, no comorbidity, 9.3 years experience):
Logit(p) = -2.1 + 1.6(1) + 0.793(0) - 0.12(9.3) = -1.057
Convert log-odds to probability: $$ p = \frac{e^{\text{logit}(p)}}{1 + e^{\text{logit}(p)}} = \frac{e^{-1.057}}{1 + e^{-1.057}} \approx 0.62 $$
Interpretation: The predicted probability of having good quality of life for the first observation is 0.62 (62%).
# 3. Cox Propoportional Hazard Regression
Question: Using survival data (stroke_e.dta). Outcome is stroke death. 3 covariates: age ( years), gender (male, female) and diabetes (yes, no) and ICD 10 (ischemic stroke, hemorrhagic).
## A) Illustrate the statistical workflow , from beginning until communicate to the stakeholder (5 marks)
Mnemonic : **I took tiktok video muka model comel**
1. Import
2. Tidy
3. Transform
4. Visualize
5. Model
6. Communicate
## B) Perform the Cox Proportional Hazard Regression model with covariates gender, age, dm & icd10. Present the result in a table. Interpret the result for age and icd10 (10 marks)
1. Import data
```{r}
stroke <- read_dta("stroke_e.dta")
glimpse(stroke)
```
2. Change labelled variables to factors
```{r}
stroke <- stroke %>%
mutate(across(where(is.labelled), as_factor))
glimpse(stroke)
```
3. Label variable
```{r}
attr(stroke$time, "label") <- "Time to Stroke death (months)"
attr(stroke$status2, "label") <- "Stroke death"
attr(stroke$age2, "label") <- "Age (years)"
attr(stroke$sex, "label") <- "Gender"
attr(stroke$dm, "label") <- "Diabetes (yes/no)"
attr(stroke$icd10cat2, "label") <- "ICD10"
attr(stroke$gcs, "label") <- "Glasgow Coma Scale"
```
4. Fit Cox Proportional Hazard Regression model
```{r}
model_cox <- coxph(
Surv(time = time, event = status2 == "dead") ~
age2 + sex + dm + icd10cat2,
data = stroke
)
tbl_regression(model_cox, exponentiate = TRUE, conf.int = TRUE)
```
4. Interpretation
- Age: Age is significantly associated with stroke recurrence. Each additional year of age is associated with a 1.03 increase in the hazard of stroke death, after adjusting for gender, diabetes, and stroke type. (95% CI: 1.01, 1.05, p 0.006).
- ICD10: Patients with hemorrhagic stroke have a 2.69 times higher hazard of stroke death compared to those with ischemic stroke, after adjusting for gender, age, and diabetes (95% CI: 1.46, 4,94, p = 0.001).
## C)add on gcs, explain about the regression estimate for icd10 (5 marks)
1. Fit Cox Proportional Hazard Regression model with gcs
```{r}
model_cox2 <- coxph(
Surv(time = time, event = status2 == "dead") ~
age2 + sex + dm + icd10cat2 + gcs, data = stroke)
tbl_regression(model_cox2, exponentiate = TRUE, conf.int = TRUE)
```
2. Interpretation for ICD10
After add adjusting for Glasgow Coma Scale, patients with haemorrhagic stroke have a 69% higher hazard of stroke recurrence compared with patients with ischaemic stroke; However, the association is no longer statistically significant (HR: 1.69, 95% CI: 0.91, 3.13, p = 0.10). This suggests that the initial observed association between stroke type and stroke recurrence may be confounded by Glasgow Coma Scale scores.
# MVA 2023/2024
# 1. Analysis workflow
Question: Before formal data analysis, researchers are encouraged to develop data analysis workflow. - Using a flow chart, generate data analysis workflow that will help with your research data. Briefly describe the objective for each stage of the workflow (10 marks)
Mnemonic : **I took tiktok video muka model comel**
1. Import
- Objective: Load raw data from various sources (e.g., CSV, Excel, databases) into the analysis environment for further processing.
2. Tidy
- Objective: Clean and structure the data into a consistent format (e.g., long format) to facilitate analysis, ensuring variables are properly labeled and missing values are addressed.
3. Transform
- Objective: Create new variables, recode existing ones, and perform necessary calculations to prepare the data for analysis, enhancing its usability and interpretability.
4. Visualize
- Objective: Generate graphical representations of the data to explore patterns, trends, and relationships among variables, aiding in understanding and hypothesis generation.
5. Model
- Objective: Apply appropriate statistical or machine learning models to analyze the data, test hypotheses, and derive insights relevant to the research questions.
6. Communicate
- Objective: Present the analysis results through reports, presentations, or dashboards, effectively conveying findings to stakeholders for informed decision-making.
```{r}
library(DiagrammeR)
grViz("
digraph data_analysis_workflow {
node [shape = rectangle, style = filled, fillcolor = LightBlue]
Import [label = 'Import Data']
Tidy [label = 'Tidy Data']
Transform [label = 'Transform Data']
Visualize [label = 'Visualize Data']
Model [label = 'Model Data']
Communicate [label = 'Communicate Results']
Import -> Tidy -> Transform -> Visualize -> Model -> Communicate
}
")
```
# 2. Mixed (Within-Between RM ANOVA)
Question: A randomized controlled trial was conducted to evaluate the effectiveness of the HIV Counselling Module (HIV-CM) in increasing healthcare workers (HCW) awareness on stigma and discrimination toward key populations of HIV. Thirty (n=30) HCWs were randomly assigned into two groups which consisted of 15 respondents in the intervention group "HIV-CM" while another 15 respondents to "Controls". They were measured on their self-awareness scores (SAS) at baseline (SAS- 1), 2 weeks (SAS-2) and 6 weeks (SAS-3). The scales are between 0-100, where a higher score indicates better awareness.
## A) State all possible research questions. (6 marks)
Research question:
**Within-group (time effect)**
- Is there a significant change in self-awareness scores (SAS) over time (baseline, 2 weeks, 6 weeks) among healthcare workers?
**Between-group (group effect)**
- Is there a significant difference in overall self-awareness scores between the HIV-CM group and the control group?
**Interaction (group × time effect)**
- Does the change in self-awareness scores over time differ between the HIV-CM group and the control group?
## B) State, with reasons, the appropriate statistical test to answer all research questions in Q2(A). (4 marks)
**Justification :**
1. **Outcome variable**
- SAS is continuous and approximately normally distributed.
2. **Within-subject factor**
- Time has **three repeated measurements** (SAS-1, SAS-2, SAS-3).
3. **Between-subject factor**
- Group has **two independent levels** (HIV-CM vs Control).
4. **Research objectives**
- The study aims to examine **time effects**, **group effects**, and **group × time interaction**, which are directly tested using mixed repeated measures ANOVA.
## C) Using the given dataset HIV-CM.sav, perform statistical analysis in SPSS using the test stated in Q2(a). Interpret all the relevant analytic steps and save them into a Microsoft Word document (HIV-CM analysis.docx). Save all relevant SPSS outputs as HIV-CM.spv (25 marks)
**Step 1: Open data**
> File → Open → Data → HIV-CM.sav
Check:
- group (HIV-CM / Control) → Between-subjects factor
- SAS_1, SAS_2, SAS_3 → Repeated measures
**Step 2: Run Mixed Repeated Measures ANOVA**
SPSS Menu path:
> Analyze → General Linear Model → Repeated Measures
Define factors:
- Within-Subject Factor Name: time
- Number of Levels: 3
- Define → assign:
- DSCA1 → Time 1
- DSCA2 → Time 2
- DSCA3 → Time 3
- Between-Subjects Factor:
- Group (HIV-CM vs Control)
Options:
- Descriptive statistics
- Estimates of effect size
- Pairwise comparisons (Bonferroni)
- Mauchly’s test of sphericity
**Step 3: Assumption checking (IMPORTANT FOR MARKS)**
1. Normality
- SAS is continuous and approximately normally distributed → assumption reasonably met.
2. Sphericity (Mauchly’s Test)
- If p \> 0.05 → sphericity assumed
- If p \< 0.05 → use Greenhouse-Geisser correction
*state which correction was used.*
**Step 4: Interpretation of key outputs**
1. Time effect
- A significant main effect of time indicates that self-awareness scores changed significantly across the three time points.
2. Group effect
- A significant main effect of group indicates that overall self-awareness scores differed between the HIV-CM and control groups.
3. Group × Time interaction (MOST IMPORTANT)
- A significant interaction indicates that the pattern of change in self-awareness scores over time differs between the HIV-CM and control groups, demonstrating intervention effectiveness.
## D) Present the results from Q2(c) in a table and conclude your findings (5 marks)
Table should include :
- Descriptive statistics (means and SDs for each group at each time point)
- ANOVA results (F-values, p-values, effect sizes for time, group, and interaction)
- Pairwise comparisons if applicable
# 3. Multiple Linear Regression
Question: A dataset named mydata.csv contains the following variables:
- id: id of participants
- score: performance score (the outcome)
- marker1: biomarker
- group: a variable with group 1 (grp1) or group 2 (grp2)
## A) Write a linear regression model and a regression equation to show the relationship between the average score as a function of marker1 and group and an interaction between marker1 and group. (4 marks)
**Linear regression model:**
$$ \text{score}_i = \beta_0 + \beta_1 \cdot \text{marker1}_i + \beta_2 \cdot \text{group2}_i + \beta_3 \cdot (\text{marker1}_i \times \text{group2}_i) + \epsilon_i $$
Where:
- score_i: Performance score for participant i
- marker1_i: Biomarker value for participant i
- group2_i: Indicator variable for group 2 (1 if grp2, 0 if grp1)
- β0: Intercept (mean score for grp1 when marker1 = 0
- β1: Effect of marker1 on score for grp1
- β2: Difference in mean score between grp2 and grp1 when marker1 = 0
- β3: Interaction effect between marker1 and group2
- εi: Random error term
**Regression equation:**
$$ \hat{\text{score}} = \beta_0 + \beta_1 \cdot \text{marker1} + \beta_2 \cdot \text{group2} + \beta_3 \cdot (\text{marker1} \times \text{group2}) $$
## B) Using R, perform the analysis and present the results in a table. Interpret the interaction term. (6 marks)
1. Import data
```{r}
mydata <- read.csv("mydata.csv")
glimpse(mydata)
```
2. Change character variables to factors
```{r}
mydata <- mydata %>%
mutate(across(where(is.character), as.factor))
glimpse(mydata)
# add label variable
attr(mydata$score, "label") <- "Performance Score"
attr(mydata$marker1, "label") <- "Biomarker 1"
attr(mydata$group, "label") <- "Group"
```
3. Fit multiple linear regression model with interaction
```{r}
model_mlr <- lm(score ~ marker1 * group, data = mydata)
tbl_regression(model_mlr, intercept = TRUE)
```
4. Interpretation
- Biomarker 1: Each unit increase in biomarker 1 is associated with a 1.4-point increase in performance score, after adjusting for group (95% CI: 1.1, 1.8, p \< 0.001).
- Group: Group 2 has a 0.03-point lower performance score compared to group 1, after adjusting for biomarker 1 (95% CI: 1.0, 9.0, p = 0.01).
- Interaction term: The interaction term between biomarker 1 and group is statistically not significant (p \> 0.05), indicating that the effect of biomarker 1 on performance score does not differ by group.
## C) Show calculations for the 95% confidence intervals of the interaction term and interpret it. (5 marks)
1. Extract coefficient and standard error for interaction term
```{r}
tidy(model_mlr, conf.int = TRUE)
```
2. Formula to calculate 95% CI:
$$ 95\% \; CI = \hat{\beta} \pm 1.96 \times SE $$
Where:
- β: Coefficient estimate for interaction term
- SE: Standard error of the coefficient
3. Manual calculation for interaction term
β3 (marker1:group2) = 0.535 SE = 0.296
- Lower bound = 0.535 - 1.96 \* 0.296 = -0.045
- Upper bound = 0.535 + 1.96 \* 0.296 = 1.115
# 4. Survival Analysis
Question: The dataset stroke_outcome.dta comes from a short survival study. The variable outcome is labelled as either dead (1) or censored (0) and the variable days is the duration from the start of follow-up until the last follow-up.
## A) Perform the Kaplan-Meier analysis for the overall survival. Interpret the finding based on number at risk, number of events and survival probabilities at time 5 and 29. (5 marks)
1. Import data
```{r}
stroke_outcome <- read_dta("stroke_outcome.dta")
glimpse(stroke_outcome)
```
2. Change labelled variables to factors
```{r}
stroke_outcome <- stroke_outcome %>%
mutate(across(where(is.labelled), as_factor))
glimpse(stroke_outcome)
# add label variable
attr(stroke_outcome$days, "label") <- "Time to Stroke death (days)"
attr(stroke_outcome$outcome, "label") <- "Stroke outcome"
attr(stroke_outcome$icd10, "label") <- "ICD10"
attr(stroke_outcome$gcs, "label") <- "Glasgow Coma Scale"
attr(stroke_outcome$age, "label") <- "Age (years)"
```
\*\*\* There is missing data in out dataset \*\*\*
```{r}
# Check for missing data
colSums(is.na(stroke_outcome))
# Ampute missing data
stroke_outcome$gcs[is.na(stroke_outcome$gcs)] <- median(stroke_outcome$gcs, na.rm = TRUE)
# Check again for missing data
colSums(is.na(stroke_outcome))
```
3. Fit Kaplan-Meier survival model and summarize at time 5 and 29
```{r}
KM.overall <- survfit(
Surv(time = days, outcome == "dead") ~ 1, type = "kaplan-meier",
data = stroke_outcome
)
summary(KM.overall)
#
ggsurvplot(KM.overall, data = stroke_outcome, risk.table = TRUE,
linetype = c(1,2), pval = TRUE)
```
4. Interpretation
At time 5 days:
- Number at risk: 92 patients were still under follow-up just before day 5
- Number of events: 30 deaths had occurred on day 5
- Survival probability: The estimated probability of survival at day 5 is 0.827 (82.7%) (95% CI: 0.769, 0.889)
At time 29 days:
-Number at risk: 5 patients were still under follow-up just before day 29 - Number of events: 22 deaths had occurred on day 29 - Survival probability: The estimated probability of survival at day 29 is 0.264 (26.4%) (95% CI: 0.131, 0.531)
Kaplan–Meier analysis demonstrates a marked decline in survival over time, with high early mortality and a substantial reduction in the number of patients at risk by day 29.
## B) Perform the Kaplan-Meier analysis for icd10. Generate a table for survival probabilities. (5 marks)
1. Fit Kaplan-Meier survival model by icd10 and summarize
```{r}
KM.icd10 <- survfit(
Surv(time = days, outcome == "dead") ~ icd10, type = "kaplan-meier",
data = stroke_outcome
)
tidy(KM.icd10)
# Plot survival curves by icd10
ggsurvplot(KM.icd10, data = stroke_outcome, risk.table = TRUE,
linetype = c(1,2,3), pval = TRUE)
```
## C) Perform the Cox Proportional Hazard Regression model with covariates gcs, icd10 and age. Present the result in a table. Interpret the result for age and icd10. (10 marks)
1. Fit Cox Proportional Hazard Regression model
```{r}
model_cox_stroke <- coxph(
Surv(time = days, event = outcome == "dead") ~
gcs + icd10 + age,
data = stroke_outcome
)
tbl_regression(model_cox_stroke, conf.int = TRUE)
```
2. Interpretation
- Age: Each additional year of age is associated with a 1.03 increase in log hazard of death, after adjusting for Glasgow Coma Scale and stroke type (95% CI: 1.01, 1.06, p = 0.004).
- ICD10: Patients with Subarachnoid Hemorrhage stroke (SAH) have a 1.51 times higher log hazard of death compared to Cerebral Infarction and Intracerebral Bleeding (ICB), after adjusting for Glasgow Coma Scale and age (95% CI: 0.66, 3.09, p = 0.3). While, patient with Intracerebral Bleeding stroke have a 1.62 times higher log hazard of death compared to Cerebral Infarction and SAH, after adjusting for Glasgow Coma Scale and age (95% CI: 0.84, 3.09, p = 0.3). Both associations are not statistically significant.
## D) Test the proportional hazard assumptions. (10 marks)
1. Linearity assumption (continuous covariates)
```{r}
ggcoxfunctional(Surv(time = days, event = outcome == 'dead') ~ age + gcs,
data = stroke_outcome)
```
Interpretation: Age and gcs fairly flat and horizontal, indicating that the linearity assumption is reasonably met.
2. Proportional hazard test using Schoenfeld residuals
```{r}
ph_test <- cox.zph(model_cox_stroke, transform = 'km', global = TRUE)
ph_test
# Plot Schoenfeld residuals
ggcoxzph(ph_test)
```
Interpretation: The global test p-value is 0.33 (p \> 0.05), indicating that the proportional hazards assumption is reasonably met for the overall model. Individually, gcs (p = 0.80), icd10 (p = 0.16), and age (p = 0.34) also show no evidence of violation of the proportional hazards assumption.
4. Residuals analysis for continuous covariates
```{r}
score <- resid(model_cox_stroke, type = "score")
plot(stroke_outcome$gcs, score[, "gcs"])
plot(stroke_outcome$age, score[, "age"])
```
# 5. Briefly describe the disadvantages of stepwise variable selection during data analysis. (5 marks)
**Disadvantages of stepwise variable selection :**
- **May exclude clinically or theoretically important variables**
Stepwise selection relies on statistical significance rather than subject-matter knowledge. As a result, important confounders (e.g. age in stroke outcome studies) may be excluded if they are not statistically significant in the sample, leading to residual confounding.
- **Data-driven and sample-dependent**
Stepwise methods are highly dependent on the observed sample. If the sample is not representative of the target population, the selected model may not generalise well, resulting in poor external validity.
- **Model instability**
Small changes in the data can lead to different variables being selected. This instability reduces the reproducibility of results and makes the final model unreliable.
- **Inflated Type I error and biased estimates**
Repeated testing during variable selection increases the risk of false-positive findings and can produce biased regression coefficients and p-values.
- **Ignores causal structure and research context**
Stepwise selection does not consider causal relationships, temporality, or prior evidence. In contrast, manual or theory-driven variable selection allows researchers to incorporate biological plausibility and epidemiological reasoning into the model.
# MVA 2021/2022
# 1. Analysis Workflow
Question: Describe the workflow involved from the reading of the dataset to communicating the findings to the stakeholders. (10 marks)
Mnemonic : **I took tiktok video muka model comel**
1. Import
- Objective: Load raw data from various sources (e.g., CSV, Excel, databases) into the analysis environment for further processing.
2. Tidy
- Objective: Clean and structure the data into a consistent format (e.g., long format) to facilitate analysis, ensuring variables are properly labeled and missing values are addressed.
3. Transform
- Objective: Create new variables, recode existing ones, and perform necessary calculations to prepare the data for analysis, enhancing its usability and interpretability.
4. Visualize
- Objective: Generate graphical representations of the data to explore patterns, trends, and relationships among variables, aiding in understanding and hypothesis generation.
5. Model
- Objective: Apply appropriate statistical or machine learning models to analyze the data, test hypotheses, and derive insights relevant to the research questions.
6. Communicate
- Objective: Present the analysis results through reports, presentations, or dashboards, effectively conveying findings to stakeholders for informed decision-making.
```{r}
library(DiagrammeR)
grViz("digraph data_analysis_workflow {
node [shape = rectangle, style = filled, fillcolor = LightBlue]
Import [label = 'Import Data']
Tidy [label = 'Tidy Data']
Transform [label = 'Transform Data']
Visualize [label = 'Visualize Data']
Model [label = 'Model Data']
Communicate [label = 'Communicate Results']
Import -> Tidy -> Transform -> Visualize -> Model -> Communicate
}")
```
# 2. Mixed (Between-within) Repeated Measures ANOVA (Using SPSS)
## A) Run Analysis
**Step 1: Open data**
> File → Open → Data → HIV-CM.sav
Check: - group (HIV-CM / Control) → Between-subjects factor - DSCA1, DSCA2, DSCA3 → Repeated measures
**Step 2: Run Mixed Repeated Measures ANOVA**
SPSS Menu path:
> Analyze → General Linear Model → Repeated Measures
Define factors:
- Within-Subject Factor Name: time
- Number of Levels: 3
- Define → assign:
- DSCA1 → Time 1
- DSCA2 → Time 2
- DSCA3 → Time 3
- Between-Subjects Factor:
- Group (HIV-CM vs Control)
Options:
- Descriptive statistics
- Estimates of effect size
- Pairwise comparisons (Bonferroni)
- Mauchly’s test of sphericity
**Step 3: Assumption checking (IMPORTANT FOR MARKS)**
1. Normality
- SAS is continuous and approximately normally distributed → assumption reasonably met.
2. Sphericity (Mauchly’s Test)
- If p \> 0.05 → sphericity assumed
- If p \< 0.05 → use Greenhouse-Geisser correction
*state which correction was used.*
**Step 4: Interpretation of key outputs**
1. Time effect
- A significant main effect of time indicates that self-awareness scores changed significantly across the three time points.
2. Group effect
- A significant main effect of group indicates that overall self-awareness scores differed between the HIV-C M and control groups.
3. Group × Time interaction (MOST IMPORTANT)
- A significant interaction indicates that the pattern of change in self-awareness scores over time differs between the HIV-CM and control groups, demonstrating intervention effectiveness.
## B) Present the results using tables and plots in MS Word. Name the document as results.docx. (5 marks)
Table should include: - Descriptive statistics (means and SDs for each group at each time point) - ANOVA results (F-values, p-values, effect sizes for time, group, and interaction) - Pairwise comparisons if applicable
Plot should include: - Line plot showing mean self-awareness scores over time for each group with error bars representing standard errors.
# 3. Multiple Log Regression
## A) The analyst has proposed TWO (2) candidate models:
- Model_A : covariates are DMDX, FBS and MOGTT2
- Model_B : covariates are AGE and TOTCHOL
### i) Compare Model_A and Model_B. Show your results. (4 marks)
1. Import dataset
```{r}
mlr_3_2122 <- read.csv("Q3.csv")
glimpse(mlr_3_2122)
```
2. Change character variables to factors
```{r}
mlr_3_2122 <- mlr_3_2122 %>%
mutate(across(where(is.character), as.factor))
glimpse(mlr_3_2122)
```
3. Fit both models and compare
```{r}
# Fit both model
model_A <- glm(DMDX ~ FBS + MOGTT2H, data = mlr_3_2122, family = binomial)
model_B <- glm(DMDX ~ AGE + TOTCHOL, data = mlr_3_2122, family = binomial)
tidy(model_A, conf.int = TRUE)
tidy(model_B, conf.int = TRUE)
# Compare models using performance metrics
compare_performance(model_A, model_B)
```
Interpretation: Model_A has lower AIC and BIC values, indicating a better fit to the data compared to Model_B. Additionally, Model_A has a higher R-squared value, suggesting it explains more variance in the outcome variable (DMDX).
### ii) Establish the model of your choice. Give reasons. (4 marks)
**Model of choice : Model_A (DMDX \~ FBS + MOGTT2)**
Based on the model comparison results, I would choose Model_A (DMDX \~ FBS + MOGTT2) for the following reasons:
1. Better fit: Model_A has lower AIC and BIC values, indicating it fits the data better than Model_B.
2. Higher explanatory power: Model_A has a higher R-squared value, meaning it explains more variance in the outcome variable (DMDX).
3. Clinical relevance: FBS and MOGTT2 are directly related to glucose metabolism and are likely more relevant predictors of DMDX compared to AGE and TOTCHOL.
## B) Using your model of choice from (A), perform a model assessment. Show the results and comment on the findings. (8 marks)
1.
# 4. Multiple Logistic Regression
## A) A dataset named Q4_4_2122 has SIX (6) variables:
Outcome – alive or dead (the outcome variable)
Sepsis – yes or no Creatinine – in mmol/l
Age – in years old
Perforation – size of perforation in cm
Hemoglobin – in mg/dl
The binary logistic regression analysis was chosen to assess the relationship between the outcome and the rest of the variables.
**Simulate the dataset using the code below:**
```{r}
set.seed(123)
n <- 300
# Generate predictors
Sepsis <- factor(sample(c("yes", "no"), n, replace = TRUE, prob = c(0.4, 0.6)))
Creatinine <- rnorm(n, mean = 100, sd = 20)
Age <- rnorm(n, mean = 60, sd = 15)
Perforation <- rnorm(n, mean = 2, sd = 0.5)
Hemoglobin <- rnorm(n, mean = 13, sd = 2)
# Encode Sepsis as numeric for modeling
Sepsis_num <- ifelse(Sepsis == "yes", 1, 0)
# Define true relationship: all covariates affect outcome
logit_p <- -10 +
0.06 * Creatinine +
0.04 * Age +
0.8 * Sepsis_num +
1.2 * Perforation -
0.3 * Hemoglobin
p_dead <- exp(logit_p) / (1 + exp(logit_p))
Outcome <- factor(rbinom(n, 1, p_dead), labels = c("alive", "dead"))
# Final dataset
mlogr_4_2122 <- data.frame(Outcome, Sepsis, Creatinine, Age, Perforation, Hemoglobin)
```
### i) Justify the choice of the analysis. (4 marks)
**Justification for choosing binary logistic regression:**
1\. **Outcome variable is binary:** The outcome variable "Outcome" has two categories (alive vs dead), making binary logistic regression appropriate for modeling the probability of one category (e.g., dead) based on predictor variables.
2\. **Multiple predictor variables:** The analysis involves multiple independent variables (Sepsis, Creatinine, Age, Perforation, Hemoglobin), which can be continuous or categorical. Logistic regression can handle both types of predictors effectively.
3\. **Estimation of odds ratios:** Logistic regression provides odds ratios for each predictor, allowing for interpretation of the strength and direction of associations between predictors and the outcome.
4\. **No assumption of normality:** Logistic regression does not require the predictors to be normally distributed, making it suitable for a variety of data types and distributions.
### ii) State the assumptions for the analysis. (2 marks)
**Assumptions of binary logistic regression:**
1\. **Independence of observations:** The observations should be independent of each other, meaning that the outcome for one observation does not influence the outcome for another.
2\. **Linerity of the logit:** There should be a linear relationship between the continuous predictor variables and the logit (log-odds) of the outcome variable.
## B) Generate a saturated model and name it as Model_Outcome. Present Model_Outcome in a table. (5 marks)
```{r}
model_Outcome <- glm(Outcome ~ Sepsis + Creatinine + Age + Perforation + Hemoglobin,
data = mlogr_4_2122, family = binomial)
tbl_regression(model_Outcome, conf.int = TRUE)
```
## C) Using Model_Outcome, compose an interaction between Creatinine and Sepsis. Name it as Model_Outcome_ia. Present Model_Outcome_ia in a table. (5 marks)
```{r}
model_Outcome_ia <- glm(Outcome ~ Sepsis + Creatinine + Age + Perforation + Hemoglobin + Sepsis:Creatinine,
data = mlogr_4_2122, family = binomial)
tbl_regression(model_Outcome_ia, conf.int = TRUE, intercept = T)
```
## D) For Model_Outcome_ia, write:
### i) The logistic model (2 marks)
The logistic model for Model_Outcome_ia can be expressed as:
$$ \text{logit}(P(\text{Outcome} = \text{dead})) = \beta_0 + \beta_1 \cdot \text{Sepsis}_{\text{yes}} + \beta_2 \cdot \text{Creatinine} + \beta_3 \cdot \text{Age} + \beta_4 \cdot \text{Perforation} + \beta_5 \cdot \text{Hemoglobin} + \beta_6 \cdot (\text{Sepsis}_{\text{yes}} \times \text{Creatinine}) $$
Probability of death can be calculated using the logistic function for model_Outcome_ia:
$$ P(\text{Outcome} = \text{dead}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \cdot \text{Sepsis}_{\text{yes}} + \beta_2 \cdot \text{Creatinine} + \beta_3 \cdot \text{Age} + \beta_4 \cdot \text{Perforation} + \beta_5 \cdot \text{Hemoglobin} + \beta_6 \cdot (\text{Sepsis}_{\text{yes}} \times \text{Creatinine}))}} $$
### ii) The logit equation (2 marks)
The logit equation for Model_Outcome_ia is: $$ \hat{\text{logit}} = \beta_0 + \beta_1 \cdot \text{Sepsis}_{\text{yes}} + \beta_2 \cdot \text{Creatinine} + \beta_3 \cdot \text{Age} + \beta_4 \cdot \text{Perforation} + \beta_5 \cdot \text{Hemoglobin} + \beta_6 \cdot (\text{Sepsis}_{\text{yes}} \times \text{Creatinine}) $$
Using beta from Model_Outcome_ia table, the logit equation is:
$$ \hat{\text{logit}} = -16 + 6.1 \cdot \text{Sepsis}_{\text{yes}} + 0.09 \cdot \text{Creatinine} + 0.08 \cdot \text{Age} + 1.10 \cdot \text{Perforation} - 0.31 \cdot \text{Hemoglobin} - 0.04 \cdot (\text{Sepsis}_{\text{yes}} \times \text{Creatinine}) $$
## E) Using Model_Outcome_ia, generate the predicted values for observation 1. Show your manual calculation. (4 marks)
1. Create observation 1 values
- Spepsis = yes
- Creatinine = 120
- Age = 70
- Perforation = 2.5
- Hemoglobin = 12
```{r}
# Observation 1 values
obbs_4_2122 <- expand.grid(
Sepsis = factor("yes", levels = c("no", "yes")),
Creatinine = 120,
Age = 70,
Perforation = 2.5,
Hemoglobin = 12
)
```
2. Manual calculation of logit for observation 1
- Using the coefficients from Model_Outcome_ia table:
- β0 = -16
- β1 (Sepsis_yes) = 6.1
- β2 (Creatinine) = 0.09
- β3 (Age) = 0.08
- β4 (Perforation) = 1.10
- β5 (Hemoglobin) = -0.31
- β6 (Sepsis_yes:Creatinine) = -0.04
- Logit calculation:
$$ \hat{\text{logit}} = -16 + 6.1 \cdot 1 + 0.09 \cdot 120 + 0.08 \cdot 70 + 1.10 \cdot 2.5 - 0.31 \cdot 12 - 0.04 \cdot (1 \cdot 120) $$
$$ \hat{\text{logit}} = -16 + 6.1 + 10.8 + 5.6 + 2.75 - 3.72 - 4.8 = 0.73 $$
- Convert logit to predicted probability
Using the logistic function: $$ P(\text{Outcome} = \text{dead}) = \frac{1}{1 + e^{-0.73}} \approx 0.68 $$
Thus, the predicted probability of death for observation 1 is approximately 0.68 (68%).
# MVA 2020/2021
# 1. Mutilple Linear Regression
Question: Dataset: Omega3
Analyse the relationship of triglyceride based on the following variables:
Model A = waist, total_chol, fbs, sex.
Model B = waist, total_chol, fbs, sex, physical activity, interaction between sex and physical activity
*** Simulate the dataset using the code below:***
```{r}
set.seed(123)
n <- 250
Omega3 <- data.frame(
waist = rnorm(n, mean = 85, sd = 10),
total_chol = rnorm(n, mean = 200, sd = 30),
fbs = rnorm(n, mean = 100, sd = 15),
physical_activity = rnorm(n, mean = 3),
sex = factor(sample(c("female", "male"), n, replace = TRUE))
)
# Generate triglyceride with some relationship to predictors
Omega3$triglyceride <- 150 +
1.5 * Omega3$waist +
0.8 * Omega3$total_chol +
0.5 * Omega3$fbs -
10 * (Omega3$sex == "male") -
5 * Omega3$physical_activity +
3 * (Omega3$sex == "male") * Omega3$physical_activity +
rnorm(n, mean = 0, sd = 20)
```
## A) Write the regression model and equation for Model A and Model B. (6 marks)
Fit regression models A and B:
```{r}
modelA_1_2021 <- lm(triglyceride ~ waist + total_chol + fbs + sex , data = Omega3)
modelB_1_2021 <- lm(triglyceride ~ waist + total_chol + fbs + sex + physical_activity + sex : physical_activity, data = Omega3)
summary(modelA_1_2021)
summary(modelB_1_2021)
```
**Model A:**
Regression model:
$$ \text{triglyceride}_i = \beta_0 + \beta_1 \cdot \text{waist}_i + \beta_2 \cdot \text{total_chol}_i + \beta_3 \cdot \text{fbs}_i + \beta_4 \cdot \text{sex}_i + \epsilon_i $$
Regression equation (using Beta from modelA_1_2021):
$$ \hat{\text{triglyceride}} = 150 + 1.3 \cdot \text{waist} + 0.8 \cdot \text{total_chol} + 0.5 \cdot \text{fbs} - 3.3 \cdot \text{sex}_{\text{male=1}} $$
**Model B:**
Regression model:
$$ \text{triglyceride}_i = \beta_0 + \beta_1 \cdot \text{waist}_i + \beta_2 \cdot \text{total_chol}_i + \beta_3 \cdot \text{fbs}_i + \beta_4 \cdot \text{sex}_i + \beta_5 \cdot \text{physical_activity}_i + \beta6 \cdot (\text{sex}_i \times \text{physical_activity}_i) + \epsilon_i $$
Regression equation (using Beta from modelB_1_2021):
$$ \hat{\text{triglyceride}} = 145 + 1.4 \cdot \text{waist} + 0.8 \cdot \text{total_chol} + 0.5 \cdot \text{fbs} - 10 \cdot \text{sex}_{\text{male=1}} - 4.5 \cdot \text{physical_activity} + 3.0 \cdot (\text{sex}_{\text{male=1}} \times \text{physical_activity}) $$
## B) Compare the two models.
```{r}
compare_performance(modelA_1_2021, modelB_1_2021)
```
Interpretation: Model B has a lower AIC and BIC compared to Model A, indicating a better fit to the data. Additionally, Model B has a higher R-squared value, suggesting it explains more variance in triglyceride levels. Therefore, Model B is preferred over Model A.
## C) Predict for the population given. (10 marks)
1. Create new data for prediction
```{r}
new_data_1_2021 <- expand.grid(
waist = c(70, 80, 90, 100),
total_chol = c(180, 200, 220),
fbs = c(90, 100, 110),
physical_activity = c(1, 2, 3, 4, 5),
sex = factor(c("female", "male"), levels = c("female", "male"))
)
```
2. Generate predictions using Model B
```{r}
new_data_1_2021$predicted_triglyceride <- predict(modelB_1_2021, newdata = new_data_1_2021)
head(new_data_1_2021)
```
3. Interpretation:
For observation 1 (waist = 70, total_chol = 180, fbs = 90, physical_activity = 1, sex = female), the predicted triglyceride level is approximately 443 mg/dL. As waist circumference, total cholesterol, fasting blood sugar, and physical activity levels increase, the predicted triglyceride levels also tend to increase, with variations based on sex due to the interaction term.
## D) Check assumption (Multiple Linear Regression) for model B.
1. Global diagnostics
```{r}
plot(modelB_1_2021)
```
2. Linearity of continuous predictors
```{r}
# Linearity for waist
ggplot(Omega3, aes(x = waist, y = residuals(modelB_1_2021))) + geom_point() +
geom_hline(yintercept = 0)
# Linearity for total_cholesterol
ggplot(Omega3, aes(x = total_chol, y = residuals(modelB_1_2021))) + geom_point() +
geom_hline(yintercept = 0)
# Linearity for fbs
ggplot(Omega3, aes(x = fbs, y = residuals(modelB_1_2021))) + geom_point() +
geom_hline(yintercept = 0)
# Linearity for physical_activity
ggplot(Omega3, aes(x = physical_activity, y = residuals(modelB_1_2021))) + geom_point() +
geom_hline(yintercept = 0)
```
Interpretation: The residuals plotted against waist, total cholesterol, fbs, and physical activity show no obvious systematic pattern or curvature and are randomly scattered around zero. This suggests that the linearity assumption is reasonably satisfied for all continuous predictors in the multiple linear regression model.
2. Normality of residuals
```{r}
hist(residuals(modelB_1_2021))
shapiro.test(modelB_1_2021$residuals)
```
Interpretation:
- The histogram of residuals appears approximately bell-shaped, suggesting that the normality of residual is met.
- The Shapiro-Wilk test yields a p-value of 0.450 (p > 0.05), indicating residuals are approximately normally distributed.
3. Homoscedasticity - Breusch-Pagan test
```{r}
bptest(modelB_1_2021)
```
Interpretation: The Breusch-Pagan test yields a p-value of 0.938 (p > 0.05), indicating no significant evidence of heteroscedasticity. Thus, homoscedasticity assumption is met.
4. Independence of residuals - Durbin-Watson test
```{r}
dwtest(modelB_1_2021)
```
Interpretation: The Durbin-Watson test yields a p-value of 0.084 (p > 0.05), indicating no significant autocorrelation in the residuals. Thus, the independence assumption is met.
5. Assessing transformation using Box-Cox plot
```{r}
MASS::boxcox(modelB_1_2021)
```
Interpretation: The Box-Cox plot suggests that the optimal lambda value is close to 1, indicating that no transformation of the outcome variable (triglyceride) is necessary. The current linear regression model is appropriate.
6. Check for multicollinearity - Variance Inflation Factor (VIF)
```{r}
vif(modelB_1_2021)
```
Interpretation: The VIF values for all predictors are below 5, indicating no significant multicollinearity between the predictors. Thus, no multicollinearity.
7. Influential points - Cook's Distance
```{r}
cutoffmodB <- 4/nrow(Omega3)
plot(modelB_1_2021, which = 4, cook.levels = cutoffmodB)
```
Interpretation: Minimal observations exceed the Cook's distance cutoff, indicating no highly influential points that unduly affect the regression results.
## E) Present Model B in table n interpret for waist & physical activity*sex
```{r}
tbl_regression(modelB_1_2021, conf.int = TRUE)
glance(modelB_1_2021)
```
Interpretation:
- Waist: For each additional cm increase in waist circumference, triglyceride levels increase by approximately 1.2 mg/dL, holding all other variables constant (95% CI: 0.98, 1.5, p < 0.001).
- physical activity * sex: The interaction term indicates that the effect of physical activity on triglyceride levels differs by sex. For males, each additional unit increase in physical activity is associated with a 4,8 mg/dL increase in triglyceride levels (95% CI: -0.11, 9.7, p = 0.055), while for females, the effect is a 5.0 mg/dL decrease in triglyceride levels (95% CI: -8.2, -1.9, p = 0.002). This suggests that physical activity has a protective effect on triglyceride levels for females but a detrimental effect for male.
# 2. Survival Analysis
Question: Using Dataset: stroke_e.dta
## A) Interpret: (10 marks)
- Overall Kaplan Meier
- Kaplan Meier by ICD10
1. Import dataset
```{r}
stroke_2_2021 <- read_dta("stroke_e.dta")
glimpse(stroke_2_2021)
```
2. Change variable label to factor
```{r}
stroke_2_2021 <- stroke_2_2021 %>% mutate(across(where(is.labelled), as_factor))
glimpse(stroke_2_2021)
```
3. Overall Kaplan Meier
```{r}
KM.overall_2_2021 <- survfit(Surv(time = time,
event = status2 == 'dead') ~ 1, type = "kaplan-meier",
data = stroke_2_2021)
tidy(KM.overall_2_2021)
# Plot overall survival curve
ggsurvplot(KM.overall_2_2021, data = stroke_2_2021, risk.table = TRUE, linetype = c(1,2), pval = TRUE)
```
4. Kaplan Meier by ICD10
```{r}
KM.icd10_2_2021 <- survfit(Surv(time = time,
event = status2 == 'dead') ~ icd10cat2, type = "kaplan-meier",
data = stroke_2_2021)
tidy(KM.icd10_2_2021)
# Plot survival curve by ICD10
ggsurvplot(KM.icd10_2_2021, data = stroke_2_2021, risk.table = TRUE, linetype = c(1,2), pval = TRUE)
```
Interpretation:
- Overall Kaplan Meier: The overall survival curve indicates that the probability of survival decreases over time among stroke patients. The median survival time can be estimated from the curve, and the risk table shows the number of patients at risk at different time points.
- Kaplan Meier by ICD10: The survival curves stratified by ICD10 categories show differences in survival probabilities among different stroke types. For example, patients with Haemorrhagic stroke may have a lower survival probability compared to those with Ischaemic stroke. The log-rank test p-value = 0.0043 indicates whether these differences are statistically significant.
## B) Cox regression for variables: (10 marks)
- Age
- DM
- ICD10
- Gender
1. Fit Cox regression model
```{r}
model_cox_2_2021 <- coxph(Surv(time = time,
event = status2 == 'dead') ~ age2 + dm + icd10cat2 + sex,
data = stroke_2_2021)
tidy(model_cox_2_2021, conf.int = TRUE, exponentiate = TRUE)
```
Interpretation:
- Age: Each additional year of age is associated with a 0.003 times higher hazard ratio of death, after adjusting for other covariates (HR = 1.03, 95% CI: 1.01-1.05, p = 0.006).
- DM: Patients with diabetes have a 0.299 times higher hazard ratio of death compared to those without diabetes, controlling for other covariates (HR = 1.57, 95% CI: 0.82, 2.65, p = 0.19). However, this association is not statistically significant.
- ICD10: Patient with hemorrhagic stroke have a 2.69 times higher hazard ratio of death compared to those with ischaemic stroke, adjusting for other covariates (HR = 2.69, 95% CI: 1.30-4.80, p = 0.466). However, this association is not statistically significant.
## C) Check assumptions needed to estimate cox regression. (10 marks)
1.
```{r}
ggcoxfunctional(Surv(time = time,
event = status2 == 'dead') ~ age2,
data = stroke_2_2021)
```
Interpretation: The plot shows a linear relationship between age and the log hazard, indicating that the linearity assumption is reasonably satisfied for age in the Cox regression model.
2. Proportional hazards assumption - Schoenfeld residuals
```{r}
prop.h_2.2021 <- cox.zph(model_cox_2_2021, transform = 'km', global = TRUE)
prop.h_2.2021
ggcoxzph(prop.h_2.2021)
```
Interpretation: The global test p-value = 0.077 (p > 0.05) indicates no significant violation of the proportional hazards assumption. However, ICD10 and sex have p-values < 0.05, suggesting potential violations for these covariates. Further investigation may be needed.
3. Linearity - Score residuals
```{r}
score_2_2021 <- resid(model_cox_2_2021, type = "score")
plot(stroke_2_2021$age2, score_2_2021[, "age2"])
```
```{r}
# Linearity for age
ggplot(stroke_2_2021, aes(x = age2, y = residuals(model_cox_2_2021, type = "martingale"))) + geom_point() +
geom_smooth(method = "loess") + geom_hline(yintercept = 0)
```
Interpretation: The residuals plotted against age show no obvious systematic pattern or curvature and are randomly scattered around zero. This suggests that the linearity assumption is reasonably satisfied for age in the Cox regression model.
4. Influential (Martingale residuals)
```{r}
ggcoxdiagnostics(model_cox_2_2021, type = 'martingale', linear.predictions = FALSE)
```
Interpretation: The martingale residuals appear randomly scattered around zero, indicating no significant outliers or influential observations that could unduly affect the Cox regression results.
# 3. Mixed (Between-within) Repeated Measures ANOVA (Using SPSS)
Question: A researcher has conducted an interventional study to evaluate the effect of a new rehabilitation module in managing post-spinal surgery patients. Thirty (30) patients were randomly assigned into two groups; “new module” and “current module”.
They were measured on their mobilization activities using questionnaire with scores between 0-100 when higher score indicates better management. Endpoint data on the mobility scores were collected at baseline (score-a), 3-month (score-2) and 6-month (score-3).
Using the given dataset “mobility study.sav”,
## A) Present the result using tables and plots (in a Word document). (10 marks)
Table should include
- Descriptive statistics (means and SDs for each group at each time point)
- ANOVA results (F-values, p-values, effect sizes for time, group, and interaction)
- Pairwise comparisons if applicable
## B) Perform appropriate statistical analysis using SPSS. Interpret all the relevant analytic step and save it into a Microsoft Word document (“mobility.docx”). Also save all the relevant SPSS outputs as “mobility.spv”. (20 marks)
Statistical analysis step using SPSS
1. Descriptive analysis
2. Univariable analysis
3. Model fit (M)
4. Interaction (I)
5. Normality of residual (N)
6. Equal variences (E)
7. Sphericity (S)
8. Split file (if interaction is significant)
9. Post-hoc test (if the interest variable has more than 2 categories)
10. Final table
11. Interpretation
12. Interaction plot
# 4. ANOVA
Question:
IV alcohol and sleep
DV performance
Research questions (main effect):
- Is there any effect of alcohol on performance score?
- Is there any effect of sleep on the performance score?
- Is the difference of performance score between the alcohol status the same/vary based on sleep status?
Run the analysis.
Statistical analysis step using SPSS
1. Descriptive analysis
2. Univariable analysis
3. Model fit (M)
4. Interaction (I)
5. Normality of residual (N)
6. Equal variences (E)
8. Split file/ Syntax (if interaction is significant)
9. Post-hoc test (if the interest variable has more than 2 categories)
10. Final table
11. Interpretation
12. Interaction plot
# MVA 2019/2020
# 1. Multiple Linear Regression
Question:
Data were collected to study factors that were related with HbA1c. The dataset is named as “metab_syndrome” and in .xlsx formats. A model was proposed to examine the relationship between HBA1C with these covariates:
- HbA1c = glycated haemoglobin (%)
- AGE = age(years)
- DMDX = diabetes status (yes/no)
- MSBPR - mean systolic blood pressure (mmHg)
- FBS = fasting blood sugar (mmol/L)
- HDL = high-density lipoprotein (mg/dL)
*** Simulate the dataset using the code below:***
```{r}
set.seed(123)
n <- 200
metab_syndrome <- data.frame(
HbA1c = rnorm(n, mean = 7, sd = 1.5),
AGE = rnorm(n, mean = 55, sd = 10),
DM = factor(sample(c("yes", "no"), n, replace = TRUE, prob = c(0.6, 0.4))),
MSBPR = rnorm(n, mean = 130, sd = 15),
FBS = rnorm(n, mean = 6.5, sd = 2),
HDL = rnorm(n, mean = 45, sd = 10),
MDBPR = rnorm(n, mean = 80, sd = 10)
)
```
## (a) Write the regression model and equation for MODEL_A. (15 marks)
1. Glimpse dataset
```{r}
glimpse(metab_syndrome)
```
2. Built model A
```{r}
model_A_1_1920 <- lm(HbA1c ~ AGE + DM + MSBPR + FBS + HDL, data = metab_syndrome)
tidy (model_A_1_1920, conf.int = TRUE)
glance(model_A_1_1920)
```
3. Write Regression model and equation for Model A
- Regression model:
$$ \text{HbA1c}_i = \beta_0 + \beta_1 \cdot \text{AGE}_i + \beta_2 \cdot \text{DM}_i + \beta_3 \cdot \text{MSBPR}_i + \beta_4 \cdot \text{FBS}_i + \beta_5 \cdot \text{HDL}_i + \epsilon_i $$
- Regression equation (using Beta from model_A_1_1920):
$$ \hat{\text{HbA1c}} = 6.47 + -0.002 \cdot \text{AGE} + 0.18 \cdot \text{DM}_{\text{yes}} + 0.002 \cdot \text{MSBPR} + 0.05 \cdot \text{FBS} - 0.003 \cdot \text{HDL} $$
## (b) Add variable MDBPR to MODEL_A and name the new model as MODEL_B. Describe the effect of this addition to the model. (5 marks)
1. Build model B
```{r}
model_B_1_1920 <- lm(HbA1c ~ AGE + DM + MSBPR + FBS + HDL + MDBPR, data = metab_syndrome)
tidy (model_B_1_1920, conf.int = TRUE)
glance(model_B_1_1920)
```
Interpretation: The inclusion of MDBPR in Model B does not significantly improve the model, as MDBPR is not significantly associated with HbA1c and its addition does not materially change the estimated effects or significance of the other covariates, indicating no evidence of confounding.
## (c) Interpret the variances between regression model and regression estimates. (10 marks)
Both model explain very little variance in HbA1c levels (R-squared values are low).
- Model A : R² = 0.0086, Adj. R² = -0.017
- Model B : R² = 0.014, Adj. R² = -0.017
Negative adjusted models suggest that the predictors do not explain the variability in HbA1c better than a simple mean model (null model).
The regression models have poor explanatory power, and adding MDBPR (Model B) does not meaningfully reduce unexplained variance compared with Model A.
## (d) Perform comparison between MODEL_A and MODEL_B. Give comment. (5 marks)
```{r}
compare_performance(model_A_1_1920, model_B_1_1920)
```
Interpretation: Model A has slightly lower AIC and BIC values compared to Model B, indicating a marginally better fit to the data. However, model B has a slightly higher R-squared value, suggesting it explains a bit more variance in HbA1c levels. Overall, both models perform similarly, with Model A being preferred due to its simplicity and slightly better fit statistics.
## (e) Perform model assessment for MODEL_B. Save the relevant plots in the thumb-drive and write the names of the plots in the answer sheet. (10 marks)
## (f) Predict the average value of HBA1C for a population with:
▪ AGE = 48 years
▪ DMDX = has history of diabetes (1)
▪ MSBPR = 133 mmHg
▪ MDBPR = 79 mmHg
▪ FBS = 5.6 mmol/l
▪ HDL = 1.3 mmol/l
1. Linear equation for Model B
$$ \hat{\text{HbA1c}} = 7.26 - 0.001 \cdot \text{AGE} + 0.196 \cdot \text{DM}_{\text{yes}} + 0.003 \cdot \text{MSBPR} + 0.05 \cdot \text{FBS} - 0.003 \cdot \text{HDL} - 0.01 \cdot \text{MDBPR} $$
HbA1c = 7.26 - 0.001(AGE) + 0.196(DM_yes) + 0.003(MSBPR) + 0.05(FBS) - 0.003(HDL) - 0.01(MDBPR)
HbA1c = 7.26 - 0.001*48 + 0.196*1 + 0.003*133 + 0.05*5.6 - 0.003*1.3 - 0.01*79
HbA1c = 7.2931 ~ 7.3%
# 2. RM ANOVA
Question :
A researcher has conducted an interventional study to evaluate the effect of a new rehabilitation module in managing post-spinal surgery patients. Thirty (30) patients were randomly assigned into two groups; “new module” and “current module”.
They were measured on their mobilization activities using questionnaire with scores between 0-100 when higher score indicates better management. Endpoint data on the mobility scores were collected at baseline (score-a), 3-month (score-2) and 6-month (score-3). Using the given dataset “mobility study.sav”,
## A) Present the result using tables and plots (in a Word document). (10 marks)
Table should include:
- Descriptive statistics (means and SDs for each variable)
- ANOVA results (F-values, p-values, effect sizes for each variable)
- Pairwise comparisons if applicable
## B) Perform appropriate statistical analysis using SPSS. Interpret all the relevant analytic step and save it into a Microsoft Word document (“mobility.docx”). Also save all the relevant SPSS outputs as “mobility.spv”. (20 marks)
RM ANOVA statistical workflow using SPSS:
1. Descriptive analysis
2. Univariable analysis
3. Model fit (M)
4. Interaction (I)
5. Normality of residual (N)
6. Equal variences (E)
7. Sphericity (S)
8. Split file (if interaction is significant)
9. Post-hoc test (if the interest variable has more than 2 categories)
10. Final table
11. Interpretation
12. Interaction plot
# 3. Survival Analysis
Question 3
The dataset “stroke_outcome.dta” comes from a short prospective survival study. The variable outcome is labelled as dead (1) or alive (0) and the variable days is the duration from the start of follow-up until the last follow-up.
## A) Present and interpret the result for:
▪ Overall Kaplan-Meier analysis
▪ Kaplan-Meier analysis on variable ICD10
Comment on the time, number of event and probability of the event for both analyses.
(10 marks)
1. Import data
```{r}
stroke_3_1920 <- read_dta("stroke_outcome.dta")
glimpse(stroke_3_1920)
```
2. Data transform
```{r}
stroke_3_1920 <- stroke_3_1920 %>% mutate(across(where(is.labelled), as_factor))
glimpse(stroke_3_1920)
```
3. Overall kaplan meier
```{r}
KM.overall_3_1920 <- survfit(Surv(time = days,
event = outcome == 'dead') ~ 1, type = "kaplan-meier",
data = stroke_3_1920)
tidy(KM.overall_3_1920)
# Plot overall survival curve
ggsurvplot(KM.overall_3_1920, data = stroke_3_1920, risk.table = TRUE, linetype = c(1,2), pval = TRUE)
```
4. Kaplan meier by icd10
```{r}
KM.icd10_3_1920 <- survfit(Surv(time = days,
event = outcome == 'dead') ~ icd10, type = "kaplan-meier",
data = stroke_3_1920)
tidy(KM.icd10_3_1920)
# Plot survival curve by ICD10
ggsurvplot(KM.icd10_3_1920, data = stroke_3_1920, risk.table = TRUE, linetype = c(1,2,3), pval = TRUE)
```
5. Interpretation:
- Overall Kaplan Meier: The Kaplan-Meier curve shows a steady decline in survival probability over time, with the most substantial drop occurring early in the follow-up. By day 10, the survival probability falls to approximately 68%, indicating a high early mortality rate in the cohort.
- Kaplan Meier by ICD10: Survival differs significantly across ICD10 diagnostic groups (p = 0.033). Patients with SAH and ICB/Other Haemorrhage show lower survival probabilities compared to those with CI/Others. This suggests that diagnosis type is a significant predictor of mortality, with haemorrhagic strokes associated with poorer outcomes.
- Compare overall and ICD10 : The overall curve masks important differences between diagnostic groups. While the pooled analysis shows a general decline in survival, the ICD10-stratified curves reveal that haemorrhagic subtypes drive the poorer outcomes. Thus, stratification provides more clinically meaningful insight, showing that survival is not uniform across stroke types.
## B) The analyst proposed to model the log hazard using the Cox proportional hazard (PH) model. Comment on his proposal. (5 marks)
- The proposal to model the log hazard using a Cox proportional hazards model is appropriate because the outcome of interest is time to event with possible censoring.
- The Cox model allows estimation of log hazard without specifying the baseline hazard function.
- However, the validity of this approach depends on the proportional hazards assumption, which requires that the log hazard for covariates remain constant over time.
- This assumption should be formally assessed, for example using Schoenfeld residuals.
- If the proportional hazards assumption is violated, alternative approaches such as stratified Cox models should be considered.
## (c) Estimate the Cox PH model with covariate: sex, GCS, ICD10, SBP, DBP. Present the results in a table. Interpret the result for SBP and ICD10. (10 marks)
Fit cox regression model
```{r}
model_cox_3_1920 <- coxph(Surv(time = days,
event = outcome == 'dead') ~ age + gcs + icd10,
data = stroke_3_1920)
tidy(model_cox_3_1920, conf.int = TRUE)
```
Interpretation:
- Age: Each additional year of age is associated with a 0.03 times higher log hazard of death, after adjusting for other covariates (Log Hazard = 0.03, 95% CI: 0.01, 0.05, p = 0.004).
- GCS: Each additional point increase in GCS score is associated with a 0.19 times lower log hazard of death, after adjusting for other covariates (Log Hazard = -0.19, 95% CI: -0.26,-0.12, p < 0.001).
- ICD10: Patients with haemorrhagic stroke have a 0.48 times higher log hazard of death compared to those with ischaemic stroke and SAH, after adjusting for other covariates (Log Hazard = 0.48, 95% CI: -0.17,1.13, p = 0.48). However, this association is not statistically significant.
## (d) Test the assumption for proportional hazard. Save the plot in the thumb-drive and write the name of the plot in the answer sheet. (5 marks)
1. Assessment: PH Assumption (Schoenfeld Residuals)
```{r}
prop.h_3_1920 <- cox.zph(model_cox_3_1920, transform = 'km', global = TRUE)
ggcoxzph(prop.h_3_1920)
```
2. Assessment: Functional Form (Linearity)
-