Pearson’s assumptions: - Ideally both variables to follow a normal distribution - Continuous - Random sample
Spearman’s assumptions: - Monotonic relationship (goes up/ down together, but not necessarily linear) - Less sensitive to extremes and able to work with smaller sample size - Continuous and ordinal variable.
“Developing a model to predict Hospital Anxiety Depression(HAD) scores”
COPD <- read.csv("data/COPD_student_dataset.csv")
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.3.2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(gmodels)
## Warning: package 'gmodels' was built under R version 4.3.2
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
head(COPD)
## X ID AGE PackHistory COPDSEVERITY MWT1 MWT2 MWT1Best FEV1 FEV1PRED FVC
## 1 1 58 77 60 SEVERE 120 120 120 1.21 36 2.40
## 2 2 57 79 50 MODERATE 165 176 176 1.09 56 1.64
## 3 3 62 80 11 MODERATE 201 180 201 1.52 68 2.30
## 4 4 145 56 60 VERY SEVERE 210 210 210 0.47 14 1.14
## 5 5 136 65 68 SEVERE 204 210 210 1.07 42 2.91
## 6 6 84 67 26 MODERATE 216 180 216 1.09 50 1.99
## FVCPRED CAT HAD SGRQ AGEquartiles copd gender smoking Diabetes muscular
## 1 98 25 8 69.55 4 3 1 2 1 0
## 2 65 12 21 44.24 4 2 0 2 1 0
## 3 86 22 18 44.09 4 2 0 2 1 0
## 4 27 28 26 62.04 1 4 1 2 0 0
## 5 98 32 18 75.56 1 3 1 2 0 1
## 6 60 29 21 73.82 2 2 0 1 1 0
## hypertension AtrialFib IHD
## 1 0 1 0
## 2 0 1 1
## 3 0 1 0
## 4 1 1 0
## 5 1 0 0
## 6 0 1 0
dim(COPD)
## [1] 101 24
colnames(COPD)
## [1] "X" "ID" "AGE" "PackHistory" "COPDSEVERITY"
## [6] "MWT1" "MWT2" "MWT1Best" "FEV1" "FEV1PRED"
## [11] "FVC" "FVCPRED" "CAT" "HAD" "SGRQ"
## [16] "AGEquartiles" "copd" "gender" "smoking" "Diabetes"
## [21] "muscular" "hypertension" "AtrialFib" "IHD"
my_data <- COPD[,c("AGE", "PackHistory", "FEV1", "FEV1PRED", "FVC", "CAT")]
cor_matrix <- cor(my_data)
round(cor_matrix,2)
## AGE PackHistory FEV1 FEV1PRED FVC CAT
## AGE 1.00 0.00 -0.10 0.07 -0.15 0.08
## PackHistory 0.00 1.00 -0.13 -0.13 -0.09 -0.14
## FEV1 -0.10 -0.13 1.00 0.78 0.82 -0.06
## FEV1PRED 0.07 -0.13 0.78 1.00 0.52 0.01
## FVC -0.15 -0.09 0.82 0.52 1.00 -0.16
## CAT 0.08 -0.14 -0.06 0.01 -0.16 1.00
pairs(~AGE + PackHistory + FEV1 + FEV1PRED + FVC + CAT, data = COPD)
Lung function variables (FEV1, FEV1PRED, FVC) have a high correlation, as to be expected. Linear regression models showed no significant differences between each variable. Given that it measures the same thing, FEV1 will be chosen.
hist(COPD$AGE)
summary(COPD$AGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 44.0 65.0 71.0 70.1 75.0 88.0
Assumptions of normal distribution are met.
table(COPD$smoking)
##
## 1 2
## 16 85
COPD$smoking <- COPD$smoking -1
table(COPD$smoking)
##
## 0 1
## 16 85
hist(COPD$PackHistory)
summary(COPD$PackHistory)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 20.0 36.0 39.7 54.0 109.0
min_value <- min(COPD$PackHistory, na.rm = TRUE)
max_value <- max(COPD$PackHistory, na.rm = TRUE)
width <- (max_value - min_value) / 3
breakpoints <- seq(min_value, max_value, by = width)
COPD$PackHistoryGroup <- cut(COPD$PackHistory, breaks = breakpoints, include.lowest = TRUE, labels = c("low", "med", "high"))
COPD$PackHistoryGroup <- factor(COPD$PackHistoryGroup, levels = c("low", "med", "high"), ordered = TRUE)
Equal width binning method used: - Pack History has a skewed (left) distribution, and i want to capture differences in groupings - Additional argument added to make it an ordinal variable
COPD$smoking <- factor(COPD$smoking)
table(COPD$smoking, COPD$PackHistoryGroup)
##
## low med high
## 0 8 7 1
## 1 44 30 11
table(COPD$COPDSEVERITY)
##
## MILD MODERATE SEVERE VERY SEVERE
## 23 43 27 8
COPD$COPDSEVERITY_new <- ifelse(COPD$COPDSEVERITY %in% c("SEVERE", "VERY SEVERE"), "Severe", COPD$COPDSEVERITY)
table(COPD$COPDSEVERITY_new)
##
## MILD MODERATE Severe
## 23 43 35
Categories for ‘Very Severe’ and ‘Severe’ was combined - ‘Very Severe’ was underrepresented had few observations, and may result in insufficient statistical power to detect an effect/ unreliable results - Models with few observations are not generalizable. Insufficient sample size.
hist(COPD$CAT)
COPD$CAT[COPD$CAT > 40] <- NA
hist(COPD$CAT)
CAT ranges from 0-40, and outlier is likely a mistake. Observation to be removed.
quantile_breaks <- quantile(COPD$CAT, probs = seq(0, 1, length.out = 4), na.rm = TRUE)
COPD$CATGroup <- cut(COPD$CAT, breaks = quantile_breaks, include.lowest = TRUE, labels = c("Low", "Medium", "High"))
table(COPD$CATGroup)
##
## Low Medium High
## 37 30 33
table(COPD$CATGroup, COPD$COPDSEVERITY_new)
##
## MILD MODERATE Severe
## Low 10 20 7
## Medium 7 15 8
## High 5 8 20
c1 <- lm(COPD$HAD ~ COPD$CATGroup, data = COPD)
summary(c1)
##
## Call:
## lm(formula = COPD$HAD ~ COPD$CATGroup, data = COPD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.976 -5.919 -1.233 3.767 39.224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.919 1.243 5.568 2.30e-07 ***
## COPD$CATGroupMedium 3.314 1.857 1.785 0.0774 .
## COPD$CATGroupHigh 10.057 1.810 5.556 2.41e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.559 on 97 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2459, Adjusted R-squared: 0.2303
## F-statistic: 15.81 on 2 and 97 DF, p-value: 1.139e-06
c2 <- lm(COPD$HAD ~ COPD$COPDSEVERITY_new)
summary(c2)
##
## Call:
## lm(formula = COPD$HAD ~ COPD$COPDSEVERITY_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.029 -6.029 -2.029 3.971 45.843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.3565 1.7864 5.797 8.21e-08 ***
## COPD$COPDSEVERITY_newMODERATE -0.2402 2.2132 -0.109 0.914
## COPD$COPDSEVERITY_newSevere 2.6720 2.2996 1.162 0.248
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.567 on 98 degrees of freedom
## Multiple R-squared: 0.02492, Adjusted R-squared: 0.005024
## F-statistic: 1.252 on 2 and 98 DF, p-value: 0.2903
Equal width binning approach is chosen instead of previous quantile binning. It divides the range of data points into equal bins, and allow for easy comparison with COPDSeverity.
CATgroup has a higher adjusted R^2 value, and explains for 20% of variability in the dataset. CAT will be used as predictor variable
plot(COPD$HAD ~ COPD$CATGroup)
Further investigations show that 1. There is an overlap in ranges,
indicating similarity in HAD score between groups 2. ‘Severe’ groups
still tend to have a higher score (this is further confirmed with a
statistically significant P value in the linear regression model) 3.
Median scores are generally higher for the high, compared to the low
group.
CAT score will be used as predictor variable
# GENDER
```r
class(COPD$gender)
## [1] "integer"
COPD$gender <- as.factor(COPD$gender)
class(COPD$gender)
## [1] "factor"
R will treat variables with numbers as integers, it is important to treat it as a category
comorbid <- rep(0, nrow(COPD))
comorbid[COPD$Diabetes == 1 | COPD$muscular == 1 | COPD$AtrialFib == 1 | COPD$IHD == 1] <- 1
comorbid[is.na(comorbid)] <- 0
comorbid <- factor(comorbid)
COPD$comorbid <- comorbid
table(comorbid)
## comorbid
## 0 1
## 53 48
After inspecting variables: the following predictor variables will be used - AGE - GENDER - Pack History - CAT Group - FEV1 - comorbid
m1 <- lm(COPD$HAD ~ COPD$AGE + COPD$gender + COPD$PackHistoryGroup + COPD$CATGroup + COPD$FEV1 + COPD$comorbid )
summary(m1)
##
## Call:
## lm(formula = COPD$HAD ~ COPD$AGE + COPD$gender + COPD$PackHistoryGroup +
## COPD$CATGroup + COPD$FEV1 + COPD$comorbid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.069 -4.717 -1.157 3.186 38.254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.03568 7.45209 3.225 0.00175 **
## COPD$AGE -0.22428 0.09966 -2.250 0.02683 *
## COPD$gender1 -2.57008 1.70130 -1.511 0.13434
## COPD$PackHistoryGroup.L -0.28857 1.76139 -0.164 0.87023
## COPD$PackHistoryGroup.Q -2.25847 1.42164 -1.589 0.11561
## COPD$CATGroupMedium 4.43108 1.90195 2.330 0.02203 *
## COPD$CATGroupHigh 10.00743 1.85792 5.386 5.59e-07 ***
## COPD$FEV1 -0.37612 1.22535 -0.307 0.75959
## COPD$comorbid1 0.70113 1.57131 0.446 0.65651
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.389 on 91 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3241, Adjusted R-squared: 0.2646
## F-statistic: 5.453 on 8 and 91 DF, p-value: 1.27e-05
par(mfrow = c(2,2))
plot(m1)
vif_result <- vif(m1)
vif_result
## GVIF Df GVIF^(1/(2*Df))
## COPD$AGE 1.117043 1 1.056903
## COPD$gender 1.221541 1 1.105234
## COPD$PackHistoryGroup 1.186271 2 1.043629
## COPD$CATGroup 1.256965 2 1.058841
## COPD$FEV1 1.241114 1 1.114053
## COPD$comorbid 1.126585 1 1.061407
VIF results show low colinearity between my variables
Exploring further into interaction effects
m2 <- lm(COPD$HAD ~ COPD$PackHistoryGroup * COPD$CATGroup)
summary(m2)
##
## Call:
## lm(formula = COPD$HAD ~ COPD$PackHistoryGroup * COPD$CATGroup)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.111 -5.111 -1.333 3.481 39.619
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 6.0231 1.7094 3.524
## COPD$PackHistoryGroup.L -1.2571 3.3657 -0.373
## COPD$PackHistoryGroup.Q -4.4113 2.4907 -1.771
## COPD$CATGroupMedium 4.0602 2.2538 1.801
## COPD$CATGroupHigh 11.6520 2.4713 4.715
## COPD$PackHistoryGroup.L:COPD$CATGroupMedium 0.3732 4.3140 0.087
## COPD$PackHistoryGroup.Q:COPD$CATGroupMedium 4.1052 3.4450 1.192
## COPD$PackHistoryGroup.L:COPD$CATGroupHigh 3.2033 4.7355 0.676
## COPD$PackHistoryGroup.Q:COPD$CATGroupHigh 5.1021 3.7707 1.353
## Pr(>|t|)
## (Intercept) 0.000668 ***
## COPD$PackHistoryGroup.L 0.709651
## COPD$PackHistoryGroup.Q 0.079887 .
## COPD$CATGroupMedium 0.074942 .
## COPD$CATGroupHigh 8.68e-06 ***
## COPD$PackHistoryGroup.L:COPD$CATGroupMedium 0.931253
## COPD$PackHistoryGroup.Q:COPD$CATGroupMedium 0.236499
## COPD$PackHistoryGroup.L:COPD$CATGroupHigh 0.500477
## COPD$PackHistoryGroup.Q:COPD$CATGroupHigh 0.179375
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.633 on 91 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2787, Adjusted R-squared: 0.2152
## F-statistic: 4.394 on 8 and 91 DF, p-value: 0.0001609
Results indicate the interaction does not have a significant effect. however, the main effect of the predictor is still significant, especially at the higher groups.
m3 <- lm(formula = COPD$HAD ~ COPD$CATGroup * COPD$AGE + COPD$PackHistory + COPD$comorbid, data = COPD)
m4 <- lm(formula = COPD$HAD ~ COPD$CATGroup * COPD$AGE + COPD$FEV1, data = COPD)
summary(m4)
##
## Call:
## lm(formula = COPD$HAD ~ COPD$CATGroup * COPD$AGE + COPD$FEV1,
## data = COPD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.738 -5.261 -0.910 3.465 38.767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.29147 10.42208 1.275 0.2054
## COPD$CATGroupMedium 30.09140 16.92094 1.778 0.0786 .
## COPD$CATGroupHigh 23.47690 15.91317 1.475 0.1435
## COPD$AGE -0.07225 0.14159 -0.510 0.6111
## COPD$FEV1 -0.78878 1.16524 -0.677 0.5001
## COPD$CATGroupMedium:COPD$AGE -0.37215 0.23732 -1.568 0.1202
## COPD$CATGroupHigh:COPD$AGE -0.19937 0.22759 -0.876 0.3833
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.407 on 93 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3058, Adjusted R-squared: 0.261
## F-statistic: 6.827 on 6 and 93 DF, p-value: 5.05e-06
plot(m4)
Overall Linear regression model: - Despite not having a significant correlation value, m4 showed the highest r^2 squared value, with the model explaing 26% of the variance. - F statistic has a p value <0.05, indicating the model as a whole is still statistically significant
Linear equation: (given values from model) HAD=13.29147+30.09140×MediumCATGroup+23.47690×HighCATGroup−0.07225×AGE−0.78878×FEV1−0.37215×(MediumCATGroup×AGE)−0.19937×(HighCATGroup×AGE)
Hypothetical example of 60 year old man, Med CAT category, 2.5FEV1 score: HAD= 13.29147+(30.09140×1)+(23.47690×0)−(0.07225×60)−(0.78878×2.5)+(−0.37215×1×60)+(−0.19937×0×60) HAD= 15.75
Plots: 1. Residuals vs Fitted: no clear pattern which is good, with some outliers 2. Q-Q: data points follow the line, indicating normal distribution 3. Scale- location: spread increasing for larger values, possible indicating skewed/ hetrodascity in the dataset 4. Residuals vs Leverage: generally few points with high leverage, indicating there are few observations explaining variances in the model.
Confidence intervals: - Values crosses 0, and…… basically a better moel needs to be built.
In all, variables showing direct indication to severity of COPD (as opposed to contributing factors such as smoking/ comorbidities) has a higher predictive value to HAD scores. Interesting to note that the model improves on interaction with age, meaning that- As one ages, there is a higher impact on the severity of disease to their HAD scores.
To improve: - Further refinement to model can be made. Possible to play around with more interactions/ different categorical groupings?