Pearson VS Spearman

Pearson’s assumptions: - Ideally both variables to follow a normal distribution - Continuous - Random sample

Spearman’s assumptions: - Monotonic relationship (goes up/ down together, but not necessarily linear) - Less sensitive to extremes and able to work with smaller sample size - Continuous and ordinal variable.

Linear regression

Multiple regression/ multi variate model

Methods for fitting a model

  1. Stepwise regression -> identifying predictors with greatest influence on outcome
  2. Backwards -> start with a full stack model and remove predictors

Good practice steps:

  1. Inspect variables
  1. Examine relationship
  1. Fit a simple LM with each variable

COPD Case Study

“Developing a model to predict Hospital Anxiety Depression(HAD) scores”

Loading Data and necessary libraries

COPD <- read.csv("data/COPD_student_dataset.csv")
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.3.2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(gmodels)
## Warning: package 'gmodels' was built under R version 4.3.2
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2

Inspect variables

  • to view the first five rows
  • to view an overall description of all the variables- describe(COPD)
  • to view the dimensions (no. of rows/ columns)
  • to view column names
head(COPD)
##   X  ID AGE PackHistory COPDSEVERITY MWT1 MWT2 MWT1Best FEV1 FEV1PRED  FVC
## 1 1  58  77          60       SEVERE  120  120      120 1.21       36 2.40
## 2 2  57  79          50     MODERATE  165  176      176 1.09       56 1.64
## 3 3  62  80          11     MODERATE  201  180      201 1.52       68 2.30
## 4 4 145  56          60  VERY SEVERE  210  210      210 0.47       14 1.14
## 5 5 136  65          68       SEVERE  204  210      210 1.07       42 2.91
## 6 6  84  67          26     MODERATE  216  180      216 1.09       50 1.99
##   FVCPRED CAT HAD  SGRQ AGEquartiles copd gender smoking Diabetes muscular
## 1      98  25   8 69.55            4    3      1       2        1        0
## 2      65  12  21 44.24            4    2      0       2        1        0
## 3      86  22  18 44.09            4    2      0       2        1        0
## 4      27  28  26 62.04            1    4      1       2        0        0
## 5      98  32  18 75.56            1    3      1       2        0        1
## 6      60  29  21 73.82            2    2      0       1        1        0
##   hypertension AtrialFib IHD
## 1            0         1   0
## 2            0         1   1
## 3            0         1   0
## 4            1         1   0
## 5            1         0   0
## 6            0         1   0
dim(COPD)
## [1] 101  24
colnames(COPD)
##  [1] "X"            "ID"           "AGE"          "PackHistory"  "COPDSEVERITY"
##  [6] "MWT1"         "MWT2"         "MWT1Best"     "FEV1"         "FEV1PRED"    
## [11] "FVC"          "FVCPRED"      "CAT"          "HAD"          "SGRQ"        
## [16] "AGEquartiles" "copd"         "gender"       "smoking"      "Diabetes"    
## [21] "muscular"     "hypertension" "AtrialFib"    "IHD"

Predictor variables

  • Continuous: lung function (multiple), Age, Comorbid?
  • Categorical: COPDSeverity/ CAT

Examining relationships between continuous variables

my_data <- COPD[,c("AGE", "PackHistory", "FEV1", "FEV1PRED", "FVC", "CAT")]
cor_matrix <- cor(my_data) 
round(cor_matrix,2)
##               AGE PackHistory  FEV1 FEV1PRED   FVC   CAT
## AGE          1.00        0.00 -0.10     0.07 -0.15  0.08
## PackHistory  0.00        1.00 -0.13    -0.13 -0.09 -0.14
## FEV1        -0.10       -0.13  1.00     0.78  0.82 -0.06
## FEV1PRED     0.07       -0.13  0.78     1.00  0.52  0.01
## FVC         -0.15       -0.09  0.82     0.52  1.00 -0.16
## CAT          0.08       -0.14 -0.06     0.01 -0.16  1.00
pairs(~AGE + PackHistory + FEV1 + FEV1PRED + FVC + CAT, data = COPD)

Lung function variables (FEV1, FEV1PRED, FVC) have a high correlation, as to be expected. Linear regression models showed no significant differences between each variable. Given that it measures the same thing, FEV1 will be chosen.

AGE

hist(COPD$AGE)

summary(COPD$AGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    44.0    65.0    71.0    70.1    75.0    88.0

Assumptions of normal distribution are met.

PACK HISTORY/ SMOKING

table(COPD$smoking)
## 
##  1  2 
## 16 85
COPD$smoking <- COPD$smoking -1
table(COPD$smoking)
## 
##  0  1 
## 16 85
hist(COPD$PackHistory)

summary(COPD$PackHistory)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    20.0    36.0    39.7    54.0   109.0
min_value <- min(COPD$PackHistory, na.rm = TRUE)
max_value <- max(COPD$PackHistory, na.rm = TRUE)
width <- (max_value - min_value) / 3
breakpoints <- seq(min_value, max_value, by = width)
COPD$PackHistoryGroup <- cut(COPD$PackHistory, breaks = breakpoints, include.lowest = TRUE, labels = c("low", "med", "high"))
COPD$PackHistoryGroup <- factor(COPD$PackHistoryGroup, levels = c("low", "med", "high"), ordered = TRUE)

Equal width binning method used: - Pack History has a skewed (left) distribution, and i want to capture differences in groupings - Additional argument added to make it an ordinal variable

COPD$smoking <- factor(COPD$smoking)
table(COPD$smoking, COPD$PackHistoryGroup)
##    
##     low med high
##   0   8   7    1
##   1  44  30   11

COPDSEVERITY/ CAT

table(COPD$COPDSEVERITY)
## 
##        MILD    MODERATE      SEVERE VERY SEVERE 
##          23          43          27           8
COPD$COPDSEVERITY_new <- ifelse(COPD$COPDSEVERITY %in% c("SEVERE", "VERY SEVERE"), "Severe", COPD$COPDSEVERITY)
table(COPD$COPDSEVERITY_new)
## 
##     MILD MODERATE   Severe 
##       23       43       35

Categories for ‘Very Severe’ and ‘Severe’ was combined - ‘Very Severe’ was underrepresented had few observations, and may result in insufficient statistical power to detect an effect/ unreliable results - Models with few observations are not generalizable. Insufficient sample size.

hist(COPD$CAT)

COPD$CAT[COPD$CAT > 40] <- NA
hist(COPD$CAT)

CAT ranges from 0-40, and outlier is likely a mistake. Observation to be removed.

quantile_breaks <- quantile(COPD$CAT, probs = seq(0, 1, length.out = 4), na.rm = TRUE)
COPD$CATGroup <- cut(COPD$CAT, breaks = quantile_breaks, include.lowest = TRUE, labels = c("Low", "Medium", "High"))
table(COPD$CATGroup)
## 
##    Low Medium   High 
##     37     30     33
table(COPD$CATGroup, COPD$COPDSEVERITY_new)
##         
##          MILD MODERATE Severe
##   Low      10       20      7
##   Medium    7       15      8
##   High      5        8     20
c1 <- lm(COPD$HAD ~ COPD$CATGroup, data = COPD)
summary(c1)
## 
## Call:
## lm(formula = COPD$HAD ~ COPD$CATGroup, data = COPD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.976  -5.919  -1.233   3.767  39.224 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            6.919      1.243   5.568 2.30e-07 ***
## COPD$CATGroupMedium    3.314      1.857   1.785   0.0774 .  
## COPD$CATGroupHigh     10.057      1.810   5.556 2.41e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.559 on 97 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2459, Adjusted R-squared:  0.2303 
## F-statistic: 15.81 on 2 and 97 DF,  p-value: 1.139e-06
c2 <- lm(COPD$HAD ~ COPD$COPDSEVERITY_new)
summary(c2)
## 
## Call:
## lm(formula = COPD$HAD ~ COPD$COPDSEVERITY_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.029  -6.029  -2.029   3.971  45.843 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    10.3565     1.7864   5.797 8.21e-08 ***
## COPD$COPDSEVERITY_newMODERATE  -0.2402     2.2132  -0.109    0.914    
## COPD$COPDSEVERITY_newSevere     2.6720     2.2996   1.162    0.248    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.567 on 98 degrees of freedom
## Multiple R-squared:  0.02492,    Adjusted R-squared:  0.005024 
## F-statistic: 1.252 on 2 and 98 DF,  p-value: 0.2903

Equal width binning approach is chosen instead of previous quantile binning. It divides the range of data points into equal bins, and allow for easy comparison with COPDSeverity.

CATgroup has a higher adjusted R^2 value, and explains for 20% of variability in the dataset. CAT will be used as predictor variable

plot(COPD$HAD ~ COPD$CATGroup)

Further investigations show that 1. There is an overlap in ranges, indicating similarity in HAD score between groups 2. ‘Severe’ groups still tend to have a higher score (this is further confirmed with a statistically significant P value in the linear regression model) 3. Median scores are generally higher for the high, compared to the low group.

CAT score will be used as predictor variable

# GENDER


```r
class(COPD$gender)
## [1] "integer"
COPD$gender <- as.factor(COPD$gender)
class(COPD$gender)
## [1] "factor"

R will treat variables with numbers as integers, it is important to treat it as a category

COMORBID

comorbid <- rep(0, nrow(COPD))
comorbid[COPD$Diabetes == 1 | COPD$muscular == 1 | COPD$AtrialFib == 1 | COPD$IHD == 1] <- 1
comorbid[is.na(comorbid)] <- 0
comorbid <- factor(comorbid)
COPD$comorbid <- comorbid
table(comorbid)
## comorbid
##  0  1 
## 53 48

After inspecting variables: the following predictor variables will be used - AGE - GENDER - Pack History - CAT Group - FEV1 - comorbid

Fitting the model

m1 <- lm(COPD$HAD ~ COPD$AGE + COPD$gender + COPD$PackHistoryGroup + COPD$CATGroup + COPD$FEV1 + COPD$comorbid )
summary(m1)
## 
## Call:
## lm(formula = COPD$HAD ~ COPD$AGE + COPD$gender + COPD$PackHistoryGroup + 
##     COPD$CATGroup + COPD$FEV1 + COPD$comorbid)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.069  -4.717  -1.157   3.186  38.254 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             24.03568    7.45209   3.225  0.00175 ** 
## COPD$AGE                -0.22428    0.09966  -2.250  0.02683 *  
## COPD$gender1            -2.57008    1.70130  -1.511  0.13434    
## COPD$PackHistoryGroup.L -0.28857    1.76139  -0.164  0.87023    
## COPD$PackHistoryGroup.Q -2.25847    1.42164  -1.589  0.11561    
## COPD$CATGroupMedium      4.43108    1.90195   2.330  0.02203 *  
## COPD$CATGroupHigh       10.00743    1.85792   5.386 5.59e-07 ***
## COPD$FEV1               -0.37612    1.22535  -0.307  0.75959    
## COPD$comorbid1           0.70113    1.57131   0.446  0.65651    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.389 on 91 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3241, Adjusted R-squared:  0.2646 
## F-statistic: 5.453 on 8 and 91 DF,  p-value: 1.27e-05
par(mfrow = c(2,2))
plot(m1)

vif_result <- vif(m1)
vif_result
##                           GVIF Df GVIF^(1/(2*Df))
## COPD$AGE              1.117043  1        1.056903
## COPD$gender           1.221541  1        1.105234
## COPD$PackHistoryGroup 1.186271  2        1.043629
## COPD$CATGroup         1.256965  2        1.058841
## COPD$FEV1             1.241114  1        1.114053
## COPD$comorbid         1.126585  1        1.061407

VIF results show low colinearity between my variables

Exploring further into interaction effects

m2 <- lm(COPD$HAD ~ COPD$PackHistoryGroup * COPD$CATGroup)
summary(m2)
## 
## Call:
## lm(formula = COPD$HAD ~ COPD$PackHistoryGroup * COPD$CATGroup)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.111  -5.111  -1.333   3.481  39.619 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                   6.0231     1.7094   3.524
## COPD$PackHistoryGroup.L                      -1.2571     3.3657  -0.373
## COPD$PackHistoryGroup.Q                      -4.4113     2.4907  -1.771
## COPD$CATGroupMedium                           4.0602     2.2538   1.801
## COPD$CATGroupHigh                            11.6520     2.4713   4.715
## COPD$PackHistoryGroup.L:COPD$CATGroupMedium   0.3732     4.3140   0.087
## COPD$PackHistoryGroup.Q:COPD$CATGroupMedium   4.1052     3.4450   1.192
## COPD$PackHistoryGroup.L:COPD$CATGroupHigh     3.2033     4.7355   0.676
## COPD$PackHistoryGroup.Q:COPD$CATGroupHigh     5.1021     3.7707   1.353
##                                             Pr(>|t|)    
## (Intercept)                                 0.000668 ***
## COPD$PackHistoryGroup.L                     0.709651    
## COPD$PackHistoryGroup.Q                     0.079887 .  
## COPD$CATGroupMedium                         0.074942 .  
## COPD$CATGroupHigh                           8.68e-06 ***
## COPD$PackHistoryGroup.L:COPD$CATGroupMedium 0.931253    
## COPD$PackHistoryGroup.Q:COPD$CATGroupMedium 0.236499    
## COPD$PackHistoryGroup.L:COPD$CATGroupHigh   0.500477    
## COPD$PackHistoryGroup.Q:COPD$CATGroupHigh   0.179375    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.633 on 91 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2787, Adjusted R-squared:  0.2152 
## F-statistic: 4.394 on 8 and 91 DF,  p-value: 0.0001609

Results indicate the interaction does not have a significant effect. however, the main effect of the predictor is still significant, especially at the higher groups.

trialing a few iterations

m3 <- lm(formula = COPD$HAD ~ COPD$CATGroup * COPD$AGE + COPD$PackHistory + COPD$comorbid, data = COPD)
m4 <- lm(formula = COPD$HAD ~ COPD$CATGroup * COPD$AGE + COPD$FEV1, data = COPD)
summary(m4)
## 
## Call:
## lm(formula = COPD$HAD ~ COPD$CATGroup * COPD$AGE + COPD$FEV1, 
##     data = COPD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.738  -5.261  -0.910   3.465  38.767 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                  13.29147   10.42208   1.275   0.2054  
## COPD$CATGroupMedium          30.09140   16.92094   1.778   0.0786 .
## COPD$CATGroupHigh            23.47690   15.91317   1.475   0.1435  
## COPD$AGE                     -0.07225    0.14159  -0.510   0.6111  
## COPD$FEV1                    -0.78878    1.16524  -0.677   0.5001  
## COPD$CATGroupMedium:COPD$AGE -0.37215    0.23732  -1.568   0.1202  
## COPD$CATGroupHigh:COPD$AGE   -0.19937    0.22759  -0.876   0.3833  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.407 on 93 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3058, Adjusted R-squared:  0.261 
## F-statistic: 6.827 on 6 and 93 DF,  p-value: 5.05e-06
plot(m4)

Overall Linear regression model: - Despite not having a significant correlation value, m4 showed the highest r^2 squared value, with the model explaing 26% of the variance. - F statistic has a p value <0.05, indicating the model as a whole is still statistically significant

Linear equation: (given values from model) HAD=13.29147+30.09140×MediumCATGroup+23.47690×HighCATGroup−0.07225×AGE−0.78878×FEV1−0.37215×(MediumCATGroup×AGE)−0.19937×(HighCATGroup×AGE)

Hypothetical example of 60 year old man, Med CAT category, 2.5FEV1 score: HAD= 13.29147+(30.09140×1)+(23.47690×0)−(0.07225×60)−(0.78878×2.5)+(−0.37215×1×60)+(−0.19937×0×60) HAD= 15.75

Plots: 1. Residuals vs Fitted: no clear pattern which is good, with some outliers 2. Q-Q: data points follow the line, indicating normal distribution 3. Scale- location: spread increasing for larger values, possible indicating skewed/ hetrodascity in the dataset 4. Residuals vs Leverage: generally few points with high leverage, indicating there are few observations explaining variances in the model.

Confidence intervals: - Values crosses 0, and…… basically a better moel needs to be built.

Summary

In all, variables showing direct indication to severity of COPD (as opposed to contributing factors such as smoking/ comorbidities) has a higher predictive value to HAD scores. Interesting to note that the model improves on interaction with age, meaning that- As one ages, there is a higher impact on the severity of disease to their HAD scores.

To improve: - Further refinement to model can be made. Possible to play around with more interactions/ different categorical groupings?