Pearson VS Spearman

Pearson’s assumptions: - Ideally both variables to follow a normal distribution - Continuous - Random sample

Spearman’s assumptions: - Monotonic relationship (goes up/ down together, but not necessarily linear) - Less sensitive to extremes and able to work with smaller sample size - Continuous and ordinal variable.

Linear regression

Least square regression -> line that best fits the data (minimizes the residuals (observed - line))
Intercept/ Alpha: value when predictor variable is 0
- If significant (e.g. AGE on FEV1, the value is age = 0 is significantly different than if age =1. It should not be used as a normative data point, as the model is based off the sample, and not the population parameter)
Beta: average increase in outcome variable, for every unit increase in predictor variable e.g. If beta is 0.07, for every change in x, you will expect 0.07 increase in y
R^2 value: how well the model fits the data. However, very sensitive to increase in no. of variables. runs the risk of overfitting the data
Adjusted R^2: penalizes for increase in predictor variables. e.g. if R^2 value is 0.26, it explainf for 26% of variance in the model
Confidence interval -> 95% of population estimate would have these values e.g. 9-25
- If interval crosses 0 (-1.5 to 3), there is a chance the estimate is 0, and result is statistically insignificant.
- Interval also gives information to direction/ whether the data is skewed positively/ negatively
P value -> the probability of finding the observed/ extreme effect given the null hypothesis (no significant effect). At a 0.05 p value, there is a 5% chance the extreme will occur if there is no significant effect
Categorical variables: the ‘lowest’ category will be the intercept. e.g. for mild/ mod/ severe categories in COPD, ‘mild’ will be the intercept. *to remember, regression models always needs to be plotted against a continuous variable

Multiple regression/ multi variate model

Interactions: change in one variable affecting the other
Colinearity: multiple variables explaining the same thing. risk of overfitting the model.

Methods for fitting a model

Stepwise regression -> identifying predictors with greatest influence on outcome
Backwards -> start with a full stack model and remove predictors

Flaws: 1) assumes all variables are independent (rarely the case), 2) Biased SE, P values, 3) Lacks generalizability to population. parameters on model will be based on sample estimates.

Good practice steps:

Inspect variables

Continuous variables -> use summary statistics and histogram
Categorical variables -> tabulations Are there any missing values?

Examine relationship

continuous -> scatter plot matrices
categorical -> cross tabulations are there any associations?

Fit a simple LM with each variable

COPD Case Study

“Developing a model to predict Hospital Anxiety Depression(HAD) scores”

Loading Data and necessary libraries

COPD <- read.csv("data/COPD_student_dataset.csv")
library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.3.2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(gmodels)

## Warning: package 'gmodels' was built under R version 4.3.2

library(car)

## Warning: package 'car' was built under R version 4.3.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.2

Inspect variables

to view the first five rows
to view an overall description of all the variables- describe(COPD)
to view the dimensions (no. of rows/ columns)
to view column names

head(COPD)

##   X  ID AGE PackHistory COPDSEVERITY MWT1 MWT2 MWT1Best FEV1 FEV1PRED  FVC
## 1 1  58  77          60       SEVERE  120  120      120 1.21       36 2.40
## 2 2  57  79          50     MODERATE  165  176      176 1.09       56 1.64
## 3 3  62  80          11     MODERATE  201  180      201 1.52       68 2.30
## 4 4 145  56          60  VERY SEVERE  210  210      210 0.47       14 1.14
## 5 5 136  65          68       SEVERE  204  210      210 1.07       42 2.91
## 6 6  84  67          26     MODERATE  216  180      216 1.09       50 1.99
##   FVCPRED CAT HAD  SGRQ AGEquartiles copd gender smoking Diabetes muscular
## 1      98  25   8 69.55            4    3      1       2        1        0
## 2      65  12  21 44.24            4    2      0       2        1        0
## 3      86  22  18 44.09            4    2      0       2        1        0
## 4      27  28  26 62.04            1    4      1       2        0        0
## 5      98  32  18 75.56            1    3      1       2        0        1
## 6      60  29  21 73.82            2    2      0       1        1        0
##   hypertension AtrialFib IHD
## 1            0         1   0
## 2            0         1   1
## 3            0         1   0
## 4            1         1   0
## 5            1         0   0
## 6            0         1   0

dim(COPD)

## [1] 101  24

colnames(COPD)

##  [1] "X"            "ID"           "AGE"          "PackHistory"  "COPDSEVERITY"
##  [6] "MWT1"         "MWT2"         "MWT1Best"     "FEV1"         "FEV1PRED"    
## [11] "FVC"          "FVCPRED"      "CAT"          "HAD"          "SGRQ"        
## [16] "AGEquartiles" "copd"         "gender"       "smoking"      "Diabetes"    
## [21] "muscular"     "hypertension" "AtrialFib"    "IHD"

Predictor variables

Continuous: lung function (multiple), Age, Comorbid?
Categorical: COPDSeverity/ CAT

Examining relationships between continuous variables

my_data <- COPD[,c("AGE", "PackHistory", "FEV1", "FEV1PRED", "FVC", "CAT")]
cor_matrix <- cor(my_data) 
round(cor_matrix,2)

##               AGE PackHistory  FEV1 FEV1PRED   FVC   CAT
## AGE          1.00        0.00 -0.10     0.07 -0.15  0.08
## PackHistory  0.00        1.00 -0.13    -0.13 -0.09 -0.14
## FEV1        -0.10       -0.13  1.00     0.78  0.82 -0.06
## FEV1PRED     0.07       -0.13  0.78     1.00  0.52  0.01
## FVC         -0.15       -0.09  0.82     0.52  1.00 -0.16
## CAT          0.08       -0.14 -0.06     0.01 -0.16  1.00

pairs(~AGE + PackHistory + FEV1 + FEV1PRED + FVC + CAT, data = COPD)

Lung function variables (FEV1, FEV1PRED, FVC) have a high correlation, as to be expected. Linear regression models showed no significant differences between each variable. Given that it measures the same thing, FEV1 will be chosen.

AGE

hist(COPD$AGE)

summary(COPD$AGE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    44.0    65.0    71.0    70.1    75.0    88.0

Assumptions of normal distribution are met.

PACK HISTORY/ SMOKING

table(COPD$smoking)

## 
##  1  2 
## 16 85

COPD$smoking <- COPD$smoking -1
table(COPD$smoking)

## 
##  0  1 
## 16 85

Headers are changed to binary for more intuitive model building (0- ex/ non- smokers, 1- current smokers)
Pack History will be categorised into low, med, high for easy comparision *Pack History- no. of packs smoked a day x years smoked

hist(COPD$PackHistory)

summary(COPD$PackHistory)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    20.0    36.0    39.7    54.0   109.0

min_value <- min(COPD$PackHistory, na.rm = TRUE)
max_value <- max(COPD$PackHistory, na.rm = TRUE)
width <- (max_value - min_value) / 3
breakpoints <- seq(min_value, max_value, by = width)
COPD$PackHistoryGroup <- cut(COPD$PackHistory, breaks = breakpoints, include.lowest = TRUE, labels = c("low", "med", "high"))
COPD$PackHistoryGroup <- factor(COPD$PackHistoryGroup, levels = c("low", "med", "high"), ordered = TRUE)

Equal width binning method used: - Pack History has a skewed (left) distribution, and i want to capture differences in groupings - Additional argument added to make it an ordinal variable

COPD$smoking <- factor(COPD$smoking)
table(COPD$smoking, COPD$PackHistoryGroup)

##    
##     low med high
##   0   8   7    1
##   1  44  30   11

Unable to use Chi- Squared test to compare its association as it does not meet assumptions of >5 per column
Pack History gives more information, and may be useful to be compared with HAD. Hence, it will be used as my predictor variable

COPDSEVERITY/ CAT

table(COPD$COPDSEVERITY)

## 
##        MILD    MODERATE      SEVERE VERY SEVERE 
##          23          43          27           8

COPD$COPDSEVERITY_new <- ifelse(COPD$COPDSEVERITY %in% c("SEVERE", "VERY SEVERE"), "Severe", COPD$COPDSEVERITY)
table(COPD$COPDSEVERITY_new)

## 
##     MILD MODERATE   Severe 
##       23       43       35

Categories for ‘Very Severe’ and ‘Severe’ was combined - ‘Very Severe’ was underrepresented had few observations, and may result in insufficient statistical power to detect an effect/ unreliable results - Models with few observations are not generalizable. Insufficient sample size.

hist(COPD$CAT)

COPD$CAT[COPD$CAT > 40] <- NA
hist(COPD$CAT)

CAT ranges from 0-40, and outlier is likely a mistake. Observation to be removed.

quantile_breaks <- quantile(COPD$CAT, probs = seq(0, 1, length.out = 4), na.rm = TRUE)
COPD$CATGroup <- cut(COPD$CAT, breaks = quantile_breaks, include.lowest = TRUE, labels = c("Low", "Medium", "High"))
table(COPD$CATGroup)

## 
##    Low Medium   High 
##     37     30     33

table(COPD$CATGroup, COPD$COPDSEVERITY_new)

##         
##          MILD MODERATE Severe
##   Low      10       20      7
##   Medium    7       15      8
##   High      5        8     20

c1 <- lm(COPD$HAD ~ COPD$CATGroup, data = COPD)
summary(c1)

## 
## Call:
## lm(formula = COPD$HAD ~ COPD$CATGroup, data = COPD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.976  -5.919  -1.233   3.767  39.224 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            6.919      1.243   5.568 2.30e-07 ***
## COPD$CATGroupMedium    3.314      1.857   1.785   0.0774 .  
## COPD$CATGroupHigh     10.057      1.810   5.556 2.41e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.559 on 97 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2459, Adjusted R-squared:  0.2303 
## F-statistic: 15.81 on 2 and 97 DF,  p-value: 1.139e-06

c2 <- lm(COPD$HAD ~ COPD$COPDSEVERITY_new)
summary(c2)

## 
## Call:
## lm(formula = COPD$HAD ~ COPD$COPDSEVERITY_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.029  -6.029  -2.029   3.971  45.843 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    10.3565     1.7864   5.797 8.21e-08 ***
## COPD$COPDSEVERITY_newMODERATE  -0.2402     2.2132  -0.109    0.914    
## COPD$COPDSEVERITY_newSevere     2.6720     2.2996   1.162    0.248    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.567 on 98 degrees of freedom
## Multiple R-squared:  0.02492,    Adjusted R-squared:  0.005024 
## F-statistic: 1.252 on 2 and 98 DF,  p-value: 0.2903

Equal width binning approach is chosen instead of previous quantile binning. It divides the range of data points into equal bins, and allow for easy comparison with COPDSeverity.

CATgroup has a higher adjusted R^2 value, and explains for 20% of variability in the dataset. CAT will be used as predictor variable

plot(COPD$HAD ~ COPD$CATGroup)

Further investigations show that 1. There is an overlap in ranges, indicating similarity in HAD score between groups 2. ‘Severe’ groups still tend to have a higher score (this is further confirmed with a statistically significant P value in the linear regression model) 3. Median scores are generally higher for the high, compared to the low group.

CAT score will be used as predictor variable

# GENDER


```r
class(COPD$gender)

## [1] "integer"

COPD$gender <- as.factor(COPD$gender)
class(COPD$gender)

## [1] "factor"

R will treat variables with numbers as integers, it is important to treat it as a category

COMORBID

comorbid <- rep(0, nrow(COPD))
comorbid[COPD$Diabetes == 1 | COPD$muscular == 1 | COPD$AtrialFib == 1 | COPD$IHD == 1] <- 1
comorbid[is.na(comorbid)] <- 0
comorbid <- factor(comorbid)
COPD$comorbid <- comorbid
table(comorbid)

## comorbid
##  0  1 
## 53 48

After inspecting variables: the following predictor variables will be used - AGE - GENDER - Pack History - CAT Group - FEV1 - comorbid

Fitting the model

m1 <- lm(COPD$HAD ~ COPD$AGE + COPD$gender + COPD$PackHistoryGroup + COPD$CATGroup + COPD$FEV1 + COPD$comorbid )
summary(m1)

## 
## Call:
## lm(formula = COPD$HAD ~ COPD$AGE + COPD$gender + COPD$PackHistoryGroup + 
##     COPD$CATGroup + COPD$FEV1 + COPD$comorbid)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.069  -4.717  -1.157   3.186  38.254 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             24.03568    7.45209   3.225  0.00175 ** 
## COPD$AGE                -0.22428    0.09966  -2.250  0.02683 *  
## COPD$gender1            -2.57008    1.70130  -1.511  0.13434    
## COPD$PackHistoryGroup.L -0.28857    1.76139  -0.164  0.87023    
## COPD$PackHistoryGroup.Q -2.25847    1.42164  -1.589  0.11561    
## COPD$CATGroupMedium      4.43108    1.90195   2.330  0.02203 *  
## COPD$CATGroupHigh       10.00743    1.85792   5.386 5.59e-07 ***
## COPD$FEV1               -0.37612    1.22535  -0.307  0.75959    
## COPD$comorbid1           0.70113    1.57131   0.446  0.65651    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.389 on 91 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3241, Adjusted R-squared:  0.2646 
## F-statistic: 5.453 on 8 and 91 DF,  p-value: 1.27e-05

par(mfrow = c(2,2))
plot(m1)

vif_result <- vif(m1)
vif_result

##                           GVIF Df GVIF^(1/(2*Df))
## COPD$AGE              1.117043  1        1.056903
## COPD$gender           1.221541  1        1.105234
## COPD$PackHistoryGroup 1.186271  2        1.043629
## COPD$CATGroup         1.256965  2        1.058841
## COPD$FEV1             1.241114  1        1.114053
## COPD$comorbid         1.126585  1        1.061407

VIF results show low colinearity between my variables

Exploring further into interaction effects

m2 <- lm(COPD$HAD ~ COPD$PackHistoryGroup * COPD$CATGroup)
summary(m2)

## 
## Call:
## lm(formula = COPD$HAD ~ COPD$PackHistoryGroup * COPD$CATGroup)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.111  -5.111  -1.333   3.481  39.619 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                   6.0231     1.7094   3.524
## COPD$PackHistoryGroup.L                      -1.2571     3.3657  -0.373
## COPD$PackHistoryGroup.Q                      -4.4113     2.4907  -1.771
## COPD$CATGroupMedium                           4.0602     2.2538   1.801
## COPD$CATGroupHigh                            11.6520     2.4713   4.715
## COPD$PackHistoryGroup.L:COPD$CATGroupMedium   0.3732     4.3140   0.087
## COPD$PackHistoryGroup.Q:COPD$CATGroupMedium   4.1052     3.4450   1.192
## COPD$PackHistoryGroup.L:COPD$CATGroupHigh     3.2033     4.7355   0.676
## COPD$PackHistoryGroup.Q:COPD$CATGroupHigh     5.1021     3.7707   1.353
##                                             Pr(>|t|)    
## (Intercept)                                 0.000668 ***
## COPD$PackHistoryGroup.L                     0.709651    
## COPD$PackHistoryGroup.Q                     0.079887 .  
## COPD$CATGroupMedium                         0.074942 .  
## COPD$CATGroupHigh                           8.68e-06 ***
## COPD$PackHistoryGroup.L:COPD$CATGroupMedium 0.931253    
## COPD$PackHistoryGroup.Q:COPD$CATGroupMedium 0.236499    
## COPD$PackHistoryGroup.L:COPD$CATGroupHigh   0.500477    
## COPD$PackHistoryGroup.Q:COPD$CATGroupHigh   0.179375    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.633 on 91 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2787, Adjusted R-squared:  0.2152 
## F-statistic: 4.394 on 8 and 91 DF,  p-value: 0.0001609

Results indicate the interaction does not have a significant effect. however, the main effect of the predictor is still significant, especially at the higher groups.

trialing a few iterations

m3 <- lm(formula = COPD$HAD ~ COPD$CATGroup * COPD$AGE + COPD$PackHistory + COPD$comorbid, data = COPD)
m4 <- lm(formula = COPD$HAD ~ COPD$CATGroup * COPD$AGE + COPD$FEV1, data = COPD)
summary(m4)

## 
## Call:
## lm(formula = COPD$HAD ~ COPD$CATGroup * COPD$AGE + COPD$FEV1, 
##     data = COPD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.738  -5.261  -0.910   3.465  38.767 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                  13.29147   10.42208   1.275   0.2054  
## COPD$CATGroupMedium          30.09140   16.92094   1.778   0.0786 .
## COPD$CATGroupHigh            23.47690   15.91317   1.475   0.1435  
## COPD$AGE                     -0.07225    0.14159  -0.510   0.6111  
## COPD$FEV1                    -0.78878    1.16524  -0.677   0.5001  
## COPD$CATGroupMedium:COPD$AGE -0.37215    0.23732  -1.568   0.1202  
## COPD$CATGroupHigh:COPD$AGE   -0.19937    0.22759  -0.876   0.3833  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.407 on 93 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3058, Adjusted R-squared:  0.261 
## F-statistic: 6.827 on 6 and 93 DF,  p-value: 5.05e-06

plot(m4)

Overall Linear regression model: - Despite not having a significant correlation value, m4 showed the highest r^2 squared value, with the model explaing 26% of the variance. - F statistic has a p value <0.05, indicating the model as a whole is still statistically significant

Linear equation: (given values from model) HAD=13.29147+30.09140×MediumCATGroup+23.47690×HighCATGroup−0.07225×AGE−0.78878×FEV1−0.37215×(MediumCATGroup×AGE)−0.19937×(HighCATGroup×AGE)

Hypothetical example of 60 year old man, Med CAT category, 2.5FEV1 score: HAD= 13.29147+(30.09140×1)+(23.47690×0)−(0.07225×60)−(0.78878×2.5)+(−0.37215×1×60)+(−0.19937×0×60) HAD= 15.75

Plots: 1. Residuals vs Fitted: no clear pattern which is good, with some outliers 2. Q-Q: data points follow the line, indicating normal distribution 3. Scale- location: spread increasing for larger values, possible indicating skewed/ hetrodascity in the dataset 4. Residuals vs Leverage: generally few points with high leverage, indicating there are few observations explaining variances in the model.

Confidence intervals: - Values crosses 0, and…… basically a better moel needs to be built.

Summary

In all, variables showing direct indication to severity of COPD (as opposed to contributing factors such as smoking/ comorbidities) has a higher predictive value to HAD scores. Interesting to note that the model improves on interaction with age, meaning that- As one ages, there is a higher impact on the severity of disease to their HAD scores.

To improve: - Further refinement to model can be made. Possible to play around with more interactions/ different categorical groupings?