Abstract

Chronic Obstructive Pulmonary Disease (COPD) is a prevalent lung condition associated with significant morbidity and mortality worldwide. Predicting the quality of life (QOL) of COPD patients is crucial for treatment planning and improving patient outcomes. In this project, we explore the application of multiple linear regression (MLR) models to predict the QOL of COPD patients based on various patient characteristics. We demonstrate how MLR models can serve as valuable tools in clinical research and public health by providing insights into disease severity and patient outcomes.

Introduction

Linear regression models have long been utilized in clinical research and public health to analyze and predict health-related outcomes. In the context of chronic diseases like COPD, understanding the factors influencing patient quality of life is essential for optimizing treatment strategies and resource allocation.

This project focuses on developing MLR models to predict the quality of life (QOL) of COPD patients based on demographic, clinical, and functional parameters. By leveraging machine learning algorithms, particularly MLR, we aim to provide clinicians and researchers with predictive tools that facilitate personalized patient care and inform clinical decision-making.

The utilization of MLR models in public health and clinical research enables the identification of key predictors associated with disease severity and patient outcomes. Through data-driven approaches, we can uncover complex relationships between patient characteristics and QOL, leading to more targeted interventions and improved patient management strategies.

Through this project, we showcase the versatility and utility of MLR models in addressing clinical challenges and advancing our understanding of COPD and other chronic diseases. By integrating data analytics with clinical expertise, we strive to enhance patient care and contribute to the broader goals of population health management and disease prevention.

Dataset and Features

The population of interest is a cohort of 101 patients, who had been invited to take part in a rehabilitation program to help manage their COPD.

The dataset consists of 24 patient variables of which the following are of interest:

Patient Characteristics

Age: Age of the patient in years
Gender: Male or Female, encoded as 1 for male and 0 for Female
PackHistory: Patient’s pack-year history calculated by multiplying the number of packs of cigarettes smoked per day by the number of years the person has smoked. This gives us an indication of the patient’s exposure to tobacco.

Exercise Tolerance

MWT1Best: 6-Minute walking test result measured in meters. This is a clinical tool used to assess patients with COPD. It evaluates the subject’s exercise capacity.

Lung Function

FEV1: Forced Expiratory Volume in 1 second. Used in diagnosing COPD. Values lower than average suggest the presence of COPD.
FVC: Forced Vital Capacity, a measure of the maximum amount of air a person can exhale forcefully after a maximal inhalation.

Disease Severity

CAT score: COPD Assessment Test scores. Ranges from 0 to 40. Higher scores indicate COPD has a greater impact on the overall health and well-being of the patient.
SGRQ (Saint George’s Respiratory Questionnaire): Our target variable. It is a disease-specific instrument designed to measure COPD impact on overall health.

Depression and Anxiety

HAD: Hospital Anxiety and Depression Scale. Used to evaluate the subject’s anxiety or depression level.

Comorbidities

Each as a categorical variable: 0 for absence of the comorbidity and 1 for presence:

Diabetes
Muscular(muscular dystrophy)
Hypertension
AtrialFib (Atrial Fibrillation)
IHD (Ischaemic Heart Disease)

Background and Rationale for Selecting Features to Investigate

Chronic obstructive pulmonary disease (COPD) is a common lung disease that causes restricted airflow and breathing problems. It is sometimes called emphysema or chronic bronchitis.

In people with COPD, the lungs can get damaged or clogged with phlegm, impeding their ability to breathe effectively. Symptoms include cough, sometimes with phlegm, difficulty breathing, wheezing, and tiredness.

Tobacco smoking accounts for over 70% of COPD cases in high-income countries. In LMICs (Low and middle-income countries), tobacco smoking accounts for 30–40% of COPD cases, and household air pollution is a major risk factor. Nearly 90% of COPD deaths in those under 70 years of age occur in low- and middle-income countries (LMIC). COPD is the seventh leading cause of poor health worldwide.

People with COPD also have a higher risk for other health problems, including lung infections, lung cancer, heart problems, weak muscles, brittle bones, depression, and anxiety.

Common symptoms of COPD develop from mid-life onwards, indicating an established relationship with age. As COPD progresses, people find it more difficult to carry out their normal daily activities, often due to breathlessness. There may be a considerable financial burden due to limitation of workplace and home productivity and costs of medical treatment. This indicates a significant relationship between the severity of COPD and the patient’s quality of life.

The diagnosis of COPD is confirmed by spirometry when the forced expiratory volume in 1 s to forced vital capacity (FEV1/FVC) ratio is less than 70%.

Studies have shown good correlation between disease severity and quality of life (QOL) scores independent of the underlying physiologic markers measured by spirometry. Patients with poor QOL scores based on the St George’s Respiratory Questionnaire (SGRQ) are at greater risk of hospital readmission.

Another aspect of disease severity in COPD is exercise tolerance. Studies in pulmonary rehabilitation have shown that assessment of exercise tolerance correlates well with disease severity.

Project Methodology

Libraries Used

We make use of makes use of several R libraries in this project for data manipulation, visualization, and modeling. The following libraries were utilized:

tidyverse: For data manipulation and visualization.
corrplot: For creating correlation matrices.
dplyr: For data manipulation tasks.
GGally: For creating scatterplot matrices.
caret: For machine learning tasks such as data splitting and model evaluation.
ggplot2: For creating customizable plots and visualizations.
car: For calculating Variance Inflation Factor (VIF) in regression models.
glmnet: For performing regularized regression (Elastic Net, Lasso, and Ridge).
pls: For performing Principal Component Regression (PCR).
Metrics: For evaluating model performance metrics.
MASS: For stepwise regression analysis.

Data Pre-processing

Loading Data

# Load the data
copd_data <- read.csv("COPD_dataset.csv")

Exploratory Data Analysis

Data Structure

# View the structure of the data
str(copd_data)

## 'data.frame':    101 obs. of  24 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ID          : int  58 57 62 145 136 84 93 27 114 152 ...
##  $ AGE         : int  77 79 80 56 65 67 67 83 72 75 ...
##  $ PackHistory : num  60 50 11 60 68 26 50 90 50 6 ...
##  $ COPDSEVERITY: chr  "SEVERE" "MODERATE" "MODERATE" "VERY SEVERE" ...
##  $ MWT1        : int  120 165 201 210 204 216 214 214 231 226 ...
##  $ MWT2        : int  120 176 180 210 210 180 237 237 237 240 ...
##  $ MWT1Best    : int  120 176 201 210 210 216 237 237 237 240 ...
##  $ FEV1        : num  1.21 1.09 1.52 0.47 1.07 1.09 0.69 0.68 2.13 1.06 ...
##  $ FEV1PRED    : num  36 56 68 14 42 50 35 32 63 46 ...
##  $ FVC         : num  2.4 1.64 2.3 1.14 2.91 1.99 1.31 2.23 4.38 2.06 ...
##  $ FVCPRED     : int  98 65 86 27 98 60 48 77 80 75 ...
##  $ CAT         : int  25 12 22 28 32 29 29 22 25 31 ...
##  $ HAD         : num  8 21 18 26 18 21 30 2 6 20 ...
##  $ SGRQ        : num  69.5 44.2 44.1 62 75.6 ...
##  $ AGEquartiles: int  4 4 4 1 1 2 2 4 3 3 ...
##  $ copd        : int  3 2 2 4 3 2 3 3 2 3 ...
##  $ gender      : int  1 0 0 1 1 0 0 1 1 0 ...
##  $ smoking     : int  2 2 2 2 2 1 1 2 1 2 ...
##  $ Diabetes    : int  1 1 1 0 0 1 1 1 1 0 ...
##  $ muscular    : int  0 0 0 0 1 0 0 0 0 1 ...
##  $ hypertension: int  0 0 0 1 1 0 0 0 0 0 ...
##  $ AtrialFib   : int  1 1 1 1 0 1 1 1 1 0 ...
##  $ IHD         : int  0 1 0 0 0 0 0 0 0 0 ...

Summary Statistics

# View summary statistics of the data
summary(copd_data)

##        X             ID              AGE        PackHistory   
##  Min.   :  1   Min.   :  1.00   Min.   :44.0   Min.   :  1.0  
##  1st Qu.: 26   1st Qu.: 49.00   1st Qu.:65.0   1st Qu.: 20.0  
##  Median : 51   Median : 87.00   Median :71.0   Median : 36.0  
##  Mean   : 51   Mean   : 91.41   Mean   :70.1   Mean   : 39.7  
##  3rd Qu.: 76   3rd Qu.:143.00   3rd Qu.:75.0   3rd Qu.: 54.0  
##  Max.   :101   Max.   :169.00   Max.   :88.0   Max.   :109.0  
##                                                               
##  COPDSEVERITY            MWT1            MWT2          MWT1Best    
##  Length:101         Min.   :120.0   Min.   :120.0   Min.   :120.0  
##  Class :character   1st Qu.:300.0   1st Qu.:303.8   1st Qu.:303.8  
##  Mode  :character   Median :419.0   Median :399.0   Median :420.0  
##                     Mean   :385.9   Mean   :390.3   Mean   :399.1  
##                     3rd Qu.:460.5   3rd Qu.:459.0   3rd Qu.:465.2  
##                     Max.   :688.0   Max.   :699.0   Max.   :699.0  
##                     NA's   :2       NA's   :1       NA's   :1      
##       FEV1          FEV1PRED           FVC           FVCPRED      
##  Min.   :0.450   Min.   :  3.29   Min.   :1.140   Min.   : 27.00  
##  1st Qu.:1.100   1st Qu.: 42.00   1st Qu.:2.270   1st Qu.: 71.00  
##  Median :1.600   Median : 60.00   Median :2.770   Median : 84.00  
##  Mean   :1.604   Mean   : 58.53   Mean   :2.955   Mean   : 86.44  
##  3rd Qu.:1.960   3rd Qu.: 75.00   3rd Qu.:3.630   3rd Qu.:103.00  
##  Max.   :3.180   Max.   :102.00   Max.   :5.370   Max.   :132.00  
##                                                                   
##       CAT              HAD             SGRQ        AGEquartiles  
##  Min.   :  3.00   Min.   : 0.00   Min.   : 2.00   Min.   :1.000  
##  1st Qu.: 12.00   1st Qu.: 6.00   1st Qu.:28.41   1st Qu.:1.000  
##  Median : 18.00   Median :10.00   Median :38.21   Median :3.000  
##  Mean   : 19.34   Mean   :11.18   Mean   :40.19   Mean   :2.475  
##  3rd Qu.: 24.00   3rd Qu.:15.00   3rd Qu.:55.23   3rd Qu.:3.000  
##  Max.   :188.00   Max.   :56.20   Max.   :77.44   Max.   :4.000  
##                                                                  
##       copd           gender          smoking         Diabetes     
##  Min.   :1.000   Min.   :0.0000   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000   Median :2.000   Median :0.0000  
##  Mean   :2.198   Mean   :0.6436   Mean   :1.842   Mean   :0.2079  
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.0000  
##  Max.   :4.000   Max.   :1.0000   Max.   :2.000   Max.   :1.0000  
##                                                                   
##     muscular       hypertension      AtrialFib          IHD         
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.000   Median :0.00000  
##  Mean   :0.1881   Mean   :0.1188   Mean   :0.198   Mean   :0.08911  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :1.00000  
##

Correlation Matrix

# Calculate correlations for continuous variables
# Create a data frame with continuous variables
data <- data.frame(
  age = copd_data$AGE,            # Your age variable
  pack_history = copd_data$PackHistory,   # Your pack history variable
  MWT1Best = copd_data$MWT1Best,       # Your MWT1Best variable
  FVC = copd_data$FVC,            # Your FVC variable
  FEV1 = copd_data$FEV1,           # Your FEV1 variable
  CAT_Score = copd_data$CAT,      # Your CAT_Score variable
  SGRQ = copd_data$SGRQ,           # Your SGRQ variable
  HAD = copd_data$HAD             # Your HAD variable
)

# Calculate correlations for continuous variables
cor_matrix <- cor(data)

# View the correlation matrix
print(cor_matrix)

##                       age pack_history MWT1Best         FVC        FEV1
## age           1.000000000 -0.001545507       NA -0.14522574 -0.10212241
## pack_history -0.001545507  1.000000000       NA -0.09307289 -0.13150514
## MWT1Best               NA           NA        1          NA          NA
## FVC          -0.145225737 -0.093072888       NA  1.00000000  0.82016501
## FEV1         -0.102122406 -0.131505136       NA  0.82016501  1.00000000
## CAT_Score     0.083361126 -0.143247732       NA -0.15885862 -0.06480429
## SGRQ         -0.139361205  0.032126419       NA -0.22009022 -0.30340676
## HAD          -0.227120404  0.027876353       NA -0.12968238 -0.14814655
##                CAT_Score        SGRQ         HAD
## age           0.08336113 -0.13936121 -0.22712040
## pack_history -0.14324773  0.03212642  0.02787635
## MWT1Best              NA          NA          NA
## FVC          -0.15885862 -0.22009022 -0.12968238
## FEV1         -0.06480429 -0.30340676 -0.14814655
## CAT_Score     1.00000000  0.28778173  0.16191871
## SGRQ          0.28778173  1.00000000  0.39579044
## HAD           0.16191871  0.39579044  1.00000000

# Correlation between independent variables and the outcome variable (SGRQ)
outcome_correlations <- cor_matrix["SGRQ", -length(data)]
print(outcome_correlations)

##          age pack_history     MWT1Best          FVC         FEV1    CAT_Score 
##  -0.13936121   0.03212642           NA  -0.22009022  -0.30340676   0.28778173 
##         SGRQ 
##   1.00000000

# Generate scatterplot matrix
pairs_plot <- ggpairs(data)

# Print the scatterplot matrix
print(pairs_plot)

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 rows containing non-finite values (`stat_density()`).

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 rows containing missing values (`geom_point()`).

Data Cleaning and Transformation

From the output of summary(), we observe that “MWT1BEst” has 1 missing Value,

The values for the “HAD” has a maximum value of “59” which is not a possible value.

Note: The Hospital Anxiety and Depression Scale (HAD) consists of two subscales: the HAD-Anxiety (HAD-A) and HAD-Depression (HAD-D) scales. Each subscale has a range of scores from 0 to 21, with higher scores indicating higher levels of anxiety or depression.

Considering both subscales, the total HAD score would be the sum of the scores from HAD-A and HAD-D. Therefore, the maximum possible total HAD score would be 42 (21 for anxiety + 21 for depression).

The CAT score also has a maximum value of 188, which is not a possible value.

Note: The COPD Assessment Test (CAT) is a questionnaire used to assess the impact of chronic obstructive pulmonary disease (COPD) on a patient’s health. The CAT score typically ranges from 0 to 40, with higher scores indicating a more significant impact of COPD on a patient’s life.

Summary of features with issues:

MWT1Best = Missing Value HAD = abnormal value : normal = 0 - 42 CAT = abnormal value : normal = 0 - 40

To fix these issues we will be imputing using the Median values of the respective features, since Medians are not affected by outliers and extreme values. Imputing with the Median also preserves the central tendency of the variable

# Check for missing values
colSums(is.na(copd_data))

##            X           ID          AGE  PackHistory COPDSEVERITY         MWT1 
##            0            0            0            0            0            2 
##         MWT2     MWT1Best         FEV1     FEV1PRED          FVC      FVCPRED 
##            1            1            0            0            0            0 
##          CAT          HAD         SGRQ AGEquartiles         copd       gender 
##            0            0            0            0            0            0 
##      smoking     Diabetes     muscular hypertension    AtrialFib          IHD 
##            0            0            0            0            0            0

# Remove any rows with missing values
copd_data <- na.omit(copd_data)

# Perform median imputation for the 'MWT1Best' column
copd_data <- copd_data %>%
  mutate(MWT1Best = ifelse(is.na(MWT1Best), median(MWT1Best, na.rm = TRUE), MWT1Best))

# Perform median imputation for values less than 0 or greater than 42 in the 'HAD' column
copd_data$HAD <- ifelse(copd_data$HAD < 0 | copd_data$HAD > 42, median(copd_data$HAD, na.rm = TRUE), copd_data$HAD)

# Perform median imputation for values less than 0 or greater than 40 in the 'CAT' column
copd_data$CAT <- ifelse(copd_data$CAT < 0 | copd_data$CAT > 40, median(copd_data$CAT, na.rm = TRUE), copd_data$CAT)

# Convert Gender to factor
copd_data$gender <- as.factor(copd_data$gender)

# Convert Comorbidities to factors
copd_data$Diabetes <- as.factor(copd_data$Diabetes)
copd_data$muscular <- as.factor(copd_data$muscular)
copd_data$hypertension <- as.factor(copd_data$hypertension)
copd_data$AtrialFib <- as.factor(copd_data$AtrialFib)
copd_data$IHD <- as.factor(copd_data$IHD)

# Check unique levels for categorical variables
unique(copd_data$gender)

## [1] 1 0
## Levels: 0 1

unique(copd_data$Diabetes)

## [1] 1 0
## Levels: 0 1

unique(copd_data$muscular)

## [1] 0 1
## Levels: 0 1

unique(copd_data$hypertension)

## [1] 0 1
## Levels: 0 1

unique(copd_data$AtrialFib)

## [1] 1 0
## Levels: 0 1

unique(copd_data$IHD)

## [1] 0 1
## Levels: 0 1

Model Development

#Feature Engineering

#combine comorbidities in a single feature "comorbid"

#Creating an empty vector of same length as our dataset
comorbid <- length(copd_data$Diabetes)

#we want comorbid = 1 when Diabetes OR muscular OR hypertension OR AtrialFib OR IHD = 1.

comorbid[copd_data$Diabetes == 1 | copd_data$muscular == 1 | copd_data$hypertension == 1 | copd_data$AtrialFib == 1 | copd_data$IHD == 1] <- 1  

#We also want comorbid = 0 when ALL above variables = 0
comorbid[is.na(comorbid)] <- 0 

#Convert to factor datatype
comorbid <- as.factor(comorbid)
copd_data <- cbind(copd_data, comorbid) #Binds New feature to the original dataframe

# Remove unused columns from dataframe 
copd_data <- copd_data[, !(names(copd_data) %in% c("COPDSEVERITY", "MWT1", "MWT2", "FEV1PRED", "FVCPRED", "AGEquartiles", "copd", "smoking", "Diabetes", "muscular", "hypertension", "AtrialFib", "IHD","X", "ID"))]

FEV1 and FVC are both measures of lung function and can be used to calculate the Tiffeneu-Pinelli index- a ratio commonly used to diagnose COPD

# Calculate the Tiffeneau-Pinelli (T-P) index (FEV1/FVC ratio)
copd_data$TP_index <- copd_data$FEV1 / copd_data$FVC

#Remove FEV1 and FVC from the dataframe

copd_data <- copd_data[, !(names(copd_data) %in% c("FEV1", "FVC"))]

calculate VIF

# Define our independent variables
independent_variables <- c("PackHistory", "MWT1Best", "TP_index", "CAT", "HAD", "comorbid")

# Create a formula for the linear model with all independent variables
formula <- as.formula(paste("SGRQ ~", paste(independent_variables, collapse = " + ")))

# Fit a linear model using lm()
lm_model <- lm(formula, data = copd_data)

# Calculate VIF for independent variables
vif_results <- car::vif(lm_model)

# Print VIF results
print(vif_results)

## PackHistory    MWT1Best    TP_index         CAT         HAD    comorbid 
##    1.133857    1.513758    1.140857    1.708508    1.547106    1.139653

#Split Data

# Set seed for reproducibility
set.seed(123)
splitIndex <- createDataPartition(copd_data$TP_index, p = 0.8, list = FALSE)
train_data <- copd_data[splitIndex, ]
test_data <- copd_data[-splitIndex, ]

Model 1: Baseline Model with All Features

# Fit the baseline model with all features
baseline_model <- lm(SGRQ ~ ., data = train_data)

# Summarize the baseline model
summary(baseline_model)

## 
## Call:
## lm(formula = SGRQ ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.979  -7.504   0.363   7.809  20.395 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 58.32714   20.40591   2.858  0.00559 ** 
## AGE         -0.27473    0.19760  -1.390  0.16877    
## PackHistory -0.02405    0.05951  -0.404  0.68734    
## MWT1Best    -0.04734    0.01762  -2.686  0.00899 ** 
## CAT          1.09925    0.21564   5.098 2.74e-06 ***
## HAD          0.49358    0.23371   2.112  0.03821 *  
## gender1      3.21649    2.93178   1.097  0.27630    
## comorbid1   -0.42003    2.87834  -0.146  0.88439    
## TP_index    -9.43706   10.59166  -0.891  0.37594    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.83 on 71 degrees of freedom
## Multiple R-squared:  0.6255, Adjusted R-squared:  0.5833 
## F-statistic: 14.82 on 8 and 71 DF,  p-value: 1.627e-12

Model 2: Elastic Net Regularization

independent_variables_SGRQ <- c("PackHistory", "MWT1Best", "TP_index", "CAT", "HAD", "comorbid", "SGRQ")

# Create predictor matrix for Model 2
X_train_model2 <- model.matrix(SGRQ ~ ., data = train_data[, independent_variables_SGRQ])

# Extract response variable for Model 2
Y_train_model2 <- train_data$SGRQ

# Perform cross-validated Elastic Net regression
cv_model2 <- cv.glmnet(X_train_model2, Y_train_model2, alpha = 0.5, nfolds = 10)

# Get the best lambda value
best_lambda_model2 <- cv_model2$lambda.min

# Fit the final Elastic Net model with the best lambda
final_model2 <- glmnet(X_train_model2, Y_train_model2, alpha = 0.5, lambda = best_lambda_model2)

# Make predictions using the final model
predictions_model2 <- predict(final_model2, s = best_lambda_model2, newx = X_train_model2)

# Evaluate the model performance
model2_performance <- cor(Y_train_model2, predictions_model2)

Model 3: Lasso Regularization

# Create predictor matrix for Model 3
X_train_model3 <- model.matrix(SGRQ ~ ., data = train_data[, independent_variables_SGRQ])

# Perform cross-validated Lasso regression
cv_model3 <- cv.glmnet(X_train_model3, Y_train_model2, alpha = 1, nfolds = 10)

# Get the best lambda value
best_lambda_model3 <- cv_model3$lambda.min

# Fit the final Lasso model with the best lambda
final_model3 <- glmnet(X_train_model3, Y_train_model2, alpha = 1, lambda = best_lambda_model3)

# Make predictions using the final model
predictions_model3 <- predict(final_model3, s = best_lambda_model3, newx = X_train_model3)

# Evaluate the model performance
model3_performance <- cor(Y_train_model2, predictions_model3)

Model 4: Ridge Regularization

# Create predictor matrix for Model 4
X_train_model4 <- model.matrix(SGRQ ~ ., data = train_data[, independent_variables_SGRQ])

# Perform cross-validated Ridge regression
cv_model4 <- cv.glmnet(X_train_model4, Y_train_model2, alpha = 0, nfolds = 10)

# Get the best lambda value
best_lambda_model4 <- cv_model4$lambda.min

# Fit the final Ridge model with the best lambda
final_model4 <- glmnet(X_train_model4, Y_train_model2, alpha = 0, lambda = best_lambda_model4)

# Make predictions using the final model
predictions_model4 <- predict(final_model4, s = best_lambda_model4, newx = X_train_model4)

# Evaluate the model performance
model4_performance <- cor(Y_train_model2, predictions_model4)

Model 5: Stepwise Regression

# Fit stepwise regression model

# Load the MASS package
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

stepwise_model <- stepAIC(baseline_model, direction = "both")

## Start:  AIC=403.8
## SGRQ ~ AGE + PackHistory + MWT1Best + CAT + HAD + gender + comorbid + 
##     TP_index
## 
##               Df Sum of Sq     RSS    AIC
## - comorbid     1       3.0  9945.3 401.83
## - PackHistory  1      22.9  9965.2 401.99
## - TP_index     1     111.2 10053.5 402.69
## - gender       1     168.6 10110.9 403.15
## <none>                      9942.3 403.80
## - AGE          1     270.7 10213.0 403.95
## - HAD          1     624.6 10566.9 406.68
## - MWT1Best     1    1010.5 10952.8 409.55
## - CAT          1    3639.0 13581.3 426.75
## 
## Step:  AIC=401.83
## SGRQ ~ AGE + PackHistory + MWT1Best + CAT + HAD + gender + TP_index
## 
##               Df Sum of Sq     RSS    AIC
## - PackHistory  1      25.8  9971.1 400.03
## - TP_index     1     131.1 10076.5 400.87
## - gender       1     165.6 10110.9 401.15
## <none>                      9945.3 401.83
## - AGE          1     279.1 10224.4 402.04
## + comorbid     1       3.0  9942.3 403.80
## - HAD          1     621.7 10567.0 404.68
## - MWT1Best     1    1011.8 10957.2 407.58
## - CAT          1    3638.7 13584.1 424.77
## 
## Step:  AIC=400.03
## SGRQ ~ AGE + MWT1Best + CAT + HAD + gender + TP_index
## 
##               Df Sum of Sq     RSS    AIC
## - TP_index     1     132.3 10103.4 399.09
## - gender       1     160.9 10132.0 399.31
## <none>                      9971.1 400.03
## - AGE          1     264.1 10235.2 400.13
## + PackHistory  1      25.8  9945.3 401.83
## + comorbid     1       5.9  9965.2 401.99
## - HAD          1     621.3 10592.4 402.87
## - MWT1Best     1    1012.0 10983.1 405.77
## - CAT          1    3771.9 13743.0 423.70
## 
## Step:  AIC=399.09
## SGRQ ~ AGE + MWT1Best + CAT + HAD + gender
## 
##               Df Sum of Sq     RSS    AIC
## - gender       1     171.0 10274.5 398.43
## <none>                     10103.4 399.09
## - AGE          1     313.0 10416.4 399.53
## + TP_index     1     132.3  9971.1 400.03
## + comorbid     1      29.9 10073.5 400.85
## + PackHistory  1      26.9 10076.5 400.87
## - HAD          1     631.3 10734.7 401.94
## - MWT1Best     1    1131.3 11234.7 405.58
## - CAT          1    3807.7 13911.2 422.67
## 
## Step:  AIC=398.43
## SGRQ ~ AGE + MWT1Best + CAT + HAD
## 
##               Df Sum of Sq   RSS    AIC
## - AGE          1     227.5 10502 398.18
## <none>                     10274 398.43
## + gender       1     171.0 10103 399.09
## + TP_index     1     142.4 10132 399.31
## + PackHistory  1      22.0 10252 400.26
## + comorbid     1      15.8 10259 400.31
## - HAD          1     581.8 10856 400.84
## - MWT1Best     1     993.1 11268 403.81
## - CAT          1    4114.9 14389 423.38
## 
## Step:  AIC=398.18
## SGRQ ~ MWT1Best + CAT + HAD
## 
##               Df Sum of Sq   RSS    AIC
## <none>                     10502 398.18
## + AGE          1     227.5 10274 398.43
## + TP_index     1     182.9 10319 398.78
## + gender       1      85.5 10416 399.53
## + comorbid     1      36.6 10465 399.90
## + PackHistory  1       9.8 10492 400.11
## - MWT1Best     1     773.3 11275 401.87
## - HAD          1     881.4 11383 402.63
## - CAT          1    4417.1 14919 424.27

# Summarize the stepwise model
summary(stepwise_model)

## 
## Call:
## lm(formula = SGRQ ~ MWT1Best + CAT + HAD, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.015  -6.620   0.029   6.973  21.452 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.53299    7.90538   3.483 0.000826 ***
## MWT1Best    -0.03464    0.01464  -2.366 0.020547 *  
## CAT          1.18112    0.20891   5.654 2.62e-07 ***
## HAD          0.55799    0.22093   2.526 0.013632 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.76 on 76 degrees of freedom
## Multiple R-squared:  0.6044, Adjusted R-squared:  0.5888 
## F-statistic: 38.71 on 3 and 76 DF,  p-value: 2.726e-15

Model 6: Principal Component Regression (PCR)

# Perform Principal Component Regression
pcr_model <- pcr(SGRQ ~ ., data = train_data[, independent_variables_SGRQ], scale = TRUE, validation = "CV")

# Summarize the PCR model
summary(pcr_model)

## Data:    X dimension: 80 6 
##  Y dimension: 80 1
## Fit method: svdpc
## Number of components considered: 6
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           18.45    12.56    12.08    11.96    12.10    12.38    12.49
## adjCV        18.45    12.54    12.07    11.92    12.07    12.34    12.43
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X       34.21    55.89    73.00    84.01    93.15   100.00
## SGRQ    53.31    56.64    59.52    59.60    59.61    61.18

# Obtain the predicted values from the PCR model
predicted_values <- predict(pcr_model, newdata = train_data[, independent_variables])

# Extract the appropriate column from predicted_values
predicted_values_1 <- predicted_values[, , 1]


# Calculate the correlation between observed and predicted values
pcr_cor <- cor(Y_train_model2, predicted_values_1)

# Print the correlation
print(pcr_cor)

## [1] 0.7301355

Model Evaluation

Comparing Model Performances

# Create a dataframe to store model performances
model_performance_df <- data.frame(
  Model = c("Baseline", "Elastic Net", "Lasso", "Ridge", "Stepwise", "PCR"),
  Correlation = c( cor(Y_train_model2, baseline_model$fitted.values), model2_performance, model3_performance, model4_performance, cor(Y_train_model2, stepwise_model$fitted.values), pcr_cor))

# Print the model performances
print(model_performance_df)

##         Model Correlation
## 1    Baseline   0.7908865
## 2 Elastic Net   0.7814297
## 3       Lasso   0.7811700
## 4       Ridge   0.7816457
## 5    Stepwise   0.7774471
## 6         PCR   0.7301355

Visualizing Model Performances

# Create a bar plot of model performances
ggplot(model_performance_df, aes(x = Model, y = Correlation)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Model Performances", x = "Model", y = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusion

In this project, we developed multiple linear regression models to predict the Quality of Life (QOL) of patients with Chronic Obstructive Pulmonary Disease (COPD). We explored various modeling techniques including baseline regression, regularized regression, stepwise regression, and principal component regression. Each model had its own strengths and weaknesses in terms of predictive performance.

The Elastic Net model showed the highest correlation with the actual SGRQ scores, followed closely by the Lasso and Ridge regression models. Stepwise regression provided a simpler model with decent predictive power. Principal Component Regression (PCR) also demonstrated competitive performance compared to other models.

Further fine-tuning and validation of these models may be necessary to optimize their performance and generalizability for broader clinical use. Additionally, incorporating additional features or exploring different machine learning algorithms could potentially enhance the predictive accuracy of the models.

In summary, predictive modeling of COPD patient outcomes holds promise for improving patient care, treatment planning, and resource allocation in healthcare settings.

Contact of the Author

Michael Adu,PharmD

Email: mikekay262@gmail.com

Connect with me on LinkedIn: Michael Adu

Predictive Modeling of Quality of Life in COPD Patients- A Multiple Linear Regression Approach

Michael Adu, PharmD

2024-02-16