Chronic Obstructive Pulmonary Disease (COPD) is a prevalent lung condition associated with significant morbidity and mortality worldwide. Predicting the quality of life (QOL) of COPD patients is crucial for treatment planning and improving patient outcomes. In this project, we explore the application of multiple linear regression (MLR) models to predict the QOL of COPD patients based on various patient characteristics. We demonstrate how MLR models can serve as valuable tools in clinical research and public health by providing insights into disease severity and patient outcomes.
Linear regression models have long been utilized in clinical research and public health to analyze and predict health-related outcomes. In the context of chronic diseases like COPD, understanding the factors influencing patient quality of life is essential for optimizing treatment strategies and resource allocation.
This project focuses on developing MLR models to predict the quality of life (QOL) of COPD patients based on demographic, clinical, and functional parameters. By leveraging machine learning algorithms, particularly MLR, we aim to provide clinicians and researchers with predictive tools that facilitate personalized patient care and inform clinical decision-making.
The utilization of MLR models in public health and clinical research enables the identification of key predictors associated with disease severity and patient outcomes. Through data-driven approaches, we can uncover complex relationships between patient characteristics and QOL, leading to more targeted interventions and improved patient management strategies.
Through this project, we showcase the versatility and utility of MLR models in addressing clinical challenges and advancing our understanding of COPD and other chronic diseases. By integrating data analytics with clinical expertise, we strive to enhance patient care and contribute to the broader goals of population health management and disease prevention.
The population of interest is a cohort of 101 patients, who had been invited to take part in a rehabilitation program to help manage their COPD.
The dataset consists of 24 patient variables of which the following are of interest:
Each as a categorical variable: 0 for absence of the comorbidity and 1 for presence:
Chronic obstructive pulmonary disease (COPD) is a common lung disease that causes restricted airflow and breathing problems. It is sometimes called emphysema or chronic bronchitis.
In people with COPD, the lungs can get damaged or clogged with phlegm, impeding their ability to breathe effectively. Symptoms include cough, sometimes with phlegm, difficulty breathing, wheezing, and tiredness.
Tobacco smoking accounts for over 70% of COPD cases in high-income countries. In LMICs (Low and middle-income countries), tobacco smoking accounts for 30–40% of COPD cases, and household air pollution is a major risk factor. Nearly 90% of COPD deaths in those under 70 years of age occur in low- and middle-income countries (LMIC). COPD is the seventh leading cause of poor health worldwide.
People with COPD also have a higher risk for other health problems, including lung infections, lung cancer, heart problems, weak muscles, brittle bones, depression, and anxiety.
Common symptoms of COPD develop from mid-life onwards, indicating an established relationship with age. As COPD progresses, people find it more difficult to carry out their normal daily activities, often due to breathlessness. There may be a considerable financial burden due to limitation of workplace and home productivity and costs of medical treatment. This indicates a significant relationship between the severity of COPD and the patient’s quality of life.
The diagnosis of COPD is confirmed by spirometry when the forced expiratory volume in 1 s to forced vital capacity (FEV1/FVC) ratio is less than 70%.
Studies have shown good correlation between disease severity and quality of life (QOL) scores independent of the underlying physiologic markers measured by spirometry. Patients with poor QOL scores based on the St George’s Respiratory Questionnaire (SGRQ) are at greater risk of hospital readmission.
Another aspect of disease severity in COPD is exercise tolerance. Studies in pulmonary rehabilitation have shown that assessment of exercise tolerance correlates well with disease severity.
We make use of makes use of several R libraries in this project for data manipulation, visualization, and modeling. The following libraries were utilized:
## 'data.frame': 101 obs. of 24 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ID : int 58 57 62 145 136 84 93 27 114 152 ...
## $ AGE : int 77 79 80 56 65 67 67 83 72 75 ...
## $ PackHistory : num 60 50 11 60 68 26 50 90 50 6 ...
## $ COPDSEVERITY: chr "SEVERE" "MODERATE" "MODERATE" "VERY SEVERE" ...
## $ MWT1 : int 120 165 201 210 204 216 214 214 231 226 ...
## $ MWT2 : int 120 176 180 210 210 180 237 237 237 240 ...
## $ MWT1Best : int 120 176 201 210 210 216 237 237 237 240 ...
## $ FEV1 : num 1.21 1.09 1.52 0.47 1.07 1.09 0.69 0.68 2.13 1.06 ...
## $ FEV1PRED : num 36 56 68 14 42 50 35 32 63 46 ...
## $ FVC : num 2.4 1.64 2.3 1.14 2.91 1.99 1.31 2.23 4.38 2.06 ...
## $ FVCPRED : int 98 65 86 27 98 60 48 77 80 75 ...
## $ CAT : int 25 12 22 28 32 29 29 22 25 31 ...
## $ HAD : num 8 21 18 26 18 21 30 2 6 20 ...
## $ SGRQ : num 69.5 44.2 44.1 62 75.6 ...
## $ AGEquartiles: int 4 4 4 1 1 2 2 4 3 3 ...
## $ copd : int 3 2 2 4 3 2 3 3 2 3 ...
## $ gender : int 1 0 0 1 1 0 0 1 1 0 ...
## $ smoking : int 2 2 2 2 2 1 1 2 1 2 ...
## $ Diabetes : int 1 1 1 0 0 1 1 1 1 0 ...
## $ muscular : int 0 0 0 0 1 0 0 0 0 1 ...
## $ hypertension: int 0 0 0 1 1 0 0 0 0 0 ...
## $ AtrialFib : int 1 1 1 1 0 1 1 1 1 0 ...
## $ IHD : int 0 1 0 0 0 0 0 0 0 0 ...
## X ID AGE PackHistory
## Min. : 1 Min. : 1.00 Min. :44.0 Min. : 1.0
## 1st Qu.: 26 1st Qu.: 49.00 1st Qu.:65.0 1st Qu.: 20.0
## Median : 51 Median : 87.00 Median :71.0 Median : 36.0
## Mean : 51 Mean : 91.41 Mean :70.1 Mean : 39.7
## 3rd Qu.: 76 3rd Qu.:143.00 3rd Qu.:75.0 3rd Qu.: 54.0
## Max. :101 Max. :169.00 Max. :88.0 Max. :109.0
##
## COPDSEVERITY MWT1 MWT2 MWT1Best
## Length:101 Min. :120.0 Min. :120.0 Min. :120.0
## Class :character 1st Qu.:300.0 1st Qu.:303.8 1st Qu.:303.8
## Mode :character Median :419.0 Median :399.0 Median :420.0
## Mean :385.9 Mean :390.3 Mean :399.1
## 3rd Qu.:460.5 3rd Qu.:459.0 3rd Qu.:465.2
## Max. :688.0 Max. :699.0 Max. :699.0
## NA's :2 NA's :1 NA's :1
## FEV1 FEV1PRED FVC FVCPRED
## Min. :0.450 Min. : 3.29 Min. :1.140 Min. : 27.00
## 1st Qu.:1.100 1st Qu.: 42.00 1st Qu.:2.270 1st Qu.: 71.00
## Median :1.600 Median : 60.00 Median :2.770 Median : 84.00
## Mean :1.604 Mean : 58.53 Mean :2.955 Mean : 86.44
## 3rd Qu.:1.960 3rd Qu.: 75.00 3rd Qu.:3.630 3rd Qu.:103.00
## Max. :3.180 Max. :102.00 Max. :5.370 Max. :132.00
##
## CAT HAD SGRQ AGEquartiles
## Min. : 3.00 Min. : 0.00 Min. : 2.00 Min. :1.000
## 1st Qu.: 12.00 1st Qu.: 6.00 1st Qu.:28.41 1st Qu.:1.000
## Median : 18.00 Median :10.00 Median :38.21 Median :3.000
## Mean : 19.34 Mean :11.18 Mean :40.19 Mean :2.475
## 3rd Qu.: 24.00 3rd Qu.:15.00 3rd Qu.:55.23 3rd Qu.:3.000
## Max. :188.00 Max. :56.20 Max. :77.44 Max. :4.000
##
## copd gender smoking Diabetes
## Min. :1.000 Min. :0.0000 Min. :1.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:0.0000
## Median :2.000 Median :1.0000 Median :2.000 Median :0.0000
## Mean :2.198 Mean :0.6436 Mean :1.842 Mean :0.2079
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :1.0000 Max. :2.000 Max. :1.0000
##
## muscular hypertension AtrialFib IHD
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.000 Median :0.00000
## Mean :0.1881 Mean :0.1188 Mean :0.198 Mean :0.08911
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.00000
##
# Calculate correlations for continuous variables
# Create a data frame with continuous variables
data <- data.frame(
age = copd_data$AGE, # Your age variable
pack_history = copd_data$PackHistory, # Your pack history variable
MWT1Best = copd_data$MWT1Best, # Your MWT1Best variable
FVC = copd_data$FVC, # Your FVC variable
FEV1 = copd_data$FEV1, # Your FEV1 variable
CAT_Score = copd_data$CAT, # Your CAT_Score variable
SGRQ = copd_data$SGRQ, # Your SGRQ variable
HAD = copd_data$HAD # Your HAD variable
)
# Calculate correlations for continuous variables
cor_matrix <- cor(data)
# View the correlation matrix
print(cor_matrix)
## age pack_history MWT1Best FVC FEV1
## age 1.000000000 -0.001545507 NA -0.14522574 -0.10212241
## pack_history -0.001545507 1.000000000 NA -0.09307289 -0.13150514
## MWT1Best NA NA 1 NA NA
## FVC -0.145225737 -0.093072888 NA 1.00000000 0.82016501
## FEV1 -0.102122406 -0.131505136 NA 0.82016501 1.00000000
## CAT_Score 0.083361126 -0.143247732 NA -0.15885862 -0.06480429
## SGRQ -0.139361205 0.032126419 NA -0.22009022 -0.30340676
## HAD -0.227120404 0.027876353 NA -0.12968238 -0.14814655
## CAT_Score SGRQ HAD
## age 0.08336113 -0.13936121 -0.22712040
## pack_history -0.14324773 0.03212642 0.02787635
## MWT1Best NA NA NA
## FVC -0.15885862 -0.22009022 -0.12968238
## FEV1 -0.06480429 -0.30340676 -0.14814655
## CAT_Score 1.00000000 0.28778173 0.16191871
## SGRQ 0.28778173 1.00000000 0.39579044
## HAD 0.16191871 0.39579044 1.00000000
# Correlation between independent variables and the outcome variable (SGRQ)
outcome_correlations <- cor_matrix["SGRQ", -length(data)]
print(outcome_correlations)
## age pack_history MWT1Best FVC FEV1 CAT_Score
## -0.13936121 0.03212642 NA -0.22009022 -0.30340676 0.28778173
## SGRQ
## 1.00000000
# Generate scatterplot matrix
pairs_plot <- ggpairs(data)
# Print the scatterplot matrix
print(pairs_plot)
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 rows containing missing values (`geom_point()`).
From the output of summary(), we observe that “MWT1BEst” has 1 missing Value,
The values for the “HAD” has a maximum value of “59” which is not a possible value.
Note: The Hospital Anxiety and Depression Scale (HAD) consists of two subscales: the HAD-Anxiety (HAD-A) and HAD-Depression (HAD-D) scales. Each subscale has a range of scores from 0 to 21, with higher scores indicating higher levels of anxiety or depression.
Considering both subscales, the total HAD score would be the sum of the scores from HAD-A and HAD-D. Therefore, the maximum possible total HAD score would be 42 (21 for anxiety + 21 for depression).
The CAT score also has a maximum value of 188, which is not a possible value.
Note: The COPD Assessment Test (CAT) is a questionnaire used to assess the impact of chronic obstructive pulmonary disease (COPD) on a patient’s health. The CAT score typically ranges from 0 to 40, with higher scores indicating a more significant impact of COPD on a patient’s life.
Summary of features with issues:
MWT1Best = Missing Value HAD = abnormal value : normal = 0 - 42 CAT = abnormal value : normal = 0 - 40
To fix these issues we will be imputing using the Median values of the respective features, since Medians are not affected by outliers and extreme values. Imputing with the Median also preserves the central tendency of the variable
## X ID AGE PackHistory COPDSEVERITY MWT1
## 0 0 0 0 0 2
## MWT2 MWT1Best FEV1 FEV1PRED FVC FVCPRED
## 1 1 0 0 0 0
## CAT HAD SGRQ AGEquartiles copd gender
## 0 0 0 0 0 0
## smoking Diabetes muscular hypertension AtrialFib IHD
## 0 0 0 0 0 0
# Remove any rows with missing values
copd_data <- na.omit(copd_data)
# Perform median imputation for the 'MWT1Best' column
copd_data <- copd_data %>%
mutate(MWT1Best = ifelse(is.na(MWT1Best), median(MWT1Best, na.rm = TRUE), MWT1Best))
# Perform median imputation for values less than 0 or greater than 42 in the 'HAD' column
copd_data$HAD <- ifelse(copd_data$HAD < 0 | copd_data$HAD > 42, median(copd_data$HAD, na.rm = TRUE), copd_data$HAD)
# Perform median imputation for values less than 0 or greater than 40 in the 'CAT' column
copd_data$CAT <- ifelse(copd_data$CAT < 0 | copd_data$CAT > 40, median(copd_data$CAT, na.rm = TRUE), copd_data$CAT)
# Convert Gender to factor
copd_data$gender <- as.factor(copd_data$gender)
# Convert Comorbidities to factors
copd_data$Diabetes <- as.factor(copd_data$Diabetes)
copd_data$muscular <- as.factor(copd_data$muscular)
copd_data$hypertension <- as.factor(copd_data$hypertension)
copd_data$AtrialFib <- as.factor(copd_data$AtrialFib)
copd_data$IHD <- as.factor(copd_data$IHD)
# Check unique levels for categorical variables
unique(copd_data$gender)
## [1] 1 0
## Levels: 0 1
## [1] 1 0
## Levels: 0 1
## [1] 0 1
## Levels: 0 1
## [1] 0 1
## Levels: 0 1
## [1] 1 0
## Levels: 0 1
## [1] 0 1
## Levels: 0 1
#Feature Engineering
#combine comorbidities in a single feature "comorbid"
#Creating an empty vector of same length as our dataset
comorbid <- length(copd_data$Diabetes)
#we want comorbid = 1 when Diabetes OR muscular OR hypertension OR AtrialFib OR IHD = 1.
comorbid[copd_data$Diabetes == 1 | copd_data$muscular == 1 | copd_data$hypertension == 1 | copd_data$AtrialFib == 1 | copd_data$IHD == 1] <- 1
#We also want comorbid = 0 when ALL above variables = 0
comorbid[is.na(comorbid)] <- 0
#Convert to factor datatype
comorbid <- as.factor(comorbid)
copd_data <- cbind(copd_data, comorbid) #Binds New feature to the original dataframe
# Remove unused columns from dataframe
copd_data <- copd_data[, !(names(copd_data) %in% c("COPDSEVERITY", "MWT1", "MWT2", "FEV1PRED", "FVCPRED", "AGEquartiles", "copd", "smoking", "Diabetes", "muscular", "hypertension", "AtrialFib", "IHD","X", "ID"))]
FEV1 and FVC are both measures of lung function and can be used to calculate the Tiffeneu-Pinelli index- a ratio commonly used to diagnose COPD
# Calculate the Tiffeneau-Pinelli (T-P) index (FEV1/FVC ratio)
copd_data$TP_index <- copd_data$FEV1 / copd_data$FVC
#Remove FEV1 and FVC from the dataframe
calculate VIF
# Define our independent variables
independent_variables <- c("PackHistory", "MWT1Best", "TP_index", "CAT", "HAD", "comorbid")
# Create a formula for the linear model with all independent variables
formula <- as.formula(paste("SGRQ ~", paste(independent_variables, collapse = " + ")))
# Fit a linear model using lm()
lm_model <- lm(formula, data = copd_data)
# Calculate VIF for independent variables
vif_results <- car::vif(lm_model)
# Print VIF results
print(vif_results)
## PackHistory MWT1Best TP_index CAT HAD comorbid
## 1.133857 1.513758 1.140857 1.708508 1.547106 1.139653
#Split Data
# Set seed for reproducibility
set.seed(123)
splitIndex <- createDataPartition(copd_data$TP_index, p = 0.8, list = FALSE)
train_data <- copd_data[splitIndex, ]
test_data <- copd_data[-splitIndex, ]
# Fit the baseline model with all features
baseline_model <- lm(SGRQ ~ ., data = train_data)
# Summarize the baseline model
summary(baseline_model)
##
## Call:
## lm(formula = SGRQ ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.979 -7.504 0.363 7.809 20.395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.32714 20.40591 2.858 0.00559 **
## AGE -0.27473 0.19760 -1.390 0.16877
## PackHistory -0.02405 0.05951 -0.404 0.68734
## MWT1Best -0.04734 0.01762 -2.686 0.00899 **
## CAT 1.09925 0.21564 5.098 2.74e-06 ***
## HAD 0.49358 0.23371 2.112 0.03821 *
## gender1 3.21649 2.93178 1.097 0.27630
## comorbid1 -0.42003 2.87834 -0.146 0.88439
## TP_index -9.43706 10.59166 -0.891 0.37594
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.83 on 71 degrees of freedom
## Multiple R-squared: 0.6255, Adjusted R-squared: 0.5833
## F-statistic: 14.82 on 8 and 71 DF, p-value: 1.627e-12
independent_variables_SGRQ <- c("PackHistory", "MWT1Best", "TP_index", "CAT", "HAD", "comorbid", "SGRQ")
# Create predictor matrix for Model 2
X_train_model2 <- model.matrix(SGRQ ~ ., data = train_data[, independent_variables_SGRQ])
# Extract response variable for Model 2
Y_train_model2 <- train_data$SGRQ
# Perform cross-validated Elastic Net regression
cv_model2 <- cv.glmnet(X_train_model2, Y_train_model2, alpha = 0.5, nfolds = 10)
# Get the best lambda value
best_lambda_model2 <- cv_model2$lambda.min
# Fit the final Elastic Net model with the best lambda
final_model2 <- glmnet(X_train_model2, Y_train_model2, alpha = 0.5, lambda = best_lambda_model2)
# Make predictions using the final model
predictions_model2 <- predict(final_model2, s = best_lambda_model2, newx = X_train_model2)
# Evaluate the model performance
model2_performance <- cor(Y_train_model2, predictions_model2)
# Create predictor matrix for Model 3
X_train_model3 <- model.matrix(SGRQ ~ ., data = train_data[, independent_variables_SGRQ])
# Perform cross-validated Lasso regression
cv_model3 <- cv.glmnet(X_train_model3, Y_train_model2, alpha = 1, nfolds = 10)
# Get the best lambda value
best_lambda_model3 <- cv_model3$lambda.min
# Fit the final Lasso model with the best lambda
final_model3 <- glmnet(X_train_model3, Y_train_model2, alpha = 1, lambda = best_lambda_model3)
# Make predictions using the final model
predictions_model3 <- predict(final_model3, s = best_lambda_model3, newx = X_train_model3)
# Evaluate the model performance
model3_performance <- cor(Y_train_model2, predictions_model3)
# Create predictor matrix for Model 4
X_train_model4 <- model.matrix(SGRQ ~ ., data = train_data[, independent_variables_SGRQ])
# Perform cross-validated Ridge regression
cv_model4 <- cv.glmnet(X_train_model4, Y_train_model2, alpha = 0, nfolds = 10)
# Get the best lambda value
best_lambda_model4 <- cv_model4$lambda.min
# Fit the final Ridge model with the best lambda
final_model4 <- glmnet(X_train_model4, Y_train_model2, alpha = 0, lambda = best_lambda_model4)
# Make predictions using the final model
predictions_model4 <- predict(final_model4, s = best_lambda_model4, newx = X_train_model4)
# Evaluate the model performance
model4_performance <- cor(Y_train_model2, predictions_model4)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Start: AIC=403.8
## SGRQ ~ AGE + PackHistory + MWT1Best + CAT + HAD + gender + comorbid +
## TP_index
##
## Df Sum of Sq RSS AIC
## - comorbid 1 3.0 9945.3 401.83
## - PackHistory 1 22.9 9965.2 401.99
## - TP_index 1 111.2 10053.5 402.69
## - gender 1 168.6 10110.9 403.15
## <none> 9942.3 403.80
## - AGE 1 270.7 10213.0 403.95
## - HAD 1 624.6 10566.9 406.68
## - MWT1Best 1 1010.5 10952.8 409.55
## - CAT 1 3639.0 13581.3 426.75
##
## Step: AIC=401.83
## SGRQ ~ AGE + PackHistory + MWT1Best + CAT + HAD + gender + TP_index
##
## Df Sum of Sq RSS AIC
## - PackHistory 1 25.8 9971.1 400.03
## - TP_index 1 131.1 10076.5 400.87
## - gender 1 165.6 10110.9 401.15
## <none> 9945.3 401.83
## - AGE 1 279.1 10224.4 402.04
## + comorbid 1 3.0 9942.3 403.80
## - HAD 1 621.7 10567.0 404.68
## - MWT1Best 1 1011.8 10957.2 407.58
## - CAT 1 3638.7 13584.1 424.77
##
## Step: AIC=400.03
## SGRQ ~ AGE + MWT1Best + CAT + HAD + gender + TP_index
##
## Df Sum of Sq RSS AIC
## - TP_index 1 132.3 10103.4 399.09
## - gender 1 160.9 10132.0 399.31
## <none> 9971.1 400.03
## - AGE 1 264.1 10235.2 400.13
## + PackHistory 1 25.8 9945.3 401.83
## + comorbid 1 5.9 9965.2 401.99
## - HAD 1 621.3 10592.4 402.87
## - MWT1Best 1 1012.0 10983.1 405.77
## - CAT 1 3771.9 13743.0 423.70
##
## Step: AIC=399.09
## SGRQ ~ AGE + MWT1Best + CAT + HAD + gender
##
## Df Sum of Sq RSS AIC
## - gender 1 171.0 10274.5 398.43
## <none> 10103.4 399.09
## - AGE 1 313.0 10416.4 399.53
## + TP_index 1 132.3 9971.1 400.03
## + comorbid 1 29.9 10073.5 400.85
## + PackHistory 1 26.9 10076.5 400.87
## - HAD 1 631.3 10734.7 401.94
## - MWT1Best 1 1131.3 11234.7 405.58
## - CAT 1 3807.7 13911.2 422.67
##
## Step: AIC=398.43
## SGRQ ~ AGE + MWT1Best + CAT + HAD
##
## Df Sum of Sq RSS AIC
## - AGE 1 227.5 10502 398.18
## <none> 10274 398.43
## + gender 1 171.0 10103 399.09
## + TP_index 1 142.4 10132 399.31
## + PackHistory 1 22.0 10252 400.26
## + comorbid 1 15.8 10259 400.31
## - HAD 1 581.8 10856 400.84
## - MWT1Best 1 993.1 11268 403.81
## - CAT 1 4114.9 14389 423.38
##
## Step: AIC=398.18
## SGRQ ~ MWT1Best + CAT + HAD
##
## Df Sum of Sq RSS AIC
## <none> 10502 398.18
## + AGE 1 227.5 10274 398.43
## + TP_index 1 182.9 10319 398.78
## + gender 1 85.5 10416 399.53
## + comorbid 1 36.6 10465 399.90
## + PackHistory 1 9.8 10492 400.11
## - MWT1Best 1 773.3 11275 401.87
## - HAD 1 881.4 11383 402.63
## - CAT 1 4417.1 14919 424.27
##
## Call:
## lm(formula = SGRQ ~ MWT1Best + CAT + HAD, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.015 -6.620 0.029 6.973 21.452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.53299 7.90538 3.483 0.000826 ***
## MWT1Best -0.03464 0.01464 -2.366 0.020547 *
## CAT 1.18112 0.20891 5.654 2.62e-07 ***
## HAD 0.55799 0.22093 2.526 0.013632 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.76 on 76 degrees of freedom
## Multiple R-squared: 0.6044, Adjusted R-squared: 0.5888
## F-statistic: 38.71 on 3 and 76 DF, p-value: 2.726e-15
# Perform Principal Component Regression
pcr_model <- pcr(SGRQ ~ ., data = train_data[, independent_variables_SGRQ], scale = TRUE, validation = "CV")
# Summarize the PCR model
summary(pcr_model)
## Data: X dimension: 80 6
## Y dimension: 80 1
## Fit method: svdpc
## Number of components considered: 6
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 18.45 12.56 12.08 11.96 12.10 12.38 12.49
## adjCV 18.45 12.54 12.07 11.92 12.07 12.34 12.43
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 34.21 55.89 73.00 84.01 93.15 100.00
## SGRQ 53.31 56.64 59.52 59.60 59.61 61.18
# Obtain the predicted values from the PCR model
predicted_values <- predict(pcr_model, newdata = train_data[, independent_variables])
# Extract the appropriate column from predicted_values
predicted_values_1 <- predicted_values[, , 1]
# Calculate the correlation between observed and predicted values
pcr_cor <- cor(Y_train_model2, predicted_values_1)
# Print the correlation
print(pcr_cor)
## [1] 0.7301355
# Create a dataframe to store model performances
model_performance_df <- data.frame(
Model = c("Baseline", "Elastic Net", "Lasso", "Ridge", "Stepwise", "PCR"),
Correlation = c( cor(Y_train_model2, baseline_model$fitted.values), model2_performance, model3_performance, model4_performance, cor(Y_train_model2, stepwise_model$fitted.values), pcr_cor))
# Print the model performances
print(model_performance_df)
## Model Correlation
## 1 Baseline 0.7908865
## 2 Elastic Net 0.7814297
## 3 Lasso 0.7811700
## 4 Ridge 0.7816457
## 5 Stepwise 0.7774471
## 6 PCR 0.7301355
# Create a bar plot of model performances
ggplot(model_performance_df, aes(x = Model, y = Correlation)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
labs(title = "Model Performances", x = "Model", y = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
In this project, we developed multiple linear regression models to predict the Quality of Life (QOL) of patients with Chronic Obstructive Pulmonary Disease (COPD). We explored various modeling techniques including baseline regression, regularized regression, stepwise regression, and principal component regression. Each model had its own strengths and weaknesses in terms of predictive performance.
The Elastic Net model showed the highest correlation with the actual SGRQ scores, followed closely by the Lasso and Ridge regression models. Stepwise regression provided a simpler model with decent predictive power. Principal Component Regression (PCR) also demonstrated competitive performance compared to other models.
Further fine-tuning and validation of these models may be necessary to optimize their performance and generalizability for broader clinical use. Additionally, incorporating additional features or exploring different machine learning algorithms could potentially enhance the predictive accuracy of the models.
In summary, predictive modeling of COPD patient outcomes holds promise for improving patient care, treatment planning, and resource allocation in healthcare settings.