Abstract
St. George respiratory questionnaire (SGRQ) is the COPD disease specific for assest quality of life in COPD patients,the lower SGRQ score mean better quality of life.I found that SGRQ score of COPD patient in this dataset can predicted by many variables in same dataset by creating multiple regression. I’m interested in patient’s age, CAT score, number of comorbidities, gender and smoking status as predictors. The multiple regression show that only patient’s age has negative correlation with SGRQ score, that mean the older COPD patients have better quality of life than younger. Others predictors in model have positive correlation with SGRQ score, which mean the higher the predictors, the worsen quality of life of that patient.
This article is a part of linear regression in R in public health course.The study question of this article is "What are the factors that predict quality of life in COPD patients?"
Step 0 : Import dataset into Rstudio.
copd<- read.csv("COPD.csv")
Step 1 : Inspect dataset.
library(DT)
datatable(copd)
dim(copd)
## [1] 101 24
There are 24 variables and 101 observations.
copd[rowSums(is.na(copd))!=0,3:8]
## AGE PackHistory COPDSEVERITY MWT1 MWT2 MWT1Best
## 57 60 14 MODERATE NA 438 438
## 101 78 55 MODERATE NA NA NA
There are 2 observations,57th and 101st, with NA value.
I interested in using age, gender, smoking status ,CAT score and number of comorbidities as predictors in this model.While smoking status and gender are binary,age and CAT score are continuous.Number of comorbidities is an ordinal.
str(copd$gender)
## int [1:101] 1 0 0 1 1 0 0 1 1 0 ...
str(copd$smoking)
## int [1:101] 2 2 2 2 2 1 1 2 1 2 ...
gender = 0 if female and 1 if male ; smoking = 1 if current and 2 if ex-smoking.
copd$smoking[copd$smoking==2]<-0 ##Assign 0 instead of 2 to smoking variable
##Setting binary factors.
copd$gender<-factor(copd$gender)
copd$smoking<-factor(copd$smoking)
str(copd$smoking)
## Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 1 2 1 ...
Now smoking = 0 if ex-smoking and 1 if current smoking.
Create new variable, “num_comorbid”, that is the sum of COPD patients comorbidities.
library(dplyr)
copd<- copd %>%
rowwise()%>%
mutate(num_comorbid =sum(c(Diabetes,muscular,hypertension,AtrialFib,IHD),na.rm = TRUE))
summary(copd$num_comorbid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 0.802 1.000 3.000
hist(copd$num_comorbid, main = "Histogram of number of patient comobidities",
xlab = "Number of patients comorbidities", col = rgb(0.6,0,0,0.7))
Histogram show that number of comorbidities is not the normal distribution.
summary(copd$AGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 44.0 65.0 71.0 70.1 75.0 88.0
hist(copd$AGE, main= "Histogram of age", xlab = "age", col =rgb(0,0.8,0,0.3), breaks = 20)
shapiro.test(copd$AGE)
##
## Shapiro-Wilk normality test
##
## data: copd$AGE
## W = 0.96765, p-value = 0.01394
AGE variable is skewed to left.
summary(copd$CAT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 12.00 18.00 19.34 24.00 188.00
hist(copd$CAT, main="Histogram of CAT score", xlab = "CAT score", col=rgb(0,0,0.3,0.3),breaks = 100)
shapiro.test(copd$CAT)
##
## Shapiro-Wilk normality test
##
## data: copd$CAT
## W = 0.41776, p-value < 2.2e-16
CAT variable is not approximate with normal distribution and has the outlier. Because of COPD Assessment Test (CAT) score is ranging from 0-40, so 188 is an error.
copd$CAT[copd$CAT>40]<-NA ## Assign NA to the error value.
hist(copd$CAT, main="Histogram of CAT score", xlab = "CAT score", col=rgb(0,0,0.3,0.3),breaks =10) ##Plot new CAT variable after handling the outlier.
shapiro.test(copd$CAT)
##
## Shapiro-Wilk normality test
##
## data: copd$CAT
## W = 0.9662, p-value = 0.0114
Both AGE and CAT are not normal distribution.
library(Hmisc)
describe(copd$gender)
## copd$gender
## n missing distinct
## 101 0 2
##
## Value 0 1
## Frequency 36 65
## Proportion 0.356 0.644
describe(copd$smoking)
## copd$smoking
## n missing distinct
## 101 0 2
##
## Value 0 1
## Frequency 85 16
## Proportion 0.842 0.158
Step 2 : Examine the relationship between candidate predictors.
cor.test(copd$AGE,copd$CAT,method = "spearman", use = "complete.obs")
##
## Spearman's rank correlation rho
##
## data: copd$AGE and copd$CAT
## S = 171834, p-value = 0.7587
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.03110844
There is no significant relation between AGE and CAT.
cor.test(copd$AGE,copd$num_comorbid,method = "spearman", use = "complete.obs")
##
## Spearman's rank correlation rho
##
## data: copd$AGE and copd$num_comorbid
## S = 149204, p-value = 0.1916
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.131021
There is no significant relation between AGE and num_comorbid.
library(gmodels)
CrossTable(copd$gender,copd$smoking)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 101
##
##
## | copd$smoking
## copd$gender | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 29 | 7 | 36 |
## | 0.056 | 0.295 | |
## | 0.806 | 0.194 | 0.356 |
## | 0.341 | 0.438 | |
## | 0.287 | 0.069 | |
## -------------|-----------|-----------|-----------|
## 1 | 56 | 9 | 65 |
## | 0.031 | 0.163 | |
## | 0.862 | 0.138 | 0.644 |
## | 0.659 | 0.562 | |
## | 0.554 | 0.089 | |
## -------------|-----------|-----------|-----------|
## Column Total | 85 | 16 | 101 |
## | 0.842 | 0.158 | |
## -------------|-----------|-----------|-----------|
##
##
There is no significant relation between gender and smoking.
EDA after cleaning data
Exploratory data analysis by simple plot show that only SGRQ and AGE have negative correlation.
Other predictors have positive correlation with SGRQ.
Step 3 : Fitting simple regression model.
3.1 Number of comorbidities.
comorbid <- lm(copd$SGRQ~copd$num_comorbid)
summary(comorbid)
##
## Call:
## lm(formula = copd$SGRQ ~ copd$num_comorbid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.149 -9.621 -2.719 11.594 40.214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.286 2.433 14.917 <2e-16 ***
## copd$num_comorbid 4.862 2.074 2.345 0.021 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.84 on 99 degrees of freedom
## Multiple R-squared: 0.05262, Adjusted R-squared: 0.04305
## F-statistic: 5.499 on 1 and 99 DF, p-value: 0.02103
confint(comorbid)
## 2.5 % 97.5 %
## (Intercept) 31.459566 41.113092
## copd$num_comorbid 0.747929 8.976781
3.2 Age.
age <- lm(copd$SGRQ~copd$AGE)
summary(age)
##
## Call:
## lm(formula = copd$SGRQ ~ copd$AGE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.827 -11.486 -0.834 13.365 36.257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.7440 16.2106 3.871 0.000195 ***
## copd$AGE -0.3218 0.2298 -1.400 0.164552
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.15 on 99 degrees of freedom
## Multiple R-squared: 0.01942, Adjusted R-squared: 0.009517
## F-statistic: 1.961 on 1 and 99 DF, p-value: 0.1646
confint(age)
## 2.5 % 97.5 %
## (Intercept) 30.5787348 94.9092336
## copd$AGE -0.7778013 0.1341933
3.3 CAT score.
cat <- lm(copd$SGRQ~copd$CAT)
summary(cat)
##
## Call:
## lm(formula = copd$SGRQ ~ copd$CAT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.651 -7.239 -0.742 9.931 26.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.3689 3.1171 3.326 0.00124 **
## copd$CAT 1.6913 0.1614 10.478 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.65 on 98 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5284, Adjusted R-squared: 0.5236
## F-statistic: 109.8 on 1 and 98 DF, p-value: < 2.2e-16
confint(cat)
## 2.5 % 97.5 %
## (Intercept) 4.182998 16.554727
## copd$CAT 1.370986 2.011613
3.4 Gender.
gender <- lm(copd$SGRQ~factor(copd$gender))
summary(gender)
##
## Call:
## lm(formula = copd$SGRQ ~ factor(copd$gender))
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.695 -12.285 -2.194 14.535 36.745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.6953 3.0545 13.323 <2e-16 ***
## factor(copd$gender)1 -0.7916 3.8076 -0.208 0.836
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.33 on 99 degrees of freedom
## Multiple R-squared: 0.0004364, Adjusted R-squared: -0.00966
## F-statistic: 0.04322 on 1 and 99 DF, p-value: 0.8357
confint(gender)
## 2.5 % 97.5 %
## (Intercept) 34.634436 46.756119
## factor(copd$gender)1 -8.346628 6.763457
3.5 Smoking status.
smoking <- lm(copd$SGRQ~copd$smoking)
summary(smoking)
##
## Call:
## lm(formula = copd$SGRQ ~ copd$smoking)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.579 -13.239 -1.839 15.911 37.921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.579 1.946 19.820 <2e-16 ***
## copd$smoking1 10.141 4.890 2.074 0.0407 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.95 on 99 degrees of freedom
## Multiple R-squared: 0.04162, Adjusted R-squared: 0.03194
## F-statistic: 4.3 on 1 and 99 DF, p-value: 0.04072
confint(smoking)
## 2.5 % 97.5 %
## (Intercept) 34.7171849 42.44164
## copd$smoking1 0.4368634 19.84431
Conclusion from simple regression models
I found that the most positive influence predictor is smoking status, mean that current smoking patient has more 10.14 SGRQ points than ex-smoking patient.The highest adjusted R2 is 0.5236 in CAT score model.
Identify interaction between gender and smoking status
Fitting interaction model between gender and smoking status
model1 <- lm(copd$SGRQ~copd$gender+copd$smoking+factor(as.integer(copd$gender)*as.integer(copd$smoking)))
summary(model1)
##
## Call:
## lm(formula = copd$SGRQ ~ copd$gender + copd$smoking + factor(as.integer(copd$gender) *
## as.integer(copd$smoking)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.029 -11.862 -2.642 13.653 39.471
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error
## (Intercept) 37.029 3.327
## copd$gender1 -5.193 4.958
## copd$smoking1 11.311 4.958
## factor(as.integer(copd$gender) * as.integer(copd$smoking))2 7.546 4.958
## factor(as.integer(copd$gender) * as.integer(copd$smoking))4 NA NA
## t value Pr(>|t|)
## (Intercept) 11.130 <2e-16
## copd$gender1 -1.047 0.2976
## copd$smoking1 2.281 0.0247
## factor(as.integer(copd$gender) * as.integer(copd$smoking))2 1.522 0.1312
## factor(as.integer(copd$gender) * as.integer(copd$smoking))4 NA NA
##
## (Intercept) ***
## copd$gender1
## copd$smoking1 *
## factor(as.integer(copd$gender) * as.integer(copd$smoking))2
## factor(as.integer(copd$gender) * as.integer(copd$smoking))4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.92 on 97 degrees of freedom
## Multiple R-squared: 0.06401, Adjusted R-squared: 0.03506
## F-statistic: 2.211 on 3 and 97 DF, p-value: 0.09167
confint(model1)
## 2.5 %
## (Intercept) 30.425375
## copd$gender1 -15.032826
## copd$smoking1 1.470389
## factor(as.integer(copd$gender) * as.integer(copd$smoking))2 -2.293778
## factor(as.integer(copd$gender) * as.integer(copd$smoking))4 NA
## 97.5 %
## (Intercept) 43.631867
## copd$gender1 4.647657
## copd$smoking1 21.150872
## factor(as.integer(copd$gender) * as.integer(copd$smoking))2 17.386705
## factor(as.integer(copd$gender) * as.integer(copd$smoking))4 NA
The coefficient of interaction term is not statistical significant. So there is no interaction between gender and smoking status.
Step 4 Fitting multiple regression.
finalmodel <- lm(copd$SGRQ~copd$AGE+copd$num_comorbid+copd$CAT+copd$gender+copd$smoking)
summary(finalmodel)
##
## Call:
## lm(formula = copd$SGRQ ~ copd$AGE + copd$num_comorbid + copd$CAT +
## copd$gender + copd$smoking)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.613 -5.923 -1.160 9.554 26.988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.5973 11.8133 2.167 0.0328 *
## copd$AGE -0.2498 0.1630 -1.533 0.1287
## copd$num_comorbid 3.5607 1.4807 2.405 0.0181 *
## copd$CAT 1.5877 0.1619 9.808 4.71e-16 ***
## copd$gender1 0.7123 2.6090 0.273 0.7854
## copd$smoking1 4.8889 3.4512 1.417 0.1599
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.35 on 94 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5691, Adjusted R-squared: 0.5462
## F-statistic: 24.83 on 5 and 94 DF, p-value: 7.266e-16
confint(finalmodel)
## 2.5 % 97.5 %
## (Intercept) 2.1416639 49.05287956
## copd$AGE -0.5733683 0.07376031
## copd$num_comorbid 0.6207120 6.50060481
## copd$CAT 1.2662677 1.90910475
## copd$gender1 -4.4678011 5.89247366
## copd$smoking1 -1.9634934 11.74127767
par(mfrow=c(2,2))
plot(finalmodel)
Summary
This multiple regression model has adjusted R2 0.546, which slightly more than CAT score simple regression model.
1 unit increase in age , the SGRQ score decrease by 0.25 points, when other predictor are constant.
1 unit increase in number of comorbidity , the SGRQ score increase by 3.56 points, when other predictor are constant.
1 unit increase in CAT score , the SGRQ score increase by 1.59 points, when other predictor are constant.
Male COPD patient has more 0.7 SGRQ points than female, when other predictor are constant.
Patient who is current smoking has more 4.9 SGRQ points than who is ex-smoking, when other predictor are constant.