Applied Regression Analysis

In a study at the Ohio State University Medical Center, doctors investigated the association between ICU acquired paresis (ICUAP) (or muscular weakness) and subject’s admission SOFA score dichotomized at 11 (sofa11) (0 for less than 11 and 1 for 11 and greater).

Specifically the investigators wanted to determine if handgrip strength (max_grip) could be used as a surrogate for ICUAP. The investigators believed that the number of days the patient was mechanically ventilated (MV_days) could affect the relationship between weakness and dichotomized SOFA.

Use the ICU acquired weakness dataset to answer the following exercises. You can also find this data within this CSV file.

url <- "https://d396qusza40orc.cloudfront.net/appliedregression/Homeworks/Acquired_weakness.csv"
download.file(url, "Acquired_weakness.csv")
acw <- read.csv("Acquired_weakness.csv")
dim(acw)

## [1] 137   4

names(acw)

## [1] "id"       "max_grip" "MVdays"   "sofa11"

str(acw)

## 'data.frame':    137 obs. of  4 variables:
##  $ id      : int  3466 5477 3366 5325 5145 3004 2984 4681 2925 4145 ...
##  $ max_grip: num  12 40 26 22 30 8 14 12 12 7 ...
##  $ MVdays  : int  10 7 5 8 8 5 6 5 12 4 ...
##  $ sofa11  : int  0 0 0 0 0 0 0 0 0 0 ...

Exercise One

Using a two-sample t-test, determine whether the two SOFA Score groups differ significantly with respect to ICUAP (as measured by max_grip).

below11 <- acw[acw$sofa11 == 0, ]
above11 <- acw[acw$sofa11 == 1, ]
(nb11 <- nrow(below11))

## [1] 125

(na11 <- nrow(above11))

## [1] 12

# pooled standard deviation
var.below11 <- var(below11$max_grip)
var.above11 <- var(above11$max_grip)
(var.pooled <- (var.below11 * (nb11 - 1) + var.above11 * (na11 - 1)) / ((nb11 - 1) + (na11 - 1)))

## [1] 181.3294

(SE <- sqrt(var.pooled / nb11 + var.pooled / na11))

## [1] 4.069572

(Tstatistic <- (mean(below11$max_grip)-mean(above11$max_grip)) / SE)

## [1] 2.204982

(pValue <- 2 * pt(Tstatistic, df = na11 + nb11, lower.tail = F))

## [1] 0.02912334

Instead of calculating by hand the above calculation to get to the p-value of the difference of two group means, there is a single function in R, t.test that performs that:

(tt <- t.test(max_grip ~ sofa11, data = acw, var.equal = TRUE)) # default two-sided test

## 
##  Two Sample t-test
## 
## data:  max_grip by sofa11
## t = 2.205, df = 135, p-value = 0.02915
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.924972 17.021695
## sample estimates:
## mean in group 0 mean in group 1 
##       15.140000        6.166667

names(tt)

## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"

tt[1:9]

## $statistic
##        t 
## 2.204982 
## 
## $parameter
##  df 
## 135 
## 
## $p.value
## [1] 0.02914824
## 
## $conf.int
## [1]  0.924972 17.021695
## attr(,"conf.level")
## [1] 0.95
## 
## $estimate
## mean in group 0 mean in group 1 
##       15.140000        6.166667 
## 
## $null.value
## difference in means 
##                   0 
## 
## $alternative
## [1] "two.sided"
## 
## $method
## [1] " Two Sample t-test"
## 
## $data.name
## [1] "max_grip by sofa11"

Exercise Two

Perform the same analysis from Exercise One (i.e., 2 sample t-test) using a regression analysis.

(lm1 <- lm(max_grip ~ sofa11, data = acw))

## 
## Call:
## lm(formula = max_grip ~ sofa11, data = acw)
## 
## Coefficients:
## (Intercept)       sofa11  
##      15.140       -8.973

summary(lm1)

## 
## Call:
## lm(formula = max_grip ~ sofa11, data = acw)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.14  -7.14  -1.14   4.86  68.86 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   15.140      1.204  12.570   <2e-16 ***
## sofa11        -8.973      4.070  -2.205   0.0291 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.47 on 135 degrees of freedom
## Multiple R-squared:  0.03476,    Adjusted R-squared:  0.02761 
## F-statistic: 4.862 on 1 and 135 DF,  p-value: 0.02915

(beta1 <- cov(acw$sofa11, acw$max_grip) / var(acw$sofa11))

## [1] -8.973333

(beta0 <- mean(acw$max_grip) - beta1 * mean(acw$sofa11))

## [1] 15.14

(var.y.given.x <- sum((acw$max_grip - lm1$fitted)^2) / (nrow(acw) - 2))

## [1] 181.3294

(var.beta1 <- var.y.given.x / (var(acw$sofa11)*(nrow(acw)-1)))

## [1] 16.56142

(SE <- sqrt(var.beta1))

## [1] 4.069572

(tStat <- beta1 / SE)

## [1] -2.204982

(pVal <- 2 * pt(tStat, df = nrow(acw) - 2))

## [1] 0.02914824

Exercise Three

Using a single model determine the regression of max_grip on MVdays for each of the two SOFA score groups.

library(ggplot2)
ggplot(data = acw, aes(MVdays, max_grip, color = as.factor(acw$sofa11)))  +
    geom_point(position = "jitter")

plot(acw$MVdays, acw$max_grip, col = acw$sofa11+1)  #added 1 to acw$sofa11 b/c it's an integer variable and one of the levels is 0. (could just make it a factor variable instead)

(lm3 <- lm(max_grip ~ MVdays*sofa11, data = acw))

## 
## Call:
## lm(formula = max_grip ~ MVdays * sofa11, data = acw)
## 
## Coefficients:
##   (Intercept)         MVdays         sofa11  MVdays:sofa11  
##       21.1842        -0.6938       -10.3081         0.4084

summary(lm3)

## 
## Call:
## lm(formula = max_grip ~ MVdays * sofa11, data = acw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.715  -7.022  -2.246   4.691  65.591 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    21.1842     1.9915  10.638  < 2e-16 ***
## MVdays         -0.6938     0.1865  -3.721 0.000292 ***
## sofa11        -10.3081     6.9788  -1.477 0.142023    
## MVdays:sofa11   0.4084     0.3851   1.060 0.290905    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.88 on 133 degrees of freedom
## Multiple R-squared:   0.13,  Adjusted R-squared:  0.1104 
## F-statistic: 6.625 on 3 and 133 DF,  p-value: 0.0003318

Here I have built a model assuming there is interaction between number of days a patient spent on mechanical ventilation and his or her SOFA score at the time of addmission. For people with SOFA < 11, average maximal grip can be calculated by:
\[y_{SOFA<11}=21.1842062 -0.6937794*MVdays\]

For people with SOFA > 11, average max grip can be found as:
\[y_{SOFA<11}=21.1842062 -0.6937794*MVdays -10.3080939+0.4083585*MVdays\]

However, is there a meaningful interaction between those two, or the interaction coefficient is equal to zero?

Exercise Four

Test whether the two SOFA cohorts differ significantly, controlling for mechanically ventilated days.
To test whether the slope of the interaction term is equal to zero perform a partial F-test that compares model with an interaction term and model without an interaction term (max_grip ~ MVdays + sofa11).

summary(lm2 <- lm(max_grip ~ MVdays + sofa11, data = acw))

## 
## Call:
## lm(formula = max_grip ~ MVdays + sofa11, data = acw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.360  -7.193  -1.958   3.611  66.042 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.3503     1.8304  11.118  < 2e-16 ***
## MVdays       -0.5981     0.1632  -3.664 0.000357 ***
## sofa11       -4.3157     4.0965  -1.053 0.294009    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.89 on 134 degrees of freedom
## Multiple R-squared:  0.1227, Adjusted R-squared:  0.1096 
## F-statistic: 9.367 on 2 and 134 DF,  p-value: 0.0001557

(anova1 <- anova(lm1))

## Analysis of Variance Table
## 
## Response: max_grip
##            Df  Sum Sq Mean Sq F value  Pr(>F)  
## sofa11      1   881.6  881.61  4.8619 0.02915 *
## Residuals 135 24479.5  181.33                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(anova2 <- anova(lm2))

## Analysis of Variance Table
## 
## Response: max_grip
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## MVdays      1  2926.6 2926.55 17.6249 4.872e-05 ***
## sofa11      1   184.3  184.29  1.1099     0.294    
## Residuals 134 22250.2  166.05                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(anova3 <- anova(lm3))

## Analysis of Variance Table
## 
## Response: max_grip
##                Df  Sum Sq Mean Sq F value    Pr(>F)    
## MVdays          1  2926.6 2926.55 17.6413 4.855e-05 ***
## sofa11          1   184.3  184.29  1.1109    0.2938    
## MVdays:sofa11   1   186.5  186.52  1.1244    0.2909    
## Residuals     133 22063.7  165.89                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Fstat <- ((sum(anova3[1:3, 2]) - sum(anova2[1:2,2]))) / anova3[4, 3])

## [1] 1.124351

(pVal <- pf(Fstat, df1 = 1, df2 = 133, lower.tail = F))

## [1] 0.2909054

p-value from the comparison of model3 and model2 that differ only in the interaction term says that the slope of the interaction term isn’t significantly different than zero. That means that there is no meaningful interaction between the variables. When we control for mechanically ventilated days, the two cohorts (SOFA < 11 and SOFA > 11) are the same.

We can visualize interaction using effects library

library(effects)

## Warning: package 'effects' was built under R version 3.2.4

lmm3 <- lm(max_grip ~ MVdays + sofa11 + MVdays:sofa11, data = acw)
plot(effect(term = "MVdays:sofa11", mod = lm3, xlevels = list(MVdays=c(2, 9, 16))), multiline = TRUE)

Exercise Five

Determine and Compare
Test whether the two SOFA cohorts have the same slope. Do they have the same intercept? Are they coincident? Use the homework forum to share your findings.
1. Are the two slopes parallel?
This is the same question as whether \(\beta_3=0\). I have answered that question in the previous exercise where I performed partial F-test that concluded that the \(\beta_3\) isn’t significantly different than zero.
2. Do the two cohorts of SOFA have the same intercept?
We can determine that by looking at the summary output of the second model (2 predictors, no interaction term) or calculate by hand whether \(\beta_2=0\):

summary(lm2)

## 
## Call:
## lm(formula = max_grip ~ MVdays + sofa11, data = acw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.360  -7.193  -1.958   3.611  66.042 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.3503     1.8304  11.118  < 2e-16 ***
## MVdays       -0.5981     0.1632  -3.664 0.000357 ***
## sofa11       -4.3157     4.0965  -1.053 0.294009    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.89 on 134 degrees of freedom
## Multiple R-squared:  0.1227, Adjusted R-squared:  0.1096 
## F-statistic: 9.367 on 2 and 134 DF,  p-value: 0.0001557

(anova1mv <- anova(lm1mv <- lm(max_grip ~ MVdays, data = acw)))

## Analysis of Variance Table
## 
## Response: max_grip
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## MVdays      1  2926.6 2926.55  17.611 4.885e-05 ***
## Residuals 135 22434.5  166.18                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# calculation
(Fstat <- (sum(anova2[1:2, 2])- anova1mv[1, 2])/anova2[3,3])

## [1] 1.109859

(pVal <- pf(Fstat, df1 = 1, df2 = 134, lower.tail = F))

## [1] 0.2940085

Are the two slopes coincident, i.e. do they have the same slope and intercept?
To answer this we compute a Multiple Partial F-test where we compare model with an interaction and model without an interaction to find out whether model with an interaction is a better model than the one without it. This is the same as testing whether simultaneously \(\beta_2=\beta_3=0\)

summary(lm3)

## 
## Call:
## lm(formula = max_grip ~ MVdays * sofa11, data = acw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.715  -7.022  -2.246   4.691  65.591 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    21.1842     1.9915  10.638  < 2e-16 ***
## MVdays         -0.6938     0.1865  -3.721 0.000292 ***
## sofa11        -10.3081     6.9788  -1.477 0.142023    
## MVdays:sofa11   0.4084     0.3851   1.060 0.290905    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.88 on 133 degrees of freedom
## Multiple R-squared:   0.13,  Adjusted R-squared:  0.1104 
## F-statistic: 6.625 on 3 and 133 DF,  p-value: 0.0003318

# calculation
(SSreg3 <- sum(anova3[1:3, 2]))

## [1] 3297.363

(SSreg2 <- sum(anova2[1:2, 2]))

## [1] 3110.842

(Fstat <- ((SSreg3 - SSreg2) / 1) / anova3[4, 3])

## [1] 1.124351

(pVal <- pf(Fstat, df1 = 1, df2 = 133, lower.tail = F))

## [1] 0.2909054

Since the null hypothesis of this test says that \(\beta_2=\beta_3=0\), and given that the p-value of the test was 0.2909054, we can’t reject the hypothesis that \(\beta_2=\beta_3=0\). In other words, the slopes for the two cohorts of SOFA coincide.

fitted3 <- lm3$fitted
ggplot(data = acw, aes(x = MVdays, y = max_grip, color = as.factor(sofa11)))  + 
    geom_point(position = "jitter")  +
    geom_line(aes(x = acw$MVdays,y = fitted3,  col= as.factor(acw$sofa11)))

Applied Regression Analysis_HW6

Exercise One

Exercise Two

Exercise Three

Exercise Four

Exercise Five