ANCOVA could be considered a marraige between ANOVA and regression. Before we start on analysis of covariance, or ANCOVA, let’s review.

Analysis of Variance (ANOVA)

To do this, let’s go back to one of your past HW assignments, the elephant tusk data.

elephant <- read.csv("ElephantTusk.csv")
str(elephant)
## 'data.frame':    595 obs. of  7 variables:
##  $ Years                 : chr  "1966-68" "1966-68" "1966-68" "1966-68" ...
##  $ Elephant              : int  58 78 86 293 29 43 221 222 241 255 ...
##  $ Sex                   : chr  "f" "f" "f" "f" ...
##  $ Age.yrs.              : num  2.5 2.5 2.5 2.5 3 3 3.5 3.5 3.5 3.5 ...
##  $ shoulder.height.cm.   : int  149 151 127 156 146 149 166 178 177 171 ...
##  $ tusk.length.cm.       : num  30 27 27 32 28.5 30.5 43 44 45 46 ...
##  $ tusk.circumference.cm.: num  10.5 10.5 10.5 11 9.5 10 11 14 12 13 ...

Now, let’s do a simple two factor ANOVA. We want to know whether tusk lenght was dependent on the era the elephants lived (Years) and Sex. First, let’s get the summary statistics.

# library(Rmisc)
# eleSUM <- summarySE(elephant, measurevar = "tusk.length.cm.", groupvars = c("Years", "Sex"))
# View(eleSUM)

Now, let’s perform the anova.

tusk.lm1 <- aov(tusk.length.cm. ~ Years*Sex, data=elephant)
anova(tusk.lm1)
## Analysis of Variance Table
## 
## Response: tusk.length.cm.
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Years       1  45482   45482 32.0347 2.363e-08 ***
## Sex         1   3957    3957  2.7871   0.09556 .  
## Years:Sex   1   4315    4315  3.0390   0.08181 .  
## Residuals 591 839082    1420                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


As a group, review all the numbers of the ANOVA table.
.

linear regression

Let’s quickly review regression, since we just did it. Let’s first look at tusk length by shoulder height.

library(ggpubr)
## Loading required package: ggplot2
ggscatter(elephant, x="shoulder.height.cm.", y="tusk.length.cm.",
          add = "reg.line", conf.int = TRUE, 
          add.params = list(color="red"),
          cor.coef = TRUE, cor.method = "pearson",
          title = "All Elephants")
## `geom_smooth()` using formula 'y ~ x'

To run the stats for this model, you would do the following.

tusk.lm2 <- lm(tusk.length.cm.~shoulder.height.cm., data=elephant)
summary(tusk.lm2)
## 
## Call:
## lm(formula = tusk.length.cm. ~ shoulder.height.cm., data = elephant)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.826  -9.939   0.718  11.239  86.950 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -97.81041    5.24531  -18.65   <2e-16 ***
## shoulder.height.cm.   0.85916    0.02344   36.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.47 on 593 degrees of freedom
## Multiple R-squared:  0.6938, Adjusted R-squared:  0.6933 
## F-statistic:  1344 on 1 and 593 DF,  p-value: < 2.2e-16

If you recall, in the HW assginment, i had you do the following.

How similar is the correlation between shoulder height and tusk length of elephants that are newly born (2005-13) to those differ than those that are born during heavy poaching (1966-68). Hint, you might have to separate the eras (Years column).

Essentially, i am throwing in another variable, or what is called a covariate.

I asked you to compare the regression lines. You might have done it this way.

# first, split up the data
elephant2000s <- elephant[elephant$Years == "2005-13", ]
elephant60s <- elephant[elephant$Years == "1966-68", ]

Then, make two graphs (one of the older elephants and one of the newer ones) to get a better visualization of your data.

ggscatter(elephant60s, x="shoulder.height.cm.", y="tusk.length.cm.",
          add = "reg.line", conf.int = TRUE, 
          add.params = list(color="red"),
          cor.coef = TRUE, cor.method = "pearson",
          title = "Elephants from 1966-68")
## `geom_smooth()` using formula 'y ~ x'

Then, graph the newer ones.

ggscatter(elephant2000s, x="shoulder.height.cm.", y="tusk.length.cm.",
          add = "reg.line", conf.int = TRUE, 
          add.params = list(color="red"),
          cor.coef = TRUE, cor.method = "pearson",
          title = "Elephants from 2005-2013")
## `geom_smooth()` using formula 'y ~ x'

Note, we also could have done it one one plot using ggplot2, and adding a filling variable. Look below.

library(ggplot2)
library(ggpubr)
ggplot(elephant, aes(y=tusk.length.cm., x=shoulder.height.cm., fill=Years, col=Years)) +
  geom_point() +
  geom_smooth(method=lm, aes(fill=Years)) + 
  stat_cor(method = "pearson")
## `geom_smooth()` using formula 'y ~ x'

Now, let’s look at the regression equations.

older <- lm(tusk.length.cm.~shoulder.height.cm., data=elephant60s)
summary(older)
## 
## Call:
## lm(formula = tusk.length.cm. ~ shoulder.height.cm., data = elephant60s)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.669  -9.779  -0.707   8.232  70.322 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -120.65261    4.43088  -27.23   <2e-16 ***
## shoulder.height.cm.    0.99765    0.02003   49.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.09 on 433 degrees of freedom
## Multiple R-squared:  0.8513, Adjusted R-squared:  0.851 
## F-statistic:  2480 on 1 and 433 DF,  p-value: < 2.2e-16

Based on this result, we can come up with the equation for older elephants. tusk length = -120.65261 + 0.99765[shoulder height]

For newer elephants:

newer <- lm(tusk.length.cm.~shoulder.height.cm., data=elephant2000s)
summary(newer)
## 
## Call:
## lm(formula = tusk.length.cm. ~ shoulder.height.cm., data = elephant2000s)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.900 -11.637  -0.996   8.475  88.231 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -74.75591    9.23809  -8.092 1.46e-13 ***
## shoulder.height.cm.   0.66940    0.04003  16.721  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.5 on 158 degrees of freedom
## Multiple R-squared:  0.6389, Adjusted R-squared:  0.6366 
## F-statistic: 279.6 on 1 and 158 DF,  p-value: < 2.2e-16

The equation is tusk length = -74.75591 + 0.66940[shoulder height]

If we to compare (observationally, not statistically) the two regression slopes, we find that the older is 0.99765 and the newer is 0.66940.


As a group, go through the script for the dual graph, as well as the results of the regression. Describe to each other what you think each line is doing?
.

The are independently both significant, so shoulder height is a good predictor of tusk length for both eras. However, how do the era’s compare??

Ancova

So the point of ancova is to account for that covariate, in this case Years. The general null hypothesis for an ancova is that the relationship between variables does not depend on the covariate. To do this, the hypothesis is multistep.

First, test the null hypothesis that the slopes of the regression lines are all equal.

Then, if you accept (or fail to reject; p > 0.05) the null hypothesis that the regression lines equal (are parallel), you test the second null hypothesis: that the Y intercepts of the regression lines (a) are all the same.

So, first, let’s test that the regression lines (slopes) are equal. To do this, we fit the data/model with the lm() function we used for regression. But, we are examining whether the predictor variable is dependent on the covariate, so we are primarily interested in the interaction. In this case, that interaction is shoulder height and years.

model.1 <- lm(tusk.length.cm. ~ shoulder.height.cm. + Years + shoulder.height.cm.:Years,
              data = elephant)
library(car)
## Loading required package: carData
Anova(model.1, type = "II")
## Anova Table (Type II tests)
## 
## Response: tusk.length.cm.
##                           Sum Sq  Df  F value    Pr(>F)    
## shoulder.height.cm.       662982   1 2374.078 < 2.2e-16 ***
## Years                      89021   1  318.777 < 2.2e-16 ***
## shoulder.height.cm.:Years  19329   1   69.217 6.115e-16 ***
## Residuals                 165042 591                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the F value and P value for the interaction tell use there is a significant difference between shoulder height based on Year, we reject our first null. This means that the slopes are significantly different based on year.

We don’t have to do test the second null hypothesis, but for class purposes, let’s do it anyway. For this. we run a Type II anova. In the model, we have to ADD the Year variable to our original model.

model.3 = lm(tusk.length.cm. ~ shoulder.height.cm. + Years,
              data = elephant)
library(car)
Anova(model.3, type = "II")
## Anova Table (Type II tests)
## 
## Response: tusk.length.cm.
##                     Sum Sq  Df F value    Pr(>F)    
## shoulder.height.cm. 662982   1 2128.78 < 2.2e-16 ***
## Years                89021   1  285.84 < 2.2e-16 ***
## Residuals           184371 592                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Look at the Years row, and examine the p value. Is that <0.05. If so (it is), the intercepts of the two regression lines are also significant.

now you try

  1. What is we still wanted to ask whether age was a covariate. So maybe tusk length didn’t change due to shoulder height, but rather it changed because of Sex.

1a. Make a scatter plot (where both regressions are on the same plot, using ggplot() as shown above)

1b. Perform the statistical tests to compare.

Answers are below, but try first

ggplot(elephant, aes(y=tusk.length.cm., x=shoulder.height.cm., 
                     fill=Sex, 
                     col=Sex)) +
  geom_point() +
  geom_smooth(method=lm, aes(fill=Sex)) + 
  stat_cor(method = "pearson")
## `geom_smooth()` using formula 'y ~ x'

elephant$death.rate <- as.factor(elephant$Age.yrs.)
model.4 = lm(tusk.length.cm. ~ shoulder.height.cm. + Sex + shoulder.height.cm.:Sex,
              data = elephant)
Anova(model.4, type = "III")
## Anova Table (Type III tests)
## 
## Response: tusk.length.cm.
##                         Sum Sq  Df F value    Pr(>F)    
## (Intercept)              34833   1  80.525 < 2.2e-16 ***
## shoulder.height.cm.     170039   1 393.087 < 2.2e-16 ***
## Sex                       4898   1  11.323 0.0008151 ***
## shoulder.height.cm.:Sex   7654   1  17.695 2.996e-05 ***
## Residuals               255651 591                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because the interaction is significant, the slopes are different. We would reject the first null.

2. Using the hospitals dataset, try doing the following. Let’s examine the relationship between the number of doctors and the death rate in our hospital dataset (new one, hospitals2.csv). However, we wondering if it differs by county (Washington vs Jefferson counties). Thus, there is a covariate.

2a. Make a scatter plot (where both regressions are on the same plot, using ggplot() as shown above)
2b. Perform the statistical tests to compare.
2c. Compare 1b with 2b?

Answers are below, but try first

hospitals2 <- read.csv("hospitals2.csv")
str(hospitals2)
## 'data.frame':    105 obs. of  3 variables:
##  $ county    : chr  "washington" "washington" "washington" "washington" ...
##  $ death.rate: num  8 9.3 7.5 7.3 10.2 ...
##  $ doctors   : int  78 68 70 96 74 111 77 168 82 89 ...
ggplot(hospitals2, aes(y=death.rate, x=doctors, 
                     fill=county, 
                     col=county)) +
  geom_point() +
  geom_smooth(method=lm, aes(fill=county)) + 
  stat_cor(method = "pearson")
## `geom_smooth()` using formula 'y ~ x'

model.hosp = lm(death.rate ~ doctors + county + doctors:county,
              data = hospitals2)
summary(model.hosp)
## 
## Call:
## lm(formula = death.rate ~ doctors + county + doctors:county, 
##     data = hospitals2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3008 -1.1117 -0.0487  0.9597  3.9498 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               8.940945   0.776335  11.517   <2e-16 ***
## doctors                   0.001539   0.006388   0.241    0.810    
## countywashington         -0.194214   1.082128  -0.179    0.858    
## doctors:countywashington  0.002596   0.008888   0.292    0.771    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.688 on 101 degrees of freedom
## Multiple R-squared:  0.006028,   Adjusted R-squared:  -0.0235 
## F-statistic: 0.2042 on 3 and 101 DF,  p-value: 0.8933

Because the interaction is not significant, you would run a test for the intercept. Note that the (intercept) row here is not refering the the intercepts of the individual regression lines.

model.hosp2 = lm(death.rate ~ doctors + county,
              data = hospitals2)
Anova(model.hosp2, Type = "II")
## Anova Table (Type II tests)
## 
## Response: death.rate
##            Sum Sq  Df F value Pr(>F)
## doctors     1.198   1  0.4244 0.5162
## county      0.300   1  0.1062 0.7452
## Residuals 288.045 102

Looking over the county row, this suggests that the intercepts are not significant.