1 ANCOVA

Now we have learned to compare a variable (dependent variable) “linearly” in response to categorical variables (ANOVA) and in response to continuous variables (linear regression). How about you have both categorical and continuous variables?

Analysis of covariance (ANCOVA) combines elements from regression and ANOVA. The dependent variable is continuous, and there is at least one continuous explanatory variable and at least one categorical explanatory variable. It allows you to examine whether the effect of one independent variable is significant and whether the effect depends on the other independent variable.

Here I will use data from the Lahman package again to demonstrate ANCOVA.

install.packages("Lahman")
library(Lahman)
data("Salaries")
data("Batting")

Now let’s subset the salaries from one single year (say 2015). Do you remember how to do it?

money15 = subset(Salaries, yearID==2015)
bat14 = subset(Batting, yearID==2014) %>%
  battingStats(idvars = c("playerID", "stint", "teamID", "lgID"), cbind = FALSE)

dat = inner_join(money15, bat14, by="playerID") %>%
  filter(BA != "NA")
head(dat, 15)
##    yearID teamID.x lgID.x  playerID   salary stint teamID.y lgID.y    BA
## 1    2015      ARI     NL ahmedni01   508500     1      ARI     NL 0.200
## 2    2015      ARI     NL anderch01   512500     1      ARI     NL 0.029
## 3    2015      ARI     NL chafian01   507500     1      ARI     NL 0.500
## 4    2015      ARI     NL collmjo01  1400000     1      ARI     NL 0.111
## 5    2015      ARI     NL delarru01   516000     1      BOS     AL 0.000
## 6    2015      ARI     NL delgara01   526000     1      ARI     NL 0.111
## 7    2015      ARI     NL goldspa01  3100000     1      ARI     NL 0.300
## 8    2015      ARI     NL gosewtu01   514500     1      ARI     NL 0.225
## 9    2015      ARI     NL  hillaa01 12000000     1      ARI     NL 0.244
## 10   2015      ARI     NL inciaen01   513000     1      ARI     NL 0.278
## 11   2015      ARI     NL lairdge01  1000000     1      ATL     NL 0.204
## 12   2015      ARI     NL  lambja01   508500     1      ARI     NL 0.230
## 13   2015      ARI     NL owingch01   512500     1      ARI     NL 0.261
## 14   2015      ARI     NL pachejo01   900000     1      COL     NL 0.236
## 15   2015      ARI     NL pachejo01   900000     2      ARI     NL 0.272
##     PA  TB SlugPct   OBP   OPS BABIP
## 1   75  19   0.271 0.233 0.504 0.220
## 2   41   1   0.029 0.057 0.086 0.050
## 3    2   1   0.500 0.500 1.000 1.000
## 4   58   8   0.148 0.158 0.306 0.154
## 5    2   0   0.000 0.000 0.000 0.000
## 6   11   1   0.111 0.111 0.222 0.200
## 7  475 220   0.542 0.396 0.938 0.368
## 8  132  40   0.310 0.242 0.552 0.269
## 9  541 184   0.367 0.287 0.654 0.276
## 10 447 150   0.359 0.318 0.677 0.310
## 11 167  39   0.257 0.275 0.532 0.261
## 12 133  47   0.373 0.263 0.636 0.291
## 13 332 126   0.406 0.300 0.706 0.314
## 14  80  25   0.347 0.300 0.647 0.293
## 15  85  26   0.321 0.298 0.619 0.319
  • Notes that I’m using a cool function to join the two dataframe together. You are encouraged to try out other functions in the dplyr and tidyr packages.

I’m interested in how the salary of a batter is being affected by the batting average (BA, the most common statistics to evaluate the performance of a batter). I also would like to know if the salary differ from National League to American League.

If these two are the only question you are interested in, simple linear regression might be just enough for the first one and ANOVA might be just enough for the second one. However, what if the relationship between salary and BA differs from league to league? This is where you should consider ANCOVA

Let’s take a look on the data type of each variable.

str(dat)
## 'data.frame':    657 obs. of  15 variables:
##  $ yearID  : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ teamID.x: Factor w/ 35 levels "ANA","ARI","ATL",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ lgID.x  : Factor w/ 2 levels "AL","NL": 2 2 2 2 2 2 2 2 2 2 ...
##  $ playerID: chr  "ahmedni01" "anderch01" "chafian01" "collmjo01" ...
##  $ salary  : int  508500 512500 507500 1400000 516000 526000 3100000 514500 12000000 513000 ...
##  $ stint   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ teamID.y: Factor w/ 149 levels "ALT","ANA","ARI",..: 3 3 3 3 16 3 3 3 3 3 ...
##  $ lgID.y  : Factor w/ 7 levels "AA","AL","FL",..: 5 5 5 5 2 5 5 5 5 5 ...
##  $ BA      : num  0.2 0.029 0.5 0.111 0 0.111 0.3 0.225 0.244 0.278 ...
##  $ PA      : num  75 41 2 58 2 11 475 132 541 447 ...
##  $ TB      : num  19 1 1 8 0 1 220 40 184 150 ...
##  $ SlugPct : num  0.271 0.029 0.5 0.148 0 0.111 0.542 0.31 0.367 0.359 ...
##  $ OBP     : num  0.233 0.057 0.5 0.158 0 0.111 0.396 0.242 0.287 0.318 ...
##  $ OPS     : num  0.504 0.086 1 0.306 0 0.222 0.938 0.552 0.654 0.677 ...
##  $ BABIP   : num  0.22 0.05 1 0.154 0 0.2 0.368 0.269 0.276 0.31 ...

So, the salary is integers (special case of numbers), BA is numbers and league ID is a categorical variable. This is what we need for executing ANCOVA.

Now let’s visualize it.

ggplot(data=dat, mapping=aes(x=BA, y=salary, color=factor(lgID.x)))+
  geom_point()

It’s a bit messy…Let’s focus on the subset of players that has BA greater than 0 but smaller than 0.375 and then log transform the salary and focus on those who earn more than 7.294163710^{5}.

dat.c = dat %>%
  mutate(sal.log=log(salary)) %>%
  filter(BA>0 & BA<0.375 & sal.log>13.5) 

ggplot(data=dat.c, aes(x=BA, y=sal.log, group=lgID.x, color=lgID.x))+
    geom_point()+
    scale_color_manual(name="League",
                       breaks=c("AL", "NL"),
                       labels=c("American League", "National League"),
                       values=c("red","blue"))+
    labs(x="Batting Average", y="Log(salary)")

Now we have a better view of the data we are going to be working on.

Let’s see how the salary related to the Batting Average in each league. First, the American league.

dat.c %>%
  filter(lgID.x=="AL") %>%
  ggplot()+
    geom_point(aes(x=BA, y=sal.log), color="red")+
    labs(x="Batting Average", y="Log(salary)", title="American League")+
    geom_smooth(aes(x=BA, y=sal.log), method="lm", color="black", se=FALSE)

dat.AL = dat.c %>%
  filter(lgID.x=="AL")
summary(lm(sal.log~ BA, data=dat.AL))
## 
## Call:
## lm(formula = sal.log ~ BA, data = dat.AL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.89659 -0.74677  0.03937  0.78588  1.94181 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.6213     0.3199  45.702   <2e-16 ***
## BA            3.0943     1.2852   2.408   0.0171 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9181 on 176 degrees of freedom
## Multiple R-squared:  0.03189,    Adjusted R-squared:  0.02639 
## F-statistic: 5.797 on 1 and 176 DF,  p-value: 0.01709

We see that there is a significant positive relationship between salary and BA in the American league.

How about the National League?

dat.c %>%
  filter(lgID.x=="NL") %>%
  ggplot(aes(x=BA, y=sal.log))+
    geom_point(color="blue")+
    labs(x="Batting Average", y="Log(salary)", title="National League")+
    geom_smooth(aes(x=BA, y=sal.log), method="lm", color="black", se=FALSE, linetype="dashed")

dat.NL = dat.c %>%
  filter(lgID.x=="NL")
summary(lm(sal.log~ BA, data=dat.NL))
## 
## Call:
## lm(formula = sal.log ~ BA, data = dat.NL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.78248 -0.78087 -0.03334  0.78318  1.93664 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.4966     0.2218   69.85   <2e-16 ***
## BA           -0.7677     0.9366   -0.82    0.413    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9534 on 200 degrees of freedom
## Multiple R-squared:  0.003348,   Adjusted R-squared:  -0.001635 
## F-statistic: 0.6719 on 1 and 200 DF,  p-value: 0.4134

Graphically and statistically, the relationship between salary and BA differs from league to league. We can do ANCOVA to statistically examine if the relationship does significantly differ from league to league.

1.1 Execute ANCOVA

We can use lm() function to do ancova. You can also use aov() function. The only difference is just the final summary table.

mod = lm(sal.log~ BA*lgID.x, data=dat.c)
aov = aov(sal.log~ BA*lgID.x, data=dat.c)

The “BA” is a continuous variable and the “lgID.x” is a categorical variable. The “*" means we specify a interaction term between the two variables in the model.

summary(mod)
## 
## Call:
## lm(formula = sal.log ~ BA * lgID.x, data = dat.c)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.89659 -0.75845 -0.00003  0.78644  1.94181 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.6213     0.3265  44.778   <2e-16 ***
## BA            3.0943     1.3117   2.359   0.0188 *  
## lgID.xNL      0.8754     0.3926   2.229   0.0264 *  
## BA:lgID.xNL  -3.8621     1.6025  -2.410   0.0164 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9371 on 376 degrees of freedom
## Multiple R-squared:  0.01708,    Adjusted R-squared:  0.009233 
## F-statistic: 2.177 on 3 and 376 DF,  p-value: 0.09027
anova(mod)
## Analysis of Variance Table
## 
## Response: sal.log
##            Df Sum Sq Mean Sq F value  Pr(>F)  
## BA          1   0.48  0.4760  0.5420 0.46205  
## lgID.x      1   0.16  0.1595  0.1817 0.67020  
## BA:lgID.x   1   5.10  5.1003  5.8083 0.01643 *
## Residuals 376 330.17  0.8781                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## BA            1    0.5   0.476   0.542 0.4620  
## lgID.x        1    0.2   0.160   0.182 0.6702  
## BA:lgID.x     1    5.1   5.100   5.808 0.0164 *
## Residuals   376  330.2   0.878                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output, we see that the interaction term is significant. This means that the effect (the coefficient) of one variable depends on the other one. In the case, we can say the relationship between salary and BA significantly differs from league to league.

1.2 Interaction plot

We can use a package called interplot that allows us to visualize

library(interplot)
interplot(m=mod, var1="BA", var2="lgID.x")+
  labs(x="League ID", y="Estimated coefficient for BA", title="Estimated Coefficient of BA on League ID")

This plot means the coefficient of BA on salary is higher in American League than that of National League.

Or, we can have another aspect of the interaction plot.

interplot(m=mod, var1="lgID.x", var2="BA")+
  labs(x="BA", y="Estimated coefficient for League ID", title="Estimated Coefficient of League ID on BA")

This means the salary in the National league is first greater than in the American league (positive), but this difference gradually decrease with the increase of BA. Finally, the salary in the National league is lower than in the American league.


Exercise 1

Try to use the following data set to execute ANCOVA and make interaction plots This data set contains the fruit weight (Fruit) of different plants with different root depth (root) and different treatment (Grazing).

gz = read.table(text=getURL("https://raw.githubusercontent.com/OscarFHC/NRE538_GSRA/master/Labs/NRE538_ANCOVA_n_Interaction/ipomopsis.txt"), sep="", header=T,comment.char="#")
head(gz)
##    Root Fruit  Grazing
## 1 6.225 59.77 Ungrazed
## 2 6.487 60.98 Ungrazed
## 3 4.919 14.73 Ungrazed
## 4 5.130 19.28 Ungrazed
## 5 5.417 34.25 Ungrazed
## 6 5.359 35.53 Ungrazed

2 Interaction terms

Keep in mind that anova, ancova and linear regression are all linear models with following formula.

\[Y_i = \beta_0 + \beta_{i1} X_{i1} + \beta_{i2} X_{i2} + ... + \beta_{ip} X_{ip} + \varepsilon_i\]

, where \(Y_i\) is the dependent variable (response variable) \(i\), \(X_{ip}\) are the independent variable, and \(\varepsilon_i\) is the error.

The difference between ANOVA/ANCOVA and linear regression is just the difference between \(X_{ip}\). In ANOVA, \(X_{ip}\) are categorical variable and in regression, \(X_{ip}\) are continuous variable. In ANCOVA, it is the combination of the two.

In this sense, we can also include interaction term in the linear regression model we had before. We used the air quality data in New York before to demonstrate the regression analysis. Let’s make it more complicated.

data("airquality")
head(airquality, 15)
##    Ozone Solar.R Wind Temp Month Day
## 1     41     190  7.4   67     5   1
## 2     36     118  8.0   72     5   2
## 3     12     149 12.6   74     5   3
## 4     18     313 11.5   62     5   4
## 5     NA      NA 14.3   56     5   5
## 6     28      NA 14.9   66     5   6
## 7     23     299  8.6   65     5   7
## 8     19      99 13.8   59     5   8
## 9      8      19 20.1   61     5   9
## 10    NA     194  8.6   69     5  10
## 11     7      NA  6.9   74     5  11
## 12    16     256  9.7   69     5  12
## 13    11     290  9.2   66     5  13
## 14    14     274 10.9   68     5  14
## 15    18      65 13.2   58     5  15
mod = lm(Temp~Wind, data=airquality)
summary(mod)
## 
## Call:
## lm(formula = Temp ~ Wind, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.291  -5.723   1.709   6.016  19.199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  90.1349     2.0522  43.921  < 2e-16 ***
## Wind         -1.2305     0.1944  -6.331 2.64e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.442 on 151 degrees of freedom
## Multiple R-squared:  0.2098, Adjusted R-squared:  0.2045 
## F-statistic: 40.08 on 1 and 151 DF,  p-value: 2.642e-09
plot(Temp~Wind, data=airquality)
abline(lm(Temp~Wind, data=airquality), col="red")

This is the linear model we have seen before. We can have a interaction terms in the model but we first need to have two independent variables.

mod1 = lm(Temp~Wind + Ozone, data=airquality)
summary(mod1)
## 
## Call:
## lm(formula = Temp ~ Wind + Ozone, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.487  -5.137   1.220   4.674  13.102 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 74.18035    2.96526  25.017  < 2e-16 ***
## Wind        -0.37829    0.22080  -1.713   0.0894 .  
## Ozone        0.17615    0.02393   7.362 3.15e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.762 on 113 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.5007, Adjusted R-squared:  0.4918 
## F-statistic: 56.65 on 2 and 113 DF,  p-value: < 2.2e-16

In the above model, there is not interaction terms. We can include it by changing “+” to “*“.

mod2 = lm(Temp~Wind * Ozone, data=airquality)
summary(mod2)
## 
## Call:
## lm(formula = Temp ~ Wind * Ozone, data = airquality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2008  -4.1492   0.5976   4.6656  13.4738 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 80.268679   3.141135  25.554  < 2e-16 ***
## Wind        -1.133278   0.275753  -4.110 7.57e-05 ***
## Ozone        0.015800   0.044790   0.353    0.725    
## Wind:Ozone   0.023505   0.005687   4.133 6.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.326 on 112 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.5668, Adjusted R-squared:  0.5552 
## F-statistic: 48.84 on 3 and 112 DF,  p-value: < 2.2e-16

From the above model output, we see that there is a significant interaction between Wind and Ozone. This means that the coefficient of one variable depends on the other one. For example, we can use interaction plot to visualize how the coefficient of Ozone depends on Wind.

interplot(mod2, var1="Wind", var2="Ozone")+
  labs(x="Wind", y="coefficient for Ozone")


Exercise 2

  1. Use F-test to compare mod.1 and mod.2. Explain why you chose one over another to explain the data.

  2. Plot the coefficient for Wind to Ozone using interplot() function.


Note that as we include the interaction term in the model, the coefficient for Ozone becomes non-significant. This is because Ozone and Wind are highly correlated.

cor(airquality[,"Ozone"], airquality[,"Wind"], use="na.or.complete")
## [1] -0.6015465

If encountering this case we should remove the main effect of Ozone, since the interaction between Wind and Ozone is significant but the main effect for Ozone is non-significant.

mod3 = lm(Temp~Wind + Wind:Ozone, data=airquality)
summary(mod3)
## 
## Call:
## lm(formula = Temp ~ Wind + Wind:Ozone, data = airquality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2622  -3.9667   0.5829   4.6323  13.6894 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 81.101122   2.065187  39.271  < 2e-16 ***
## Wind        -1.210998   0.165189  -7.331 3.68e-11 ***
## Wind:Ozone   0.025242   0.002831   8.916 9.81e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.302 on 113 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.5663, Adjusted R-squared:  0.5586 
## F-statistic: 73.77 on 2 and 113 DF,  p-value: < 2.2e-16
anova(mod2, mod3)
## Analysis of Variance Table
## 
## Model 1: Temp ~ Wind * Ozone
## Model 2: Temp ~ Wind + Wind:Ozone
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    112 4482.7                           
## 2    113 4487.7 -1   -4.9808 0.1244 0.7249

From the above results, we see that the two models are not significantly different from each other in terms of the variance of response variable explained. This is based on F-test we have learned from the multiple regression session. This can only be executed if the two models are nested. However, this model comparison (or model selection) topic will not be covered here but you are welcome to explore it if you are interested in!