Now we have learned to compare a variable (dependent variable) “linearly” in response to categorical variables (ANOVA) and in response to continuous variables (linear regression). How about you have both categorical and continuous variables?
Analysis of covariance (ANCOVA) combines elements from regression and ANOVA. The dependent variable is continuous, and there is at least one continuous explanatory variable and at least one categorical explanatory variable. It allows you to examine whether the effect of one independent variable is significant and whether the effect depends on the other independent variable.
Here I will use data from the Lahman package again to demonstrate ANCOVA.
install.packages("Lahman")
library(Lahman)
data("Salaries")
data("Batting")
Now let’s subset the salaries from one single year (say 2015). Do you remember how to do it?
money15 = subset(Salaries, yearID==2015)
bat14 = subset(Batting, yearID==2014) %>%
battingStats(idvars = c("playerID", "stint", "teamID", "lgID"), cbind = FALSE)
dat = inner_join(money15, bat14, by="playerID") %>%
filter(BA != "NA")
head(dat, 15)
## yearID teamID.x lgID.x playerID salary stint teamID.y lgID.y BA
## 1 2015 ARI NL ahmedni01 508500 1 ARI NL 0.200
## 2 2015 ARI NL anderch01 512500 1 ARI NL 0.029
## 3 2015 ARI NL chafian01 507500 1 ARI NL 0.500
## 4 2015 ARI NL collmjo01 1400000 1 ARI NL 0.111
## 5 2015 ARI NL delarru01 516000 1 BOS AL 0.000
## 6 2015 ARI NL delgara01 526000 1 ARI NL 0.111
## 7 2015 ARI NL goldspa01 3100000 1 ARI NL 0.300
## 8 2015 ARI NL gosewtu01 514500 1 ARI NL 0.225
## 9 2015 ARI NL hillaa01 12000000 1 ARI NL 0.244
## 10 2015 ARI NL inciaen01 513000 1 ARI NL 0.278
## 11 2015 ARI NL lairdge01 1000000 1 ATL NL 0.204
## 12 2015 ARI NL lambja01 508500 1 ARI NL 0.230
## 13 2015 ARI NL owingch01 512500 1 ARI NL 0.261
## 14 2015 ARI NL pachejo01 900000 1 COL NL 0.236
## 15 2015 ARI NL pachejo01 900000 2 ARI NL 0.272
## PA TB SlugPct OBP OPS BABIP
## 1 75 19 0.271 0.233 0.504 0.220
## 2 41 1 0.029 0.057 0.086 0.050
## 3 2 1 0.500 0.500 1.000 1.000
## 4 58 8 0.148 0.158 0.306 0.154
## 5 2 0 0.000 0.000 0.000 0.000
## 6 11 1 0.111 0.111 0.222 0.200
## 7 475 220 0.542 0.396 0.938 0.368
## 8 132 40 0.310 0.242 0.552 0.269
## 9 541 184 0.367 0.287 0.654 0.276
## 10 447 150 0.359 0.318 0.677 0.310
## 11 167 39 0.257 0.275 0.532 0.261
## 12 133 47 0.373 0.263 0.636 0.291
## 13 332 126 0.406 0.300 0.706 0.314
## 14 80 25 0.347 0.300 0.647 0.293
## 15 85 26 0.321 0.298 0.619 0.319
I’m interested in how the salary of a batter is being affected by the batting average (BA, the most common statistics to evaluate the performance of a batter). I also would like to know if the salary differ from National League to American League.
If these two are the only question you are interested in, simple linear regression might be just enough for the first one and ANOVA might be just enough for the second one. However, what if the relationship between salary and BA differs from league to league? This is where you should consider ANCOVA
Let’s take a look on the data type of each variable.
str(dat)
## 'data.frame': 657 obs. of 15 variables:
## $ yearID : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ teamID.x: Factor w/ 35 levels "ANA","ARI","ATL",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ lgID.x : Factor w/ 2 levels "AL","NL": 2 2 2 2 2 2 2 2 2 2 ...
## $ playerID: chr "ahmedni01" "anderch01" "chafian01" "collmjo01" ...
## $ salary : int 508500 512500 507500 1400000 516000 526000 3100000 514500 12000000 513000 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID.y: Factor w/ 149 levels "ALT","ANA","ARI",..: 3 3 3 3 16 3 3 3 3 3 ...
## $ lgID.y : Factor w/ 7 levels "AA","AL","FL",..: 5 5 5 5 2 5 5 5 5 5 ...
## $ BA : num 0.2 0.029 0.5 0.111 0 0.111 0.3 0.225 0.244 0.278 ...
## $ PA : num 75 41 2 58 2 11 475 132 541 447 ...
## $ TB : num 19 1 1 8 0 1 220 40 184 150 ...
## $ SlugPct : num 0.271 0.029 0.5 0.148 0 0.111 0.542 0.31 0.367 0.359 ...
## $ OBP : num 0.233 0.057 0.5 0.158 0 0.111 0.396 0.242 0.287 0.318 ...
## $ OPS : num 0.504 0.086 1 0.306 0 0.222 0.938 0.552 0.654 0.677 ...
## $ BABIP : num 0.22 0.05 1 0.154 0 0.2 0.368 0.269 0.276 0.31 ...
So, the salary is integers (special case of numbers), BA is numbers and league ID is a categorical variable. This is what we need for executing ANCOVA.
Now let’s visualize it.
ggplot(data=dat, mapping=aes(x=BA, y=salary, color=factor(lgID.x)))+
geom_point()
It’s a bit messy…Let’s focus on the subset of players that has BA greater than 0 but smaller than 0.375 and then log transform the salary and focus on those who earn more than 7.294163710^{5}.
dat.c = dat %>%
mutate(sal.log=log(salary)) %>%
filter(BA>0 & BA<0.375 & sal.log>13.5)
ggplot(data=dat.c, aes(x=BA, y=sal.log, group=lgID.x, color=lgID.x))+
geom_point()+
scale_color_manual(name="League",
breaks=c("AL", "NL"),
labels=c("American League", "National League"),
values=c("red","blue"))+
labs(x="Batting Average", y="Log(salary)")
Now we have a better view of the data we are going to be working on.
Let’s see how the salary related to the Batting Average in each league. First, the American league.
dat.c %>%
filter(lgID.x=="AL") %>%
ggplot()+
geom_point(aes(x=BA, y=sal.log), color="red")+
labs(x="Batting Average", y="Log(salary)", title="American League")+
geom_smooth(aes(x=BA, y=sal.log), method="lm", color="black", se=FALSE)
dat.AL = dat.c %>%
filter(lgID.x=="AL")
summary(lm(sal.log~ BA, data=dat.AL))
##
## Call:
## lm(formula = sal.log ~ BA, data = dat.AL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.89659 -0.74677 0.03937 0.78588 1.94181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.6213 0.3199 45.702 <2e-16 ***
## BA 3.0943 1.2852 2.408 0.0171 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9181 on 176 degrees of freedom
## Multiple R-squared: 0.03189, Adjusted R-squared: 0.02639
## F-statistic: 5.797 on 1 and 176 DF, p-value: 0.01709
We see that there is a significant positive relationship between salary and BA in the American league.
How about the National League?
dat.c %>%
filter(lgID.x=="NL") %>%
ggplot(aes(x=BA, y=sal.log))+
geom_point(color="blue")+
labs(x="Batting Average", y="Log(salary)", title="National League")+
geom_smooth(aes(x=BA, y=sal.log), method="lm", color="black", se=FALSE, linetype="dashed")
dat.NL = dat.c %>%
filter(lgID.x=="NL")
summary(lm(sal.log~ BA, data=dat.NL))
##
## Call:
## lm(formula = sal.log ~ BA, data = dat.NL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.78248 -0.78087 -0.03334 0.78318 1.93664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.4966 0.2218 69.85 <2e-16 ***
## BA -0.7677 0.9366 -0.82 0.413
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9534 on 200 degrees of freedom
## Multiple R-squared: 0.003348, Adjusted R-squared: -0.001635
## F-statistic: 0.6719 on 1 and 200 DF, p-value: 0.4134
Graphically and statistically, the relationship between salary and BA differs from league to league. We can do ANCOVA to statistically examine if the relationship does significantly differ from league to league.
We can use lm() function to do ancova. You can also use aov() function. The only difference is just the final summary table.
mod = lm(sal.log~ BA*lgID.x, data=dat.c)
aov = aov(sal.log~ BA*lgID.x, data=dat.c)
The “BA” is a continuous variable and the “lgID.x” is a categorical variable. The “*" means we specify a interaction term between the two variables in the model.
summary(mod)
##
## Call:
## lm(formula = sal.log ~ BA * lgID.x, data = dat.c)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.89659 -0.75845 -0.00003 0.78644 1.94181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.6213 0.3265 44.778 <2e-16 ***
## BA 3.0943 1.3117 2.359 0.0188 *
## lgID.xNL 0.8754 0.3926 2.229 0.0264 *
## BA:lgID.xNL -3.8621 1.6025 -2.410 0.0164 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9371 on 376 degrees of freedom
## Multiple R-squared: 0.01708, Adjusted R-squared: 0.009233
## F-statistic: 2.177 on 3 and 376 DF, p-value: 0.09027
anova(mod)
## Analysis of Variance Table
##
## Response: sal.log
## Df Sum Sq Mean Sq F value Pr(>F)
## BA 1 0.48 0.4760 0.5420 0.46205
## lgID.x 1 0.16 0.1595 0.1817 0.67020
## BA:lgID.x 1 5.10 5.1003 5.8083 0.01643 *
## Residuals 376 330.17 0.8781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## BA 1 0.5 0.476 0.542 0.4620
## lgID.x 1 0.2 0.160 0.182 0.6702
## BA:lgID.x 1 5.1 5.100 5.808 0.0164 *
## Residuals 376 330.2 0.878
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the output, we see that the interaction term is significant. This means that the effect (the coefficient) of one variable depends on the other one. In the case, we can say the relationship between salary and BA significantly differs from league to league.
We can use a package called interplot that allows us to visualize
library(interplot)
interplot(m=mod, var1="BA", var2="lgID.x")+
labs(x="League ID", y="Estimated coefficient for BA", title="Estimated Coefficient of BA on League ID")
This plot means the coefficient of BA on salary is higher in American League than that of National League.
Or, we can have another aspect of the interaction plot.
interplot(m=mod, var1="lgID.x", var2="BA")+
labs(x="BA", y="Estimated coefficient for League ID", title="Estimated Coefficient of League ID on BA")
This means the salary in the National league is first greater than in the American league (positive), but this difference gradually decrease with the increase of BA. Finally, the salary in the National league is lower than in the American league.
Exercise 1
Try to use the following data set to execute ANCOVA and make interaction plots This data set contains the fruit weight (Fruit) of different plants with different root depth (root) and different treatment (Grazing).
gz = read.table(text=getURL("https://raw.githubusercontent.com/OscarFHC/NRE538_GSRA/master/Labs/NRE538_ANCOVA_n_Interaction/ipomopsis.txt"), sep="", header=T,comment.char="#")
head(gz)
## Root Fruit Grazing
## 1 6.225 59.77 Ungrazed
## 2 6.487 60.98 Ungrazed
## 3 4.919 14.73 Ungrazed
## 4 5.130 19.28 Ungrazed
## 5 5.417 34.25 Ungrazed
## 6 5.359 35.53 Ungrazed
Keep in mind that anova, ancova and linear regression are all linear models with following formula.
\[Y_i = \beta_0 + \beta_{i1} X_{i1} + \beta_{i2} X_{i2} + ... + \beta_{ip} X_{ip} + \varepsilon_i\]
, where \(Y_i\) is the dependent variable (response variable) \(i\), \(X_{ip}\) are the independent variable, and \(\varepsilon_i\) is the error.
The difference between ANOVA/ANCOVA and linear regression is just the difference between \(X_{ip}\). In ANOVA, \(X_{ip}\) are categorical variable and in regression, \(X_{ip}\) are continuous variable. In ANCOVA, it is the combination of the two.
In this sense, we can also include interaction term in the linear regression model we had before. We used the air quality data in New York before to demonstrate the regression analysis. Let’s make it more complicated.
data("airquality")
head(airquality, 15)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## 11 7 NA 6.9 74 5 11
## 12 16 256 9.7 69 5 12
## 13 11 290 9.2 66 5 13
## 14 14 274 10.9 68 5 14
## 15 18 65 13.2 58 5 15
mod = lm(Temp~Wind, data=airquality)
summary(mod)
##
## Call:
## lm(formula = Temp ~ Wind, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.291 -5.723 1.709 6.016 19.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.1349 2.0522 43.921 < 2e-16 ***
## Wind -1.2305 0.1944 -6.331 2.64e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.442 on 151 degrees of freedom
## Multiple R-squared: 0.2098, Adjusted R-squared: 0.2045
## F-statistic: 40.08 on 1 and 151 DF, p-value: 2.642e-09
plot(Temp~Wind, data=airquality)
abline(lm(Temp~Wind, data=airquality), col="red")
This is the linear model we have seen before. We can have a interaction terms in the model but we first need to have two independent variables.
mod1 = lm(Temp~Wind + Ozone, data=airquality)
summary(mod1)
##
## Call:
## lm(formula = Temp ~ Wind + Ozone, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.487 -5.137 1.220 4.674 13.102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.18035 2.96526 25.017 < 2e-16 ***
## Wind -0.37829 0.22080 -1.713 0.0894 .
## Ozone 0.17615 0.02393 7.362 3.15e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.762 on 113 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.5007, Adjusted R-squared: 0.4918
## F-statistic: 56.65 on 2 and 113 DF, p-value: < 2.2e-16
In the above model, there is not interaction terms. We can include it by changing “+” to “*“.
mod2 = lm(Temp~Wind * Ozone, data=airquality)
summary(mod2)
##
## Call:
## lm(formula = Temp ~ Wind * Ozone, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2008 -4.1492 0.5976 4.6656 13.4738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.268679 3.141135 25.554 < 2e-16 ***
## Wind -1.133278 0.275753 -4.110 7.57e-05 ***
## Ozone 0.015800 0.044790 0.353 0.725
## Wind:Ozone 0.023505 0.005687 4.133 6.93e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.326 on 112 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.5668, Adjusted R-squared: 0.5552
## F-statistic: 48.84 on 3 and 112 DF, p-value: < 2.2e-16
From the above model output, we see that there is a significant interaction between Wind and Ozone. This means that the coefficient of one variable depends on the other one. For example, we can use interaction plot to visualize how the coefficient of Ozone depends on Wind.
interplot(mod2, var1="Wind", var2="Ozone")+
labs(x="Wind", y="coefficient for Ozone")
Exercise 2
Use F-test to compare mod.1 and mod.2. Explain why you chose one over another to explain the data.
Plot the coefficient for Wind to Ozone using interplot() function.
Note that as we include the interaction term in the model, the coefficient for Ozone becomes non-significant. This is because Ozone and Wind are highly correlated.
cor(airquality[,"Ozone"], airquality[,"Wind"], use="na.or.complete")
## [1] -0.6015465
If encountering this case we should remove the main effect of Ozone, since the interaction between Wind and Ozone is significant but the main effect for Ozone is non-significant.
mod3 = lm(Temp~Wind + Wind:Ozone, data=airquality)
summary(mod3)
##
## Call:
## lm(formula = Temp ~ Wind + Wind:Ozone, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2622 -3.9667 0.5829 4.6323 13.6894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.101122 2.065187 39.271 < 2e-16 ***
## Wind -1.210998 0.165189 -7.331 3.68e-11 ***
## Wind:Ozone 0.025242 0.002831 8.916 9.81e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.302 on 113 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.5663, Adjusted R-squared: 0.5586
## F-statistic: 73.77 on 2 and 113 DF, p-value: < 2.2e-16
anova(mod2, mod3)
## Analysis of Variance Table
##
## Model 1: Temp ~ Wind * Ozone
## Model 2: Temp ~ Wind + Wind:Ozone
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 112 4482.7
## 2 113 4487.7 -1 -4.9808 0.1244 0.7249
From the above results, we see that the two models are not significantly different from each other in terms of the variance of response variable explained. This is based on F-test we have learned from the multiple regression session. This can only be executed if the two models are nested. However, this model comparison (or model selection) topic will not be covered here but you are welcome to explore it if you are interested in!