Reference: Stock, James H. and Mark W. Watson (2003) Introduction to Econometrics, Addison-Wesley Educational Publishers, http://wps.aw.com/aw_stockwatsn_economtrcs_1, chapter 7.
#install.packages("Ecdat", repos="http://R-Forge.R-project.org")
library("Ecdat")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
data(MCAS)
mcas <- MCAS[,c("totsc4","avgsalary","regday","lnchpct")]
# There are some NA values for certain municipalities; omitting those
mcas <- mcas[complete.cases(mcas),]
attach(mcas)
Before we begin the modeling, here is a scattergram of the four variables of interest (3 independent and 1 dependent):
plot(mcas,pch=15)
It is clear that there are some significant correlations, both positive and negative, between these four variables.
totsc4.lm_EW <- lm(totsc4 ~ regday + lnchpct + avgsalary)
summary(totsc4.lm_EW)
##
## Call:
## lm(formula = totsc4 ~ regday + lnchpct + avgsalary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.164 -5.653 -1.200 5.405 24.924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 693.253244 7.908557 87.659 <2e-16 ***
## regday 0.001592 0.000911 1.747 0.0822 .
## lnchpct -0.757799 0.043922 -17.253 <2e-16 ***
## avgsalary 0.575086 0.250464 2.296 0.0228 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.075 on 191 degrees of freedom
## Multiple R-squared: 0.6626, Adjusted R-squared: 0.6573
## F-statistic: 125 on 3 and 191 DF, p-value: < 2.2e-16
Here we simply add all the independent variables to the model at once, regardless of their expected contributions.
First I’m creating the correlation matrix.
cor(mcas)
## totsc4 avgsalary regday lnchpct
## totsc4 1.0000000 0.3686397 0.16603021 -0.79471760
## avgsalary 0.3686397 1.0000000 0.51841749 -0.26989136
## regday 0.1660302 0.5184175 1.00000000 -0.02357037
## lnchpct -0.7947176 -0.2698914 -0.02357037 1.00000000
From this matrix, we would expect lnchpct to have the effect on totsc4 (correlation = -0.79), so we start with that:
totsc4.lm_H1 <- lm(totsc4 ~ lnchpct)
summary(totsc4.lm_H1)
##
## Call:
## lm(formula = totsc4 ~ lnchpct)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.968 -6.491 -1.027 5.980 23.877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 721.77627 0.96708 746.35 <2e-16 ***
## lnchpct -0.79166 0.04352 -18.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.434 on 193 degrees of freedom
## Multiple R-squared: 0.6316, Adjusted R-squared: 0.6297
## F-statistic: 330.9 on 1 and 193 DF, p-value: < 2.2e-16
What this is suggesting is that if there were ZERO students on free and reduced price lunches, the average MCAS score for 4th graders would be 722 (which is greater than the state-wide average). Here is the regression line for this model, which is a 2D plot because we only include one independent variable.
conf <- confint(totsc4.lm_H1,level=0.95)
lwr <- conf[,1]
upr <- conf[,2]
plot(lnchpct,totsc4,pch=21, bg='red',main="MCAS score vs. % free lunch students")
abline(totsc4.lm_H1$coef, lwd=3)
abline(lwr, lty=2, col='red')
abline(upr, lty=2, col='red')
The second biggest \(R^2\) with totsc4 is avgsalary (correlation = 0.37), so we’ll add that next:
totsc4.lm_H2 <- lm(totsc4 ~ lnchpct + avgsalary)
summary(totsc4.lm_H2)
##
## Call:
## lm(formula = totsc4 ~ lnchpct + avgsalary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.246 -5.811 -1.160 4.546 26.121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 691.98870 7.91735 87.402 < 2e-16 ***
## lnchpct -0.74696 0.04371 -17.088 < 2e-16 ***
## avgsalary 0.80786 0.21322 3.789 0.000202 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.124 on 192 degrees of freedom
## Multiple R-squared: 0.6572, Adjusted R-squared: 0.6536
## F-statistic: 184.1 on 2 and 192 DF, p-value: < 2.2e-16
This tells us that if we had no reduced price lunch children but also did not pay the teachers, we would see a lower performance on test scores than we did above (drops from 722 to 692). Because we have two independent variables in this model, we can use a 3D plot to visualize the model and a regression plane (plus 95% confidence interval planes.
#install.packages("scatterplot3d")
library(scatterplot3d)
H2_3d <- scatterplot3d(lnchpct, avgsalary, totsc4, pch=15, main = "MCAS score vs. % free lunch students and teacher salary", bg = 'ivory4', xlab = "% free lunch students", ylab = "Avg. teacher salary", zlab = "Avg. 4th grade MCAS score", axis=TRUE)
H2_3d$plane3d(totsc4.lm_H2, col = 'blue')
H2_3d$plane3d(confint(totsc4.lm_H2)[,1],col="red")
H2_3d$plane3d(confint(totsc4.lm_H2)[,2],col="red")
Finally, we’ll add regday (correlation = 0.17):
totsc4.lm_H3 <- lm(totsc4 ~ lnchpct + avgsalary + regday)
summary(totsc4.lm_H3)
##
## Call:
## lm(formula = totsc4 ~ lnchpct + avgsalary + regday)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.164 -5.653 -1.200 5.405 24.924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 693.253244 7.908557 87.659 <2e-16 ***
## lnchpct -0.757799 0.043922 -17.253 <2e-16 ***
## avgsalary 0.575086 0.250464 2.296 0.0228 *
## regday 0.001592 0.000911 1.747 0.0822 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.075 on 191 degrees of freedom
## Multiple R-squared: 0.6626, Adjusted R-squared: 0.6573
## F-statistic: 125 on 3 and 191 DF, p-value: < 2.2e-16
By adding the per capita student expenditures to the model, we see that a our intercept test score actually increases slightly from the previous model. By looking at the resulting coefficients and the correlation matrix, it appears that avgsalary is a suppressor of regday, as they have a correlation of 0.52 and the regday coefficient (0.0016) is significantly lower than the initial estimate based on the correlation matrix (0.166).
totsc4.lm_S <- lm(totsc4 ~ regday + lnchpct + avgsalary)
step(totsc4.lm_S,direction="both")
## Start: AIC=864.13
## totsc4 ~ regday + lnchpct + avgsalary
##
## Df Sum of Sq RSS AIC
## <none> 15731 864.13
## - regday 1 251.4 15982 865.22
## - avgsalary 1 434.2 16165 867.44
## - lnchpct 1 24516.9 40248 1045.31
##
## Call:
## lm(formula = totsc4 ~ regday + lnchpct + avgsalary)
##
## Coefficients:
## (Intercept) regday lnchpct avgsalary
## 693.253244 0.001592 -0.757799 0.575086
This model is emperically determining the order in which to include each independent variable. The decision is based on the increase in the \(R^2\) value that the addition of each variable imparts.