Data selection

  1. This data is taken from the R package ‘Ecdat’ that contains various data sets for econometrics (details here). The specific set being analyzed here consists of education statistics from 220 municipalities across the state of Massachussetts, collected during 1997-1998.

Reference: Stock, James H. and Mark W. Watson (2003) Introduction to Econometrics, Addison-Wesley Educational Publishers, http://wps.aw.com/aw_stockwatsn_economtrcs_1, chapter 7.

  1. Pick the independent and dependent variables.
  1. Read the dataset as a data.table or data.frame.
#install.packages("Ecdat", repos="http://R-Forge.R-project.org")
library("Ecdat")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
data(MCAS)
mcas <- MCAS[,c("totsc4","avgsalary","regday","lnchpct")]
# There are some NA values for certain municipalities; omitting those
mcas <- mcas[complete.cases(mcas),]
attach(mcas)
  1. The null hypothesis that I am testing, \(H_0\): there is no relationship between either teacher salary, per capita student spending, or percentage of children on reduced price lunch with the outcomes of the 4th grade MCAS test.

Linear model

Before we begin the modeling, here is a scattergram of the four variables of interest (3 independent and 1 dependent):

plot(mcas,pch=15)

It is clear that there are some significant correlations, both positive and negative, between these four variables.

Entry-wise model:

totsc4.lm_EW <- lm(totsc4 ~ regday + lnchpct + avgsalary)
summary(totsc4.lm_EW)
## 
## Call:
## lm(formula = totsc4 ~ regday + lnchpct + avgsalary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.164  -5.653  -1.200   5.405  24.924 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 693.253244   7.908557  87.659   <2e-16 ***
## regday        0.001592   0.000911   1.747   0.0822 .  
## lnchpct      -0.757799   0.043922 -17.253   <2e-16 ***
## avgsalary     0.575086   0.250464   2.296   0.0228 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.075 on 191 degrees of freedom
## Multiple R-squared:  0.6626, Adjusted R-squared:  0.6573 
## F-statistic:   125 on 3 and 191 DF,  p-value: < 2.2e-16

Here we simply add all the independent variables to the model at once, regardless of their expected contributions.

Hierarchical model:

First I’m creating the correlation matrix.

cor(mcas)
##               totsc4  avgsalary      regday     lnchpct
## totsc4     1.0000000  0.3686397  0.16603021 -0.79471760
## avgsalary  0.3686397  1.0000000  0.51841749 -0.26989136
## regday     0.1660302  0.5184175  1.00000000 -0.02357037
## lnchpct   -0.7947176 -0.2698914 -0.02357037  1.00000000

From this matrix, we would expect lnchpct to have the effect on totsc4 (correlation = -0.79), so we start with that:

totsc4.lm_H1 <- lm(totsc4 ~ lnchpct)
summary(totsc4.lm_H1)
## 
## Call:
## lm(formula = totsc4 ~ lnchpct)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.968  -6.491  -1.027   5.980  23.877 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 721.77627    0.96708  746.35   <2e-16 ***
## lnchpct      -0.79166    0.04352  -18.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.434 on 193 degrees of freedom
## Multiple R-squared:  0.6316, Adjusted R-squared:  0.6297 
## F-statistic: 330.9 on 1 and 193 DF,  p-value: < 2.2e-16

What this is suggesting is that if there were ZERO students on free and reduced price lunches, the average MCAS score for 4th graders would be 722 (which is greater than the state-wide average). Here is the regression line for this model, which is a 2D plot because we only include one independent variable.

conf <- confint(totsc4.lm_H1,level=0.95)
lwr <- conf[,1]
upr <- conf[,2]
plot(lnchpct,totsc4,pch=21, bg='red',main="MCAS score vs. % free lunch students")
abline(totsc4.lm_H1$coef, lwd=3)
abline(lwr, lty=2, col='red')
abline(upr, lty=2, col='red')

The second biggest \(R^2\) with totsc4 is avgsalary (correlation = 0.37), so we’ll add that next:

totsc4.lm_H2 <- lm(totsc4 ~ lnchpct + avgsalary)
summary(totsc4.lm_H2)
## 
## Call:
## lm(formula = totsc4 ~ lnchpct + avgsalary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.246  -5.811  -1.160   4.546  26.121 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 691.98870    7.91735  87.402  < 2e-16 ***
## lnchpct      -0.74696    0.04371 -17.088  < 2e-16 ***
## avgsalary     0.80786    0.21322   3.789 0.000202 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.124 on 192 degrees of freedom
## Multiple R-squared:  0.6572, Adjusted R-squared:  0.6536 
## F-statistic: 184.1 on 2 and 192 DF,  p-value: < 2.2e-16

This tells us that if we had no reduced price lunch children but also did not pay the teachers, we would see a lower performance on test scores than we did above (drops from 722 to 692). Because we have two independent variables in this model, we can use a 3D plot to visualize the model and a regression plane (plus 95% confidence interval planes.

#install.packages("scatterplot3d")
library(scatterplot3d)
H2_3d <- scatterplot3d(lnchpct, avgsalary, totsc4, pch=15, main = "MCAS score vs. % free lunch students and teacher salary", bg = 'ivory4', xlab = "% free lunch students", ylab = "Avg. teacher salary", zlab = "Avg. 4th grade MCAS score", axis=TRUE)
H2_3d$plane3d(totsc4.lm_H2, col = 'blue')
H2_3d$plane3d(confint(totsc4.lm_H2)[,1],col="red")
H2_3d$plane3d(confint(totsc4.lm_H2)[,2],col="red")

Finally, we’ll add regday (correlation = 0.17):

totsc4.lm_H3 <- lm(totsc4 ~ lnchpct + avgsalary + regday)
summary(totsc4.lm_H3)
## 
## Call:
## lm(formula = totsc4 ~ lnchpct + avgsalary + regday)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.164  -5.653  -1.200   5.405  24.924 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 693.253244   7.908557  87.659   <2e-16 ***
## lnchpct      -0.757799   0.043922 -17.253   <2e-16 ***
## avgsalary     0.575086   0.250464   2.296   0.0228 *  
## regday        0.001592   0.000911   1.747   0.0822 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.075 on 191 degrees of freedom
## Multiple R-squared:  0.6626, Adjusted R-squared:  0.6573 
## F-statistic:   125 on 3 and 191 DF,  p-value: < 2.2e-16

By adding the per capita student expenditures to the model, we see that a our intercept test score actually increases slightly from the previous model. By looking at the resulting coefficients and the correlation matrix, it appears that avgsalary is a suppressor of regday, as they have a correlation of 0.52 and the regday coefficient (0.0016) is significantly lower than the initial estimate based on the correlation matrix (0.166).

Sequential model:

totsc4.lm_S <- lm(totsc4 ~ regday + lnchpct + avgsalary)
step(totsc4.lm_S,direction="both")
## Start:  AIC=864.13
## totsc4 ~ regday + lnchpct + avgsalary
## 
##             Df Sum of Sq   RSS     AIC
## <none>                   15731  864.13
## - regday     1     251.4 15982  865.22
## - avgsalary  1     434.2 16165  867.44
## - lnchpct    1   24516.9 40248 1045.31
## 
## Call:
## lm(formula = totsc4 ~ regday + lnchpct + avgsalary)
## 
## Coefficients:
## (Intercept)       regday      lnchpct    avgsalary  
##  693.253244     0.001592    -0.757799     0.575086

This model is emperically determining the order in which to include each independent variable. The decision is based on the increase in the \(R^2\) value that the addition of each variable imparts.