library(Ecdat)
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
library(scatterplot3d)
library(RColorBrewer)
data(MCAS)
# clean data
MCAS <- MCAS[complete.cases(MCAS[,c('totsc8','percap','totday', 'avgsalary')]),]
# created trimmed version of data with only variables that are used in the model
MCAS.f <- MCAS[,c('totsc8', 'percap','totday', 'avgsalary')]
attach(MCAS)
This dataset is based on information collected from the Massachussetts Comprehensive Assessment System (MCAS) from their department of education as part of the 1990 census. My \( H_0 \) is that per capita income, percent english learners, and percent of students eligible for reduced price or free lunch have no effect on students 8th grade test scores (totsc8). I plan to build an explanatory model of these data to see if test scores can be explained by other econometric variables.
summary(MCAS)
## code municipa district regday
## Min. : 1.0 ABINGTON : 1 Abington : 1 Min. :3023
## 1st Qu.: 84.5 ACUSHNET : 1 Acushnet : 1 1st Qu.:4158
## Median :161.0 AGAWAM : 1 Agawam : 1 Median :4521
## Mean :166.6 AMESBURY : 1 Amesbury : 1 Mean :4680
## 3rd Qu.:259.5 ANDOVER : 1 Andover : 1 3rd Qu.:5031
## Max. :348.0 ARLINGTON: 1 Arlington: 1 Max. :8759
## (Other) :157 (Other) :157
## specneed bilingua occupday totday
## Min. : 3832 Min. : 0 Min. : 0 Min. :3868
## 1st Qu.: 7432 1st Qu.: 0 1st Qu.: 0 1st Qu.:4786
## Median : 8288 Median : 0 Median : 0 Median :5223
## Mean : 8621 Mean : 4039 Mean : 1398 Mean :5433
## 3rd Qu.: 9710 3rd Qu.: 3628 3rd Qu.: 0 3rd Qu.:5870
## Max. :15741 Max. :295140 Max. :15088 Max. :9868
##
## spc speced lnchpct tchratio
## Min. : 2.600 Min. :10.40 Min. : 0.40 Min. :11.40
## 1st Qu.: 6.350 1st Qu.:13.60 1st Qu.: 5.40 1st Qu.:16.00
## Median : 7.900 Median :15.50 Median :11.20 Median :17.10
## Mean : 8.174 Mean :16.02 Mean :16.58 Mean :17.34
## 3rd Qu.: 9.800 3rd Qu.:17.85 3rd Qu.:21.40 3rd Qu.:18.85
## Max. :16.600 Max. :26.00 Max. :76.20 Max. :27.00
## NA's :8
## percap totsc4 totsc8 avgsalary
## Min. : 9.686 Min. :658.0 Min. :641 Min. :24.96
## 1st Qu.:15.058 1st Qu.:700.0 1st Qu.:685 1st Qu.:33.92
## Median :17.275 Median :710.0 Median :698 Median :36.09
## Mean :18.528 Mean :708.7 Mean :698 Mean :36.22
## 3rd Qu.:20.288 3rd Qu.:720.0 3rd Qu.:712 3rd Qu.:38.11
## Max. :46.855 Max. :740.0 Max. :747 Max. :44.49
##
## pctel
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 1.340
## 3rd Qu.: 1.146
## Max. :24.494
##
# look at plots of variables that I will use
plot(MCAS.f)
# look at correlations of variables that I will use
cor(MCAS.f)
## totsc8 percap totday avgsalary
## totsc8 1.0000000 0.7832548 0.1638890 0.4332872
## percap 0.7832548 1.0000000 0.4415456 0.6206904
## totday 0.1638890 0.4415456 1.0000000 0.4818724
## avgsalary 0.4332872 0.6206904 0.4818724 1.0000000
# check normality of the variables that I will use
par(mfrow = c(2,2))
hist(MCAS.f$totsc8)
hist(MCAS.f$percap)
hist(MCAS.f$totday)
hist(MCAS.f$avgsalary)
The checking for the normality in these variables it seems that the distributions of percap and totday are skewed to the left slightly, but remain mostly normally distributed. This leads me to believe that for the purposes of this analysis we can leave the dataset as is rather than selectively adjusting the aggregated data to fit our needs and possibly introducing bias.
code:district code (numerical) municipa:municipality (name) district:district name regday:spending per pupil, regular specneed:spending per pupil, special needs bilingua:spending per pupil, bilingual occupday:spending per pupil, occupational totday:spending per pupil, total spc:students per computer speced:special education students lnchpct:eligible for free or reduced price lunch tchratio:students per teacher percap:per capita income totsc4:4th grade score (math+english+science) totsc8:8th grade score (math+english+science) avgsalary:average teacher salary pctel:percent english learners
For the entry-wise model I will be inputting the variables percap, totday, and avgsalary. Since this model is entry-wise I will input all three into the same linear model at once.
## Entry-wise model
fit <- lm(totsc8 ~ percap + totday + avgsalary)
summary(fit)
##
## Call:
## lm(formula = totsc8 ~ percap + totday + avgsalary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.080 -7.720 0.346 6.970 46.343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 663.657864 12.308381 53.919 < 2e-16 ***
## percap 3.487879 0.238546 14.621 < 2e-16 ***
## totday -0.005105 0.001243 -4.107 6.39e-05 ***
## avgsalary -0.070326 0.415916 -0.169 0.866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.76 on 159 degrees of freedom
## Multiple R-squared: 0.6547, Adjusted R-squared: 0.6482
## F-statistic: 100.5 on 3 and 159 DF, p-value: < 2.2e-16
Building a step wise model depends on inputing terms based on the amount of variance explained sequentially into the model. Based on the correlation matrix between the variables the order of variables to be added will be determined.
## Step-wise
cor(MCAS.f)
## totsc8 percap totday avgsalary
## totsc8 1.0000000 0.7832548 0.1638890 0.4332872
## percap 0.7832548 1.0000000 0.4415456 0.6206904
## totday 0.1638890 0.4415456 1.0000000 0.4818724
## avgsalary 0.4332872 0.6206904 0.4818724 1.0000000
# based on the correlation matrix percap should be the first variable added to the model
fit_sw1 <- lm(totsc8 ~ percap)
summary(fit_sw1)
##
## Call:
## lm(formula = totsc8 ~ percap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.110 -6.688 1.009 8.051 43.816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 641.0248 3.7157 172.52 <2e-16 ***
## percap 3.0751 0.1924 15.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.41 on 161 degrees of freedom
## Multiple R-squared: 0.6135, Adjusted R-squared: 0.6111
## F-statistic: 255.5 on 1 and 161 DF, p-value: < 2.2e-16
# next the average teacher salary should be input
fit_sw2 <- lm(totsc8 ~ percap + avgsalary)
summary(fit_sw2)
##
## Call:
## lm(formula = totsc8 ~ percap + avgsalary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.482 -7.536 0.324 8.019 38.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 657.9665 12.8223 51.31 <2e-16 ***
## percap 3.2847 0.2447 13.43 <2e-16 ***
## avgsalary -0.5750 0.4166 -1.38 0.169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.38 on 160 degrees of freedom
## Multiple R-squared: 0.618, Adjusted R-squared: 0.6133
## F-statistic: 129.4 on 2 and 160 DF, p-value: < 2.2e-16
# finally the total money spent per student per day should be input
fit_sw3 <- lm(totsc8 ~ percap + avgsalary + totday)
summary(fit_sw3)
##
## Call:
## lm(formula = totsc8 ~ percap + avgsalary + totday)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.080 -7.720 0.346 6.970 46.343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 663.657864 12.308381 53.919 < 2e-16 ***
## percap 3.487879 0.238546 14.621 < 2e-16 ***
## avgsalary -0.070326 0.415916 -0.169 0.866
## totday -0.005105 0.001243 -4.107 6.39e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.76 on 159 degrees of freedom
## Multiple R-squared: 0.6547, Adjusted R-squared: 0.6482
## F-statistic: 100.5 on 3 and 159 DF, p-value: < 2.2e-16
To build a hierarchical model we will input the terms based on theory into the model.
# per capita income model
fit_p <- lm(totsc8 ~ percap)
summary(fit_p)
##
## Call:
## lm(formula = totsc8 ~ percap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.110 -6.688 1.009 8.051 43.816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 641.0248 3.7157 172.52 <2e-16 ***
## percap 3.0751 0.1924 15.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.41 on 161 degrees of freedom
## Multiple R-squared: 0.6135, Adjusted R-squared: 0.6111
## F-statistic: 255.5 on 1 and 161 DF, p-value: < 2.2e-16
# per capita plus average teacher salary model
fit_pt <- lm(totsc8 ~ percap + totday)
summary(fit_pt)
##
## Call:
## lm(formula = totsc8 ~ percap + totday)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.725 -7.691 0.293 7.088 47.011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 661.835686 5.928563 111.635 < 2e-16 ***
## percap 3.466951 0.203304 17.053 < 2e-16 ***
## totday -0.005167 0.001184 -4.365 2.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.72 on 160 degrees of freedom
## Multiple R-squared: 0.6546, Adjusted R-squared: 0.6503
## F-statistic: 151.6 on 2 and 160 DF, p-value: < 2.2e-16
# all three
fit_pta <- lm(totsc8 ~ percap + totday + avgsalary)
summary(fit_pta)
##
## Call:
## lm(formula = totsc8 ~ percap + totday + avgsalary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.080 -7.720 0.346 6.970 46.343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 663.657864 12.308381 53.919 < 2e-16 ***
## percap 3.487879 0.238546 14.621 < 2e-16 ***
## totday -0.005105 0.001243 -4.107 6.39e-05 ***
## avgsalary -0.070326 0.415916 -0.169 0.866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.76 on 159 degrees of freedom
## Multiple R-squared: 0.6547, Adjusted R-squared: 0.6482
## F-statistic: 100.5 on 3 and 159 DF, p-value: < 2.2e-16
anova(fit_p, fit_pt, fit_pta)
## Analysis of Variance Table
##
## Model 1: totsc8 ~ percap
## Model 2: totsc8 ~ percap + totday
## Model 3: totsc8 ~ percap + totday + avgsalary
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 161 28966
## 2 160 25884 1 3081.99 18.9354 2.41e-05 ***
## 3 159 25879 1 4.65 0.0286 0.8659
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# per capita income
plot(percap, totsc8, pch = 21, bg = 'blue')
conf <- confint(fit_p)
abline(fit_p)
abline(conf[,1], lty = 2, col = 'red')
abline(conf[,2], lty = 2, col = 'red')
# total amount of money spent on a student per day
fit_t <- lm(totsc8 ~ totday)
plot(totday, totsc8, pch = 21, bg = 'blue')
abline(fit_t)
conf <- confint(fit_t)
abline(conf[,1], lty = 2, col = 'red')
abline(conf[,2], lty = 2, col = 'red')
# average teacher salary
fit_a <- lm(totsc8 ~ avgsalary)
plot(avgsalary, totsc8, pch = 21, bg = 'blue')
abline(fit_a)
conf <- confint(fit_a)
abline(conf[,1], lty = 2, col = 'red')
abline(conf[,2], lty = 2, col = 'red')
# plot hierarchical model in 3d
mdl2 <- scatterplot3d(percap, avgsalary, totsc8, pch=21, main = "Percap and avgsalary vs. 8th grade test scores", bg = 'ivory4', xlab = "Per-Capita Income", ylab = "Percentage of Average Teacher Salary", zlab = "8th Grade Test Scores", axis=TRUE)
mdl2$plane3d(fit_sw2, col = 'dodgerblue4')
# plot stepwise model in 3d
# plot model in 3d
mdl2 <- scatterplot3d(percap, totday, totsc8, pch=21, main = "Percap and totday vs. 8th grade test scores", bg = 'ivory4', xlab = "Per-Capita Income", ylab = "Percentage of Free/Reduced Lunch Students", zlab = "8th Grade Test Scores", axis=TRUE)
mdl2$plane3d(fit_pt, col = 'dodgerblue4')
The entry-wise model showed a significant amount of variance explained by percap and totday, while avgsalary did not explain a significant amount of variance at all. This was shown by it's insignificant F test score from the summary of the linear model. The \( R^2 \) was .64, meaning that 64% of the varaiance in test scores was explained by these three variables.
The step wise model indicated that building a step wise model based on correlation information may not be the best way, as totday seemed to explain more of the variance in totsc8 although avgsalary was less correlated. When building the step wise model we can see that avgsalary increases the \( R^2 \) very little (.0022), whereas totday increases it by .03. Both of the other terms besides percap seem to contribute little to the overall explanation of test scores.
The theoretical predictions made to build the hierarchical model turned out to be true in terms of order. Totday explained more variance than avgsalary. Totday explains more variance when just added to percap than it does when put into a model with avgsalary as well, meaning the best model in terms of variance explained is just percap and totday. The variance explained probably drops when avgsalary is added because of the low level correlation between totday and avgsalary.
From the Anova test of fit models we can see that the model with just totday and percap explains significantly more varaince than the other models in question, and thus is most likely the best explanatory model. Thus, using this model we reject the null hypothesis that totday and percap do not influence test scores but accept that average teacher salary does not affect test scores. Looking at the outliers in the avgsalary plot they seem to play a role in the overall fit of the model, making it much worse for this variable. In a future analysis it may be best to take these out to improve explanatory accuracy.
In the final model \( b_0 \) was 661.8, meaning that this is the test score when both the per capita income and total amount spent on a student per day are 0. The \( b_1 \) for each of these variables was 3.46 for percap and -.005 for totday. This makes sense for percap, as you would expect test scores to go up for a student with improved socio-economic status but for totday this seems to be the opposite of what you would expect. This may reflect the small impact that totday has on the variance explained and the influence that outliers had on it's reliability. Totday may explain variance within the data set but may not hold 'real-world' validity in predicting test scores based on it's small and possibly negligable effect.