Project #2: Multiple Regression

Sam Rose

3/19/2015

1. Data

Load data

library(Ecdat)

## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

library(scatterplot3d)
library(RColorBrewer)
data(MCAS)
# clean data
MCAS <- MCAS[complete.cases(MCAS[,c('totsc8','percap','totday', 'avgsalary')]),]
# created trimmed version of data with only variables that are used in the model
MCAS.f <- MCAS[,c('totsc8', 'percap','totday', 'avgsalary')]
attach(MCAS)

Description

This dataset is based on information collected from the Massachussetts Comprehensive Assessment System (MCAS) from their department of education as part of the 1990 census. My \( H_0 \) is that per capita income, percent english learners, and percent of students eligible for reduced price or free lunch have no effect on students 8th grade test scores (totsc8). I plan to build an explanatory model of these data to see if test scores can be explained by other econometric variables.

summary(MCAS)

##       code            municipa        district       regday    
##  Min.   :  1.0   ABINGTON :  1   Abington :  1   Min.   :3023  
##  1st Qu.: 84.5   ACUSHNET :  1   Acushnet :  1   1st Qu.:4158  
##  Median :161.0   AGAWAM   :  1   Agawam   :  1   Median :4521  
##  Mean   :166.6   AMESBURY :  1   Amesbury :  1   Mean   :4680  
##  3rd Qu.:259.5   ANDOVER  :  1   Andover  :  1   3rd Qu.:5031  
##  Max.   :348.0   ARLINGTON:  1   Arlington:  1   Max.   :8759  
##                  (Other)  :157   (Other)  :157                 
##     specneed        bilingua         occupday         totday    
##  Min.   : 3832   Min.   :     0   Min.   :    0   Min.   :3868  
##  1st Qu.: 7432   1st Qu.:     0   1st Qu.:    0   1st Qu.:4786  
##  Median : 8288   Median :     0   Median :    0   Median :5223  
##  Mean   : 8621   Mean   :  4039   Mean   : 1398   Mean   :5433  
##  3rd Qu.: 9710   3rd Qu.:  3628   3rd Qu.:    0   3rd Qu.:5870  
##  Max.   :15741   Max.   :295140   Max.   :15088   Max.   :9868  
##                                                                 
##       spc             speced         lnchpct         tchratio    
##  Min.   : 2.600   Min.   :10.40   Min.   : 0.40   Min.   :11.40  
##  1st Qu.: 6.350   1st Qu.:13.60   1st Qu.: 5.40   1st Qu.:16.00  
##  Median : 7.900   Median :15.50   Median :11.20   Median :17.10  
##  Mean   : 8.174   Mean   :16.02   Mean   :16.58   Mean   :17.34  
##  3rd Qu.: 9.800   3rd Qu.:17.85   3rd Qu.:21.40   3rd Qu.:18.85  
##  Max.   :16.600   Max.   :26.00   Max.   :76.20   Max.   :27.00  
##  NA's   :8                                                       
##      percap           totsc4          totsc8      avgsalary    
##  Min.   : 9.686   Min.   :658.0   Min.   :641   Min.   :24.96  
##  1st Qu.:15.058   1st Qu.:700.0   1st Qu.:685   1st Qu.:33.92  
##  Median :17.275   Median :710.0   Median :698   Median :36.09  
##  Mean   :18.528   Mean   :708.7   Mean   :698   Mean   :36.22  
##  3rd Qu.:20.288   3rd Qu.:720.0   3rd Qu.:712   3rd Qu.:38.11  
##  Max.   :46.855   Max.   :740.0   Max.   :747   Max.   :44.49  
##                                                                
##      pctel       
##  Min.   : 0.000  
##  1st Qu.: 0.000  
##  Median : 0.000  
##  Mean   : 1.340  
##  3rd Qu.: 1.146  
##  Max.   :24.494  
##

# look at plots of variables that I will use
plot(MCAS.f)

plot of chunk unnamed-chunk-2

# look at correlations of variables that I will use
cor(MCAS.f)

##              totsc8    percap    totday avgsalary
## totsc8    1.0000000 0.7832548 0.1638890 0.4332872
## percap    0.7832548 1.0000000 0.4415456 0.6206904
## totday    0.1638890 0.4415456 1.0000000 0.4818724
## avgsalary 0.4332872 0.6206904 0.4818724 1.0000000

# check normality of the variables that I will use
par(mfrow = c(2,2))
hist(MCAS.f$totsc8)
hist(MCAS.f$percap)
hist(MCAS.f$totday)
hist(MCAS.f$avgsalary)

plot of chunk unnamed-chunk-2

The checking for the normality in these variables it seems that the distributions of percap and totday are skewed to the left slightly, but remain mostly normally distributed. This leads me to believe that for the purposes of this analysis we can leave the dataset as is rather than selectively adjusting the aggregated data to fit our needs and possibly introducing bias.

Data labels

code:district code (numerical) municipa:municipality (name) district:district name regday:spending per pupil, regular specneed:spending per pupil, special needs bilingua:spending per pupil, bilingual occupday:spending per pupil, occupational totday:spending per pupil, total spc:students per computer speced:special education students lnchpct:eligible for free or reduced price lunch tchratio:students per teacher percap:per capita income totsc4:4th grade score (math+english+science) totsc8:8th grade score (math+english+science) avgsalary:average teacher salary pctel:percent english learners

Create Multiple Regression Model

Entry-wise Model

For the entry-wise model I will be inputting the variables percap, totday, and avgsalary. Since this model is entry-wise I will input all three into the same linear model at once.

## Entry-wise model
fit <- lm(totsc8 ~ percap + totday + avgsalary)
summary(fit)

## 
## Call:
## lm(formula = totsc8 ~ percap + totday + avgsalary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.080  -7.720   0.346   6.970  46.343 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 663.657864  12.308381  53.919  < 2e-16 ***
## percap        3.487879   0.238546  14.621  < 2e-16 ***
## totday       -0.005105   0.001243  -4.107 6.39e-05 ***
## avgsalary    -0.070326   0.415916  -0.169    0.866    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.76 on 159 degrees of freedom
## Multiple R-squared:  0.6547, Adjusted R-squared:  0.6482 
## F-statistic: 100.5 on 3 and 159 DF,  p-value: < 2.2e-16

Step-wise model

Building a step wise model depends on inputing terms based on the amount of variance explained sequentially into the model. Based on the correlation matrix between the variables the order of variables to be added will be determined.

## Step-wise

cor(MCAS.f)

##              totsc8    percap    totday avgsalary
## totsc8    1.0000000 0.7832548 0.1638890 0.4332872
## percap    0.7832548 1.0000000 0.4415456 0.6206904
## totday    0.1638890 0.4415456 1.0000000 0.4818724
## avgsalary 0.4332872 0.6206904 0.4818724 1.0000000

# based on the correlation matrix percap should be the first variable added to the model

fit_sw1 <- lm(totsc8 ~ percap)
summary(fit_sw1)

## 
## Call:
## lm(formula = totsc8 ~ percap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.110  -6.688   1.009   8.051  43.816 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 641.0248     3.7157  172.52   <2e-16 ***
## percap        3.0751     0.1924   15.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.41 on 161 degrees of freedom
## Multiple R-squared:  0.6135, Adjusted R-squared:  0.6111 
## F-statistic: 255.5 on 1 and 161 DF,  p-value: < 2.2e-16

# next the average teacher salary should be input
fit_sw2 <- lm(totsc8 ~ percap + avgsalary)
summary(fit_sw2)

## 
## Call:
## lm(formula = totsc8 ~ percap + avgsalary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.482  -7.536   0.324   8.019  38.674 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 657.9665    12.8223   51.31   <2e-16 ***
## percap        3.2847     0.2447   13.43   <2e-16 ***
## avgsalary    -0.5750     0.4166   -1.38    0.169    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.38 on 160 degrees of freedom
## Multiple R-squared:  0.618,  Adjusted R-squared:  0.6133 
## F-statistic: 129.4 on 2 and 160 DF,  p-value: < 2.2e-16

# finally the total money spent per student per day should be input
fit_sw3 <- lm(totsc8 ~ percap + avgsalary + totday)
summary(fit_sw3)

## 
## Call:
## lm(formula = totsc8 ~ percap + avgsalary + totday)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.080  -7.720   0.346   6.970  46.343 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 663.657864  12.308381  53.919  < 2e-16 ***
## percap        3.487879   0.238546  14.621  < 2e-16 ***
## avgsalary    -0.070326   0.415916  -0.169    0.866    
## totday       -0.005105   0.001243  -4.107 6.39e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.76 on 159 degrees of freedom
## Multiple R-squared:  0.6547, Adjusted R-squared:  0.6482 
## F-statistic: 100.5 on 3 and 159 DF,  p-value: < 2.2e-16

Hierarchical

To build a hierarchical model we will input the terms based on theory into the model.

socio-economic status has often been cited as a positive influencer of academic aptitude and test scores, so per capita income will go first into the model
theoretically, if you pay more money per student per day then this should translate into overall productivity and performance of the students so this will go second
the average teacher salary will go third because this may not be the best predictor or student performance, being that the age of teachers generally will mean they get paid more due to greater tenure, but may not necessarily mean that they are better teachers.

# per capita income model
fit_p <- lm(totsc8 ~ percap)
summary(fit_p)

## 
## Call:
## lm(formula = totsc8 ~ percap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.110  -6.688   1.009   8.051  43.816 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 641.0248     3.7157  172.52   <2e-16 ***
## percap        3.0751     0.1924   15.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.41 on 161 degrees of freedom
## Multiple R-squared:  0.6135, Adjusted R-squared:  0.6111 
## F-statistic: 255.5 on 1 and 161 DF,  p-value: < 2.2e-16

# per capita plus average teacher salary model
fit_pt <- lm(totsc8 ~ percap + totday)
summary(fit_pt)

## 
## Call:
## lm(formula = totsc8 ~ percap + totday)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.725  -7.691   0.293   7.088  47.011 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 661.835686   5.928563 111.635  < 2e-16 ***
## percap        3.466951   0.203304  17.053  < 2e-16 ***
## totday       -0.005167   0.001184  -4.365 2.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.72 on 160 degrees of freedom
## Multiple R-squared:  0.6546, Adjusted R-squared:  0.6503 
## F-statistic: 151.6 on 2 and 160 DF,  p-value: < 2.2e-16

# all three
fit_pta <- lm(totsc8 ~ percap + totday + avgsalary)
summary(fit_pta)

## 
## Call:
## lm(formula = totsc8 ~ percap + totday + avgsalary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.080  -7.720   0.346   6.970  46.343 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 663.657864  12.308381  53.919  < 2e-16 ***
## percap        3.487879   0.238546  14.621  < 2e-16 ***
## totday       -0.005105   0.001243  -4.107 6.39e-05 ***
## avgsalary    -0.070326   0.415916  -0.169    0.866    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.76 on 159 degrees of freedom
## Multiple R-squared:  0.6547, Adjusted R-squared:  0.6482 
## F-statistic: 100.5 on 3 and 159 DF,  p-value: < 2.2e-16

anova(fit_p, fit_pt, fit_pta)

## Analysis of Variance Table
## 
## Model 1: totsc8 ~ percap
## Model 2: totsc8 ~ percap + totday
## Model 3: totsc8 ~ percap + totday + avgsalary
##   Res.Df   RSS Df Sum of Sq       F   Pr(>F)    
## 1    161 28966                                  
## 2    160 25884  1   3081.99 18.9354 2.41e-05 ***
## 3    159 25879  1      4.65  0.0286   0.8659    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Plots

# per capita income
plot(percap, totsc8, pch = 21, bg = 'blue')
conf <- confint(fit_p)
abline(fit_p)
abline(conf[,1], lty = 2, col = 'red')
abline(conf[,2], lty = 2, col = 'red')

plot of chunk unnamed-chunk-6

# total amount of money spent on a student per day
fit_t <- lm(totsc8 ~ totday)
plot(totday, totsc8, pch = 21, bg = 'blue')
abline(fit_t)
conf <- confint(fit_t)
abline(conf[,1], lty = 2, col = 'red')
abline(conf[,2], lty = 2, col = 'red')

plot of chunk unnamed-chunk-6

# average teacher salary
fit_a <- lm(totsc8 ~ avgsalary)
plot(avgsalary, totsc8, pch = 21, bg = 'blue')
abline(fit_a)
conf <- confint(fit_a)
abline(conf[,1], lty = 2, col = 'red')
abline(conf[,2], lty = 2, col = 'red')

plot of chunk unnamed-chunk-6

# plot hierarchical model in 3d
mdl2 <- scatterplot3d(percap, avgsalary, totsc8, pch=21, main = "Percap and avgsalary vs. 8th grade test scores", bg = 'ivory4', xlab = "Per-Capita Income", ylab = "Percentage of Average Teacher Salary", zlab = "8th Grade Test Scores", axis=TRUE)
mdl2$plane3d(fit_sw2, col = 'dodgerblue4')

plot of chunk unnamed-chunk-6

# plot stepwise model in 3d
# plot model in 3d
mdl2 <- scatterplot3d(percap, totday, totsc8, pch=21, main = "Percap and totday vs. 8th grade test scores", bg = 'ivory4', xlab = "Per-Capita Income", ylab = "Percentage of Free/Reduced Lunch Students", zlab = "8th Grade Test Scores", axis=TRUE)
mdl2$plane3d(fit_pt, col = 'dodgerblue4')

plot of chunk unnamed-chunk-6

4. Interpretation

Entry-Wise

The entry-wise model showed a significant amount of variance explained by percap and totday, while avgsalary did not explain a significant amount of variance at all. This was shown by it's insignificant F test score from the summary of the linear model. The \( R^2 \) was .64, meaning that 64% of the varaiance in test scores was explained by these three variables.

Step-Wise

The step wise model indicated that building a step wise model based on correlation information may not be the best way, as totday seemed to explain more of the variance in totsc8 although avgsalary was less correlated. When building the step wise model we can see that avgsalary increases the \( R^2 \) very little (.0022), whereas totday increases it by .03. Both of the other terms besides percap seem to contribute little to the overall explanation of test scores.

Hierarchical

The theoretical predictions made to build the hierarchical model turned out to be true in terms of order. Totday explained more variance than avgsalary. Totday explains more variance when just added to percap than it does when put into a model with avgsalary as well, meaning the best model in terms of variance explained is just percap and totday. The variance explained probably drops when avgsalary is added because of the low level correlation between totday and avgsalary.

From the Anova test of fit models we can see that the model with just totday and percap explains significantly more varaince than the other models in question, and thus is most likely the best explanatory model. Thus, using this model we reject the null hypothesis that totday and percap do not influence test scores but accept that average teacher salary does not affect test scores. Looking at the outliers in the avgsalary plot they seem to play a role in the overall fit of the model, making it much worse for this variable. In a future analysis it may be best to take these out to improve explanatory accuracy.

In the final model \( b_0 \) was 661.8, meaning that this is the test score when both the per capita income and total amount spent on a student per day are 0. The \( b_1 \) for each of these variables was 3.46 for percap and -.005 for totday. This makes sense for percap, as you would expect test scores to go up for a student with improved socio-economic status but for totday this seems to be the opposite of what you would expect. This may reflect the small impact that totday has on the variance explained and the influence that outliers had on it's reliability. Totday may explain variance within the data set but may not hold 'real-world' validity in predicting test scores based on it's small and possibly negligable effect.