Dataset Selection
I selected a dataset from 100+ Interesting Data Sets: Ecdat-dataset for econometrics package: http://www.rdocumentation.org/packages/ecdat
I chose the ‘Caschool’ dataset: http://www.rdocumentation.org/packages/Ecdat/functions/Caschool
Dataset Description
This is a California Test Score dataset from 1998-1999 that looked at 420 schools and made 17 observations relating to test scores of students within those schools, student teacher ratio, average income of the district within which the school was located, etc.
The source of the dataset: California Department of Education http://www.cde.ca.gov.
library("Ecdat", lib.loc="~/R/win-library/3.0")
data(Caschool)
dim(Caschool)
## [1] 420 17
str(Caschool)
## 'data.frame': 420 obs. of 17 variables:
## $ distcod : int 75119 61499 61549 61457 61523 62042 68536 63834 62331 67306 ...
## $ county : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
## $ district: Factor w/ 409 levels "Ackerman Elementary",..: 362 214 367 132 270 53 152 383 263 94 ...
## $ grspan : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
## $ enrltot : int 195 240 1550 243 1335 137 195 888 379 2247 ...
## $ teachers: num 10.9 11.1 82.9 14 71.5 ...
## $ calwpct : num 0.51 15.42 55.03 36.48 33.11 ...
## $ mealpct : num 2.04 47.92 76.32 77.05 78.43 ...
## $ computer: int 67 101 169 85 171 25 28 66 35 0 ...
## $ testscr : num 691 661 644 648 641 ...
## $ compstu : num 0.344 0.421 0.109 0.35 0.128 ...
## $ expnstu : num 6385 5099 5502 7102 5236 ...
## $ str : num 17.9 21.5 18.7 17.4 18.7 ...
## $ avginc : num 22.69 9.82 8.98 8.98 9.08 ...
## $ elpct : num 0 4.58 30 0 13.86 ...
## $ readscr : num 692 660 636 652 642 ...
## $ mathscr : num 690 662 651 644 640 ...
summary(Caschool)
## distcod county district
## Min. :61382 Sonoma : 29 Lakeside Union Elementary: 3
## 1st Qu.:64308 Kern : 27 Mountain View Elementary : 3
## Median :67761 Los Angeles: 27 Jefferson Elementary : 2
## Mean :67473 Tulare : 24 Liberty Elementary : 2
## 3rd Qu.:70419 San Diego : 21 Ocean View Elementary : 2
## Max. :75440 Santa Clara: 20 Pacific Union Elementary : 2
## (Other) :272 (Other) :406
## grspan enrltot teachers calwpct
## KK-06: 61 Min. : 81.0 Min. : 4.85 Min. : 0.000
## KK-08:359 1st Qu.: 379.0 1st Qu.: 19.66 1st Qu.: 4.395
## Median : 950.5 Median : 48.56 Median :10.520
## Mean : 2628.8 Mean : 129.07 Mean :13.246
## 3rd Qu.: 3008.0 3rd Qu.: 146.35 3rd Qu.:18.981
## Max. :27176.0 Max. :1429.00 Max. :78.994
##
## mealpct computer testscr compstu
## Min. : 0.00 Min. : 0.0 Min. :605.5 Min. :0.00000
## 1st Qu.: 23.28 1st Qu.: 46.0 1st Qu.:640.0 1st Qu.:0.09377
## Median : 41.75 Median : 117.5 Median :654.5 Median :0.12546
## Mean : 44.71 Mean : 303.4 Mean :654.2 Mean :0.13593
## 3rd Qu.: 66.86 3rd Qu.: 375.2 3rd Qu.:666.7 3rd Qu.:0.16447
## Max. :100.00 Max. :3324.0 Max. :706.8 Max. :0.42083
##
## expnstu str avginc elpct
## Min. :3926 Min. :14.00 Min. : 5.335 Min. : 0.000
## 1st Qu.:4906 1st Qu.:18.58 1st Qu.:10.639 1st Qu.: 1.941
## Median :5215 Median :19.72 Median :13.728 Median : 8.778
## Mean :5312 Mean :19.64 Mean :15.317 Mean :15.768
## 3rd Qu.:5601 3rd Qu.:20.87 3rd Qu.:17.629 3rd Qu.:22.970
## Max. :7712 Max. :25.80 Max. :55.328 Max. :85.540
##
## readscr mathscr
## Min. :604.5 Min. :605.4
## 1st Qu.:640.4 1st Qu.:639.4
## Median :655.8 Median :652.5
## Mean :655.0 Mean :653.3
## 3rd Qu.:668.7 3rd Qu.:665.9
## Max. :704.0 Max. :709.5
##
Selection of independent and dependent variables
I am interested in test scores of students as the response variable. The explanatory variables I want to look at are student teacher ratio, computers per student and average income because I anticipate that these are predictors that may help explain the variation in test scores received by students.
Independent variables:
Student-teacher ratio “str”
Computers per student “compstu”
Average income in the district the school is located in (in 1,000 USD) “avginc”
Dependent variable: Test scores of students “testscr”
data=Caschool[,c("county","district","testscr","str","compstu","avginc")]
head(data)
## county district testscr str compstu
## 1 Alameda Sunol Glen Unified 690.80 17.88991 0.3435898
## 2 Butte Manzanita Elementary 661.20 21.52466 0.4208333
## 3 Butte Thermalito Union Elementary 643.60 18.69723 0.1090323
## 4 Butte Golden Feather Union Elementary 647.70 17.35714 0.3497942
## 5 Butte Palermo Union Elementary 640.85 18.67133 0.1280899
## 6 Fresno Burrel Union Elementary 605.55 21.40625 0.1824818
## avginc
## 1 22.690001
## 2 9.824000
## 3 8.978000
## 4 8.978000
## 5 9.080333
## 6 10.415000
Linear model
The multiple linear regression model is testing whether the variation in the explanatory variables (student teacher ratio, number of computers per student and average income in the district in 1,000 units of USD) can explain the variation in the response variable test scores within schools in California in the years 1998-1999.
Null hypothesis
The null hypothesis is that there is no association between the independent variables (student teacher ratio, computers per student and average income) and the dependent variable (test scores of students).
A quick look at correlations
plot(data[,c(3:6)])
cor(data[,c(3:6)])
## testscr str compstu avginc
## testscr 1.0000000 -0.2263628 0.2707034 0.7124308
## str -0.2263628 1.0000000 -0.3070702 -0.2321937
## compstu 0.2707034 -0.3070702 1.0000000 0.1948062
## avginc 0.7124308 -0.2321937 0.1948062 1.0000000
There seems to be a positive association between the independent variable ‘compstu’ - number of computers per student and the dependent variable ‘testscr’ - test scores of students.
Also seems to be a positive association between the independent variable ‘avginc’ - average income in the school district and the dependent variable ‘testscr’ - test scores.
Looking at the plots of the pair-wise relationships between the indepdendent variables, ‘str’ student teacher ratio, ‘compstu’ computers per student and ‘avginc’ average income in district, it is hard to say whether there is any association.
Overall - a strong positive correlation between average income and test scores. Not much correlation between the independent predictors.
Data diagnostics
Boxplots and histograms are made to assess the distribution of the response variable and the explanatory variables, and to check for overall spread, outliers and skewness.
attach(data)
par(mfrow=c(2,4))
boxplot(testscr, main="test score")
boxplot(avginc, main="average income")
boxplot(compstu, main="computers/student")
boxplot(str, main="student teacher ratio")
hist(testscr, main="test score")
hist(avginc, main="average income")
hist(compstu, main="computers/student")
hist(str, main="student teacher ratio")
The response variable test scores and student teacher ratio appear to be fairly normally distributed, with 1 outlier in the former and a few outliers in the latter. There is a right skew both in average income and computers per student, with a number of outliers reaching above the upper whiskers.
Based on the plot above, it looks like the majority of the observations have an average income within $25K, while there are only a few points above $30K. Similarly for computers per students, a majority of the data seems to lie within 0-0.25, while only a few points extend above 0.25. While it may make sense to remove these extreme points, we will try to fit all of the data for now.
Hierarchical regression
For this model building method, we add independent variables in a predetermined manner and then observe the change in the strength of the association by R square and the significance of the association by the p value. Because I think that student teacher ratio is the strongest predictor of test scores, I will add this first. Then I will add computers per student, and finally average income.
Test score and student teacher ratio
h.lm.one<-lm(testscr~str)
summary(h.lm.one)
##
## Call:
## lm(formula = testscr ~ str)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.727 -14.251 0.483 12.822 48.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.9330 9.4675 73.825 < 2e-16 ***
## str -2.2798 0.4798 -4.751 2.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
Even though student teacher ratio is a significant predictor of the test scores, it explains only 5% of the variation in the test scores.
Test score and student teacher ratio + computers per student
h.lm.two<-lm(testscr~str+compstu)
summary(h.lm.two)
##
## Call:
## lm(formula = testscr ~ str + compstu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.83 -14.03 -1.47 12.89 46.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 676.5830 10.4808 64.555 < 2e-16 ***
## str -1.5928 0.4928 -3.232 0.00132 **
## compstu 65.1599 14.3513 4.540 7.36e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.16 on 417 degrees of freedom
## Multiple R-squared: 0.09593, Adjusted R-squared: 0.0916
## F-statistic: 22.12 on 2 and 417 DF, p-value: 7.375e-10
Both student teacher ratio and computers per student are significant, but computers per student only added a little more information after student teacher ratio was considered - now the variation explained is 9%.
Test score and student teacher ratio + computers per student + average income
h.lm.three<-lm(testscr~str+compstu+avginc)
summary(h.lm.three)
##
## Call:
## lm(formula = testscr ~ str + compstu + avginc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.067 -9.114 0.817 8.946 34.910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 627.1742 8.0090 78.309 < 2e-16 ***
## str -0.2882 0.3633 -0.793 0.428040
## compstu 37.9366 10.4939 3.615 0.000337 ***
## avginc 1.7946 0.0923 19.444 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.16 on 416 degrees of freedom
## Multiple R-squared: 0.5264, Adjusted R-squared: 0.5229
## F-statistic: 154.1 on 3 and 416 DF, p-value: < 2.2e-16
There seems to be a significant improvement to the model when taking average income into account. We now see that student teacher ratio is no longer a significant predictor, while computers per student and average income are significant. The R square changed from 9% in the previous model to 52% in the new model, suggesting that average income contributed significant information to the test scores.
Step-wise regression
Here we add independent variables to the model based on the strength of their correlations with the dependent variable, from highest to lowest. Based on pair-wise correlations from before, average income had the highest correlation with test scores, followed by computers per student and finally student teacher ratio. So we build three models in this order – the first model using average income alone, then a second model by adding computers per student and finally, we add student teacher ratio.
Test scores and average income
s.lm.one<-lm(testscr~avginc)
summary(s.lm.one)
##
## Call:
## lm(formula = testscr ~ avginc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.574 -8.803 0.603 9.032 32.530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 625.3836 1.5324 408.11 <2e-16 ***
## avginc 1.8785 0.0905 20.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.39 on 418 degrees of freedom
## Multiple R-squared: 0.5076, Adjusted R-squared: 0.5064
## F-statistic: 430.8 on 1 and 418 DF, p-value: < 2.2e-16
Even though the predictor average income is significant (P<2e-16), the R squared of 0.5 suggests moderate association between average income and test scores.
Test scores and average income + computers per student
s.lm.two<-lm(testscr~avginc+compstu)
summary(s.lm.two)
##
## Call:
## lm(formula = testscr ~ avginc + compstu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.616 -9.088 1.033 9.098 35.351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 620.99523 1.86507 332.961 < 2e-16 ***
## avginc 1.80811 0.09067 19.942 < 2e-16 ***
## compstu 40.22145 10.08643 3.988 7.88e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.15 on 417 degrees of freedom
## Multiple R-squared: 0.5256, Adjusted R-squared: 0.5234
## F-statistic: 231 on 2 and 417 DF, p-value: < 2.2e-16
Both average income and computers per student are significant in the second model and the variation in the two predictors explain about 52% of the variation in test scores, a slight improvement from the first model.
Test scores and average income + computers per student + student teacher ratio
s.lm.three<-lm(testscr~avginc+compstu+str)
summary(s.lm.three)
##
## Call:
## lm(formula = testscr ~ avginc + compstu + str)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.067 -9.114 0.817 8.946 34.910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 627.1742 8.0090 78.309 < 2e-16 ***
## avginc 1.7946 0.0923 19.444 < 2e-16 ***
## compstu 37.9366 10.4939 3.615 0.000337 ***
## str -0.2882 0.3633 -0.793 0.428040
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.16 on 416 degrees of freedom
## Multiple R-squared: 0.5264, Adjusted R-squared: 0.5229
## F-statistic: 154.1 on 3 and 416 DF, p-value: < 2.2e-16
Again, both average income and computers per student are significant in the third model, while student teacher ratio is not. The variation explained in the third model is not a significant improvement from the second model; in other words, student teacher information did not add any significant information after average income and computers per student were taken into consideration.
In fact when I ran a model incorporating average income and student teacher ratio (data not shown), the R squared was 0.51 compared to 0.5 when only average income was used. Student teacher ratio had a p-value of 0.06, and failed to reach significance.
In conclusion at an alpha of 0.05 we reject the null hypothesis for average income and computers per student – they are significantly associated with test scores, while we accept the null hypothesis for student teacher ratio that there is no association with it and test scores.
However, an important caveat is that even after adding all three independent variables to the model, we were able to explain only about 52% of the variation in the test scores. Clearly, we have not captured other possibly important explanatory variables.
Entry-wise regression
For this model, we input all independent variables together and look at the model performance.
lm.all<-lm(testscr~str+compstu+avginc)
summary(lm.all)
##
## Call:
## lm(formula = testscr ~ str + compstu + avginc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.067 -9.114 0.817 8.946 34.910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 627.1742 8.0090 78.309 < 2e-16 ***
## str -0.2882 0.3633 -0.793 0.428040
## compstu 37.9366 10.4939 3.615 0.000337 ***
## avginc 1.7946 0.0923 19.444 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.16 on 416 degrees of freedom
## Multiple R-squared: 0.5264, Adjusted R-squared: 0.5229
## F-statistic: 154.1 on 3 and 416 DF, p-value: < 2.2e-16
The interpretation is the same as above.
Scattergrams
Independent variable student teacher ratio and dependent variable test scores.
plot(str,testscr, cex=1,pch=16,col="red", xlab="student teacher ratio", ylab="test scores", main="Test scores vs. Student teacher ratio")
lm.one<-lm(testscr~str)
abline(lm.one$coef, lwd=2)
summary(lm.one)
##
## Call:
## lm(formula = testscr ~ str)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.727 -14.251 0.483 12.822 48.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.9330 9.4675 73.825 < 2e-16 ***
## str -2.2798 0.4798 -4.751 2.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
Independent variable computers per student and dependent variable test scores.
plot(compstu,testscr, cex=1,pch=16,col="red", xlab="number of computers per student", ylab="test scores", main="Test scores vs. Computers per student")
lm.two<-lm(testscr~compstu)
abline(lm.two$coef, lwd=2)
summary(lm.two)
##
## Call:
## lm(formula = testscr ~ compstu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.303 -13.880 -0.259 12.649 48.013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 643.36 2.08 309.275 < 2e-16 ***
## compstu 79.41 13.81 5.749 1.73e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.36 on 418 degrees of freedom
## Multiple R-squared: 0.07328, Adjusted R-squared: 0.07106
## F-statistic: 33.05 on 1 and 418 DF, p-value: 1.732e-08
Independent variable average income in district and dependent variable test scores.
plot(avginc,testscr, cex=1,pch=16,col="red", xlab="average income in school district (USD X 1,000)", ylab="test scores", main="Test scores vs. Average income in school district")
lm.three<-lm(testscr~avginc)
abline(lm.three$coef, lwd=2)
summary(lm.three)
##
## Call:
## lm(formula = testscr ~ avginc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.574 -8.803 0.603 9.032 32.530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 625.3836 1.5324 408.11 <2e-16 ***
## avginc 1.8785 0.0905 20.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.39 on 418 degrees of freedom
## Multiple R-squared: 0.5076, Adjusted R-squared: 0.5064
## F-statistic: 430.8 on 1 and 418 DF, p-value: < 2.2e-16
3D Plot
Including computers per student + average income since student teacher ratio was the least explanatory/correlated with test scores
lm<-lm(testscr~compstu+avginc)
library("scatterplot3d", lib.loc="~/R/win-library/3.0")
plot3d <- scatterplot3d(compstu, avginc, testscr, pch=16, color="red", main="3D Scatterplot", xlab="Number of computers per student", ylab="Average income in school district (USD X 1,000)", zlab="Test Scores of students")
plot3d$plane3d(lm)
summary(lm)
##
## Call:
## lm(formula = testscr ~ compstu + avginc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.616 -9.088 1.033 9.098 35.351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 620.99523 1.86507 332.961 < 2e-16 ***
## compstu 40.22145 10.08643 3.988 7.88e-05 ***
## avginc 1.80811 0.09067 19.942 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.15 on 417 degrees of freedom
## Multiple R-squared: 0.5256, Adjusted R-squared: 0.5234
## F-statistic: 231 on 2 and 417 DF, p-value: < 2.2e-16
95% Confidence intervals
confint(lm.all)
## 2.5 % 97.5 %
## (Intercept) 611.430978 642.9173294
## str -1.002450 0.4259604
## compstu 17.308987 58.5642392
## avginc 1.613165 1.9760215
The 95% confidence intervals says that we are 95% confident that the true value of the parameters that we are interested in lies within our confidence interval. For the intercept and the explanatory variables, in line with our conclusion from the linear models, student teacher ratio is not significant (interval includes 0).
Residuals and Standardized residuals
lm.all<-lm(testscr~avginc+compstu+str)
resid<-resid(lm.all)
st.resid<-rstandard(lm.all)
par(mfrow=c(1,2))
plot(resid, pch=16, col="red", main="Residuals")
abline(0,0)
plot(st.resid, pch=16, col="blue", main="Standardized Residuals")
abline(0,0)
There seems to be a positive relationship between the residuals and the test score. Given that the distribution of the residuals appears non random, a linear model may not be the best fit for the data. Plotting the histogram of the log transformed values of average income showed a more symmetric distribution. When I fit a model of log-transformed average income with test scores, the residuals were more randomly distributed.
hist(log(avginc), main="log (avg income)")
test.lm<-lm(testscr~log(avginc))
plot(log(avginc),testscr, cex=1,pch=16,col="red", xlab="log of average income in school district (USD X 1,000)", ylab="test scores", main="Test scores vs. Average income in school district")
abline(test.lm$coef)
resid<-resid(test.lm)
plot(resid, cex=1,pch=16, col="blue", main="Residuals")
abline(0,0)
Finally, it might make sense to remove some of the extreme data points that were outliers in the distribution of the computers per student and the average income, as these may represent a small subset of schools that may be in wealthier neighborhoods.