Applied Linear Regression Project 2

Dataset Selection

I selected a dataset from 100+ Interesting Data Sets: Ecdat-dataset for econometrics package: http://www.rdocumentation.org/packages/ecdat
I chose the ‘Caschool’ dataset: http://www.rdocumentation.org/packages/Ecdat/functions/Caschool

Dataset Description

This is a California Test Score dataset from 1998-1999 that looked at 420 schools and made 17 observations relating to test scores of students within those schools, student teacher ratio, average income of the district within which the school was located, etc.
The source of the dataset: California Department of Education http://www.cde.ca.gov.

library("Ecdat", lib.loc="~/R/win-library/3.0")
data(Caschool)
dim(Caschool)

## [1] 420  17

str(Caschool)

## 'data.frame':    420 obs. of  17 variables:
##  $ distcod : int  75119 61499 61549 61457 61523 62042 68536 63834 62331 67306 ...
##  $ county  : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
##  $ district: Factor w/ 409 levels "Ackerman Elementary",..: 362 214 367 132 270 53 152 383 263 94 ...
##  $ grspan  : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
##  $ enrltot : int  195 240 1550 243 1335 137 195 888 379 2247 ...
##  $ teachers: num  10.9 11.1 82.9 14 71.5 ...
##  $ calwpct : num  0.51 15.42 55.03 36.48 33.11 ...
##  $ mealpct : num  2.04 47.92 76.32 77.05 78.43 ...
##  $ computer: int  67 101 169 85 171 25 28 66 35 0 ...
##  $ testscr : num  691 661 644 648 641 ...
##  $ compstu : num  0.344 0.421 0.109 0.35 0.128 ...
##  $ expnstu : num  6385 5099 5502 7102 5236 ...
##  $ str     : num  17.9 21.5 18.7 17.4 18.7 ...
##  $ avginc  : num  22.69 9.82 8.98 8.98 9.08 ...
##  $ elpct   : num  0 4.58 30 0 13.86 ...
##  $ readscr : num  692 660 636 652 642 ...
##  $ mathscr : num  690 662 651 644 640 ...

summary(Caschool)

##     distcod              county                         district  
##  Min.   :61382   Sonoma     : 29   Lakeside Union Elementary:  3  
##  1st Qu.:64308   Kern       : 27   Mountain View Elementary :  3  
##  Median :67761   Los Angeles: 27   Jefferson Elementary     :  2  
##  Mean   :67473   Tulare     : 24   Liberty Elementary       :  2  
##  3rd Qu.:70419   San Diego  : 21   Ocean View Elementary    :  2  
##  Max.   :75440   Santa Clara: 20   Pacific Union Elementary :  2  
##                  (Other)    :272   (Other)                  :406  
##    grspan       enrltot           teachers          calwpct      
##  KK-06: 61   Min.   :   81.0   Min.   :   4.85   Min.   : 0.000  
##  KK-08:359   1st Qu.:  379.0   1st Qu.:  19.66   1st Qu.: 4.395  
##              Median :  950.5   Median :  48.56   Median :10.520  
##              Mean   : 2628.8   Mean   : 129.07   Mean   :13.246  
##              3rd Qu.: 3008.0   3rd Qu.: 146.35   3rd Qu.:18.981  
##              Max.   :27176.0   Max.   :1429.00   Max.   :78.994  
##                                                                  
##     mealpct          computer         testscr         compstu       
##  Min.   :  0.00   Min.   :   0.0   Min.   :605.5   Min.   :0.00000  
##  1st Qu.: 23.28   1st Qu.:  46.0   1st Qu.:640.0   1st Qu.:0.09377  
##  Median : 41.75   Median : 117.5   Median :654.5   Median :0.12546  
##  Mean   : 44.71   Mean   : 303.4   Mean   :654.2   Mean   :0.13593  
##  3rd Qu.: 66.86   3rd Qu.: 375.2   3rd Qu.:666.7   3rd Qu.:0.16447  
##  Max.   :100.00   Max.   :3324.0   Max.   :706.8   Max.   :0.42083  
##                                                                     
##     expnstu          str            avginc           elpct       
##  Min.   :3926   Min.   :14.00   Min.   : 5.335   Min.   : 0.000  
##  1st Qu.:4906   1st Qu.:18.58   1st Qu.:10.639   1st Qu.: 1.941  
##  Median :5215   Median :19.72   Median :13.728   Median : 8.778  
##  Mean   :5312   Mean   :19.64   Mean   :15.317   Mean   :15.768  
##  3rd Qu.:5601   3rd Qu.:20.87   3rd Qu.:17.629   3rd Qu.:22.970  
##  Max.   :7712   Max.   :25.80   Max.   :55.328   Max.   :85.540  
##                                                                  
##     readscr         mathscr     
##  Min.   :604.5   Min.   :605.4  
##  1st Qu.:640.4   1st Qu.:639.4  
##  Median :655.8   Median :652.5  
##  Mean   :655.0   Mean   :653.3  
##  3rd Qu.:668.7   3rd Qu.:665.9  
##  Max.   :704.0   Max.   :709.5  
##

Selection of independent and dependent variables

I am interested in test scores of students as the response variable. The explanatory variables I want to look at are student teacher ratio, computers per student and average income because I anticipate that these are predictors that may help explain the variation in test scores received by students.

Independent variables:

Student-teacher ratio “str”
Computers per student “compstu”
Average income in the district the school is located in (in 1,000 USD) “avginc”

Dependent variable: Test scores of students “testscr”

data=Caschool[,c("county","district","testscr","str","compstu","avginc")]
head(data)

##    county                        district testscr      str   compstu
## 1 Alameda              Sunol Glen Unified  690.80 17.88991 0.3435898
## 2   Butte            Manzanita Elementary  661.20 21.52466 0.4208333
## 3   Butte     Thermalito Union Elementary  643.60 18.69723 0.1090323
## 4   Butte Golden Feather Union Elementary  647.70 17.35714 0.3497942
## 5   Butte        Palermo Union Elementary  640.85 18.67133 0.1280899
## 6  Fresno         Burrel Union Elementary  605.55 21.40625 0.1824818
##      avginc
## 1 22.690001
## 2  9.824000
## 3  8.978000
## 4  8.978000
## 5  9.080333
## 6 10.415000

Linear model

The multiple linear regression model is testing whether the variation in the explanatory variables (student teacher ratio, number of computers per student and average income in the district in 1,000 units of USD) can explain the variation in the response variable test scores within schools in California in the years 1998-1999.

Null hypothesis

The null hypothesis is that there is no association between the independent variables (student teacher ratio, computers per student and average income) and the dependent variable (test scores of students).

A quick look at correlations

plot(data[,c(3:6)])

cor(data[,c(3:6)])

##            testscr        str    compstu     avginc
## testscr  1.0000000 -0.2263628  0.2707034  0.7124308
## str     -0.2263628  1.0000000 -0.3070702 -0.2321937
## compstu  0.2707034 -0.3070702  1.0000000  0.1948062
## avginc   0.7124308 -0.2321937  0.1948062  1.0000000

There seems to be a positive association between the independent variable ‘compstu’ - number of computers per student and the dependent variable ‘testscr’ - test scores of students.
Also seems to be a positive association between the independent variable ‘avginc’ - average income in the school district and the dependent variable ‘testscr’ - test scores.
Looking at the plots of the pair-wise relationships between the indepdendent variables, ‘str’ student teacher ratio, ‘compstu’ computers per student and ‘avginc’ average income in district, it is hard to say whether there is any association.

Overall - a strong positive correlation between average income and test scores. Not much correlation between the independent predictors.

Data diagnostics

Boxplots and histograms are made to assess the distribution of the response variable and the explanatory variables, and to check for overall spread, outliers and skewness.

attach(data)
par(mfrow=c(2,4))
boxplot(testscr, main="test score")
boxplot(avginc, main="average income")
boxplot(compstu, main="computers/student")
boxplot(str, main="student teacher ratio")
hist(testscr, main="test score")
hist(avginc, main="average income")
hist(compstu, main="computers/student")
hist(str, main="student teacher ratio")

The response variable test scores and student teacher ratio appear to be fairly normally distributed, with 1 outlier in the former and a few outliers in the latter. There is a right skew both in average income and computers per student, with a number of outliers reaching above the upper whiskers.

Based on the plot above, it looks like the majority of the observations have an average income within $25K, while there are only a few points above $30K. Similarly for computers per students, a majority of the data seems to lie within 0-0.25, while only a few points extend above 0.25. While it may make sense to remove these extreme points, we will try to fit all of the data for now.

Hierarchical regression

For this model building method, we add independent variables in a predetermined manner and then observe the change in the strength of the association by R square and the significance of the association by the p value. Because I think that student teacher ratio is the strongest predictor of test scores, I will add this first. Then I will add computers per student, and finally average income.

Test score and student teacher ratio

h.lm.one<-lm(testscr~str)
summary(h.lm.one)

## 
## Call:
## lm(formula = testscr ~ str)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9330     9.4675  73.825  < 2e-16 ***
## str          -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

Even though student teacher ratio is a significant predictor of the test scores, it explains only 5% of the variation in the test scores.

Test score and student teacher ratio + computers per student

h.lm.two<-lm(testscr~str+compstu)
summary(h.lm.two)

## 
## Call:
## lm(formula = testscr ~ str + compstu)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48.83 -14.03  -1.47  12.89  46.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 676.5830    10.4808  64.555  < 2e-16 ***
## str          -1.5928     0.4928  -3.232  0.00132 ** 
## compstu      65.1599    14.3513   4.540 7.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.16 on 417 degrees of freedom
## Multiple R-squared:  0.09593,    Adjusted R-squared:  0.0916 
## F-statistic: 22.12 on 2 and 417 DF,  p-value: 7.375e-10

Both student teacher ratio and computers per student are significant, but computers per student only added a little more information after student teacher ratio was considered - now the variation explained is 9%.

Test score and student teacher ratio + computers per student + average income

h.lm.three<-lm(testscr~str+compstu+avginc)
summary(h.lm.three)

## 
## Call:
## lm(formula = testscr ~ str + compstu + avginc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.067  -9.114   0.817   8.946  34.910 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 627.1742     8.0090  78.309  < 2e-16 ***
## str          -0.2882     0.3633  -0.793 0.428040    
## compstu      37.9366    10.4939   3.615 0.000337 ***
## avginc        1.7946     0.0923  19.444  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.16 on 416 degrees of freedom
## Multiple R-squared:  0.5264, Adjusted R-squared:  0.5229 
## F-statistic: 154.1 on 3 and 416 DF,  p-value: < 2.2e-16

There seems to be a significant improvement to the model when taking average income into account. We now see that student teacher ratio is no longer a significant predictor, while computers per student and average income are significant. The R square changed from 9% in the previous model to 52% in the new model, suggesting that average income contributed significant information to the test scores.

Step-wise regression

Here we add independent variables to the model based on the strength of their correlations with the dependent variable, from highest to lowest. Based on pair-wise correlations from before, average income had the highest correlation with test scores, followed by computers per student and finally student teacher ratio. So we build three models in this order – the first model using average income alone, then a second model by adding computers per student and finally, we add student teacher ratio.

Test scores and average income

s.lm.one<-lm(testscr~avginc)
summary(s.lm.one)

## 
## Call:
## lm(formula = testscr ~ avginc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.574  -8.803   0.603   9.032  32.530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 625.3836     1.5324  408.11   <2e-16 ***
## avginc        1.8785     0.0905   20.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.39 on 418 degrees of freedom
## Multiple R-squared:  0.5076, Adjusted R-squared:  0.5064 
## F-statistic: 430.8 on 1 and 418 DF,  p-value: < 2.2e-16

Even though the predictor average income is significant (P<2e-16), the R squared of 0.5 suggests moderate association between average income and test scores.

Test scores and average income + computers per student

s.lm.two<-lm(testscr~avginc+compstu)
summary(s.lm.two)

## 
## Call:
## lm(formula = testscr ~ avginc + compstu)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.616  -9.088   1.033   9.098  35.351 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 620.99523    1.86507 332.961  < 2e-16 ***
## avginc        1.80811    0.09067  19.942  < 2e-16 ***
## compstu      40.22145   10.08643   3.988 7.88e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.15 on 417 degrees of freedom
## Multiple R-squared:  0.5256, Adjusted R-squared:  0.5234 
## F-statistic:   231 on 2 and 417 DF,  p-value: < 2.2e-16

Both average income and computers per student are significant in the second model and the variation in the two predictors explain about 52% of the variation in test scores, a slight improvement from the first model.

Test scores and average income + computers per student + student teacher ratio

s.lm.three<-lm(testscr~avginc+compstu+str)
summary(s.lm.three)

## 
## Call:
## lm(formula = testscr ~ avginc + compstu + str)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.067  -9.114   0.817   8.946  34.910 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 627.1742     8.0090  78.309  < 2e-16 ***
## avginc        1.7946     0.0923  19.444  < 2e-16 ***
## compstu      37.9366    10.4939   3.615 0.000337 ***
## str          -0.2882     0.3633  -0.793 0.428040    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.16 on 416 degrees of freedom
## Multiple R-squared:  0.5264, Adjusted R-squared:  0.5229 
## F-statistic: 154.1 on 3 and 416 DF,  p-value: < 2.2e-16

Again, both average income and computers per student are significant in the third model, while student teacher ratio is not. The variation explained in the third model is not a significant improvement from the second model; in other words, student teacher information did not add any significant information after average income and computers per student were taken into consideration.

In fact when I ran a model incorporating average income and student teacher ratio (data not shown), the R squared was 0.51 compared to 0.5 when only average income was used. Student teacher ratio had a p-value of 0.06, and failed to reach significance.

In conclusion at an alpha of 0.05 we reject the null hypothesis for average income and computers per student – they are significantly associated with test scores, while we accept the null hypothesis for student teacher ratio that there is no association with it and test scores.

However, an important caveat is that even after adding all three independent variables to the model, we were able to explain only about 52% of the variation in the test scores. Clearly, we have not captured other possibly important explanatory variables.

Entry-wise regression

For this model, we input all independent variables together and look at the model performance.

lm.all<-lm(testscr~str+compstu+avginc)
summary(lm.all)

## 
## Call:
## lm(formula = testscr ~ str + compstu + avginc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.067  -9.114   0.817   8.946  34.910 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 627.1742     8.0090  78.309  < 2e-16 ***
## str          -0.2882     0.3633  -0.793 0.428040    
## compstu      37.9366    10.4939   3.615 0.000337 ***
## avginc        1.7946     0.0923  19.444  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.16 on 416 degrees of freedom
## Multiple R-squared:  0.5264, Adjusted R-squared:  0.5229 
## F-statistic: 154.1 on 3 and 416 DF,  p-value: < 2.2e-16

The interpretation is the same as above.

Scattergrams

Independent variable student teacher ratio and dependent variable test scores.

plot(str,testscr, cex=1,pch=16,col="red", xlab="student teacher ratio", ylab="test scores", main="Test scores vs. Student teacher ratio")
lm.one<-lm(testscr~str)
abline(lm.one$coef, lwd=2)

summary(lm.one)

## 
## Call:
## lm(formula = testscr ~ str)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9330     9.4675  73.825  < 2e-16 ***
## str          -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

Independent variable computers per student and dependent variable test scores.

plot(compstu,testscr, cex=1,pch=16,col="red", xlab="number of computers per student", ylab="test scores", main="Test scores vs. Computers per student")
lm.two<-lm(testscr~compstu)
abline(lm.two$coef, lwd=2)

summary(lm.two)

## 
## Call:
## lm(formula = testscr ~ compstu)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.303 -13.880  -0.259  12.649  48.013 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   643.36       2.08 309.275  < 2e-16 ***
## compstu        79.41      13.81   5.749 1.73e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.36 on 418 degrees of freedom
## Multiple R-squared:  0.07328,    Adjusted R-squared:  0.07106 
## F-statistic: 33.05 on 1 and 418 DF,  p-value: 1.732e-08

Independent variable average income in district and dependent variable test scores.

plot(avginc,testscr, cex=1,pch=16,col="red", xlab="average income in school district (USD X 1,000)", ylab="test scores", main="Test scores vs. Average income in school district")
lm.three<-lm(testscr~avginc)
abline(lm.three$coef, lwd=2)

summary(lm.three)

## 
## Call:
## lm(formula = testscr ~ avginc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.574  -8.803   0.603   9.032  32.530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 625.3836     1.5324  408.11   <2e-16 ***
## avginc        1.8785     0.0905   20.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.39 on 418 degrees of freedom
## Multiple R-squared:  0.5076, Adjusted R-squared:  0.5064 
## F-statistic: 430.8 on 1 and 418 DF,  p-value: < 2.2e-16

3D Plot

Including computers per student + average income since student teacher ratio was the least explanatory/correlated with test scores

lm<-lm(testscr~compstu+avginc)
library("scatterplot3d", lib.loc="~/R/win-library/3.0") 
plot3d <- scatterplot3d(compstu, avginc, testscr, pch=16, color="red", main="3D Scatterplot", xlab="Number of computers per student", ylab="Average income in school district (USD X 1,000)", zlab="Test Scores of students")
plot3d$plane3d(lm)

summary(lm)

## 
## Call:
## lm(formula = testscr ~ compstu + avginc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.616  -9.088   1.033   9.098  35.351 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 620.99523    1.86507 332.961  < 2e-16 ***
## compstu      40.22145   10.08643   3.988 7.88e-05 ***
## avginc        1.80811    0.09067  19.942  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.15 on 417 degrees of freedom
## Multiple R-squared:  0.5256, Adjusted R-squared:  0.5234 
## F-statistic:   231 on 2 and 417 DF,  p-value: < 2.2e-16

95% Confidence intervals

confint(lm.all)

##                  2.5 %      97.5 %
## (Intercept) 611.430978 642.9173294
## str          -1.002450   0.4259604
## compstu      17.308987  58.5642392
## avginc        1.613165   1.9760215

The 95% confidence intervals says that we are 95% confident that the true value of the parameters that we are interested in lies within our confidence interval. For the intercept and the explanatory variables, in line with our conclusion from the linear models, student teacher ratio is not significant (interval includes 0).

Residuals and Standardized residuals

lm.all<-lm(testscr~avginc+compstu+str)
resid<-resid(lm.all)
st.resid<-rstandard(lm.all)
par(mfrow=c(1,2))
plot(resid, pch=16, col="red", main="Residuals")
abline(0,0)
plot(st.resid, pch=16, col="blue", main="Standardized Residuals")
abline(0,0)

There seems to be a positive relationship between the residuals and the test score. Given that the distribution of the residuals appears non random, a linear model may not be the best fit for the data. Plotting the histogram of the log transformed values of average income showed a more symmetric distribution. When I fit a model of log-transformed average income with test scores, the residuals were more randomly distributed.

hist(log(avginc), main="log (avg income)")

test.lm<-lm(testscr~log(avginc))
plot(log(avginc),testscr, cex=1,pch=16,col="red", xlab="log of average income in school district (USD X 1,000)", ylab="test scores", main="Test scores vs. Average income in school district")
abline(test.lm$coef)

resid<-resid(test.lm)
plot(resid, cex=1,pch=16, col="blue", main="Residuals")
abline(0,0)

Finally, it might make sense to remove some of the extreme data points that were outliers in the distribution of the computers per student and the average income, as these may represent a small subset of schools that may be in wealthier neighborhoods.

Applied Linear Regression Project 2

Kalpana Gopalakrishnan

Thursday, March 12, 2015