Cheryl Tran

RPI

10/9/2014 Version 1

1. Setting

System under test

This recipe is examining the California Test Score Data Set from the Ecdat package.This dataset contains data observed from schools in California from 1998-1999.This experiment is testing the effect of computers per student and student teacher ratio on the average reading score.

install.packages("Ecdat", repos='http://cran.us.r-project.org')
## package 'Ecdat' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tranc3\AppData\Local\Temp\RtmpyEJh5a\downloaded_packages
library("Ecdat", lib.loc="C:/Program Files/R/R-3.1.1/library")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
Cal<-Caschool

Factors and Levels

In this experiment, the two factors being observed are the number of computers per student and the student teacher ratio.

head(Cal)
##   distcod  county                        district grspan enrltot teachers
## 1   75119 Alameda              Sunol Glen Unified  KK-08     195    10.90
## 2   61499   Butte            Manzanita Elementary  KK-08     240    11.15
## 3   61549   Butte     Thermalito Union Elementary  KK-08    1550    82.90
## 4   61457   Butte Golden Feather Union Elementary  KK-08     243    14.00
## 5   61523   Butte        Palermo Union Elementary  KK-08    1335    71.50
## 6   62042  Fresno         Burrel Union Elementary  KK-08     137     6.40
##   calwpct mealpct computer testscr compstu expnstu   str avginc  elpct
## 1  0.5102   2.041       67   690.8  0.3436    6385 17.89 22.690  0.000
## 2 15.4167  47.917      101   661.2  0.4208    5099 21.52  9.824  4.583
## 3 55.0323  76.323      169   643.6  0.1090    5502 18.70  8.978 30.000
## 4 36.4754  77.049       85   647.7  0.3498    7102 17.36  8.978  0.000
## 5 33.1086  78.427      171   640.8  0.1281    5236 18.67  9.080 13.858
## 6 12.3188  86.956       25   605.6  0.1825    5580 21.41 10.415 12.409
##   readscr mathscr
## 1   691.6   690.0
## 2   660.5   661.9
## 3   636.3   650.9
## 4   651.9   643.5
## 5   641.8   639.9
## 6   605.7   605.4
tail(Cal)
##     distcod      county                  district grspan enrltot teachers
## 415   69682 Santa Clara Saratoga Union Elementary  KK-08    2341   124.09
## 416   68957   San Mateo    Las Lomitas Elementary  KK-08     984    59.73
## 417   69518 Santa Clara      Los Altos Elementary  KK-08    3724   208.48
## 418   72611     Ventura    Somis Union Elementary  KK-08     441    20.15
## 419   72744        Yuba         Plumas Elementary  KK-08     101     5.00
## 420   72751        Yuba      Wheatland Elementary  KK-08    1778    93.40
##     calwpct mealpct computer testscr compstu expnstu   str avginc  elpct
## 415  0.1709   0.598      286   700.3  0.1222    5393 18.87 40.402  2.050
## 416  0.1016   3.557      195   704.3  0.1982    7290 16.47 28.717  5.996
## 417  1.0741   1.504      721   706.8  0.1936    5741 17.86 41.734  4.726
## 418  3.5635  37.194       45   645.0  0.1020    4403 21.89 23.733 24.263
## 419 11.8812  59.406       14   672.2  0.1386    4776 20.20  9.952  2.970
## 420  6.9235  47.571      313   655.8  0.1760    5993 19.04 12.502  5.006
##     readscr mathscr
## 415   698.9   701.7
## 416   700.9   707.7
## 417   704.0   709.5
## 418   648.3   641.7
## 419   667.9   676.5
## 420   660.5   651.0
summary(Cal)
##     distcod              county                         district  
##  Min.   :61382   Sonoma     : 29   Lakeside Union Elementary:  3  
##  1st Qu.:64308   Kern       : 27   Mountain View Elementary :  3  
##  Median :67760   Los Angeles: 27   Jefferson Elementary     :  2  
##  Mean   :67473   Tulare     : 24   Liberty Elementary       :  2  
##  3rd Qu.:70419   San Diego  : 21   Ocean View Elementary    :  2  
##  Max.   :75440   Santa Clara: 20   Pacific Union Elementary :  2  
##                  (Other)    :272   (Other)                  :406  
##    grspan       enrltot         teachers         calwpct    
##  KK-06: 61   Min.   :   81   Min.   :   4.8   Min.   : 0.0  
##  KK-08:359   1st Qu.:  379   1st Qu.:  19.7   1st Qu.: 4.4  
##              Median :  950   Median :  48.6   Median :10.5  
##              Mean   : 2629   Mean   : 129.1   Mean   :13.2  
##              3rd Qu.: 3008   3rd Qu.: 146.4   3rd Qu.:19.0  
##              Max.   :27176   Max.   :1429.0   Max.   :79.0  
##                                                             
##     mealpct         computer       testscr       compstu      
##  Min.   :  0.0   Min.   :   0   Min.   :606   Min.   :0.0000  
##  1st Qu.: 23.3   1st Qu.:  46   1st Qu.:640   1st Qu.:0.0938  
##  Median : 41.8   Median : 118   Median :654   Median :0.1255  
##  Mean   : 44.7   Mean   : 303   Mean   :654   Mean   :0.1359  
##  3rd Qu.: 66.9   3rd Qu.: 375   3rd Qu.:667   3rd Qu.:0.1645  
##  Max.   :100.0   Max.   :3324   Max.   :707   Max.   :0.4208  
##                                                               
##     expnstu          str           avginc          elpct      
##  Min.   :3926   Min.   :14.0   Min.   : 5.34   Min.   : 0.00  
##  1st Qu.:4906   1st Qu.:18.6   1st Qu.:10.64   1st Qu.: 1.94  
##  Median :5215   Median :19.7   Median :13.73   Median : 8.78  
##  Mean   :5312   Mean   :19.6   Mean   :15.32   Mean   :15.77  
##  3rd Qu.:5601   3rd Qu.:20.9   3rd Qu.:17.63   3rd Qu.:22.97  
##  Max.   :7712   Max.   :25.8   Max.   :55.33   Max.   :85.54  
##                                                               
##     readscr       mathscr   
##  Min.   :604   Min.   :605  
##  1st Qu.:640   1st Qu.:639  
##  Median :656   Median :652  
##  Mean   :655   Mean   :653  
##  3rd Qu.:669   3rd Qu.:666  
##  Max.   :704   Max.   :710  
## 

Continuous variables (if any)

The continuous variables in the data set are total enrollment, number of teachers, percent qualifying for CalWorks, percent qualifying for reduced-price lunch, number of computers, average test score, computer per student, expenditure per student, student teacher ratio, district average income, percent of English learners, average reading score, and average math score.

Response variables

In this experiment, the response variable is the average reading score being affected by the number of computers per student and student teacher ratio.

The Data: How is it organized and what does it look like?

The dataset was obtained from 420 observations of California test Scores from 1998-1999.The data is organized by 17 variables: distcod, county, district, grspan, enrltot, teachers, calwpct, mealpct, computer, testscr, compstu, expnstu, str, avginc, elpct, readscr, mathscr.

Randomization

There is no randomization. The dataset is based off of observations at California schools.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The anova test is analyzing if the variation in average reading scores can be attributed to variation in number of computers per student or the student teacher ratio. The null hypothesis for this experiment is that the variation in average reading scores can not be attributed to the variation in number of computers per student or the student teacher ratio.The alternative is that the variation can be attributed to the variation in number of computers per student or the student teacher ratio.

What is the rationale for this design?

The anova test is used to analyze the observed variance in a variable. This variable is broken down into factors and tested to determine if the factors can be used to explain the variation. One may assume that the number of computers per student or the student teacher ratio could affect the average reading scores for students. It seems that more computers per student or less students per teacher would be a better learning environment. However, this may not be true therefore this experiment is used to test the hypothesis.

Randomize: What is the Randomization Scheme?

It is unknown how the data was collected and if randomization was used.

Replicate: Are there replicates and/or repeated measures?

There are no replicates. Data is collected from each individual school.

Block: Did you use blocking in the design?

There was no blocking used in the design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

# histograms and boxplots of number of computers per student and the student teacher ratio
hist(Cal$compstu)

plot of chunk unnamed-chunk-3

hist(Cal$str)

plot of chunk unnamed-chunk-3

boxplot(Cal$compstu)

plot of chunk unnamed-chunk-3

boxplot(Cal$str)

plot of chunk unnamed-chunk-3

The data for number of computers per student appears to be skewed to the right. While the student teacher ratio appears to have a normal distribution.

Testing

model1=aov(readscr~compstu, data=Cal)
anova(model1)
## Analysis of Variance Table
## 
## Response: readscr
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## compstu     1  13392   13392    35.9 4.5e-09 ***
## Residuals 418 156022     373                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2=aov(readscr~str, data=Cal)
anova(model2)
## Analysis of Variance Table
## 
## Response: readscr
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## str         1  10302   10302    27.1 3.1e-07 ***
## Residuals 418 159113     381                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3=aov(readscr~str*compstu, data=Cal)
anova(model3)
## Analysis of Variance Table
## 
## Response: readscr
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## str           1  10302   10302   28.55 1.5e-07 ***
## compstu       1   7894    7894   21.88 3.9e-06 ***
## str:compstu   1   1114    1114    3.09    0.08 .  
## Residuals   416 150104     361                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the results of the first ANOVA, we would reject the null hypothesis and the variation in average reading test scores can be explained by something other than randomiation. The average reading test scores can be attributed to the number of computers per student.The probability of getting and F value of 35.9 under randomization is 4.5e-09. For the second ANOVA test, we would also reject the null hypothesis and the average reading test scores can be attributed to the student teacher ratio. For the third ANOVA test, it appears that the average reading test scores can be attributed to the number computers per student or student teacher ratio. However with the interaction, the total variation can’t be attributed to anything other than randomization.

Diagnostics/Model Adequacy Checking

qqnorm(residuals(model3))
qqline(residuals(model3))

plot of chunk unnamed-chunk-5

shapiro.test(Cal$str)
## 
##  Shapiro-Wilk normality test
## 
## data:  Cal$str
## W = 0.992, p-value = 0.02385
shapiro.test(Cal$compstu)
## 
##  Shapiro-Wilk normality test
## 
## data:  Cal$compstu
## W = 0.9485, p-value = 6.53e-11
plot(fitted(model3), residuals(model3))

plot of chunk unnamed-chunk-5

interaction.plot(Cal$str, Cal$compstu, Cal$readscr)

plot of chunk unnamed-chunk-5 A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals appear to be normal. We use the Shapiro-wilk test to check normality. With our p-values <0.1 it appears that its adequate.The plot of the fitted model and the residuals appear to be scattered randomly. There does not appear to be any interaction based off of the interaction plot.

When running a Tukey test, the null hypothesis is that there is no difference between the means of a pair of data, while the alternative states that there is a significant difference between the means.The tukey test creates a set of confidence intervals on the differences between the means of the levels of a factor with the specified family-wise probability of coverage.

4. Contingencies

A non parametric test could be used to test the hypothesis. For example, a Kruskal Wallis or Friedmans test are some non-parametric methods.The Friedmans test and kruskal Wallis performs a rank sum test.The Kruskal Wallis test does not assume a normal distrubtion of the residuals.

5. References to the literature

http://www.cde.ca.gov

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code