This recipe is examining the California Test Score Data Set from the Ecdat package.This dataset contains data observed from schools in California from 1998-1999.This experiment is testing the effect of computers per student and student teacher ratio on the average reading score.
install.packages("Ecdat", repos='http://cran.us.r-project.org')
## package 'Ecdat' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\tranc3\AppData\Local\Temp\RtmpyEJh5a\downloaded_packages
library("Ecdat", lib.loc="C:/Program Files/R/R-3.1.1/library")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
Cal<-Caschool
In this experiment, the two factors being observed are the number of computers per student and the student teacher ratio.
head(Cal)
## distcod county district grspan enrltot teachers
## 1 75119 Alameda Sunol Glen Unified KK-08 195 10.90
## 2 61499 Butte Manzanita Elementary KK-08 240 11.15
## 3 61549 Butte Thermalito Union Elementary KK-08 1550 82.90
## 4 61457 Butte Golden Feather Union Elementary KK-08 243 14.00
## 5 61523 Butte Palermo Union Elementary KK-08 1335 71.50
## 6 62042 Fresno Burrel Union Elementary KK-08 137 6.40
## calwpct mealpct computer testscr compstu expnstu str avginc elpct
## 1 0.5102 2.041 67 690.8 0.3436 6385 17.89 22.690 0.000
## 2 15.4167 47.917 101 661.2 0.4208 5099 21.52 9.824 4.583
## 3 55.0323 76.323 169 643.6 0.1090 5502 18.70 8.978 30.000
## 4 36.4754 77.049 85 647.7 0.3498 7102 17.36 8.978 0.000
## 5 33.1086 78.427 171 640.8 0.1281 5236 18.67 9.080 13.858
## 6 12.3188 86.956 25 605.6 0.1825 5580 21.41 10.415 12.409
## readscr mathscr
## 1 691.6 690.0
## 2 660.5 661.9
## 3 636.3 650.9
## 4 651.9 643.5
## 5 641.8 639.9
## 6 605.7 605.4
tail(Cal)
## distcod county district grspan enrltot teachers
## 415 69682 Santa Clara Saratoga Union Elementary KK-08 2341 124.09
## 416 68957 San Mateo Las Lomitas Elementary KK-08 984 59.73
## 417 69518 Santa Clara Los Altos Elementary KK-08 3724 208.48
## 418 72611 Ventura Somis Union Elementary KK-08 441 20.15
## 419 72744 Yuba Plumas Elementary KK-08 101 5.00
## 420 72751 Yuba Wheatland Elementary KK-08 1778 93.40
## calwpct mealpct computer testscr compstu expnstu str avginc elpct
## 415 0.1709 0.598 286 700.3 0.1222 5393 18.87 40.402 2.050
## 416 0.1016 3.557 195 704.3 0.1982 7290 16.47 28.717 5.996
## 417 1.0741 1.504 721 706.8 0.1936 5741 17.86 41.734 4.726
## 418 3.5635 37.194 45 645.0 0.1020 4403 21.89 23.733 24.263
## 419 11.8812 59.406 14 672.2 0.1386 4776 20.20 9.952 2.970
## 420 6.9235 47.571 313 655.8 0.1760 5993 19.04 12.502 5.006
## readscr mathscr
## 415 698.9 701.7
## 416 700.9 707.7
## 417 704.0 709.5
## 418 648.3 641.7
## 419 667.9 676.5
## 420 660.5 651.0
summary(Cal)
## distcod county district
## Min. :61382 Sonoma : 29 Lakeside Union Elementary: 3
## 1st Qu.:64308 Kern : 27 Mountain View Elementary : 3
## Median :67760 Los Angeles: 27 Jefferson Elementary : 2
## Mean :67473 Tulare : 24 Liberty Elementary : 2
## 3rd Qu.:70419 San Diego : 21 Ocean View Elementary : 2
## Max. :75440 Santa Clara: 20 Pacific Union Elementary : 2
## (Other) :272 (Other) :406
## grspan enrltot teachers calwpct
## KK-06: 61 Min. : 81 Min. : 4.8 Min. : 0.0
## KK-08:359 1st Qu.: 379 1st Qu.: 19.7 1st Qu.: 4.4
## Median : 950 Median : 48.6 Median :10.5
## Mean : 2629 Mean : 129.1 Mean :13.2
## 3rd Qu.: 3008 3rd Qu.: 146.4 3rd Qu.:19.0
## Max. :27176 Max. :1429.0 Max. :79.0
##
## mealpct computer testscr compstu
## Min. : 0.0 Min. : 0 Min. :606 Min. :0.0000
## 1st Qu.: 23.3 1st Qu.: 46 1st Qu.:640 1st Qu.:0.0938
## Median : 41.8 Median : 118 Median :654 Median :0.1255
## Mean : 44.7 Mean : 303 Mean :654 Mean :0.1359
## 3rd Qu.: 66.9 3rd Qu.: 375 3rd Qu.:667 3rd Qu.:0.1645
## Max. :100.0 Max. :3324 Max. :707 Max. :0.4208
##
## expnstu str avginc elpct
## Min. :3926 Min. :14.0 Min. : 5.34 Min. : 0.00
## 1st Qu.:4906 1st Qu.:18.6 1st Qu.:10.64 1st Qu.: 1.94
## Median :5215 Median :19.7 Median :13.73 Median : 8.78
## Mean :5312 Mean :19.6 Mean :15.32 Mean :15.77
## 3rd Qu.:5601 3rd Qu.:20.9 3rd Qu.:17.63 3rd Qu.:22.97
## Max. :7712 Max. :25.8 Max. :55.33 Max. :85.54
##
## readscr mathscr
## Min. :604 Min. :605
## 1st Qu.:640 1st Qu.:639
## Median :656 Median :652
## Mean :655 Mean :653
## 3rd Qu.:669 3rd Qu.:666
## Max. :704 Max. :710
##
The continuous variables in the data set are total enrollment, number of teachers, percent qualifying for CalWorks, percent qualifying for reduced-price lunch, number of computers, average test score, computer per student, expenditure per student, student teacher ratio, district average income, percent of English learners, average reading score, and average math score.
In this experiment, the response variable is the average reading score being affected by the number of computers per student and student teacher ratio.
The dataset was obtained from 420 observations of California test Scores from 1998-1999.The data is organized by 17 variables: distcod, county, district, grspan, enrltot, teachers, calwpct, mealpct, computer, testscr, compstu, expnstu, str, avginc, elpct, readscr, mathscr.
There is no randomization. The dataset is based off of observations at California schools.
The anova test is analyzing if the variation in average reading scores can be attributed to variation in number of computers per student or the student teacher ratio. The null hypothesis for this experiment is that the variation in average reading scores can not be attributed to the variation in number of computers per student or the student teacher ratio.The alternative is that the variation can be attributed to the variation in number of computers per student or the student teacher ratio.
The anova test is used to analyze the observed variance in a variable. This variable is broken down into factors and tested to determine if the factors can be used to explain the variation. One may assume that the number of computers per student or the student teacher ratio could affect the average reading scores for students. It seems that more computers per student or less students per teacher would be a better learning environment. However, this may not be true therefore this experiment is used to test the hypothesis.
It is unknown how the data was collected and if randomization was used.
There are no replicates. Data is collected from each individual school.
There was no blocking used in the design.
# histograms and boxplots of number of computers per student and the student teacher ratio
hist(Cal$compstu)
hist(Cal$str)
boxplot(Cal$compstu)
boxplot(Cal$str)
The data for number of computers per student appears to be skewed to the right. While the student teacher ratio appears to have a normal distribution.
model1=aov(readscr~compstu, data=Cal)
anova(model1)
## Analysis of Variance Table
##
## Response: readscr
## Df Sum Sq Mean Sq F value Pr(>F)
## compstu 1 13392 13392 35.9 4.5e-09 ***
## Residuals 418 156022 373
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2=aov(readscr~str, data=Cal)
anova(model2)
## Analysis of Variance Table
##
## Response: readscr
## Df Sum Sq Mean Sq F value Pr(>F)
## str 1 10302 10302 27.1 3.1e-07 ***
## Residuals 418 159113 381
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3=aov(readscr~str*compstu, data=Cal)
anova(model3)
## Analysis of Variance Table
##
## Response: readscr
## Df Sum Sq Mean Sq F value Pr(>F)
## str 1 10302 10302 28.55 1.5e-07 ***
## compstu 1 7894 7894 21.88 3.9e-06 ***
## str:compstu 1 1114 1114 3.09 0.08 .
## Residuals 416 150104 361
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the results of the first ANOVA, we would reject the null hypothesis and the variation in average reading test scores can be explained by something other than randomiation. The average reading test scores can be attributed to the number of computers per student.The probability of getting and F value of 35.9 under randomization is 4.5e-09. For the second ANOVA test, we would also reject the null hypothesis and the average reading test scores can be attributed to the student teacher ratio. For the third ANOVA test, it appears that the average reading test scores can be attributed to the number computers per student or student teacher ratio. However with the interaction, the total variation can’t be attributed to anything other than randomization.
qqnorm(residuals(model3))
qqline(residuals(model3))
shapiro.test(Cal$str)
##
## Shapiro-Wilk normality test
##
## data: Cal$str
## W = 0.992, p-value = 0.02385
shapiro.test(Cal$compstu)
##
## Shapiro-Wilk normality test
##
## data: Cal$compstu
## W = 0.9485, p-value = 6.53e-11
plot(fitted(model3), residuals(model3))
interaction.plot(Cal$str, Cal$compstu, Cal$readscr)
A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals appear to be normal. We use the Shapiro-wilk test to check normality. With our p-values <0.1 it appears that its adequate.The plot of the fitted model and the residuals appear to be scattered randomly. There does not appear to be any interaction based off of the interaction plot.
When running a Tukey test, the null hypothesis is that there is no difference between the means of a pair of data, while the alternative states that there is a significant difference between the means.The tukey test creates a set of confidence intervals on the differences between the means of the levels of a factor with the specified family-wise probability of coverage.
A non parametric test could be used to test the hypothesis. For example, a Kruskal Wallis or Friedmans test are some non-parametric methods.The Friedmans test and kruskal Wallis performs a rank sum test.The Kruskal Wallis test does not assume a normal distrubtion of the residuals.