This recipe will conduct an experiment on the ecdat dataset. The experiment will attempt to investigate the Test Scores of Students in California from the Caschool dataset.
CA <- read.csv("C:/Users/Matt/Desktop/Cali.csv")
The data being examined has 420 observations of 22 variables.
head(CA)
## X distcod county district grspan enrltot
## 1 1 75119 Alameda Sunol Glen Unified KK-08 195
## 2 2 61499 Butte Manzanita Elementary KK-08 240
## 3 3 61549 Butte Thermalito Union Elementary KK-08 1550
## 4 4 61457 Butte Golden Feather Union Elementary KK-08 243
## 5 5 61523 Butte Palermo Union Elementary KK-08 1335
## 6 6 62042 Fresno Burrel Union Elementary KK-08 137
## teacherslev teachers calwpct mealpctlev mealpct computerlev computer
## 1 1 10.90 0.5102 1 2.041 1 67
## 2 1 11.15 15.4167 2 47.917 1 101
## 3 1 82.90 55.0323 3 76.323 1 169
## 4 1 14.00 36.4754 3 77.049 1 85
## 5 1 71.50 33.1086 3 78.427 1 171
## 6 1 6.40 12.3188 3 86.957 1 25
## testscr compstu expnstu str avginclev avginc elpct readscr mathscr
## 1 690.8 0.3436 6385 17.89 2 22.690 0.000 691.6 690.0
## 2 661.2 0.4208 5099 21.52 1 9.824 4.583 660.5 661.9
## 3 643.6 0.1090 5502 18.70 1 8.978 30.000 636.3 650.9
## 4 647.7 0.3498 7102 17.36 1 8.978 0.000 651.9 643.5
## 5 640.9 0.1281 5236 18.67 1 9.080 13.858 641.8 639.9
## 6 605.5 0.1825 5580 21.41 1 10.415 12.409 605.7 605.4
tail(CA)
## X distcod county district grspan enrltot
## 415 415 69682 Santa Clara Saratoga Union Elementary KK-08 2341
## 416 416 68957 San Mateo Las Lomitas Elementary KK-08 984
## 417 417 69518 Santa Clara Los Altos Elementary KK-08 3724
## 418 418 72611 Ventura Somis Union Elementary KK-08 441
## 419 419 72744 Yuba Plumas Elementary KK-08 101
## 420 420 72751 Yuba Wheatland Elementary KK-08 1778
## teacherslev teachers calwpct mealpctlev mealpct computerlev computer
## 415 2 124.09 0.1709 1 0.598 1 286
## 416 1 59.73 0.1016 1 3.557 1 195
## 417 2 208.48 1.0741 1 1.504 2 721
## 418 1 20.15 3.5635 2 37.194 1 45
## 419 1 5.00 11.8812 2 59.406 1 14
## 420 1 93.40 6.9235 2 47.571 1 313
## testscr compstu expnstu str avginclev avginc elpct readscr mathscr
## 415 700.3 0.1222 5393 18.87 3 40.402 2.050 698.9 701.7
## 416 704.3 0.1982 7290 16.47 2 28.717 5.996 700.9 707.7
## 417 706.8 0.1936 5741 17.86 3 41.734 4.726 704.0 709.5
## 418 645.0 0.1020 4403 21.89 2 23.733 24.263 648.3 641.7
## 419 672.2 0.1386 4776 20.20 1 9.952 2.970 667.9 676.5
## 420 655.8 0.1760 5993 19.04 1 12.502 5.006 660.5 651.0
summary(CA)
## X distcod county
## Min. : 1 Min. :61382 Sonoma : 29
## 1st Qu.:106 1st Qu.:64308 Kern : 27
## Median :210 Median :67760 Los Angeles: 27
## Mean :210 Mean :67473 Tulare : 24
## 3rd Qu.:315 3rd Qu.:70419 San Diego : 21
## Max. :420 Max. :75440 Santa Clara: 20
## (Other) :272
## district grspan enrltot
## Lakeside Union Elementary: 3 KK-06: 61 Min. : 81
## Mountain View Elementary : 3 KK-08:359 1st Qu.: 379
## Jefferson Elementary : 2 Median : 950
## Liberty Elementary : 2 Mean : 2629
## Ocean View Elementary : 2 3rd Qu.: 3008
## Pacific Union Elementary : 2 Max. :27176
## (Other) :406
## teacherslev teachers calwpct mealpctlev
## Min. :1.00 Min. : 4.8 Min. : 0.0 Min. :0.00
## 1st Qu.:1.00 1st Qu.: 19.7 1st Qu.: 4.4 1st Qu.:1.00
## Median :1.00 Median : 48.6 Median :10.5 Median :2.00
## Mean :1.42 Mean : 129.1 Mean :13.2 Mean :1.83
## 3rd Qu.:2.00 3rd Qu.: 146.3 3rd Qu.:19.0 3rd Qu.:3.00
## Max. :3.00 Max. :1429.0 Max. :79.0 Max. :3.00
##
## mealpct computerlev computer testscr
## Min. : 0.0 Min. :0.00 Min. : 0 Min. :606
## 1st Qu.: 23.3 1st Qu.:1.00 1st Qu.: 46 1st Qu.:640
## Median : 41.8 Median :1.00 Median : 118 Median :654
## Mean : 44.7 Mean :1.26 Mean : 303 Mean :654
## 3rd Qu.: 66.9 3rd Qu.:1.00 3rd Qu.: 375 3rd Qu.:667
## Max. :100.0 Max. :3.00 Max. :3324 Max. :707
##
## compstu expnstu str avginclev
## Min. :0.0000 Min. :3926 Min. :14.0 Min. :1.00
## 1st Qu.:0.0938 1st Qu.:4906 1st Qu.:18.6 1st Qu.:1.00
## Median :0.1255 Median :5215 Median :19.7 Median :1.00
## Mean :0.1359 Mean :5312 Mean :19.6 Mean :1.18
## 3rd Qu.:0.1645 3rd Qu.:5601 3rd Qu.:20.9 3rd Qu.:1.00
## Max. :0.4208 Max. :7712 Max. :25.8 Max. :3.00
##
## avginc elpct readscr mathscr
## Min. : 5.34 Min. : 0.00 Min. :604 Min. :605
## 1st Qu.:10.64 1st Qu.: 1.94 1st Qu.:640 1st Qu.:639
## Median :13.73 Median : 8.78 Median :656 Median :652
## Mean :15.32 Mean :15.77 Mean :655 Mean :653
## 3rd Qu.:17.63 3rd Qu.:22.97 3rd Qu.:669 3rd Qu.:666
## Max. :55.33 Max. :85.54 Max. :704 Max. :710
##
str(CA)
## 'data.frame': 420 obs. of 22 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ distcod : int 75119 61499 61549 61457 61523 62042 68536 63834 62331 67306 ...
## $ county : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
## $ district : Factor w/ 409 levels "Ackerman Elementary",..: 362 214 367 134 270 53 152 383 263 93 ...
## $ grspan : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
## $ enrltot : int 195 240 1550 243 1335 137 195 888 379 2247 ...
## $ teacherslev: int 1 1 1 1 1 1 1 1 1 2 ...
## $ teachers : num 10.9 11.2 82.9 14 71.5 ...
## $ calwpct : num 0.51 15.42 55.03 36.48 33.11 ...
## $ mealpctlev : int 1 2 3 3 3 3 3 3 3 3 ...
## $ mealpct : num 2.04 47.92 76.32 77.05 78.43 ...
## $ computerlev: int 1 1 1 1 1 1 1 1 1 1 ...
## $ computer : int 67 101 169 85 171 25 28 66 35 0 ...
## $ testscr : num 691 661 644 648 641 ...
## $ compstu : num 0.344 0.421 0.109 0.35 0.128 ...
## $ expnstu : num 6385 5099 5502 7102 5236 ...
## $ str : num 17.9 21.5 18.7 17.4 18.7 ...
## $ avginclev : int 2 1 1 1 1 1 1 1 1 1 ...
## $ avginc : num 22.69 9.82 8.98 8.98 9.08 ...
## $ elpct : num 0 4.58 30 0 13.86 ...
## $ readscr : num 692 660 636 652 642 ...
## $ mathscr : num 690 662 651 644 640 ...
If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable; otherwise, it is called a discrete variable. The continuous variable in this dataset is test scores. ### Response variables A response variable is defined as the outcome of a study. It is a variable you would be interested in predicting or forecasting. It is often called a dependent variable or predicted variable. In this instance, a response variable is test scores. ### The Data: How is it organized and what does it look like? The data is organized initially into a 18 column table. Data in the selected columns have been transformed into binary integers.The integers were assigned a 1, 2, or 3 depending on whether or not the initial value fell in between a certain range.
par(mfrow=c(2,2))
plot(CA$testscr~CA$teacherslev, xlab="# of teachers", ylab="Average test score")
plot(CA$testscr~CA$computerlev, xlab="# of computers", ylab="Average test score")
plot(CA$testscr~CA$avginclev, xlab="district average income", ylab="Average test score")
plot(CA$testscr~CA$mealpctlev, xlab="% qualifying for reduced-price lunch", ylab="Average test score")
### Randomization This data comes from a cross-section from 1993-1995. Since this is the only information available in regards to background information about the data collection, it is entirely possible that this data might not be completely randomized or the experiment had a completely randomized design. # 2. (Experimental) Design ### How will the experiment be organized and conducted to test the hypothesis? In order to conduct this experiment, I will conduct an an experiment using response surface methods. The experiment will test which of the 4 selected factors, teachers, computers, average income, and meal percentage suggest an influence on the test scores of students.
library("rsm")
This dataframe in this recipe contains a single factor with multiple levels. The reason that the three-level designs were proposed is to model possible curvature in the response function and to handle the case of nominal factors at 3 levels. A third level for a continuous factor facilitates investigation of a quadratic relationship between the response and each of the factors. ### Randomize: What is the Randomization Scheme? The experiment was based on observations and therefore was not randomized. ### Replicate: Are there replicates and/or repeated measures? There are no replicates, but repeated measures do occur between the factors and levels. ### Block: Did you use blocking in the design? No blocking was performed in this design. ###Initial Model Creation:
model=rsm(testscr~SO(teacherslev, computerlev, avginclev, mealpctlev), data=CA)
summary(model)
##
## Call:
## rsm(formula = testscr ~ SO(teacherslev, computerlev, avginclev,
## mealpctlev), data = CA)
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 663.1180 11.9389 55.54 < 2e-16 ***
## teacherslev -21.3946 21.8042 -0.98 0.327
## computerlev 4.8050 22.3655 0.21 0.830
## avginclev 17.8485 10.2155 1.75 0.081 .
## mealpctlev 5.1223 5.6712 0.90 0.367
## teacherslev:computerlev -3.6097 11.1419 -0.32 0.746
## teacherslev:avginclev 6.3923 3.0745 2.08 0.038 *
## teacherslev:mealpctlev -2.2664 1.7758 -1.28 0.203
## computerlev:avginclev -5.0301 3.1569 -1.59 0.112
## computerlev:mealpctlev 3.7344 1.7738 2.11 0.036 *
## avginclev:mealpctlev -5.4840 2.8130 -1.95 0.052 .
## teacherslev^2 6.0712 10.7929 0.56 0.574
## computerlev^2 0.0974 1.7725 0.05 0.956
## avginclev^2 -0.3411 2.4060 -0.14 0.887
## mealpctlev^2 -4.1738 0.9459 -4.41 1.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Multiple R-squared: 0.717, Adjusted R-squared: 0.707
## F-statistic: 73.4 on 14 and 405 DF, p-value: <2e-16
##
## Analysis of Variance Table
##
## Response: testscr
## Df Sum Sq Mean Sq
## FO(teacherslev, computerlev, avginclev, mealpctlev) 4 105021 26255
## TWI(teacherslev, computerlev, avginclev, mealpctlev) 6 1942 324
## PQ(teacherslev, computerlev, avginclev, mealpctlev) 4 2134 534
## Residuals 405 43012 106
## Lack of fit 18 2895 161
## Pure error 387 40117 104
## F value Pr(>F)
## FO(teacherslev, computerlev, avginclev, mealpctlev) 247.22 < 2e-16
## TWI(teacherslev, computerlev, avginclev, mealpctlev) 3.05 0.00630
## PQ(teacherslev, computerlev, avginclev, mealpctlev) 5.02 0.00058
## Residuals
## Lack of fit 1.55 0.06975
## Pure error
##
## Stationary point of response surface:
## teacherslev computerlev avginclev mealpctlev
## 4.021393 6.171307 0.001985 2.281354
##
## Eigenanalysis:
## $values
## [1] 8.845 1.104 -2.612 -5.684
##
## $vectors
## [,1] [,2] [,3] [,4]
## teacherslev 0.8075 -0.5728 0.1316 -0.0495
## computerlev -0.3369 -0.6246 -0.6951 -0.1151
## avginclev 0.4360 0.4246 -0.6646 0.4334
## mealpctlev -0.2104 -0.3186 0.2404 0.8924
Graphs showing trends in the data:
par(mfrow=c(2,3))
contour(model, ~teacherslev+computerlev+avginclev+mealpctlev, image=TRUE, at=summary(model$canonical$xs))
Next, I create plots to attempt to identify any patterns in the data.
library("rgl", lib.loc="~/R/win-library/3.1")
attach(CA)
par(mfrow=c(1,1))
persp(model, ~teacherslev + computerlev, image = TRUE, at = c(summary(model)$canonical$xs, Block="B2"), theta=30, zlab="average test scores", col.lab=33, contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model, ~teacherslev + avginclev, image = TRUE, at = c(summary(model)$canonical$xs, Block="B2"), theta=30, zlab="average test scores", col.lab=33, contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model, ~teacherslev + mealpctlev, image = TRUE, at = c(summary(model)$canonical$xs, Block="B2"), theta=30, zlab="average test scores", col.lab=33, contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model, ~computerlev + avginclev, image = TRUE, at = c(summary(model)$canonical$xs, Block="B2"), theta=30, zlab="average test scores", col.lab=33, contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model, ~computerlev + mealpctlev, image = TRUE, at = c(summary(model)$canonical$xs, Block="B2"), theta=30, zlab="average test scores", col.lab=33, contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model, ~avginclev + mealpctlev, image = TRUE, at = c(summary(model)$canonical$xs, Block="B2"), theta=30, zlab="average test scores", col.lab=33, contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
###View response surfaces: One thing to remember, however, is that these can be slightly misleading because they only take into account two factors at a time. In this case, comparing just two factors at a time can produce contradictory results. For this reason, it is crucial to compare all factors simultaneously to understand the true optimal operating point. ##Shapiro-Wilk Here I use the Shapiro-Wilk normality test to determine if the response variable is normally distributed.
shapiro.test(CA$testscr)
##
## Shapiro-Wilk normality test
##
## data: CA$testscr
## W = 0.9963, p-value = 0.436
The null of a Shapiro-Wilk test states that the data presented is normally distributed. Fortunately, the p-value is greater than an alpha of 0.05, and therefore, we must fail to reject the null hypothesis. This suggests, instead, that the price of personal computers data is normally distributed. ##Diagnostics/Model Adequacy Checking Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. Based on the theoretical distribution, the expected value for each datum is determined. If the data values in a set follow the theoretical distribution, then they will appear as a straight line on a Q-Q plot. When an anova is performed, it is done so with the assumption that the test statistic follows a normal distribution. Visualization of a Q-Q plot will further confirm if that assumption is correct for the anova tests that were performed.
qqnorm(residuals(model))
qqline(residuals(model))
plot(fitted(model),residuals(model))
A Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. These plots are used to identify linearity, outliers, and error variances.