Recipe 9

Matthew Macchi

Rensselaer Polytechnic Institute

12/1/14 Version 1

1. Setting

System under test

This recipe will conduct an experiment on the ecdat dataset. The experiment will attempt to investigate the Test Scores of Students in California from the Caschool dataset.

CA <- read.csv("C:/Users/Matt/Desktop/Cali.csv")

Factors and Levels

The data being examined has 420 observations of 22 variables.

head(CA)

##   X distcod  county                        district grspan enrltot
## 1 1   75119 Alameda              Sunol Glen Unified  KK-08     195
## 2 2   61499   Butte            Manzanita Elementary  KK-08     240
## 3 3   61549   Butte     Thermalito Union Elementary  KK-08    1550
## 4 4   61457   Butte Golden Feather Union Elementary  KK-08     243
## 5 5   61523   Butte        Palermo Union Elementary  KK-08    1335
## 6 6   62042  Fresno         Burrel Union Elementary  KK-08     137
##   teacherslev teachers calwpct mealpctlev mealpct computerlev computer
## 1           1    10.90  0.5102          1   2.041           1       67
## 2           1    11.15 15.4167          2  47.917           1      101
## 3           1    82.90 55.0323          3  76.323           1      169
## 4           1    14.00 36.4754          3  77.049           1       85
## 5           1    71.50 33.1086          3  78.427           1      171
## 6           1     6.40 12.3188          3  86.957           1       25
##   testscr compstu expnstu   str avginclev avginc  elpct readscr mathscr
## 1   690.8  0.3436    6385 17.89         2 22.690  0.000   691.6   690.0
## 2   661.2  0.4208    5099 21.52         1  9.824  4.583   660.5   661.9
## 3   643.6  0.1090    5502 18.70         1  8.978 30.000   636.3   650.9
## 4   647.7  0.3498    7102 17.36         1  8.978  0.000   651.9   643.5
## 5   640.9  0.1281    5236 18.67         1  9.080 13.858   641.8   639.9
## 6   605.5  0.1825    5580 21.41         1 10.415 12.409   605.7   605.4

tail(CA)

##       X distcod      county                  district grspan enrltot
## 415 415   69682 Santa Clara Saratoga Union Elementary  KK-08    2341
## 416 416   68957   San Mateo    Las Lomitas Elementary  KK-08     984
## 417 417   69518 Santa Clara      Los Altos Elementary  KK-08    3724
## 418 418   72611     Ventura    Somis Union Elementary  KK-08     441
## 419 419   72744        Yuba         Plumas Elementary  KK-08     101
## 420 420   72751        Yuba      Wheatland Elementary  KK-08    1778
##     teacherslev teachers calwpct mealpctlev mealpct computerlev computer
## 415           2   124.09  0.1709          1   0.598           1      286
## 416           1    59.73  0.1016          1   3.557           1      195
## 417           2   208.48  1.0741          1   1.504           2      721
## 418           1    20.15  3.5635          2  37.194           1       45
## 419           1     5.00 11.8812          2  59.406           1       14
## 420           1    93.40  6.9235          2  47.571           1      313
##     testscr compstu expnstu   str avginclev avginc  elpct readscr mathscr
## 415   700.3  0.1222    5393 18.87         3 40.402  2.050   698.9   701.7
## 416   704.3  0.1982    7290 16.47         2 28.717  5.996   700.9   707.7
## 417   706.8  0.1936    5741 17.86         3 41.734  4.726   704.0   709.5
## 418   645.0  0.1020    4403 21.89         2 23.733 24.263   648.3   641.7
## 419   672.2  0.1386    4776 20.20         1  9.952  2.970   667.9   676.5
## 420   655.8  0.1760    5993 19.04         1 12.502  5.006   660.5   651.0

summary(CA)

##        X          distcod              county   
##  Min.   :  1   Min.   :61382   Sonoma     : 29  
##  1st Qu.:106   1st Qu.:64308   Kern       : 27  
##  Median :210   Median :67760   Los Angeles: 27  
##  Mean   :210   Mean   :67473   Tulare     : 24  
##  3rd Qu.:315   3rd Qu.:70419   San Diego  : 21  
##  Max.   :420   Max.   :75440   Santa Clara: 20  
##                                (Other)    :272  
##                       district     grspan       enrltot     
##  Lakeside Union Elementary:  3   KK-06: 61   Min.   :   81  
##  Mountain View Elementary :  3   KK-08:359   1st Qu.:  379  
##  Jefferson Elementary     :  2               Median :  950  
##  Liberty Elementary       :  2               Mean   : 2629  
##  Ocean View Elementary    :  2               3rd Qu.: 3008  
##  Pacific Union Elementary :  2               Max.   :27176  
##  (Other)                  :406                              
##   teacherslev      teachers         calwpct       mealpctlev  
##  Min.   :1.00   Min.   :   4.8   Min.   : 0.0   Min.   :0.00  
##  1st Qu.:1.00   1st Qu.:  19.7   1st Qu.: 4.4   1st Qu.:1.00  
##  Median :1.00   Median :  48.6   Median :10.5   Median :2.00  
##  Mean   :1.42   Mean   : 129.1   Mean   :13.2   Mean   :1.83  
##  3rd Qu.:2.00   3rd Qu.: 146.3   3rd Qu.:19.0   3rd Qu.:3.00  
##  Max.   :3.00   Max.   :1429.0   Max.   :79.0   Max.   :3.00  
##                                                               
##     mealpct       computerlev      computer       testscr   
##  Min.   :  0.0   Min.   :0.00   Min.   :   0   Min.   :606  
##  1st Qu.: 23.3   1st Qu.:1.00   1st Qu.:  46   1st Qu.:640  
##  Median : 41.8   Median :1.00   Median : 118   Median :654  
##  Mean   : 44.7   Mean   :1.26   Mean   : 303   Mean   :654  
##  3rd Qu.: 66.9   3rd Qu.:1.00   3rd Qu.: 375   3rd Qu.:667  
##  Max.   :100.0   Max.   :3.00   Max.   :3324   Max.   :707  
##                                                             
##     compstu          expnstu          str         avginclev   
##  Min.   :0.0000   Min.   :3926   Min.   :14.0   Min.   :1.00  
##  1st Qu.:0.0938   1st Qu.:4906   1st Qu.:18.6   1st Qu.:1.00  
##  Median :0.1255   Median :5215   Median :19.7   Median :1.00  
##  Mean   :0.1359   Mean   :5312   Mean   :19.6   Mean   :1.18  
##  3rd Qu.:0.1645   3rd Qu.:5601   3rd Qu.:20.9   3rd Qu.:1.00  
##  Max.   :0.4208   Max.   :7712   Max.   :25.8   Max.   :3.00  
##                                                               
##      avginc          elpct          readscr       mathscr   
##  Min.   : 5.34   Min.   : 0.00   Min.   :604   Min.   :605  
##  1st Qu.:10.64   1st Qu.: 1.94   1st Qu.:640   1st Qu.:639  
##  Median :13.73   Median : 8.78   Median :656   Median :652  
##  Mean   :15.32   Mean   :15.77   Mean   :655   Mean   :653  
##  3rd Qu.:17.63   3rd Qu.:22.97   3rd Qu.:669   3rd Qu.:666  
##  Max.   :55.33   Max.   :85.54   Max.   :704   Max.   :710  
##

str(CA)

## 'data.frame':    420 obs. of  22 variables:
##  $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ distcod    : int  75119 61499 61549 61457 61523 62042 68536 63834 62331 67306 ...
##  $ county     : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
##  $ district   : Factor w/ 409 levels "Ackerman Elementary",..: 362 214 367 134 270 53 152 383 263 93 ...
##  $ grspan     : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
##  $ enrltot    : int  195 240 1550 243 1335 137 195 888 379 2247 ...
##  $ teacherslev: int  1 1 1 1 1 1 1 1 1 2 ...
##  $ teachers   : num  10.9 11.2 82.9 14 71.5 ...
##  $ calwpct    : num  0.51 15.42 55.03 36.48 33.11 ...
##  $ mealpctlev : int  1 2 3 3 3 3 3 3 3 3 ...
##  $ mealpct    : num  2.04 47.92 76.32 77.05 78.43 ...
##  $ computerlev: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ computer   : int  67 101 169 85 171 25 28 66 35 0 ...
##  $ testscr    : num  691 661 644 648 641 ...
##  $ compstu    : num  0.344 0.421 0.109 0.35 0.128 ...
##  $ expnstu    : num  6385 5099 5502 7102 5236 ...
##  $ str        : num  17.9 21.5 18.7 17.4 18.7 ...
##  $ avginclev  : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ avginc     : num  22.69 9.82 8.98 8.98 9.08 ...
##  $ elpct      : num  0 4.58 30 0 13.86 ...
##  $ readscr    : num  692 660 636 652 642 ...
##  $ mathscr    : num  690 662 651 644 640 ...

Continuous variables (if any)

If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable; otherwise, it is called a discrete variable. The continuous variable in this dataset is test scores. ### Response variables A response variable is defined as the outcome of a study. It is a variable you would be interested in predicting or forecasting. It is often called a dependent variable or predicted variable. In this instance, a response variable is test scores. ### The Data: How is it organized and what does it look like? The data is organized initially into a 18 column table. Data in the selected columns have been transformed into binary integers.The integers were assigned a 1, 2, or 3 depending on whether or not the initial value fell in between a certain range.

par(mfrow=c(2,2))
plot(CA$testscr~CA$teacherslev, xlab="# of teachers", ylab="Average test score")
plot(CA$testscr~CA$computerlev, xlab="# of computers", ylab="Average test score")
plot(CA$testscr~CA$avginclev, xlab="district average income", ylab="Average test score")
plot(CA$testscr~CA$mealpctlev, xlab="% qualifying for reduced-price lunch", ylab="Average test score")

plot of chunk unnamed-chunk-3 ### Randomization This data comes from a cross-section from 1993-1995. Since this is the only information available in regards to background information about the data collection, it is entirely possible that this data might not be completely randomized or the experiment had a completely randomized design. # 2. (Experimental) Design ### How will the experiment be organized and conducted to test the hypothesis? In order to conduct this experiment, I will conduct an an experiment using response surface methods. The experiment will test which of the 4 selected factors, teachers, computers, average income, and meal percentage suggest an influence on the test scores of students.

library("rsm")

What is the rationale for this design?

This dataframe in this recipe contains a single factor with multiple levels. The reason that the three-level designs were proposed is to model possible curvature in the response function and to handle the case of nominal factors at 3 levels. A third level for a continuous factor facilitates investigation of a quadratic relationship between the response and each of the factors. ### Randomize: What is the Randomization Scheme? The experiment was based on observations and therefore was not randomized. ### Replicate: Are there replicates and/or repeated measures? There are no replicates, but repeated measures do occur between the factors and levels. ### Block: Did you use blocking in the design? No blocking was performed in this design. ###Initial Model Creation:

model=rsm(testscr~SO(teacherslev, computerlev, avginclev, mealpctlev), data=CA)
summary(model)

## 
## Call:
## rsm(formula = testscr ~ SO(teacherslev, computerlev, avginclev, 
##     mealpctlev), data = CA)
## 
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             663.1180    11.9389   55.54  < 2e-16 ***
## teacherslev             -21.3946    21.8042   -0.98    0.327    
## computerlev               4.8050    22.3655    0.21    0.830    
## avginclev                17.8485    10.2155    1.75    0.081 .  
## mealpctlev                5.1223     5.6712    0.90    0.367    
## teacherslev:computerlev  -3.6097    11.1419   -0.32    0.746    
## teacherslev:avginclev     6.3923     3.0745    2.08    0.038 *  
## teacherslev:mealpctlev   -2.2664     1.7758   -1.28    0.203    
## computerlev:avginclev    -5.0301     3.1569   -1.59    0.112    
## computerlev:mealpctlev    3.7344     1.7738    2.11    0.036 *  
## avginclev:mealpctlev     -5.4840     2.8130   -1.95    0.052 .  
## teacherslev^2             6.0712    10.7929    0.56    0.574    
## computerlev^2             0.0974     1.7725    0.05    0.956    
## avginclev^2              -0.3411     2.4060   -0.14    0.887    
## mealpctlev^2             -4.1738     0.9459   -4.41  1.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Multiple R-squared:  0.717,  Adjusted R-squared:  0.707 
## F-statistic: 73.4 on 14 and 405 DF,  p-value: <2e-16
## 
## Analysis of Variance Table
## 
## Response: testscr
##                                                       Df Sum Sq Mean Sq
## FO(teacherslev, computerlev, avginclev, mealpctlev)    4 105021   26255
## TWI(teacherslev, computerlev, avginclev, mealpctlev)   6   1942     324
## PQ(teacherslev, computerlev, avginclev, mealpctlev)    4   2134     534
## Residuals                                            405  43012     106
## Lack of fit                                           18   2895     161
## Pure error                                           387  40117     104
##                                                      F value  Pr(>F)
## FO(teacherslev, computerlev, avginclev, mealpctlev)   247.22 < 2e-16
## TWI(teacherslev, computerlev, avginclev, mealpctlev)    3.05 0.00630
## PQ(teacherslev, computerlev, avginclev, mealpctlev)     5.02 0.00058
## Residuals                                                           
## Lack of fit                                             1.55 0.06975
## Pure error                                                          
## 
## Stationary point of response surface:
## teacherslev computerlev   avginclev  mealpctlev 
##    4.021393    6.171307    0.001985    2.281354 
## 
## Eigenanalysis:
## $values
## [1]  8.845  1.104 -2.612 -5.684
## 
## $vectors
##                [,1]    [,2]    [,3]    [,4]
## teacherslev  0.8075 -0.5728  0.1316 -0.0495
## computerlev -0.3369 -0.6246 -0.6951 -0.1151
## avginclev    0.4360  0.4246 -0.6646  0.4334
## mealpctlev  -0.2104 -0.3186  0.2404  0.8924

3. Statistical Analysis

Exploratory Data Analysis: Graphics and Decriptive Summary

Graphs showing trends in the data:

par(mfrow=c(2,3))
contour(model, ~teacherslev+computerlev+avginclev+mealpctlev, image=TRUE, at=summary(model$canonical$xs))

plot of chunk unnamed-chunk-6 Next, I create plots to attempt to identify any patterns in the data.

library("rgl", lib.loc="~/R/win-library/3.1")
attach(CA)
par(mfrow=c(1,1))
persp(model, ~teacherslev + computerlev, image = TRUE, at = c(summary(model)$canonical$xs, Block="B2"), theta=30, zlab="average test scores", col.lab=33, contour="colors")

## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-7

persp(model, ~teacherslev + avginclev, image = TRUE, at = c(summary(model)$canonical$xs, Block="B2"), theta=30, zlab="average test scores", col.lab=33, contour="colors")

## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

persp(model, ~teacherslev + mealpctlev, image = TRUE, at = c(summary(model)$canonical$xs, Block="B2"), theta=30, zlab="average test scores", col.lab=33, contour="colors")

## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-7

persp(model, ~computerlev + avginclev, image = TRUE, at = c(summary(model)$canonical$xs, Block="B2"), theta=30, zlab="average test scores", col.lab=33, contour="colors")

## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-7

persp(model, ~computerlev + mealpctlev, image = TRUE, at = c(summary(model)$canonical$xs, Block="B2"), theta=30, zlab="average test scores", col.lab=33, contour="colors")

## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-7

persp(model, ~avginclev + mealpctlev, image = TRUE, at = c(summary(model)$canonical$xs, Block="B2"), theta=30, zlab="average test scores", col.lab=33, contour="colors")

## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-7 ###View response surfaces: One thing to remember, however, is that these can be slightly misleading because they only take into account two factors at a time. In this case, comparing just two factors at a time can produce contradictory results. For this reason, it is crucial to compare all factors simultaneously to understand the true optimal operating point. ##Shapiro-Wilk Here I use the Shapiro-Wilk normality test to determine if the response variable is normally distributed.

shapiro.test(CA$testscr)

## 
##  Shapiro-Wilk normality test
## 
## data:  CA$testscr
## W = 0.9963, p-value = 0.436

The null of a Shapiro-Wilk test states that the data presented is normally distributed. Fortunately, the p-value is greater than an alpha of 0.05, and therefore, we must fail to reject the null hypothesis. This suggests, instead, that the price of personal computers data is normally distributed. ##Diagnostics/Model Adequacy Checking Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. Based on the theoretical distribution, the expected value for each datum is determined. If the data values in a set follow the theoretical distribution, then they will appear as a straight line on a Q-Q plot. When an anova is performed, it is done so with the assumption that the test statistic follows a normal distribution. Visualization of a Q-Q plot will further confirm if that assumption is correct for the anova tests that were performed.

qqnorm(residuals(model))
qqline(residuals(model))

plot of chunk unnamed-chunk-9

plot(fitted(model),residuals(model))

plot of chunk unnamed-chunk-9 A Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. These plots are used to identify linearity, outliers, and error variances.