Recipe 7: Resampling Methods

Recipe for Resampling Methods

Wei Zou

RPI

Nov 5, 14; Version 1

1. Setting

System under test

The purpose of this recipe is to use the resampling methods to repeat the ANOVA and compare the results. The dataset used for this experiment is the “Star” from the “Ecdat Package” in R, which is used to explore the effects on learning of small class sizes. In this study, we focus on the effect of class type on students' math score and the null/alternative hyphothesis are stated as follows:

H0: The variation in the total math scaled score is due to sample randomization only. (i.e, the type of class has no effect on the students' total math scaled scores) HA: The variation in the total math scaled score is due to something else other than sample randomization (i.e., the type of class may affect the students' total math scaled scores)

To test the hypothesis, we conduct a single factor (with 3 levels) experiment to analyze the effect of class type on the variability in the math scores. We first conduct a exploratory data analysis, followed by a ANOVA test with model adequacy checking, and then use the resampling methods to repeat the ANOVA and conclude the results.

#Read in the data 
library("Ecdat", lib.loc="~/R/win-library/3.1")

## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

data1<-Star
attach(data1)

Data Summary

Factor: The original data have four factors: “classk” with three levels (regular class, small class, regular class aid) indicates the type of class; “Sex” with two levels (girl, boy), “freelunk” with two levels (qualified for free lunch, not qualified for free lunch) and “race” with three levels (white, black, other). In this study, we focus on the effect of “classk” (i.e., the class type).

Continuous variable and Response Variable: The original data have three continous variables: “tmathssk” represents the total math scaled score; “treadssk” represents the total reading scaled score and “totexpk” represents the years of total teching experience. In this study, we treat “tmathssk” as the response variable.

Organization: The original data were obtained from Project Star with a cross-sectional stduy from 1985 to 1989 in the state of Tennessee, it was intended to “test whether students attending small clasess in grades K-3 had higher academic achievement than their peers in larger classes”. The data has 5745 observations with eight variables.

Randomization: The data were randomly selected among the K-3 students in Tennessee, and the students are randomly assigned to different types of classes, however, there is no randomize execution order in the experiment.

2. (Experimental) Design

Organization of the Experiment

In this sample recipe, the effect of class type on the students' total math scaled scores is studied. An ANOVA is performed to verify if the variation in math scores is due to pure sample randomization or the class type has a contribution effect. The analysis is followed by the model adequancy checking and resampling methods are used to repeat the ANOVA and conclude the findings.

Rationale for this design

An ANOVA is used as it checks whether the mean of the response variable is the same among several groups. In this study we are testing whether the measn of math scores is the same among different class type groups, therefore ANOVA is an appropriate test. The resampling technique is used the data are not following a normal distribution as assumed in ANOVA.

Randomize, Replicate and Block

As discussed in the previous section, the data are randomly selected and assigned, however, they are not randomly executed. There are no replicated and/or blocking used in this experiment.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

The boxplot shows that the the math scores do vary among different class types, noted that the median math score in small class is higher than the median math score in regular class and regular class with aide, and it seems that there is no obvious difference between the regular class group and the regular with aid group. Therefore,it is possible that the variation in math scores can be explained by the variation in the class type.

The histogram of the response variable shows that the data is close to a normal distribution, further model adequecy checking is needed to test the underlying population distribution.

#bloxplots 
boxplot(tmathssk~classk, ylab="Total math scaled score")
title("Boxplot of Math Scored among Different Class Types")

plot of chunk unnamed-chunk-2

#histograms
hist(tmathssk, xlab="total math scaled score")
title("Histogram of the Match Score")

plot of chunk unnamed-chunk-2

Testing

According to the ANOVA result, we reject the null hypothesis that the variation in math scores is due to sample randomization only and the effect of class type is shown to be statistically siginificant.

model1 = aov (tmathssk~classk, data = data1)
summary(model1)

##               Df   Sum Sq Mean Sq F value  Pr(>F)    
## classk         2    84166   42083    18.6 9.3e-09 ***
## Residuals   5745 13031173    2268                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4. Diagnostics/Model Adequacy Checking

The qqplot shows that the residuals generated from our ANOVA deviates from the normal qqline, especially around the tails, indicating that the assumption of normal distribution might not be satisfied. The finding is further confirmed from the Shapiro-Wilk test where the null hypothosis that the data follows a normal distribution is rejected (with p-value approximately equals to zero). The Fitted Y vs. Residuals plots further shows the converging trend in the residuals, therefore, resampling is necessary to ensure the model adequacy.

#qqplot
qqnorm(residuals(model1),ylab="Total Math Scaled Score")
qqline(residuals(model1),ylab="Total Math Scaled Score")

plot of chunk unnamed-chunk-4

#Fitted Y vs. Residuals
plot(fitted(model1), residuals(model1))

plot of chunk unnamed-chunk-4

#Shapiro Test
shapiro.test(tmathssk[1:5000])

## 
##  Shapiro-Wilk normality test
## 
## data:  tmathssk[1:5000]
## W = 0.9813, p-value < 2.2e-16

5.Resampling

Monte Carlo Estimation of Power

The “Monte Carlo” simulation is perfromed from a known theoretical distribution and use repeater random sampling techniques to determined the distribution of a dataset. Assuming that the pooled variance of residuals applied to all groups, we estimate the theoretical F distribution and plot the histogram of the simulated F distribution.

#Create an array of the means of the levels
meanstar = with(tmathssk,tapply(tmatssk,classk,mean))

## Error: numeric 'envir' arg not of length one

#Create an array of the variance of the levels
with(tmathssk,tapply(tmathssk,classk,var))

## Error: numeric 'envir' arg not of length one

#Create an array of the length of the levels
with(tmathssk,tapply(tmathssk,classk,length))

## Error: numeric 'envir' arg not of length one

#Summary of the original ANOVA results 
summary(model1)

##               Df   Sum Sq Mean Sq F value  Pr(>F)    
## classk         2    84166   42083    18.6 9.3e-09 ***
## Residuals   5745 13031173    2268                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#The sd(residual error) of the residuals is thus:
sdstar = sqrt(2268)
#The mean of all the math scores
meanstar = mean(tmathssk)
# Monte-Carlo Simulation with 10,000 iterations
R = 10000  
Fstar = numeric(R)
simclassk = classk
for (i in 1:R) {
  groupA = rnorm(2000, mean=meanstar, sd=sdstar)
  groupB = rnorm(1733, mean=meanstar, sd=sdstar)
  groupC = rnorm(2015, mean=meanstar, sd=sdstar)
  simscore = c(groupA,groupB,groupC)
  simdata = data.frame(simscore,simclassk)
  Fstar[i] = oneway.test(simscore~simclassk, var.equal=T, data=simdata)$statistic
}
# computes empirical distribution of F under H0 and display as histogram
hist(Fstar, ylim=c(0,1), prob=T,main ="PDF of the Theoretical F Distribution")
# Plot F distribution from .25 to 8 in .5 increments
x=seq(.25,8,.5)
# Overlay the analytic F distribution as red dots (degrees of freedom known)
# 2 = 3 levels-1 mean; 5745= 5748 (n)-3 means
# Notice the quality of the fit of the Monte Carlo simulation to the empirical distribution of F
points(x,y=df(x,2,5748),type="b",col="red")

plot of chunk unnamed-chunk-5

Bootstrap F-Distribution

We then use the Bottstrap to estimate the actual F-distribution. To do so, we use the given data to simulate random sampling with replacement and the experiment is carried out for 10,000 times. By comparing the PDF plots of the Theoretical F-distribution (generated from Monte-Carlo Simulation) and actual F-distribution (generated from Bootstrap), we may conclude that the actual F-distribution of the data is different from the theoretical F-distribution and the normal distribution assumption might not be satisfied. The standard error of the bootstrapped F-distribution and bias betweent the theoretical and actual F-distribution are also estimated for reference.

#Bootstrap version (with 10,000 iterations)
grpA = tmathssk[classk=="regular"] - meanstar[1]
grpB = tmathssk[classk=="small.class"]  - meanstar[2]
grpC = tmathssk[classk=="regular.with.aide"]  - meanstar[3]
newsimclassk= classk
R = 10000
newFstar = numeric(R)
for (i in 1:R) {
  groupA = sample(grpA, size=2000, replace=T)
  groupB = sample(grpB, size=1733, replace=T)
  groupC = sample(grpC, size=2015, replace=T)
  newsimscore = c(groupA,groupB,groupC)
  newsimdata = data.frame(newsimscore,newsimclassk)
  newFstar[i] = oneway.test(newsimscore~newsimclassk, var.equal=T, data=newsimdata)$statistic
}

# Now generate a similar plot, comparing bootstrapped distribution to the known F distribution

par(mfrow=c(1,2))
hist(Fstar,ylim = c(0,1),prob=TRUE,main="PDF of Theoretical F-distribution")
x=seq(.25,8,.5)
points(x,y=df(x,2,5748), type = "b", col = "red")

hist(newFstar, ylim = c(0,1.0),prob=TRUE, main = "PDF of Actual F-distribution")
newx=seq(.25,8,.5)
points(newx,y=df(newx,2,5748),type="b",col="red")

plot of chunk unnamed-chunk-7

# Generate quantiles of the acutal F distribution
qf(.95,2,5748)

## [1] 2.997

# Estimate the standard error
sd(newFstar)

## [1] 1.003

# Estimate the bias
mean(Fstar) - mean(newFstar)

## [1] -0.01464

ANOVA of the new model

Now we repeat the ANOVA with the new resampled data (through Bootstrapping). This time according the results, we fail to reject the null hyphothesis that the variation in math scored is due to sample randomization (with p-value > 0.05). Noted that completely different results are achieved from the resampled data (as compared to the original data), indicating the that original data do not satisfy the normal distribution assumption and the ANOVA results obtained based on them are not reliable. We have also conducted model adequacy checking for the new ANOVA results and the plots appear to be appropriate and we may conclude that the results we obtained from the ANOVA is adequate.

newmodel = lm(newsimscore~newsimclassk, data = newsimdata)
anova(newmodel)

## Analysis of Variance Table
## 
## Response: newsimscore
##                Df  Sum Sq Mean Sq F value Pr(>F)
## newsimclassk    2     558     279    0.12   0.89
## Residuals    1997 4691008    2349

par(mfrow=c(2,2))
plot(newmodel)

plot of chunk unnamed-chunk-8

6. References to the literature

http://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf