Recipe 7

Matthew Macchi

Rensselaer Polytechnic Institute

11/6/14 Version 1

1. Setting

System under test

This recipe will conduct an experiment on the ecdat dataset. The experiment will attempt to investigate the earnings for three age groups on the Earnings dataset. Tuna{Ecdat}

install.packages("Ecdat", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/55/ql66yz5j3jzgkn6dmnb9sk1c0000gn/T//RtmpqkXskM/downloaded_packages
library("Ecdat", lib.loc="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
x<-Earnings

Factors and Levels

The data being examined has 4266 observations of 2 variables.

head(x)
##   age      y
## 1  g3  569.5
## 2  g3  895.5
## 3  g3 1111.0
## 4  g3 1182.0
## 5  g3 1277.5
## 6  g3 1384.0
tail(x)
##      age     y
## 4261  g1 80906
## 4262  g1 83769
## 4263  g1 83810
## 4264  g1 83810
## 4265  g1 83810
## 4266  g1 83810
summary(x)
##  age             y        
##  g1:1109   Min.   :  332  
##  g2:1678   1st Qu.:13827  
##  g3:1479   Median :22691  
##            Mean   :25511  
##            3rd Qu.:33588  
##            Max.   :83810
str(x)
## 'data.frame':    4266 obs. of  2 variables:
##  $ age: Factor w/ 3 levels "g1","g2","g3": 3 3 3 3 3 3 3 3 3 3 ...
##  $ y  : num  570 896 1111 1182 1278 ...
x$y <- as.numeric(x$y)

Continuous variables (if any)

If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable; otherwise, it is called a discrete variable.

The continuous variable in this dataset is earnings. ### Response variables A response variable is defined as the outcome of a study. It is a variable you would be interested in predicting or forecasting. It is often called a dependent variable or predicted variable. In this instance, a response variable is earnings. ### The Data: How is it organized and what does it look like? The data is organized initially into a 2 column table: The columns are titled as follows: age, and y. Data in the age column is textual, where as the data in y column is numeric.

str(x)
## 'data.frame':    4266 obs. of  2 variables:
##  $ age: Factor w/ 3 levels "g1","g2","g3": 3 3 3 3 3 3 3 3 3 3 ...
##  $ y  : num  570 896 1111 1182 1278 ...

Randomization

This data comes from a cross-section from 1988-1989. Since this is the only information available in regards to background information about the data collection, it is entirely possible that this data might not be completely randomized or the experiment had a completely randomized design. # 2. (Experimental) Design ### How will the experiment be organized and conducted to test the hypothesis? In order to conduct this experiment, I will conduct an analysis of covariance (ANOVA). After ANOVA has been performed, resampling methods will be implemented to determine their effect on the outcome of the ANOVA test.
### What is the rationale for this design? This dataframe in this recipe contains a single factor with multiple levels. Therefore, an ANOVA is the appropriate test to be performed. The resampling techniques performed are conducted due to the fact that ANOVA makes the assumption that all data is normally distributed. While it is possible for this distribution to occur, rarely is any data completely normal. ### Randomize: What is the Randomization Scheme? The experiment was based on observations and therefore was not randomized. ### Replicate: Are there replicates and/or repeated measures? There are no replicates, but repeated measures do occur between the factors and levels. ### Block: Did you use blocking in the design? No blocking was performed in this design. # 3. Statistical Analysis ##Exploratory Data Analysis: Graphics and Decriptive Summary

boxplot(y~age, data=x, main="Earnings for Three Age Groups", xlab="Age Group", ylab="Earnings ($)")

plot of chunk unnamed-chunk-4 ##ANOVA Testing An analysis of variance (ANOVA) will be used to determine the statistical significance between the individual earnings means. If the null hypothesis is accepted, it can be assumed that there is no variation in earnings due to the variation of age groups. If the null hypothesis is rejected, it can be assumed that the variation of earnings can be attributed to the variation of age groups.

modelx = aov(y~age, data=x)
anova(modelx)
## Analysis of Variance Table
## 
## Response: y
##             Df   Sum Sq  Mean Sq F value  Pr(>F)    
## age          2 1.70e+10 8.48e+09      35 8.5e-16 ***
## Residuals 4263 1.03e+12 2.42e+08                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post-Hoc Analysis

Tukey’s Honestly Significantly Difference is a multiple comparison procedure that is used after an ANOVA to determine which specific sample means are significantly different from the others. In this recipe, Tukey’s HSD is not necessary because the ANOVA indicated that there was no variation in weight means.

Resampling Methods

Monte Carlo Estimation of Power

If resampling is performed from a known theoretical distribution, it is refered to as a “Monte Carlo” simulation. Monte Carlo simulations are algorithms that utilize repeated random sampling techniques to determine the distribution of a data set.

meanstar=with(x,tapply(y,age,mean))
with(x, tapply(y,age,var))
##        g1        g2        g3 
## 215138107 233841809 272338659
with(x, tapply(y,age,length))
##   g1   g2   g3 
## 1109 1678 1479
summary(aov(y~age, data=x))
##               Df   Sum Sq  Mean Sq F value  Pr(>F)    
## age            2 1.70e+10 8.48e+09      35 8.5e-16 ***
## Residuals   4263 1.03e+12 2.42e+08                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The code below creates a random normal distributed with 85 observations and performs an ANOVA on the resultant dataframe. The result is the theoretical F-distribution, show in the histogram with two degrees of freedom for the age groups and 4263 degrees of freedom for the residuals.

meanx=mean(x$y)
sqrtMS=sqrt(2.423e+08)
simage = x$age
Runs = 1000
Fstar = numeric(Runs)
for (i in 1:Runs)
  {A = rnorm(1109, mean=meanx, sd=sqrtMS)
 B = rnorm(1678, mean=meanx, sd=sqrtMS)
 C = rnorm(1479, mean=meanx, sd=sqrtMS)
 simx = c(A,B,C)
 simdata = data.frame(simx, simage)
 Fstar[i] = oneway.test(simx~simage, var.equal=TRUE, data=simdata)$statistic}

Determine the Bootstrap F-distribution

meanstar
##    g1    g2    g3 
## 22880 25080 27974
grpA = x$y[x$age=="g1"] - meanstar[1]
grpB = x$y[x$age=="g2"] - meanstar[2]
grpC = x$y[x$age=="g3"] - meanstar[3]
newsimage = x$age
newFstar = numeric(Runs)
for (i in 1:Runs)
{bunchA = sample(grpA, size=1109, replace=TRUE)
bunchB = sample(grpB, size=1678, replace=TRUE)
  bunchC = sample(grpC, size=1479, replace=TRUE)
  newsimx = c(bunchA,bunchB,bunchC)
  newsimdata = data.frame(newsimx, newsimage)
  newFstar[i] = oneway.test(newsimx~newsimage, var.equal=TRUE, data=newsimdata)$statistic} 

Bootstrap F-dist Histogram

hist(newFstar, prob=TRUE, main ="Theoretical F Distrobution")
s=seq(0,6,.25)
points(s,y=df(s,2,4263), type="b", col="blue")

plot of chunk unnamed-chunk-10

Estimation of the Alpha Level

mean(Fstar>=newFstar)
## [1] 0.469
qf(.95, 2,4263)
## [1] 2.998
quantile(newFstar,.95)
##   95% 
## 3.101

Repeat ANOVA Test with Resampling

modelB=aov(newsimx~newsimage, data=newsimdata)
anova(modelB)
## Analysis of Variance Table
## 
## Response: newsimx
##             Df   Sum Sq  Mean Sq F value Pr(>F)
## newsimage    2 8.83e+08 4.42e+08     1.8   0.17
## Residuals 4263 1.05e+12 2.45e+08

ANOVA Results: The second anova test that analyzed the variation in player weight as a result of the variation in player team produced a p-value less than 0.05. Again, this indicates that there is a low probability that the variation of player weight can be attributed to solely randomization.

Diagnostics/Model Adequacy Checking

Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. Based on the theoretical distribution, the expected value for each datum is determined. If the data values in a set follow the theoretical distribution, then they will appear as a straight line on a Q-Q plot. When an anova is performed, it is done so with the assumption that the test statistic follows a normal distribution. Visualization of a Q-Q plot will further confirm if that assumption is correct for the anova tests that were performed.

qqnorm(residuals(modelA))
qqline(residuals(modelA))

plot of chunk unnamed-chunk-13

plot(fitted(modelA),residuals(modelA))

plot of chunk unnamed-chunk-13

qqnorm(residuals(modelB))
qqline(residuals(modelB))

plot of chunk unnamed-chunk-13

plot(fitted(modelB),residuals(modelB))

plot of chunk unnamed-chunk-13 A Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. These plots are used to identify linearity, outliers, and error variances.