Recipe 7: Resampling Techniques

Weights of Football Players

Max Winkelman

Rensselaer Polytechnic Institute

November 11, 2014

Version 1

1. Setting

Football Player Weights

The data analyzed in this recipe is a csv file that contains the weights of football players from five NFL teams.

Install the ‘FootballPlayerWeights.csv’ file

weight <- read.csv("~/RPI/Classes/Design of Experiments/R/FootballPlayerWeights.csv", header=TRUE)
#reads in the data from the csv file 'FootballPlayerWeights.csv' and assigns it to the dataframe weight
weight$Team = as.factor(weight$Team)

Factors and Levels

Factor: NFL Team

Levels: Cowboys, Packers, Broncos, Dolphins, Forty Niners

#Summary of Data 
head(weight)

##      Team Weight
## 1 Cowboys    250
## 2 Cowboys    255
## 3 Cowboys    255
## 4 Cowboys    264
## 5 Cowboys    250
## 6 Cowboys    265

#displays the first 6 sets of variables 
tail(weight)

##            Team Weight
## 80 Forty Niners    253
## 81 Forty Niners    249
## 82 Forty Niners    223
## 83 Forty Niners    221
## 84 Forty Niners    228
## 85 Forty Niners    271

#displays the last 6 sets of variables 
summary(weight)

##            Team        Weight   
##  Broncos     :17   Min.   :208  
##  Cowboys     :17   1st Qu.:236  
##  Dolphins    :17   Median :252  
##  Forty Niners:17   Mean   :248  
##  Packers     :17   3rd Qu.:260  
##                    Max.   :281

#displays a summary of the variables

Continuous variables:

In the csv file, ‘FootballPlayerWeights,’ the continous variable is ‘Weight’ and is measured in pounds.

Response variables:

TRhe response variable in this recipe will be football player weight (lbs).

The Data: How is it organized and what does it look like?

The data in this recipe was taken from The Sports Encyclopedia Pro Football can is organized into two columns; ‘Team,’ which displays the NFL team that a player is from, and ‘Weight,’ which is the weight of the player in pounds.

Randomization

It can be assumed that the original data was gathered with proper randomization methods.

2. Experimental Design

How will the experiment be organized and conducted to test the hypothesis?

The experiment that will be conducted in this recipe will be to determine if the variation of NFL player weight can be attributed to the variation of NFL teams. An analysis of variance with a confidence interval of 95% will be performed to determine if the team that a player is a member of will have an effect on his weight. The null hypothesis for this ANOVA will be that the means of the player weights will be equal for all NFL teams. If this is rejected, the alternative hypothesis, which states that the mean weights are different between NFL teams, will be accepted. After the ANOVA has been performed, resampling methods will be implemented to determine their effect of the outcome of the ANOVA test.

What is the rationale for this design?

This dataframe in this recipe contains a single factor with multiple levels. Therefore, an ANOVA is the appropriate test to be performed. The resampling techniques performed are conducted due to the fact that ANOVA makes the assumption that all data is normally distributed. While it is possible for this distribution to occur, rarely is any data completely normal.

Randomize: What is the Randomization Scheme?

No randomization scheme was used in this recipe because the data used is not original content.

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures in this data set.

Block: Did you use blocking in the design?

There will be no blocking in this recipe.

3. Statistical Analysis

Exploratory Data Analysis: Graphics and Descriptive Summary

boxplot(Weight~Team,data=weight, xlab="NFL Team", ylab="Weight (lbs)")
title("NFL Player Weights")

plot of chunk unnamed-chunk-3

The boxplot above shows the distribution of player weights from five NFL teams. All medians fall in between 240 and 260 pounds. The Broncos and the Cowboys are the only teams that have outliers. No statistical inference can be drawn from this plot.

ANOVA Testing

An analysis of variance (ANOVA) will be used to determine the statistical significance between the player weight means. If the null hypothesis is accepted, it can be assumed that there is no variation in player weight due to the variation of NFL teams. If the null hypothesis is rejected, it can be assumed that the variation of player weights can be attributed to the variation of teams.

# ANOVA
#Player Weight
model = aov(Weight~Team,data=weight)
anova(model)

## Analysis of Variance Table
## 
## Response: Weight
##           Df Sum Sq Mean Sq F value Pr(>F)
## Team       4   1714     428    1.58   0.19
## Residuals 80  21761     272

#performs an anova test

ANOVA Results: The anova test that analyzed the variation in player weight as a result of the variation in player team produced a p-value of 0.19. This indicates that there is a high probability that the variation of player weight can be attributed to solely randomization. It is highly likely that a player’s team has no effect on the player’s weight. The mean square of the residuals is 272.

Post-Hoc Analysis

Tukey’s Honestly Significantly Difference is a multiple comparison procedure that is used after an ANOVA to determine which specific sample means are significantly different from the others. In this recipe, Tukey’s HSD is not necessary because the ANOVA indicated that there was no variation in weight means.

Resampling Methods

Monte Carlo Estimation of Power

If resampling is performed from a known theoretical distribution, it is refered to as a “Monte Carlo” simulation. Monte Carlo simulations are algorithms that utilize repeated random sampling techniques to determine the distribution of a data set.

meanstar=with(weight,tapply(Weight,Team,mean))
#creates an array of the means of the levels
with(weight,tapply(Weight,Team,var))

##      Broncos      Cowboys     Dolphins Forty Niners      Packers 
##        263.6        235.3        281.1        288.5        291.6

#creates an array of the variances of the levels
with(weight,tapply(Weight,Team,length))

##      Broncos      Cowboys     Dolphins Forty Niners      Packers 
##           17           17           17           17           17

#creates an array of the lengths of the levels

summary(aov(Weight~Team,data=weight))

##             Df Sum Sq Mean Sq F value Pr(>F)
## Team         4   1714     428    1.58   0.19
## Residuals   80  21761     272

#summary of the avona model for reference
 
meanWeight = mean(weight$Weight)
#mean of all the weights in the data set
sqrtMS = sqrt(272)
#squareroot of the mean square of the residuals
simTeam = weight$Team
#assigns the Teams to one data set
Runs = 1000
#number of random simulations that will be run
Fstar = numeric(Runs)
for (i in 1:Runs) 
  {A = rnorm(17, mean=meanWeight, sd=sqrtMS)
  B = rnorm(17, mean=meanWeight, sd=sqrtMS)
  C = rnorm(17, mean=meanWeight, sd=sqrtMS)
  D = rnorm(17, mean=meanWeight, sd=sqrtMS)
  E = rnorm(17, mean=meanWeight, sd=sqrtMS)
  simWeight = c(A,B,C,D,E)
  simdata = data.frame(simWeight,simTeam)
  Fstar[i] = oneway.test(simWeight~simTeam, var.equal=T, data=simdata)$statistic}

The code above creates a random normal distributed with 85 observations and performs an ANOVA on the resultant dataframe. The result is the theoretical F-distribution, show in the histogram with 4 degrees of freedom for the NFL Teams and 80 degrees of freedom for the residuals.

hist(Fstar, prob=TRUE, main ="Theoretical F Distribution", ylim=c(0,0.8), xlim=c(0,7))
x=seq(.25,7,.25)
#4=5levels-1mean and 80=85values-5means
points(x,y=df(x,4,80),type="b",col="blue")

plot of chunk unnamed-chunk-6

The plotted points of the theoretical distribution match up relatively nicely with the Monte Carlo F-distribution. This indicates that the original data set is close to being normally distributed.

Bootstrap Method

This will be the second resampling method performed. It is completely analytical and there are no assumtptions about the distribution of the data. The results will be inputed into a second ANOVA.

# Bootstrap version
#rm(list=ls())
grpB = weight$Weight[weight$Team=="Broncos"] - meanstar[1]
grpC = weight$Weight[weight$Team=="Cowboys"] - meanstar[2]
grpD = weight$Weight[weight$Team=="Dolphins"] - meanstar[3]
grpFN = weight$Weight[weight$Team=="Forty Niners"] - meanstar[4]
grpP = weight$Weight[weight$Team=="Packers"] - meanstar[5]
newsimTeam = weight$Team

newFstar = numeric(Runs)
for (i in 1:Runs) 
  {groupA = sample(grpB, size=17, replace=TRUE)
   groupB = sample(grpC, size=17, replace=TRUE)
   groupC = sample(grpD, size=17, replace=TRUE)
   groupD = sample(grpFN, size=17, replace=TRUE)
   groupE = sample(grpP, size=17, replace=TRUE)
   newsimweight = c(groupA,groupB,groupC,groupD,groupE)
   newsimdata = data.frame(newsimweight,newsimTeam)
   newFstar[i] = oneway.test(newsimweight~newsimTeam, var.equal=T, data=newsimdata)$statistic}

Bootstrap F-distribution Histogram

hist(newFstar, prob=TRUE, main ="BootStrap F-Distribution", ylim=c(0,0.8), xlim=c(0,7))
x=seq(.25,7,.5)
#4=5levels-1mean and 80=85values-5means
points(x,y=df(x,4,80),type="b",col="blue")

plot of chunk unnamed-chunk-8

The bootstrapping histogram is very similar to the theoretical distribution from the first histogram, confirming that the original data set is very close to being normally distributed.

Estimatation of the Alpha Level

# Alpha level (Bootstrapped Distribution)
print(realFstar<-oneway.test(Weight~Team, var.equal=T, data=weight)$statistic)

##     F 
## 1.575

#Display the probability
mean(newFstar>=realFstar)

## [1] 0.16

#Quantiles 
#Analytic F distribution
qf(.95,4,80)

## [1] 2.486

#Bootstrapped F distribution (alpha= 0.05)
quantile(newFstar,.95)

##   95% 
## 2.231

Repeat ANOVA Test with Resampling

newmodel=aov(newsimweight~newsimTeam, data=newsimdata)
anova(newmodel)

## Analysis of Variance Table
## 
## Response: newsimweight
##            Df Sum Sq Mean Sq F value Pr(>F)
## newsimTeam  4   1575     394    1.55    0.2
## Residuals  80  20293     254

ANOVA Results: The second anova test that analyzed the variation in player weight as a result of the variation in player team produced a p-value greater than 0.19. Again, this indicates that there is a high probability that the variation of player weight can be attributed to solely randomization.

Estimation of Parameters

Summary of all factors and levels

#Summaries
#Broncos
broncos<-weight$Team=="Broncos" 
#assigns all bronco data a vector
summary(weight[broncos,"Weight"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     216     250     256     254     265     281

#displays a summary of the weight data for the Broncos

#Packers
packers<-weight$Team=="Packers" 
#assigns all packer data a vector
summary(weight[packers,"Weight"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     222     240     254     251     263     275

#displays a summary of the weight data for the Packers

#Dolphins
dolphins<-weight$Team=="Dolphins" 
#assigns all packer data a vector
summary(weight[dolphins,"Weight"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     210     236     253     249     263     268

#displays a summary of the weight data for the Dolphins

#49ers
fortyniners<-weight$Team=="Forty Niners" 
#assigns all packer data a vector
summary(weight[fortyniners,"Weight"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     208     228     247     241     249     271

#displays a summary of the weight data for the 49ers

#Cowboys
cowboys<-weight$Team=="Cowboys" 
#assigns all packer data a vector
summary(weight[cowboys,"Weight"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     220     245     250     247     255     266

#displays a summary of the weight data for the Cowboys

Diagnostics/Model Adequacy Checking

Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. Based on the theoretical distribution, the expected value for each datum is determined. If the data values in a set follow the theoretical distribution, then they will appear as a straight line on a Q-Q plot. When an anova is performed, it is done so with the assumption that the test statistic follows a normal distribution. Visualization of a Q-Q plot will further confirm if that assumption is correct for the anova tests that were performed.

#Q-Q Plots
#Player Weight
qqnorm(residuals(model), main="Normal Q-Q Plot for NFL Player Weight", ylab="Weight (lbs) Residuals")
qqline(residuals(model))

plot of chunk unnamed-chunk-12

#produces a Q-Q normal plot for the anova model

qqnorm(residuals(newmodel), main="Normal Q-Q Plot for NFL Player Weight (Resampling)", ylab="Weight (lbs) Residuals")
qqline(residuals(newmodel))

plot of chunk unnamed-chunk-12

#produces a Q-Q normal plot for the anova model with resampling

The Normal Q-Q plots for both anova models produced relatively linear relationships between the residual and theoretical values, indicating that the use of ANOVA in this recipe was appropriate.

A Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. These plots are used to identify linearity, outliers, and error variances.

#Player Weights
plot(fitted(model),residuals(model), main="Residual vs Fitted Plot for NFL Player Weights")

plot of chunk unnamed-chunk-13

plot(fitted(newmodel),residuals(newmodel), main="Residual vs Fitted Plot for NFL Player Weights (Resampling)")

plot of chunk unnamed-chunk-13

#produces a Residual vs Fits Plot for the first and second ANOVA

The residual plots above both produced relatively even distributions about zero and do not contain any obvious outliers, indicating that the use of ANOVA was appropraite in this recipe.

4. References to the Literature

No literature was used in this sample recipe

5. Appendices

The raw data used in this statistical analysis is available as a downloadable file from: http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/owan/frames/frame.html