Recipie for Resampling Methods

Ali Svoobda

RPI

11/6/14

1. Setting

System under test

For this recipie, we will examine the Crime dataset from the Ecdat package. This dataset contains statistics on crime in North Caroline from 1981 to 1987.

Read in and subset data:

install.packages("Ecdat")

## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)

## Error: trying to use CRAN without setting a mirror

library("Ecdat", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")

## Warning: package 'Ecdat' was built under R version 3.1.2

## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

x<-Crime

For more on the dataset:

?Crime

## starting httpd help server ... done

Install Agricolae Package:

install.packages("agricolae")

## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)

## Error: trying to use CRAN without setting a mirror

library(agricolae)

## Warning: package 'agricolae' was built under R version 3.1.2

Factors and Levels 24

The dataset contains 2 factors, region (3 levels) and SMSA (2 levels).

For this experiment, we will only be examining region (central, west or other).

Set up region as factor:

x$region=as.factor(x$region)

Continuous Variables

There are 22 continuous variables on the crime in NC. see ?Crime above for descriptions.

Response Variables

Crime rate (crmrte) will be the response variable for this experiment.

The Data: How is it organized and what does it look like?

The Crime dataset has 630 observations of 24 total variables. Each county is observed each year.

Since there are so many variables not under study in this recipie, we will take out those columns.

data<- x[,c(1,2,3,11)]

Structure, summary, and first/last observations of dataset:

str(data)

## 'data.frame':    630 obs. of  4 variables:
##  $ county: int  1 1 1 1 1 1 1 3 3 3 ...
##  $ year  : int  81 82 83 84 85 86 87 81 82 83 ...
##  $ crmrte: num  0.0399 0.0383 0.0303 0.0347 0.0366 ...
##  $ region: Factor w/ 3 levels "other","west",..: 3 3 3 3 3 3 3 3 3 3 ...

summary(data)

##      county         year        crmrte            region   
##  Min.   :  1   Min.   :81   Min.   :0.00181   other  :245  
##  1st Qu.: 51   1st Qu.:82   1st Qu.:0.01835   west   :147  
##  Median :103   Median :84   Median :0.02844   central:238  
##  Mean   :101   Mean   :84   Mean   :0.03159                
##  3rd Qu.:151   3rd Qu.:86   3rd Qu.:0.03841                
##  Max.   :197   Max.   :87   Max.   :0.16384

head(data)

##   county year  crmrte  region
## 1      1   81 0.03988 central
## 2      1   82 0.03834 central
## 3      1   83 0.03030 central
## 4      1   84 0.03473 central
## 5      1   85 0.03657 central
## 6      1   86 0.03475 central

tail(data)

##     county year  crmrte region
## 625    197   82 0.01807   west
## 626    197   83 0.01557   west
## 627    197   84 0.01366   west
## 628    197   85 0.01309   west
## 629    197   86 0.01287   west
## 630    197   87 0.01419   west

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

An analysis of variance will be performed. We will test to see if region can explain any of the variation in crime rate (crmrte).

What is the Rationale for this design?

This design was choosen to see if crime rates varried from one region of NC to another.

Randomize: What is the Randomization Scheme?

The dataset Crime is a collection of survey data from crime rates and other statistics from 1981 to 1987.

Replicate: Are there replicates and/or repeated measures?

Each county (many counties in a region) is observed once a year, or 7 times total.

Block: Did you use blocking in the design?

Blocking is not used in this design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Mean of crime rate by each region:

tapply(data$crmrte, data$region, mean)

##   other    west central 
## 0.03478 0.01976 0.03561

tapply(data$crmrte, data$region, length)

##   other    west central 
##     245     147     238

The west has a slightly higher crime rate while the central and other regions seem to be more similar. Further analysis will be required to see if these differences are significant

Boxplots:

boxplot(data$crmrte~data$region, xlab="Region of NC", ylab="Crime Rate per person)")

plot of chunk unnamed-chunk-8

The boxplot shows similar results to examining the means.

Histogram of crime rate:

hist(data$crmrte)

plot of chunk unnamed-chunk-9

Most crime rates fall between 0 and .05 crimes per person

Testing

Original Sample

ANOVA Model 1

Null Hypothesis: The variation in the response crime rate cannot be explained by anything other than randomization. Alternative Hypothesis: The variation in response can be explained by something other than randomization (region). Assumption: Data is normally distributed.

model1=aov(data$crmrte~data$region)
summary(model1)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## data$region   2 0.0269 0.01346      47 <2e-16 ***
## Residuals   627 0.1796 0.00029                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is very small, we reject the null hypothesis that crime rate cannot be explained by anything other than randomization. It is likely that the region you are in within North Carolina can explain some of the changes in crime rate.

Note for later, residual error = 0.000286

Tukey Test 1 (for differences in crime rate between the 3 regions)

When the line for a factor pair crosses zero or a p-value greater than .05 is generated, that indicates we fail to reject the null hypothesis that there is no difference in the means of that combination of pairs. When the plotted line does not cross zero or we generate a small p-value, that indicated there is likely a difference in means between the two factor levels.

tukey1<-TukeyHSD(aov(data$crmrte~data$region))
tukey1

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = data$crmrte ~ data$region)
## 
## $`data$region`
##                     diff       lwr       upr  p adj
## west-other    -0.0150199 -0.019168 -0.010872 0.0000
## central-other  0.0008374 -0.002781  0.004456 0.8498
## central-west   0.0158573  0.011686  0.020029 0.0000

plot(tukey1)

plot of chunk unnamed-chunk-11

The central-other region pair returns a high p-value, meaning we fail to rejec the null that there is no difference in the mean crime rate between the two regions. Looking back at the boxplot, this makes sense as the other and central boxplots looked similar.

Resampling

We will run two resampling simulations in this recipie.

For the simulations, we will need the standard deviation which is the square root of mean sq error in anova summary above:

sdstar=sqrt(0.000286)

We will also attach the dataset so it easier to call the variables, save the mean crime rate, set the number of runs, ® and name the regions.

attach(data)
meanstar=mean(crmrte)
sdstar=sqrt(0.000286)
R= 10000
simregion=region

Monte Carlo simulation

First, this recipie will run a monte carlo simulation which assumes normallity. The simulation looks to examine how closely the empirical distribution adheres to the true F-distribution.

The group size is choosen by length of each level of region (see tapply length code above in statisical analysis).

Fstar=numeric(R)
for (i in 1:R){
  groupA = rnorm(245, mean=meanstar, sd=sdstar)
  groupB = rnorm(147, mean=meanstar, sd=sdstar)
  groupC = rnorm(238, mean=meanstar, sd=sdstar)
  simcrmrte = c(groupA, groupB, groupC)
  simdata = data.frame(simcrmrte,simregion)
  Fstar[i] = oneway.test(simcrmrte~simregion, var.equal=T, data=simdata)$statistic
}

Below is a histogram and line points comparing the emperical to the analytic distribution of the F-Statistic. The Histogram is an emperical dist of f statistic (based on collected data, with assumption that it is normally distributed) while the red points represent the f-distribution with the same number of degrees of freedom based on the analytical distribution.

The second line of code is identifying which points to estimate for the f-distribution (plot from 0 to 7 at intervals of .25)

For the points code: DF(region)=2
DF(residuals)=627 -> observations-levels -> 630-3=627 Degrees of freedom can also be found in summary of anova model 1

hist(Fstar, xlim=c(0, 8), ylim=c(0,1), prob=T)
x=seq(0,7,.25)
points(x,y=df(x,2,627),type="b",col="red")

plot of chunk unnamed-chunk-15

This plot looks to examine the quality of the estimation. The ermerical distribution follows the analytic distribution closely, indicating a good fit.

However, remember an assumption of the Monte carlo is that the original data is normally distributed. In the Model Adequacy Checking section below, we will find this is not the case, meaning that the fit of the emperical to the anlytical distribution of the F-statistic may not be as good as we see.

Bootstrapping Method simulation (shuffle with replacement)

Unlike the monte carlo simulation performed above, the bootstrapping method of ANOVA does not assume normallity.

For group size in the sample, the size should be the same as in the original data (again see tapply length)

BFstar=numeric(R)

meanstarB = with(data, tapply(crmrte,region,mean))
grpA = crmrte[region=="other"] - meanstarB[1]
grpB = crmrte[region=="west"] - meanstarB[2]
grpC = crmrte[region=="central"] - meanstarB[3]

for (i in 1:R) {
  BgroupA = sample(grpA, size=245, replace=T)
  BgroupB = sample(grpB, size=147, replace=T)
  BgroupC = sample(grpC, size=238, replace=T)

  Bsimcrmrte = c(BgroupA,BgroupB,BgroupC)
  Bsimdata = data.frame(Bsimcrmrte,simregion)
  BFstar[i] = oneway.test(Bsimcrmrte~simregion, var.equal=T, data=Bsimdata)$statistic
}

Compare Bootstrap simulated distribution to the known F-distribution:

DF(region)=2 DF(residuals)=627

hist(BFstar, ylim=c(0,.9), xlim=c(0,8), prob=T)
Bx=seq(0,7,.25)
points(Bx,y=df(Bx,2,627),type="b",col="red")

plot of chunk unnamed-chunk-17

The graph shows a good fit.

Here is the alpha level from the bootstrapped distribution:

print(realFstar<-oneway.test(crmrte~region, var.equal=T, data=data)$statistic)

##  F 
## 47

estimate alpha level real is from exp bfstar is vector thats result from bootstrap sim compare each elimnet of vector to real % of values that exceeded real esitmate alpha, prob that estimate exceeds real would want below .05

When comparing each element of the estimate and the estimate and real value, the probability that the estimate exceeds the real value is:

mean(BFstar>=realFstar)

## [1] 0

The probability of getting an f-statistic of 45 is zero, and we would be satisfied with any value under .05.

Quantiles of f-dist:

qf(.95,2,627)

## [1] 3.01

We want to see similar values above and below.

Quantiles of analytical result

quantile(BFstar,.95)

##   95% 
## 2.965

Diagnostics/Model Adequacy Checking

Model 1

Visually inspect normality of data:

qqnorm(residuals(model1))
qqline(residuals(model1))

plot of chunk unnamed-chunk-22

The data appears it may not be normal.

Shapiro-Wilks normality test to check:

shapiro.test(residuals(model1))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(model1)
## W = 0.9048, p-value < 2.2e-16

Null hypothesis: The data came from a normally distributed population. We reject the null. We cannot assume the data is normal. This will be addressed in the contingencies section below.

Fitted vs Residuals Plot

plot(fitted(model1),residuals(model1))

plot of chunk unnamed-chunk-24

The data should be symetric over the zero and spread out over the dynamic range. The model may not be a good fit since neither of these are true.

4. Contingencies

Since the data did not fulfill the normality assumption of the ancova model (which could be why the fit of the model was questionable), a Kruskal-Wallis non-parametric analysis of variance by Rank Sum Test should be performed:

kruskal.test(crmrte~region)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  crmrte by region
## Kruskal-Wallis chi-squared = 105.3, df = 2, p-value < 2.2e-16

The null hypothesis of the kruskal test is that the mean ranks of the samples from the populations are expected to be the same (this is not the same as saying the populations have identical means). Since the test results have a low p-value, we reject this null hypothesis. It is likely that the variation in the rank means of region can explain the variaion in crime rate.

5. References to the Literature

None used.

6. Appendicies

Link to raw data

Data is from the NYCflight13 package

Complete R Code

All included above.