For this recipie, we will examine the Crime dataset from the Ecdat package. This dataset contains statistics on crime in North Caroline from 1981 to 1987.
Read in and subset data:
install.packages("Ecdat")
## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library("Ecdat", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
## Warning: package 'Ecdat' was built under R version 3.1.2
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
x<-Crime
For more on the dataset:
?Crime
## starting httpd help server ... done
Install Agricolae Package:
install.packages("agricolae")
## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library(agricolae)
## Warning: package 'agricolae' was built under R version 3.1.2
The dataset contains 2 factors, region (3 levels) and SMSA (2 levels).
For this experiment, we will only be examining region (central, west or other).
Set up region as factor:
x$region=as.factor(x$region)
There are 22 continuous variables on the crime in NC. see ?Crime above for descriptions.
Crime rate (crmrte) will be the response variable for this experiment.
The Crime dataset has 630 observations of 24 total variables. Each county is observed each year.
Since there are so many variables not under study in this recipie, we will take out those columns.
data<- x[,c(1,2,3,11)]
Structure, summary, and first/last observations of dataset:
str(data)
## 'data.frame': 630 obs. of 4 variables:
## $ county: int 1 1 1 1 1 1 1 3 3 3 ...
## $ year : int 81 82 83 84 85 86 87 81 82 83 ...
## $ crmrte: num 0.0399 0.0383 0.0303 0.0347 0.0366 ...
## $ region: Factor w/ 3 levels "other","west",..: 3 3 3 3 3 3 3 3 3 3 ...
summary(data)
## county year crmrte region
## Min. : 1 Min. :81 Min. :0.00181 other :245
## 1st Qu.: 51 1st Qu.:82 1st Qu.:0.01835 west :147
## Median :103 Median :84 Median :0.02844 central:238
## Mean :101 Mean :84 Mean :0.03159
## 3rd Qu.:151 3rd Qu.:86 3rd Qu.:0.03841
## Max. :197 Max. :87 Max. :0.16384
head(data)
## county year crmrte region
## 1 1 81 0.03988 central
## 2 1 82 0.03834 central
## 3 1 83 0.03030 central
## 4 1 84 0.03473 central
## 5 1 85 0.03657 central
## 6 1 86 0.03475 central
tail(data)
## county year crmrte region
## 625 197 82 0.01807 west
## 626 197 83 0.01557 west
## 627 197 84 0.01366 west
## 628 197 85 0.01309 west
## 629 197 86 0.01287 west
## 630 197 87 0.01419 west
An analysis of variance will be performed. We will test to see if region can explain any of the variation in crime rate (crmrte).
This design was choosen to see if crime rates varried from one region of NC to another.
The dataset Crime is a collection of survey data from crime rates and other statistics from 1981 to 1987.
Each county (many counties in a region) is observed once a year, or 7 times total.
Blocking is not used in this design.
Mean of crime rate by each region:
tapply(data$crmrte, data$region, mean)
## other west central
## 0.03478 0.01976 0.03561
tapply(data$crmrte, data$region, length)
## other west central
## 245 147 238
The west has a slightly higher crime rate while the central and other regions seem to be more similar. Further analysis will be required to see if these differences are significant
Boxplots:
boxplot(data$crmrte~data$region, xlab="Region of NC", ylab="Crime Rate per person)")
The boxplot shows similar results to examining the means.
Histogram of crime rate:
hist(data$crmrte)
Most crime rates fall between 0 and .05 crimes per person
Null Hypothesis: The variation in the response crime rate cannot be explained by anything other than randomization. Alternative Hypothesis: The variation in response can be explained by something other than randomization (region). Assumption: Data is normally distributed.
model1=aov(data$crmrte~data$region)
summary(model1)
## Df Sum Sq Mean Sq F value Pr(>F)
## data$region 2 0.0269 0.01346 47 <2e-16 ***
## Residuals 627 0.1796 0.00029
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is very small, we reject the null hypothesis that crime rate cannot be explained by anything other than randomization. It is likely that the region you are in within North Carolina can explain some of the changes in crime rate.
When the line for a factor pair crosses zero or a p-value greater than .05 is generated, that indicates we fail to reject the null hypothesis that there is no difference in the means of that combination of pairs. When the plotted line does not cross zero or we generate a small p-value, that indicated there is likely a difference in means between the two factor levels.
tukey1<-TukeyHSD(aov(data$crmrte~data$region))
tukey1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = data$crmrte ~ data$region)
##
## $`data$region`
## diff lwr upr p adj
## west-other -0.0150199 -0.019168 -0.010872 0.0000
## central-other 0.0008374 -0.002781 0.004456 0.8498
## central-west 0.0158573 0.011686 0.020029 0.0000
plot(tukey1)
The central-other region pair returns a high p-value, meaning we fail to rejec the null that there is no difference in the mean crime rate between the two regions. Looking back at the boxplot, this makes sense as the other and central boxplots looked similar.
We will run two resampling simulations in this recipie.
For the simulations, we will need the standard deviation which is the square root of mean sq error in anova summary above:
sdstar=sqrt(0.000286)
We will also attach the dataset so it easier to call the variables, save the mean crime rate, set the number of runs, ® and name the regions.
attach(data)
meanstar=mean(crmrte)
sdstar=sqrt(0.000286)
R= 10000
simregion=region
First, this recipie will run a monte carlo simulation which assumes normallity. The simulation looks to examine how closely the empirical distribution adheres to the true F-distribution.
The group size is choosen by length of each level of region (see tapply length code above in statisical analysis).
Fstar=numeric(R)
for (i in 1:R){
groupA = rnorm(245, mean=meanstar, sd=sdstar)
groupB = rnorm(147, mean=meanstar, sd=sdstar)
groupC = rnorm(238, mean=meanstar, sd=sdstar)
simcrmrte = c(groupA, groupB, groupC)
simdata = data.frame(simcrmrte,simregion)
Fstar[i] = oneway.test(simcrmrte~simregion, var.equal=T, data=simdata)$statistic
}
Below is a histogram and line points comparing the emperical to the analytic distribution of the F-Statistic. The Histogram is an emperical dist of f statistic (based on collected data, with assumption that it is normally distributed) while the red points represent the f-distribution with the same number of degrees of freedom based on the analytical distribution.
The second line of code is identifying which points to estimate for the f-distribution (plot from 0 to 7 at intervals of .25)
For the points code:
DF(region)=2
DF(residuals)=627 -> observations-levels -> 630-3=627
Degrees of freedom can also be found in summary of anova model 1
hist(Fstar, xlim=c(0, 8), ylim=c(0,1), prob=T)
x=seq(0,7,.25)
points(x,y=df(x,2,627),type="b",col="red")
This plot looks to examine the quality of the estimation. The ermerical distribution follows the analytic distribution closely, indicating a good fit.
However, remember an assumption of the Monte carlo is that the original data is normally distributed. In the Model Adequacy Checking section below, we will find this is not the case, meaning that the fit of the emperical to the anlytical distribution of the F-statistic may not be as good as we see.
Unlike the monte carlo simulation performed above, the bootstrapping method of ANOVA does not assume normallity.
For group size in the sample, the size should be the same as in the original data (again see tapply length)
BFstar=numeric(R)
meanstarB = with(data, tapply(crmrte,region,mean))
grpA = crmrte[region=="other"] - meanstarB[1]
grpB = crmrte[region=="west"] - meanstarB[2]
grpC = crmrte[region=="central"] - meanstarB[3]
for (i in 1:R) {
BgroupA = sample(grpA, size=245, replace=T)
BgroupB = sample(grpB, size=147, replace=T)
BgroupC = sample(grpC, size=238, replace=T)
Bsimcrmrte = c(BgroupA,BgroupB,BgroupC)
Bsimdata = data.frame(Bsimcrmrte,simregion)
BFstar[i] = oneway.test(Bsimcrmrte~simregion, var.equal=T, data=Bsimdata)$statistic
}
Compare Bootstrap simulated distribution to the known F-distribution:
DF(region)=2 DF(residuals)=627
hist(BFstar, ylim=c(0,.9), xlim=c(0,8), prob=T)
Bx=seq(0,7,.25)
points(Bx,y=df(Bx,2,627),type="b",col="red")
The graph shows a good fit.
Here is the alpha level from the bootstrapped distribution:
print(realFstar<-oneway.test(crmrte~region, var.equal=T, data=data)$statistic)
## F
## 47
estimate alpha level real is from exp bfstar is vector thats result from bootstrap sim compare each elimnet of vector to real % of values that exceeded real esitmate alpha, prob that estimate exceeds real would want below .05
When comparing each element of the estimate and the estimate and real value, the probability that the estimate exceeds the real value is:
mean(BFstar>=realFstar)
## [1] 0
The probability of getting an f-statistic of 45 is zero, and we would be satisfied with any value under .05.
Quantiles of f-dist:
qf(.95,2,627)
## [1] 3.01
We want to see similar values above and below.
Quantiles of analytical result
quantile(BFstar,.95)
## 95%
## 2.965
Visually inspect normality of data:
qqnorm(residuals(model1))
qqline(residuals(model1))
The data appears it may not be normal.
Shapiro-Wilks normality test to check:
shapiro.test(residuals(model1))
##
## Shapiro-Wilk normality test
##
## data: residuals(model1)
## W = 0.9048, p-value < 2.2e-16
Null hypothesis: The data came from a normally distributed population. We reject the null. We cannot assume the data is normal. This will be addressed in the contingencies section below.
plot(fitted(model1),residuals(model1))
The data should be symetric over the zero and spread out over the dynamic range. The model may not be a good fit since neither of these are true.
Since the data did not fulfill the normality assumption of the ancova model (which could be why the fit of the model was questionable), a Kruskal-Wallis non-parametric analysis of variance by Rank Sum Test should be performed:
kruskal.test(crmrte~region)
##
## Kruskal-Wallis rank sum test
##
## data: crmrte by region
## Kruskal-Wallis chi-squared = 105.3, df = 2, p-value < 2.2e-16
The null hypothesis of the kruskal test is that the mean ranks of the samples from the populations are expected to be the same (this is not the same as saying the populations have identical means). Since the test results have a low p-value, we reject this null hypothesis. It is likely that the variation in the rank means of region can explain the variaion in crime rate.
None used.
Data is from the NYCflight13 package
All included above.