This recipe is examining unemployment of Blue Collar Workers from the Ecdat package. We are looking at how 6 different factors, each with 2 levels effect the replacement rate.
#installing package
install.packages("FrF2",repos='http://cran.us.r-project.org')
## package 'FrF2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\tranc3\AppData\Local\Temp\RtmpUxQbWb\downloaded_packages
install.packages("Ecdat", repos='http://cran.us.r-project.org')
## package 'Ecdat' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\tranc3\AppData\Local\Temp\RtmpUxQbWb\downloaded_packages
library("Ecdat", lib.loc="C:/Program Files/R/R-3.1.1/library")
## Warning: package 'Ecdat' was built under R version 3.1.2
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
#Pulling the Benefits data from the library
Blue<-Benefits
library(FrF2)
## Warning: package 'FrF2' was built under R version 3.1.2
## Loading required package: DoE.base
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
##
## Attaching package: 'DoE.base'
##
## The following objects are masked from 'package:stats':
##
## aov, lm
##
## The following object is masked from 'package:graphics':
##
## plot.design
We are examining a two level multi-factor analysis with 6 different factors. They are “nwhite” which is nonwhite?, “school12” more than 12 years of school?, “sex” so male or female, “smsa” if they live in smsa, “married” if they are married, and lastly “dkids”, if they have kids. These factors are broken down to yes or no or male or female and are assigned -1 or 1.
head(Blue)
## stateur statemb state age tenure joblost nwhite school12 sex
## 1 4.5 167 42 49 21 other no no male
## 2 10.5 251 55 26 2 slack_work no no male
## 3 7.2 260 21 40 19 other no yes female
## 4 5.8 245 56 51 17 slack_work yes no female
## 5 6.5 125 58 33 1 slack_work no yes male
## 6 7.5 188 11 51 3 other no no male
## bluecol smsa married dkids dykids yrdispl rr head ui
## 1 yes yes no no no 7 0.2906 yes yes
## 2 yes yes no yes yes 10 0.5202 yes no
## 3 yes yes yes no no 10 0.4325 yes yes
## 4 yes yes yes no no 10 0.5000 no yes
## 5 yes yes yes yes yes 4 0.3906 yes no
## 6 yes yes yes no no 10 0.4822 yes yes
tail(Blue)
## stateur statemb state age tenure joblost nwhite school12 sex
## 4872 8.6 125 59 21 2 slack_work no no male
## 4873 7.4 168 33 35 12 slack_work yes yes female
## 4874 7.0 189 74 39 5 slack_work no no male
## 4875 8.0 168 74 59 13 slack_work no no male
## 4876 6.3 191 41 33 11 slack_work no no male
## 4877 9.3 188 94 27 2 other no no male
## bluecol smsa married dkids dykids yrdispl rr head ui
## 4872 yes yes no no no 2 0.3125 yes no
## 4873 yes yes no yes no 6 0.4636 yes yes
## 4874 yes yes yes no no 4 0.3937 yes yes
## 4875 yes yes yes yes no 2 0.3724 yes yes
## 4876 yes yes yes yes yes 3 0.5000 yes no
## 4877 yes yes yes yes yes 7 0.4954 yes yes
The continuous variables in the data set are the state unemployment rate, years of tenure in job lost, replacement rate.
In this experiment, we are looking at the replacement rate.
The dataset has 4877 observations from 1972. They are individual observations in the United States.They are organized into stateur, statemb, state, age, tenure, joblost, nwhite, school12, sex, bluecol, smsa, married, dkids, dykids, yrdispl, rr, head, and ui.
The dataset consists of observations so it unknow if there was randomization.
This experiment being conducted is to use a fractional factorial design and perform data analysis on it (ANOVA). An anova will be done on the data, then it will be performed again on the merged data of the original and the data from the FrF2 function.
This method is to check whether the means of several groups are equal when multiple factors are involved. the fractional factorial design is used so that all possible combinations will be included and interaction effects as well.
The dataset is the observations so it was not randomized
There are no replicates or repeated measures
There was no blocking used in the design.
summary(Blue)
## stateur statemb state age
## Min. : 2.20 Min. : 84 Min. :11.0 Min. :20.0
## 1st Qu.: 5.70 1st Qu.:150 1st Qu.:31.0 1st Qu.:28.0
## Median : 7.20 Median :177 Median :55.0 Median :34.0
## Mean : 7.51 Mean :181 Mean :52.8 Mean :36.1
## 3rd Qu.: 9.00 3rd Qu.:205 3rd Qu.:74.0 3rd Qu.:43.0
## Max. :18.00 Max. :293 Max. :95.0 Max. :61.0
## tenure joblost nwhite school12
## Min. : 1.00 other :1976 no :4159 no :3950
## 1st Qu.: 2.00 position_abolished: 402 yes: 718 yes: 927
## Median : 3.00 seasonal_job_ended: 177
## Mean : 5.66 slack_work :2322
## 3rd Qu.: 7.00
## Max. :41.00
## sex bluecol smsa married dkids dykids
## female:1150 yes:4877 no :1694 no :1791 no :2508 no :3796
## male :3727 yes:3183 yes:3086 yes:2369 yes:1081
##
##
##
##
## yrdispl rr head ui
## Min. : 1.0 Min. :0.0386 no :1558 no :1542
## 1st Qu.: 2.0 1st Qu.:0.3752 yes:3319 yes:3335
## Median : 5.0 Median :0.4904
## Mean : 5.2 Mean :0.4384
## 3rd Qu.: 8.0 3rd Qu.:0.5116
## Max. :10.0 Max. :0.6912
boxplot(rr~nwhite, data=Blue, xlab="nonwhite?", ylab="replacement rate")
title("Replacement Rate of white or nonwhite")
boxplot(rr~school12, data=Blue, xlab="More than 12 years of school?", ylab="replacement rate")
title("Replacement Rate of people with or without 12 years of school")
boxplot(rr~sex, data=Blue, xlab="Gender?", ylab="replacement rate")
title("Replacement Rate of Gender")
boxplot(rr~smsa, data=Blue, xlab="Lives in smsa?", ylab="replacement rate")
title("Replacement Rate of people who live or dont live in smsa")
boxplot(rr~married, data=Blue, xlab="Married?", ylab="replacement rate")
title("Replacement Rate of people who are married")
boxplot(rr~smsa, data=Blue, xlab="has kids?", ylab="replacement rate")
title("Replacement Rate of people who have or don't have kids")
#make factors binary
Blue$nwhite<-as.character(Blue$nwhite)
Blue$nwhite[Blue$nwhite =="no"]<--1
Blue$nwhite[Blue$nwhite =="yes"]<- 1
Blue$school12<-as.character(Blue$school12)
Blue$school12[Blue$school12 =="no"]<--1
Blue$school12[Blue$school12 =="yes"]<- 1
Blue$sex<-as.character(Blue$sex)
Blue$sex[Blue$sex =="male"]<--1
Blue$sex[Blue$sex =="female"]<- 1
Blue$smsa<-as.character(Blue$smsa)
Blue$smsa[Blue$smsa =="no"]<--1
Blue$smsa[Blue$smsa =="yes"]<- 1
Blue$married<-as.character(Blue$married)
Blue$married[Blue$married =="no"]<--1
Blue$married[Blue$married =="yes"]<- 1
Blue$dkids<-as.character(Blue$dkids)
Blue$dkids[Blue$dkids =="no"]<--1
Blue$dkids[Blue$dkids =="yes"]<- 1
When looking at the boxplots of the replacement rate for white or non white, the white had a larger interquartile range and range. However, both of the medians seem to be fairly similar and the data sets seems skewed. When looking at if they had 12 years of school, the yes group had a larger interquartile range and didnt have any outliers. The no group had a lot of outliers and seemed to be more skewed than the yes group. When looking at gender, the female group had a way smaller range and interquartile range then the men. However, there seemed to be more outliers for the women.The boxplots for the who do or do not live in smsa look very similar. The ranges seem to be about the same and same with the interquartile range. The yes group may have a slightly larger maximum but the boxplots look similar.For people that are not married, their data looks more skewed for the no group. However the medians both seem close to being 0.5. There are outliers for both of the marriage groups.people that do or do not have kids have very similar distributions for the replacement rate. Their medians are both around 0.5 and the ranges are about the same.
model1=aov(rr~nwhite+school12+sex+smsa+married+dkids, data=Blue)
anova(model1)
## Analysis of Variance Table
##
## Response: rr
## Df Sum Sq Mean Sq F value Pr(>F)
## nwhite 1 0.5 0.459 44.54 2.8e-11 ***
## school12 1 0.8 0.827 80.14 < 2e-16 ***
## sex 1 2.8 2.752 266.71 < 2e-16 ***
## smsa 1 0.0 0.010 0.99 0.32
## married 1 0.7 0.692 67.11 3.3e-16 ***
## dkids 1 0.0 0.005 0.45 0.50
## Residuals 4870 50.2 0.010
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Frdesign<-FrF2(32, nfactors=6, estimable=formula("~nwhite+school12+sex+smsa+married+dkids+sex:(nwhite+school12+smsa+married+dkids)"), factor.names=c("nwhite", "school12", "sex","smsa","married","dkids"), res5=TRUE, clear=FALSE)
Frdesign
## nwhite school12 sex smsa married dkids
## 1 1 -1 -1 1 -1 -1
## 2 -1 1 -1 -1 1 -1
## 3 -1 1 1 -1 -1 -1
## 4 1 -1 1 -1 1 1
## 5 1 -1 1 1 1 -1
## 6 1 1 1 1 -1 -1
## 7 1 -1 1 -1 -1 -1
## 8 -1 -1 1 1 -1 -1
## 9 1 -1 -1 -1 1 -1
## 10 -1 1 1 1 -1 1
## 11 1 1 -1 -1 1 1
## 12 1 -1 1 1 -1 1
## 13 1 1 -1 1 1 -1
## 14 1 -1 -1 1 1 1
## 15 -1 -1 -1 -1 -1 -1
## 16 -1 -1 -1 1 1 -1
## 17 -1 -1 -1 1 -1 1
## 18 -1 1 1 1 1 -1
## 19 -1 -1 1 -1 -1 1
## 20 -1 -1 -1 -1 1 1
## 21 1 -1 -1 -1 -1 1
## 22 -1 -1 1 1 1 1
## 23 -1 1 -1 1 -1 -1
## 24 -1 1 -1 -1 -1 1
## 25 1 1 -1 1 -1 1
## 26 -1 1 -1 1 1 1
## 27 1 1 1 -1 1 -1
## 28 1 1 1 1 1 1
## 29 -1 -1 1 -1 1 -1
## 30 1 1 -1 -1 -1 -1
## 31 1 1 1 -1 -1 1
## 32 -1 1 1 -1 1 1
## class=design, type= FrF2.estimable
aliasprint(Frdesign)
## $legend
## [1] A=nwhite B=school12 C=sex D=smsa E=married F=dkids
##
## [[2]]
## [1] no aliasing among main effects and 2fis
new=merge(Frdesign, Blue, by=c("nwhite", "school12", "sex","smsa", "married","dkids"),all=FALSE)
The null hypothesis for this ANOVA is that the variation in the replacement rate can not be attributed to anything other than randomization. The alternate hypothesis is that the variation in the replacement rate can be attributed to one of the 6 factors.
From our model we would reject our null hypothesis and the variation in the replacement rate can be attributed to something other than randomization such as nonwhite,12 years of school, gender, and marriage status. For nwhite, there is a 2.8e-11 probability of getting the F value 44.54. For school12, there is a 2e-16 probability of getting a F value of 80.14. For gender, there is a 2e-16 probability for getting a f value of 266.71. lastly, for married there is a 3.3e-16 probability of getting an f value of 67.11. When looking at smsa or dkids, the variation in the replacement rate can not be attributed to anything other than randomization.
model2=aov(rr~nwhite*sex+school12*sex+smsa*sex+married*sex+dkids*sex, data=new)
anova(model2)
## Analysis of Variance Table
##
## Response: rr
## Df Sum Sq Mean Sq F value Pr(>F)
## nwhite 1 0.41 0.414 39.92 3.2e-10 ***
## sex 1 1.47 1.472 141.88 < 2e-16 ***
## school12 1 0.38 0.378 36.45 1.8e-09 ***
## smsa 1 0.02 0.021 2.05 0.1518
## married 1 0.24 0.243 23.44 1.4e-06 ***
## dkids 1 0.00 0.000 0.02 0.8794
## nwhite:sex 1 0.05 0.045 4.35 0.0371 *
## sex:school12 1 0.03 0.034 3.32 0.0687 .
## sex:smsa 1 0.03 0.025 2.43 0.1193
## sex:married 1 0.09 0.090 8.67 0.0033 **
## sex:dkids 1 0.00 0.000 0.04 0.8485
## Residuals 2364 24.53 0.010
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
shapiro.test(new$rr)
##
## Shapiro-Wilk normality test
##
## data: new$rr
## W = 0.9089, p-value < 2.2e-16
For our second Anova test, we tested the main effects of the factors as well as the interaction effect of gender with all of the factors. As seen in our results, we can reject the null hypothesis and the variation in replacement rate can be attributed to nwhite, gender, school12, and married. Also, there is significant interaction effect between gender and nwhite and married.
The shapiro test is used to check if the response variable is normal. With an alpha level of .05, and a pvalue of 2.2e-16, we assume that the data is normally distributed.
qqnorm(residuals(model1))
qqline(residuals(model1))
plot(fitted(model1), residuals(model1))
qqnorm(residuals(model2))
qqline(residuals(model2))
plot(fitted(model2), residuals(model2))
A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals for both ANOVA models appear to be normal. However at each of the ends, the residuals seem to stray from the line. When looking at the plot of the fitted model and the residuals, a good fit would be if the plot was symmetric at zero and had a high variation. The points seem to be symmetric at zero however the points at the beginning have higher variation and it decreases towards the right of the plot.
Since the dataset consist of observations of unemployment of blue collar workers, it is possible that all of the values were from chance and that there might have been something different about the year the observations were made.There may be other nuisance factors that were not recorded that year. Also when looking at the boxplots of the variables we were testing, the distributions werent normal. The Kruskal Wallis test does not assume a normal distrubtion of the residuals and could be used to test if the variation can be attributed to anything other than randomization without assuming a normal distribution.
kruskal.test(rr~nwhite, data=new)
##
## Kruskal-Wallis rank sum test
##
## data: rr by nwhite
## Kruskal-Wallis chi-squared = 29.57, df = 1, p-value = 5.396e-08
kruskal.test(rr~school12, data=new)
##
## Kruskal-Wallis rank sum test
##
## data: rr by school12
## Kruskal-Wallis chi-squared = 43.04, df = 1, p-value = 5.362e-11
kruskal.test(rr~sex, data=new)
##
## Kruskal-Wallis rank sum test
##
## data: rr by sex
## Kruskal-Wallis chi-squared = 127.6, df = 1, p-value < 2.2e-16
kruskal.test(rr~smsa, data=new)
##
## Kruskal-Wallis rank sum test
##
## data: rr by smsa
## Kruskal-Wallis chi-squared = 0.9053, df = 1, p-value = 0.3414
kruskal.test(rr~married, data=new)
##
## Kruskal-Wallis rank sum test
##
## data: rr by married
## Kruskal-Wallis chi-squared = 47.01, df = 1, p-value = 7.052e-12
kruskal.test(rr~dkids, data=new)
##
## Kruskal-Wallis rank sum test
##
## data: rr by dkids
## Kruskal-Wallis chi-squared = 1.894, df = 1, p-value = 0.1687
When looking at the results of the Kruskal-Wallis test, we would reject the null hypothesis and the variation in ther replacement rate can be attributed to nwhite, school12, sex, and married.