Cheryl Tran

RPI

11/20/2014 Version 1

1. Setting

System under test

This recipe is examining unemployment of Blue Collar Workers from the Ecdat package. We are looking at how 6 different factors, each with 2 levels effect the replacement rate.

#installing package
install.packages("FrF2",repos='http://cran.us.r-project.org')
## package 'FrF2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tranc3\AppData\Local\Temp\RtmpUxQbWb\downloaded_packages
install.packages("Ecdat", repos='http://cran.us.r-project.org')
## package 'Ecdat' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tranc3\AppData\Local\Temp\RtmpUxQbWb\downloaded_packages
library("Ecdat", lib.loc="C:/Program Files/R/R-3.1.1/library")
## Warning: package 'Ecdat' was built under R version 3.1.2
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
#Pulling the Benefits data from the library
Blue<-Benefits
library(FrF2)
## Warning: package 'FrF2' was built under R version 3.1.2
## Loading required package: DoE.base
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
## 
## Attaching package: 'DoE.base'
## 
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## 
## The following object is masked from 'package:graphics':
## 
##     plot.design

Factors and Levels

We are examining a two level multi-factor analysis with 6 different factors. They are “nwhite” which is nonwhite?, “school12” more than 12 years of school?, “sex” so male or female, “smsa” if they live in smsa, “married” if they are married, and lastly “dkids”, if they have kids. These factors are broken down to yes or no or male or female and are assigned -1 or 1.

head(Blue)
##   stateur statemb state age tenure    joblost nwhite school12    sex
## 1     4.5     167    42  49     21      other     no       no   male
## 2    10.5     251    55  26      2 slack_work     no       no   male
## 3     7.2     260    21  40     19      other     no      yes female
## 4     5.8     245    56  51     17 slack_work    yes       no female
## 5     6.5     125    58  33      1 slack_work     no      yes   male
## 6     7.5     188    11  51      3      other     no       no   male
##   bluecol smsa married dkids dykids yrdispl     rr head  ui
## 1     yes  yes      no    no     no       7 0.2906  yes yes
## 2     yes  yes      no   yes    yes      10 0.5202  yes  no
## 3     yes  yes     yes    no     no      10 0.4325  yes yes
## 4     yes  yes     yes    no     no      10 0.5000   no yes
## 5     yes  yes     yes   yes    yes       4 0.3906  yes  no
## 6     yes  yes     yes    no     no      10 0.4822  yes yes
tail(Blue)
##      stateur statemb state age tenure    joblost nwhite school12    sex
## 4872     8.6     125    59  21      2 slack_work     no       no   male
## 4873     7.4     168    33  35     12 slack_work    yes      yes female
## 4874     7.0     189    74  39      5 slack_work     no       no   male
## 4875     8.0     168    74  59     13 slack_work     no       no   male
## 4876     6.3     191    41  33     11 slack_work     no       no   male
## 4877     9.3     188    94  27      2      other     no       no   male
##      bluecol smsa married dkids dykids yrdispl     rr head  ui
## 4872     yes  yes      no    no     no       2 0.3125  yes  no
## 4873     yes  yes      no   yes     no       6 0.4636  yes yes
## 4874     yes  yes     yes    no     no       4 0.3937  yes yes
## 4875     yes  yes     yes   yes     no       2 0.3724  yes yes
## 4876     yes  yes     yes   yes    yes       3 0.5000  yes  no
## 4877     yes  yes     yes   yes    yes       7 0.4954  yes yes

Continuous variables (if any)

The continuous variables in the data set are the state unemployment rate, years of tenure in job lost, replacement rate.

Response variables

In this experiment, we are looking at the replacement rate.

The Data: How is it organized and what does it look like?

The dataset has 4877 observations from 1972. They are individual observations in the United States.They are organized into stateur, statemb, state, age, tenure, joblost, nwhite, school12, sex, bluecol, smsa, married, dkids, dykids, yrdispl, rr, head, and ui.

Randomization

The dataset consists of observations so it unknow if there was randomization.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

This experiment being conducted is to use a fractional factorial design and perform data analysis on it (ANOVA). An anova will be done on the data, then it will be performed again on the merged data of the original and the data from the FrF2 function.

What is the rationale for this design?

This method is to check whether the means of several groups are equal when multiple factors are involved. the fractional factorial design is used so that all possible combinations will be included and interaction effects as well.

Randomize: What is the Randomization Scheme?

The dataset is the observations so it was not randomized

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures

Block: Did you use blocking in the design?

There was no blocking used in the design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

summary(Blue)
##     stateur         statemb        state           age      
##  Min.   : 2.20   Min.   : 84   Min.   :11.0   Min.   :20.0  
##  1st Qu.: 5.70   1st Qu.:150   1st Qu.:31.0   1st Qu.:28.0  
##  Median : 7.20   Median :177   Median :55.0   Median :34.0  
##  Mean   : 7.51   Mean   :181   Mean   :52.8   Mean   :36.1  
##  3rd Qu.: 9.00   3rd Qu.:205   3rd Qu.:74.0   3rd Qu.:43.0  
##  Max.   :18.00   Max.   :293   Max.   :95.0   Max.   :61.0  
##      tenure                    joblost     nwhite     school12  
##  Min.   : 1.00   other             :1976   no :4159   no :3950  
##  1st Qu.: 2.00   position_abolished: 402   yes: 718   yes: 927  
##  Median : 3.00   seasonal_job_ended: 177                        
##  Mean   : 5.66   slack_work        :2322                        
##  3rd Qu.: 7.00                                                  
##  Max.   :41.00                                                  
##      sex       bluecol     smsa      married    dkids      dykids    
##  female:1150   yes:4877   no :1694   no :1791   no :2508   no :3796  
##  male  :3727              yes:3183   yes:3086   yes:2369   yes:1081  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##     yrdispl           rr          head        ui      
##  Min.   : 1.0   Min.   :0.0386   no :1558   no :1542  
##  1st Qu.: 2.0   1st Qu.:0.3752   yes:3319   yes:3335  
##  Median : 5.0   Median :0.4904                        
##  Mean   : 5.2   Mean   :0.4384                        
##  3rd Qu.: 8.0   3rd Qu.:0.5116                        
##  Max.   :10.0   Max.   :0.6912
boxplot(rr~nwhite, data=Blue, xlab="nonwhite?", ylab="replacement rate")
title("Replacement Rate of white or nonwhite")

plot of chunk unnamed-chunk-3

boxplot(rr~school12, data=Blue, xlab="More than 12 years of school?", ylab="replacement rate")
title("Replacement Rate of people with or without 12 years of school")

plot of chunk unnamed-chunk-3

boxplot(rr~sex, data=Blue, xlab="Gender?", ylab="replacement rate")
title("Replacement Rate of Gender")

plot of chunk unnamed-chunk-3

boxplot(rr~smsa, data=Blue, xlab="Lives in smsa?", ylab="replacement rate")
title("Replacement Rate of people who live or dont live in smsa")

plot of chunk unnamed-chunk-3

boxplot(rr~married, data=Blue, xlab="Married?", ylab="replacement rate")
title("Replacement Rate of people who are married")

plot of chunk unnamed-chunk-3

boxplot(rr~smsa, data=Blue, xlab="has kids?", ylab="replacement rate")
title("Replacement Rate of people who have or don't have kids")

plot of chunk unnamed-chunk-3

#make factors binary
Blue$nwhite<-as.character(Blue$nwhite)
Blue$nwhite[Blue$nwhite =="no"]<--1
Blue$nwhite[Blue$nwhite =="yes"]<- 1
Blue$school12<-as.character(Blue$school12)
Blue$school12[Blue$school12 =="no"]<--1
Blue$school12[Blue$school12 =="yes"]<- 1
Blue$sex<-as.character(Blue$sex)
Blue$sex[Blue$sex =="male"]<--1
Blue$sex[Blue$sex =="female"]<- 1
Blue$smsa<-as.character(Blue$smsa)
Blue$smsa[Blue$smsa =="no"]<--1
Blue$smsa[Blue$smsa =="yes"]<- 1
Blue$married<-as.character(Blue$married)
Blue$married[Blue$married =="no"]<--1
Blue$married[Blue$married =="yes"]<- 1
Blue$dkids<-as.character(Blue$dkids)
Blue$dkids[Blue$dkids =="no"]<--1
Blue$dkids[Blue$dkids =="yes"]<- 1

When looking at the boxplots of the replacement rate for white or non white, the white had a larger interquartile range and range. However, both of the medians seem to be fairly similar and the data sets seems skewed. When looking at if they had 12 years of school, the yes group had a larger interquartile range and didnt have any outliers. The no group had a lot of outliers and seemed to be more skewed than the yes group. When looking at gender, the female group had a way smaller range and interquartile range then the men. However, there seemed to be more outliers for the women.The boxplots for the who do or do not live in smsa look very similar. The ranges seem to be about the same and same with the interquartile range. The yes group may have a slightly larger maximum but the boxplots look similar.For people that are not married, their data looks more skewed for the no group. However the medians both seem close to being 0.5. There are outliers for both of the marriage groups.people that do or do not have kids have very similar distributions for the replacement rate. Their medians are both around 0.5 and the ranges are about the same.

Testing

model1=aov(rr~nwhite+school12+sex+smsa+married+dkids, data=Blue)
anova(model1)
## Analysis of Variance Table
## 
## Response: rr
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## nwhite       1    0.5   0.459   44.54 2.8e-11 ***
## school12     1    0.8   0.827   80.14 < 2e-16 ***
## sex          1    2.8   2.752  266.71 < 2e-16 ***
## smsa         1    0.0   0.010    0.99    0.32    
## married      1    0.7   0.692   67.11 3.3e-16 ***
## dkids        1    0.0   0.005    0.45    0.50    
## Residuals 4870   50.2   0.010                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Frdesign<-FrF2(32, nfactors=6, estimable=formula("~nwhite+school12+sex+smsa+married+dkids+sex:(nwhite+school12+smsa+married+dkids)"), factor.names=c("nwhite", "school12", "sex","smsa","married","dkids"), res5=TRUE, clear=FALSE)
Frdesign
##    nwhite school12 sex smsa married dkids
## 1       1       -1  -1    1      -1    -1
## 2      -1        1  -1   -1       1    -1
## 3      -1        1   1   -1      -1    -1
## 4       1       -1   1   -1       1     1
## 5       1       -1   1    1       1    -1
## 6       1        1   1    1      -1    -1
## 7       1       -1   1   -1      -1    -1
## 8      -1       -1   1    1      -1    -1
## 9       1       -1  -1   -1       1    -1
## 10     -1        1   1    1      -1     1
## 11      1        1  -1   -1       1     1
## 12      1       -1   1    1      -1     1
## 13      1        1  -1    1       1    -1
## 14      1       -1  -1    1       1     1
## 15     -1       -1  -1   -1      -1    -1
## 16     -1       -1  -1    1       1    -1
## 17     -1       -1  -1    1      -1     1
## 18     -1        1   1    1       1    -1
## 19     -1       -1   1   -1      -1     1
## 20     -1       -1  -1   -1       1     1
## 21      1       -1  -1   -1      -1     1
## 22     -1       -1   1    1       1     1
## 23     -1        1  -1    1      -1    -1
## 24     -1        1  -1   -1      -1     1
## 25      1        1  -1    1      -1     1
## 26     -1        1  -1    1       1     1
## 27      1        1   1   -1       1    -1
## 28      1        1   1    1       1     1
## 29     -1       -1   1   -1       1    -1
## 30      1        1  -1   -1      -1    -1
## 31      1        1   1   -1      -1     1
## 32     -1        1   1   -1       1     1
## class=design, type= FrF2.estimable
aliasprint(Frdesign)
## $legend
## [1] A=nwhite   B=school12 C=sex      D=smsa     E=married  F=dkids   
## 
## [[2]]
## [1] no aliasing among main effects and 2fis
new=merge(Frdesign, Blue, by=c("nwhite", "school12", "sex","smsa", "married","dkids"),all=FALSE)

The null hypothesis for this ANOVA is that the variation in the replacement rate can not be attributed to anything other than randomization. The alternate hypothesis is that the variation in the replacement rate can be attributed to one of the 6 factors.

From our model we would reject our null hypothesis and the variation in the replacement rate can be attributed to something other than randomization such as nonwhite,12 years of school, gender, and marriage status. For nwhite, there is a 2.8e-11 probability of getting the F value 44.54. For school12, there is a 2e-16 probability of getting a F value of 80.14. For gender, there is a 2e-16 probability for getting a f value of 266.71. lastly, for married there is a 3.3e-16 probability of getting an f value of 67.11. When looking at smsa or dkids, the variation in the replacement rate can not be attributed to anything other than randomization.

model2=aov(rr~nwhite*sex+school12*sex+smsa*sex+married*sex+dkids*sex, data=new)
anova(model2)
## Analysis of Variance Table
## 
## Response: rr
##                Df Sum Sq Mean Sq F value  Pr(>F)    
## nwhite          1   0.41   0.414   39.92 3.2e-10 ***
## sex             1   1.47   1.472  141.88 < 2e-16 ***
## school12        1   0.38   0.378   36.45 1.8e-09 ***
## smsa            1   0.02   0.021    2.05  0.1518    
## married         1   0.24   0.243   23.44 1.4e-06 ***
## dkids           1   0.00   0.000    0.02  0.8794    
## nwhite:sex      1   0.05   0.045    4.35  0.0371 *  
## sex:school12    1   0.03   0.034    3.32  0.0687 .  
## sex:smsa        1   0.03   0.025    2.43  0.1193    
## sex:married     1   0.09   0.090    8.67  0.0033 ** 
## sex:dkids       1   0.00   0.000    0.04  0.8485    
## Residuals    2364  24.53   0.010                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
shapiro.test(new$rr)
## 
##  Shapiro-Wilk normality test
## 
## data:  new$rr
## W = 0.9089, p-value < 2.2e-16

For our second Anova test, we tested the main effects of the factors as well as the interaction effect of gender with all of the factors. As seen in our results, we can reject the null hypothesis and the variation in replacement rate can be attributed to nwhite, gender, school12, and married. Also, there is significant interaction effect between gender and nwhite and married.

The shapiro test is used to check if the response variable is normal. With an alpha level of .05, and a pvalue of 2.2e-16, we assume that the data is normally distributed.

Diagnostics/Model Adequacy Checking

qqnorm(residuals(model1))
qqline(residuals(model1))

plot of chunk unnamed-chunk-6

plot(fitted(model1), residuals(model1))

plot of chunk unnamed-chunk-6

qqnorm(residuals(model2))
qqline(residuals(model2))

plot of chunk unnamed-chunk-6

plot(fitted(model2), residuals(model2))

plot of chunk unnamed-chunk-6 A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals for both ANOVA models appear to be normal. However at each of the ends, the residuals seem to stray from the line. When looking at the plot of the fitted model and the residuals, a good fit would be if the plot was symmetric at zero and had a high variation. The points seem to be symmetric at zero however the points at the beginning have higher variation and it decreases towards the right of the plot.

4. Contingencies

Since the dataset consist of observations of unemployment of blue collar workers, it is possible that all of the values were from chance and that there might have been something different about the year the observations were made.There may be other nuisance factors that were not recorded that year. Also when looking at the boxplots of the variables we were testing, the distributions werent normal. The Kruskal Wallis test does not assume a normal distrubtion of the residuals and could be used to test if the variation can be attributed to anything other than randomization without assuming a normal distribution.

kruskal.test(rr~nwhite, data=new)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  rr by nwhite
## Kruskal-Wallis chi-squared = 29.57, df = 1, p-value = 5.396e-08
kruskal.test(rr~school12, data=new)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  rr by school12
## Kruskal-Wallis chi-squared = 43.04, df = 1, p-value = 5.362e-11
kruskal.test(rr~sex, data=new)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  rr by sex
## Kruskal-Wallis chi-squared = 127.6, df = 1, p-value < 2.2e-16
kruskal.test(rr~smsa, data=new)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  rr by smsa
## Kruskal-Wallis chi-squared = 0.9053, df = 1, p-value = 0.3414
kruskal.test(rr~married, data=new)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  rr by married
## Kruskal-Wallis chi-squared = 47.01, df = 1, p-value = 7.052e-12
kruskal.test(rr~dkids, data=new)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  rr by dkids
## Kruskal-Wallis chi-squared = 1.894, df = 1, p-value = 0.1687

When looking at the results of the Kruskal-Wallis test, we would reject the null hypothesis and the variation in ther replacement rate can be attributed to nwhite, school12, sex, and married.

5. References to the literature

http://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code