Recipe 8: Fractional Factorial Designs

Recipe for Resampling Methods

Wei Zou

RPI

Nov 19, 14; Version 1

1. Setting

System under test

The purpose of this project is to create a 2 level half fractional factorial design with 6 factors.The dataset used for this experiment is the “Benefit” from the “Ecdat Package” in R, which is used to explore the influencing factors of state unemployment rates of blue collar workers in 1972. The original dataset has 4877 observations with 18 variables. In this study, we focus on the following 6 effects on the state unemployment rate: race (white or non-white), education(more than 12 years of school or not), married or not, has kids or not, and if the respondent is the head of the household.The null/alternative hyphothesis of the experiment can be stated as:

H0: The variation in state unemployment rates is due to sample randomization only. (i.e, the selected 6 factors cannot explain the variation in state unemployment rates) HA: The variation in state unemployment rates is due to something else other than sample randomization (i.e., the selected 6 factors may affect the state unemployment rates)

To test the hypothesis, we first conduct a 6 factor (with 2 levels each) experiment to analyze the variations in state unemployment rates. Then we use a 2 level half fractional factorial design to select a subsample with 2^6-1 observations from the original data. After that, an exploratory data analysis and an ANOVA test with model adequacy checking are carried out on the subsample. The results are concluded by comparing the model estimates from the original data to the model estimates from the subsample.

#Read in the data 
library("Ecdat", lib.loc="~/R/win-library/3.1")

## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

data1<-Benefits
summary(data1)

##     stateur         statemb        state           age      
##  Min.   : 2.20   Min.   : 84   Min.   :11.0   Min.   :20.0  
##  1st Qu.: 5.70   1st Qu.:150   1st Qu.:31.0   1st Qu.:28.0  
##  Median : 7.20   Median :177   Median :55.0   Median :34.0  
##  Mean   : 7.51   Mean   :181   Mean   :52.8   Mean   :36.1  
##  3rd Qu.: 9.00   3rd Qu.:205   3rd Qu.:74.0   3rd Qu.:43.0  
##  Max.   :18.00   Max.   :293   Max.   :95.0   Max.   :61.0  
##      tenure                    joblost     nwhite     school12  
##  Min.   : 1.00   other             :1976   no :4159   no :3950  
##  1st Qu.: 2.00   position_abolished: 402   yes: 718   yes: 927  
##  Median : 3.00   seasonal_job_ended: 177                        
##  Mean   : 5.66   slack_work        :2322                        
##  3rd Qu.: 7.00                                                  
##  Max.   :41.00                                                  
##      sex       bluecol     smsa      married    dkids      dykids    
##  female:1150   yes:4877   no :1694   no :1791   no :2508   no :3796  
##  male  :3727              yes:3183   yes:3086   yes:2369   yes:1081  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##     yrdispl           rr          head        ui      
##  Min.   : 1.0   Min.   :0.0386   no :1558   no :1542  
##  1st Qu.: 2.0   1st Qu.:0.3752   yes:3319   yes:3335  
##  Median : 5.0   Median :0.4904                        
##  Mean   : 5.2   Mean   :0.4384                        
##  3rd Qu.: 8.0   3rd Qu.:0.5116                        
##  Max.   :10.0   Max.   :0.6912

attach(data1)

Data Summary

Factor: The original data have nine factors. For the purpose of this study, we focus on the following recoded 6: “white”: 1 if the respondent is white, -1 otherwise “school”: 1 if the respodent has more than 12 years of school, -1 otherwise “gender”: 1 if the respondent is female, -1 otherwise “marriage”: 1 if the respondent is married, -1 otherwise “kids”: 1 if the respondent has kids, -1 otherwise “headofhh”: 1 if the respondent is the head of the household, -1 otherwise

Continuous variable and Response Variable: The original data have four continous variables. In this study, we treat “stateur” (the state unemployment rate, in %) as the response variable.

Organization: The original data were derived from the 1972 Current Population Survey's Displaced Worker Supplements. The data have 4877 observations with 18 variables.

Randomization: The data were randomly selected among the blue collar workers in the US, however, there is no randomize assignment and random execution order in the experiment.

#data recoding
nwhite<-as.factor(nwhite)
data1$white<-NA
data1$white[nwhite=="yes"]<-1
data1$white[nwhite=="no"]<--1

school12<-as.factor(school12)
data1$school<-NA
data1$school[school12=="yes"]<-1
data1$school[school12=="no"]<--1

sex<-as.factor(sex)
data1$gender<-NA
data1$gender[sex=="female"]<-1
data1$gender[sex=="male"]<--1

married<-as.factor(married)
data1$marriage<-NA
data1$marriage[married=="yes"]<-1
data1$marriage[married=="no"]<--1

dkids<-as.factor(dykids)
data1$kids<-NA
data1$kids[dykids=="yes"]<-1
data1$kids[dykids=="no"]<--1

head<-as.factor(head)
data1$headofhh<-NA
data1$headofhh[head=="yes"]<-1
data1$headofhh[head=="no"]<--1

2. (Experimental) Design

Organization of the Experiment and Rationale for this design

In this sample recipe, the effects of the selected 6 factors on the state unemployment rates are studied. In the first,an ANOVA is performed on the original data to verify if the variation in state unemployment rates are due to pure sample randomization.

Then we carry out 2 level half fractional factorial design, and use the design matrix to select 32 (= 2⁵⁾ matching experimental runs. The fractional factorial design carefully selects a subset (32 obs in this study) of the experimental runs of a full factorial design (4877 obs in this study), and ensures that the selected subset maintains the most important features of the problem studied (main effects and low-order interaction effects among the factors in this study). By doing so, we can test the hypothesis with fewer number of experimental runs, and thus reduce the experimental time and costs substantially.

In the last, we perform an ANOVA to the selected subset sample, compare the estimates and concludes the results.

Randomize, Replicate and Block

As discussed in the previous section, the data are randomly selected, however, they are not randomly assigned and executed.

There are no replicated and/or blocking used in this experiment.

3. (Statistical) Analysis for the Original Data

(Exploratory Data Analysis) Graphics and descriptive summary

The boxplot shows that the the math scores do not vary alot among different levels of the selected factors, noted that the median state unemployment rate in the “-1” group is similar to the “+1” group for all the 6 selected factors. Therefore,it is possible that the variation in state unemployment is due to sample randomization only.

The histogram of the response variable shows that the data are slightly right skewed, thus further model adequecy checking is needed to test the underlying population distribution.

#bloxplots 
par(mfrow=c(2,3))
boxplot(data1$stateur~data1$white,xlab="White/non-White",ylab="State Unemployment Rate (%)")
boxplot(data1$stateur~data1$school,xlab="Education (12 years of school or not)",ylab="State Unemployment Rate (%)")
boxplot(data1$stateur~data1$gender,xlab="Gender",ylab="State Unemployment Rate (%)")
boxplot(data1$stateur~data1$marriage,xlab="Married or Not",ylab="State Unemployment Rate (%)")
boxplot(data1$stateur~data1$kids,xlab="Had kids or Not",ylab="State Unemployment Rate (%)")
boxplot(data1$stateur~data1$headofhh,xlab="Head or the Household or Not",ylab="State Unemployment Rate (%)")

plot of chunk unnamed-chunk-3

#histograms
par(mfrow=c(1,1))
hist(stateur, xlab="state unemployment rate", main = "Histogram of the State Unemployment Rate")

plot of chunk unnamed-chunk-3

Testing for the Original Data

According to the ANOVA result from the full factorial design, the effect of gender (p-value = 0.0046), and the interaction effect of “marriage” and “headof hh” (p-value = 4e-5) are shown to be statistically siginificant at a 0.05 level. Therefore we reject the null hypothesis that the variation in state unemployment rate is due to sample randomization only.

model1 = aov (stateur~(white+school+gender+marriage+kids+headofhh)^2, data = data1)
summary(model1)

##                     Df Sum Sq Mean Sq F value Pr(>F)    
## white                1      0     0.0    0.00 0.9589    
## school               1      3     2.5    0.41 0.5220    
## gender               1     50    49.9    8.02 0.0047 ** 
## marriage             1      3     3.3    0.54 0.4645    
## kids                 1     14    13.5    2.18 0.1401    
## headofhh             1      0     0.2    0.04 0.8428    
## white:school         1      4     4.3    0.69 0.4076    
## white:gender         1     22    21.6    3.47 0.0627 .  
## white:marriage       1      2     2.0    0.32 0.5690    
## white:kids           1     10    10.5    1.68 0.1944    
## white:headofhh       1      1     0.7    0.11 0.7380    
## school:gender        1     14    13.7    2.21 0.1371    
## school:marriage      1      1     0.6    0.10 0.7467    
## school:kids          1      9     8.7    1.40 0.2368    
## school:headofhh      1      5     5.4    0.86 0.3530    
## gender:marriage      1      2     2.1    0.34 0.5574    
## gender:kids          1      0     0.0    0.00 0.9565    
## gender:headofhh      1     12    11.5    1.85 0.1735    
## marriage:kids        1      2     2.4    0.39 0.5343    
## marriage:headofhh    1    105   105.1   16.90  4e-05 ***
## kids:headofhh        1     37    36.7    5.90 0.0152 *  
## Residuals         4855  30194     6.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4. Fractional Factorial Design

Now we carry out the 2 level half fractional factorial design, and use the design matrix to select a subset from the full factorial design runs: Step 1: Use the “FrF2” fuction to generate a 2^6-1 fractional factorial design with 32 observations, and the design matrix is saved in data2. Step 2: Use the design matrix to select the matching experimental runs from the original data, and the results are stored in data 3. Since there are multiple matching observations for each entry in the design matrix, we use the “unique” function to delete the identical observations and save the results in data4. Ideally there should be 32 unique observations in data4 for the purpose of perfectly balanced selection, however, we were only able to match 29 unique observations in the original data to the design matrix. (Noted that 2⁵ fractional factorial design includes all the possible combinations of factors while in a true experiment, it is possible that only part of the full combinations are carried out). Step 3: The response variable is matched to the selected 29 obs, and the final subset data are saved in data5 with 29 obs, 1 response variable and 6 factors (each with 2 levels).

# Construct the fractional factorial design
library(FrF2)

## Warning: package 'FrF2' was built under R version 3.1.2

## Loading required package: DoE.base

## Warning: package 'DoE.base' was built under R version 3.1.2

## Loading required package: grid
## Loading required package: conf.design
## 
## Attaching package: 'DoE.base'
## 
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## 
## The following object is masked from 'package:graphics':
## 
##     plot.design

halffrac = FrF2(32,nfactors=6,estimable=formula("~white+school+gender+marriage+kids+headofhh+white:(school+gender+marriage+kids+headofhh)"),factor.names=c("white","school","gender","marriage","kids","headofhh"),clear=FALSE)
halffrac

##    white school gender marriage kids headofhh
## 1     -1      1      1        1   -1        1
## 2      1      1      1       -1    1       -1
## 3     -1      1     -1        1   -1       -1
## 4     -1      1     -1       -1    1       -1
## 5      1      1     -1       -1   -1       -1
## 6      1     -1      1        1   -1        1
## 7      1     -1      1        1    1       -1
## 8     -1     -1     -1        1    1       -1
## 9     -1     -1      1        1   -1       -1
## 10    -1      1      1        1    1       -1
## 11     1      1     -1       -1    1        1
## 12     1     -1      1       -1   -1       -1
## 13    -1     -1     -1       -1   -1       -1
## 14    -1      1      1       -1    1        1
## 15    -1      1     -1       -1   -1        1
## 16    -1     -1     -1       -1    1        1
## 17    -1     -1     -1        1   -1        1
## 18     1     -1     -1        1    1        1
## 19     1     -1     -1       -1    1       -1
## 20     1      1     -1        1   -1        1
## 21    -1     -1      1        1    1        1
## 22     1      1      1        1   -1       -1
## 23    -1     -1      1       -1   -1        1
## 24    -1     -1      1       -1    1       -1
## 25     1     -1      1       -1    1        1
## 26     1      1      1        1    1        1
## 27     1      1      1       -1   -1        1
## 28     1      1     -1        1    1       -1
## 29     1     -1     -1        1   -1       -1
## 30    -1      1     -1        1    1        1
## 31    -1      1      1       -1   -1       -1
## 32     1     -1     -1       -1   -1        1
## class=design, type= FrF2.estimable

aliasprint(halffrac)

## $legend
## [1] A=white    B=school   C=gender   D=marriage E=kids     F=headofhh
## 
## [[2]]
## [1] no aliasing among main effects and 2fis

data2<-data.frame(halffrac)
# Use the design matrix to select the experimental runs
data3 <- merge(data2,data1, by=c("white","school","gender","marriage","kids","headofhh"))
data4 <- unique(data3[ , 1:6])
rownames(data4)

##  [1] "1"    "401"  "425"  "1565" "1585" "1753" "1759" "2103" "2106" "2234"
## [11] "2235" "2251" "2403" "2429" "2440" "2442" "2462" "2536" "2540" "2548"
## [21] "2612" "2643" "2662" "2664" "2684" "2714" "2715" "2751" "2764"

stateur <- data3$stateur[index=c(1,401,425,1565,1585,1753,1759,2103,2106,2234,2235,2251,2403,2429,2440,2442,2462,2536,2540,2548,2612,2643,2662,2664,2684,2714,2715,2751,2764)]
data5<-cbind(stateur,data4)
summary(data5)

##     stateur      white   school  gender  marriage kids    headofhh
##  Min.   : 3.60   -1:16   -1:16   -1:15   -1:15    -1:16   -1:14   
##  1st Qu.: 5.70   1 :13   1 :13   1 :14   1 :14    1 :13   1 :15   
##  Median : 7.20                                                    
##  Mean   : 7.28                                                    
##  3rd Qu.: 8.90                                                    
##  Max.   :11.50

detach(data1)
attach(data5)

## The following object is masked _by_ .GlobalEnv:
## 
##     stateur

5. (Statistical) Analysis for the Fractional Designed Data

(Exploratory Data Analysis) Graphics and descriptive summary

The boxplot shows that the the math scores do not vary alot among different levels of “Gender” and “Had Kids or Not”. For other factors, the median state unemployment rate in the “-1” group shows difference to the “+1” group. Therefore,it is possible that the variation in state unemployment is not due to sample randomization only, and some factors might be able to explain the variation.

The histogram of the response variable shows that the data is not perfectly normally distributed, thus further model adequecy checking is needed to test the underlying population distribution.

#bloxplots 
par(mfrow=c(2,3))
boxplot(stateur~white,xlab="White/non-White",ylab="State Unemployment Rate (%)")
boxplot(stateur~school,xlab="Education (12 years of school or not)",ylab="State Unemployment Rate (%)")
boxplot(stateur~gender,xlab="Gender",ylab="State Unemployment Rate (%)")
boxplot(stateur~marriage,xlab="Married or Not",ylab="State Unemployment Rate (%)")
boxplot(stateur~kids,xlab="Had kids or Not",ylab="State Unemployment Rate (%)")
boxplot(stateur~headofhh,xlab="Head or the Household or Not",ylab="State Unemployment Rate (%)")

plot of chunk unnamed-chunk-6

#histograms
par(mfrow=c(1,1))
hist(stateur, xlab="state unemployment rate", main ="Histogram of the State Unemployment Rate")

plot of chunk unnamed-chunk-6

Testing for the Original Data

According to the ANOVA result from the fractional factorial design susbset data, none of the main effect of selected variables shows statistical significance. For the interaction effects, “white and kids”, “shcool and gender”, “marriage and headofhh” are shown to be statistically siginificant at the 0.05 level. Therefore we reject the null hypothesis that the variation in state unemployment rate is due to sample randomization only. Noted that by comparing this result to the ones we obtained from the full factorial design, we may conclude that the 2 level fractional factorial design helps to maintain part of the complete results and reduces the estimation time and cost sustantially.

model2 = aov (stateur~(white+school+gender+marriage+kids+headofhh)^2, data = data5)
summary(model2)

##                   Df Sum Sq Mean Sq F value Pr(>F)  
## white              1   0.15    0.15    0.03  0.859  
## school             1   6.63    6.63    1.52  0.258  
## gender             1   0.04    0.04    0.01  0.924  
## marriage           1   0.03    0.03    0.01  0.940  
## kids               1   4.69    4.69    1.07  0.335  
## headofhh           1   0.84    0.84    0.19  0.675  
## white:school       1   4.73    4.73    1.08  0.333  
## white:gender       1  18.16   18.16    4.15  0.081 .
## white:marriage     1  11.21   11.21    2.56  0.153  
## white:kids         1  22.82   22.82    5.22  0.056 .
## white:headofhh     1   0.66    0.66    0.15  0.709  
## school:gender      1   0.35    0.35    0.08  0.786  
## school:marriage    1   1.99    1.99    0.46  0.522  
## school:kids        1   0.03    0.03    0.01  0.932  
## school:headofhh    1   6.73    6.73    1.54  0.255  
## gender:marriage    1   0.24    0.24    0.05  0.822  
## gender:kids        1   0.27    0.27    0.06  0.812  
## gender:headofhh    1   6.17    6.17    1.41  0.274  
## marriage:kids      1   0.07    0.07    0.01  0.906  
## marriage:headofhh  1   1.88    1.88    0.43  0.533  
## kids:headofhh      1   1.03    1.03    0.24  0.642  
## Residuals          7  30.59    4.37                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6. Diagnostics/Model Adequacy Checking

The Q-Q Norm plots test the normality of the sample. The result (tails not matching the qqline perfectly) shows that our subset sample might not satisfy the assumption of normal distribution perfectly, therefore resampling might be needed to ensure the validity of model estimates.

The scatter plots of the residuals and the fitted model does not indicate any obvious trend, thus we may conclude that data in the selected subset are randomly distributed.

#qqplot
qqnorm(residuals(model2),ylab="State Unemployment Rate")
qqline(residuals(model2),ylab="State Unemployment Rate")

plot of chunk unnamed-chunk-8

#Fitted Y vs. Residuals
plot(fitted(model2), residuals(model2))

plot of chunk unnamed-chunk-8

6. References to the literature

http://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf