1. Setting

System Under Test

In this project, I will be examining the ‘Health Insurance and Hours Worked by Wives’ data set found in the Ecdat package.

library("Ecdat")

## Loading required package: Ecfun

## Warning: package 'Ecfun' was built under R version 3.2.5

## 
## Attaching package: 'Ecfun'

## The following object is masked from 'package:base':
## 
##     sign

## 
## Attaching package: 'Ecdat'

## The following object is masked from 'package:datasets':
## 
##     Orange

hours <- HI
colnames(hours)

##  [1] "whrswk"     "hhi"        "whi"        "hhi2"       "education" 
##  [6] "race"       "hispanic"   "experience" "kidslt6"    "kids618"   
## [11] "husby"      "region"     "wght"

colnames <- c("WifeWorkHours", "HusbandHI", "WifeHI", "Race", "Region")

project <- data.frame(hours$whrswk, hours$hhi, hours$whi, hours$race, hours$region)
colnames(project) <- colnames

project3 <- subset(project, project$Region == "northcentral" | project$Region == "south" | project$Region == "west")

Factors and Levels

In this experiment, four variables will be examined: two binary factors and two three-level factors.

HusbandHI: is the wife covered by the husband’s health insurance? Yes/No

WifeHI: Wife has health insurance through her job? Yes/No

Race: White/Black/Other (NOTE: it is unclear whether this field denotes the race of the wife, the husband, or both. Here we will assume it denotes the race of the wife.)

Region: Northcentral/South/West

Response Variables

The response variable is continous variable denoting the hours the wife works per week.

The data: How is it organized and what does it look like?

As per the instructions for Project #3, the two three-level variables will be represented by two two-level factors (each). Therefore, we will have a 2^6 experimental design.

head(project3, 10)

##    WifeWorkHours HusbandHI WifeHI  Race       Region
## 1              0        no     no white northcentral
## 2             50        no    yes white northcentral
## 3             40       yes     no white northcentral
## 4             40        no    yes white northcentral
## 5              0       yes     no white northcentral
## 6             40       yes    yes white northcentral
## 7             40       yes     no white northcentral
## 8             25        no     no white northcentral
## 9             45        no    yes white northcentral
## 10            30        no     no white northcentral

Experimental Design

How will the experiment be organized and conducted to test the hypothesis?

As mentioned above, the full factorial experimental design is a 2^6 design. A fractional factorial design of 2^3 will be utilized to approximate a full factorial experimental design. This will be done using the FrF2 package in R.

Design Rationale

Randomization, Replication, and Blocking

The data can be assumed to be randomly collected, as it is part of an R package of data sets that are meant to be used in analyses like these. There are no deliberate repeated measures in this data set, as husband/wife pairings can be assumed to be unique data points (no polygamy, probably). No blocking factors were specified.

3. Statistical Analysis

Exploratory Data Analysis

Before we split the three-level factors into several two-level factors, we will do some exploratory data analysis.

HHI <- as.factor(project3$HusbandHI)
WHI <- as.factor(project3$WifeHI)
WWH <- as.numeric(project3$WifeWorkHours)
Race <- as.factor(project3$Race)
Region <- as.factor(project3$Region)

boxplot(WWH ~ HHI, xlab = "Husband Health Insurance", ylab = "Wife Working Hours/Week")

boxplot(WWH ~ WHI, xlab = "Wife Health Insurance", ylab = "Wife Working Hours/Week")

boxplot(WWH ~ Race, xlab = "Race", ylab = "Wife Working Hours/Week")

boxplot(WWH ~ Region, xlab = "Region", ylab = "Wife Working Hours/Week")

Now that we’ve looked at that, we can split the two three-level factors into two two-level factors (each). These new factors will be WhiteB, BlackB, SouthB, and WestB (where the B stands for ‘binary’). Each factor will be either 0 or 1. The third level of the original factor is when both of the binary factors are zero. For example, if the region is neither ‘south’ (SouthB = 0) nor ‘west’ (WestB = 0), then the region must be ‘northcentral.’

White <- matrix(nrow = 17102, ncol = 1)
Black <- matrix(nrow = 17102, ncol = 1)
South <- matrix(nrow = 17102, ncol = 1)
West <- matrix(nrow = 17102, ncol = 1)

project33 <- data.frame(project3, White, Black, South, West)
colnames(project33) <- c("WifeWorkHours", "HusbandHI", "WifeHI", "Race3", "Region3", "WhiteB", "BlackB", "SouthB", "WestB")

project33$WhiteB[project33$Race3 == "white"] <- 1
project33$WhiteB[project33$Race3 == "black"] <- 0
project33$WhiteB[project33$Race3 == "other"] <- 0

project33$BlackB[project33$Race3 == "white"] <- 0
project33$BlackB[project33$Race3 == "other"] <- 0
project33$BlackB[project33$Race3 == "black"] <- 1

project33$WestB[project33$Region3 == "west"] <- 1
project33$WestB[project33$Region3 == "south"] <- 0
project33$WestB[project33$Region3 == "northcentral"] <- 0

project33$SouthB[project33$Region3 == "west"] <- 0
project33$SouthB[project33$Region3 == "south"] <- 1
project33$SouthB[project33$Region3 == "northcentral"] <- 0

head(project33)

##   WifeWorkHours HusbandHI WifeHI Race3      Region3 WhiteB BlackB SouthB
## 1             0        no     no white northcentral      1      0      0
## 2            50        no    yes white northcentral      1      0      0
## 3            40       yes     no white northcentral      1      0      0
## 4            40        no    yes white northcentral      1      0      0
## 5             0       yes     no white northcentral      1      0      0
## 6            40       yes    yes white northcentral      1      0      0
##   WestB
## 1     0
## 2     0
## 3     0
## 4     0
## 5     0
## 6     0

Testing

First, we’ll look at the full factorial experimental design of 64 runs:

expand.grid(HusbandHI = c(0,1), WifeHI = c(0,1), White = c(0,1), Black = c(0,1), South = c(0,1), West = c(0,1))

##    HusbandHI WifeHI White Black South West
## 1          0      0     0     0     0    0
## 2          1      0     0     0     0    0
## 3          0      1     0     0     0    0
## 4          1      1     0     0     0    0
## 5          0      0     1     0     0    0
## 6          1      0     1     0     0    0
## 7          0      1     1     0     0    0
## 8          1      1     1     0     0    0
## 9          0      0     0     1     0    0
## 10         1      0     0     1     0    0
## 11         0      1     0     1     0    0
## 12         1      1     0     1     0    0
## 13         0      0     1     1     0    0
## 14         1      0     1     1     0    0
## 15         0      1     1     1     0    0
## 16         1      1     1     1     0    0
## 17         0      0     0     0     1    0
## 18         1      0     0     0     1    0
## 19         0      1     0     0     1    0
## 20         1      1     0     0     1    0
## 21         0      0     1     0     1    0
## 22         1      0     1     0     1    0
## 23         0      1     1     0     1    0
## 24         1      1     1     0     1    0
## 25         0      0     0     1     1    0
## 26         1      0     0     1     1    0
## 27         0      1     0     1     1    0
## 28         1      1     0     1     1    0
## 29         0      0     1     1     1    0
## 30         1      0     1     1     1    0
## 31         0      1     1     1     1    0
## 32         1      1     1     1     1    0
## 33         0      0     0     0     0    1
## 34         1      0     0     0     0    1
## 35         0      1     0     0     0    1
## 36         1      1     0     0     0    1
## 37         0      0     1     0     0    1
## 38         1      0     1     0     0    1
## 39         0      1     1     0     0    1
## 40         1      1     1     0     0    1
## 41         0      0     0     1     0    1
## 42         1      0     0     1     0    1
## 43         0      1     0     1     0    1
## 44         1      1     0     1     0    1
## 45         0      0     1     1     0    1
## 46         1      0     1     1     0    1
## 47         0      1     1     1     0    1
## 48         1      1     1     1     0    1
## 49         0      0     0     0     1    1
## 50         1      0     0     0     1    1
## 51         0      1     0     0     1    1
## 52         1      1     0     0     1    1
## 53         0      0     1     0     1    1
## 54         1      0     1     0     1    1
## 55         0      1     1     0     1    1
## 56         1      1     1     0     1    1
## 57         0      0     0     1     1    1
## 58         1      0     0     1     1    1
## 59         0      1     0     1     1    1
## 60         1      1     0     1     1    1
## 61         0      0     1     1     1    1
## 62         1      0     1     1     1    1
## 63         0      1     1     1     1    1
## 64         1      1     1     1     1    1

Obviously, this would require a significant amount of time and resources to go through, so we will approximate it with a fractional factorial design. The fractional factorial design we chose is a 2^(m-3), or a 2^3 design using the R package ‘FrF2.’ Note that the use of binary values (0,1) in the fractional factorial design denote whether the factor is true (1) or not (0). The highest resolution possible for this design (six factors, eight runs) is III. We can also use the FrF2 package to determine the aliasing structure of the fractional factorial design, and the generators of the fractional factorial design.

library("FrF2")

## Warning: package 'FrF2' was built under R version 3.2.5

## Loading required package: DoE.base

## Warning: package 'DoE.base' was built under R version 3.2.5

## Loading required package: grid

## Loading required package: conf.design

## 
## Attaching package: 'DoE.base'

## The following objects are masked from 'package:stats':
## 
##     aov, lm

## The following object is masked from 'package:graphics':
## 
##     plot.design

## The following object is masked from 'package:base':
## 
##     lengths

design <- FrF2(8, factor.names = c("HusbandHI", "WifeHI", "White", "Black", "South", "West"), default.levels = c("0", "1"))

design

##   HusbandHI WifeHI White Black South West
## 1         0      1     1     0     0    1
## 2         1      0     1     0     1    0
## 3         0      0     0     1     1    1
## 4         0      0     1     1     0    0
## 5         0      1     0     0     1    0
## 6         1      1     0     1     0    0
## 7         1      1     1     1     1    1
## 8         1      0     0     0     0    1
## class=design, type= FrF2

summary(design)

## Call:
## FrF2(8, factor.names = c("HusbandHI", "WifeHI", "White", "Black", 
##     "South", "West"), default.levels = c("0", "1"))
## 
## Experimental design of type  FrF2 
## 8  runs
## 
## Factor settings (scale ends):
##   HusbandHI WifeHI White Black South West
## 1         0      0     0     0     0    0
## 2         1      1     1     1     1    1
## 
## Design generating information:
## $legend
## [1] A=HusbandHI B=WifeHI    C=White     D=Black     E=South     F=West     
## 
## $generators
## [1] D=AB E=AC F=BC
## 
## 
## Alias structure:
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
## 
## $fi2
## [1] AF=BE=CD
## 
## 
## The design itself:
##   HusbandHI WifeHI White Black South West
## 1         0      1     1     0     0    1
## 2         1      0     1     0     1    0
## 3         0      0     0     1     1    1
## 4         0      0     1     1     0    0
## 5         0      1     0     0     1    0
## 6         1      1     0     1     0    0
## 7         1      1     1     1     1    1
## 8         1      0     0     0     0    1
## class=design, type= FrF2

aliasprint(design)

## $legend
## [1] A=HusbandHI B=WifeHI    C=White     D=Black     E=South     F=West     
## 
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
## 
## $fi2
## [1] AF=BE=CD

From this, we gather that the generators of the fractional factorial design are D = AB, E = AC, F = BC. The aliasing structure shows that the main effects are aliased with some of the two factor interactions, which are highly aliased with each other: AF = BE = CD. This is because the design is only a resolution III design, so only the main effects are a reliable source of analysis.

6020:

The generator for the fractional factorial experiment can also be calculated separately. Through the aliased interactions D = AB, E = AC, and F = BC the generator, I, can be calculated as I = ABD = ACE = BCF.

Estimation

Now we can estimate the main and interaction effects using a linear model and ANOVA analysis. This can be seen here, with all the main effects and all the two-factor interaction effects:

fit <- lm(project33$WifeWorkHours ~ (project33$HusbandHI + project33$WifeHI + project33$WhiteB + project33$BlackB + project33$SouthB + project33$WestB)^2)

anova(fit)

## Analysis of Variance Table
## 
## Response: project33$WifeWorkHours
##                                         Df  Sum Sq Mean Sq   F value
## project33$HusbandHI                      1   97859   97859  380.6127
## project33$WifeHI                         1 1582215 1582215 6153.8614
## project33$WhiteB                         1    5625    5625   21.8785
## project33$BlackB                         1     715     715    2.7803
## project33$SouthB                         1     147     147    0.5731
## project33$WestB                          1    3229    3229   12.5581
## project33$HusbandHI:project33$WifeHI     1    1987    1987    7.7265
## project33$HusbandHI:project33$WhiteB     1     242     242    0.9395
## project33$HusbandHI:project33$BlackB     1     732     732    2.8468
## project33$HusbandHI:project33$SouthB     1     251     251    0.9756
## project33$HusbandHI:project33$WestB      1       4       4    0.0168
## project33$WifeHI:project33$WhiteB        1    2501    2501    9.7267
## project33$WifeHI:project33$BlackB        1     147     147    0.5721
## project33$WifeHI:project33$SouthB        1     945     945    3.6757
## project33$WifeHI:project33$WestB         1    1097    1097    4.2685
## project33$WhiteB:project33$SouthB        1    1151    1151    4.4748
## project33$WhiteB:project33$WestB         1     291     291    1.1306
## project33$BlackB:project33$SouthB        1     898     898    3.4931
## project33$BlackB:project33$WestB         1    1031    1031    4.0088
## Residuals                            17082 4391942     257          
##                                         Pr(>F)    
## project33$HusbandHI                  < 2.2e-16 ***
## project33$WifeHI                     < 2.2e-16 ***
## project33$WhiteB                     2.927e-06 ***
## project33$BlackB                     0.0954481 .  
## project33$SouthB                     0.4490530    
## project33$WestB                      0.0003956 ***
## project33$HusbandHI:project33$WifeHI 0.0054477 ** 
## project33$HusbandHI:project33$WhiteB 0.3324294    
## project33$HusbandHI:project33$BlackB 0.0915754 .  
## project33$HusbandHI:project33$SouthB 0.3232944    
## project33$HusbandHI:project33$WestB  0.8969055    
## project33$WifeHI:project33$WhiteB    0.0018192 ** 
## project33$WifeHI:project33$BlackB    0.4494280    
## project33$WifeHI:project33$SouthB    0.0552280 .  
## project33$WifeHI:project33$WestB     0.0388394 *  
## project33$WhiteB:project33$SouthB    0.0344133 *  
## project33$WhiteB:project33$WestB     0.2876671    
## project33$BlackB:project33$SouthB    0.0616440 .  
## project33$BlackB:project33$WestB     0.0452784 *  
## Residuals                                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Creation & Adequacy Checking

We can also look at just the main effects:

fit2 <- lm(WifeWorkHours ~ HusbandHI + WifeHI + WhiteB + BlackB + SouthB + WestB, data = project33)
anova(fit2)

## Analysis of Variance Table
## 
## Response: WifeWorkHours
##              Df  Sum Sq Mean Sq   F value    Pr(>F)    
## HusbandHI     1   97859   97859  379.9270 < 2.2e-16 ***
## WifeHI        1 1582215 1582215 6142.7741 < 2.2e-16 ***
## WhiteB        1    5625    5625   21.8390 2.988e-06 ***
## BlackB        1     715     715    2.7753 0.0957471 .  
## SouthB        1     147     147    0.5720 0.4494618    
## WestB         1    3229    3229   12.5354 0.0004004 ***
## Residuals 17095 4403217     258                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, the conclusion remains the same because the main effects are aliased with the two-factor interactions.

We can also examine the plots of the residuals:

par(mfrow = c(1,2))
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(fitted(fit), residuals(fit))

It is important to note that the Q-Q plot shows that this data probably does not follow a normal distribution, so further analysis could be done on what distribution best fits this data, or potential confounding factors that were left out of the data (husband’s salary perhaps).

4. References to the literature

Montgomery, Douglas C. Design and analysis of experiments. John Wiley & Sons, 2008.

Appendices

The raw data can be found in the Ecdat package in R. A pdf describing the data sets found in this package can be found here. The description of the ‘Health Insurance and Hours Worked by Wives’ is found on page 54.

The complete code in this project can be seen here:

library("Ecdat")
hours <- HI
colnames(hours)

colnames <- c("WifeWorkHours", "HusbandHI", "WifeHI", "Race", "Region")

#assigns relevant columns to a new dataframe for simpler analysis
project <- data.frame(hours$whrswk, hours$hhi, hours$whi, hours$race, hours$region)
colnames(project) <- colnames

#subsets into new data set to discard values of 'other' in the region factor
project3 <- subset(project, project$Region == "northcentral" | project$Region == "south" | project$Region == "west")

#displays the first 10 rows of the 2^2 * 3^2 data set (project3)
head(project3, 10)

#sets columns as appropriate value types
HHI <- as.factor(project3$HusbandHI)
WHI <- as.factor(project3$WifeHI)
WWH <- as.numeric(project3$WifeWorkHours)
Race <- as.factor(project3$Race)
Region <- as.factor(project3$Region)

#displays a 4x4 grid of plots
par(mfrow = c(2,2))

#boxplots
boxplot(WWH ~ HHI, xlab = "Husband Health Insurance", ylab = "Wife Working Hours/Week")
boxplot(WWH ~ WHI, xlab = "Wife Health Insurance", ylab = "Wife Working Hours/Week")
boxplot(WWH ~ Race, xlab = "Race", ylab = "Wife Working Hours/Week")
boxplot(WWH ~ Region, xlab = "Region", ylab = "Wife Working Hours/Week")

#starts the process to decompose three-level factors into two two-level factors (each) by creating empty matrices to put data into
White <- matrix(nrow = 17102, ncol = 1)
Black <- matrix(nrow = 17102, ncol = 1)
South <- matrix(nrow = 17102, ncol = 1)
West <- matrix(nnrow = 17102, ncol = 1)

#creates another new data frame for decomposed three-level factors
project33 <- data.frame(project3, White, Black, South, West)
colnames(project33) <- c("WifeWorkHours", "HusbandHI", "WifeHI", "Race3", "Region3", "WhiteB", "BlackB", "SouthB", "WestB")

#assigns appropriate values to new binary factors
project33$WhiteB[project33$Race3 == "white"] <- 1
project33$WhiteB[project33$Race3 == "black"] <- 0
project33$WhiteB[project33$Race3 == "other"] <- 0

project33$BlackB[project33$Race3 == "white"] <- 0
project33$BlackB[project33$Race3 == "other"] <- 0
project33$BlackB[project33$Race3 == "black"] <- 1

project33$WestB[project33$Region3 == "west"] <- 1
project33$WestB[project33$Region3 == "south"] <- 0
project33$WestB[project33$Region3 == "northcentral"] <- 0

project33$SouthB[project33$Region3 == "west"] <- 0
project33$SouthB[project33$Region3 == "south"] <- 1
project33$SouthB[project33$Region3 == "northcentral"] <- 0

#displays the first 6 rows of the new data frame
head(project33)

#shows a full factorial design of 64 runs 
expand.grid(HusbandHI = c(0,1), WifeHI = c(0,1), White = c(0,1), Black = c(0,1), South = c(0,1), West = c(0,1))

#loads the fractional factorial design package
library("FrF2")

#creates a 2^3 fractional factorial design 
design <- FrF2(8, factor.names = c("HusbandHI", "WifeHI", "White", "Black", "South", "West"), default.levels = c("0", "1"))

#displays this fractional factorial design
design

#shows aliases, generators, and other information on the fractional factorial design
summary(design)

#shows the aliasing structure of the fractional factorial design
aliasprint(design)

#creates a linear model with all main and interaction effects and does an analysis of variance on it
fit <- lm(WifeWorkHours ~ (HusbandHI + WifeHI + WhiteB + BlackB + SouthB + WestB)^2, data = project33)
anova(fit)

#creates a linear model of all main effects and does an analysis of variance on it
fit2 <- lm(WifeWorkHours ~ HusbandHI + WifeHI + WhiteB + BlackB + SouthB + WestB, data = project33)
anova(fit2)

#plots different residual plots for normality checks
par(mfrow = c(1,2))
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(fitted(fit), residuals(fit))

DOE Project 3

Molly Renaud

12/6/2016