In this project, I will be examining the ‘Health Insurance and Hours Worked by Wives’ data set found in the Ecdat package.
library("Ecdat")
## Loading required package: Ecfun
## Warning: package 'Ecfun' was built under R version 3.2.5
##
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
##
## sign
##
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##
## Orange
hours <- HI
colnames(hours)
## [1] "whrswk" "hhi" "whi" "hhi2" "education"
## [6] "race" "hispanic" "experience" "kidslt6" "kids618"
## [11] "husby" "region" "wght"
colnames <- c("WifeWorkHours", "HusbandHI", "WifeHI", "Race", "Region")
project <- data.frame(hours$whrswk, hours$hhi, hours$whi, hours$race, hours$region)
colnames(project) <- colnames
project3 <- subset(project, project$Region == "northcentral" | project$Region == "south" | project$Region == "west")
In this experiment, four variables will be examined: two binary factors and two three-level factors.
HusbandHI: is the wife covered by the husband’s health insurance? Yes/No
WifeHI: Wife has health insurance through her job? Yes/No
Race: White/Black/Other (NOTE: it is unclear whether this field denotes the race of the wife, the husband, or both. Here we will assume it denotes the race of the wife.)
Region: Northcentral/South/West
The response variable is continous variable denoting the hours the wife works per week.
As per the instructions for Project #3, the two three-level variables will be represented by two two-level factors (each). Therefore, we will have a 2^6 experimental design.
head(project3, 10)
## WifeWorkHours HusbandHI WifeHI Race Region
## 1 0 no no white northcentral
## 2 50 no yes white northcentral
## 3 40 yes no white northcentral
## 4 40 no yes white northcentral
## 5 0 yes no white northcentral
## 6 40 yes yes white northcentral
## 7 40 yes no white northcentral
## 8 25 no no white northcentral
## 9 45 no yes white northcentral
## 10 30 no no white northcentral
As mentioned above, the full factorial experimental design is a 2^6 design. A fractional factorial design of 2^3 will be utilized to approximate a full factorial experimental design. This will be done using the FrF2 package in R.
The data can be assumed to be randomly collected, as it is part of an R package of data sets that are meant to be used in analyses like these. There are no deliberate repeated measures in this data set, as husband/wife pairings can be assumed to be unique data points (no polygamy, probably). No blocking factors were specified.
Before we split the three-level factors into several two-level factors, we will do some exploratory data analysis.
HHI <- as.factor(project3$HusbandHI)
WHI <- as.factor(project3$WifeHI)
WWH <- as.numeric(project3$WifeWorkHours)
Race <- as.factor(project3$Race)
Region <- as.factor(project3$Region)
boxplot(WWH ~ HHI, xlab = "Husband Health Insurance", ylab = "Wife Working Hours/Week")
boxplot(WWH ~ WHI, xlab = "Wife Health Insurance", ylab = "Wife Working Hours/Week")
boxplot(WWH ~ Race, xlab = "Race", ylab = "Wife Working Hours/Week")
boxplot(WWH ~ Region, xlab = "Region", ylab = "Wife Working Hours/Week")
Now that we’ve looked at that, we can split the two three-level factors into two two-level factors (each). These new factors will be WhiteB, BlackB, SouthB, and WestB (where the B stands for ‘binary’). Each factor will be either 0 or 1. The third level of the original factor is when both of the binary factors are zero. For example, if the region is neither ‘south’ (SouthB = 0) nor ‘west’ (WestB = 0), then the region must be ‘northcentral.’
White <- matrix(nrow = 17102, ncol = 1)
Black <- matrix(nrow = 17102, ncol = 1)
South <- matrix(nrow = 17102, ncol = 1)
West <- matrix(nrow = 17102, ncol = 1)
project33 <- data.frame(project3, White, Black, South, West)
colnames(project33) <- c("WifeWorkHours", "HusbandHI", "WifeHI", "Race3", "Region3", "WhiteB", "BlackB", "SouthB", "WestB")
project33$WhiteB[project33$Race3 == "white"] <- 1
project33$WhiteB[project33$Race3 == "black"] <- 0
project33$WhiteB[project33$Race3 == "other"] <- 0
project33$BlackB[project33$Race3 == "white"] <- 0
project33$BlackB[project33$Race3 == "other"] <- 0
project33$BlackB[project33$Race3 == "black"] <- 1
project33$WestB[project33$Region3 == "west"] <- 1
project33$WestB[project33$Region3 == "south"] <- 0
project33$WestB[project33$Region3 == "northcentral"] <- 0
project33$SouthB[project33$Region3 == "west"] <- 0
project33$SouthB[project33$Region3 == "south"] <- 1
project33$SouthB[project33$Region3 == "northcentral"] <- 0
head(project33)
## WifeWorkHours HusbandHI WifeHI Race3 Region3 WhiteB BlackB SouthB
## 1 0 no no white northcentral 1 0 0
## 2 50 no yes white northcentral 1 0 0
## 3 40 yes no white northcentral 1 0 0
## 4 40 no yes white northcentral 1 0 0
## 5 0 yes no white northcentral 1 0 0
## 6 40 yes yes white northcentral 1 0 0
## WestB
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
First, we’ll look at the full factorial experimental design of 64 runs:
expand.grid(HusbandHI = c(0,1), WifeHI = c(0,1), White = c(0,1), Black = c(0,1), South = c(0,1), West = c(0,1))
## HusbandHI WifeHI White Black South West
## 1 0 0 0 0 0 0
## 2 1 0 0 0 0 0
## 3 0 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 0 0 1 0 0 0
## 6 1 0 1 0 0 0
## 7 0 1 1 0 0 0
## 8 1 1 1 0 0 0
## 9 0 0 0 1 0 0
## 10 1 0 0 1 0 0
## 11 0 1 0 1 0 0
## 12 1 1 0 1 0 0
## 13 0 0 1 1 0 0
## 14 1 0 1 1 0 0
## 15 0 1 1 1 0 0
## 16 1 1 1 1 0 0
## 17 0 0 0 0 1 0
## 18 1 0 0 0 1 0
## 19 0 1 0 0 1 0
## 20 1 1 0 0 1 0
## 21 0 0 1 0 1 0
## 22 1 0 1 0 1 0
## 23 0 1 1 0 1 0
## 24 1 1 1 0 1 0
## 25 0 0 0 1 1 0
## 26 1 0 0 1 1 0
## 27 0 1 0 1 1 0
## 28 1 1 0 1 1 0
## 29 0 0 1 1 1 0
## 30 1 0 1 1 1 0
## 31 0 1 1 1 1 0
## 32 1 1 1 1 1 0
## 33 0 0 0 0 0 1
## 34 1 0 0 0 0 1
## 35 0 1 0 0 0 1
## 36 1 1 0 0 0 1
## 37 0 0 1 0 0 1
## 38 1 0 1 0 0 1
## 39 0 1 1 0 0 1
## 40 1 1 1 0 0 1
## 41 0 0 0 1 0 1
## 42 1 0 0 1 0 1
## 43 0 1 0 1 0 1
## 44 1 1 0 1 0 1
## 45 0 0 1 1 0 1
## 46 1 0 1 1 0 1
## 47 0 1 1 1 0 1
## 48 1 1 1 1 0 1
## 49 0 0 0 0 1 1
## 50 1 0 0 0 1 1
## 51 0 1 0 0 1 1
## 52 1 1 0 0 1 1
## 53 0 0 1 0 1 1
## 54 1 0 1 0 1 1
## 55 0 1 1 0 1 1
## 56 1 1 1 0 1 1
## 57 0 0 0 1 1 1
## 58 1 0 0 1 1 1
## 59 0 1 0 1 1 1
## 60 1 1 0 1 1 1
## 61 0 0 1 1 1 1
## 62 1 0 1 1 1 1
## 63 0 1 1 1 1 1
## 64 1 1 1 1 1 1
Obviously, this would require a significant amount of time and resources to go through, so we will approximate it with a fractional factorial design. The fractional factorial design we chose is a 2^(m-3), or a 2^3 design using the R package ‘FrF2.’ Note that the use of binary values (0,1) in the fractional factorial design denote whether the factor is true (1) or not (0). The highest resolution possible for this design (six factors, eight runs) is III. We can also use the FrF2 package to determine the aliasing structure of the fractional factorial design, and the generators of the fractional factorial design.
library("FrF2")
## Warning: package 'FrF2' was built under R version 3.2.5
## Loading required package: DoE.base
## Warning: package 'DoE.base' was built under R version 3.2.5
## Loading required package: grid
## Loading required package: conf.design
##
## Attaching package: 'DoE.base'
## The following objects are masked from 'package:stats':
##
## aov, lm
## The following object is masked from 'package:graphics':
##
## plot.design
## The following object is masked from 'package:base':
##
## lengths
design <- FrF2(8, factor.names = c("HusbandHI", "WifeHI", "White", "Black", "South", "West"), default.levels = c("0", "1"))
design
## HusbandHI WifeHI White Black South West
## 1 0 1 1 0 0 1
## 2 1 0 1 0 1 0
## 3 0 0 0 1 1 1
## 4 0 0 1 1 0 0
## 5 0 1 0 0 1 0
## 6 1 1 0 1 0 0
## 7 1 1 1 1 1 1
## 8 1 0 0 0 0 1
## class=design, type= FrF2
summary(design)
## Call:
## FrF2(8, factor.names = c("HusbandHI", "WifeHI", "White", "Black",
## "South", "West"), default.levels = c("0", "1"))
##
## Experimental design of type FrF2
## 8 runs
##
## Factor settings (scale ends):
## HusbandHI WifeHI White Black South West
## 1 0 0 0 0 0 0
## 2 1 1 1 1 1 1
##
## Design generating information:
## $legend
## [1] A=HusbandHI B=WifeHI C=White D=Black E=South F=West
##
## $generators
## [1] D=AB E=AC F=BC
##
##
## Alias structure:
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
##
## $fi2
## [1] AF=BE=CD
##
##
## The design itself:
## HusbandHI WifeHI White Black South West
## 1 0 1 1 0 0 1
## 2 1 0 1 0 1 0
## 3 0 0 0 1 1 1
## 4 0 0 1 1 0 0
## 5 0 1 0 0 1 0
## 6 1 1 0 1 0 0
## 7 1 1 1 1 1 1
## 8 1 0 0 0 0 1
## class=design, type= FrF2
aliasprint(design)
## $legend
## [1] A=HusbandHI B=WifeHI C=White D=Black E=South F=West
##
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
##
## $fi2
## [1] AF=BE=CD
From this, we gather that the generators of the fractional factorial design are D = AB, E = AC, F = BC. The aliasing structure shows that the main effects are aliased with some of the two factor interactions, which are highly aliased with each other: AF = BE = CD. This is because the design is only a resolution III design, so only the main effects are a reliable source of analysis.
The generator for the fractional factorial experiment can also be calculated separately. Through the aliased interactions D = AB, E = AC, and F = BC the generator, I, can be calculated as I = ABD = ACE = BCF.
Now we can estimate the main and interaction effects using a linear model and ANOVA analysis. This can be seen here, with all the main effects and all the two-factor interaction effects:
fit <- lm(project33$WifeWorkHours ~ (project33$HusbandHI + project33$WifeHI + project33$WhiteB + project33$BlackB + project33$SouthB + project33$WestB)^2)
anova(fit)
## Analysis of Variance Table
##
## Response: project33$WifeWorkHours
## Df Sum Sq Mean Sq F value
## project33$HusbandHI 1 97859 97859 380.6127
## project33$WifeHI 1 1582215 1582215 6153.8614
## project33$WhiteB 1 5625 5625 21.8785
## project33$BlackB 1 715 715 2.7803
## project33$SouthB 1 147 147 0.5731
## project33$WestB 1 3229 3229 12.5581
## project33$HusbandHI:project33$WifeHI 1 1987 1987 7.7265
## project33$HusbandHI:project33$WhiteB 1 242 242 0.9395
## project33$HusbandHI:project33$BlackB 1 732 732 2.8468
## project33$HusbandHI:project33$SouthB 1 251 251 0.9756
## project33$HusbandHI:project33$WestB 1 4 4 0.0168
## project33$WifeHI:project33$WhiteB 1 2501 2501 9.7267
## project33$WifeHI:project33$BlackB 1 147 147 0.5721
## project33$WifeHI:project33$SouthB 1 945 945 3.6757
## project33$WifeHI:project33$WestB 1 1097 1097 4.2685
## project33$WhiteB:project33$SouthB 1 1151 1151 4.4748
## project33$WhiteB:project33$WestB 1 291 291 1.1306
## project33$BlackB:project33$SouthB 1 898 898 3.4931
## project33$BlackB:project33$WestB 1 1031 1031 4.0088
## Residuals 17082 4391942 257
## Pr(>F)
## project33$HusbandHI < 2.2e-16 ***
## project33$WifeHI < 2.2e-16 ***
## project33$WhiteB 2.927e-06 ***
## project33$BlackB 0.0954481 .
## project33$SouthB 0.4490530
## project33$WestB 0.0003956 ***
## project33$HusbandHI:project33$WifeHI 0.0054477 **
## project33$HusbandHI:project33$WhiteB 0.3324294
## project33$HusbandHI:project33$BlackB 0.0915754 .
## project33$HusbandHI:project33$SouthB 0.3232944
## project33$HusbandHI:project33$WestB 0.8969055
## project33$WifeHI:project33$WhiteB 0.0018192 **
## project33$WifeHI:project33$BlackB 0.4494280
## project33$WifeHI:project33$SouthB 0.0552280 .
## project33$WifeHI:project33$WestB 0.0388394 *
## project33$WhiteB:project33$SouthB 0.0344133 *
## project33$WhiteB:project33$WestB 0.2876671
## project33$BlackB:project33$SouthB 0.0616440 .
## project33$BlackB:project33$WestB 0.0452784 *
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also look at just the main effects:
fit2 <- lm(WifeWorkHours ~ HusbandHI + WifeHI + WhiteB + BlackB + SouthB + WestB, data = project33)
anova(fit2)
## Analysis of Variance Table
##
## Response: WifeWorkHours
## Df Sum Sq Mean Sq F value Pr(>F)
## HusbandHI 1 97859 97859 379.9270 < 2.2e-16 ***
## WifeHI 1 1582215 1582215 6142.7741 < 2.2e-16 ***
## WhiteB 1 5625 5625 21.8390 2.988e-06 ***
## BlackB 1 715 715 2.7753 0.0957471 .
## SouthB 1 147 147 0.5720 0.4494618
## WestB 1 3229 3229 12.5354 0.0004004 ***
## Residuals 17095 4403217 258
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As you can see, the conclusion remains the same because the main effects are aliased with the two-factor interactions.
We can also examine the plots of the residuals:
par(mfrow = c(1,2))
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(fitted(fit), residuals(fit))
It is important to note that the Q-Q plot shows that this data probably does not follow a normal distribution, so further analysis could be done on what distribution best fits this data, or potential confounding factors that were left out of the data (husband’s salary perhaps).
Montgomery, Douglas C. Design and analysis of experiments. John Wiley & Sons, 2008.
The raw data can be found in the Ecdat package in R. A pdf describing the data sets found in this package can be found here. The description of the ‘Health Insurance and Hours Worked by Wives’ is found on page 54.
The complete code in this project can be seen here:
library("Ecdat")
hours <- HI
colnames(hours)
colnames <- c("WifeWorkHours", "HusbandHI", "WifeHI", "Race", "Region")
#assigns relevant columns to a new dataframe for simpler analysis
project <- data.frame(hours$whrswk, hours$hhi, hours$whi, hours$race, hours$region)
colnames(project) <- colnames
#subsets into new data set to discard values of 'other' in the region factor
project3 <- subset(project, project$Region == "northcentral" | project$Region == "south" | project$Region == "west")
#displays the first 10 rows of the 2^2 * 3^2 data set (project3)
head(project3, 10)
#sets columns as appropriate value types
HHI <- as.factor(project3$HusbandHI)
WHI <- as.factor(project3$WifeHI)
WWH <- as.numeric(project3$WifeWorkHours)
Race <- as.factor(project3$Race)
Region <- as.factor(project3$Region)
#displays a 4x4 grid of plots
par(mfrow = c(2,2))
#boxplots
boxplot(WWH ~ HHI, xlab = "Husband Health Insurance", ylab = "Wife Working Hours/Week")
boxplot(WWH ~ WHI, xlab = "Wife Health Insurance", ylab = "Wife Working Hours/Week")
boxplot(WWH ~ Race, xlab = "Race", ylab = "Wife Working Hours/Week")
boxplot(WWH ~ Region, xlab = "Region", ylab = "Wife Working Hours/Week")
#starts the process to decompose three-level factors into two two-level factors (each) by creating empty matrices to put data into
White <- matrix(nrow = 17102, ncol = 1)
Black <- matrix(nrow = 17102, ncol = 1)
South <- matrix(nrow = 17102, ncol = 1)
West <- matrix(nnrow = 17102, ncol = 1)
#creates another new data frame for decomposed three-level factors
project33 <- data.frame(project3, White, Black, South, West)
colnames(project33) <- c("WifeWorkHours", "HusbandHI", "WifeHI", "Race3", "Region3", "WhiteB", "BlackB", "SouthB", "WestB")
#assigns appropriate values to new binary factors
project33$WhiteB[project33$Race3 == "white"] <- 1
project33$WhiteB[project33$Race3 == "black"] <- 0
project33$WhiteB[project33$Race3 == "other"] <- 0
project33$BlackB[project33$Race3 == "white"] <- 0
project33$BlackB[project33$Race3 == "other"] <- 0
project33$BlackB[project33$Race3 == "black"] <- 1
project33$WestB[project33$Region3 == "west"] <- 1
project33$WestB[project33$Region3 == "south"] <- 0
project33$WestB[project33$Region3 == "northcentral"] <- 0
project33$SouthB[project33$Region3 == "west"] <- 0
project33$SouthB[project33$Region3 == "south"] <- 1
project33$SouthB[project33$Region3 == "northcentral"] <- 0
#displays the first 6 rows of the new data frame
head(project33)
#shows a full factorial design of 64 runs
expand.grid(HusbandHI = c(0,1), WifeHI = c(0,1), White = c(0,1), Black = c(0,1), South = c(0,1), West = c(0,1))
#loads the fractional factorial design package
library("FrF2")
#creates a 2^3 fractional factorial design
design <- FrF2(8, factor.names = c("HusbandHI", "WifeHI", "White", "Black", "South", "West"), default.levels = c("0", "1"))
#displays this fractional factorial design
design
#shows aliases, generators, and other information on the fractional factorial design
summary(design)
#shows the aliasing structure of the fractional factorial design
aliasprint(design)
#creates a linear model with all main and interaction effects and does an analysis of variance on it
fit <- lm(WifeWorkHours ~ (HusbandHI + WifeHI + WhiteB + BlackB + SouthB + WestB)^2, data = project33)
anova(fit)
#creates a linear model of all main effects and does an analysis of variance on it
fit2 <- lm(WifeWorkHours ~ HusbandHI + WifeHI + WhiteB + BlackB + SouthB + WestB, data = project33)
anova(fit2)
#plots different residual plots for normality checks
par(mfrow = c(1,2))
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(fitted(fit), residuals(fit))