1. Setting

System under test

The data set tested comes from the Ecdat package in R and the data set is named Males. The data set consists of 4360 observations from 1980 to 1987

## Loading required package: Ecfun
## 
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
## 
##     sign
## 
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
## 
##     Orange

Factors and Levels

The data set contains the following factors: union: levels “no”" and “yes”

ethn: levels “black”,“hisp”, and “other”

maried: levels “no” and “yes”

health: levels “no” and “yes”

industry: levels “Agricultural”, “Mining”, “Construction”,“Trade”, “Transportation,”Finance“,”Business and Repair Service“,”Personal_Service“,”Entertainment“,”Manufacturing“,”Professional and Related Service“,”Public Administration"

occupation: levels “Professional, Technicaland kindred”, “Managers, Officials_and_Proprietors”, “Sales_Workers”,“Clerical_and_kindred”, “Craftsmen, Foremen_and_kindred”,“Operatives_and_kindred”,“Laborers_and_farmers”,“Farm_Laborers_and_Foreman”,“Service_Workers”

residence: levels rural area,north east, northern central, south

Note the residence factor will be grouped as a three-level factor by merging together south and rural area in to rural-south

Countionus Variable

The data set contains the following continous variables: nr: identifiant

year: years

school: years of schooling

exper: years of experience

wage: log of hourly wages

The Response Variable

The response is the log of hourly wages wage.

The Data

Below are the first 6 observations of the Males data set

##   nr year school exper union  ethn maried health     wage
## 1 13 1980     14     1    no other     no     no 1.197540
## 2 13 1981     14     2   yes other     no     no 1.853060
## 3 13 1982     14     3    no other     no     no 1.344462
## 4 13 1983     14     4    no other     no     no 1.433213
## 5 13 1984     14     5    no other     no     no 1.568125
## 6 13 1985     14     6    no other     no     no 1.699891
##                      industry                          occupation
## 1 Business_and_Repair_Service                     Service_Workers
## 2            Personal_Service                     Service_Workers
## 3 Business_and_Repair_Service                     Service_Workers
## 4 Business_and_Repair_Service                     Service_Workers
## 5            Personal_Service      Craftsmen, Foremen_and_kindred
## 6 Business_and_Repair_Service Managers, Officials_and_Proprietors
##    residence
## 1 north_east
## 2 north_east
## 3 north_east
## 4 north_east
## 5 north_east
## 6 north_east

Below is the structure of the data set

## 'data.frame':    4360 obs. of  12 variables:
##  $ nr        : int  13 13 13 13 13 13 13 13 17 17 ...
##  $ year      : int  1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
##  $ school    : int  14 14 14 14 14 14 14 14 13 13 ...
##  $ exper     : int  1 2 3 4 5 6 7 8 4 5 ...
##  $ union     : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ethn      : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ maried    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ health    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wage      : num  1.2 1.85 1.34 1.43 1.57 ...
##  $ industry  : Factor w/ 12 levels "Agricultural",..: 7 8 7 7 8 7 7 7 4 4 ...
##  $ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 9 9 5 2 2 2 2 2 ...
##  $ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 2 2 2 2 2 2 2 2 ...

2 (Experimental) Design

The purpose of this experiment is to observe the impact of a multitude factors on the log of hourly wage. A fractional factorial design is used.

How will the expreriment be organized and conducted to test the hypothesis?

A \(2^{6-3}\) fractional factorial design will be used. The two 2-level factors for this experiment are heath and maried. The two 3-level factors for this experiment are ethn and residence after decomposed to a 3-level factor. Both of these three-level factors will be decomposed to 2-level factors.

Randomize: What is the Randomization Scheme?

There is no randomization in the data set.

Replicate: Are there any replicates or repeated measures?

There are no replicates in the original dataset.

Block: Did you use blocking in the design?

Blocking is not used in this design. However, it is possible to block if necessary.

3.(Statistical) Analysis

(Exploratory) Data Analysis

The following below are a series of boxplots

M<-as.factor(Males$maried)
R<-as.factor(Males$residence)
H<-as.factor(Males$health)
E<-as.factor(Males$ethn)
W<-as.numeric(Males$wage)
boxplot(W~M,xlab="Maritial Status",ylab="Log of hourly wages")

boxplot(W~R,xlab="Residence",ylab="Log of hourly wages")

boxplot(W~H,xlab="Any health probles",ylab="Log of hourly wages")

boxplot(W~E,xlab="Ethnicity",ylab="Log of hourly wages")

Decomposoiton of factors into two 2-level factors. The new factors are RuralSouth,NorthEC,Hisp,and Black.

colnames <-c("wage","maried","health","ethn","residence")
project <-data.frame(Males$wage,Males$maried,Males$health,Males$ethn,Males$residence)
colnames(project)<- colnames
RuralSouth <- matrix(0,nrow = 4360, ncol = 1)
NorthEC <- matrix(0,nrow = 4360, ncol = 1)
Hisp <- matrix(0,nrow = 4360, ncol = 1)
Black <- matrix(0,nrow = 4360, ncol = 1)
project3<-data.frame(project,RuralSouth,NorthEC,Hisp,Black)
colnames <-c("wage","maried","health","ethn","residence","RS","NEC","H","B")
colnames(project3) <- colnames
project3$RS[project3$residence=="rural_area"]<-1
project3$RS[project3$residence=="south"]<-1
project3$RS[project3$residence=="north_east"]<-0
project3$RS[project3$residence=="northern_central"]<-0
project3$NEC[project3$residence=="rural_area"]<-0
project3$NEC[project3$residence=="south"]<-0
project3$NEC[project3$residence=="north_east"]<-1
project3$NEC[project3$residence=="northern_central"]<-1
project3$H[project3$residence=="black"]<-0
project3$H[project3$residence=="other"]<-0
project3$H[project3$residence=="hisp"]<-1
project3$B[project3$residence=="black"]<-1
project3$B[project3$residence=="other"]<-0
project3$B[project3$residence=="hisp"]<-0
head(project3)
##       wage maried health  ethn  residence RS NEC H B
## 1 1.197540     no     no other north_east  0   1 0 0
## 2 1.853060     no     no other north_east  0   1 0 0
## 3 1.344462     no     no other north_east  0   1 0 0
## 4 1.433213     no     no other north_east  0   1 0 0
## 5 1.568125     no     no other north_east  0   1 0 0
## 6 1.699891     no     no other north_east  0   1 0 0

Testing

The full factorial experimental design of 2^6 = 64 runs is below:

expand.grid(maried=c(0,1),health=c(0,1),RS=c(0,1),NEC=c(0,1),H=c(0,1),B=c(0,1))
##    maried health RS NEC H B
## 1       0      0  0   0 0 0
## 2       1      0  0   0 0 0
## 3       0      1  0   0 0 0
## 4       1      1  0   0 0 0
## 5       0      0  1   0 0 0
## 6       1      0  1   0 0 0
## 7       0      1  1   0 0 0
## 8       1      1  1   0 0 0
## 9       0      0  0   1 0 0
## 10      1      0  0   1 0 0
## 11      0      1  0   1 0 0
## 12      1      1  0   1 0 0
## 13      0      0  1   1 0 0
## 14      1      0  1   1 0 0
## 15      0      1  1   1 0 0
## 16      1      1  1   1 0 0
## 17      0      0  0   0 1 0
## 18      1      0  0   0 1 0
## 19      0      1  0   0 1 0
## 20      1      1  0   0 1 0
## 21      0      0  1   0 1 0
## 22      1      0  1   0 1 0
## 23      0      1  1   0 1 0
## 24      1      1  1   0 1 0
## 25      0      0  0   1 1 0
## 26      1      0  0   1 1 0
## 27      0      1  0   1 1 0
## 28      1      1  0   1 1 0
## 29      0      0  1   1 1 0
## 30      1      0  1   1 1 0
## 31      0      1  1   1 1 0
## 32      1      1  1   1 1 0
## 33      0      0  0   0 0 1
## 34      1      0  0   0 0 1
## 35      0      1  0   0 0 1
## 36      1      1  0   0 0 1
## 37      0      0  1   0 0 1
## 38      1      0  1   0 0 1
## 39      0      1  1   0 0 1
## 40      1      1  1   0 0 1
## 41      0      0  0   1 0 1
## 42      1      0  0   1 0 1
## 43      0      1  0   1 0 1
## 44      1      1  0   1 0 1
## 45      0      0  1   1 0 1
## 46      1      0  1   1 0 1
## 47      0      1  1   1 0 1
## 48      1      1  1   1 0 1
## 49      0      0  0   0 1 1
## 50      1      0  0   0 1 1
## 51      0      1  0   0 1 1
## 52      1      1  0   0 1 1
## 53      0      0  1   0 1 1
## 54      1      0  1   0 1 1
## 55      0      1  1   0 1 1
## 56      1      1  1   0 1 1
## 57      0      0  0   1 1 1
## 58      1      0  0   1 1 1
## 59      0      1  0   1 1 1
## 60      1      1  0   1 1 1
## 61      0      0  1   1 1 1
## 62      1      0  1   1 1 1
## 63      0      1  1   1 1 1
## 64      1      1  1   1 1 1

It’s apparent that this would take a while to complete this analysis. Therefore, it will be approximated with a fraction factorial design. The fractional factorial design chosen is a 2^6-3 design by utilizing the R package FrF2. The highest possible resolution for this design is Resolution III. The FRF2 package is also used to determine the aliasing structure of the fractional factorial design and the generators of the fractional factorial design.

library(FrF2)
## Warning: package 'FrF2' was built under R version 3.3.2
## Loading required package: DoE.base
## Warning: package 'DoE.base' was built under R version 3.3.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.3.2
## 
## Attaching package: 'DoE.base'
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## The following object is masked from 'package:graphics':
## 
##     plot.design
## The following object is masked from 'package:base':
## 
##     lengths
design <-FrF2(8,factor.names=c("maried","health","RS","NEC","H","B"),default.levels = c("0","1"))
design
##   maried health RS NEC H B
## 1      0      1  0   0 1 0
## 2      0      0  0   1 1 1
## 3      1      0  0   0 0 1
## 4      1      1  1   1 1 1
## 5      1      0  1   0 1 0
## 6      1      1  0   1 0 0
## 7      0      0  1   1 0 0
## 8      0      1  1   0 0 1
## class=design, type= FrF2
summary(design)
## Call:
## FrF2(8, factor.names = c("maried", "health", "RS", "NEC", "H", 
##     "B"), default.levels = c("0", "1"))
## 
## Experimental design of type  FrF2 
## 8  runs
## 
## Factor settings (scale ends):
##   maried health RS NEC H B
## 1      0      0  0   0 0 0
## 2      1      1  1   1 1 1
## 
## Design generating information:
## $legend
## [1] A=maried B=health C=RS     D=NEC    E=H      F=B     
## 
## $generators
## [1] D=AB E=AC F=BC
## 
## 
## Alias structure:
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
## 
## $fi2
## [1] AF=BE=CD
## 
## 
## The design itself:
##   maried health RS NEC H B
## 1      0      1  0   0 1 0
## 2      0      0  0   1 1 1
## 3      1      0  0   0 0 1
## 4      1      1  1   1 1 1
## 5      1      0  1   0 1 0
## 6      1      1  0   1 0 0
## 7      0      0  1   1 0 0
## 8      0      1  1   0 0 1
## class=design, type= FrF2
aliasprint(design)
## $legend
## [1] A=maried B=health C=RS     D=NEC    E=H      F=B     
## 
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
## 
## $fi2
## [1] AF=BE=CD

From the above result, it’s apparent that the generators of the fractional factorial design are D=AB,E=AC,F=BC. From the aliasing structure, it’s apparent that the main effects are aliased with some of the two factor interations, which are highly aliased with each other: AF=BE=CD. Because the design is a Resolution III design, only the main effects are a reliable source of analysis.

6020

From the aliased interactions (D=AB, E=AC, and F= BC), the generator I is calculated as I=ABD=ACE=BCF

Estimation

The main and interaction effects can be estimated by using a linear model and ANOVA analysis. This can be seen below, with all of the main effects and all of the two-factor interaction effects:

fit <- lm(project3$wage~(project3$maried+project3$health+project3$RS+project3$NEC+project3$H+project3$B)^2,data=project3)
anova(fit)
## Analysis of Variance Table
## 
## Response: project3$wage
##                                   Df  Sum Sq Mean Sq  F value    Pr(>F)
## project3$maried                    1   52.11  52.114 194.7300 < 2.2e-16
## project3$health                    1    0.84   0.837   3.1266 0.0770953
## project3$RS                        1    2.97   2.966  11.0837 0.0008781
## project3$NEC                       1    9.81   9.805  36.6389 1.541e-09
## project3$maried:project3$health    1    0.46   0.463   1.7309 0.1883711
## project3$maried:project3$RS        1    5.40   5.395  20.1591 7.310e-06
## project3$maried:project3$NEC       1    0.60   0.603   2.2519 0.1335193
## project3$health:project3$RS        1    0.06   0.057   0.2133 0.6441796
## project3$health:project3$NEC       1    0.13   0.130   0.4849 0.4862504
## Residuals                       4350 1164.16   0.268                   
##                                    
## project3$maried                 ***
## project3$health                 .  
## project3$RS                     ***
## project3$NEC                    ***
## project3$maried:project3$health    
## project3$maried:project3$RS     ***
## project3$maried:project3$NEC       
## project3$health:project3$RS        
## project3$health:project3$NEC       
## Residuals                          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Diagnostic and Model Adequacy Checking

To look at just the main effects:

fit2 <- lm(project3$wage~(project3$maried+project3$health+project3$RS+project3$NEC+project3$H+project3$B),data=project3)
anova(fit2)
## Analysis of Variance Table
## 
## Response: project3$wage
##                   Df  Sum Sq Mean Sq  F value    Pr(>F)    
## project3$maried    1   52.11  52.114 193.8468 < 2.2e-16 ***
## project3$health    1    0.84   0.837   3.1124 0.0777685 .  
## project3$RS        1    2.97   2.966  11.0335 0.0009022 ***
## project3$NEC       1    9.81   9.805  36.4728 1.677e-09 ***
## Residuals       4355 1170.81   0.269                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The conclusion remains the same because the main effects are aliased with the two-factor interactions, Below are the plots of the residuals.

qqnorm(residuals(fit))
qqline(residuals(fit))

plot(fitted(fit),residuals(fit))

The Q-Q plot shows the data probably follows a normal distribution.

4. References to the literature

[1]https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf

5. Appendices

https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf

The complete code for the project can be seen below:

library(Ecdat)
data("Males")
head(Males,6)
str(Males)
M<-as.factor(Males$maried)
R<-as.factor(Males$residence)
H<-as.factor(Males$health)
E<-as.factor(Males$ethn)
W<-as.numeric(Males$wage)
boxplot(W~M,xlab="Maritial Status",ylab="Log of hourly wages")
boxplot(W~R,xlab="Residence",ylab="Log of hourly wages")
boxplot(W~H,xlab="Any health probles",ylab="Log of hourly wages")
boxplot(W~E,xlab="Ethnicity",ylab="Log of hourly wages")
colnames <-c("wage","maried","health","ethn","residence")
project <-data.frame(Males$wage,Males$maried,Males$health,Males$ethn,Males$residence)
colnames(project)<- colnames
RuralSouth <- matrix(0,nrow = 4360, ncol = 1)
NorthEC <- matrix(0,nrow = 4360, ncol = 1)
Hisp <- matrix(0,nrow = 4360, ncol = 1)
Black <- matrix(0,nrow = 4360, ncol = 1)
project3<-data.frame(project,RuralSouth,NorthEC,Hisp,Black)
colnames <-c("wage","maried","health","ethn","residence","RS","NEC","H","B")
colnames(project3) <- colnames
project3$RS[project3$residence=="rural_area"]<-1
project3$RS[project3$residence=="south"]<-1
project3$RS[project3$residence=="north_east"]<-0
project3$RS[project3$residence=="northern_central"]<-0
project3$NEC[project3$residence=="rural_area"]<-0
project3$NEC[project3$residence=="south"]<-0
project3$NEC[project3$residence=="north_east"]<-1
project3$NEC[project3$residence=="northern_central"]<-1
project3$H[project3$residence=="black"]<-0
project3$H[project3$residence=="other"]<-0
project3$H[project3$residence=="hisp"]<-1
project3$B[project3$residence=="black"]<-1
project3$B[project3$residence=="other"]<-0
project3$B[project3$residence=="hisp"]<-0
head(project3)
expand.grid(maried=c(0,1),health=c(0,1),RuralSouth=c(0,1),NorthEC=c(0,1),Hisp=c(0,1),Black=c(0,1))
library(FrF2)
design <-FrF2(8,factor.names=c("maried","health","RS","NEC","H","B"),default.levels = c("0","1"))
design
summary(design)
aliasprint(design)
fit <- lm(project3$wage~(project3$maried+project3$health+project3$RS+project3$NEC+project3$H+project3$B)^2,data=project3)
anova(fit)
fit2 <- lm(project3$wage~(project3$maried+project3$health+project3$RS+project3$NEC+project3$H+project3$B),data=project3)
anova(fit2)
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(fitted(fit),residuals(fit))