The data set tested comes from the Ecdat package in R and the data set is named Males. The data set consists of 4360 observations from 1980 to 1987
## Loading required package: Ecfun
##
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
##
## sign
##
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##
## Orange
The data set contains the following factors: union: levels “no”" and “yes”
ethn: levels “black”,“hisp”, and “other”
maried: levels “no” and “yes”
health: levels “no” and “yes”
industry: levels “Agricultural”, “Mining”, “Construction”,“Trade”, “Transportation,”Finance“,”Business and Repair Service“,”Personal_Service“,”Entertainment“,”Manufacturing“,”Professional and Related Service“,”Public Administration"
occupation: levels “Professional, Technicaland kindred”, “Managers, Officials_and_Proprietors”, “Sales_Workers”,“Clerical_and_kindred”, “Craftsmen, Foremen_and_kindred”,“Operatives_and_kindred”,“Laborers_and_farmers”,“Farm_Laborers_and_Foreman”,“Service_Workers”
residence: levels rural area,north east, northern central, south
Note the residence factor will be grouped as a three-level factor by merging together south and rural area in to rural-south
The data set contains the following continous variables: nr: identifiant
year: years
school: years of schooling
exper: years of experience
wage: log of hourly wages
The response is the log of hourly wages wage.
Below are the first 6 observations of the Males data set
## nr year school exper union ethn maried health wage
## 1 13 1980 14 1 no other no no 1.197540
## 2 13 1981 14 2 yes other no no 1.853060
## 3 13 1982 14 3 no other no no 1.344462
## 4 13 1983 14 4 no other no no 1.433213
## 5 13 1984 14 5 no other no no 1.568125
## 6 13 1985 14 6 no other no no 1.699891
## industry occupation
## 1 Business_and_Repair_Service Service_Workers
## 2 Personal_Service Service_Workers
## 3 Business_and_Repair_Service Service_Workers
## 4 Business_and_Repair_Service Service_Workers
## 5 Personal_Service Craftsmen, Foremen_and_kindred
## 6 Business_and_Repair_Service Managers, Officials_and_Proprietors
## residence
## 1 north_east
## 2 north_east
## 3 north_east
## 4 north_east
## 5 north_east
## 6 north_east
Below is the structure of the data set
## 'data.frame': 4360 obs. of 12 variables:
## $ nr : int 13 13 13 13 13 13 13 13 17 17 ...
## $ year : int 1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
## $ school : int 14 14 14 14 14 14 14 14 13 13 ...
## $ exper : int 1 2 3 4 5 6 7 8 4 5 ...
## $ union : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ ethn : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ maried : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ health : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ wage : num 1.2 1.85 1.34 1.43 1.57 ...
## $ industry : Factor w/ 12 levels "Agricultural",..: 7 8 7 7 8 7 7 7 4 4 ...
## $ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 9 9 5 2 2 2 2 2 ...
## $ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 2 2 2 2 2 2 2 2 ...
The purpose of this experiment is to observe the impact of a multitude factors on the log of hourly wage. A fractional factorial design is used.
A \(2^{6-3}\) fractional factorial design will be used. The two 2-level factors for this experiment are heath and maried. The two 3-level factors for this experiment are ethn and residence after decomposed to a 3-level factor. Both of these three-level factors will be decomposed to 2-level factors.
There is no randomization in the data set.
There are no replicates in the original dataset.
Blocking is not used in this design. However, it is possible to block if necessary.
The following below are a series of boxplots
M<-as.factor(Males$maried)
R<-as.factor(Males$residence)
H<-as.factor(Males$health)
E<-as.factor(Males$ethn)
W<-as.numeric(Males$wage)
boxplot(W~M,xlab="Maritial Status",ylab="Log of hourly wages")
boxplot(W~R,xlab="Residence",ylab="Log of hourly wages")
boxplot(W~H,xlab="Any health probles",ylab="Log of hourly wages")
boxplot(W~E,xlab="Ethnicity",ylab="Log of hourly wages")
Decomposoiton of factors into two 2-level factors. The new factors are RuralSouth,NorthEC,Hisp,and Black.
colnames <-c("wage","maried","health","ethn","residence")
project <-data.frame(Males$wage,Males$maried,Males$health,Males$ethn,Males$residence)
colnames(project)<- colnames
RuralSouth <- matrix(0,nrow = 4360, ncol = 1)
NorthEC <- matrix(0,nrow = 4360, ncol = 1)
Hisp <- matrix(0,nrow = 4360, ncol = 1)
Black <- matrix(0,nrow = 4360, ncol = 1)
project3<-data.frame(project,RuralSouth,NorthEC,Hisp,Black)
colnames <-c("wage","maried","health","ethn","residence","RS","NEC","H","B")
colnames(project3) <- colnames
project3$RS[project3$residence=="rural_area"]<-1
project3$RS[project3$residence=="south"]<-1
project3$RS[project3$residence=="north_east"]<-0
project3$RS[project3$residence=="northern_central"]<-0
project3$NEC[project3$residence=="rural_area"]<-0
project3$NEC[project3$residence=="south"]<-0
project3$NEC[project3$residence=="north_east"]<-1
project3$NEC[project3$residence=="northern_central"]<-1
project3$H[project3$residence=="black"]<-0
project3$H[project3$residence=="other"]<-0
project3$H[project3$residence=="hisp"]<-1
project3$B[project3$residence=="black"]<-1
project3$B[project3$residence=="other"]<-0
project3$B[project3$residence=="hisp"]<-0
head(project3)
## wage maried health ethn residence RS NEC H B
## 1 1.197540 no no other north_east 0 1 0 0
## 2 1.853060 no no other north_east 0 1 0 0
## 3 1.344462 no no other north_east 0 1 0 0
## 4 1.433213 no no other north_east 0 1 0 0
## 5 1.568125 no no other north_east 0 1 0 0
## 6 1.699891 no no other north_east 0 1 0 0
The full factorial experimental design of 2^6 = 64 runs is below:
expand.grid(maried=c(0,1),health=c(0,1),RS=c(0,1),NEC=c(0,1),H=c(0,1),B=c(0,1))
## maried health RS NEC H B
## 1 0 0 0 0 0 0
## 2 1 0 0 0 0 0
## 3 0 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 0 0 1 0 0 0
## 6 1 0 1 0 0 0
## 7 0 1 1 0 0 0
## 8 1 1 1 0 0 0
## 9 0 0 0 1 0 0
## 10 1 0 0 1 0 0
## 11 0 1 0 1 0 0
## 12 1 1 0 1 0 0
## 13 0 0 1 1 0 0
## 14 1 0 1 1 0 0
## 15 0 1 1 1 0 0
## 16 1 1 1 1 0 0
## 17 0 0 0 0 1 0
## 18 1 0 0 0 1 0
## 19 0 1 0 0 1 0
## 20 1 1 0 0 1 0
## 21 0 0 1 0 1 0
## 22 1 0 1 0 1 0
## 23 0 1 1 0 1 0
## 24 1 1 1 0 1 0
## 25 0 0 0 1 1 0
## 26 1 0 0 1 1 0
## 27 0 1 0 1 1 0
## 28 1 1 0 1 1 0
## 29 0 0 1 1 1 0
## 30 1 0 1 1 1 0
## 31 0 1 1 1 1 0
## 32 1 1 1 1 1 0
## 33 0 0 0 0 0 1
## 34 1 0 0 0 0 1
## 35 0 1 0 0 0 1
## 36 1 1 0 0 0 1
## 37 0 0 1 0 0 1
## 38 1 0 1 0 0 1
## 39 0 1 1 0 0 1
## 40 1 1 1 0 0 1
## 41 0 0 0 1 0 1
## 42 1 0 0 1 0 1
## 43 0 1 0 1 0 1
## 44 1 1 0 1 0 1
## 45 0 0 1 1 0 1
## 46 1 0 1 1 0 1
## 47 0 1 1 1 0 1
## 48 1 1 1 1 0 1
## 49 0 0 0 0 1 1
## 50 1 0 0 0 1 1
## 51 0 1 0 0 1 1
## 52 1 1 0 0 1 1
## 53 0 0 1 0 1 1
## 54 1 0 1 0 1 1
## 55 0 1 1 0 1 1
## 56 1 1 1 0 1 1
## 57 0 0 0 1 1 1
## 58 1 0 0 1 1 1
## 59 0 1 0 1 1 1
## 60 1 1 0 1 1 1
## 61 0 0 1 1 1 1
## 62 1 0 1 1 1 1
## 63 0 1 1 1 1 1
## 64 1 1 1 1 1 1
It’s apparent that this would take a while to complete this analysis. Therefore, it will be approximated with a fraction factorial design. The fractional factorial design chosen is a 2^6-3 design by utilizing the R package FrF2. The highest possible resolution for this design is Resolution III. The FRF2 package is also used to determine the aliasing structure of the fractional factorial design and the generators of the fractional factorial design.
library(FrF2)
## Warning: package 'FrF2' was built under R version 3.3.2
## Loading required package: DoE.base
## Warning: package 'DoE.base' was built under R version 3.3.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.3.2
##
## Attaching package: 'DoE.base'
## The following objects are masked from 'package:stats':
##
## aov, lm
## The following object is masked from 'package:graphics':
##
## plot.design
## The following object is masked from 'package:base':
##
## lengths
design <-FrF2(8,factor.names=c("maried","health","RS","NEC","H","B"),default.levels = c("0","1"))
design
## maried health RS NEC H B
## 1 0 1 0 0 1 0
## 2 0 0 0 1 1 1
## 3 1 0 0 0 0 1
## 4 1 1 1 1 1 1
## 5 1 0 1 0 1 0
## 6 1 1 0 1 0 0
## 7 0 0 1 1 0 0
## 8 0 1 1 0 0 1
## class=design, type= FrF2
summary(design)
## Call:
## FrF2(8, factor.names = c("maried", "health", "RS", "NEC", "H",
## "B"), default.levels = c("0", "1"))
##
## Experimental design of type FrF2
## 8 runs
##
## Factor settings (scale ends):
## maried health RS NEC H B
## 1 0 0 0 0 0 0
## 2 1 1 1 1 1 1
##
## Design generating information:
## $legend
## [1] A=maried B=health C=RS D=NEC E=H F=B
##
## $generators
## [1] D=AB E=AC F=BC
##
##
## Alias structure:
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
##
## $fi2
## [1] AF=BE=CD
##
##
## The design itself:
## maried health RS NEC H B
## 1 0 1 0 0 1 0
## 2 0 0 0 1 1 1
## 3 1 0 0 0 0 1
## 4 1 1 1 1 1 1
## 5 1 0 1 0 1 0
## 6 1 1 0 1 0 0
## 7 0 0 1 1 0 0
## 8 0 1 1 0 0 1
## class=design, type= FrF2
aliasprint(design)
## $legend
## [1] A=maried B=health C=RS D=NEC E=H F=B
##
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
##
## $fi2
## [1] AF=BE=CD
From the above result, it’s apparent that the generators of the fractional factorial design are D=AB,E=AC,F=BC. From the aliasing structure, it’s apparent that the main effects are aliased with some of the two factor interations, which are highly aliased with each other: AF=BE=CD. Because the design is a Resolution III design, only the main effects are a reliable source of analysis.
From the aliased interactions (D=AB, E=AC, and F= BC), the generator I is calculated as I=ABD=ACE=BCF
The main and interaction effects can be estimated by using a linear model and ANOVA analysis. This can be seen below, with all of the main effects and all of the two-factor interaction effects:
fit <- lm(project3$wage~(project3$maried+project3$health+project3$RS+project3$NEC+project3$H+project3$B)^2,data=project3)
anova(fit)
## Analysis of Variance Table
##
## Response: project3$wage
## Df Sum Sq Mean Sq F value Pr(>F)
## project3$maried 1 52.11 52.114 194.7300 < 2.2e-16
## project3$health 1 0.84 0.837 3.1266 0.0770953
## project3$RS 1 2.97 2.966 11.0837 0.0008781
## project3$NEC 1 9.81 9.805 36.6389 1.541e-09
## project3$maried:project3$health 1 0.46 0.463 1.7309 0.1883711
## project3$maried:project3$RS 1 5.40 5.395 20.1591 7.310e-06
## project3$maried:project3$NEC 1 0.60 0.603 2.2519 0.1335193
## project3$health:project3$RS 1 0.06 0.057 0.2133 0.6441796
## project3$health:project3$NEC 1 0.13 0.130 0.4849 0.4862504
## Residuals 4350 1164.16 0.268
##
## project3$maried ***
## project3$health .
## project3$RS ***
## project3$NEC ***
## project3$maried:project3$health
## project3$maried:project3$RS ***
## project3$maried:project3$NEC
## project3$health:project3$RS
## project3$health:project3$NEC
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To look at just the main effects:
fit2 <- lm(project3$wage~(project3$maried+project3$health+project3$RS+project3$NEC+project3$H+project3$B),data=project3)
anova(fit2)
## Analysis of Variance Table
##
## Response: project3$wage
## Df Sum Sq Mean Sq F value Pr(>F)
## project3$maried 1 52.11 52.114 193.8468 < 2.2e-16 ***
## project3$health 1 0.84 0.837 3.1124 0.0777685 .
## project3$RS 1 2.97 2.966 11.0335 0.0009022 ***
## project3$NEC 1 9.81 9.805 36.4728 1.677e-09 ***
## Residuals 4355 1170.81 0.269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The conclusion remains the same because the main effects are aliased with the two-factor interactions, Below are the plots of the residuals.
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(fitted(fit),residuals(fit))
The Q-Q plot shows the data probably follows a normal distribution.
https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf
The complete code for the project can be seen below:
library(Ecdat)
data("Males")
head(Males,6)
str(Males)
M<-as.factor(Males$maried)
R<-as.factor(Males$residence)
H<-as.factor(Males$health)
E<-as.factor(Males$ethn)
W<-as.numeric(Males$wage)
boxplot(W~M,xlab="Maritial Status",ylab="Log of hourly wages")
boxplot(W~R,xlab="Residence",ylab="Log of hourly wages")
boxplot(W~H,xlab="Any health probles",ylab="Log of hourly wages")
boxplot(W~E,xlab="Ethnicity",ylab="Log of hourly wages")
colnames <-c("wage","maried","health","ethn","residence")
project <-data.frame(Males$wage,Males$maried,Males$health,Males$ethn,Males$residence)
colnames(project)<- colnames
RuralSouth <- matrix(0,nrow = 4360, ncol = 1)
NorthEC <- matrix(0,nrow = 4360, ncol = 1)
Hisp <- matrix(0,nrow = 4360, ncol = 1)
Black <- matrix(0,nrow = 4360, ncol = 1)
project3<-data.frame(project,RuralSouth,NorthEC,Hisp,Black)
colnames <-c("wage","maried","health","ethn","residence","RS","NEC","H","B")
colnames(project3) <- colnames
project3$RS[project3$residence=="rural_area"]<-1
project3$RS[project3$residence=="south"]<-1
project3$RS[project3$residence=="north_east"]<-0
project3$RS[project3$residence=="northern_central"]<-0
project3$NEC[project3$residence=="rural_area"]<-0
project3$NEC[project3$residence=="south"]<-0
project3$NEC[project3$residence=="north_east"]<-1
project3$NEC[project3$residence=="northern_central"]<-1
project3$H[project3$residence=="black"]<-0
project3$H[project3$residence=="other"]<-0
project3$H[project3$residence=="hisp"]<-1
project3$B[project3$residence=="black"]<-1
project3$B[project3$residence=="other"]<-0
project3$B[project3$residence=="hisp"]<-0
head(project3)
expand.grid(maried=c(0,1),health=c(0,1),RuralSouth=c(0,1),NorthEC=c(0,1),Hisp=c(0,1),Black=c(0,1))
library(FrF2)
design <-FrF2(8,factor.names=c("maried","health","RS","NEC","H","B"),default.levels = c("0","1"))
design
summary(design)
aliasprint(design)
fit <- lm(project3$wage~(project3$maried+project3$health+project3$RS+project3$NEC+project3$H+project3$B)^2,data=project3)
anova(fit)
fit2 <- lm(project3$wage~(project3$maried+project3$health+project3$RS+project3$NEC+project3$H+project3$B),data=project3)
anova(fit2)
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(fitted(fit),residuals(fit))