Fractional Factorial Designs

Prasanna Date

RPI

December 9, 2016 V.1

1. Setting

From the Ecdat package, VietNamH dataset was used for this project. This dataset has a total of 5999 observations describing the household expenses of households in Vietnam. The dataset can be loaded as follows:

#install.packages("Ecdat")
library("Ecdat")

## Loading required package: Ecfun

## 
## Attaching package: 'Ecfun'

## The following object is masked from 'package:base':
## 
##     sign

## 
## Attaching package: 'Ecdat'

## The following object is masked from 'package:datasets':
## 
##     Orange

library(pid)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.3.2

## Loading required package: png

## Loading required package: FrF2

## Loading required package: DoE.base

## Loading required package: grid

## Loading required package: conf.design

## 
## Attaching package: 'DoE.base'

## The following objects are masked from 'package:stats':
## 
##     aov, lm

## The following object is masked from 'package:graphics':
## 
##     plot.design

## The following object is masked from 'package:base':
## 
##     lengths

library(FrF2)
library(knitr)
library(tidyr)
library(kimisc)
df <- VietNamH

System Under Test

As described above, the VietNamH dataset has data about the household expenses of households in Vietnam. Specifically, we have the following variables in the data set:

sex: Gender of household head (male, female)
age: Age of household head
educyr: Schooling year of household head
farm: yes if farm household
urban: yes if urban household
hhsize: Size of the household
lntotal: Natural logarighm of total expenditure of the household
lnmed: Natural logarithm of medical expenditure of the household
lnrlfood: Natural logarithm of food expenditure of the household
lnexp12m: Natural logarithm of total household health care expenditure for 12 months
commune: Commune

This dataset has been taken from the Vietnam World Bank Livings Standards Survey.

head(df)

##      sex age educyr farm urban hhsize  lntotal     lnmed  lnrlfood
## 1 female  68      4   no   yes      6 10.13649 11.233210  8.639339
## 2 female  57      8   no   yes      6 10.25206  8.505120  9.345752
## 3   male  42     14   no   yes      6 10.93231  8.713418 10.226330
## 4 female  72      9   no   yes      6 10.26749  9.291736  9.263722
## 5 female  73      1   no   yes      8 10.48811  7.555382  9.592890
## 6 female  66     13   no   yes      7 10.52660  9.789702  9.372034
##    lnexp12m commune
## 1 11.233210       1
## 2  8.505120       1
## 3  8.713418       1
## 4  9.291736       1
## 5  7.555382       1
## 6  9.789702       1

Factors and Levels

First of all, we must keep only those variables that will be needed in our analysis. For our analysis, we have two 2-level IVs (sex and urban) and two 3-level IVs (age and hhsize). The response variable is lntotal. Thus, our modified dataset becomes:

df = df[c(1,5,2,6,7)]
for (i in 1:nrow(df))
{
  if (df$age[i] <= 35) 
  {
    df$age_factor[i] <- "Young"
  }
  else if (df$age[i] <= 60) 
  {
    df$age_factor[i] <- "Adult"
  }
  else
  {
    df$age_factor[i] <- "Old"
  }
  
  if (df$hhsize[i] <= 4) 
  {
    df$hhsize_factor[i] <- "Small"
  }
  else if (df$hhsize[i] <= 8)
  {
    df$hhsize_factor[i] <- "Medium"
  }
  else
  {
    df$hhsize_factor[i] <- "Large"
  }
}

df$sex <- as.factor(df$sex)
df$urban <- as.factor(df$urban)
df$age <- as.factor(df$age_factor)
df$hhsize <- as.factor(df$hhsize_factor)
df <- df[-c(6,7)]

head(df)

##      sex urban   age hhsize  lntotal
## 1 female   yes   Old Medium 10.13649
## 2 female   yes Adult Medium 10.25206
## 3   male   yes Adult Medium 10.93231
## 4 female   yes   Old Medium 10.26749
## 5 female   yes   Old Medium 10.48811
## 6 female   yes   Old Medium 10.52660

str(df)

## 'data.frame':    5999 obs. of  5 variables:
##  $ sex    : Factor w/ 2 levels "male","female": 2 2 1 2 2 2 2 1 1 1 ...
##  $ urban  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ age    : Factor w/ 3 levels "Adult","Old",..: 2 1 1 2 2 2 2 1 1 1 ...
##  $ hhsize : Factor w/ 3 levels "Large","Medium",..: 2 2 2 2 2 2 1 3 2 3 ...
##  $ lntotal: num  10.1 10.3 10.9 10.3 10.5 ...

summary(df)

##      sex       urban         age          hhsize        lntotal      
##  male  :4375   no :4269   Adult:3497   Large : 233   Min.   : 6.543  
##  female:1624   yes:1730   Old  :1283   Medium:2920   1st Qu.: 8.920  
##                           Young:1219   Small :2846   Median : 9.311  
##                                                      Mean   : 9.342  
##                                                      3rd Qu.: 9.759  
##                                                      Max.   :12.202

As can be seen above, we have two 2-level IVs (sex and urban) and two 3-level IVs (age and hhsize).

Continuous Variables and Response Variables

The response variable in our study (lntotal) is the only continuous variable. We converted the continuous factors into categorical variables.

2. Experimental Design

How will the experiment be organized and conducted to test the hypothesis?

If we were to use $2^2 \cdot 3^2$ design, we would end up with 36 different configurations of input variables. However we plan to do a $2^k$ design by converting the 3-level variables into two 2-level variables. As a result the mixed design that we currently have would be converted to a $2^k$ design with $k = 6$.

The 3-level factors can be converted into two 2-level factors by assigning $(-1, -1), (-1, +1), (+1, -1), (+1, +1)$ to $small, medium, medium$ and $large$ respectively.

We can now begin the process of modifying the current dataset. The first 3-level variable (age) can be converted as follows.

C <- c(-1, 1, -1, 1)
D <- c(-1, -1, 1, 1)
X <- c("Young", "Adult", "Adult", "Old")
FactX <- as.data.frame(cbind(C,D,X))
kable(FactX, align='c')

C	D	X
-1	-1	Young
1	-1	Adult
-1	1	Adult
1	1	Old

The second 3-level variable (hhsize) can be converted to two 2-level variables as follows.

E <- C
F <- D
Y <- c("Small", "Medium", "Medium", "Large")
FactY <- as.data.frame(cbind(E, F, Y))
kable(FactY, align='c')

E	F	Y
-1	-1	Small
1	-1	Medium
-1	1	Medium
1	1	Large

Now that we have converted the 3-level factors into two 2-level factors each, we can construct a $2^6$ design as follows.

K <- as.data.frame(FrF2(64, 6, randomize = FALSE))

## creating full factorial with 64 runs ...

A <- c(as.integer(K[,1]))
B <- c(as.integer(K[,2]))
C <- c(as.integer(K[,3]))
D <- c(as.integer(K[,4]))
CD <- C*D
E <- c(as.integer(K[,5]))
F <- c(as.integer(K[,6]))
EF <- E*F

FullDesignMatrix <- as.data.frame(cbind(A,B,C,D,E,F,CD,EF))

for (i in 1:nrow(FullDesignMatrix))
{
  if (FullDesignMatrix$A[i] == 1)
  {
    FullDesignMatrix$Sex[i] <- "Male"
  }
  else
  {
    FullDesignMatrix$Sex[i] <- "Female"
  }
  
  if (FullDesignMatrix$B[i] == 1)
  {
    FullDesignMatrix$Urban[i] <- "No"
  }
  else
  {
    FullDesignMatrix$Urban[i] <- "Yes"
  }
  
  if (FullDesignMatrix$CD[i] == 1)
  {
    FullDesignMatrix$Age[i] <- "Adult"
  }
  else if (FullDesignMatrix$CD[i] == 2)
  {
    FullDesignMatrix$Age[i] <- "Old"
  }
  else
  {
    FullDesignMatrix$Age[i] <- "Young"
  }
  
  if (FullDesignMatrix$EF[i] == 1)
  {
    FullDesignMatrix$Hhsize[i] <- "Large"
  }
  else if (FullDesignMatrix$EF[i] == 2)
  {
    FullDesignMatrix$Hhsize[i] <- "Medium"
  }
  else
  {
    FullDesignMatrix$Hhsize[i] <- "Small"
  }
}

FullDesign <- cbind(K, FullDesignMatrix$Sex, FullDesignMatrix$Urban, FullDesignMatrix$Age, FullDesignMatrix$Hhsize)
kable(FullDesign, align='c')

A	B	C	D	E	F	FullDesignMatrix$Sex	FullDesignMatrix$Urban	FullDesignMatrix$Age	FullDesignMatrix$Hhsize
-1	-1	-1	-1	-1	-1	Male	No	Adult	Large
1	-1	-1	-1	-1	-1	Female	No	Adult	Large
-1	1	-1	-1	-1	-1	Male	Yes	Adult	Large
1	1	-1	-1	-1	-1	Female	Yes	Adult	Large
-1	-1	1	-1	-1	-1	Male	No	Old	Large
1	-1	1	-1	-1	-1	Female	No	Old	Large
-1	1	1	-1	-1	-1	Male	Yes	Old	Large
1	1	1	-1	-1	-1	Female	Yes	Old	Large
-1	-1	-1	1	-1	-1	Male	No	Old	Large
1	-1	-1	1	-1	-1	Female	No	Old	Large
-1	1	-1	1	-1	-1	Male	Yes	Old	Large
1	1	-1	1	-1	-1	Female	Yes	Old	Large
-1	-1	1	1	-1	-1	Male	No	Young	Large
1	-1	1	1	-1	-1	Female	No	Young	Large
-1	1	1	1	-1	-1	Male	Yes	Young	Large
1	1	1	1	-1	-1	Female	Yes	Young	Large
-1	-1	-1	-1	1	-1	Male	No	Adult	Medium
1	-1	-1	-1	1	-1	Female	No	Adult	Medium
-1	1	-1	-1	1	-1	Male	Yes	Adult	Medium
1	1	-1	-1	1	-1	Female	Yes	Adult	Medium
-1	-1	1	-1	1	-1	Male	No	Old	Medium
1	-1	1	-1	1	-1	Female	No	Old	Medium
-1	1	1	-1	1	-1	Male	Yes	Old	Medium
1	1	1	-1	1	-1	Female	Yes	Old	Medium
-1	-1	-1	1	1	-1	Male	No	Old	Medium
1	-1	-1	1	1	-1	Female	No	Old	Medium
-1	1	-1	1	1	-1	Male	Yes	Old	Medium
1	1	-1	1	1	-1	Female	Yes	Old	Medium
-1	-1	1	1	1	-1	Male	No	Young	Medium
1	-1	1	1	1	-1	Female	No	Young	Medium
-1	1	1	1	1	-1	Male	Yes	Young	Medium
1	1	1	1	1	-1	Female	Yes	Young	Medium
-1	-1	-1	-1	-1	1	Male	No	Adult	Medium
1	-1	-1	-1	-1	1	Female	No	Adult	Medium
-1	1	-1	-1	-1	1	Male	Yes	Adult	Medium
1	1	-1	-1	-1	1	Female	Yes	Adult	Medium
-1	-1	1	-1	-1	1	Male	No	Old	Medium
1	-1	1	-1	-1	1	Female	No	Old	Medium
-1	1	1	-1	-1	1	Male	Yes	Old	Medium
1	1	1	-1	-1	1	Female	Yes	Old	Medium
-1	-1	-1	1	-1	1	Male	No	Old	Medium
1	-1	-1	1	-1	1	Female	No	Old	Medium
-1	1	-1	1	-1	1	Male	Yes	Old	Medium
1	1	-1	1	-1	1	Female	Yes	Old	Medium
-1	-1	1	1	-1	1	Male	No	Young	Medium
1	-1	1	1	-1	1	Female	No	Young	Medium
-1	1	1	1	-1	1	Male	Yes	Young	Medium
1	1	1	1	-1	1	Female	Yes	Young	Medium
-1	-1	-1	-1	1	1	Male	No	Adult	Small
1	-1	-1	-1	1	1	Female	No	Adult	Small
-1	1	-1	-1	1	1	Male	Yes	Adult	Small
1	1	-1	-1	1	1	Female	Yes	Adult	Small
-1	-1	1	-1	1	1	Male	No	Old	Small
1	-1	1	-1	1	1	Female	No	Old	Small
-1	1	1	-1	1	1	Male	Yes	Old	Small
1	1	1	-1	1	1	Female	Yes	Old	Small
-1	-1	-1	1	1	1	Male	No	Old	Small
1	-1	-1	1	1	1	Female	No	Old	Small
-1	1	-1	1	1	1	Male	Yes	Old	Small
1	1	-1	1	1	1	Female	Yes	Old	Small
-1	-1	1	1	1	1	Male	No	Young	Small
1	-1	1	1	1	1	Female	No	Young	Small
-1	1	1	1	1	1	Male	Yes	Young	Small
1	1	1	1	1	1	Female	Yes	Young	Small

To construct the $2^{6 - 3}$ design, we follow the following steps. First of all, it is important to note that $2^{6 - 3}$ design is essentially a design having $\frac{1}{8}$ number of configurations as compared to $2^6$ design. Note that with the given set of input variables, it is not possible to achieve Resolution IV and V because for Resolution IV, we would require $2^{6 - 2}$ i.e. 16 runs and for Resolution V, we would require $2^{6 - 1}$ i.e. 32 runs. We know that for a Resolution III design, no main effect is aliased with any other main effect, but main effects are aliased with two factor interactions. Now, we can construct our $2^{6 - 3}_{III}$ comprising of 8 runs as follows.

K <- as.data.frame(FrF2(8,6,randomize = FALSE))
A <- c(as.integer(K[,1]))
B <- c(as.integer(K[,2]))
C <- c(as.integer(K[,3]))
D <- c(as.integer(K[,4]))
CD <- C*D
E <- c(as.integer(K[,5]))
F <- c(as.integer(K[,6]))
EF <- E*F

FracDesignMatrix <- as.data.frame(cbind(A,B,C,D,E,F,CD,EF))

for (i in 1:nrow(FracDesignMatrix))
{
  if (FracDesignMatrix$A[i] == 1)
  {
    FracDesignMatrix$Sex[i] <- "Male"
  }
  else
  {
    FracDesignMatrix$Sex[i] <- "Female"
  }
  
  if (FracDesignMatrix$B[i] == 1)
  {
    FracDesignMatrix$Urban[i] <- "No"
  }
  else
  {
    FracDesignMatrix$Urban[i] <- "Yes"
  }
  
  if (FracDesignMatrix$CD[i] == 1)
  {
    FracDesignMatrix$Age[i] <- "Adult"
  }
  else if (FracDesignMatrix$CD[i] == 2)
  {
    FracDesignMatrix$Age[i] <- "Old"
  }
  else
  {
    FracDesignMatrix$Age[i] <- "Young"
  }
  
  if (FracDesignMatrix$EF[i] == 1)
  {
    FracDesignMatrix$Hhsize[i] <- "Large"
  }
  else if (FracDesignMatrix$EF[i] == 2)
  {
    FracDesignMatrix$Hhsize[i] <- "Medium"
  }
  else
  {
    FracDesignMatrix$Hhsize[i] <- "Small"
  }
}

FracDesign <- cbind(K, FracDesignMatrix$Sex, FracDesignMatrix$Urban, FracDesignMatrix$Age, FracDesignMatrix$Hhsize)

kable(FracDesign, align='c')

A	B	C	D	E	F	FracDesignMatrix$Sex	FracDesignMatrix$Urban	FracDesignMatrix$Age	FracDesignMatrix$Hhsize
-1	-1	-1	1	1	1	Male	No	Old	Small
1	-1	-1	-1	-1	1	Female	No	Adult	Medium
-1	1	-1	-1	1	-1	Male	Yes	Adult	Medium
1	1	-1	1	-1	-1	Female	Yes	Old	Large
-1	-1	1	1	-1	-1	Male	No	Young	Large
1	-1	1	-1	1	-1	Female	No	Old	Medium
-1	1	1	-1	-1	1	Male	Yes	Old	Medium
1	1	1	1	1	1	Female	Yes	Young	Small

We can now construct our $2^3 = 2^{6 - 3}$ table as follows.

A <- c(-1, 1)
B <- c(-1, 1)
C <- c(-1, 1)
des <- expand.grid(A=A, B=B, C=C)
des <- as.data.frame(des)
des

##    A  B  C
## 1 -1 -1 -1
## 2  1 -1 -1
## 3 -1  1 -1
## 4  1  1 -1
## 5 -1 -1  1
## 6  1 -1  1
## 7 -1  1  1
## 8  1  1  1

Since we have six factors, we need to look up the trade off table for $2^k$ fractional factorial analysis.

tradeOffTable()

As we can see, for $2^{6 - 3}_{III}$ fractional factorial design, we can add additional generators as follows.

des$D <- des$A * des$B
des$E <- des$A * des$C
des$F <- des$B * des$C

des

##    A  B  C  D  E  F
## 1 -1 -1 -1  1  1  1
## 2  1 -1 -1 -1 -1  1
## 3 -1  1 -1 -1  1 -1
## 4  1  1 -1  1 -1 -1
## 5 -1 -1  1  1 -1 -1
## 6  1 -1  1 -1  1 -1
## 7 -1  1  1 -1 -1  1
## 8  1  1  1  1  1  1

Now, we have the $2^{6 - 3}_{III}$ design matrix. We can determine the aliasing structure by using the following commands.

K <- FrF2(8,6,randomize = FALSE)
summary(K)

## Call:
## FrF2(8, 6, randomize = FALSE)
## 
## Experimental design of type  FrF2 
## 8  runs
## 
## Factor settings (scale ends):
##    A  B  C  D  E  F
## 1 -1 -1 -1 -1 -1 -1
## 2  1  1  1  1  1  1
## 
## Design generating information:
## $legend
## [1] A=A B=B C=C D=D E=E F=F
## 
## $generators
## [1] D=AB E=AC F=BC
## 
## 
## Alias structure:
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
## 
## $fi2
## [1] AF=BE=CD
## 
## 
## The design itself:
##    A  B  C  D  E  F
## 1 -1 -1 -1  1  1  1
## 2  1 -1 -1 -1 -1  1
## 3 -1  1 -1 -1  1 -1
## 4  1  1 -1  1 -1 -1
## 5 -1 -1  1  1 -1 -1
## 6  1 -1  1 -1  1 -1
## 7 -1  1  1 -1 -1  1
## 8  1  1  1  1  1  1
## class=design, type= FrF2

This aliasing structure is in sync with the description of $2^{6 - 3}_{III}$ fractional factorial designs in Montgomery textbook.

Before we proceed further, let’s summarize the input variables once again. $A$ represents the gender of household head, which is a 2-level factor. $B$ represents if the household is situated in urban area or not, which is also a 2-level factor. $C$ and $D$ together represent the 3-level factor age. $E$ and $F$ together represent the 3-level factor household size.

What is the rationale for this design?

The rationale for this design is to be able to draw as much information using as less number of experimental runs as possible. Furthermore, in order to simulate the experiment, we are using $2^{6 - 3}_{III}$ design. We wish to draw maximum possible information from the 8 runs.

Randomization, Blocking and Replication

For ANOVA, Randomization must occur at three levels: Random Selection, Random Assignment and Random Execution. Random Selection refers to selecting a random set of sample from the population. Random Assignment refers to assigning those samples to different groups. Random Execution refers to ordering the experimental trials randomly [1]. The data comes from the Vietnam World Bank Livings Standards Survey. There doesn’t seem to be any information pertaining to random selection. However, random assignment and random execution is ensured in the analysis.

Blocking refers to arranging the experimental runs in blocks that are similar to each other [2]. In this experiment, as mentioned above, we would be blocking on the categorical IV sex in order to eliminate any variablity caused by the levels of this factor.

Replication is the repeatition of an experimental condition so that the variablity associated with the phenomenon under study can be estimated [3]. Repeated measures refers to using the same subjects with every branch of research, including the control group [4]. In this dataset, there are several replications of possible configurations of the input IVs. There is no evidence of repeated measures.

3. Statistical Analysis

Exploratory Data Analysis (EDA)

It is important to note that the EDA performed here is only done to obtain a feel for the descriptive statistics of the problem because we are trying to use eight runs for our fractional factorial designs.

summary(df)

##      sex       urban         age          hhsize        lntotal      
##  male  :4375   no :4269   Adult:3497   Large : 233   Min.   : 6.543  
##  female:1624   yes:1730   Old  :1283   Medium:2920   1st Qu.: 8.920  
##                           Young:1219   Small :2846   Median : 9.311  
##                                                      Mean   : 9.342  
##                                                      3rd Qu.: 9.759  
##                                                      Max.   :12.202

str(df)

## 'data.frame':    5999 obs. of  5 variables:
##  $ sex    : Factor w/ 2 levels "male","female": 2 2 1 2 2 2 2 1 1 1 ...
##  $ urban  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ age    : Factor w/ 3 levels "Adult","Old",..: 2 1 1 2 2 2 2 1 1 1 ...
##  $ hhsize : Factor w/ 3 levels "Large","Medium",..: 2 2 2 2 2 2 1 3 2 3 ...
##  $ lntotal: num  10.1 10.3 10.9 10.3 10.5 ...

head(df)

##      sex urban   age hhsize  lntotal
## 1 female   yes   Old Medium 10.13649
## 2 female   yes Adult Medium 10.25206
## 3   male   yes Adult Medium 10.93231
## 4 female   yes   Old Medium 10.26749
## 5 female   yes   Old Medium 10.48811
## 6 female   yes   Old Medium 10.52660

Boxplots can be generated for the four input variables as follows.

boxplot(df$lntotal~df$sex, xlab="Sex", ylab="Log of Total Expenditure", main="Boxplot of the IV Sex")

boxplot(df$lntotal~df$urban, xlab="Urban?", ylab="Log of Total Expenditure", main="Boxplot of the IV Urban")

boxplot(df$lntotal~df$age, xlab="Age", ylab="Log of Total Expenditure", main="Boxplot of the IV Age")

boxplot(df$lntotal~df$hhsize, xlab="Household Income", ylab="Log of Total Expenditure", main="Boxplot of the IV Household Size")

Testing

We can form the ordered $2^{6 - 3}_{III}$ fractional factorial design matrix as follows.

K <- as.data.frame(FrF2(8,6, randomize = FALSE))
A <- c(as.integer(K[,1]))
B <- c(as.integer(K[,2]))
C <- c(as.integer(K[,3]))
D <- c(as.integer(K[,4]))
CD <- C*D
E <- c(as.integer(K[,5]))
F <- c(as.integer(K[,6]))
EF <- E*F

Design <- as.data.frame(cbind(A,B,C,D,E,F,CD,EF))

for (i in 1:nrow(K))
{
  if (Design$A[i] == 1)
  {
    Design$Sex[i] <- "Male"
  }
  else
  {
    Design$Sex[i] <- "Female"
  }
  
  if (Design$B[i] == 1)
  {
    Design$Urban[i] <- "No"
  }
  else
  {
    Design$Urban[i] <- "Yes"
  }
  
  if (Design$CD[i] == 1)
  {
    Design$Age[i] <- "Adult"
  }
  else if (Design$CD[i] == 2)
  {
    Design$Age[i] <- "Old"
  }
  else
  {
    Design$Age[i] <- "Young"
  }
  
  if (Design$EF[i] == 1)
  {
    Design$Hhsize[i] <- "Large"
  }
  else if (Design$EF[i] == 2)
  {
    Design$Hhsize[i] <- "Medium"
  }
  else
  {
    Design$Hhsize[i] <- "Small"
  }
}

dfDesign <- cbind(K, Design$Sex, Design$Urban, Design$Age, Design$Hhsize)
kable(dfDesign, align='c')

A	B	C	D	E	F	Design$Sex	Design$Urban	Design$Age	Design$Hhsize
-1	-1	-1	1	1	1	Male	No	Old	Small
1	-1	-1	-1	-1	1	Female	No	Adult	Medium
-1	1	-1	-1	1	-1	Male	Yes	Adult	Medium
1	1	-1	1	-1	-1	Female	Yes	Old	Large
-1	-1	1	1	-1	-1	Male	No	Young	Large
1	-1	1	-1	1	-1	Female	No	Old	Medium
-1	1	1	-1	-1	1	Male	Yes	Old	Medium
1	1	1	1	1	1	Female	Yes	Young	Small

The code for a randomized design matrix is as follows.

plan.design <- FrF2(8,6, randomize=TRUE)
K <- as.data.frame(plan.design)
A <- c(as.integer(K[,1]))
B <- c(as.integer(K[,2]))
C <- c(as.integer(K[,3]))
D <- c(as.integer(K[,4]))
CD <- C*D
E <- c(as.integer(K[,5]))
F <- c(as.integer(K[,6]))
EF <- E*F

Design <- as.data.frame(cbind(A,B,C,D,E,F,CD,EF))

for (i in 1:nrow(K))
{
  if (Design$A[i] == 1)
  {
    Design$Sex[i] <- "male"
  }
  else
  {
    Design$Sex[i] <- "female"
  }
  
  if (Design$B[i] == 1)
  {
    Design$Urban[i] <- "no"
  }
  else
  {
    Design$Urban[i] <- "yes"
  }
  
  if (Design$CD[i] == 1)
  {
    Design$Age[i] <- "Adult"
  }
  else if (Design$CD[i] == 2)
  {
    Design$Age[i] <- "Old"
  }
  else
  {
    Design$Age[i] <- "Young"
  }
  
  if (Design$EF[i] == 1)
  {
    Design$Hhsize[i] <- "Large"
  }
  else if (Design$EF[i] == 2)
  {
    Design$Hhsize[i] <- "Medium"
  }
  else
  {
    Design$Hhsize[i] <- "Small"
  }
}

dfDesign <- cbind(K, Design$Sex, Design$Urban, Design$Age, Design$Hhsize)
kable(dfDesign, align='c')

A	B	C	D	E	F	Design$Sex	Design$Urban	Design$Age	Design$Hhsize
-1	-1	1	1	-1	-1	male	no	Young	Large
-1	1	-1	-1	1	-1	male	yes	Adult	Medium
1	1	1	1	1	1	female	yes	Young	Small
1	-1	1	-1	1	-1	female	no	Old	Medium
-1	-1	-1	1	1	1	male	no	Old	Small
-1	1	1	-1	-1	1	male	yes	Old	Medium
1	-1	-1	-1	-1	1	female	no	Adult	Medium
1	1	-1	1	-1	-1	female	yes	Old	Large

summary(dfDesign)

##   A      B      C      D      E      F      Design$Sex Design$Urban
##  -1:4   -1:4   -1:4   -1:4   -1:4   -1:4   female:4    no :4       
##  1 :4   1 :4   1 :4   1 :4   1 :4   1 :4   male  :4    yes:4       
##                                                                    
##  Design$Age Design$Hhsize
##  Adult:2    Large :2     
##  Old  :4    Medium:4     
##  Young:2    Small :2

design_with_ref <- unite(dfDesign, refcol, c(7,8,9,10), remove=FALSE)
kable(design_with_ref, align='c')

A	B	C	D	E	F	refcol	Design$Sex	Design$Urban	Design$Age	Design$Hhsize
-1	-1	1	1	-1	-1	male_no_Young_Large	male	no	Young	Large
-1	1	-1	-1	1	-1	male_yes_Adult_Medium	male	yes	Adult	Medium
1	1	1	1	1	1	female_yes_Young_Small	female	yes	Young	Small
1	-1	1	-1	1	-1	female_no_Old_Medium	female	no	Old	Medium
-1	-1	-1	1	1	1	male_no_Old_Small	male	no	Old	Small
-1	1	1	-1	-1	1	male_yes_Old_Medium	male	yes	Old	Medium
1	-1	-1	-1	-1	1	female_no_Adult_Medium	female	no	Adult	Medium
1	1	-1	1	-1	-1	female_yes_Old_Large	female	yes	Old	Large

Design.refcol <- c(dfDesign$refcol)
df <- unite(df, refcol, c(1,2,3,4), remove=FALSE)
head(df)

##                    refcol    sex urban   age hhsize  lntotal
## 1   female_yes_Old_Medium female   yes   Old Medium 10.13649
## 2 female_yes_Adult_Medium female   yes Adult Medium 10.25206
## 3   male_yes_Adult_Medium   male   yes Adult Medium 10.93231
## 4   female_yes_Old_Medium female   yes   Old Medium 10.26749
## 5   female_yes_Old_Medium female   yes   Old Medium 10.48811
## 6   female_yes_Old_Medium female   yes   Old Medium 10.52660

head(design_with_ref)

##    A  B  C  D  E  F                 refcol Design$Sex Design$Urban
## 1 -1 -1  1  1 -1 -1    male_no_Young_Large       male           no
## 2 -1  1 -1 -1  1 -1  male_yes_Adult_Medium       male          yes
## 3  1  1  1  1  1  1 female_yes_Young_Small     female          yes
## 4  1 -1  1 -1  1 -1   female_no_Old_Medium     female           no
## 5 -1 -1 -1  1  1  1      male_no_Old_Small       male           no
## 6 -1  1  1 -1 -1  1    male_yes_Old_Medium       male          yes
##   Design$Age Design$Hhsize
## 1      Young         Large
## 2      Adult        Medium
## 3      Young         Small
## 4        Old        Medium
## 5        Old         Small
## 6        Old        Medium

for (j in 1:nrow(df))
{
  if (design_with_ref$refcol[1] == df$refcol[j])
  {
    df$R1[j] <- df$lntotal[j]
  }
  else
  {
    df$R1[j] <- 0
  }
}

for (j in 1:nrow(df))
{
  if (design_with_ref$refcol[2] == df$refcol[j])
  {
    df$R2[j] <- df$lntotal[j]
  }
  else
  {
    df$R2[j] <- 0
  }
}

for (j in 1:nrow(df))
{
  if (design_with_ref$refcol[3] == df$refcol[j])
  {
    df$R3[j] <- df$lntotal[j]
  }
  else
  {
    df$R3[j] <- 0
  }
}
 
for (j in 1:nrow(df))
{ 
  if (design_with_ref$refcol[4] == df$refcol[j])
  {
    df$R4[j] <- df$lntotal[j]
  }
  else
  {
    df$R4[j] <- 0
  }
}  

for (j in 1:nrow(df))
{
  if (design_with_ref$refcol[5] == df$refcol[j])
  {
    df$R5[j] <- df$lntotal[j]
  }
  else
  {
    df$R5[j] <- 0
  }
}

for (j in 1:nrow(df))
{
  if (design_with_ref$refcol[6] == df$refcol[j])
  {
    df$R6[j] <- df$lntotal[j]
  }
  else
  {
    df$R6[j] <- 0
  }
}

for (j in 1:nrow(df))
{
  if (design_with_ref$refcol[7] == df$refcol[j])
  {
    df$R7[j] <- df$lntotal[j]
  }
  else
  {
    df$R7[j] <- 0
  }
}  

for (j in 1:nrow(df))
{
  if (design_with_ref$refcol[8] == df$refcol[j])
  {
    df$R8[j] <- df$lntotal[j]
  }
  else
  {
    df$R8[j] <- 0
  }
}


R_1 <- sample.rows(subset(df, R1 > 0), 1)
R_2 <- sample.rows(subset(df, R2 > 0), 1)
R_3 <- sample.rows(subset(df, R3 > 0), 1)
R_4 <- sample.rows(subset(df, R4 > 0), 1)
R_5 <- sample.rows(subset(df, R5 > 0), 1)
R_6 <- sample.rows(subset(df, R6 > 0), 1)
R_7 <- sample.rows(subset(df, R7 > 0), 1)
R_8 <- sample.rows(subset(df, R8 > 0), 1)

RV <- c(R_1$R1, R_2$R2, R_3$R3, R_4$R4, R_5$R5, R_6$R6, R_7$R7, R_8$R8)

RV

## [1]  9.536133  9.683457  9.905867  9.441408  9.153513  9.816130  9.547972
## [8] 10.229960

The above eight response variable values can be added to our design.

plan.response <- add.response(plan.design, RV)
summary(plan.response)

## Call:
## FrF2(8, 6, randomize = TRUE)
## 
## Experimental design of type  FrF2 
## 8  runs
## 
## Factor settings (scale ends):
##    A  B  C  D  E  F
## 1 -1 -1 -1 -1 -1 -1
## 2  1  1  1  1  1  1
## 
## Responses:
## [1] RV
## 
## Design generating information:
## $legend
## [1] A=A B=B C=C D=D E=E F=F
## 
## $generators
## [1] D=AB E=AC F=BC
## 
## 
## Alias structure:
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
## 
## $fi2
## [1] AF=BE=CD
## 
## 
## The design itself:
##    A  B  C  D  E  F        RV
## 1 -1 -1  1  1 -1 -1  9.536133
## 2 -1  1 -1 -1  1 -1  9.683457
## 3  1  1  1  1  1  1  9.905867
## 4  1 -1  1 -1  1 -1  9.441408
## 5 -1 -1 -1  1  1  1  9.153513
## 6 -1  1  1 -1 -1  1  9.816130
## 7  1 -1 -1 -1 -1  1  9.547972
## 8  1  1 -1  1 -1 -1 10.229960
## class=design, type= FrF2

Now that we have our design matrix along with the response variable values ready, we can begin the analysis.

summary(lm(plan.response))

## Number of observations used: 8 
## Formula:
## RV ~ (A + B + C + D + E + F)^2
## 
## Call:
## lm.default(formula = fo, data = model.frame(fo, data = formula))
## 
## Residuals:
## ALL 8 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (14 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  9.664305         NA      NA       NA
## A1           0.116997         NA      NA       NA
## B1           0.244549         NA      NA       NA
## C1           0.010579         NA      NA       NA
## D1           0.042063         NA      NA       NA
## E1          -0.118244         NA      NA       NA
## F1          -0.058434         NA      NA       NA
## A1:F1        0.004052         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 7 and 0 DF,  p-value: NA

To compute the main effects, we can use the following command.

MEPlot(plan.response, abbrev=5, cex.xax=1.6, cex.main=2)

The interaction effects can be computed as follows.

IAPlot(plan.response, abbrev=5, show.alias=TRUE, lwd=2, cex=2, cex.xax=1.2, cex.lab=1.5)

According to the above plots, the main effects of $B$, $C$, $D$ and $F$ seem significant. These correspond to the urban (i.e. whether house is in urban area or not), age (i.e. age of head of the household) and hhsize (i.e. size of the household). It seems that the urban households have a higher total expenditure. Furthermore, age of the household owner negatively impacts the household expenses. Also, the larger the household, the higher is the expenditure (note that the hhsize was encoded as Large, Medium and Small respectively). Finally, it doesn’t look like the sex of the head of the household affects the total household expenditure - but we need to quantify the main effect through ANOVA.

Thus, we can formulate our null hypothesis as follows: The total household expenditure is not affected by sex, urban, age and household size.

AOVObject = aov(df$lntotal~df$sex+df$urban+df$age+df$hhsize, data = df)
anova(AOVObject)

## Analysis of Variance Table
## 
## Response: df$lntotal
##             Df  Sum Sq Mean Sq  F value    Pr(>F)    
## df$sex       1   18.87   18.87   62.744 2.785e-15 ***
## df$urban     1  635.53  635.53 2113.101 < 2.2e-16 ***
## df$age       2  113.70   56.85  189.030 < 2.2e-16 ***
## df$hhsize    2  266.78  133.39  443.519 < 2.2e-16 ***
## Residuals 5992 1802.13    0.30                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is apparent from ANOVA reults that we can safely discard the null hypothesis - all the main effects are statistically significant.

Model Adequacy Checking

First of all, we can plot the histogram of the response variable

hist(df$lntotal, breaks=10, main="Histogram of logarithm of total household expenditure")

The distribution seems normal from the histogram.

qqnorm(residuals(AOVObject))
qqline(residuals(AOVObject))

plot(fitted(AOVObject), residuals(AOVObject))

The QQ plot and residuals vs fitted plots indeed confirm that the data was normally distributed and randomly selected.

Project 3