3.2 Main Effects
According to the main effect plots, it seems that physiccourse and sex have a significant effect on Sat Math Score.
effectPlot(design)
Among a dataset from the Ecdat R Package, we select “Mathlevel” which would be useful for expecting SAT Math Score.
In Mathlevel data, ‘language’, ‘sex’, ‘physiccourse’ and ‘chemistcourse’ are selected as factors, which could explain the result of SAT Math Score and have 2 levels, 2 levels, 3 levels and 3 levels respectively. And ‘sat’ is chosen as a response variable.
The orginal dataset was collected for examining effect of level of math on learning Economics and consist of 8 variables and 609 observations. Original dataframe containing:
mathlevel highest level of math attained , an ordered factor with levels 170, 171a, 172, 171b, 172b, 221a, 221b
sat sat math score
language status of foreign language proficiency
sex male, female
major one of other, eco, oss (other social sciences), ns (natural sciences), hum (humanities)
mathcourse number of courses in advanced math (0 to 3)
physiccourse number of courses in physics (0 to 2)
chemistcourse number of courses in chemistry (0 to 2)
We will use some factors among them in order to conduct the analysis for the Project 3.
We can inspect the head and tail of the dataframe in order to see what the data look like.
load("C:/Users/bokjh3/Desktop/Ecdat_0.2-9/Ecdat/data/Mathlevel.rda")
head(Mathlevel, n=10)
## mathlevel sat language sex major mathcourse physiccourse
## 1 170 670 no male ns 1 2
## 2 170 660 no male other 1 1
## 3 170 610 no female eco 1 0
## 4 170 620 yes male eco 1 0
## 5 170 430 no male eco 0 1
## 6 170 580 no female oss 2 1
## 7 170 550 yes female other 1 0
## 8 170 510 no female eco 1 1
## 9 170 560 yes male hum 1 0
## 10 170 670 no male oss 1 0
## chemistcourse
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 0
## 10 1
tail(Mathlevel, n=10)
## mathlevel sat language sex major mathcourse physiccourse
## 600 221b 660 no female ns 2 1
## 601 221b 670 no female ns 2 1
## 602 221b 670 no male other 2 1
## 603 221b 660 no male ns 1 0
## 604 221b 590 no female ns 2 1
## 605 221b 580 no female oss 2 1
## 606 221b 770 no male oss 2 1
## 607 221b 660 no male other 2 1
## 608 221b 710 no female eco 2 0
## 609 221b 590 no female oss 2 0
## chemistcourse
## 600 1
## 601 1
## 602 1
## 603 1
## 604 1
## 605 1
## 606 1
## 607 1
## 608 1
## 609 1
4 factors selected are language, sex, physiccourse and chemistcourse. We convert variables of physiccourse and chemistcourse from numerics to factors for the analysis.
language status of foreign language proficiency
sex male, female
physiccourse number of courses in physics (0 to 2)
chemistcourse number of courses in chemistry (0 to 2)
Mathlevel$physiccourse <- as.factor(Mathlevel$physiccourse)
Mathlevel$chemistcourse <- as.factor(Mathlevel$chemistcourse)
str(Mathlevel)
## 'data.frame': 609 obs. of 8 variables:
## $ mathlevel : Ord.factor w/ 7 levels "170"<"171a"<"172a"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ sat : int 670 660 610 620 430 580 550 510 560 670 ...
## $ language : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 2 1 ...
## $ sex : Factor w/ 2 levels "male","female": 1 1 2 1 1 2 2 2 1 1 ...
## $ major : Factor w/ 5 levels "other","eco",..: 4 1 2 2 2 3 1 2 5 3 ...
## $ mathcourse : num 1 1 1 1 0 2 1 1 1 1 ...
## $ physiccourse : Factor w/ 3 levels "0","1","2": 3 2 1 1 2 2 1 2 1 1 ...
## $ chemistcourse: Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 1 2 ...
Sat is a continous variable. In order to conduct ANOVA, a continuous dependent variable must be choosed as the response variable among the variables in the dataset.
The response variable is sat. The sat is defined as SAT Math Score. Sat is a continuous variable. We investigate the effects on factors such as language, sex, physiccourse, and chemistcourse on the response variable, sat.
The “Mathlevel” data is a cross-section data from 1983 to 1986 in United States. The number of observations is 609. The dataset is organized by the following variables: mathlevel, sat, language, sex, major, mathcourse, physiccourse and chemistcourse.
We select language, sex, physiccourse and chemistcourse as factors, which have 2, 2, 3 and 3 levles respectively, and sat as a response variable, which is a continous variable.
A Taguchi design allows the reduction of the total runs, while still allowing the computation of main effects.
To make the design simpler, the factors with 3 levels will be decomposed into factors with 2-level for calculating the required data needed to obtain appropriate data. For these, the sum of the variables with 2 levels will yield the value with 3 levels. This gives us a total of six 2-level factors for inclusion in our design.
The Taguchi experimental design will be used for futher analysis with the least number of experimental runs.
A Taguchi design is another means to reduce the number of runs required for a full factorial design. The Taguchi model will need fewer runs, and still yield accurate main effect size data.
Randomization is a technique used to balance the effect of extraneous or uncontrollable conditions that can impact the results of an experiment. In this experiment, we do not consider randomization becasue we do not have control over data collection, but we assume that this dataset satisfies random design assumptions.
Replicates are multiple experimental runs with the same factor settings (levels). Replicates are subject to the same sources of variability, independently of each other. We can replicate combinations of factor levels, groups of factor level combinations, or entire designs. There is no replication in this research.
In experimental design, blocking is a technique used to deal with nuisance factors that may affect the results of the experiment. The experiment is organized into blocks, where the nuisance factor is maintained at a constant level in each block. Blocking is not utilized in this research.
The dataset consists of 4 factors of which 2 factors have 2 levels and the other 2 factors have 3 levels. These 2 factors with 3 levels will be decomposed into 2-level factors respectively by the following process. Six 2-level factors remain and will be utilized in the research.
x = Mathlevel[1,]
physiccourse = c(0,0)
L = levels(Mathlevel[,'physiccourse'])
for(i in seq(2,dim(Mathlevel)[1])){
if(Mathlevel[i,'physiccourse']==L[1]){
x = rbind(x,Mathlevel[i,])
physiccourse = rbind(physiccourse,c(0,0))
}
else if(Mathlevel[i,'physiccourse']==L[2]){
x = rbind(x,Mathlevel[i,])
x = rbind(x,Mathlevel[i,])
physiccourse = rbind(physiccourse,c(1,0))
physiccourse = rbind(physiccourse,c(0,1))
}
else{
x = rbind(x,Mathlevel[i,])
physiccourse = rbind(physiccourse,c(1,1))
}
}
x = cbind(x,physiccourse)
x = x[,c(1,2,3,4,5,6,8,9,10)]
colnames(x) = c(colnames(Mathlevel[1,c(1,2,3,4,5,6,8)]),'physiccourseA','physiccourseB')
Lc = levels(Mathlevel[,'chemistcourse'])
chemistcourse = c(1,0)
chemistcourse = rbind(chemistcourse,c(0,1))
dat = x[1,]
dat = rbind(dat,dat)
for(i in seq(2,dim(x)[1])){
if(x[i,'chemistcourse']==Lc[1]){
dat = rbind(dat,x[i,])
chemistcourse = rbind(chemistcourse,c(0,0))
}
else if(x[i,'chemistcourse']==Lc[2]){
dat = rbind(dat,x[i,])
dat = rbind(dat,x[i,])
chemistcourse = rbind(chemistcourse,c(1,0))
chemistcourse = rbind(chemistcourse,c(0,1))
}
else{
dat = rbind(dat,x[i,])
chemistcourse = rbind(chemistcourse,c(1,1))
}
}
dat = cbind(dat,chemistcourse)
colnames(dat) = c(colnames(x),'chemistcourseA','chemistcourseB')
dat = dat[,c(1,2,3,4,5,6,7,8,9,10,11)]
Utilizing the taguchiChoose function in the qualityTools package, we can investigate all possible Taguchi designs based on the number of factors and the number of levels for each factor.
install.packages("qualityTools", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/bokjh3/Documents/R/win-library/3.3'
## (as 'lib' is unspecified)
## package 'qualityTools' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\bokjh3\AppData\Local\Temp\RtmpGIOuGb\downloaded_packages
library(qualityTools)
## Warning: package 'qualityTools' was built under R version 3.3.2
## Loading required package: Rsolnp
## Warning: package 'Rsolnp' was built under R version 3.3.2
## Loading required package: MASS
##
## Attaching package: 'qualityTools'
## The following object is masked from 'package:stats':
##
## sigma
taguchiChoose(factors1 = 6, level1 = 2)
## 6 factors on 2 levels and 0 factors on 0 levels with 0 desired interactions to be estimated
##
## Possible Designs:
##
## L8_2 L12_2 L16_2 L32_2
##
## Use taguchiDesign("L8_2") or different to create a taguchi design object
Although there are various Taguchi designs we can use, we are only interested in main effects and in minimizing the number of experimental runs with the aims of this experiment. Therefore, the L8_2 design is the most proper.
Next, we will create our design using the taguchiDesign function. We will also set the random seed so that our results are reproduceable.
set.seed(1587)
design <- taguchiDesign("L8_2")
## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated
names(design) = c("dat$physiccourseA", "dat$physiccourseB", "dat$chemistcourseA", "chemistcourseB", "language", "sex")
design
## StandOrder RunOrder Replicate A B C D E F G y
## 1 8 1 1 2 2 1 2 1 1 2 NA
## 2 5 2 1 2 1 2 1 2 1 2 NA
## 3 3 3 1 1 2 2 1 1 2 2 NA
## 4 6 4 1 2 1 2 2 1 2 1 NA
## 5 4 5 1 1 2 2 2 2 1 1 NA
## 6 1 6 1 1 1 1 1 1 1 1 NA
## 7 7 7 1 2 2 1 1 2 2 1 NA
## 8 2 8 1 1 1 1 2 2 2 2 NA
We will collect the data for our 8 experimental runs based on our design. In order to make subset, we convert variables of language and sex from factors to numerics.
Mathlevel$language <- as.numeric(Mathlevel$language)
Mathlevel$sex <- as.numeric(Mathlevel$sex)
set1 <- subset(Mathlevel, physiccourse == 2 & chemistcourse == 1 & language == 1 & sex == 1)
set2 <- subset(Mathlevel, physiccourse == 1 & chemistcourse == 1 & language == 2 & sex == 1)
set3 <- subset(Mathlevel, physiccourse == 1 & chemistcourse == 1 & language == 1 & sex == 2)
set4 <- subset(Mathlevel, physiccourse == 1 & chemistcourse == 2 & language == 1 & sex == 2)
set5 <- subset(Mathlevel, physiccourse == 1 & chemistcourse == 2 & language == 2 & sex == 1)
set6 <- subset(Mathlevel, physiccourse == 0 & chemistcourse == 0 & language == 1 & sex == 1)
set7 <- subset(Mathlevel, physiccourse == 0 & chemistcourse == 0 & language == 2 & sex == 2)
set8 <- subset(Mathlevel, physiccourse == 0 & chemistcourse == 1 & language == 2 & sex == 2)
run1 <- set1[sample(1:nrow(set1), 1), ]
run2 <- set2[sample(1:nrow(set2), 1), ]
run3 <- set3[sample(1:nrow(set3), 1), ]
run4 <- set4[sample(1:nrow(set4), 1), ]
run5 <- set5[sample(1:nrow(set5), 1), ]
run6 <- set6[sample(1:nrow(set6), 1), ]
run7 <- set7[sample(1:nrow(set7), 1), ]
run8 <- set8[sample(1:nrow(set8), 1), ]
response <- c(run1$sat, run2$sat, run3$sat, run4$sat, run5$sat, run6$sat, run7$sat, run8$sat)
response
## [1] 580 780 650 590 670 560 610 550
Now we add the response column to our design.
response(design) = response
summary(design)
## Taguchi SINGLE Design
## Information about the factors:
##
## A B C
## value 1 1 1 1
## value 2 2 2 2
## name dat$physiccourseA dat$physiccourseB dat$chemistcourseA
## unit
## type numeric numeric numeric
## D E F G
## value 1 1 1 1 1
## value 2 2 2 2 2
## name chemistcourseB language sex <NA>
## unit
## type numeric numeric numeric numeric
##
## -----------
##
## StandOrder RunOrder Replicate A B C D E F G response
## 1 8 1 1 2 2 1 2 1 1 2 580
## 2 5 2 1 2 1 2 1 2 1 2 780
## 3 3 3 1 1 2 2 1 1 2 2 650
## 4 6 4 1 2 1 2 2 1 2 1 590
## 5 4 5 1 1 2 2 2 2 1 1 670
## 6 1 6 1 1 1 1 1 1 1 1 560
## 7 7 7 1 2 2 1 1 2 2 1 610
## 8 2 8 1 1 1 1 2 2 2 2 550
##
## -----------
According to the main effect plots, it seems that physiccourse and sex have a significant effect on Sat Math Score.
effectPlot(design)
In this section, we conduct ANOVA test. We estimate main effects ont the full dataset in order to check the validity of the results. An ANOVA table for the main effects is shown below. The table shows that sex and physiccourse have statistically significant effects on the SAT Math Score as expected from the main effects determined by our Taguchi desing experiment.
model = lm(sat~language+sex+physiccourse+chemistcourse,data=Mathlevel)
anova(model)
## Analysis of Variance Table
##
## Response: sat
## Df Sum Sq Mean Sq F value Pr(>F)
## language 1 11963 11963 3.5283 0.06081 .
## sex 1 66461 66461 19.6012 1.133e-05 ***
## physiccourse 2 24716 12358 3.6446 0.02671 *
## chemistcourse 2 13678 6839 2.0169 0.13396
## Residuals 602 2041192 3391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In order to compare the Fractional Factorial Design to the Taguchi Design, we look at the main effects plot for both (effect plot and boxplot). We provide boxplots to analyze the main effects of factors in Fractional Factorial Design over the response variable, sat.
boxplot(dat$sat~dat$language, xlab="language", ylab="SAT Math Score")
boxplot(dat$sat~dat$sex, xlab="sex", ylab="SAT Math Score")
boxplot(dat$sat~dat$physiccourseA, xlab="physiccourseA", ylab="SAT Math Score")
boxplot(dat$sat~dat$physiccourseB, xlab="physiccourseB", ylab="SAT Math Score")
boxplot(dat$sat~dat$chemistcourseA, xlab="chemistcourseA", ylab="SAT Math Score")
boxplot(dat$sat~dat$chemistcourseB, xlab="chemistcourseB", ylab="SAT Math Score")
We also conduct ANOVA test for the main effects based on the Fractional Factorial Design.
model_FFD = lm(sat~language+sex+physiccourseA+physiccourseB+chemistcourseA+chemistcourseB,data=dat)
anova(model_FFD)
## Analysis of Variance Table
##
## Response: sat
## Df Sum Sq Mean Sq F value Pr(>F)
## language 1 40659 40659 11.6992 0.000639 ***
## sex 1 198469 198469 57.1074 6.478e-14 ***
## physiccourseA 1 7089 7089 2.0397 0.153413
## physiccourseB 1 33684 33684 9.6923 0.001879 **
## chemistcourseA 1 2080 2080 0.5985 0.439263
## chemistcourseB 1 18501 18501 5.3236 0.021150 *
## Residuals 1833 6370355 3475
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The researcher using the Taguchi design only focuses on the main effect of the few factors that contribute significantly to the response. Therefore, the purpose of the comparison is to check how a Taguchi design compares against a fractional factorial design in estimating main effects.
Both designs are capable of estimating main effects with only 8 experimental runs for our six 2-level factor experiment. They requires fewer experimental runs, and therefore saves time and resources a lot. However, both could not guarantee statistical precisions compared to the results based on the full dataset.
Here we can find out that the results of the two experiments are greatly different although they make use of the same dataset. After the results of the ANOVA test conducted on the full dataset, we learn that the Taguchi model gives a better estimate of the main effects than the Fractional Factorial Design.
For this dataset, it appears that the Taguchi design is a more effective design. This makes sense. We already know that the main effects for the Fractional Factorial Design are confounded with the 2-factor interactions. The main effects are aliased with the 2-factor interactions in the Fractional Factorial design.
At first, we look at the histogram of sat to see if it satisfies the assumption of normality. As you can see, the distribution meets the assumption of normality.
hist(Mathlevel$sat, main = "SAT Math Score")
Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. The relatively linear relationship for all data sets justifies the use of ANOVA to test for the significant difference. From the QQ plot, the residuals nearly form a linear line, and we learn that the assumptions of normality are met.
qqnorm(residuals(model))
qqline(residuals(model))
Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. From the Residuals vs. Fits plot, the distribution of points seems random although we can see a few outliers and some linearity.
plot(fitted(model),residuals(model))
Montgomery, Douglas C. 2012. Design and Analysis of Experiments, 8th Edition.
The Mathlevel dataset can be found by installing and loading the Ecdat R package.