Among a dataset from the Ecdat R Package, we select “Mathlevel” which would be useful for expecting SAT Math Score.
In Mathlevel data, ‘language’, ‘sex’, ‘physiccourse’ and ‘chemistcourse’ are selected as factors, which could explain the result of SAT Math Score and have 2 levels, 2 levels, 3 levels and 3 levels respectively. And ‘sat’ is chosen as a response variable.
The orginal dataset was collected for examining effect of level of math on learning Economics and consist of 8 variables and 609 observations. Original dataframe containing:
mathlevel highest level of math attained , an ordered factor with levels 170, 171a, 172, 171b, 172b, 221a, 221b
sat sat math score
language status of foreign language proficiency
sex male, female
major one of other, eco, oss (other social sciences), ns (natural sciences), hum (humanities)
mathcourse number of courses in advanced math (0 to 3)
physiccourse number of courses in physics (0 to 2)
chemistcourse number of courses in chemistry (0 to 2)
We will use some factors among them in order to conduct the analysis for the Project 3.
We can inspect the head and tail of the dataframe in order to see what the data look like.
load("C:/Users/bokjh3/Desktop/Ecdat_0.2-9/Ecdat/data/Mathlevel.rda")
head(Mathlevel, n=10)
## mathlevel sat language sex major mathcourse physiccourse
## 1 170 670 no male ns 1 2
## 2 170 660 no male other 1 1
## 3 170 610 no female eco 1 0
## 4 170 620 yes male eco 1 0
## 5 170 430 no male eco 0 1
## 6 170 580 no female oss 2 1
## 7 170 550 yes female other 1 0
## 8 170 510 no female eco 1 1
## 9 170 560 yes male hum 1 0
## 10 170 670 no male oss 1 0
## chemistcourse
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 0
## 10 1
tail(Mathlevel, n=10)
## mathlevel sat language sex major mathcourse physiccourse
## 600 221b 660 no female ns 2 1
## 601 221b 670 no female ns 2 1
## 602 221b 670 no male other 2 1
## 603 221b 660 no male ns 1 0
## 604 221b 590 no female ns 2 1
## 605 221b 580 no female oss 2 1
## 606 221b 770 no male oss 2 1
## 607 221b 660 no male other 2 1
## 608 221b 710 no female eco 2 0
## 609 221b 590 no female oss 2 0
## chemistcourse
## 600 1
## 601 1
## 602 1
## 603 1
## 604 1
## 605 1
## 606 1
## 607 1
## 608 1
## 609 1
4 factors selected are language, sex, physiccourse and chemistcourse. We convert variables of physiccourse and chemistcourse from numerics to factors for the analysis.
language status of foreign language proficiency
sex male, female
physiccourse number of courses in physics (0 to 2)
chemistcourse number of courses in chemistry (0 to 2)
Mathlevel$physiccourse <- as.factor(Mathlevel$physiccourse)
Mathlevel$chemistcourse <- as.factor(Mathlevel$chemistcourse)
str(Mathlevel)
## 'data.frame': 609 obs. of 8 variables:
## $ mathlevel : Ord.factor w/ 7 levels "170"<"171a"<"172a"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ sat : int 670 660 610 620 430 580 550 510 560 670 ...
## $ language : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 2 1 ...
## $ sex : Factor w/ 2 levels "male","female": 1 1 2 1 1 2 2 2 1 1 ...
## $ major : Factor w/ 5 levels "other","eco",..: 4 1 2 2 2 3 1 2 5 3 ...
## $ mathcourse : num 1 1 1 1 0 2 1 1 1 1 ...
## $ physiccourse : Factor w/ 3 levels "0","1","2": 3 2 1 1 2 2 1 2 1 1 ...
## $ chemistcourse: Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 1 2 ...
Sat is a continous variable. In order to conduct ANOVA, a continuous dependent variable must be choosed as the response variable among the variables in the dataset.
The response variable is sat. The sat is defined as SAT Math Score. Sat is a continuous variable. We investigate the effects on factors such as language, sex, physiccourse, and chemistcourse on the response variable, sat.
The “Mathlevel” data is a cross-section data from 1983 to 1986 in United States. The number of observations is 609. The dataset is organized by the following variables: mathlevel, sat, language, sex, major, mathcourse, physiccourse and chemistcourse.
We select language, sex, physiccourse and chemistcourse as factors, which have 2, 2, 3 and 3 levles respectively, and sat as a response variable, which is a continous variable.
Fractional factorial design is conducted in order to reduce the computing power and runs necessary to reach an appropriate conclusion. A fractional factorial design of the 2m-3 type will be utilized in order to make the calculations simple.
The factors with 3 levels will be decomposed into factors with 2-level for calculating the required data needed to obtain appropriate data. For these, the sum of the variables with 2 levels will yield the value with 3 levels.
The results from the factional factorial design will be evaluated for aliasing in order to decide the true amount of data which comes from more limited fractional design. Main and interaction effects are calculated using ANOVA analysis.
A fractional factorial model is utilized in order to reduce the computing power required to get an accurate solution. Alias testing is used to decide the accuracy and resolution of the design. This enables the researcher to know the limitations of this level of design, and determine if another method is more suitable for the research.
Randomization is a technique used to balance the effect of extraneous or uncontrollable conditions that can impact the results of an experiment. In this experiment, we do not consider randomization becasue we do not have control over data collection, but we assume that this dataset satisfies random design assumptions.
Replicates are multiple experimental runs with the same factor settings (levels). Replicates are subject to the same sources of variability, independently of each other. We can replicate combinations of factor levels, groups of factor level combinations, or entire designs. There is no replication in this research.
In experimental design, blocking is a technique used to deal with nuisance factors that may affect the results of the experiment. The experiment is organized into blocks, where the nuisance factor is maintained at a constant level in each block. Blocking is not utilized in this research.
At first, we look at the histogram of sat to see if it satisfies the assumption of normality. As you can see, the distribution meets the assumption of normality.
hist(Mathlevel$sat, main = "SAT Math Score")
And then, the levels of each factor are shown in a boxplot to analyze the main effects of factors over the response variable, sat.
boxplot(Mathlevel$sat~Mathlevel$language, xlab="language", ylab="SAT Math Score")
title("Impact of language on SAT Math Score")
boxplot(Mathlevel$sat~Mathlevel$sex, xlab="sex", ylab="SAT Math Score")
title("Impact of sex on SAT Math Score")
boxplot(Mathlevel$sat~Mathlevel$physiccourse, xlab="number of courses in physics", ylab="SAT Math Score")
title("Impact of physiccourse on SAT Math Score")
boxplot(Mathlevel$sat~Mathlevel$chemistcourse, xlab="number of courses in chemistry", ylab="SAT Math Score")
title("Impact of chemistcourse on SAT Math Score")
In this research, 2k fractional factorial design is utilized. The dataset consists of 4 factors of which 2 factors have 2 levels and the other 2 factors have 3 levels. These 2 factors with 3 levels will be decomposed into 2-level factors respectively by the following process. Six 2-level factors remain and will be utilized in the research.
x = Mathlevel[1,]
physiccourse = c(0,0)
L = levels(Mathlevel[,'physiccourse'])
for(i in seq(2,dim(Mathlevel)[1])){
if(Mathlevel[i,'physiccourse']==L[1]){
x = rbind(x,Mathlevel[i,])
physiccourse = rbind(physiccourse,c(0,0))
}
else if(Mathlevel[i,'physiccourse']==L[2]){
x = rbind(x,Mathlevel[i,])
x = rbind(x,Mathlevel[i,])
physiccourse = rbind(physiccourse,c(1,0))
physiccourse = rbind(physiccourse,c(0,1))
}
else{
x = rbind(x,Mathlevel[i,])
physiccourse = rbind(physiccourse,c(1,1))
}
}
x = cbind(x,physiccourse)
x = x[,c(1,2,3,4,5,6,8,9,10)]
colnames(x) = c(colnames(Mathlevel[1,c(1,2,3,4,5,6,8)]),'physiccourseA','physiccourseB')
Lc = levels(Mathlevel[,'chemistcourse'])
chemistcourse = c(1,0)
chemistcourse = rbind(chemistcourse,c(0,1))
dat = x[1,]
dat = rbind(dat,dat)
for(i in seq(2,dim(x)[1])){
if(x[i,'chemistcourse']==Lc[1]){
dat = rbind(dat,x[i,])
chemistcourse = rbind(chemistcourse,c(0,0))
}
else if(x[i,'chemistcourse']==Lc[2]){
dat = rbind(dat,x[i,])
dat = rbind(dat,x[i,])
chemistcourse = rbind(chemistcourse,c(1,0))
chemistcourse = rbind(chemistcourse,c(0,1))
}
else{
dat = rbind(dat,x[i,])
chemistcourse = rbind(chemistcourse,c(1,1))
}
}
dat = cbind(dat,chemistcourse)
colnames(dat) = c(colnames(x),'chemistcourseA','chemistcourseB')
dat = dat[,c(1,2,3,4,5,6,7,8,9,10,11)]
And, we look at the full factorial experimental design of 64 runs
expand.grid(language = c(0,1), sex = c(0,1), physiccourseA = c(0,1), physiccourseB = c(0,1), chemistcourseA = c(0,1), chemistcourseB = c(0,1))
## language sex physiccourseA physiccourseB chemistcourseA chemistcourseB
## 1 0 0 0 0 0 0
## 2 1 0 0 0 0 0
## 3 0 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 0 0 1 0 0 0
## 6 1 0 1 0 0 0
## 7 0 1 1 0 0 0
## 8 1 1 1 0 0 0
## 9 0 0 0 1 0 0
## 10 1 0 0 1 0 0
## 11 0 1 0 1 0 0
## 12 1 1 0 1 0 0
## 13 0 0 1 1 0 0
## 14 1 0 1 1 0 0
## 15 0 1 1 1 0 0
## 16 1 1 1 1 0 0
## 17 0 0 0 0 1 0
## 18 1 0 0 0 1 0
## 19 0 1 0 0 1 0
## 20 1 1 0 0 1 0
## 21 0 0 1 0 1 0
## 22 1 0 1 0 1 0
## 23 0 1 1 0 1 0
## 24 1 1 1 0 1 0
## 25 0 0 0 1 1 0
## 26 1 0 0 1 1 0
## 27 0 1 0 1 1 0
## 28 1 1 0 1 1 0
## 29 0 0 1 1 1 0
## 30 1 0 1 1 1 0
## 31 0 1 1 1 1 0
## 32 1 1 1 1 1 0
## 33 0 0 0 0 0 1
## 34 1 0 0 0 0 1
## 35 0 1 0 0 0 1
## 36 1 1 0 0 0 1
## 37 0 0 1 0 0 1
## 38 1 0 1 0 0 1
## 39 0 1 1 0 0 1
## 40 1 1 1 0 0 1
## 41 0 0 0 1 0 1
## 42 1 0 0 1 0 1
## 43 0 1 0 1 0 1
## 44 1 1 0 1 0 1
## 45 0 0 1 1 0 1
## 46 1 0 1 1 0 1
## 47 0 1 1 1 0 1
## 48 1 1 1 1 0 1
## 49 0 0 0 0 1 1
## 50 1 0 0 0 1 1
## 51 0 1 0 0 1 1
## 52 1 1 0 0 1 1
## 53 0 0 1 0 1 1
## 54 1 0 1 0 1 1
## 55 0 1 1 0 1 1
## 56 1 1 1 0 1 1
## 57 0 0 0 1 1 1
## 58 1 0 0 1 1 1
## 59 0 1 0 1 1 1
## 60 1 1 0 1 1 1
## 61 0 0 1 1 1 1
## 62 1 0 1 1 1 1
## 63 0 1 1 1 1 1
## 64 1 1 1 1 1 1
Evidently, this would require a lot of time to examine, therefore we make similar to it with a fractional factorial design. The fractional factorial design is \(\sf{2_{Ⅲ}}\)6-3 which yields 8 experimental runs. The R package FrF2 provides functions for creating such a design. The highest resolution possible for this research (6 factors, 8 runs) is III. We can also utilize the FrF2 package to decide the aliasing structure of the fractional factorial design, and the generators of the fractional factorial design.
library(FrF2)
## Warning: package 'FrF2' was built under R version 3.3.2
## Loading required package: DoE.base
## Warning: package 'DoE.base' was built under R version 3.3.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.3.2
##
## Attaching package: 'DoE.base'
## The following objects are masked from 'package:stats':
##
## aov, lm
## The following object is masked from 'package:graphics':
##
## plot.design
## The following object is masked from 'package:base':
##
## lengths
mathscore = FrF2(8,6,factor.names = c('language','sex','physiccourseA','physiccourseB', 'chemistcourseA', 'chemistcourseB'))
mathscore
## language sex physiccourseA physiccourseB chemistcourseA chemistcourseB
## 1 -1 -1 1 1 -1 -1
## 2 1 -1 -1 -1 -1 1
## 3 -1 1 1 -1 -1 1
## 4 -1 -1 -1 1 1 1
## 5 1 1 1 1 1 1
## 6 -1 1 -1 -1 1 -1
## 7 1 1 -1 1 -1 -1
## 8 1 -1 1 -1 1 -1
## class=design, type= FrF2
From this, we learn that the generators of the fractional factorial design are D = AB, E = AC, F = BC. The generator for the fractional factorial experiment can also be calculated separately. According to the interactions D = AB, E = AC, and F = BC, the generators lead to a generator of I = ABD = ACE = BCF.
summary(mathscore)
## Call:
## FrF2(8, 6, factor.names = c("language", "sex", "physiccourseA",
## "physiccourseB", "chemistcourseA", "chemistcourseB"))
##
## Experimental design of type FrF2
## 8 runs
##
## Factor settings (scale ends):
## language sex physiccourseA physiccourseB chemistcourseA chemistcourseB
## 1 -1 -1 -1 -1 -1 -1
## 2 1 1 1 1 1 1
##
## Design generating information:
## $legend
## [1] A=language B=sex C=physiccourseA D=physiccourseB
## [5] E=chemistcourseA F=chemistcourseB
##
## $generators
## [1] D=AB E=AC F=BC
##
##
## Alias structure:
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
##
## $fi2
## [1] AF=BE=CD
##
##
## The design itself:
## language sex physiccourseA physiccourseB chemistcourseA chemistcourseB
## 1 -1 -1 1 1 -1 -1
## 2 1 -1 -1 -1 -1 1
## 3 -1 1 1 -1 -1 1
## 4 -1 -1 -1 1 1 1
## 5 1 1 1 1 1 1
## 6 -1 1 -1 -1 1 -1
## 7 1 1 -1 1 -1 -1
## 8 1 -1 1 -1 1 -1
## class=design, type= FrF2
The table below provides which factors are confounded. (the aliasing structure). Because this design is a resolution III, the aliasing structure shows that the main effects are aliased with some of the 2 factor interactions, which are highly aliased with each other: AF = BE = CD. Therefore, the main effects are reliable to analyze.
design.info(mathscore)$aliased
## $legend
## [1] "A=language" "B=sex" "C=physiccourseA"
## [4] "D=physiccourseB" "E=chemistcourseA" "F=chemistcourseB"
##
## $main
## [1] "A=BD=CE" "B=AD=CF" "C=AE=BF" "D=AB=EF" "E=AC=DF" "F=BC=DE"
##
## $fi2
## [1] "AF=BE=CD"
In this section, we conduct ANOVA test. We estimate main effects and all the 2-factor interaction effects.
model1 = lm(sat~(language+sex+physiccourseA+physiccourseB+chemistcourseA+chemistcourseB)^2,data=dat)
anova(model1)
## Analysis of Variance Table
##
## Response: sat
## Df Sum Sq Mean Sq F value Pr(>F)
## language 1 40659 40659 11.7450 0.0006236 ***
## sex 1 198469 198469 57.3308 5.824e-14 ***
## physiccourseA 1 7089 7089 2.0477 0.1526123
## physiccourseB 1 33684 33684 9.7302 0.0018410 **
## chemistcourseA 1 2080 2080 0.6008 0.4383696
## chemistcourseB 1 18501 18501 5.3444 0.0209001 *
## language:sex 1 36675 36675 10.5940 0.0011554 **
## language:physiccourseA 1 977 977 0.2822 0.5953439
## language:physiccourseB 1 4154 4154 1.1999 0.2734860
## language:chemistcourseA 1 1138 1138 0.3288 0.5664473
## language:chemistcourseB 1 10688 10688 3.0875 0.0790632 .
## sex:physiccourseA 1 1952 1952 0.5638 0.4528316
## sex:physiccourseB 1 8522 8522 2.4617 0.1168264
## sex:chemistcourseA 1 315 315 0.0911 0.7627861
## sex:chemistcourseB 1 2976 2976 0.8598 0.3539273
## physiccourseA:physiccourseB 1 2168 2168 0.6262 0.4288667
## physiccourseA:chemistcourseA 1 3 3 0.0009 0.9763914
## physiccourseA:chemistcourseB 1 27 27 0.0077 0.9298829
## physiccourseB:chemistcourseA 1 15 15 0.0042 0.9482711
## physiccourseB:chemistcourseB 1 132 132 0.0380 0.8454004
## chemistcourseA:chemistcourseB 1 7017 7017 2.0271 0.1546890
## Residuals 1818 6293597 3462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An ANOVA table for the main effects is shown below. The table shows that the physiccourseA and chemistcourseA do not have statistically significant effects on the SAT Math Score. However, we already know that the main effects for physiccourseA and chemistcourseA are confounded with the 2-factor interactions. The conclusion remains the same because the main effects are aliased with the 2-factor interactions.
model2 = lm(sat~language+sex+physiccourseA+physiccourseB+chemistcourseA+chemistcourseB,data=dat)
anova(model2)
## Analysis of Variance Table
##
## Response: sat
## Df Sum Sq Mean Sq F value Pr(>F)
## language 1 40659 40659 11.6992 0.000639 ***
## sex 1 198469 198469 57.1074 6.478e-14 ***
## physiccourseA 1 7089 7089 2.0397 0.153413
## physiccourseB 1 33684 33684 9.6923 0.001879 **
## chemistcourseA 1 2080 2080 0.5985 0.439263
## chemistcourseB 1 18501 18501 5.3236 0.021150 *
## Residuals 1833 6370355 3475
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The parameters for the second model are shown in the table below.
coef(model2)
## (Intercept) languageyes sexfemale physiccourseA physiccourseB
## 611.15334 20.16151 -20.52596 11.43258 11.43258
## chemistcourseA chemistcourseB
## 10.52719 10.52719
Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. The relatively linear relationship for all data sets justifies the use of ANOVA to test for the significant difference. From the QQ plot, the residuals nearly form a linear line, and we can find out that the assumptions of normality are met.
qqnorm(residuals(model2))
qqline(residuals(model2))
Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. From the Residuals vs. Fits plot, the distribution of points seems random although we can see a few outliers and some linearity.
plot(fitted(model2),residuals(model2))
Montgomery, Douglas C. 2012. Design and Analysis of Experiments, 8th Edition.
The Mathlevel dataset can be found by installing and loading the Ecdat R package.
#load("C:/Users/bokjh3/Desktop/Ecdat_0.2-9/Ecdat/data/Mathlevel.rda")
#head(Mathlevel, n=10)
#tail(Mathlevel, n=10)
#Mathlevel$physiccourse <- as.factor(Mathlevel$physiccourse)
#Mathlevel$chemistcourse <- as.factor(Mathlevel$chemistcourse)
#str(Mathlevel)
#hist(Mathlevel$sat, main = "SAT Math Score")
#boxplot(Mathlevel$sat~Mathlevel$language, xlab="language", ylab="SAT Math Score")
#title("Impact of language on SAT Math Score")
#boxplot(Mathlevel$sat~Mathlevel$sex, xlab="sex", ylab="SAT Math Score")
#title("Impact of sex on SAT Math Score")
#boxplot(Mathlevel$sat~Mathlevel$physiccourse, xlab="number of courses in physics", ylab="SAT Math Score")
#title("Impact of physiccourse on SAT Math Score")
#boxplot(Mathlevel$sat~Mathlevel$chemistcourse, xlab="number of courses in chemistry", ylab="SAT Math Score")
#title("Impact of chemistcourse on SAT Math Score")
#x = Mathlevel[1,]
#physiccourse = c(0,0)
#L = levels(Mathlevel[,'physiccourse'])
#for(i in seq(2,dim(Mathlevel)[1])){
# if(Mathlevel[i,'physiccourse']==L[1]){
# x = rbind(x,Mathlevel[i,])
# physiccourse = rbind(physiccourse,c(0,0))
# }
# else if(Mathlevel[i,'physiccourse']==L[2]){
# x = rbind(x,Mathlevel[i,])
# x = rbind(x,Mathlevel[i,])
# physiccourse = rbind(physiccourse,c(1,0))
# physiccourse = rbind(physiccourse,c(0,1))
# }
# else{
# x = rbind(x,Mathlevel[i,])
# physiccourse = rbind(physiccourse,c(1,1))
# }
#}
#x = cbind(x,physiccourse)
#x = x[,c(1,2,3,4,5,6,8,9,10)]
#colnames(x) = c(colnames(Mathlevel[1,c(1,2,3,4,5,6,8)]),'physiccourseA','physiccourseB')
#Lc = levels(Mathlevel[,'chemistcourse'])
#chemistcourse = c(1,0)
#chemistcourse = rbind(chemistcourse,c(0,1))
#dat = x[1,]
#dat = rbind(dat,dat)
#for(i in seq(2,dim(x)[1])){
# if(x[i,'chemistcourse']==Lc[1]){
# dat = rbind(dat,x[i,])
# chemistcourse = rbind(chemistcourse,c(0,0))
# }
# else if(x[i,'chemistcourse']==Lc[2]){
# dat = rbind(dat,x[i,])
# dat = rbind(dat,x[i,])
# chemistcourse = rbind(chemistcourse,c(1,0))
# chemistcourse = rbind(chemistcourse,c(0,1))
# }
# else{
# dat = rbind(dat,x[i,])
# chemistcourse = rbind(chemistcourse,c(1,1))
# }
#}
#dat = cbind(dat,chemistcourse)
#colnames(dat) = c(colnames(x),'chemistcourseA','chemistcourseB')
#dat = dat[,c(1,2,3,4,5,6,7,8,9,10,11)]
#expand.grid(language = c(0,1), sex = c(0,1), physiccourseA = c(0,1), physiccourseB = c(0,1), chemistcourseA = c(0,1), chemistcourseB = c(0,1))
#library(FrF2)
#mathscore = FrF2(8,6,factor.names = c('language','sex','physiccourseA','physiccourseB', 'chemistcourseA', 'chemistcourseB'))
#mathscore
#summary(mathscore)
#design.info(mathscore)$aliased
#model1 = lm(sat~(language+sex+physiccourseA+physiccourseB+chemistcourseA+chemistcourseB)^2,data=dat)
#anova(model1)
#model2 = lm(sat~language+sex+physiccourseA+physiccourseB+chemistcourseA+chemistcourseB,data=dat)
#anova(model2)
#coef(model2)
#qqnorm(residuals(model2))
#qqline(residuals(model2))
#plot(fitted(model2),residuals(model2))