Munira Shahir
Rensselaer Polytechnic Institute
The data set tested comes from the Ecdat package in R and the data set is named Mathlevel. The data set is a cross-section from 1983-1986 of 609 individuals in the United States.
## Loading required package: Ecfun
##
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
##
## sign
##
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##
## Orange
## Warning: package 'FrF2' was built under R version 3.3.2
## Loading required package: DoE.base
## Warning: package 'DoE.base' was built under R version 3.3.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.3.2
##
## Attaching package: 'DoE.base'
## The following objects are masked from 'package:stats':
##
## aov, lm
## The following object is masked from 'package:graphics':
##
## plot.design
## The following object is masked from 'package:base':
##
## lengths
The data set contains the following factors:
mathlevel:higest level of math attained (170,171a,172,171b,172b,221a,221b)
language: foreign langauge proficiency (yes,no)
sex: levels are male,female
major: level is one of the following majors one of other,eco,oss(other social sciences),ns (natural sciences), hum (humanities)
mathcourse: level is number of courses in adbanced math (0 to 3)
physsiccourse: level is number of courses in physics (0,2)
chemistcourse: level is number of courses in chemistry (0,2)
There is one continous variable in the data set: sat: SAT Math Score
The response is the SAT Math Score sat.
Below are the first 6 observations of the Mathlevel data set
## mathlevel sat language sex major mathcourse physiccourse
## 1 170 670 no male ns 1 2
## 2 170 660 no male other 1 1
## 3 170 610 no female eco 1 0
## 4 170 620 yes male eco 1 0
## 5 170 430 no male eco 0 1
## 6 170 580 no female oss 2 1
## chemistcourse
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
Below is the structure of the data set
## 'data.frame': 609 obs. of 8 variables:
## $ mathlevel : Ord.factor w/ 7 levels "170"<"171a"<"172a"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ sat : int 670 660 610 620 430 580 550 510 560 670 ...
## $ language : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 2 1 ...
## $ sex : Factor w/ 2 levels "male","female": 1 1 2 1 1 2 2 2 1 1 ...
## $ major : Factor w/ 5 levels "other","eco",..: 4 1 2 2 2 3 1 2 5 3 ...
## $ mathcourse : num 1 1 1 1 0 2 1 1 1 1 ...
## $ physiccourse : num 2 1 0 0 1 1 0 1 0 0 ...
## $ chemistcourse: num 1 1 1 1 1 1 1 1 0 1 ...
This data set consists of 609 observations of 8 variables.
The purpose of this experiment is to observe the impact of a multitude of factors on the highest math level of a student. A fractional factorial design is used.
A \(2^{6-3}\) fractional factorial design will be used. The two 2-level factors for this experiment are language and sex. The two 3-level factors for this experiment are physsiccourse and chemistcourse. Both of these three-level factors will be decomposed to 2-level factors.
There is no randomization in the data set.
There are no replicates in the original dataset.
Blocking is not used in this design. However, it is possible to block if necessary.
Before decomposing the two 3-level factors, exploratory data analysis will be conducted.
boxplot(SMS~S,xlab="Sex of Student",ylab="SAT Math Score")
boxplot(SMS~FL,xlab="Is the student proficient in a foreign language",ylab="SAT Math Score")
boxplot(SMS~PC,xlab= "Number of physic courses the student has taken",ylab="SAT Math Score")
boxplot(SMS~CC,xlab= "Number of chemistry courses the student has taken",ylab="SAT Math Score")
The two 3-level factors are each split into two 2-level factors. The new factors are Atmost1phy,Atleast1phy,Atmost1chem, and Atleast1chem which refers to have at least 1 or at most 1 physic or chemistry course.
Atmost1phy <- matrix(0,nrow=nrow(Mathlevel),ncol=1)
Atleast1phy <- matrix(0,nrow=nrow(Mathlevel),ncol=1)
Atmost1chem <- matrix(0,nrow=nrow(Mathlevel),ncol=1)
Atleast1chem <- matrix(0,nrow=nrow(Mathlevel),ncol=1)
project3 <- data.frame(Mathlevel$sat,Mathlevel$language,Mathlevel$sex,Mathlevel$physiccourse,Mathlevel$chemistcourse)
colnames<-c("sat","language","sex","physiccourse","chemistcourse")
colnames(project3)<-colnames
Mathlevel2 <- data.frame(project3,Atmost1phy,Atleast1phy,Atmost1chem,Atleast1chem)
for (i in 1:609){
if(Mathlevel2$physiccourse[i]==1){
Mathlevel2$Atmost1phy[i] <- 1
Mathlevel2$Atleast1phy[i] <- 1
}
else if (Mathlevel2$physiccourse[i]==0){
Mathlevel2$Atmost1phy[i] <- 0
Mathlevel2$Atleast1phy[i] <- 1
}
else{
Mathlevel2$Atmost1phy[i] <- 1
Mathlevel2$Atleast1phy[i] <- 0
}
if(Mathlevel2$chemistcourse[i]==1){
Mathlevel2$Atmost1chem[i] <- 1
Mathlevel2$Atleast1chem[i] <- 1
}
else if (Mathlevel2$chemistcourse[i]==0){
Mathlevel2$Atmost1chem[i] <- 0
Mathlevel2$Atleast1chem[i] <- 1
}
else{
Mathlevel2$Atmost1chem[i] <- 1
Mathlevel2$Atleast1chem[i] <- 0
}
}
head(Mathlevel2$sat~(Mathlevel2))
## Mathlevel2$sat ~ (Mathlevel2)
The full factorial expirimental design of 2^6 = 64 runs is below:
expand.grid(language=c(0,1),sex=c(0,1),Atmost1phy=c(0,1),Atleast1phy=c(0,1),Atmost1chem=c(0,1),Atleast1chem=c(0,1))
## language sex Atmost1phy Atleast1phy Atmost1chem Atleast1chem
## 1 0 0 0 0 0 0
## 2 1 0 0 0 0 0
## 3 0 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 0 0 1 0 0 0
## 6 1 0 1 0 0 0
## 7 0 1 1 0 0 0
## 8 1 1 1 0 0 0
## 9 0 0 0 1 0 0
## 10 1 0 0 1 0 0
## 11 0 1 0 1 0 0
## 12 1 1 0 1 0 0
## 13 0 0 1 1 0 0
## 14 1 0 1 1 0 0
## 15 0 1 1 1 0 0
## 16 1 1 1 1 0 0
## 17 0 0 0 0 1 0
## 18 1 0 0 0 1 0
## 19 0 1 0 0 1 0
## 20 1 1 0 0 1 0
## 21 0 0 1 0 1 0
## 22 1 0 1 0 1 0
## 23 0 1 1 0 1 0
## 24 1 1 1 0 1 0
## 25 0 0 0 1 1 0
## 26 1 0 0 1 1 0
## 27 0 1 0 1 1 0
## 28 1 1 0 1 1 0
## 29 0 0 1 1 1 0
## 30 1 0 1 1 1 0
## 31 0 1 1 1 1 0
## 32 1 1 1 1 1 0
## 33 0 0 0 0 0 1
## 34 1 0 0 0 0 1
## 35 0 1 0 0 0 1
## 36 1 1 0 0 0 1
## 37 0 0 1 0 0 1
## 38 1 0 1 0 0 1
## 39 0 1 1 0 0 1
## 40 1 1 1 0 0 1
## 41 0 0 0 1 0 1
## 42 1 0 0 1 0 1
## 43 0 1 0 1 0 1
## 44 1 1 0 1 0 1
## 45 0 0 1 1 0 1
## 46 1 0 1 1 0 1
## 47 0 1 1 1 0 1
## 48 1 1 1 1 0 1
## 49 0 0 0 0 1 1
## 50 1 0 0 0 1 1
## 51 0 1 0 0 1 1
## 52 1 1 0 0 1 1
## 53 0 0 1 0 1 1
## 54 1 0 1 0 1 1
## 55 0 1 1 0 1 1
## 56 1 1 1 0 1 1
## 57 0 0 0 1 1 1
## 58 1 0 0 1 1 1
## 59 0 1 0 1 1 1
## 60 1 1 0 1 1 1
## 61 0 0 1 1 1 1
## 62 1 0 1 1 1 1
## 63 0 1 1 1 1 1
## 64 1 1 1 1 1 1
It’s apparent that this would take a while to complete this analysis. Therefore, it will be approximated with a fraction factorial design. The fractional factorial design chosen is a 2^6-3 design by utilizing the R package FrF2. The highest possible resolution for this design is Resolution III. The FRF2 package is also used to determine the aliasing structure of the fractional factorial design and the generators of the fractional factorial design.
design <-FrF2(8,factor.names = c("language","sex","Atmost1phy","Atleast1phy","Atmost1chem","Atleast1chem"),default.levels = c("0","1"))
design
## language sex Atmost1phy Atleast1phy Atmost1chem Atleast1chem
## 1 1 1 0 1 0 0
## 2 0 1 1 0 0 1
## 3 0 0 1 1 0 0
## 4 0 0 0 1 1 1
## 5 0 1 0 0 1 0
## 6 1 0 0 0 0 1
## 7 1 0 1 0 1 0
## 8 1 1 1 1 1 1
## class=design, type= FrF2
summary(design)
## Call:
## FrF2(8, factor.names = c("language", "sex", "Atmost1phy", "Atleast1phy",
## "Atmost1chem", "Atleast1chem"), default.levels = c("0", "1"))
##
## Experimental design of type FrF2
## 8 runs
##
## Factor settings (scale ends):
## language sex Atmost1phy Atleast1phy Atmost1chem Atleast1chem
## 1 0 0 0 0 0 0
## 2 1 1 1 1 1 1
##
## Design generating information:
## $legend
## [1] A=language B=sex C=Atmost1phy D=Atleast1phy
## [5] E=Atmost1chem F=Atleast1chem
##
## $generators
## [1] D=AB E=AC F=BC
##
##
## Alias structure:
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
##
## $fi2
## [1] AF=BE=CD
##
##
## The design itself:
## language sex Atmost1phy Atleast1phy Atmost1chem Atleast1chem
## 1 1 1 0 1 0 0
## 2 0 1 1 0 0 1
## 3 0 0 1 1 0 0
## 4 0 0 0 1 1 1
## 5 0 1 0 0 1 0
## 6 1 0 0 0 0 1
## 7 1 0 1 0 1 0
## 8 1 1 1 1 1 1
## class=design, type= FrF2
aliasprint(design)
## $legend
## [1] A=language B=sex C=Atmost1phy D=Atleast1phy
## [5] E=Atmost1chem F=Atleast1chem
##
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
##
## $fi2
## [1] AF=BE=CD
From the above result, it’s apparent that the generators of the fractional factorial design are D=AB,E=AC,F=BC. From the aliasing structure, it’s apparent that the main effects are aliased with some of the two factor interations, which are highly aliased with each other: AF=BE=CD. Because the design is a Resolution III design, only the main effects are a reliable source of analysis.
From the aliased interactions (D=AB, E=AC, and F= BC), the generator I is calculated as I=ABD=ACE=BCF
The main and interaction effects can be estimated by using a linear model and ANOVA analysis. This can be seen below, with all of the main effects and all of the two-factor interaction effects:
fit <- lm(Mathlevel2$sat~(Mathlevel2$language+Mathlevel2$sex+Mathlevel2$Atmost1phy+Mathlevel2$Atleast1phy+Mathlevel2$Atmost1chem+Mathlevel2$Atleast1chem)^2,data = Mathlevel2)
anova(fit)
## Analysis of Variance Table
##
## Response: Mathlevel2$sat
## Df Sum Sq Mean Sq F value
## Mathlevel2$language 1 11963 11963 3.5144
## Mathlevel2$sex 1 66461 66461 19.5238
## Mathlevel2$Atmost1phy 1 24640 24640 7.2384
## Mathlevel2$Atleast1phy 1 75 75 0.0221
## Mathlevel2$Atmost1chem 1 785 785 0.2306
## Mathlevel2$Atleast1chem 1 12893 12893 3.7873
## Mathlevel2$language:Mathlevel2$sex 1 5161 5161 1.5162
## Mathlevel2$language:Mathlevel2$Atmost1phy 1 3010 3010 0.8841
## Mathlevel2$language:Mathlevel2$Atleast1phy 1 3927 3927 1.1535
## Mathlevel2$language:Mathlevel2$Atmost1chem 1 5475 5475 1.6082
## Mathlevel2$language:Mathlevel2$Atleast1chem 1 2 2 0.0004
## Mathlevel2$sex:Mathlevel2$Atmost1phy 1 6618 6618 1.9440
## Mathlevel2$sex:Mathlevel2$Atleast1phy 1 8810 8810 2.5879
## Mathlevel2$sex:Mathlevel2$Atmost1chem 1 92 92 0.0269
## Mathlevel2$sex:Mathlevel2$Atleast1chem 1 1083 1083 0.3181
## Mathlevel2$Atmost1phy:Mathlevel2$Atmost1chem 1 1115 1115 0.3276
## Mathlevel2$Atmost1phy:Mathlevel2$Atleast1chem 1 530 530 0.1558
## Mathlevel2$Atleast1phy:Mathlevel2$Atmost1chem 1 3 3 0.0007
## Mathlevel2$Atleast1phy:Mathlevel2$Atleast1chem 1 343 343 0.1008
## Residuals 589 2005025 3404
## Pr(>F)
## Mathlevel2$language 0.061332 .
## Mathlevel2$sex 1.183e-05 ***
## Mathlevel2$Atmost1phy 0.007339 **
## Mathlevel2$Atleast1phy 0.881890
## Mathlevel2$Atmost1chem 0.631250
## Mathlevel2$Atleast1chem 0.052117 .
## Mathlevel2$language:Mathlevel2$sex 0.218691
## Mathlevel2$language:Mathlevel2$Atmost1phy 0.347461
## Mathlevel2$language:Mathlevel2$Atleast1phy 0.283258
## Mathlevel2$language:Mathlevel2$Atmost1chem 0.205243
## Mathlevel2$language:Mathlevel2$Atleast1chem 0.983238
## Mathlevel2$sex:Mathlevel2$Atmost1phy 0.163760
## Mathlevel2$sex:Mathlevel2$Atleast1phy 0.108216
## Mathlevel2$sex:Mathlevel2$Atmost1chem 0.869807
## Mathlevel2$sex:Mathlevel2$Atleast1chem 0.572944
## Mathlevel2$Atmost1phy:Mathlevel2$Atmost1chem 0.567274
## Mathlevel2$Atmost1phy:Mathlevel2$Atleast1chem 0.693169
## Mathlevel2$Atleast1phy:Mathlevel2$Atmost1chem 0.978278
## Mathlevel2$Atleast1phy:Mathlevel2$Atleast1chem 0.750933
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To look at just the main effects:
fit2 <- lm(Mathlevel2$sat~Mathlevel2$language+Mathlevel2$sex+Mathlevel2$Atmost1phy+Mathlevel2$Atleast1phy+Mathlevel2$Atmost1chem+Mathlevel2$Atleast1chem,data = Mathlevel2)
anova(fit2)
## Analysis of Variance Table
##
## Response: Mathlevel2$sat
## Df Sum Sq Mean Sq F value Pr(>F)
## Mathlevel2$language 1 11963 11963 3.5283 0.060812 .
## Mathlevel2$sex 1 66461 66461 19.6012 1.133e-05 ***
## Mathlevel2$Atmost1phy 1 24640 24640 7.2671 0.007219 **
## Mathlevel2$Atleast1phy 1 75 75 0.0222 0.881657
## Mathlevel2$Atmost1chem 1 785 785 0.2315 0.630571
## Mathlevel2$Atleast1chem 1 12893 12893 3.8023 0.051645 .
## Residuals 602 2041192 3391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The conclusion remains the same because the main effects are aliased with the two-factor interactions, Below are the plots of the residuals.
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(fitted(fit),residuals(fit))
The Q-Q plot shows the data probably follows a normal distribution. ##4. References to the literature [1]https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf
https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf
The complete code for the project can be seen below:
library(Ecdat)
library(FrF2)
data("Mathlevel")
head(Mathlevel)
str(Mathlevel)
S <- as.factor(Mathlevel$sex)
FL <- as.factor(Mathlevel$language)
PC <- as.factor(Mathlevel$physiccourse)
CC <- as.factor(Mathlevel$chemistcourse)
SMS <- as.numeric(Mathlevel$sat)
#Boxplots
boxplot(SMS~S,xlab="Sex of Student",ylab="SAT Math Score")
boxplot(SMS~FL,xlab="Is the student proficient in a foreign language",ylab="SAT Math Score")
boxplot(SMS~PC,xlab= "Number of physic courses the student has taken",ylab="SAT Math Score")
boxplot(SMS~CC,xlab= "Number of chemistry courses the student has taken",ylab="SAT Math Score")
Atmost1phy <- matrix(0,nrow=nrow(Mathlevel),ncol=1)
Atleast1phy <- matrix(0,nrow=nrow(Mathlevel),ncol=1)
Atmost1chem <- matrix(0,nrow=nrow(Mathlevel),ncol=1)
Atleast1chem <- matrix(0,nrow=nrow(Mathlevel),ncol=1)
#Placing relavent information into a new data frame for easier analysis
project3 <- data.frame(Mathlevel$sat,Mathlevel$language,Mathlevel$sex,Mathlevel$physiccourse,Mathlevel$chemistcourse)
colnames<-c("sat","language","sex","physiccourse","chemistcourse")
colnames(project3)<-colnames
Mathlevel2 <- data.frame(project3,Atmost1phy,Atleast1phy,Atmost1chem,Atleast1chem)
#Decomposing for 3-factor levels to 2-factor levels
for (i in 1:609){
if(Mathlevel2$physiccourse[i]==1){
Mathlevel2$Atmost1phy[i] <- 1
Mathlevel2$Atleast1phy[i] <- 1
}
else if (Mathlevel2$physiccourse[i]==0){
Mathlevel2$Atmost1phy[i] <- 0
Mathlevel2$Atleast1phy[i] <- 1
}
else{
Mathlevel2$Atmost1phy[i] <- 1
Mathlevel2$Atleast1phy[i] <- 0
}
if(Mathlevel2$chemistcourse[i]==1){
Mathlevel2$Atmost1chem[i] <- 1
Mathlevel2$Atleast1chem[i] <- 1
}
else if (Mathlevel2$chemistcourse[i]==0){
Mathlevel2$Atmost1chem[i] <- 0
Mathlevel2$Atleast1chem[i] <- 1
}
else{
Mathlevel2$Atmost1chem[i] <- 1
Mathlevel2$Atleast1chem[i] <- 0
}
}
head(Mathlevel2)
expand.grid(language=c(0,1),sex=c(0,1),Atmost1phy=c(0,1),Atleast1phy=c(0,1),Atmost1chem=c(0,1),Atleast1chem=c(0,1))
design <-FrF2(8,factor.names = c("language","sex","Atmost1phy","Atleast1phy","Atmost1chem","Atleast1chem"),default.levels = c("0","1"))
design
summary(design)
aliasprint(design)
#Anova and linear model
#main and interaction
fit <- lm(Mathlevel2$sat~(Mathlevel2$language+Mathlevel2$sex+Mathlevel2$Atmost1phy+Mathlevel2$Atleast1phy+Mathlevel2$Atmost1chem+Mathlevel2$Atleast1chem)^2,data = Mathlevel2)
anova(fit)
#just main
fit2 <- lm(Mathlevel2$sat~Mathlevel2$language+Mathlevel2$sex+Mathlevel2$Atmost1phy+Mathlevel2$Atleast1phy+Mathlevel2$Atmost1chem+Mathlevel2$Atleast1chem,data = Mathlevel2)
anova(fit2)
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(fitted(fit),residuals(fit))