Dataset containing test scores for reading and mathmatics.
library(Ecfun) # Package needed for Ecdat
library(Ecdat)
library(FrF2)
data("Star") # Load dataset
Below are the first 6 observations
head(Star)
## tmathssk treadssk classk totexpk sex freelunk race schidkn
## 2 473 447 small.class 7 girl no white 63
## 3 536 450 small.class 21 girl no black 20
## 5 463 439 regular.with.aide 0 boy yes black 19
## 11 559 448 regular 16 boy no white 69
## 12 489 447 small.class 5 boy yes white 79
## 13 454 431 regular 8 boy yes white 5
The dataset contains the following 4 factors.
levels(Star$sex)
## [1] "girl" "boy"
levels(Star$freelunk)
## [1] "no" "yes"
levels(Star$race)
## [1] "white" "black" "other"
levels(Star$classk)
## [1] "regular" "small.class" "regular.with.aide"
‘totexpk’ and ‘schidkn’ are continuous variables. Years of total teaching experience and school indicator variable, respectively.
The response variables are the two test scores, ‘tmathssk’ and ‘treadssk’,
There is a reference in the Ecdat package manual supposedly with information about the dataset. Unfortunately, the link is for an outdated website with no information about how the experiment was conducted or how the data was collected.
However, in this assignment a \(2^k\) fractional factorial design is used.
The data set contains 4 factors of which two have 3 levels. These two factor will be decomposed to two 2-level factors each by the following process.
x = Star[1,]
race = c(0,0)
L = levels(Star[,'race'])
for(i in seq(2,dim(Star)[1])){
if(Star[i,'race']==L[1]){
x = rbind(x,Star[i,])
race = rbind(race,c(0,0))
}
else if(Star[i,'race']==L[2]){
x = rbind(x,Star[i,])
x = rbind(x,Star[i,])
race = rbind(race,c(1,0))
race = rbind(race,c(0,1))
}
else{
x = rbind(x,Star[i,])
race = rbind(race,c(1,1))
}
}
x = cbind(x,race)
x = x[,c(1,2,3,4,5,6,8,9,10)]
colnames(x) = c(colnames(Star[1,c(1,2,3,4,5,6,8)]),'raceA','raceB')
Lc = levels(Star[,'classk'])
classk = c(1,0)
classk = rbind(classk,c(0,1))
dat = x[1,]
dat = rbind(dat,dat)
for(i in seq(2,dim(x)[1])){
if(x[i,'classk']==Lc[1]){
dat = rbind(dat,x[i,])
classk = rbind(classk,c(0,0))
}
else if(x[i,'classk']==Lc[2]){
dat = rbind(dat,x[i,])
dat = rbind(dat,x[i,])
classk = rbind(classk,c(1,0))
classk = rbind(classk,c(0,1))
}
else{
dat = rbind(dat,x[i,])
classk = rbind(classk,c(1,1))
}
}
dat = cbind(dat,classk)
colnames(dat) = c(colnames(x),'classkA','classkB')
dat = dat[,c(1,2,4,5,6,7,8,9,10,11)]
Thus, six 2-level factors remain and will be used in the design.
The fractional factorial design is \(2_{III}^{6-3}\), which yields 8 experimental runs. The R package FrF2 provides functions for creating such a design.
# Fractional Factorial Design
plan = FrF2(8,6,factor.names = c('classkA','classkB','sex','freelunk','raceA','raceB'))
plan
## classkA classkB sex freelunk raceA raceB
## 1 1 -1 -1 -1 -1 1
## 2 -1 -1 -1 1 1 1
## 3 1 1 1 1 1 1
## 4 1 -1 1 -1 1 -1
## 5 -1 -1 1 1 -1 -1
## 6 1 1 -1 1 -1 -1
## 7 -1 1 1 -1 -1 1
## 8 -1 1 -1 -1 1 -1
## class=design, type= FrF2
This design has resolution III, thus the main effects and 2-factor interactions are confounded with 2-factor interactions. The table below shows which factors are confounded.
# view aliasing
design.info(plan)$aliased
## $legend
## [1] "A=classkA" "B=classkB" "C=sex" "D=freelunk" "E=raceA"
## [6] "F=raceB"
##
## $main
## [1] "A=BD=CE" "B=AD=CF" "C=AE=BF" "D=AB=EF" "E=AC=DF" "F=BC=DE"
##
## $fi2
## [1] "AF=BE=CD"
The main effect “classkA” is confounded with the 2-factor interactions (classkB,freelunk) and (sex,raceA).
In this section, some of the factors effects on the test scores are shown in boxplots. The following boxplot shows the effect of class-type on math test scores.
boxplot(tmathssk~classk,data=Star,ylab="Math test score",xlab="type of class")
The medians of the math test scores are very similar, although the median for “small.class” is slightly higher.
par(mfrow=c(1,2))
boxplot(tmathssk~sex,data=Star,ylab="Math test score",xlab="sex")
boxplot(treadssk~sex,data=Star,ylab="Reading test score",xlab="sex")
Median test scores a slightly different for girls and boys, but not by much.
The main and interaction effects are estimated using lm an anova.
m1 = lm(tmathssk+treadssk~(sex+freelunk+raceA+raceB+classkA+classkB)^2,data=dat)
anova(m1)
## Analysis of Variance Table
##
## Response: tmathssk + treadssk
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 449201 449201 88.9648 < 2.2e-16 ***
## freelunk 1 3826342 3826342 757.8109 < 2.2e-16 ***
## raceA 1 104898 104898 20.7751 5.227e-06 ***
## raceB 1 249943 249943 49.5015 2.114e-12 ***
## classkA 1 885 885 0.1752 0.67551
## classkB 1 755 755 0.1496 0.69891
## sex:freelunk 1 2385 2385 0.4724 0.49190
## sex:raceA 1 10020 10020 1.9844 0.15896
## sex:raceB 1 23880 23880 4.7295 0.02967 *
## sex:classkA 1 15927 15927 3.1545 0.07575 .
## sex:classkB 1 13616 13616 2.6967 0.10059
## freelunk:raceA 1 17950 17950 3.5551 0.05939 .
## freelunk:raceB 1 27883 27883 5.5223 0.01880 *
## freelunk:classkA 1 830 830 0.1644 0.68514
## freelunk:classkB 1 718 718 0.1421 0.70619
## raceA:raceB 1 13230 13230 2.6202 0.10554
## raceA:classkA 1 62 62 0.0124 0.91147
## raceA:classkB 1 53 53 0.0106 0.91816
## raceB:classkA 1 154 154 0.0305 0.86130
## raceB:classkB 1 138 138 0.0274 0.86856
## classkA:classkB 1 447579 447579 88.6434 < 2.2e-16 ***
## Residuals 9851 49739708 5049
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An Anova table for the main effects is shown below.
m2 = lm(tmathssk+treadssk~sex+freelunk+raceA+raceB+classkA+classkB,data=dat)
anova(m2)
## Analysis of Variance Table
##
## Response: tmathssk + treadssk
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 449201 449201 88.0830 < 2.2e-16 ***
## freelunk 1 3826342 3826342 750.2999 < 2.2e-16 ***
## raceA 1 104898 104898 20.5692 5.819e-06 ***
## raceB 1 249943 249943 49.0109 2.711e-12 ***
## classkA 1 885 885 0.1735 0.6770
## classkB 1 755 755 0.1481 0.7003
## Residuals 9866 50314135 5100
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The table shows that the type of class does not have a significant effect on the test scores. However, we know that the main effect for classkA (and classkB) are confounded with the 2-factor interactions (classkB,freelunk) and (sex,raceA). Since, this model does not consider interactions, the conclusion remains the same.
The parameters for the second model are shown in the following table.
coef(m2)
## (Intercept) sexboy freelunkyes raceA raceB classkA
## 951.7719253 -14.1819638 -33.4484036 -13.2979406 -13.2979406 0.5550271
## classkB
## 0.5550271
In this section, the assmumptions of using a linear fixed effect model are checked for the first model. Starting with a Normal QQ-plot. The linear model in lm assumes the response is normally distributed.
plot(m2,2)
The points should not differ much from the straight dotted line. Thus the assumption, that the response is normal, may not be satisfied.
Next,the residuals are are plotted vs the fitted values. The residuals should not increase with the fitted values.
plot(m2,1)
The plot shows that the residuals do not increase substantially with the fitted values.
R Package Ecdat: https://cran.r-project.org/web/packages/Ecdat/index.html