Project 3

1. Setting

Effects on Learning of Small Class Sizes

Dataset containing test scores for reading and mathmatics.

library(Ecfun) # Package needed for Ecdat
library(Ecdat)
library(FrF2)
data("Star") # Load dataset

Below are the first 6 observations

head(Star)

##    tmathssk treadssk            classk totexpk  sex freelunk  race schidkn
## 2       473      447       small.class       7 girl       no white      63
## 3       536      450       small.class      21 girl       no black      20
## 5       463      439 regular.with.aide       0  boy      yes black      19
## 11      559      448           regular      16  boy       no white      69
## 12      489      447       small.class       5  boy      yes white      79
## 13      454      431           regular       8  boy      yes white       5

Factors and levels

The dataset contains the following 4 factors.

levels(Star$sex)

## [1] "girl" "boy"

levels(Star$freelunk)

## [1] "no"  "yes"

levels(Star$race)

## [1] "white" "black" "other"

levels(Star$classk)

## [1] "regular"           "small.class"       "regular.with.aide"

Continuous variables

‘totexpk’ and ‘schidkn’ are continuous variables. Years of total teaching experience and school indicator variable, respectively.

Response variables

The response variables are the two test scores, ‘tmathssk’ and ‘treadssk’,

2. Experimental Design

There is a reference in the Ecdat package manual supposedly with information about the dataset. Unfortunately, the link is for an outdated website with no information about how the experiment was conducted or how the data was collected.

However, in this assignment a \(2^k\) fractional factorial design is used.

The data set contains 4 factors of which two have 3 levels. These two factor will be decomposed to two 2-level factors each by the following process.

x = Star[1,]
race = c(0,0)

L = levels(Star[,'race'])
for(i in seq(2,dim(Star)[1])){
  if(Star[i,'race']==L[1]){
    x = rbind(x,Star[i,])
    race = rbind(race,c(0,0))
  } 
  else if(Star[i,'race']==L[2]){
    x = rbind(x,Star[i,])
    x = rbind(x,Star[i,])
    race = rbind(race,c(1,0))
    race = rbind(race,c(0,1))
  } 
  else{
    x = rbind(x,Star[i,])
    race = rbind(race,c(1,1))
  }
}

x = cbind(x,race)
x = x[,c(1,2,3,4,5,6,8,9,10)]
colnames(x) = c(colnames(Star[1,c(1,2,3,4,5,6,8)]),'raceA','raceB')

Lc = levels(Star[,'classk'])

classk = c(1,0)
classk = rbind(classk,c(0,1))
dat = x[1,]
dat = rbind(dat,dat)



for(i in seq(2,dim(x)[1])){
  if(x[i,'classk']==Lc[1]){
    dat = rbind(dat,x[i,])
    classk = rbind(classk,c(0,0))
  } 
  else if(x[i,'classk']==Lc[2]){
    dat = rbind(dat,x[i,])
    dat = rbind(dat,x[i,])
    classk = rbind(classk,c(1,0))
    classk = rbind(classk,c(0,1))
  } 
  else{
    dat = rbind(dat,x[i,])
    classk = rbind(classk,c(1,1))
  }
}

dat = cbind(dat,classk)

colnames(dat) = c(colnames(x),'classkA','classkB')
dat = dat[,c(1,2,4,5,6,7,8,9,10,11)]

Thus, six 2-level factors remain and will be used in the design.

The fractional factorial design is \(2_{III}^{6-3}\), which yields 8 experimental runs. The R package FrF2 provides functions for creating such a design.

# Fractional Factorial Design
plan = FrF2(8,6,factor.names = c('classkA','classkB','sex','freelunk','raceA','raceB'))
plan

##   classkA classkB sex freelunk raceA raceB
## 1       1      -1  -1       -1    -1     1
## 2      -1      -1  -1        1     1     1
## 3       1       1   1        1     1     1
## 4       1      -1   1       -1     1    -1
## 5      -1      -1   1        1    -1    -1
## 6       1       1  -1        1    -1    -1
## 7      -1       1   1       -1    -1     1
## 8      -1       1  -1       -1     1    -1
## class=design, type= FrF2

This design has resolution III, thus the main effects and 2-factor interactions are confounded with 2-factor interactions. The table below shows which factors are confounded.

# view aliasing
design.info(plan)$aliased

## $legend
## [1] "A=classkA"  "B=classkB"  "C=sex"      "D=freelunk" "E=raceA"   
## [6] "F=raceB"   
## 
## $main
## [1] "A=BD=CE" "B=AD=CF" "C=AE=BF" "D=AB=EF" "E=AC=DF" "F=BC=DE"
## 
## $fi2
## [1] "AF=BE=CD"

The main effect “classkA” is confounded with the 2-factor interactions (classkB,freelunk) and (sex,raceA).

3. Statistical Analysis

Exploratoty data analysis

In this section, some of the factors effects on the test scores are shown in boxplots. The following boxplot shows the effect of class-type on math test scores.

boxplot(tmathssk~classk,data=Star,ylab="Math test score",xlab="type of class")

The medians of the math test scores are very similar, although the median for “small.class” is slightly higher.

par(mfrow=c(1,2))
boxplot(tmathssk~sex,data=Star,ylab="Math test score",xlab="sex")
boxplot(treadssk~sex,data=Star,ylab="Reading test score",xlab="sex")

Median test scores a slightly different for girls and boys, but not by much.

Testing

The main and interaction effects are estimated using lm an anova.

m1 = lm(tmathssk+treadssk~(sex+freelunk+raceA+raceB+classkA+classkB)^2,data=dat)
anova(m1)

## Analysis of Variance Table
## 
## Response: tmathssk + treadssk
##                    Df   Sum Sq Mean Sq  F value    Pr(>F)    
## sex                 1   449201  449201  88.9648 < 2.2e-16 ***
## freelunk            1  3826342 3826342 757.8109 < 2.2e-16 ***
## raceA               1   104898  104898  20.7751 5.227e-06 ***
## raceB               1   249943  249943  49.5015 2.114e-12 ***
## classkA             1      885     885   0.1752   0.67551    
## classkB             1      755     755   0.1496   0.69891    
## sex:freelunk        1     2385    2385   0.4724   0.49190    
## sex:raceA           1    10020   10020   1.9844   0.15896    
## sex:raceB           1    23880   23880   4.7295   0.02967 *  
## sex:classkA         1    15927   15927   3.1545   0.07575 .  
## sex:classkB         1    13616   13616   2.6967   0.10059    
## freelunk:raceA      1    17950   17950   3.5551   0.05939 .  
## freelunk:raceB      1    27883   27883   5.5223   0.01880 *  
## freelunk:classkA    1      830     830   0.1644   0.68514    
## freelunk:classkB    1      718     718   0.1421   0.70619    
## raceA:raceB         1    13230   13230   2.6202   0.10554    
## raceA:classkA       1       62      62   0.0124   0.91147    
## raceA:classkB       1       53      53   0.0106   0.91816    
## raceB:classkA       1      154     154   0.0305   0.86130    
## raceB:classkB       1      138     138   0.0274   0.86856    
## classkA:classkB     1   447579  447579  88.6434 < 2.2e-16 ***
## Residuals        9851 49739708    5049                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An Anova table for the main effects is shown below.

m2 = lm(tmathssk+treadssk~sex+freelunk+raceA+raceB+classkA+classkB,data=dat)
anova(m2)

## Analysis of Variance Table
## 
## Response: tmathssk + treadssk
##             Df   Sum Sq Mean Sq  F value    Pr(>F)    
## sex          1   449201  449201  88.0830 < 2.2e-16 ***
## freelunk     1  3826342 3826342 750.2999 < 2.2e-16 ***
## raceA        1   104898  104898  20.5692 5.819e-06 ***
## raceB        1   249943  249943  49.0109 2.711e-12 ***
## classkA      1      885     885   0.1735    0.6770    
## classkB      1      755     755   0.1481    0.7003    
## Residuals 9866 50314135    5100                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table shows that the type of class does not have a significant effect on the test scores. However, we know that the main effect for classkA (and classkB) are confounded with the 2-factor interactions (classkB,freelunk) and (sex,raceA). Since, this model does not consider interactions, the conclusion remains the same.

Estimation

The parameters for the second model are shown in the following table.

coef(m2)

## (Intercept)      sexboy freelunkyes       raceA       raceB     classkA 
## 951.7719253 -14.1819638 -33.4484036 -13.2979406 -13.2979406   0.5550271 
##     classkB 
##   0.5550271

Diagnostics / Model Adequacy Checking

In this section, the assmumptions of using a linear fixed effect model are checked for the first model. Starting with a Normal QQ-plot. The linear model in lm assumes the response is normally distributed.

plot(m2,2)

The points should not differ much from the straight dotted line. Thus the assumption, that the response is normal, may not be satisfied.

Next,the residuals are are plotted vs the fitted values. The residuals should not increase with the fitted values.

plot(m2,1)

The plot shows that the residuals do not increase substantially with the fitted values.

4. References

R Package Ecdat: https://cran.r-project.org/web/packages/Ecdat/index.html