Under the 100 interesting data sets, with the below direction Ecdat has been selected :
After examining the available data sets in Ecdat, Star data set has been selected.
Installation of the package and selection of the data-set is shown below :
#install.packages("pid")
#install.packages("Ecdat")
library(pid)
Loading required package: ggplot2
Loading required package: png
Loading required package: FrF2
Loading required package: DoE.base
Loading required package: grid
Loading required package: conf.design
Attaching package: 'DoE.base'
The following objects are masked from 'package:stats':
aov, lm
The following object is masked from 'package:graphics':
plot.design
The following object is masked from 'package:base':
lengths
library("Ecdat")
Loading required package: Ecfun
Attaching package: 'Ecfun'
The following object is masked from 'package:base':
sign
Attaching package: 'Ecdat'
The following object is masked from 'package:datasets':
Orange
prj <- Star
Star data set was collected for examining Effects on Learning of Small Class Sizes and consist of 8 variables and 5748 observations.
Definitions of the variables in data set are as below:
head(prj)
tmathssk treadssk classk totexpk sex freelunk race schidkn
2 473 447 small.class 7 girl no white 63
3 536 450 small.class 21 girl no black 20
5 463 439 regular.with.aide 0 boy yes black 19
11 559 448 regular 16 boy no white 69
12 489 447 small.class 5 boy yes white 79
13 454 431 regular 8 boy yes white 5
Factors and levels of each factor are listed as below:
str(prj)
## 'data.frame': 5748 obs. of 8 variables:
## $ tmathssk: int 473 536 463 559 489 454 423 500 439 528 ...
## $ treadssk: int 447 450 439 448 447 431 395 451 478 455 ...
## $ classk : Factor w/ 3 levels "regular","small.class",..: 2 2 3 1 2 1 3 1 2 2 ...
## $ totexpk : int 7 21 0 16 5 8 17 3 11 10 ...
## $ sex : Factor w/ 2 levels "girl","boy": 1 1 2 2 2 2 1 1 1 1 ...
## $ freelunk: Factor w/ 2 levels "no","yes": 1 1 2 1 2 2 2 1 1 1 ...
## $ race : Factor w/ 3 levels "white","black",..: 1 2 2 1 1 1 2 1 2 1 ...
## $ schidkn : int 63 20 19 69 79 5 16 56 11 66 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:5850] 1 4 6 7 8 9 10 15 16 17 ...
## .. ..- attr(*, "names")= chr [1:5850] "1" "4" "6" "7" ...
However, one of the factors that we are planning to use in the analysis, totexpk, is defined as integer. We should change it to a factor before starting the analysis. I also grouped the factor levels in to two categories 0-10 years of experience will be defined as Exp1 and higher experience will be defined as Exp2 (nearly %50-%50 of the sample size) :
prj$totexpk <- as.factor(prj$totexpk)
levels(prj$totexpk) <- c("Exp1","Exp1","Exp1","Exp1","Exp1","Exp1","Exp1","Exp1","Exp1","Exp1","Exp2","Exp2","Exp2","Exp2","Exp2","Exp2","Exp2","Exp2","Exp2","Exp2","Exp2","Exp2","Exp2","Exp2","Exp2")
str(prj$totexpk)
## Factor w/ 2 levels "Exp1","Exp2": 1 2 1 2 1 1 2 1 2 2 ...
We will use following 4 factors with below levels:
levels(prj$classk)
## [1] "regular" "small.class" "regular.with.aide"
levels(prj$totexpk)
## [1] "Exp1" "Exp2"
levels(prj$sex)
## [1] "girl" "boy"
levels(prj$freelunk)
## [1] "no" "yes"
In the data set we have two continuous variables
Since we assume that the scores represent the learning results, we can use either one of the scores as response variable.
We will use tmathssk : total math scaled score as our response variable.
As it has been mentioned before data consists of 8 variables and 5748 observations between years 1985 and 1989. One of the variables is schidkn : school indicator variable with a maximum value of 80, that represents the number of schools used in the analysis.
General view of the data is as below:
head(prj)
## tmathssk treadssk classk totexpk sex freelunk race schidkn
## 2 473 447 small.class Exp1 girl no white 63
## 3 536 450 small.class Exp2 girl no black 20
## 5 463 439 regular.with.aide Exp1 boy yes black 19
## 11 559 448 regular Exp2 boy no white 69
## 12 489 447 small.class Exp1 boy yes white 79
## 13 454 431 regular Exp1 boy yes white 5
summary(prj)
## tmathssk treadssk classk totexpk
## Min. :320.0 Min. :315.0 regular :2000 Exp1:2940
## 1st Qu.:454.0 1st Qu.:414.0 small.class :1733 Exp2:2808
## Median :484.0 Median :433.0 regular.with.aide:2015
## Mean :485.6 Mean :436.7
## 3rd Qu.:513.0 3rd Qu.:453.0
## Max. :626.0 Max. :627.0
## sex freelunk race schidkn
## girl:2794 no :2973 white:3869 Min. : 1.00
## boy :2954 yes:2775 black:1852 1st Qu.:20.00
## other: 27 Median :39.00
## Mean :39.84
## 3rd Qu.:60.00
## Max. :80.00
We are using 4 factors with multiple levels, we will analyze their individual and interaction effects on learning of small classes.
For this experiment our null hypothesis can be set as learning of small classes does not affected by class type, gender of the students, experience of teacher, free lunch qualification and any two way interactions of these factors.
Data set has been retrieved from Ecdat, and factors had been already chosen. In this situation it is not possible for us to define the reasons of why these factors were selected. Other factors like education level of the parents of subjects or methods used in teaching (tablet, book, etc.) or grades (K-12) might have been selected.
For our analysis during 4 factors have been selected according to my personal opinion regarding the possible effects on learning.
Replication, local control(blocking) and randomization are fundamentals of a designing an experiment. Replication and blocking increases the precision in the experiment and randomization helps reducing bias.
Randomization is possible with below three conditions :
If all these are realized in a random behaviour we can assume the analyze is randomized.
For this data set We do not have a certain information about random selection. Only sign of this could be the school indicator variable in the data set is in a mixed order and also we can assume that schools and students selected randomly.
However, with randomly sub-setting and ordering the data we can assure the random assignment and random execution. By running below code each time we will select 4000 different observations randomly among 5748.
x <- sample(1:5748,5748,replace=FALSE)
R_prj <- cbind(prj,x)
prj <- subset(R_prj,x<=4000)
“Repeat and replicate measurements are both multiple response measurements taken at the same combination of factor settings; but repeat measurements are taken during the same experimental run or consecutive runs, while replicate measurements are taken during identical but different experimental runs, which are often randomized.”[1]
Replication is using the same treatment on different subjects but on the other hand in repeated measurements the same subject evaluated with multiple readings of the dependent variable. As an example we have three plants and we used a fertilizer on all of them and measured the length of each plants once. Since each of the observation is independent this can be accepted as a Replication. However, if we measure the same plant`s length multiple times (2,3 ..) this is named as Repeated measurement.
In the dataset we have two exam results one is math the other is reading. Similar to our previous study in Alspaugh paper these academic areas can be defined as repeated measures. However, we chose only one dependent variable and our calculations conducted without the analysis without repeated measurements.
“A block is a categorical variable that explains variation in the response variable that is not caused by the factors. Although each measurement should be taken under consistent experimental conditions (other than the factors that are being varied as part of the experiment), this is not always possible. Use blocks in designed experiments and analysis to minimize bias and variance of the error because of nuisance factors.” [2]
As the definition of nuisance factor, it might have some effect on the response but on the other hand it should not be in main focus of the researcher. Since the limitations given for the project requires 4 factors.
No blocking has been used in our analysis.
Our new dataset summary which has been randomly selected is as below. Additionally, histogram of our response variable (math score) is plotted, for visually analyzing the normal distribution.
summary(prj)
## tmathssk treadssk classk totexpk
## Min. :320.0 Min. :315.0 regular :1399 Exp1:2030
## 1st Qu.:454.0 1st Qu.:414.0 small.class :1233 Exp2:1970
## Median :478.0 Median :433.0 regular.with.aide:1368
## Mean :485.3 Mean :436.3
## 3rd Qu.:513.0 3rd Qu.:451.0
## Max. :626.0 Max. :627.0
## sex freelunk race schidkn x
## girl:1972 no :2078 white:2706 Min. : 1.00 Min. : 1
## boy :2028 yes:1922 black:1275 1st Qu.:20.00 1st Qu.:1001
## other: 19 Median :39.00 Median :2000
## Mean :39.81 Mean :2000
## 3rd Qu.:60.00 3rd Qu.:3000
## Max. :80.00 Max. :4000
hist(prj$tmathssk)
In general box plots visually represents the significance of factor, response differences between levels of the factor and variation differences in the levels. However for a correctcalculation of main effects means of the factors should be used, and in below box-plots means represented by a red diamond.
boxplot(prj$tmathssk~prj$classk, xlab="Type of Class", ylab="Math Score")
title("Class Type")
means1 <- by(prj$tmathssk, prj$classk, mean)
points(1:3, means1, pch = 23, cex = 2, bg = "red")
text(1:3 - 0.1, means1,labels = format(means1, format = "f", digits = 0),pos = 3, cex = 0.9, col = "red")
boxplot(prj$tmathssk~prj$totexpk, xlab="Experience of the teacher", ylab="Math Score")
title("Teacher")
means2 <- by(prj$tmathssk, prj$totexpk, mean)
points(1:2, means2, pch = 23, cex = 2, bg = "red")
text(1:2 - 0.1, means2,labels = format(means2, format = "f", digits = 0),pos = 3, cex = 0.9, col = "red")
boxplot(prj$tmathssk~prj$sex, xlab="Gender", ylab="Math Score")
title("Gender")
means3 <- by(prj$tmathssk, prj$sex, mean)
points(1:2, means3, pch = 23, cex = 2, bg = "red")
text(1:2 - 0.1, means3,labels = format(means3, format = "f", digits = 0),pos = 3, cex = 0.9, col = "red")
boxplot(prj$tmathssk~prj$freelunk, xlab="Free Lunch", ylab="Math Score")
title("Free Lunch")
means4 <- by(prj$tmathssk, prj$freelunk, mean)
points(1:2, means4, pch = 23, cex = 2, bg = "red")
text(1:2 - 0.1, means4,labels = format(means4, format = "f", digits = 0),pos = 3, cex = 0.9, col = "red")
According to means above we may say that all factors have some effect on the response variable, however free lunch support level differences are more obvious than the others.
For analyzing significance of these effects ANOVA method is going to be used in next section.
In order to determine the statistical significance of the results of the factorial experiment, ANOVA will be conducted. This will show the possibility that the significance is by chance, or not by chance. This allows researchers to either accept the null hypothesis: that there is not a statistically significant main or interaction effect of factors, or reject the null and say that there is a statistically significant main or interaction effect of factor(s) on the response variable.
model_1 <- aov(prj$tmathssk~prj$classk)
anova(model_1)
## Analysis of Variance Table
##
## Response: prj$tmathssk
## Df Sum Sq Mean Sq F value Pr(>F)
## prj$classk 2 53949 26974.3 11.946 6.717e-06 ***
## Residuals 3997 9025023 2257.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_2 <- aov(prj$tmathssk~prj$totexpk)
anova(model_2)
## Analysis of Variance Table
##
## Response: prj$tmathssk
## Df Sum Sq Mean Sq F value Pr(>F)
## prj$totexpk 1 53744 53744 23.808 1.106e-06 ***
## Residuals 3998 9025227 2257
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_3 <- aov(prj$tmathssk~prj$sex)
anova(model_3)
## Analysis of Variance Table
##
## Response: prj$tmathssk
## Df Sum Sq Mean Sq F value Pr(>F)
## prj$sex 1 51902 51902 22.987 1.69e-06 ***
## Residuals 3998 9027070 2258
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_4 <- aov(prj$tmathssk~prj$freelunk)
anova(model_4)
## Analysis of Variance Table
##
## Response: prj$tmathssk
## Df Sum Sq Mean Sq F value Pr(>F)
## prj$freelunk 1 552151 552151 258.89 < 2.2e-16 ***
## Residuals 3998 8526821 2133
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can summarize the ANOVA results for all main effects since all have a p < 0.05. According to this finding we cannot say that the variance in math scores (response variable) are not affected by these factors and we should reject the null hypothesis.
We can summarize the ANOVA results for interaction effects as below:
model_12 <- aov(prj$tmathssk~prj$classk*prj$totexpk)
anova(model_12)
## Analysis of Variance Table
##
## Response: prj$tmathssk
## Df Sum Sq Mean Sq F value Pr(>F)
## prj$classk 2 53949 26974 12.0496 6.063e-06 ***
## prj$totexpk 1 61137 61137 27.3103 1.821e-07 ***
## prj$classk:prj$totexpk 2 22876 11438 5.1093 0.00608 **
## Residuals 3994 8941010 2239
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interaction.plot(prj$classk,prj$totexpk,prj$tmathssk)
Class type x teacher experience - p < 0.05 - Evidence of significant interaction effect.
model_13 <- aov(prj$tmathssk~prj$classk*prj$sex)
anova(model_13)
## Analysis of Variance Table
##
## Response: prj$tmathssk
## Df Sum Sq Mean Sq F value Pr(>F)
## prj$classk 2 53949 26974 12.0328 6.164e-06 ***
## prj$sex 1 52186 52186 23.2794 1.453e-06 ***
## prj$classk:prj$sex 2 19397 9698 4.3263 0.01328 *
## Residuals 3994 8953440 2242
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interaction.plot(prj$sex,prj$classk,prj$tmathssk)
Class type x gender - p < 0.05 - Evidence of significant interaction effect.
model_14 <- aov(prj$tmathssk~prj$classk*prj$freelunk)
anova(model_14)
## Analysis of Variance Table
##
## Response: prj$tmathssk
## Df Sum Sq Mean Sq F value Pr(>F)
## prj$classk 2 53949 26974 12.7075 3.153e-06 ***
## prj$freelunk 1 545970 545970 257.2045 < 2.2e-16 ***
## prj$classk:prj$freelunk 2 956 478 0.2251 0.7985
## Residuals 3994 8478098 2123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interaction.plot(prj$classk,prj$freelunk,prj$tmathssk)
Class type x free lunch - p > 0.05 - NO INTERACTION EFFECT This can be also observed from the interaction plot that the two factors are moving parallel to each other.
model_23 <- aov(prj$tmathssk~prj$totexpk*prj$sex)
anova(model_23)
## Analysis of Variance Table
##
## Response: prj$tmathssk
## Df Sum Sq Mean Sq F value Pr(>F)
## prj$totexpk 1 53744 53744 23.921 1.043e-06 ***
## prj$sex 1 47276 47276 21.042 4.630e-06 ***
## prj$totexpk:prj$sex 1 0 0 0.000 0.9963
## Residuals 3996 8977951 2247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interaction.plot(prj$totexpk,prj$sex,prj$tmathssk)
Gender x teacher experience - p > 0.05 - NO INTERACTION EFFECT This can be also observed from the interaction plot that the two factors are moving parallel to each other.
model_24 <- aov(prj$tmathssk~prj$totexpk*prj$freelunk)
anova(model_24)
## Analysis of Variance Table
##
## Response: prj$tmathssk
## Df Sum Sq Mean Sq F value Pr(>F)
## prj$totexpk 1 53744 53744 25.3125 5.09e-07 ***
## prj$freelunk 1 540031 540031 254.3442 < 2.2e-16 ***
## prj$totexpk:prj$freelunk 1 778 778 0.3663 0.545
## Residuals 3996 8484419 2123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interaction.plot(prj$totexpk,prj$freelunk,prj$tmathssk)
Free lunch x teacher experience - p > 0.05 - NO INTERACTION EFFECT This can be also observed from the interaction plot that the two factors are moving parallel to each other.
model_34 <- aov(prj$tmathssk~prj$sex*prj$freelunk)
anova(model_34)
## Analysis of Variance Table
##
## Response: prj$tmathssk
## Df Sum Sq Mean Sq F value Pr(>F)
## prj$sex 1 51902 51902 24.4907 7.775e-07 ***
## prj$freelunk 1 558252 558252 263.4181 < 2.2e-16 ***
## prj$sex:prj$freelunk 1 253 253 0.1193 0.7298
## Residuals 3996 8468565 2119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interaction.plot(prj$sex,prj$freelunk,prj$tmathssk)
Free lunch x gender - p > 0.05 - NO INTERACTION EFFECT This can be also observed from the interaction plot that the two factors are moving parallel to each other.
Another method for calculating the main effects is as below : (This calculation does required pid package)
model_main <- lm(prj$tmathssk~prj$classk*prj$totexpk + prj$classk*prj$sex + prj$classk*prj$freelunk + prj$totexpk*prj$sex + prj$totexpk*prj$freelunk + prj$sex*prj$freelunk)
summary(model_main)
##
## Call:
## lm.default(formula = prj$tmathssk ~ prj$classk * prj$totexpk +
## prj$classk * prj$sex + prj$classk * prj$freelunk + prj$totexpk *
## prj$sex + prj$totexpk * prj$freelunk + prj$sex * prj$freelunk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141.745 -31.620 -4.765 27.516 164.255
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 495.4432 2.8444 174.185
## prj$classksmall.class 9.2924 3.6105 2.574
## prj$classkregular.with.aide -4.0468 3.5721 -1.133
## prj$totexpkExp2 9.3519 3.1922 2.930
## prj$sexboy -13.2054 3.1887 -4.141
## prj$freelunkyes -21.0680 3.2425 -6.498
## prj$classksmall.class:prj$totexpkExp2 -10.8685 3.5988 -3.020
## prj$classkregular.with.aide:prj$totexpkExp2 2.2782 3.4895 0.653
## prj$classksmall.class:prj$sexboy 9.1323 3.5798 2.551
## prj$classkregular.with.aide:prj$sexboy 7.1414 3.4851 2.049
## prj$classksmall.class:prj$freelunkyes -3.0476 3.5942 -0.848
## prj$classkregular.with.aide:prj$freelunkyes -3.3057 3.4878 -0.948
## prj$totexpkExp2:prj$sexboy 0.7244 2.9045 0.249
## prj$totexpkExp2:prj$freelunkyes -1.2439 2.9132 -0.427
## prj$sexboy:prj$freelunkyes 0.5755 2.9019 0.198
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## prj$classksmall.class 0.01010 *
## prj$classkregular.with.aide 0.25733
## prj$totexpkExp2 0.00341 **
## prj$sexboy 3.52e-05 ***
## prj$freelunkyes 9.17e-11 ***
## prj$classksmall.class:prj$totexpkExp2 0.00254 **
## prj$classkregular.with.aide:prj$totexpkExp2 0.51388
## prj$classksmall.class:prj$sexboy 0.01078 *
## prj$classkregular.with.aide:prj$sexboy 0.04052 *
## prj$classksmall.class:prj$freelunkyes 0.39652
## prj$classkregular.with.aide:prj$freelunkyes 0.34330
## prj$totexpkExp2:prj$sexboy 0.80304
## prj$totexpkExp2:prj$freelunkyes 0.66942
## prj$sexboy:prj$freelunkyes 0.84279
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.72 on 3985 degrees of freedom
## Multiple R-squared: 0.0827, Adjusted R-squared: 0.07948
## F-statistic: 25.66 on 14 and 3985 DF, p-value: < 2.2e-16
paretoPlot(model_main)
This relation between main effects and interaction effects also represents that their effects are independent. Even all factors have a main effect, some of them does not have an interaction effect. As we have stated in the wikibook Chapter 0 : “Main effects and interaction effects are independent from each other (orthogonal) That means in some cases there might be no main effect, but some interaction effects or alternatively, a main effect may exist, but no interactions will be present.
ANOVA analysis bases on a list of assumptions, if these assumptions are not met then further analysis should be implemented or a new experiment should be planned. List of model adequacy assumptions are as below :
For testing the normality :
“An extremely useful procedure is to construct a normal probability plot of the residuals. … In the analysis of variance, it is usually more effective (and straightforward) to do this with the residuals. If the underlying error distribution is normal, this plot will resemble a straight line. In visualizing the straight line, place more emphasis on the central values of the plot than on the extremes.” [3]
For checking homogeneity of variance:
If the model is correct and the assumptions are satisfied, the residuals should be structureless; in particular, they should be unrelated to any other variable including the predicted response. A simple check is to plot the residuals versus the fitted values
These graphs are plotted as below and they represent that the assumptions have been met:
qqnorm(residuals(model_1))
qqline(residuals(model_1))
plot(fitted(model_1),residuals(model_1))
qqnorm(residuals(model_34))
qqline(residuals(model_34))
plot(fitted(model_34),residuals(model_34))
[3] - Montgomery, Douglas C.. Design and Analysis of Experiments, 8th Edition (Page 80). Wiley. Kindle Edition.
[4] - Montgomery, Douglas C.. Design and Analysis of Experiments, 8th Edition (Page 83). Wiley. Kindle Edition.
Mean point addition to BoxPlots: https://stat.ethz.ch/pipermail/r-help/2004-January/044833.html
pid Package and effects calculation by Kevin DUNN : https://www.youtube.com/watch?v=Rti2zqQFrTY&index=43&list=PLHUnYbefLmeOPRuT1sukKmRyOVd4WSxJE