The data to be tested are from the ecdat package in R, from the data set entitled Star. This data is a study of the effects of class style on learning taken from 1985-1989. Learning is assessed using test scores based on three separate class styles and other relevant student, class, and teacher information.
These data will be used in order to examine the effect of class type on the learning results of students. This data is going to be blocked by the sex of the students in order to better show differences as a result of the class style instead of other causes. Sex is a very common blocking variable as it’s relatively easy to identify and control for. The data will be analyzed in terms of these two variables only.
The testing independent variable of this dataset is classk, which represents the class teaching style of the student. This has levels small.class, regular, and regular.with.aide. The blocking variable in the data set is sex, with levels boy and girl. This will be used in order to remove some of the variation from the data and allow for more exact representation of the effect from the classk variable.
Three continuous independent variable exists in this data set: totexpk which refers to the combined years of experience from the teacher in the classroom, tmathssk and treadssk, which are the math and reading test scores, respectively. The totexpk variable is not to be analyzed in this report. The other two continuous variables will be combined into a variable, totalssk, which is simply the sum, and represents the overall achievement of the student in both subjects. This is the response variable to be analyzed in this report.
The response variable is overall achievement totalssk, which is simply the summed total of the students’ math and reading examination scores.
head(df)
## tmathssk treadssk classk totexpk sex freelunk race schidkn
## 2 473 447 small.class 7 girl no white 63
## 3 536 450 small.class 21 girl no black 20
## 5 463 439 regular.with.aide 0 boy yes black 19
## 11 559 448 regular 16 boy no white 69
## 12 489 447 small.class 5 boy yes white 79
## 13 454 431 regular 8 boy yes white 5
str(df)
## 'data.frame': 5748 obs. of 8 variables:
## $ tmathssk: int 473 536 463 559 489 454 423 500 439 528 ...
## $ treadssk: int 447 450 439 448 447 431 395 451 478 455 ...
## $ classk : Factor w/ 3 levels "regular","small.class",..: 2 2 3 1 2 1 3 1 2 2 ...
## $ totexpk : int 7 21 0 16 5 8 17 3 11 10 ...
## $ sex : Factor w/ 2 levels "girl","boy": 1 1 2 2 2 2 1 1 1 1 ...
## $ freelunk: Factor w/ 2 levels "no","yes": 1 1 2 1 2 2 2 1 1 1 ...
## $ race : Factor w/ 3 levels "white","black",..: 1 2 2 1 1 1 2 1 2 1 ...
## $ schidkn : int 63 20 19 69 79 5 16 56 11 66 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:5850] 1 4 6 7 8 9 10 15 16 17 ...
## .. ..- attr(*, "names")= chr [1:5850] "1" "4" "6" "7" ...
The data set contains 5748 observations with 8 variables.
The following alterations will be performed:
All incomplete cases in the set will be removed.
The tmathssk and treadssk observations will be summed in order to show the overall aptitude of that student.
The data set will be condensed down to the response variable totalssk, the blocking variable sex, and the testing variable classk.
Following the paring of the data set, it looks as follows. No rows were removed for not having a complete set, but the data is now in a more presentable format.
head(df2)
## classk sex totalssk
## 2 small.class girl 920
## 3 small.class girl 986
## 5 regular.with.aide boy 902
## 11 regular boy 1007
## 12 small.class boy 936
## 13 regular boy 885
str(df2)
## 'data.frame': 5748 obs. of 3 variables:
## $ classk : Factor w/ 3 levels "regular","small.class",..: 2 2 3 1 2 1 3 1 2 2 ...
## $ sex : Factor w/ 2 levels "girl","boy": 1 1 2 2 2 2 1 1 1 1 ...
## $ totalssk: int 920 986 902 1007 936 885 818 951 917 983 ...
summary(df2)
## classk sex totalssk
## regular :2000 girl:2794 Min. : 635.0
## small.class :1733 boy :2954 1st Qu.: 870.0
## regular.with.aide:2015 Median : 915.0
## Mean : 922.4
## 3rd Qu.: 964.0
## Max. :1253.0
This experiment seeks to observe the impact of the class style on the testing results of students. Statistical and visual analysis can be used in order to guide future class decisions. Obviously, this will not capture all of the variation from the set, but that is also not the defined purpose of the experiment. We only want to see if class style has an impact.
Three separate methods will be used in order to determine if there is a significant effect present as a result of class style. All data will be analyzed using a blocked method with the data blocked by sex. The three methods that will be used are ANOVA, Confidence Interval Analysis, and resampling, which will be conducted using a student’s T-test and measuring the percentage of results that are statistically significant. Bootstrapping will be used in order to resample the data a total of 1000 times. Sample sizes for the tests will be determined using Gpower 3.1 with an \(\alpha\) of .05 and a \(\beta\) of .1.
In order to calculate the sample size, we need an estimation of the main effect. This was also done in Gpower. The three groups to be evaluated were the class size rankings. The means and standard deviations of each group were calculated and yielded the following results.
Small.Class: mean = 932.0508, sd = 76.42836
Regular: mean = 917.942, sd = 73.15339
Regular.With.Aide: mean = 918.4973, sd = 71.58013
The homogeneity of variances was tested with a bartlett test.
##
## Bartlett test of homogeneity of variances
##
## data: classk by sex
## Bartlett's K-squared = 0.00011189, df = 1, p-value = 0.9916
As the p-value is very high, there does not appear to be a significant difference between the variances of the groups. Thus, homogeneity of variances was assumed, and the mean of the three variances (73.72) was used in the calculation of the effect size. Input of these results into Gpower yielded a required sample size of 1710.
With this sample size, the above methods can all be performed in order to determine if there is a difference between the mean test scores of the various class styles.
Using three separate models for analysis of the hypothesis allows a better understanding of the statistical power of the differences. The combination of null hypothesis testing, resampling and visual CI assessment helps to portray the data in a variety of ways, all of which can be examined for significance. It provides a better picture of the overall results of the data than simply only using null hypothesis testing and also allows for corroboration of the results.
Blocking is important in this model in order to eliminate one source of variation. Sex was chosen as the blocking variable, as it would be the easiest to block by in most studies.
Assumptions
This data set meets many of the assumptions required for truly random design. It has a significant number of replicates in each of the groups, seems to be randomly selected, and the ability to block by sex. However, based on principles of randomization, replication, and blocking, the data set appears to be useful for factorial analysis.
Three different analyses will be performed in order to assess the differences in group means. First, a blocked ANOVA will be conducted in order to determine if the null hypothesis of no difference can be rejected. Next, a graphical 80% CI assessment will be used in order to determine which groups have significantly different means. Finally, resampling statistics will be used in order to determine the percentage of time in which the difference is significant. If this value is above 95%, then we will assume that the difference is significant.
Boxplots of the blocked groups will be displayed in order to understand the differences in the groups. There are some small differences in the mean, but the blocked data sets look relatively similar in composition. This will likely change once broken down further.
Additionally, the data sets are analyzed by the class type variable. This analysis shows that there are slight differences in the means of the class types. However, blocking will likely be beneficial in the full analysis of the data set.
The three testing methods that are to be used are null hypothesis testing, Confidence Interval (CI) plots and a resampling utilizing a bootstrapping approach. For all experiments the hypotheses will be the following:
Ho - There is no statistically significant difference between the means of the classk groups
Ha - There is a statistically significant difference between the means of the classk groups.
In the ANOVA method, the sample was blocked along sex and then the testing variable was the class structure. The output of this is displayed below:
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 276469 276469 51.45 8.26e-13 ***
## classk 2 234012 117006 21.78 3.79e-10 ***
## Residuals 5745 30868728 5373
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-stat for the classk variable is 21.7760783, and the p-value is 0, which indicates that the result of the test is highly significant. The blocking variable, sex, is also statistically significant, although this does not matter for the purposes of this experiment. These results show that there are significant differences in the student achievement resulting from the different class styles. In order to determine which of the means are different, we will use a Tukey post-hoc test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = totalssk ~ sex + classk, data = df2_ss)
##
## $classk
## diff lwr upr p adj
## small.class-regular 14.1682483 8.529575 19.806921 0.0000000
## regular.with.aide-regular 0.5503816 -4.873617 5.974380 0.9692872
## regular.with.aide-small.class -13.6178667 -19.246785 -7.988949 0.0000000
From this analysis, we can see that there are statistically significant differences (p <.05) in the small.class-regular group (p = 0) and the regular.with.aide-small.class group (p = 0). There is no significant difference in the regular.with.aide-regular combination (p = 0.9692872). All differences are significant at the 95% confidence level.
Based on the plots resulting from the ANOVA, we can see that the residuals are somewhat evenly distributed about 0, although it appears that more residuals are present above 0 than below.
The Q-Q plot of the residuals indicates that there is deviation from the normal distribution, as the extremes deviate from the straight line that is expected if the residuals are normal. This indicates that there is more variance that should be analyzed. Transforms, including square and square root transformations failed to normalize the residuals. As a result, the untransformed data is presented above. Seeing as the analyzed data category is only a small amount of the total variables in the set, it is relatively unsurprising that the residuals are not normal when only two variables are analyzed.
The next analysis method is a graphical analysis of the means of the three class groups. This will be done by calculating a mean and a confidence interval for the data. When this is plotted, visual analysis will be conducted to determine differences in the means of the data. 80% confidence intervals will be calculated and depicted as error bars. In order to say that the means are significantly different from each other, the mean of one group must not be located within the confidence interval of another. This analysis will be conducted with all of the data available in the set.
The CI is computed with the following equation for each group, where 1.28 is the z-score for 80% confidence:
\[1.28 * \frac{\sigma_{totalssk}} {\sqrt{sample size}}\]
Based on the table above, it is easy to see that with 80% Confidence Intervals, the small.class achievement mean is significantly different than both the regular and regular.with.aide groups, as the CI for this group does not overlap the mean of the other groups. There is no significant difference between the means of the other two groups, as their means are contained within the confidence interval of the other. This corroborates the results of the ANOVA conducted above. This is significant at an 80% confidence level, based on the size of the CI bars.
Diagnostics of this system are only limited to agreement and corroboration of the above analysis, which is confirmed. As the method is mostly visual, it is not possible to fully diagnose any issues within the model. However, the notes from the above model will likely still apply.
The final method for analyzing the effects of the data set are the use of resampling. In this method, a random sample of size 1710 is taken from df2 10,000 times, and analyzed using an ANOVA. Every loop, the p-value is recorded into an array. After the 10,000 simulations are complete, the array is analyzed to find every time that the p-value was less than .05. This is done in terms of 1s and 0s. By taking the mean of the array, it is possible to compute the percentage of times in which the p-value indicated a significant difference between the groups.
The following code was used for bootstrapping.
alpha = numeric(10000) # initialize p-value array
for (i in 1:10000){
rand_ss <- sample(c(1:nrow(df2),1710)) # create random sample numbers
df2_ss <- df2[rand_ss,] # create random sample
av = aov(totalssk~ sex + classk, data = df2_ss) #ANOVA
a <- summary(av)[[1]][["Pr(>F)"]][[2]] # extract p-value
alpha[i] <- a # save p-value
}
# mean(alpha <.05) Solve for the percentage significant p-values
After 10,000 samples, the ANOVA analysis found that the percentage of significant tests was 1. This means that it is certain that there is a significant difference between the average achievement of the different class styles. This corroborates what we already know from the two prior experiments, but gives it an extraordinary level of significance, with the value of 1, indicating that every run was statistically significant.
The residuals of this test are going to be distributed in a manner similar to the ANOVA conducted in the prior test, as it is just resampling of the same test. The conducted test will not come close to explaining all of the variance, as there are many variables not studied in this experiment.
In conclusion, all three methods showed a statistically significant difference between at least two of the group means. Methods 1 and 2 reveal which group means are distinct:
small.class- regular small.class - regular.with.aide
Additionally, test 3 shows that regardless of the sample that is drawn, there is a significant difference in the means of the classk groups.
The test performance in the small class group was higher, indicating that it is the superior option in the classk option for increasing the test scores of the students. This could be pursued moving forward in order to increase the achievement of students.
The data was drawn from the R ecdat library, and the dataset used was Star. The structure and the head of the data set can be seen in the section Experimental Setting - Data Overview
require(Ecdat) # load package
df <- Star # load data frame
head(df) # show head
str(df) # show structure
df2 <- df[complete.cases(df),] # eliminate incomplete cases
df2$totalssk <- df2$tmathssk + df2$treadssk # create total test variable
df2 <- df2[,c(3,5,9)] # select only necessary rows
head(df2) # show head of new data frame
str(df2) # structure of new data frame
summary(df2) # summary of data frame
ss <- 1710 # G*Power calculated sample size
rand_ss <- sample(c(1:nrow(df2),ss)) # Creates random number set
df2_ss <- df2[rand_ss,] # Pulls data from random rows
boxplot(df2$totalssk~df2$sex,xlab="Sex",ylab="Combined Test Score", main="Analysis of Blocks") # boxplot of blocking variable
boxplot(df2$totalssk~df2$classk,xlab="Class Type",ylab="Combined Test Score", main="Analysis of Testing Variables") # boxplot of testing variable
av = aov(totalssk~ sex + classk, data = df2_ss) # Analyze variance
summary(av) # display ANOVA results
options(scipen = 999) # remove scientific notation for numerics
a <- summary(av)[[1]][["Pr(>F)"]][[2]] # extract F-stat for ANOVA
b <- summary(av)[[1]][["F value"]][[2]] # extract p-value for ANOVA
posthoc <- TukeyHSD(av,"classk",conf.level = .95) # calculate mean differences
posthoc # show mean differences
a <- posthoc$classk[10] # p-value of group 1
b <- posthoc$classk[11] # p-value of group 2
c <- posthoc$classk[12] # p-value of group 3
plot(av,which = c(1:2)) # ANOVA diagnostics and residuals
df2_s <- subset(df2,df2$classk == 'small.class') # subset for small class
df2_r <- subset(df2,df2$classk == 'regular') # subset for regular class
df2_ra <- subset(df2,df2$classk == 'regular.with.aide') # subset for regular aided class
#Calculate confidence intervals at 80% for each class type
CI80r = 1.28 *(sd(df2_r$totalssk) / sqrt(nrow(df2_r)))
CI80ra = 1.28 * (sd(df2_ra$totalssk) / sqrt(nrow(df2_ra)))
CI80s = 1.28 * (sd(df2_s$totalssk) / sqrt(nrow(df2_s)))
# Take mean of each class type test performance
ave_r = mean(df2_r$totalssk)
ave_ra = mean(df2_ra$totalssk)
ave_s = mean(df2_s$totalssk)
# Create initial plot of means
df_plot <- data.frame(x = 1:3,F = c(ave_r,ave_ra,ave_s),L = c(CI80r,CI80ra,CI80s))
# load ggplot package
require(ggplot2)
# Plot CIs on chart with different colors and appropriate axis labels
pg <- ggplot(df_plot,aes(x=x, y = F))+
geom_point(size=4,colour = c("red","blue","green"))+
geom_errorbar(aes(ymax = F+L , ymin = F-L),colour = c("red","blue","green"))+
labs(x= "Class Style",y = "Mean Combined Achievement Score") +
scale_x_continuous(breaks = c(1:3),labels = c("regular","aide","small"))
# display graph
print(pg)
alpha = numeric(10000) # initialize p-value array
for (i in 1:10000){
rand_ss <- sample(c(1:nrow(df2),1710)) # create random sample numbers
df2_ss <- df2[rand_ss,] # create random sample
av = aov(totalssk~ sex + classk, data = df2_ss) #ANOVA
a <- summary(av)[[1]][["Pr(>F)"]][[2]] # extract p-value
alpha[i] <- a # save p-value
}
# mean(alpha <.05) Solve for the percentage significant p-values