Experimental Setting

System under test

The data to be tested are from the ecdat package in R, from the data set entitled Star. This data is a study of the effects of class style on learning taken from 1985-1989. Learning is assessed using test scores based on three separate class styles and other relevant student, class, and teacher information.

These data will be used in order to examine the effect of class type on the learning results of students. This data is going to be blocked by the sex of the students in order to better show differences as a result of the class style instead of other causes. Sex is a very common blocking variable as it’s relatively easy to identify and control for. The data will be analyzed in terms of these two variables only.

Factors and Levels

The testing independent variable of this dataset is classk, which represents the class teaching style of the student. This has levels small.class, regular, and regular.with.aide. The blocking variable in the data set is sex, with levels boy and girl. This will be used in order to remove some of the variation from the data and allow for more exact representation of the effect from the classk variable.

Continuous Variables

Three continuous independent variable exists in this data set: totexpk which refers to the combined years of experience from the teacher in the classroom, tmathssk and treadssk, which are the math and reading test scores, respectively. The totexpk variable is not to be analyzed in this report. The other two continuous variables will be combined into a variable, totalssk, which is simply the sum, and represents the overall achievement of the student in both subjects. This is the response variable to be analyzed in this report.

Response Variables

The response variable is overall achievement totalssk, which is simply the summed total of the students’ math and reading examination scores.

Data Overview

head(df)

##    tmathssk treadssk            classk totexpk  sex freelunk  race schidkn
## 2       473      447       small.class       7 girl       no white      63
## 3       536      450       small.class      21 girl       no black      20
## 5       463      439 regular.with.aide       0  boy      yes black      19
## 11      559      448           regular      16  boy       no white      69
## 12      489      447       small.class       5  boy      yes white      79
## 13      454      431           regular       8  boy      yes white       5

str(df)

## 'data.frame':    5748 obs. of  8 variables:
##  $ tmathssk: int  473 536 463 559 489 454 423 500 439 528 ...
##  $ treadssk: int  447 450 439 448 447 431 395 451 478 455 ...
##  $ classk  : Factor w/ 3 levels "regular","small.class",..: 2 2 3 1 2 1 3 1 2 2 ...
##  $ totexpk : int  7 21 0 16 5 8 17 3 11 10 ...
##  $ sex     : Factor w/ 2 levels "girl","boy": 1 1 2 2 2 2 1 1 1 1 ...
##  $ freelunk: Factor w/ 2 levels "no","yes": 1 1 2 1 2 2 2 1 1 1 ...
##  $ race    : Factor w/ 3 levels "white","black",..: 1 2 2 1 1 1 2 1 2 1 ...
##  $ schidkn : int  63 20 19 69 79 5 16 56 11 66 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:5850] 1 4 6 7 8 9 10 15 16 17 ...
##   .. ..- attr(*, "names")= chr [1:5850] "1" "4" "6" "7" ...

The data set contains 5748 observations with 8 variables.

The following alterations will be performed:

All incomplete cases in the set will be removed.

The tmathssk and treadssk observations will be summed in order to show the overall aptitude of that student.

The data set will be condensed down to the response variable totalssk, the blocking variable sex, and the testing variable classk.

Following the paring of the data set, it looks as follows. No rows were removed for not having a complete set, but the data is now in a more presentable format.

head(df2)

##               classk  sex totalssk
## 2        small.class girl      920
## 3        small.class girl      986
## 5  regular.with.aide  boy      902
## 11           regular  boy     1007
## 12       small.class  boy      936
## 13           regular  boy      885

str(df2)

## 'data.frame':    5748 obs. of  3 variables:
##  $ classk  : Factor w/ 3 levels "regular","small.class",..: 2 2 3 1 2 1 3 1 2 2 ...
##  $ sex     : Factor w/ 2 levels "girl","boy": 1 1 2 2 2 2 1 1 1 1 ...
##  $ totalssk: int  920 986 902 1007 936 885 818 951 917 983 ...

summary(df2)

##                classk       sex          totalssk     
##  regular          :2000   girl:2794   Min.   : 635.0  
##  small.class      :1733   boy :2954   1st Qu.: 870.0  
##  regular.with.aide:2015               Median : 915.0  
##                                       Mean   : 922.4  
##                                       3rd Qu.: 964.0  
##                                       Max.   :1253.0

Experimental Design

This experiment seeks to observe the impact of the class style on the testing results of students. Statistical and visual analysis can be used in order to guide future class decisions. Obviously, this will not capture all of the variation from the set, but that is also not the defined purpose of the experiment. We only want to see if class style has an impact.

How will the experiment be organized and conducted?

Three separate methods will be used in order to determine if there is a significant effect present as a result of class style. All data will be analyzed using a blocked method with the data blocked by sex. The three methods that will be used are ANOVA, Confidence Interval Analysis, and resampling, which will be conducted using a student’s T-test and measuring the percentage of results that are statistically significant. Bootstrapping will be used in order to resample the data a total of 1000 times. Sample sizes for the tests will be determined using Gpower 3.1 with an \(\alpha\) of .05 and a \(\beta\) of .1.

In order to calculate the sample size, we need an estimation of the main effect. This was also done in Gpower. The three groups to be evaluated were the class size rankings. The means and standard deviations of each group were calculated and yielded the following results.

Small.Class: mean = 932.0508, sd = 76.42836
Regular: mean = 917.942, sd = 73.15339
Regular.With.Aide: mean = 918.4973, sd = 71.58013

The homogeneity of variances was tested with a bartlett test.

## 
##  Bartlett test of homogeneity of variances
## 
## data:  classk by sex
## Bartlett's K-squared = 0.00011189, df = 1, p-value = 0.9916

As the p-value is very high, there does not appear to be a significant difference between the variances of the groups. Thus, homogeneity of variances was assumed, and the mean of the three variances (73.72) was used in the calculation of the effect size. Input of these results into Gpower yielded a required sample size of 1710.

With this sample size, the above methods can all be performed in order to determine if there is a difference between the mean test scores of the various class styles.

What is the rationale for this design?

Using three separate models for analysis of the hypothesis allows a better understanding of the statistical power of the differences. The combination of null hypothesis testing, resampling and visual CI assessment helps to portray the data in a variety of ways, all of which can be examined for significance. It provides a better picture of the overall results of the data than simply only using null hypothesis testing and also allows for corroboration of the results.

Blocking is important in this model in order to eliminate one source of variation. Sex was chosen as the blocking variable, as it would be the easiest to block by in most studies.

Assumptions

This data set meets many of the assumptions required for truly random design. It has a significant number of replicates in each of the groups, seems to be randomly selected, and the ability to block by sex. However, based on principles of randomization, replication, and blocking, the data set appears to be useful for factorial analysis.

Statistical Analysis

Three different analyses will be performed in order to assess the differences in group means. First, a blocked ANOVA will be conducted in order to determine if the null hypothesis of no difference can be rejected. Next, a graphical 80% CI assessment will be used in order to determine which groups have significantly different means. Finally, resampling statistics will be used in order to determine the percentage of time in which the difference is significant. If this value is above 95%, then we will assume that the difference is significant.

Descriptive Summary

Boxplots of the blocked groups will be displayed in order to understand the differences in the groups. There are some small differences in the mean, but the blocked data sets look relatively similar in composition. This will likely change once broken down further.

Additionally, the data sets are analyzed by the class type variable. This analysis shows that there are slight differences in the means of the class types. However, blocking will likely be beneficial in the full analysis of the data set.

Testing

The three testing methods that are to be used are null hypothesis testing, Confidence Interval (CI) plots and a resampling utilizing a bootstrapping approach. For all experiments the hypotheses will be the following:

Ho - There is no statistically significant difference between the means of the classk groups

Ha - There is a statistically significant difference between the means of the classk groups.

ANOVA

In the ANOVA method, the sample was blocked along sex and then the testing variable was the class structure. The output of this is displayed below:

##               Df   Sum Sq Mean Sq F value   Pr(>F)    
## sex            1   276469  276469   51.45 8.26e-13 ***
## classk         2   234012  117006   21.78 3.79e-10 ***
## Residuals   5745 30868728    5373                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-stat for the classk variable is 21.7760783, and the p-value is 0, which indicates that the result of the test is highly significant. The blocking variable, sex, is also statistically significant, although this does not matter for the purposes of this experiment. These results show that there are significant differences in the student achievement resulting from the different class styles. In order to determine which of the means are different, we will use a Tukey post-hoc test.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = totalssk ~ sex + classk, data = df2_ss)
## 
## $classk
##                                      diff        lwr       upr     p adj
## small.class-regular            14.1682483   8.529575 19.806921 0.0000000
## regular.with.aide-regular       0.5503816  -4.873617  5.974380 0.9692872
## regular.with.aide-small.class -13.6178667 -19.246785 -7.988949 0.0000000

From this analysis, we can see that there are statistically significant differences (p <.05) in the small.class-regular group (p = 0) and the regular.with.aide-small.class group (p = 0). There is no significant difference in the regular.with.aide-regular combination (p = 0.9692872). All differences are significant at the 95% confidence level.

Model Diagnostics

Based on the plots resulting from the ANOVA, we can see that the residuals are somewhat evenly distributed about 0, although it appears that more residuals are present above 0 than below.

The Q-Q plot of the residuals indicates that there is deviation from the normal distribution, as the extremes deviate from the straight line that is expected if the residuals are normal. This indicates that there is more variance that should be analyzed. Transforms, including square and square root transformations failed to normalize the residuals. As a result, the untransformed data is presented above. Seeing as the analyzed data category is only a small amount of the total variables in the set, it is relatively unsurprising that the residuals are not normal when only two variables are analyzed.

Confidence Interval Analysis

The next analysis method is a graphical analysis of the means of the three class groups. This will be done by calculating a mean and a confidence interval for the data. When this is plotted, visual analysis will be conducted to determine differences in the means of the data. 80% confidence intervals will be calculated and depicted as error bars. In order to say that the means are significantly different from each other, the mean of one group must not be located within the confidence interval of another. This analysis will be conducted with all of the data available in the set.

The CI is computed with the following equation for each group, where 1.28 is the z-score for 80% confidence:

\[1.28 * \frac{\sigma_{totalssk}} {\sqrt{sample size}}\]

Based on the table above, it is easy to see that with 80% Confidence Intervals, the small.class achievement mean is significantly different than both the regular and regular.with.aide groups, as the CI for this group does not overlap the mean of the other groups. There is no significant difference between the means of the other two groups, as their means are contained within the confidence interval of the other. This corroborates the results of the ANOVA conducted above. This is significant at an 80% confidence level, based on the size of the CI bars.

Model Diagnostics

Diagnostics of this system are only limited to agreement and corroboration of the above analysis, which is confirmed. As the method is mostly visual, it is not possible to fully diagnose any issues within the model. However, the notes from the above model will likely still apply.

Resampling Analysis

The final method for analyzing the effects of the data set are the use of resampling. In this method, a random sample of size 1710 is taken from df2 10,000 times, and analyzed using an ANOVA. Every loop, the p-value is recorded into an array. After the 10,000 simulations are complete, the array is analyzed to find every time that the p-value was less than .05. This is done in terms of 1s and 0s. By taking the mean of the array, it is possible to compute the percentage of times in which the p-value indicated a significant difference between the groups.

The following code was used for bootstrapping.

alpha = numeric(10000) # initialize p-value array
for (i in 1:10000){
  rand_ss <- sample(c(1:nrow(df2),1710)) # create random sample numbers
  df2_ss <- df2[rand_ss,] # create random sample
  av = aov(totalssk~ sex + classk, data = df2_ss) #ANOVA
  a <- summary(av)[[1]][["Pr(>F)"]][[2]] # extract p-value

  alpha[i] <- a # save p-value
  
}
# mean(alpha <.05) Solve for the percentage significant p-values

After 10,000 samples, the ANOVA analysis found that the percentage of significant tests was 1. This means that it is certain that there is a significant difference between the average achievement of the different class styles. This corroborates what we already know from the two prior experiments, but gives it an extraordinary level of significance, with the value of 1, indicating that every run was statistically significant.

Model Diagnostics

The residuals of this test are going to be distributed in a manner similar to the ANOVA conducted in the prior test, as it is just resampling of the same test. The conducted test will not come close to explaining all of the variance, as there are many variables not studied in this experiment.

Conclusions

In conclusion, all three methods showed a statistically significant difference between at least two of the group means. Methods 1 and 2 reveal which group means are distinct:

small.class- regular small.class - regular.with.aide

Additionally, test 3 shows that regardless of the sample that is drawn, there is a significant difference in the means of the classk groups.

The test performance in the small class group was higher, indicating that it is the superior option in the classk option for increasing the test scores of the students. This could be pursued moving forward in order to increase the achievement of students.

Appendix 1: Raw Data

The data was drawn from the R ecdat library, and the dataset used was Star. The structure and the head of the data set can be seen in the section Experimental Setting - Data Overview

Appendix 2: R Code

require(Ecdat) # load package
df <- Star # load data frame

head(df) # show head 

str(df) # show structure

df2 <- df[complete.cases(df),] # eliminate incomplete cases


df2$totalssk <- df2$tmathssk + df2$treadssk # create total test variable

df2 <- df2[,c(3,5,9)] # select only necessary rows

head(df2) # show head of new data frame

str(df2) # structure of new data frame

summary(df2) # summary of data frame

ss <- 1710 # G*Power calculated sample size

rand_ss <- sample(c(1:nrow(df2),ss)) # Creates random number set
df2_ss <- df2[rand_ss,] # Pulls data from random rows

boxplot(df2$totalssk~df2$sex,xlab="Sex",ylab="Combined Test Score", main="Analysis of Blocks") # boxplot of blocking variable

boxplot(df2$totalssk~df2$classk,xlab="Class Type",ylab="Combined Test Score", main="Analysis of Testing Variables") # boxplot of testing variable

av = aov(totalssk~ sex + classk, data = df2_ss) # Analyze variance
summary(av) # display ANOVA results
options(scipen = 999) # remove scientific notation for numerics
a <- summary(av)[[1]][["Pr(>F)"]][[2]] # extract F-stat for ANOVA

b <- summary(av)[[1]][["F value"]][[2]] # extract p-value for ANOVA

posthoc <- TukeyHSD(av,"classk",conf.level = .95) # calculate mean differences 
posthoc # show mean differences
a <- posthoc$classk[10] # p-value of group 1
b <- posthoc$classk[11] # p-value of group 2
c <- posthoc$classk[12] # p-value of group 3

plot(av,which = c(1:2)) # ANOVA diagnostics and residuals


df2_s <- subset(df2,df2$classk == 'small.class') # subset for small class
df2_r <- subset(df2,df2$classk == 'regular') # subset for regular class
df2_ra <- subset(df2,df2$classk == 'regular.with.aide') # subset for regular aided class

#Calculate confidence intervals at 80% for each class type
CI80r = 1.28 *(sd(df2_r$totalssk) / sqrt(nrow(df2_r)))
CI80ra = 1.28 * (sd(df2_ra$totalssk) / sqrt(nrow(df2_ra)))
CI80s = 1.28 * (sd(df2_s$totalssk) / sqrt(nrow(df2_s)))

# Take mean of each class type test performance
ave_r = mean(df2_r$totalssk)
ave_ra = mean(df2_ra$totalssk)
ave_s = mean(df2_s$totalssk)

# Create initial plot of means 
df_plot <- data.frame(x = 1:3,F = c(ave_r,ave_ra,ave_s),L = c(CI80r,CI80ra,CI80s))

# load ggplot package
require(ggplot2)

# Plot CIs on chart with different colors and appropriate axis labels
pg <- ggplot(df_plot,aes(x=x, y = F))+
  geom_point(size=4,colour = c("red","blue","green"))+ 
  geom_errorbar(aes(ymax = F+L , ymin = F-L),colour = c("red","blue","green"))+
  labs(x= "Class Style",y = "Mean Combined Achievement Score") + 
  scale_x_continuous(breaks = c(1:3),labels = c("regular","aide","small"))
  
# display graph
print(pg)


alpha = numeric(10000) # initialize p-value array
for (i in 1:10000){
  rand_ss <- sample(c(1:nrow(df2),1710)) # create random sample numbers
  df2_ss <- df2[rand_ss,] # create random sample
  av = aov(totalssk~ sex + classk, data = df2_ss) #ANOVA
  a <- summary(av)[[1]][["Pr(>F)"]][[2]] # extract p-value

  alpha[i] <- a # save p-value
  
}
# mean(alpha <.05) Solve for the percentage significant p-values

Effect of Class Style on Student Achievement

Michael Wassick

November 10, 2016

Experimental Setting

System under test

Factors and Levels

Continuous Variables

Response Variables

Data Overview

Experimental Design

How will the experiment be organized and conducted?

What is the rationale for this design?

Statistical Analysis

Descriptive Summary

Testing

ANOVA

Model Diagnostics

Confidence Interval Analysis

Model Diagnostics

Resampling Analysis

Model Diagnostics

Conclusions

Appendix 1: Raw Data

Appendix 2: R Code