Overview

In this project, we used the idea of bootstrap and permutation test to compute if there is a statistically significant difference between midterm scores from two different class sections. The preliminary data exploration showed that score distribution from both sections is unknown; hence we used a permutation test to calculate the achieved significance level.

Library

library(dplyr)

We used the dplyr package for its data manipulation function.

Importing Data

score_data = read.csv("midterm_score.csv", header = TRUE)
score_data = score_data[complete.cases(score_data), ]
score_data = score_data %>%
  filter(score != 0)

We imported midterm score data and excluded 0s since they were outliers.

Section_A = score_data %>%
  filter(`Section ID` == "A11" |
         `Section ID` == "A12" |
         `Section ID` == "A13" |
         `Section ID` == "A14")

Section_C = score_data %>%
  filter(`Section ID` == "C31" |
         `Section ID` == "C32" |
         `Section ID` == "C33" |
         `Section ID` == "C34")

Then we separated scores for each section. We were mainly interested in finding the difference between two major sections, section A and section C, or otherwise known as the morning section and afternoon section.

Preliminary Data Exploration

par(mfrow=c(1,3))
boxplot(Section_A$score)
hist(Section_A$score, main="Section A Test Score", xlab ="Test Score")
qqnorm(Section_A$score);qqline(Section_A$score)

par(mfrow=c(1,3))
boxplot(Section_C$score)
hist(Section_C$score, main="Section C Test Score", xlab ="Test Score")
qqnorm(Section_C$score);qqline(Section_C$score)

From the boxplot, we noticed that both sections have a median score of around 85 but the histograms showed us that each section’s score distribution is somewhat different. We wanted to calculate the difference between each section’s average score; however, normal Q-Q suggested that they might not be normally distributed; hence, constructing and comparing normal confidence interval for average score won’t tell us whether being in a different section has a statistically significant impact on the midterm score.

Normality Test

shapiro.test(Section_A$score)

## 
##  Shapiro-Wilk normality test
## 
## data:  Section_A$score
## W = 0.92428, p-value = 0.0002738

shapiro.test(Section_C$score)

## 
##  Shapiro-Wilk normality test
## 
## data:  Section_C$score
## W = 0.94429, p-value = 0.0009528

We ran the Shapiro-Wilk normality test on both distributions to check their normality and low p-value suggested that they do not meet the normality assumption. We could check if they are in the form of known distribution; however, we believed that they do not follow any known distribution and decided to perform testing that does not require much information about the distribution.

Bootstrap

boots_trap = function(rep_num_list, data){
  
  sd_list = c()
  n = length(data)
  
  for (rep in rep_num_list){
    
    boot_rep = rep(NA, rep)
    
    for (i in 1:rep){
      id = sample(n, n, replace = TRUE)
      sample_data <- data[id]
      boot_rep[i] = log(mean(sample_data))
    }
    sd = sd(boot_rep)
    sd_list = append(sd_list, sd)
  }
  return(sd_list)
}

The idea behind our bootstrap is to find the standard deviation of our estimate. We wanted to calculate the average score by construction 95% C.I. using the sd calculated from our bootstrap function. We repeatedly sampled from the unknown score distribution given and found the standard deviation of those samples. We did not know the adequate number of repetitions so we created a function to help us determine the right number of resampling.

Average Score 95% C.I.

rep_num = seq(25, 2500, by=25)

par(mfrow=c(1,2))
sd_list_A = boots_trap(rep_num, Section_A$score)
sd_list_C = boots_trap(rep_num, Section_C$score)

plot(rep_num, sd_list_A, ylim = c(0.015,0.025), xlab = "# Repetition",ylab = "SD",main = "Section A")
plot(rep_num, sd_list_C, ylim = c(0.015,0.025), xlab = "# Repetition",ylab = "SD",main = "Section C")

mean(Section_A$score) + c(-1, 1) * qnorm(0.975) * sd_list_A[length(sd_list_A)]

## [1] 80.60049 80.66978

mean(Section_C$score) + c(-1, 1) * qnorm(0.975) * sd_list_A[length(sd_list_C)]

## [1] 81.63203 81.70131

We observed that after a certain amount of repetition (around after 1000 repetitions), the calculated standard deviation narrowed down to a specific value. Then we used that sd value to construct a 95% confidence interval for each section’s average midterm score. It turned out, the difference between the average score of two sections is about 1 point. We wanted to test whether it is statistically significant and used the permutation test to test this hypothesis.

Permutation Test

permutation_test = function(list1, list2, rep_num) {
  
  diff = c()
  mean_diff = mean(list1) - mean(list2)
  total_list = c(list1, list2)
  list_length = length(total_list)
  replicate = replicate(rep_num, sample(list_length, length(list1)))
  
  for(i in 1:rep_num) {
    samepl_mean_diff = mean(total_list[replicate[,i]]) - mean(total_list[-replicate[,i]])
    diff = append(diff,samepl_mean_diff)
  }
  
  two_sided = mean(abs(diff) > abs(mean_diff))
  list(diff=diff, significance = two_sided)
}

perm_test = permutation_test(Section_A$score, Section_C$score, 1000)
perm_test$significance

## [1] 0.623

hist(perm_test$diff,xlab = "Score Difference", main = "Score Difference Histrogram")

The purpose of our permutation test was to determine if there is significant evidence to reject the hypothesis that the observed sample means from two different sections are the same. We used this resampling method to compare the average score from two different score distributions and compute the achieved significance level(ASL).

We were hoping to obtain ASL below 0.05 to achieve a 95% significance level; however, as much higher ASL showed us, we could not find a significant difference between the average midterm score of two sections.

Conclusion

By using a more in-depth idea of statistics such as bootstrap and permutation test, we could perform score difference analysis beyond average score comparisons. By doing more complex analysis, we were able to have statistical evidence on this question.

Based on this analysis, we learned that there is no statistically significant difference between the two sections’ midterm score. As a TA, it was relieving that being in different sections will not result in different grades since that could be a sign of bias.

Bootstrap and Permutation Test - Understanding Unknown Distribution

Kitae Kim