Introduction

I have been hired to run some numbers. Last Spring semester (of 2018), a couple of biology professors had a natural science experience on their hands. Between Dr. Kamal Dulai and Dr. Mufadhal Al-Kuhlani, one of them taught the Biology 2 course in a “traditional” way while the other employed “active learning” techniques. Here, we will run statistical tests to see if one paradigm is possibly better than the other.

In this exploratory report, I compared the final exams from Biology 2 for the Spring 2018 semester. Students were administered exams that did come in two versions—that happened to be named “female” and “male” due to the Scantron form used—but the questions were simply permutated to preserve the experiment. The class schedule included three sections:

for a total of 330 students that took the same final exam.

This early report focuses on “Scantron 1”, the first portion of the final exam that had purely multiple-choice questions (i.e. no multiple answers, diagrams, short-answer, or essay tasks).

Data

library("readxl")
library("DT")
library("tidyverse")

df <- read_excel("Bio 2_-_final exams data.xlsx")
DT::datatable(df)

Summary Statistics

Overall, out of 100 multiple-choice questions, the summary is

summary(df$score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   44.00   51.00   51.91   58.75   88.00

Since the exam was adminstered at the same time for all of these students (besides those with other accomodations), the test version within each lecture section should not matter, but for a practice calculation, here are the means and medians.

df %>%
  group_by(section, version) %>%
  summarize(mean = mean(score),
            median = median(score))
## # A tibble: 6 x 4
## # Groups:   section [?]
##   section version  mean median
##     <dbl> <chr>   <dbl>  <dbl>
## 1       1 female   51.0   50  
## 2       1 male     49.5   48.5
## 3      20 female   53.1   53  
## 4      20 male     53.6   51.5
## 5      30 female   52.3   50  
## 6      30 male     52.3   51

Now for the main show, here is a quick comparison between the teaching styles and how their students performed on that final exam.

df %>%
  group_by(style) %>%
  summarize(mean = mean(score),
            median = median(score))
## # A tibble: 2 x 3
##   style        mean median
##   <chr>       <dbl>  <dbl>
## 1 active       51.8     51
## 2 traditional  52.3     51

At the moment, the statistics are probably too close to be conclusive to support either teaching paradigm.

Density Plots

Beyond the means and medians, let us now look at the distributions of the final exam scores.

Once again, the test version within each lecture section should not matter, but for practice calculations,

df %>%
  ggplot(aes(x = score, color = version, fill = version)) +
  geom_density(kernel = "gaussian", alpha = 0.5) +
  labs(title = "Test Versions")

For the main experiment, here are the distributions between the teaching styles and how their students performed on that final exam.

df %>%
  ggplot(aes(x = score, color = style, fill = style)) +
  geom_density(kernel = "gaussian", alpha = 0.5) +
  labs(title = "Teaching Styles")

Once again, the graphs overlap so much that the results are probably too close to be conclusive to support either teaching paradigm.

Hypothesis Testing

Finally, let us run hypothesis tests to compare the results in statistically sound ways.

First, I will do the silly calculation between the test versions (same exam questions, but shuffled) as a baseline comparison.

female_exams <- df %>%
  filter(version == "female") %>%
  select(score)
male_exams <- df %>%
  filter(version == "male") %>%
  select(score)

t.test(as.data.frame(female_exams), 
       as.data.frame(male_exams), 
       alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  as.data.frame(female_exams) and as.data.frame(male_exams)
## t = 0.39079, df = 327.97, p-value = 0.6962
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.873606  2.802522
## sample estimates:
## mean of x mean of y 
##  52.14110  51.67665

Since the p-value 0.7 > 0.05, we fail to reject the claim that the results between “female” and “male” exams were the same.


Now for the main show.

active_learning <- df %>%
  filter(style == "active") %>%
  select(score)
traditional_lecture <- df %>%
  filter(style == "traditional") %>%
  select(score)

t.test(as.data.frame(active_learning), 
       as.data.frame(traditional_lecture), 
       alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  as.data.frame(active_learning) and as.data.frame(traditional_lecture)
## t = -0.39735, df = 150.04, p-value = 0.6917
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.200989  2.129124
## sample estimates:
## mean of x mean of y 
##  51.76639  52.30233

Since the p-value 0.69 > 0.05, we fail to reject the claim that the results between traditional lectures and active learning environments led to the same results.

Conclusion

Alas, my quick report produced inconclusive results when comparing the final exam results at UC Merced for the active-learning and traditional-lecture professors.

I will inquire with those instructors to see if they want furthur analyses.