I have been hired to run some numbers. Last Spring semester (of 2018), a couple of biology professors had a natural science experience on their hands. Between Dr. Kamal Dulai and Dr. Mufadhal Al-Kuhlani, one of them taught the Biology 2 course in a “traditional” way while the other employed “active learning” techniques. Here, we will run statistical tests to see if one paradigm is possibly better than the other.
In this exploratory report, I compared the final exams from Biology 2 for the Spring 2018 semester. Students were administered exams that did come in two versions—that happened to be named “female” and “male” due to the Scantron form used—but the questions were simply permutated to preserve the experiment. The class schedule included three sections:
for a total of 330 students that took the same final exam.
This early report focuses on “Scantron 1”, the first portion of the final exam that had purely multiple-choice questions (i.e. no multiple answers, diagrams, short-answer, or essay tasks).
library("readxl")
library("DT")
library("tidyverse")
df <- read_excel("Bio 2_-_final exams data.xlsx")
DT::datatable(df)
Overall, out of 100 multiple-choice questions, the summary is
summary(df$score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 44.00 51.00 51.91 58.75 88.00
Since the exam was adminstered at the same time for all of these students (besides those with other accomodations), the test version within each lecture section should not matter, but for a practice calculation, here are the means and medians.
df %>%
group_by(section, version) %>%
summarize(mean = mean(score),
median = median(score))
## # A tibble: 6 x 4
## # Groups: section [?]
## section version mean median
## <dbl> <chr> <dbl> <dbl>
## 1 1 female 51.0 50
## 2 1 male 49.5 48.5
## 3 20 female 53.1 53
## 4 20 male 53.6 51.5
## 5 30 female 52.3 50
## 6 30 male 52.3 51
Now for the main show, here is a quick comparison between the teaching styles and how their students performed on that final exam.
df %>%
group_by(style) %>%
summarize(mean = mean(score),
median = median(score))
## # A tibble: 2 x 3
## style mean median
## <chr> <dbl> <dbl>
## 1 active 51.8 51
## 2 traditional 52.3 51
At the moment, the statistics are probably too close to be conclusive to support either teaching paradigm.
Beyond the means and medians, let us now look at the distributions of the final exam scores.
Once again, the test version within each lecture section should not matter, but for practice calculations,
df %>%
ggplot(aes(x = score, color = version, fill = version)) +
geom_density(kernel = "gaussian", alpha = 0.5) +
labs(title = "Test Versions")
For the main experiment, here are the distributions between the teaching styles and how their students performed on that final exam.
df %>%
ggplot(aes(x = score, color = style, fill = style)) +
geom_density(kernel = "gaussian", alpha = 0.5) +
labs(title = "Teaching Styles")
Once again, the graphs overlap so much that the results are probably too close to be conclusive to support either teaching paradigm.
Finally, let us run hypothesis tests to compare the results in statistically sound ways.
First, I will do the silly calculation between the test versions (same exam questions, but shuffled) as a baseline comparison.
female_exams <- df %>%
filter(version == "female") %>%
select(score)
male_exams <- df %>%
filter(version == "male") %>%
select(score)
t.test(as.data.frame(female_exams),
as.data.frame(male_exams),
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: as.data.frame(female_exams) and as.data.frame(male_exams)
## t = 0.39079, df = 327.97, p-value = 0.6962
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.873606 2.802522
## sample estimates:
## mean of x mean of y
## 52.14110 51.67665
Since the p-value 0.7 > 0.05, we fail to reject the claim that the results between “female” and “male” exams were the same.
Now for the main show.
active_learning <- df %>%
filter(style == "active") %>%
select(score)
traditional_lecture <- df %>%
filter(style == "traditional") %>%
select(score)
t.test(as.data.frame(active_learning),
as.data.frame(traditional_lecture),
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: as.data.frame(active_learning) and as.data.frame(traditional_lecture)
## t = -0.39735, df = 150.04, p-value = 0.6917
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.200989 2.129124
## sample estimates:
## mean of x mean of y
## 51.76639 52.30233
Since the p-value 0.69 > 0.05, we fail to reject the claim that the results between traditional lectures and active learning environments led to the same results.
Alas, my quick report produced inconclusive results when comparing the final exam results at UC Merced for the active-learning and traditional-lecture professors.
I will inquire with those instructors to see if they want furthur analyses.