Project Overview

This document covers the analyses for the paired (within-subjects) study which evaluates whether or not a detailed classification rubric improves the accuracy when subjects classify proactive safety reports from HRO hallmarks.

Design Choice

A paired design is appropriate for the research because the primary focus is the “change” in classification accuracy within each of the participants after being introduced to the rubric. The advantages to using the paired design are the increased sensitivity and the lessened amount of participants required to detect the same size effect. Limitations of this approach can include order effects, carry over effects and possibly fatigue of participants from reading a total of 16 reports.

Data Input

To start we will bring in the provided data.

library (ggplot2)
library (pwr)
library(tidyr)

Sample_size<-23

#Create a data frame from the provided data
Data<-data.frame(
  
#Set number of participants
Participants = Sample_size,

#Percent Correct Before Rubric
Before_Data = c(30,70,40,70,30,30,30,30,70,50,50,50,50,20,70,30,30,60,30,80,40,30,70),

#Percent Correct After Rubric
After_Data = c(100,90,80,50,100,70,40,70,80,70,50,50,50,40,50,60,50,100,100,90,50,100,90)

)

Confounds

In our paired design, the same participants classified safety reports before and after receiving the rubric. While this design minimizes the impact of individual differences, there are still some confounding factors that could affect our findings, such as:

  1. Learning/Practice Effect
    Participants may score higher post rubric simply because they are more familiarized with the HRO hallmarks, or they understand the classification task better. This could skew the Post-Rubric scores higher even if the rubric was responsible for some improvement.

  2. Variations in Report Difficulty
    The before rubric and after rubric datasets contain different sets of safety reports. Even with randomization, the two sets might not be perfectly balanced in difficulty. If the after rubric reports were easier, the improvement might be overstated. If they were more difficult, it could understate the impact of the rubric.

  3. Motivation or Exhaustion
    Students may be more focused or enthusiastic for one portion of the task than the other. Task fatigue, boredom, or simply rushing could decrease accuracy in either portion and introduce unrelated variance.

  4. Variation in Personal Interpretation

Some traits (e.g., Preoccupation with Failure vs. Sensitivity to Operations) may be interpreted differently by individual students and this is possible with the same rubric. Systematic bias that results from the individual reading comprehension or interpretation levels may go unaccounted for by the design of the study.

Summary:

Such confounds could impact the perceived gains and must be taken into consideration when assessing the impact of the classification rubric.

Descriptive Statistics

# Compute means for pre and post
Mean_Pre  <-  mean(Data$Before_Data)
Mean_Post <- mean(Data$After_Data)

#Getting a  difference of after to before
Data$difference <- Data$After_Data - Data$Before_Data

#Getting Standard Deviation (SD) of difference
SD_Diff <- sd(Data$difference)
Mean_Diff<- mean(Data$difference)
D_observation <- Mean_Diff/SD_Diff

# Already calculated earlier:
# Data$difference
# SD_Diff <- sd(Data$difference)
# Mean_Diff <- mean(Data$difference)

Mean_Pre
## [1] 46.08696
Mean_Post
## [1] 70.86957
SD_Diff
## [1] 26.946
Mean_Diff
## [1] 24.78261

Students had an average of 46.1% correct in the pre rubric application assessment. Subsequently, they scored an average of 70.9%. This is an average improvement of approximately (Standard Deviation: ~27) and indicates that while the large majority of students improved, the degree of improvement had large variance.

Histogram of Differences:

# Histogram of improvement scores
hist(
  Data$difference,
  main = "Histogram of Improvement (Post - Pre)",
  xlab = "Improvement in Percentage Points",
  col  = "skyblue",
  border = "black"
)

This histogram helps us understand how much improvement each student had. Most students had positive scores which shows that most students did better after using the rubric. The range shows that improvement was unequal among participants. Some students were able to improve a lot more than others.

Before / After boxplots:

boxplot(
  Data$Before_Data,
  Data$After_Data,
  names = c("Pre-Rubric", "Post-Rubric"),
  main  = "Classification Accuracy Before and After Rubric",
  ylab  = "Percentage Correct",
  col   = c("orange", "yellow")
)

This boxplot helps us understand the distribution and range of students’ scores Before Using the Rubric and After Using the Rubric. While the post rubric score box is higher showing that students did better, post rubric box has a biased range as the scores were not uniformly distributed.

Power Analysis

Given the sample size of 23 students, we can create a power curve showing the statistical power as a function of the effect size.

#Range of effect sizes
Effect_Sizes <- seq(0.1,2.0,by = 0.05)

Powers <- sapply(Effect_Sizes,function(d)
  {
  pwr.t.test(n=Sample_size, d=d, sig.level=0.05, type="paired", alternative="two.sided")$power
})

Power_dat <- data.frame(
  Effect_Size = Effect_Sizes,
  Power = Powers
)

ggplot(Power_dat, aes(x = Effect_Size, y = Power))+
  geom_line(color = "green")+
  geom_point(color = "Blue")+
  geom_hline(yintercept =.8, linetype = "dashed", color="red")+
  labs(
    title = "Power Curve for Paired T-Test (N=23, aplha = 0.05)",
    x = "Effect Size",
    y = "Statistical Power"
  )+
  scale_y_continuous(labels = scales::percent)

Now we can calculate the minimum detectable size for 80% power:

#Calculate min detecable size for 80% power
Min_detect<-pwr.t.test(
  n=Sample_size,
  power = 0.8,
  sig.level = .05,
  type = "paired",
  alternative = "two.sided"
  )

#Pull the min effect size
min_effect <- Min_detect$d
min_effect
## [1] 0.6112775

We find that we get a minimum effect size of 0.611.

Paired t-test

Hypotheses

Null Hypothesis (H₀): The mean percentage correct ‘before’ the rubric is equal to the mean percentage correct ‘after’ the rubric.
\[H_0 : \mu_{\text{before}} = \mu_{\text{after}}\] Alternative Hypothesis (H₁):
The mean percentage correct ‘after’ the rubric is greater than the mean ‘before’.
\[H_1 : \mu_{\text{after}} > \mu_{\text{before}}\] ###Check assumptions (normality of differences)

# Calculate difference scores (After - Before)
diff <- Data$After_Data - Data$Before_Data

# Histogram of differences
hist(
  diff,
  main = "Histogram of Score Differences",
  xlab = "Difference (After - Before)"
)

# QQ-plot for normality
qqnorm(diff)
qqline(diff)

# Shapiro-Wilk normality test
shapiro.test(diff)
## 
##  Shapiro-Wilk normality test
## 
## data:  diff
## W = 0.92204, p-value = 0.07369
# Paired t-test
t_test_result <- t.test(Data$After_Data, 
                        Data$Before_Data,
                        paired = TRUE,
                        alternative = "greater")

t_test_result
## 
##  Paired t-test
## 
## data:  Data$After_Data and Data$Before_Data
## t = 4.4108, df = 22, p-value = 0.0001106
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
##  15.13461      Inf
## sample estimates:
## mean difference 
##        24.78261

Paired t-test Results

The paired t-test was conducted to compare percentage correct before and after the rubric.

t-statistic: 4.4108
Degrees of freedom:22
p-value:0.0001106
95% Confidence Interval:[15.13461, Inf)

Conclusion: Since the p-value is much less than 0.05, we reject the null hypothesis. The rubric led to a statistically significant improvement in student scores.

Interpretation:

1. Effectiveness of the Rubric

The paired t-test shows a statistically significant improvement in scores after using the rubric (p-value < 0.05). This means students performed better after the rubric was introduced. Therefore, the rubric is effective in increasing the percentage of correct responses.

2. Observed Effect Size vs. Predicted Effect Size

The effect size computed from the sample represents the actual improvement seen in your data. When compared to the predicted effect size from your power analysis, the observed value helps show whether the rubric had a stronger or weaker impact than originally expected. If the observed effect size is larger, the rubric worked better than predicted; if smaller, the improvement was more modest.

3. Observed Power vs. Predicted Power

Your power analysis estimated how likely the study was to detect a real improvement before collecting any data. The observed (post-hoc) power, based on your actual results, tells you how strong your test ended up being. If the observed power is close to or greater than your predicted power, your sample size and effect were adequate. If it is lower, the study may have been underpowered, meaning it could have missed meaningful differences.

4. Practical Significance vs. Statistical Significance

Even though the improvement is statistically significant, it is important to examine whether the size of the improvement is meaningful in a real academic context. A small increase in scores may not be very impactful for instructors or students, even if the p-value is low. Practical significance looks at whether the improvement is large enough to matter in real-world teaching and grading. If the score difference is substantial, then the rubric is both statistically and practically beneficial.

All Data

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
library (ggplot2)
library (pwr)
library(tidyr)

Sample_size<-23

#Create a data frame from the provided data
Data<-data.frame(
  
#Set number of participants
Participants = Sample_size,

#Percent Correct Before Rubric
Before_Data = c(30,70,40,70,30,30,30,30,70,50,50,50,50,20,70,30,30,60,30,80,40,30,70),

#Percent Correct After Rubric
After_Data = c(100,90,80,50,100,70,40,70,80,70,50,50,50,40,50,60,50,100,100,90,50,100,90)

)
# Compute means for pre and post
Mean_Pre  <-  mean(Data$Before_Data)
Mean_Post <- mean(Data$After_Data)

#Getting a  difference of after to before
Data$difference <- Data$After_Data - Data$Before_Data

#Getting Standard Deviation (SD) of difference
SD_Diff <- sd(Data$difference)
Mean_Diff<- mean(Data$difference)
D_observation <- Mean_Diff/SD_Diff

# Already calculated earlier:
# Data$difference
# SD_Diff <- sd(Data$difference)
# Mean_Diff <- mean(Data$difference)

Mean_Pre
Mean_Post
SD_Diff
Mean_Diff

# Histogram of improvement scores
hist(
  Data$difference,
  main = "Histogram of Improvement (Post - Pre)",
  xlab = "Improvement in Percentage Points",
  col  = "skyblue",
  border = "black"
)

boxplot(
  Data$Before_Data,
  Data$After_Data,
  names = c("Pre-Rubric", "Post-Rubric"),
  main  = "Classification Accuracy Before and After Rubric",
  ylab  = "Percentage Correct",
  col   = c("orange", "yellow")
)

#Range of effect sizes
Effect_Sizes <- seq(0.1,2.0,by = 0.05)

Powers <- sapply(Effect_Sizes,function(d)
  {
  pwr.t.test(n=Sample_size, d=d, sig.level=0.05, type="paired", alternative="two.sided")$power
})

Power_dat <- data.frame(
  Effect_Size = Effect_Sizes,
  Power = Powers
)

ggplot(Power_dat, aes(x = Effect_Size, y = Power))+
  geom_line(color = "green")+
  geom_point(color = "Blue")+
  geom_hline(yintercept =.8, linetype = "dashed", color="red")+
  labs(
    title = "Power Curve for Paired T-Test (N=23, aplha = 0.05)",
    x = "Effect Size",
    y = "Statistical Power"
  )+
  scale_y_continuous(labels = scales::percent)

#Calculate min detecable size for 80% power
Min_detect<-pwr.t.test(
  n=Sample_size,
  power = 0.8,
  sig.level = .05,
  type = "paired",
  alternative = "two.sided"
  )

#Pull the min effect size
min_effect <- Min_detect$d
min_effect

# Calculate difference scores (After - Before)
diff <- Data$After_Data - Data$Before_Data

# Histogram of differences
hist(
  diff,
  main = "Histogram of Score Differences",
  xlab = "Difference (After - Before)"
)

# QQ-plot for normality
qqnorm(diff)
qqline(diff)

# Shapiro-Wilk normality test
shapiro.test(diff)
# Paired t-test
t_test_result <- t.test(Data$After_Data, 
                        Data$Before_Data,
                        paired = TRUE,
                        alternative = "greater")

t_test_result