1 Introduction

High Reliability Organizations helps achieve safety records by adhering to five core principles that sharpen ability to detect and address weak signals of potential failure. Many organizations encourage proactive reporting of near-misses and hazards, assigning these reports to correct HRO principle remains challenging when reviewers rely solely on high-level definitions. Using the rubric leads to a significant improvement in classification accuracy.

1.1 Study Context

This study investigates whether using scientifically developed classification rubric enhances accuracy of proactive safety reports as per the five hallmarks of HRO’s. The dataset originates from Precision Metal Components Inc., mid-sized manufacturing company engaged in the below works:

CNC machining operations

Material handling (forklifts, cranes)

Maintenance operations

Chemical storage and treatment

Welding and fabrication areas

Quality control lab

These operational conditions provide realistic environment in which these proactive safety reports capture meaningful early indicators of potential hazards and warnings.

1.2 Experimental Design

Paired design is used, where each participant completed the classification task twice— once before receiving the rubric (Pre Rubric) and once after (Post Rubric). This approach allows direct comparison of performance across conditions while controlling for individual differences.

Randomization: The order of report sets was randomized to mitigate potential bias arising from differences in report difficulty

1.3 Data Collected

Below is the data collected both Pre-Rubric and Post-Rubric :

data<-read.csv("C:/Users/sings/Downloads/Data_Pre_Post_Rubric_HRO.csv")
data <- as.data.frame(data)
data
##    Subject PreRubric PostRubric
## 1        1        30        100
## 2        2        70         90
## 3        3        40         80
## 4        4        70         50
## 5        5        30        100
## 6        6        30         70
## 7        7        30         40
## 8        8        30         70
## 9        9        70         80
## 10      10        50         70
## 11      11        50         50
## 12      12        50         50
## 13      13        50         50
## 14      14        20         40
## 15      15        70         50
## 16      16        30         60
## 17      17        30         50
## 18      18        60        100
## 19      19        30        100
## 20      20        80         90
## 21      21        40         50
## 22      22        30        100
## 23      23        70         90

2 Design Choice

A paired design is appropriate for this study because each student classifies reports both before and after receiving the rubric. This means each person serves as their own control.

In addition the main advantage over an independent groups design is that it controls for individual differences. Some students might naturally be better at classifying safety reports than others. However, with a paired design, we compare each student’s improvement to themselves, not to different people. This can help to reduce variability and can help detect smaller statistical effects with fewer participants.

Potential limitations include order effects. Since everyone does the “before” condition first, any improvement could partly be due to practice or familiarity with the task rather than the rubric itself. Also, the two sets of 10 reports are different, so any difference in difficulty between the sets could affect results.

2.2 Power Analysis

A power analysis was conducted to determine the minimum effect size our study could detect with 80% power at a significance level of 0.05. With our sample size of 23 students, the minimum detectable effect size is d = 0.61.

library(pwr)
library(ggplot2)
n <- nrow(data)
effect_sizes <- seq(0.1, 1.2, by = 0.05)
power_values <- sapply(effect_sizes, function(d) {
  pwr.t.test(n = n,d = d,sig.level = 0.05,type = "paired",alternative = "two.sided")$power})
power_df <- data.frame( effect_size = effect_sizes,power = power_values)
ggplot(power_df, aes(x = effect_size, y = power)) + geom_line(color = "orange", size = 1.2) + geom_hline(yintercept = 0.80, linetype = "dashed",color = "grey", linewidth = 1) + labs(title = paste("Power Curve for Paired t-test (n =", n, ")"),x = "Effect Size (Cohen's d)",y = "Statistical Power"  ) + theme_minimal() 

d_grid <- seq(0.1, 1.5, by = 0.001)
power_grid <- sapply(d_grid, function(d) 
{pwr.t.test( n = n, d = d, sig.level = 0.05, type = "paired", alternative = "two.sided")$power})
min_d_for_80 <- d_grid[which(power_grid >= 0.80)[1]]
min_d_for_80
## [1] 0.612

The power curve shows the relationship between effect size and statistical power for our sample size. As the effect size increases, power increases. At small effect sizes, our power is very low, meaning we would likely miss a real effect. Our observed effect size was 0.92, which is well above the minimum of 0.61. This means the analysis was well-powered to detect the effect of the rubric.

3 Potential Confounds

  1. Unequal Difficulty Across Report Sets

Even with randomized order, the two sets of proactive safety reports may differ in classification difficulty. If the Phase 2 set is accidentally easier, accuracy of the rubric could rise, leading to overestimated effect. On the other hand, second set could cause the rubric’s true benefit to be underestimated.

  1. Practice and Learning Effects

Completing the first 10 classifications in Phase 1 gives participants experience in spotting HRO cues and interpreting reports. This practice alone can improve performance in Phase 2, making it hard to disentangle learning or practice gains from the specific contribution of the rubric. As a result, measured improvements may be upwardly biased.

  1. Variation in Engagement of Participants

Participants vary in how thoroughly they engage with the rubric some read and apply it, others skim or disregard key sections. This individual differences in rubric add noise to Phase 2 accuracy scores, which can either mask a real rubric benefit or create illusion of effect that isn’t consistently driven by the tool itself.

4. Descriptive Statistics

data$Difference <- data$PostRubric - data$PreRubric
meanpre  <- mean(data$PreRubric)
meanpre
## [1] 46.08696
meanpost <- mean(data$PostRubric)
meanpost
## [1] 70.86957
meandiff <- mean(data$Difference)
meandiff
## [1] 24.78261
sddiff   <- sd(data$Difference)
sddiff
## [1] 26.946
t_testresult <- t.test(data$PostRubric, data$PreRubric, paired = TRUE)
t_testresult
## 
##  Paired t-test
## 
## data:  data$PostRubric and data$PreRubric
## t = 4.4108, df = 22, p-value = 0.0002212
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  13.13028 36.43493
## sample estimates:
## mean difference 
##        24.78261
effectsize <- meandiff / sddiff
effectsize
## [1] 0.9197138
boxplot(data$PreRubric, data$PostRubric, names = c("Before", "After"),main = "Scores before & after",col = c("pink", "purple"),ylab = "Score (%)")

hist(data$Difference, main = "Improvement Distribution", xlab = "Difference (Post - Pre)",col = "red", border = "blue")

Below are the observed results from the above code -  

Mean percentage correct before rubric is - 46.086

Mean percentage correct after rubric is - 70.869

Standard deviation of differences - 26.946

5. Paired T- Test

We used a paired t-test to determine if the classification rubric actually helped students improve their accuracy. This test compares the mean difference between paired observations (Pre-Rubric vs Post-Rubric scores) against a null hypothesis of zero difference.

Hypothesis

We tested the following hypotheses:

Null:

\(H_{0}\):\(\mu_1\)=\(\mu_2\)=\(\mu\)

Alternate:

\(H_{a}\):Atleast one \(\mu\) differs

where,

\(\mu_1\)= Mean of Pre rubric

\(\mu_2\)= Mean of post rubric

5.1 Check Assumptions: Normality of Differences

qqnorm(data$Difference, main = "Q-Q Plot of Difference Scores")
qqline(data$Difference)

#paired t test
pwr.t.test(n=NULL, d= 0.5, sig.level = 0.05,power = 0.95, type="paired", alternative="greater")
## 
##      Paired t test power calculation 
## 
##               n = 44.67998
##               d = 0.5
##       sig.level = 0.05
##           power = 0.95
##     alternative = greater
## 
## NOTE: n is number of *pairs*

5.2 Test Results

t.test(data$PostRubric, data$PreRubric)
## 
##  Welch Two Sample t-test
## 
## data:  data$PostRubric and data$PreRubric
## t = 4.1632, df = 42.613, p-value = 0.0001496
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12.77452 36.79070
## sample estimates:
## mean of x mean of y 
##  70.86957  46.08696

The analysis yielded the following results:

  • Test Statistic: t = 4.4108

  • P-value: p = 0.000212

  • 95% Confidence Interval: [13.13, 36.43]

Since the p-value is significantly lower than the alpha level of 0.05, we reject the null hypothesis. This means the difference in scores is statistically significant and that the rubric really did change things.

6. Interpretation

Effectiveness of the Rubric

The rubric was very effective.

Effect Size - In our Power Analysis, we calculated that we needed an effect size of 0.61 to be 80% sure of our results. Our actual effect size was 0.92. Since 0.92 is much higher than 0.61, the improvement was huge.

Power - Because the effect was so big, our study had very high statistical power (~98.8%). This means there was almost zero chance that we would have missed this improvement if it existed. 23 students were more than enough to prove this rubric works.

The Practical significance here is that an improvement from ~46% to ~71% is the difference between failing and passing. In a High Reliability Organization (like a nuclear plant), getting 25% more safety reports correct could prevent a serious accident. This tool clearly helps employees better understand and classify safety risks.