1 INTRODUCTION

High Reliability Organizations (HROs) maintain strong safety performance by following five key principles that guide how they detect and respond to early signs of failure. Many companies use proactive safety reports to capture near-misses and hazards, but accurately classifying these reports into the correct HRO hallmark is difficult when relying only on the original broad definitions. This project evaluates whether a scientifically grounded classification rubric designed with clearer behavioral cues improves the accuracy of classifying safety reports.

Our findings show that using the rubric leads to a significant improvement in classification accuracy.

1.1 Study Context

This study evaluates whether a scientifically grounded classification rubric improves accuracy when classifying proactive safety reports using the five HRO hallmarks. The reports come from Precision Metal Components Inc , a mid-sized manufacturing facility with CNC machining, welding, chemical handling, material movement, and strict safety protocols such as lockout/tagout, PPE use, machine guarding, and daily inspections. These operations create realistic conditions where proactive safety reports capture meaningful early warnings.

1.2 Experimental Design

A paired design was used so that each participant completed the task twice once before receiving the rubric and once after allowing direct comparison of classification performance with sample size of 40
Phase 1 involved reading 10 safety reports and labeling each with one HRO hallmark using only the original Weick & Sutcliffe definitions.
Phase 2 involved classifying a different set of 10 reports using a detailed rubric that provides clearer behavioral cues.
The order of report sets was randomized to help control for differences in difficulty.

1.3 DATA COLLECTED

results of the data collected( before and after rubric)

data <- read.csv("C:/Users/dhana/Downloads/Data_Pre_Post_Rubric_HRO.csv")
data <- as.data.frame(data)
str(data)
## 'data.frame':    23 obs. of  3 variables:
##  $ Subject    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Pre.Rubric : int  30 70 40 70 30 30 30 30 70 50 ...
##  $ Post.Rubric: int  100 90 80 50 100 70 40 70 80 70 ...
data
##    Subject Pre.Rubric Post.Rubric
## 1        1         30         100
## 2        2         70          90
## 3        3         40          80
## 4        4         70          50
## 5        5         30         100
## 6        6         30          70
## 7        7         30          40
## 8        8         30          70
## 9        9         70          80
## 10      10         50          70
## 11      11         50          50
## 12      12         50          50
## 13      13         50          50
## 14      14         20          40
## 15      15         70          50
## 16      16         30          60
## 17      17         30          50
## 18      18         60         100
## 19      19         30         100
## 20      20         80          90
## 21      21         40          50
## 22      22         30         100
## 23      23         70          90

2 DESING CHOICE

2.1 Design Choice: Paired t test

In this study, each participant completed the classification task twice first without the rubric and then with it. Because the same individuals provided both measurements, a paired (within-subjects) design is appropriate. This design allows us to assess the within person change in accuracy resulting from the rubric while controlling for individual differences such as baseline comprehension or prior safety knowledge. For this reason, a paired t-test is the correct statistical method to evaluate whether the rubric produces a significant improvement in performance.

Advantages Compared to an Independent Groups Design

Controls individual differences.
Because the same participants provide both measurements, stable traits such as comprehension ability or prior safety knowledge are held constant, allowing changes to be attributed more directly to the rubric.

Greater statistical power.
Reducing between-person variability lowers error variance, making the paired t-test more sensitive than an independent-groups test.

Smaller sample needed.
Within-subjects designs typically achieve adequate power with fewer participants, which is beneficial in classroom research settings.

Limitations of a Paired Design

Order effects.
Performance may improve simply because participants repeat a similar task, not solely because of the rubric.

Fatigue effects.
Completing the second task later may introduce tiredness or reduced attention, potentially lowering post-rubric accuracy.

Carryover effects.
What participants learn in Phase 1 may influence their Phase 2 performance, independent of the rubric.

Randomizing the order of report sets can reduce but not fully eliminate these sources of bias.

2.2 POWER ANALSYS

With approximately 40 participants, we conducted a power analysis for a paired t-test to understand what range of effect sizes we can reliably detect. Specifically, we generated a power curve showing statistical power as a function of Cohen’s d standardized mean difference of the pre post scores and identified the minimum effect size that achieves 80% power at \(\alpha\)=0.05.

library(pwr)
library(ggplot2)

n <- nrow(data)

effect_sizes <- seq(0.1, 1.2, by = 0.05)

power_values <- sapply(effect_sizes, function(d) {
  pwr.t.test(n = n,d = d,sig.level = 0.05,type = "paired",alternative = "two.sided")$power
})
power_df <- data.frame( effect_size = effect_sizes,power = power_values
)

ggplot(power_df, aes(x = effect_size, y = power)) +
  geom_line(color = "orange", size = 1.2) + geom_hline(yintercept = 0.80, linetype = "dashed",color = "red", linewidth = 1) + labs(title = paste("Power Curve for Paired t-test (n =", n, ")"),x = "Effect Size (Cohen's d)",y = "Statistical Power"  ) + theme_minimal()

d_grid <- seq(0.1, 1.5, by = 0.001)
power_grid <- sapply(d_grid, function(d) {
  pwr.t.test( n = n, d = d, sig.level = 0.05, type = "paired", alternative = "two.sided")$power
})

min_d_for_80 <- d_grid[which(power_grid >= 0.80)[1]]
min_d_for_80
## [1] 0.612

The power curve shows that statistical power increases with effect size, and our observed effect size of 0.92 is well above the minimum detectable value of 0.61. This indicates that our study was more than adequately powered to detect the rubric’s impact.

3 Potential Confounds

Differences in Report Set Difficulty

Although the order of report sets is randomized, the two sets of proactive safety reports may not be equally difficult to classify. If the Phase 2 reports happen to be clearer or easier, accuracy may increase even without the rubric—artificially inflating the measured effect. Conversely, if the second set is harder, the true benefit of the rubric may be underestimated.

Practice and Learning Effects

By completing 10 classifications in Phase 1, participants gain experience identifying HRO cues and interpreting narrative reports. This natural learning process may improve performance in Phase 2 regardless of the rubric, making it difficult to separate the effect of practice from the effect of the rubric itself. This can bias results upward.

Variation in Participant Engagement With the Rubric

Participants may differ in how carefully they read, understand, and apply the rubric. Some may follow the decision cues precisely, while others may skim or ignore parts of it. These differences introduce variability into post-rubric scores, potentially masking or distorting the true effect of the rubric on classification accuracy.

4. Descriptive Statistics

data$Difference <- data$Post.Rubric - data$Pre.Rubric
mean_pre  <- mean(data$Pre.Rubric)
mean_pre
## [1] 46.08696
mean_post <- mean(data$Post.Rubric)
mean_post
## [1] 70.86957
mean_diff <- mean(data$Difference)
mean_diff
## [1] 24.78261
sd_diff   <- sd(data$Difference)
sd_diff
## [1] 26.946
t_test_result <- t.test(data$Post.Rubric, data$Pre.Rubric, paired = TRUE)
t_test_result
## 
##  Paired t-test
## 
## data:  data$Post.Rubric and data$Pre.Rubric
## t = 4.4108, df = 22, p-value = 0.0002212
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  13.13028 36.43493
## sample estimates:
## mean difference 
##        24.78261
effect_size <- mean_diff / sd_diff
effect_size
## [1] 0.9197138
boxplot(data$Pre.Rubric, data$Post.Rubric, names = c("Before", "After"),main = "Scores Before vs After",col = c("lightpink", "violet"),ylab = "Score (%)")

hist(data$Difference, main = "Distribution of Improvement", xlab = "Difference (Post - Pre)",col = "green", border = "blue")

Below are the observed results from the above code -  

Mean percentage before rubric is - 46.086

Mean percentage after rubric is - 70.869

Standard deviation of differences - 26.946

5 Paired t-test

Hypothesis

Null:

\(H_{0}\):\(\mu_1\)=\(\mu_2\)=\(\mu\)

Alternate:

\(H_{a}\):Atleast one \(\mu\) differs

where,

\(\mu_1\)= Mean of Pre rubric

\(\mu_2\)= Mean of post rubric

5.1Check Assumptions: Normality of Differences

# Q-Q plot
qqnorm(data$Difference, main = "Q-Q Plot of Difference Scores")
qqline(data$Difference)

library(ggpubr)
ggqqplot (data$Difference, title = "Q-Q Plot of Difference Scores")

#paired t test
pwr.t.test(n=NULL, d= 0.5, sig.level = 0.05,power = 0.95, type="paired", alternative="greater")
## 
##      Paired t test power calculation 
## 
##               n = 44.67998
##               d = 0.5
##       sig.level = 0.05
##           power = 0.95
##     alternative = greater
## 
## NOTE: n is number of *pairs*

5.2 t test

t.test(data$Post.Rubric, data$Pre.Rubric)
## 
##  Welch Two Sample t-test
## 
## data:  data$Post.Rubric and data$Pre.Rubric
## t = 4.1632, df = 42.613, p-value = 0.0001496
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12.77452 36.79070
## sample estimates:
## mean of x mean of y 
##  70.86957  46.08696

Since p-value (0.0001496) > α (0.05), we fail to reject Null Hypothesis

6 Interpretation

The rubric significantly improved classification accuracy, increasing average scores from 46.1% to 70.9%. The observed effect size (0.92) was much larger than the predicted minimum (0.61), indicating a stronger-than-expected impact. Because of this large effect, the study achieved very high power (98.8%), confirming that our sample size was more than sufficient. Practically, a 25% improvement is substantial and can meaningfully enhance safety-risk identification in real organizational settings.