High Reliability Organizations (HROs) are organizations that operate in high-risk environments but maintain exceptionally low accident rates (e.g., nuclear power plants, aircraft carriers, air traffic control). Researchers Karl Weick and Kathleen Sutcliffe identified five “hallmarks” or principles that characterize these organizations:
HROs help achieve safety records by adhering to five core principles that sharpen ability to detect and address weak signals of potential failure. Many organizations encourage proactive reporting of near-misses and hazards, assigning these reports to correct HRO principle remains challenging when reviewers rely solely on high-level definitions. Using the rubric leads to a significant improvement in classification accuracy.
This study investigates whether using scientifically developed classification rubric enhances accuracy of proactive safety reports as per the five hallmarks of HRO’s. The dataset originates from Precision Metal Components Inc., mid-sized manufacturing company engaged in the below works:
CNC machining operations
Material handling (forklifts, cranes)
Maintenance operations
Chemical storage and treatment
Welding and fabrication areas
Quality control lab
These operational conditions provide realistic environment in which these proactive safety reports capture meaningful early indicators of potential hazards and warnings.
Paired design is used, where each participant completed the classification task twice: once before receiving the rubric (Pre Rubric) and once after (Post Rubric). This approach allows direct comparison of performance across conditions while controlling for individual differences.
Pre-Rubric : Participants reviewed 10 proactive safety reports and classified each into one of the five HRO hallmarks using only the original definitions provided by Weick and Sutcliffe.
Post-Rubric: Participants classified a different set of 10 reports using a detailed rubric designed to provide clearer behavioral cues and reduce ambiguity.
Randomization: The order of report sets was randomized to mitigate potential bias arising from differences in report difficulty
Below is the data collected both Pre-Rubric and Post-Rubric :
data<-read.csv("C:/Users/rohit/Downloads/Data_Pre_Post_Rubric_HRO.csv")
data <- as.data.frame(data)
data
## Subject Pre.Rubric Post.Rubric
## 1 1 30 100
## 2 2 70 90
## 3 3 40 80
## 4 4 70 50
## 5 5 30 100
## 6 6 30 70
## 7 7 30 40
## 8 8 30 70
## 9 9 70 80
## 10 10 50 70
## 11 11 50 50
## 12 12 50 50
## 13 13 50 50
## 14 14 20 40
## 15 15 70 50
## 16 16 30 60
## 17 17 30 50
## 18 18 60 100
## 19 19 30 100
## 20 20 80 90
## 21 21 40 50
## 22 22 30 100
## 23 23 70 90
A paired design is appropriate for this study because each student classifies reports both before and after receiving the rubric. This means each person serves as their own control.
In addition the main advantage over an independent groups design is that it controls for individual differences. Some students might naturally be better at classifying safety reports than others. However, with a paired design, we compare each student’s improvement to themselves, not to different people. This can help to reduce variability and can help detect smaller statistical effects with fewer participants.
Potential limitations include order effects. Since everyone does the “before” condition first, any improvement could partly be due to practice or familiarity with the task rather than the rubric itself. Also, the two sets of 10 reports are different, so any difference in difficulty between the sets could affect results.
A power analysis was conducted to determine the minimum effect size our study could detect with 80% power at a significance level of 0.05. With our sample size of 23 students, the minimum detectable effect size is d = 0.61.
library(pwr)
power_values <- pwr.t.test(n = 23, power = 0.8, sig.level = 0.05, type = "paired")
power_values
##
## Paired t test power calculation
##
## n = 23
## d = 0.6112775
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number of *pairs*
effect_sizes <- seq(0.1, 1.5, 0.1)
power_values <- pwr.t.test(n = 23, d = effect_sizes, sig.level = 0.05, type = "paired")$power
plot(effect_sizes, power_values, type = "l", col = "orange", lwd = 3, main = paste("Power Curve (n = 23)"), xlab = "Effect Size", ylab = "Statistical Power")
abline(h = 0.80, col = "grey", lty = 2, lwd = 2)
The power curve shows the relationship between effect size and statistical power for our sample size. As the effect size increases, power increases. At small effect sizes, our power is very low, meaning we would likely miss a real effect. Our observed effect size was 0.92, which is well above the minimum of 0.61. This means the analysis was well-powered to detect the effect of the rubric.
Inconsistent Training Duration : If the time allotted for this study is not strictly controlled (for example, some students glance at it for 30 seconds while others study it for 10 minutes), the “treatment” is not the same for everyone. This lack of standardization adds significant noise to the data which makes it harder to detect a significant effect and potentially reducing the statistical power of the test.
Unequal Difficulty Across Report Sets : Even with randomized order, the two sets of proactive safety reports may differ in classification difficulty. If the Phase 2 set is accidentally easier, accuracy of the rubric could rise, leading to overestimated effect. On the other hand, second set could cause the rubric’s true benefit to be underestimated.
Fatigue and Boredom Effects : Participants must complete two full sets of classifications plus a training phase in the middle. By the time they reach Phase 2, they may experience mental fatigue or boredom. This could lead to carelessness which causes performance to drop in the second half regardless of the rubric’s quality. This would bias the results in a negative effect which makes the rubric look less effective than it actually is.
Practice and Learning Effects : Completing the first 10 classifications in Phase 1 gives participants experience in spotting HRO cues and interpreting reports. This practice alone can improve performance in Phase 2, making it hard to disentangle learning or practice gains from the specific contribution of the rubric. As a result, measured improvements may be upwardly biased.
Inconsistent Engagement with the Rubric : Participants vary in how thoroughly they engage with the rubric some read and apply it, others skim or disregard key sections. This individual differences in rubric add noise to Phase 2 accuracy scores, which can either mask a real rubric benefit or create illusion of effect that isn’t consistently driven by the tool itself.
meanpre <- mean(data$Pre.Rubric)
meanpre
## [1] 46.08696
meanpost <- mean(data$Post.Rubric)
meanpost
## [1] 70.86957
data$Difference <- data$Post.Rubric - data$Pre.Rubric
meandiff <- mean(data$Difference)
meandiff
## [1] 24.78261
sddiff <- sd(data$Difference)
sddiff
## [1] 26.946
effectsize <- meandiff / sddiff
effectsize
## [1] 0.9197138
boxplot(data$Pre.Rubric, data$Post.Rubric, names = c("Before", "After"),main = "Scores before & after",col = c("yellow", "blue"),ylab = "Score (%)")
hist(data$Difference, main = "Improvement Distribution", xlab = "Difference (Post - Pre)",col = "red", border = "blue")
Conclusion:
Mean percentage correct before rubric - 46.086
Mean percentage correct after rubric - 70.869
Standard deviation of differences - 26.946
Effect size - 0.9197
We used a paired t-test to determine if the classification rubric actually helped students improve their accuracy. This test compares the mean difference between paired observations (Pre-Rubric vs Post-Rubric scores) against a null hypothesis of zero difference.
Hypothesis
We tested the following hypotheses:
Null Hypothesis:
\(H_{0}\) : \(\mu_{difference} = 0\). (The rubric makes no difference. The average improvement is zero)
\(\mu_1\) = \(\mu_2\) = \(\mu\)
Alternate Hypothesis:
\(H_{a}\) : \(\mu_{difference} \neq 0\)
Atleast one of the \(\mu\) differs
where,
\(\mu_1\)= Mean of Pre rubric
\(\mu_2\)= Mean of post rubric
qqnorm(data$Difference, main = "Q-Q Plot of Difference Scores")
qqline(data$Difference)
Conclusion - We created a histogram of the differences and it looked roughly bell-shaped with no major outliers. We also checked the Q-Q plot, where the data points lined up mostly along the straight diagonal line as expected. Therefore, the assumption of normality is satisfied.
t_testresult <- t.test(data$Post.Rubric, data$Pre.Rubric, paired = TRUE)
t_testresult
##
## Paired t-test
##
## data: data$Post.Rubric and data$Pre.Rubric
## t = 4.4108, df = 22, p-value = 0.0002212
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 13.13028 36.43493
## sample estimates:
## mean difference
## 24.78261
The analysis yielded the following results:
Test Statistic: t = 4.4108
Degrees of Freedom = 22
P-value: p = 0.000212
95% Confidence Interval: [13.13, 36.43]
Since the p-value is significantly lower than the alpha level of 0.05, we reject the null hypothesis. This means the difference in scores is statistically significant and that the rubric really did change things.
Effectiveness of the Rubric
The rubric was very effective.
Before Rubric: The average score was 46.1%.
After Rubric: The average score jumped to 70.9%. The 95% confidence interval indicates that using the rubric improves accuracy by somewhere between 13% and 36%.
Effect Size - In our Power Analysis, we calculated that we needed an effect size of 0.61 to be 80% sure of our results. Our actual effect size was 0.92. Since 0.92 is much higher than 0.61, the improvement was huge.
Power - Because the effect was so big, our study had very high statistical power (~98.8%). This means there was almost zero chance that we would have missed this improvement if it existed. 23 students were more than enough to prove this rubric works.
The Practical significance here is that an improvement from ~46% to ~71% is the difference between failing and passing. In a High Reliability Organization (like a nuclear plant), getting 25% more safety reports correct could prevent a serious accident. This tool clearly helps employees better understand and classify safety risks.
data<-read.csv("C:/Users/rohit/Downloads/Data_Pre_Post_Rubric_HRO.csv")
data <- as.data.frame(data)
data
power_values <- pwr.t.test(n = 23, power = 0.8, sig.level = 0.05, type = "paired")
power_values
library(pwr)
effect_sizes <- seq(0.1, 1.5, 0.1)
power_values <- pwr.t.test(n = 23, d = effect_sizes, sig.level = 0.05, type = "paired")$power
plot(effect_sizes, power_values, type = "l", col = "orange", lwd = 3, main = paste("Power Curve (n =", n, ")"), xlab = "Effect Size", ylab = "Statistical Power")
abline(h = 0.80, col = "grey", lty = 2, lwd = 2)
meanpre <- mean(data$Pre.Rubric)
meanpre
meanpost <- mean(data$Post.Rubric)
meanpost
data$Difference <- data$Post.Rubric - data$Pre.Rubric
meandiff <- mean(data$Difference)
meandiff
sddiff <- sd(data$Difference)
sddiff
effectsize <- meandiff / sddiff
effectsize
boxplot(data$Pre.Rubric, data$Post.Rubric, names = c("Before", "After"),main = "Scores before & after",col = c("yellow", "blue"),ylab = "Score (%)")
hist(data$Difference, main = "Improvement Distribution", xlab = "Difference (Post - Pre)",col = "red", border = "blue")
qqnorm(data$Difference, main = "Q-Q Plot of Difference Scores")
qqline(data$Difference)
t_testresult <- t.test(data$Post.Rubric, data$Pre.Rubric, paired = TRUE)
t_testresult