1 INTRODUCTION

High Reliability Organizations (HROs) are organizations that operate in high-risk environments but maintain exceptionally low accident rates (e.g., nuclear power plants, aircraft carriers, air traffic control). Researchers Karl Weick and Kathleen Sutcliffe identified five “hallmarks” or principles that characterize these organizations:

HROs help achieve safety records by adhering to five core principles that sharpen ability to detect and address weak signals of potential failure. Many organizations encourage proactive reporting of near-misses and hazards, assigning these reports to correct HRO principle remains challenging when reviewers rely solely on high-level definitions. Using the rubric leads to a significant improvement in classification accuracy.

1.1 Study Context

This study investigates whether using scientifically developed classification rubric enhances accuracy of proactive safety reports as per the five hallmarks of HRO’s. The dataset originates from Precision Metal Components Inc., mid-sized manufacturing company engaged in the below works:

  • CNC machining operations

  • Material handling (forklifts, cranes)

  • Maintenance operations

  • Chemical storage and treatment

  • Welding and fabrication areas

  • Quality control lab

These operational conditions provide realistic environment in which these proactive safety reports capture meaningful early indicators of potential hazards and warnings.

1.2 Experimental Design

Paired design is used, where each participant completed the classification task twice: once before receiving the rubric (Pre Rubric) and once after (Post Rubric). This approach allows direct comparison of performance across conditions while controlling for individual differences.

  • Pre-Rubric : Participants reviewed 10 proactive safety reports and classified each into one of the five HRO hallmarks using only the original definitions provided by Weick and Sutcliffe.

  • Post-Rubric: Participants classified a different set of 10 reports using a detailed rubric designed to provide clearer behavioral cues and reduce ambiguity.

Randomization: The order of report sets was randomized to mitigate potential bias arising from differences in report difficulty

1.3 Data Collected

Below is the data collected both Pre-Rubric and Post-Rubric :

data<-read.csv("C:/Users/rohit/Downloads/Data_Pre_Post_Rubric_HRO.csv")
data <- as.data.frame(data)
data
##    Subject Pre.Rubric Post.Rubric
## 1        1         30         100
## 2        2         70          90
## 3        3         40          80
## 4        4         70          50
## 5        5         30         100
## 6        6         30          70
## 7        7         30          40
## 8        8         30          70
## 9        9         70          80
## 10      10         50          70
## 11      11         50          50
## 12      12         50          50
## 13      13         50          50
## 14      14         20          40
## 15      15         70          50
## 16      16         30          60
## 17      17         30          50
## 18      18         60         100
## 19      19         30         100
## 20      20         80          90
## 21      21         40          50
## 22      22         30         100
## 23      23         70          90

2 DESIGN CHOICE

A paired design is appropriate for this study because each student classifies reports both before and after receiving the rubric. This means each person serves as their own control.

In addition the main advantage over an independent groups design is that it controls for individual differences. Some students might naturally be better at classifying safety reports than others. However, with a paired design, we compare each student’s improvement to themselves, not to different people. This can help to reduce variability and can help detect smaller statistical effects with fewer participants.

Potential limitations include order effects. Since everyone does the “before” condition first, any improvement could partly be due to practice or familiarity with the task rather than the rubric itself. Also, the two sets of 10 reports are different, so any difference in difficulty between the sets could affect results.

2.2 Power Analysis

A power analysis was conducted to determine the minimum effect size our study could detect with 80% power at a significance level of 0.05. With our sample size of 23 students, the minimum detectable effect size is d = 0.61.

library(pwr)
power_values <- pwr.t.test(n = 23, power = 0.8, sig.level = 0.05, type = "paired")
power_values
## 
##      Paired t test power calculation 
## 
##               n = 23
##               d = 0.6112775
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*
effect_sizes <- seq(0.1, 1.5, 0.1)
power_values <- pwr.t.test(n = 23, d = effect_sizes, sig.level = 0.05, type = "paired")$power
plot(effect_sizes, power_values, type = "l", col = "orange", lwd = 3, main = paste("Power Curve (n = 23)"), xlab = "Effect Size", ylab = "Statistical Power")
abline(h = 0.80, col = "grey", lty = 2, lwd = 2)

The power curve shows the relationship between effect size and statistical power for our sample size. As the effect size increases, power increases. At small effect sizes, our power is very low, meaning we would likely miss a real effect. Our observed effect size was 0.92, which is well above the minimum of 0.61. This means the analysis was well-powered to detect the effect of the rubric.

3 POTENTIAL CONFOUNDS

4 DESCRIPTIVE STATISTICS

meanpre  <- mean(data$Pre.Rubric)
meanpre
## [1] 46.08696
meanpost <- mean(data$Post.Rubric)
meanpost
## [1] 70.86957
data$Difference <- data$Post.Rubric - data$Pre.Rubric
meandiff <- mean(data$Difference)
meandiff
## [1] 24.78261
sddiff   <- sd(data$Difference)
sddiff
## [1] 26.946
effectsize <- meandiff / sddiff
effectsize
## [1] 0.9197138
boxplot(data$Pre.Rubric, data$Post.Rubric, names = c("Before", "After"),main = "Scores before & after",col = c("yellow", "blue"),ylab = "Score (%)")

hist(data$Difference, main = "Improvement Distribution", xlab = "Difference (Post - Pre)",col = "red", border = "blue")

Conclusion:

Mean percentage correct before rubric - 46.086

Mean percentage correct after rubric - 70.869

Standard deviation of differences - 26.946

Effect size - 0.9197

5 PAIRED T-TEST

We used a paired t-test to determine if the classification rubric actually helped students improve their accuracy. This test compares the mean difference between paired observations (Pre-Rubric vs Post-Rubric scores) against a null hypothesis of zero difference.

Hypothesis

We tested the following hypotheses:

Null Hypothesis:

\(H_{0}\) : \(\mu_{difference} = 0\). (The rubric makes no difference. The average improvement is zero)

\(\mu_1\) = \(\mu_2\) = \(\mu\)

Alternate Hypothesis:

\(H_{a}\) : \(\mu_{difference} \neq 0\)

Atleast one of the \(\mu\) differs

where,

\(\mu_1\)= Mean of Pre rubric

\(\mu_2\)= Mean of post rubric

5.1 Check Assumptions: Normality of Differences

qqnorm(data$Difference, main = "Q-Q Plot of Difference Scores")
qqline(data$Difference)

Conclusion - We created a histogram of the differences and it looked roughly bell-shaped with no major outliers. We also checked the Q-Q plot, where the data points lined up mostly along the straight diagonal line as expected. Therefore, the assumption of normality is satisfied.

5.2 Test Results

t_testresult <- t.test(data$Post.Rubric, data$Pre.Rubric, paired = TRUE)
t_testresult
## 
##  Paired t-test
## 
## data:  data$Post.Rubric and data$Pre.Rubric
## t = 4.4108, df = 22, p-value = 0.0002212
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  13.13028 36.43493
## sample estimates:
## mean difference 
##        24.78261

The analysis yielded the following results:

  • Test Statistic: t = 4.4108

  • Degrees of Freedom = 22

  • P-value: p = 0.000212

  • 95% Confidence Interval: [13.13, 36.43]

Since the p-value is significantly lower than the alpha level of 0.05, we reject the null hypothesis. This means the difference in scores is statistically significant and that the rubric really did change things.

6 INTERPRETATION

Effectiveness of the Rubric

The rubric was very effective.

Effect Size - In our Power Analysis, we calculated that we needed an effect size of 0.61 to be 80% sure of our results. Our actual effect size was 0.92. Since 0.92 is much higher than 0.61, the improvement was huge.

Power - Because the effect was so big, our study had very high statistical power (~98.8%). This means there was almost zero chance that we would have missed this improvement if it existed. 23 students were more than enough to prove this rubric works.

The Practical significance here is that an improvement from ~46% to ~71% is the difference between failing and passing. In a High Reliability Organization (like a nuclear plant), getting 25% more safety reports correct could prevent a serious accident. This tool clearly helps employees better understand and classify safety risks.

SOURCE CODE

data<-read.csv("C:/Users/rohit/Downloads/Data_Pre_Post_Rubric_HRO.csv")
data <- as.data.frame(data)
data
power_values <- pwr.t.test(n = 23, power = 0.8, sig.level = 0.05, type = "paired")
power_values
library(pwr)
effect_sizes <- seq(0.1, 1.5, 0.1)
power_values <- pwr.t.test(n = 23, d = effect_sizes, sig.level = 0.05, type = "paired")$power
plot(effect_sizes, power_values, type = "l", col = "orange", lwd = 3, main = paste("Power Curve (n =", n, ")"), xlab = "Effect Size", ylab = "Statistical Power")
abline(h = 0.80, col = "grey", lty = 2, lwd = 2)
meanpre  <- mean(data$Pre.Rubric)
meanpre
meanpost <- mean(data$Post.Rubric)
meanpost
data$Difference <- data$Post.Rubric - data$Pre.Rubric
meandiff <- mean(data$Difference)
meandiff
sddiff   <- sd(data$Difference)
sddiff
effectsize <- meandiff / sddiff
effectsize
boxplot(data$Pre.Rubric, data$Post.Rubric, names = c("Before", "After"),main = "Scores before & after",col = c("yellow", "blue"),ylab = "Score (%)")
hist(data$Difference, main = "Improvement Distribution", xlab = "Difference (Post - Pre)",col = "red", border = "blue")
qqnorm(data$Difference, main = "Q-Q Plot of Difference Scores")
qqline(data$Difference)
t_testresult <- t.test(data$Post.Rubric, data$Pre.Rubric, paired = TRUE)
t_testresult