Project Overview

We will participate in a research study designed to evaluate the effectiveness of a classification rubric for safety management. This project will give us hands-on experience with experimental design, hypothesis testing, and power analysis while contributing to real research in safety management.

Background

High Reliability Organizations (HROs) are organizations that operate in high-risk environments but maintain exceptionally low accident rates (e.g., nuclear power plants, aircraft carriers, air traffic control). Researchers Karl Weick and Kathleen Sutcliffe identified five “hallmarks” or principles that characterize these organizations: Anticipating Failure: Preoccupation with Failure (PF) - Treating any lapse as a symptom that something may be wrong Reluctance to Simplify Interpretations (RS) - Creating complete pictures rather than oversimplifying Sensitivity to Operations (SO) - Staying attentive to the front line where the actual work happens Responding to Disruptions: Commitment to Resilience (CR) - Developing capabilities to detect, contain, and bounce back from errors Deference to Expertise (DE) - Pushing decisions to people with expertise regardless of rank Organizations trying to improve safety often collect proactive safety reports - narratives about near-misses, hazards, unsafe conditions, and safety suggestions written by employees. These reports can be classified by which HRO hallmark they relate to, but the original definitions by Weick and Sutcliffe were based on anecdotal observations and can be difficult to apply consistently. Research Question: Does a scientifically-grounded classification rubric improve the accuracy of classifying proactive safety reports by HRO hallmarks?

Study Context

We will classify safety reports from Precision Metal Components Inc., a mid-sized manufacturing facility that produces custom metal parts. The facility includes: CNC machining operations Welding and fabrication areas Quality control lab Material handling (forklifts, cranes) Maintenance operations Chemical storage and treatment Safety protocols include lockout/tagout procedures, PPE requirements, machine guarding, forklift certification, and daily equipment inspections.

Experimental Design

Design Type: Paired (within-subjects) design Independent Variable: Availability of classification rubric (Before vs. After) Dependent Variable: Percentage of reports correctly classified (compared to expert gold standard) Sample Size: Approximately 23 students )

Procedure:

Phase 1 - Before Rubric: We will receive brief definitions of the five HRO hallmarks (original Weick & Sutcliffe definitions) We will read 10 proactive safety reports For each report, and will classify it as ONE of the five hallmarks: PF, RS, SO, CR, or DE . This classifications will be recorded

Phase 2 - After Rubric: We will receive a detailed classification rubric based on models of human information processing and situational awareness . We will study the rubric and work through example classifications and then will read 10 DIFFERENT proactive safety reports For each report, Then we will classify it using the rubric which will be recorded.

Randomization: The order in which we receive the two sets of 8 reports will be randomized to control for differences in report difficulty.

Design Choice:

We have decided to select the paired t test experimental design choice as each participants/ student classifies reports before and after seeing the rubric. So the outcome is a person-level change score, which mainly provide an improvement data due to the rubric rather than differences between people. Also in this sort of experimental design baseline accuracy is very important. As the skills, safety knowledge , risk assessment criteria varies from person to person, so selecting as independent group design will increase between person variance ,which will be required higher amount of sample size to conduct this analysis. On the other hand because the paired analysis uses the within-person correlation, it can detect same level of effects with fewer participants an sample size.

Potential limitations and confounds:

However, though we feel the paired t test for the analysis is the best design choice , but still there might have some potential limitations and confounds.

  • Practice and getting used to effects : Accuracy may rise in Phase 2 not only for seeing the RUBIC, also because students are more practiced at the task.

  • Designing the Phase 1 and Phase 2 ques difficulty: If either of Phase 1 and Phase 2 design is subjectively easy or difficult those will create different level of cognitive load among the participant which might me a potential confound on the data we collected.

  • Inherent Characteristics of participants: Students may try harder in the post-rubric phase because they expect to perform better which might cause a little bias while collecting the data.

  • Another potential confound is that all participants share a similar academic background and field of study, which may limit variability in cognitive approaches and expectations. In contrast, real-world HRO environments typically comprise individuals from diverse disciplines with varying cognitive abilities and perspectives, making our sample less representative.

Input Data

subject<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)
pre_rubric <-c(30,70,40,70,30,30,30,30,70,50,50,50,50,20,70,30,30,60,30,80,40,30,70)
post_rubric <-c(100,90,80,50,100,70,40,70,80,70,50,50,50,40,50,60,50,100,100,90,50, 100, 90)
dat <- data.frame(subject, pre_rubric, post_rubric)
dat
##    subject pre_rubric post_rubric
## 1        1         30         100
## 2        2         70          90
## 3        3         40          80
## 4        4         70          50
## 5        5         30         100
## 6        6         30          70
## 7        7         30          40
## 8        8         30          70
## 9        9         70          80
## 10      10         50          70
## 11      11         50          50
## 12      12         50          50
## 13      13         50          50
## 14      14         20          40
## 15      15         70          50
## 16      16         30          60
## 17      17         30          50
## 18      18         60         100
## 19      19         30         100
## 20      20         80          90
## 21      21         40          50
## 22      22         30         100
## 23      23         70          90

State Null Hypothesis and Alternative Hypothesis

\(H_o:\mu_\text{diff}=0\)

\(H_1:\mu_\text{diff}\neq 0\)

\(H_o\) : Providing a detailed HRO classification rubric does not significantly improve participants’ accuracy in classifying proactive safety reports compared to classifications made using only brief hallmark definition

\(H_1\) : Providing a detailed HRO classification rubric significantly improves participants’ accuracy in classifying proactive safety reports compared to classifications made using only brief hallmark definitions.The average post-rubric score is different from the pre-rubric score.

Effect Size and Power Calculation and Visualization:

library(pwr)
pwr.t.test(n = 23,d= NULL, sig.level = 0.05,power= .80, type = "paired",alternative = "two.sided")
## 
##      Paired t test power calculation 
## 
##               n = 23
##               d = 0.6112775
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*

Here we have Calculated the minimum effect size with 80% power with 23 pared sample data. We got the minimum effect size 0.611 with 80% power with 23 pared sample size at significance level 0.05. So The planned analysis indicating that with 23 participants, we can detect medium-to-large effect (d ≈ 0.61) of significance.

Calculating actual effect size from the data to get the actual power of our analysis:

#Calculating the actual effect from sample data

mean_prerubricdata  <- mean(pre_rubric)
mean_postrubricdata <- mean(post_rubric)
difference_between_mean <- mean_postrubricdata - mean_prerubricdata
paired_difference <- pre_rubric - post_rubric
pooled_SD <- sd(paired_difference)
cohens_d <- difference_between_mean / pooled_SD
cohens_d 
## [1] 0.9197138
# Computing power at n = 23
power_at_23 <- pwr.t.test(n = 23, d = cohens_d, sig.level = 0.05,
                          type = "paired", alternative = "two.sided")
power_at_23
## 
##      Paired t test power calculation 
## 
##               n = 23
##               d = 0.9197138
##       sig.level = 0.05
##           power = 0.9878144
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*
# SIMPLE POWER CURVE (Effect size vs Power)
d_grid <- seq(0.1, 1.5, by = 0.01)
power_d <- sapply(d_grid, function(d)
  pwr.t.test(n = 23, d = d, sig.level = 0.05,
             type = "paired", alternative = "two.sided")$power)

plot(d_grid, power_d, type = "l", lwd = 2,
     xlab = "Effect Size (Cohen's dₓ)",
     ylab = "Power (α = 0.05, two-sided)",
     main = "Power vs Effect Size for n = 23 (Paired t-test)")
abline(h = 0.8, lty = 2)
abline(v = cohens_d, lty = 3)
points(cohens_d, power_at_23$power, pch = 19, cex = 1.2)
text(cohens_d, power_at_23$power,
     labels = sprintf("d = %.2f, power = %.2f", cohens_d, power_at_23$power),
     pos = 4)

From the above calculation we can find effect size(d) from the actual data is even larger ( 0.919 ) than the minimum effect size ,with which we can detect the significance with more power around 98%.

We can easily visualize this power analysis from above power curve.

Checking Correlation:

cor(dat$pre_rubric, dat$post_rubric)
## [1] 0.1109376

From the correlation clarification in above we find hat Pre-Rubric and Post-Rubric scores for the same individuals have almost no linear relationship (r = 0.11).

That means:

  • Participants who scored high before didn’t necessarily score high after.

  • Some low-scorers improved a lot, while others barely changed or even dropped.

Data collection and Visualization:

Assumptions Check

diff<-post_rubric-pre_rubric
qqnorm(diff,main="Normal Q-Q Plot of Differences",col="darkblue",pch=19) 
qqline(diff,col="red",lwd=2)

Analysis of Assumptions:

Normality of the paired differences was evaluated using a Q–Q plot. The points in the plot fell reasonably close to the reference line, showing only minor deviations at the extreme tails. This pattern indicates that the differences between post-rubric and pre-rubric scores are approximately normally distributed. Therefore, the normality assumption for the paired t-test is considered satisfied, and it is appropriate to proceed with the test.

Normality checking with shapiro test and Histogram / boxplot

shapiro.test(diff)        
## 
##  Shapiro-Wilk normality test
## 
## data:  diff
## W = 0.92204, p-value = 0.07369
hist(diff, main="Histogram of Differences", col= "lightblue")

boxplot(diff, main="Boxplot of Differences",col="lightblue")

Here from the shapiro test we can find p = 0.0737 > 0.05 which means there is no significant evidence against normality.

Also the histogram looks roughly symmetric, though maybe slightly right-skewed, There’s no extreme outlier cluster where the visual visual pattern supports approximately normal.

Also from the Box plot of Differences , we notice The median is near the middle of the box.No obvious outliers beyond whiskers. The spread looks reasonable, not strongly skewed. This again supports that the differences are not severely non-normal giving us a strong foundation to proceed with regular paired t test for this analysis.

Paired t.test

t.test(post_rubric,pre_rubric,alternative = c("two.sided"),
       mu = 0, paired = TRUE, var.equal = FALSE,
       conf.level = 0.95)
## 
##  Paired t-test
## 
## data:  post_rubric and pre_rubric
## t = 4.4108, df = 22, p-value = 0.0002212
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  13.13028 36.43493
## sample estimates:
## mean difference 
##        24.78261
#boxplot
boxplot(pre_rubric ,post_rubric,main="Visualization of Pre and Post Rubric test", names = c("Pre Rubric Spread","Post Rubric Spread"),
        ylab= "Score", col= c("skyblue", "lightgreen"))

par(mfrow=c(1,2))

# Pre-rubric histogram
hist(pre_rubric,
     main="Pre-Rubric Scores",
     xlab="Score",
     col="skyblue",
     xlim=c(0, 110),
     breaks=seq(0,110,10),
     ylim=c(0,23))

# Post-rubric histogram
hist(post_rubric,
     main="Post-Rubric Scores",
     xlab="Score",
     col="lightgreen",
     xlim=c(0, 110),
     breaks=seq(0,110,10),
     ylim=c(0,23))

Analysis of Paired t.test

A paired-samples t-test was conducted to compare pre-rubric and post-rubric scores. Results showed that rubric use significantly improved classification accuracy, t(22) = 4.41, p = 0.00022. The mean improvement was 24.78 percentage points, with a 95% confidence interval of [13.13, 36.43]. These findings provide strong evidence that the rubric had a positive effect on performance.

Additional, from the comparison Histograms, the post-rubric distribution shifts rightward, showing more scores in the higher bins. Also from Box plots comparison,the post-rubric median is much higher, and the entire box (IQR) is shifted upward meaning that the rubric is significantly helping to improve the score among the population.

Based on above all statistical analysis we can come to a decision that-

“Providing a detailed HRO classification rubric significantly improves participants’ accuracy in classifying proactive safety reports compared to classifications made using only brief hallmark definitions.The average post-rubric score is different from the pre-rubric score.”

subject<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)
pre_rubric<-c(30,70,40,70,30,30,30,30,70,50,50,50,50,20,70,30,30,60,30,80,40,30,70)
post_rubric<-c(100,90,80,50,100,70,40,70,80,70,50,50,50,40,50,60,50,100,100,90,50, 100, 90)
dat <- data.frame(subject, pre_rubric, post_rubric)
dat
library(pwr)
pwr.t.test(n = 23,d= NULL, sig.level = 0.05,power= .80, type = "paired",alternative = "two.sided")
mean_prerubricdata  <- mean(pre_rubric)
mean_postrubricdata <- mean(post_rubric)
difference_between_mean <- mean_postrubricdata - mean_prerubricdata
paired_difference <- pre_rubric - post_rubric
pooled_SD <- sd(paired_difference)
cohens_d <- difference_between_mean / pooled_SD
cohens_d 
power_at_23 <- pwr.t.test(n = 23, d = cohens_d, sig.level = 0.05,
                          type = "paired", alternative = "two.sided")
power_at_23
d_grid <- seq(0.1, 1.5, by = 0.01)
power_d <- sapply(d_grid, function(d)
  pwr.t.test(n = 23, d = d, sig.level = 0.05,
             type = "paired", alternative = "two.sided")$power)

plot(d_grid, power_d, type = "l", lwd = 2,
     xlab = "Effect Size (Cohen's dₓ)",
     ylab = "Power (α = 0.05, two-sided)",
     main = "Power vs Effect Size for n = 23 (Paired t-test)")
abline(h = 0.8, lty = 2)
abline(v = cohens_d, lty = 3)
points(cohens_d, power_at_23$power, pch = 19, cex = 1.2)
text(cohens_d, power_at_23$power,labels = sprintf("d = %.2f, power = %.2f",cohens_d,power_at_23$power),pos = 4)
cor(dat$pre_rubric, dat$post_rubric)




diff<-post_rubric-pre_rubric
shapiro.test(diff)
hist(diff, main="Histogram of Differences", col= "lightblue")
boxplot(diff, main="Boxplot of Differences",col="lightblue")
qqnorm(diff,main="Normal Q-Q Plot of Differences",col="darkblue",pch=19) 
qqline(diff,col="red",lwd=2)
t.test(post_rubric,pre_rubric,alternative = c("two.sided"),
       mu = 0, paired = TRUE, var.equal = FALSE,
       conf.level = 0.95)
boxplot(pre_rubric ,post_rubric,main="Visualization of Pre and Post Rubric test", names = c("Pre Rubric Spread","Post Rubric Spread"),
        ylab= "Score", col= c("skyblue", "lightgreen"))
hist(pre_rubric,
     main="Pre-Rubric Scores",
     xlab="Score",
     col="skyblue",
     xlim=c(0, 110),
     breaks=seq(0,110,10),
     ylim=c(0,23))

hist(post_rubric,
     main="Post-Rubric Scores",
     xlab="Score",
     col="lightgreen",
     xlim=c(0, 110),
     breaks=seq(0,110,10),
     ylim=c(0,23))