Design of Experiment to Assess Effectiveness of Classification Rubric for Safety Management

This research study was designed to evaluate the effectiveness of a classification rubric for safety management.

Background

High Reliability Organizations (HROs) are organizations that operate in high-risk environments but maintain exceptionally low accident rates (e.g., nuclear power plants, aircraft carriers, air traffic control). Researchers Karl Weick and Kathleen Sutcliffe identified five “hallmarks” or principles that characterize these organizations:

Anticipating Failure:

Preoccupation with Failure (PF) - Treating any lapse as a symptom that something may be wrong
Reluctance to Simplify Interpretations (RS) - Creating complete pictures rather than oversimplifying
Sensitivity to Operations (SO) - Staying attentive to the front line where the actual work happens

Responding to Disruptions:

Commitment to Resilience (CR) - Developing capabilities to detect, contain, and bounce back from errors
Deference to Expertise (DE) - Pushing decisions to people with expertise regardless of rank

Organizations trying to improve safety often collect proactive safety reports - narratives about near-misses, hazards, unsafe conditions, and safety suggestions written by employees. These reports can be classified by which HRO hallmark they relate to, but the original definitions by Weick and Sutcliffe were based on anecdotal observations and can be difficult to apply consistently.

Research Question: Does a scientifically-grounded classification rubric improve the accuracy of classifying proactive safety reports by HRO hallmarks?

Study Context

You will classify safety reports from Precision Metal Components Inc., a mid-sized manufacturing facility that produces custom metal parts.

Design Choice

A paired T-Test is perfect for this experiment because it directly tests and determines the difference in the means and determines if it is different from zero. In our study, the two populations are the percentage correct in classification of the safety reports pre-rubric and post-rubric. In other words, participants first classified the safety reports before their exposure to the rubric, solely on their innate knowledge and knowledge of HRO and the hallmarks. This was pre-rubric classification performance, measured in percentage of correct classifications of reports.

Similarly, they classify different sets of safety reports after being taught the rubric that embeds the human information processing model and situational awareness model.

Advantages of using the paired t-test

Our goal is to measure the effect of the rubric on classification performance. We can conduct pre-rubric observations and post-rubric observations among the same set of people. Since the individuals will be conducting the pre and post-classification tasks, the change we measure is more directly tied to the rubric. This removes differences between individuals and makes the effect of the rubric clearer.

Paired t-test allows us to control for individual variability, which can be caused by knowledge of safety and reliability, reading and comprehension skill, and attention level of the individuals. A cleaner test can be created by allowing individual control over the differences.

The high number of participants and data will help produce a higher statistical power and a more precise effect estimate.

Limitations/confounds: Time-related constraints, such as fatigue, history, or motivation. For example, participants may perform better or worse based on the time of day.

dat <- read.csv("Data_Pre_Post_Rubric_HRO.csv", header=TRUE)
dat

##    Subject Pre.Rubric Post.Rubric
## 1        1         30         100
## 2        2         70          90
## 3        3         40          80
## 4        4         70          50
## 5        5         30         100
## 6        6         30          70
## 7        7         30          40
## 8        8         30          70
## 9        9         70          80
## 10      10         50          70
## 11      11         50          50
## 12      12         50          50
## 13      13         50          50
## 14      14         20          40
## 15      15         70          50
## 16      16         30          60
## 17      17         30          50
## 18      18         60         100
## 19      19         30         100
## 20      20         80          90
## 21      21         40          50
## 22      22         30         100
## 23      23         70          90

cor(dat$'Pre.Rubric',dat$'Post.Rubric')

## [1] 0.1109376

Power Analysis

# power curve
library(pwr)

## Warning: package 'pwr' was built under R version 4.5.2

# number of observations
n <- 23
# List of effect sizes
seq_effect_sizes <- seq(0.1, 1.0, by = 0.05)
# create a numeric empty vector to store powers values
# size of it must equal size of effect sizes sequence vector
powers_seq <- numeric(length(seq_effect_sizes))

# for loop that computes power for each effect size
for (i in 1:length(seq_effect_sizes)) 
  {
  d_val <- seq_effect_sizes[i] #take ith element from 
  result <- pwr.t.test(n = n, d = d_val, sig.level = 0.05, type = "paired")
  powers_seq[i] <- result$power #append power value from results to ith position                                 # in powers vector
}

# plotting the power curve
plot(seq_effect_sizes, powers_seq, type = "l", lwd = 2,
     xlab = "Effect Size",
     ylab = "Power",
     main = "Power Curve for Paired t-Test ",
     ylim = c(0, 1))

#library("pwr")
pwr.t.test(n=23, sig.level=.05, power=0.8, type="paired")

## 
##      Paired t test power calculation 
## 
##               n = 23
##               d = 0.6112775
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*

Since we have a sample size 23 (23 pairs), the minimum effect size we can detect is roughly 0.60.

Potential Confounds

There are many factors acting as sources of bias in this design. Three of these potential confounding variables are discussed below:

1. Difference in the two sets of reports

Since the two reports the students were classifying were different, the difference in the two samples of percentage of correct classification numbers could not have been because of the support from the rubric. The sets of reports students were exposed to post-rubric could be simple and less confusing enough to be classified correctly into their proper hallmarks, which might have affected the post-rubric results.

2. Diminishing mental focus

The mental concentration students might have given to classify pre-rubric reports may be different from what they gave to classify post-rubric reports. If these were done in the same continuously allocated time, it is known that concentration decreases over time. This mental fatigue could be another source of bias, causing some difference in results.

3. Difference in familiarity/exposure with rubrics

Some students may have understood the rubric in better ways, maybe due to a better grasp of the source material or better practice through repeated exposure or prior exposure. This could be due to a lack of attention, motivation, or less practice. Some students were more familiar with concepts of HRO and proactive safety. This difference in familiarity and/or understanding may have influenced the post-rubric results.

Descriptive Statistics

Pre-rubric and post-rubric mean percentage correct

pre_rubric_mean  <- mean(dat$Pre.Rubric)
post_rubric_mean <- mean(dat$Post.Rubric)

print(paste("Pre Rubric Mean:", pre_rubric_mean, "Post Rubric Mean:", post_rubric_mean))

## [1] "Pre Rubric Mean: 46.0869565217391 Post Rubric Mean: 70.8695652173913"

Standard deviation of pre-rubric, post-rubric, and difference data set

pre_sd  <- sd(dat$Pre.Rubric)
pre_sd

## [1] 18.27545

post_sd <- sd(dat$Post.Rubric)
post_sd

## [1] 21.93234

differences<- dat$Post.Rubric - dat$Pre.Rubric
diff_sd <- sd(differences)
diff_sd

## [1] 26.946

The standard deviation for the Pre Rubric is 18.28 and 21.93 for the Post Rubric. The differences between the two Rubrics is 26.95.

Histogram of differences

hist(differences,
     breaks=10, #makes 10 bins
     main = "Histogram of Differences",
     xlab = "Differences in Percentage Correct Post-Pre",
     col = "lightgreen",
     border = "black")

Box-plots: before and after rubric

boxplot(dat$Pre.Rubric, dat$Post.Rubric,
        names = c("Before", "After"),
        main = "Box-plots for Before and After Rubric",
        ylab = "Percentage Correct",
        col = c("red", "lightblue"))

Paired T-Test

We define delta as difference between the corresponding observations.

\[ \delta_i = post_i - pre_i \]

We define the difference between mean of the two populations, namely pre-rubric percentage correct and post-rubric percentage correct, as

\[\mu_{\delta} = \mu_{post} - \mu_{pre} \] Our null hypothesis is that the population mean of the differences is zero. So our null hypothesis is that the rubric does not improve classification accuracy. The null hypothesis is written as:

\[ H_0: \mu_{\delta} = 0 \]

The alternative hypothesis is that the population mean of the differences is not zero. This is written as:

\[ H_a: \mu_{\delta} \neq 0 \]

qqnorm(dat$Pre.Rubric) 
qqline(dat$Pre.Rubric, col = "blue")

qqnorm(dat$Post.Rubric)
qqline(dat$Post.Rubric, col = "blue")

x<-c(dat$Pre.Rubric,dat$Post.Rubric)
qqnorm(x)
qqline(x, col="blue")

Checking assumptions(Normality of differences)

diff <- dat$Post.Rubric - dat$Pre.Rubric
qqnorm(diff)
qqline(diff, col = "blue")

Most of the points lie fairly close to the straight line. This indicates that the difference percentage values roughly have a normal distribution. Some deviations at the extremes are observed, but they are not enough to violate the normality assumptions.

Thus, we move forward to conduct the paired t-test.

t.test(dat$'Post.Rubric',dat$'Pre.Rubric', paired=TRUE)

## 
##  Paired t-test
## 
## data:  dat$Post.Rubric and dat$Pre.Rubric
## t = 4.4108, df = 22, p-value = 0.0002212
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  13.13028 36.43493
## sample estimates:
## mean difference 
##        24.78261

The results obtained is as follows:

test statistic: 4.4108
degrees of freedom (number of observations -1 ): 22
p-value: 0.0002212
95% confidence interval: [13.130, 36.43]
mean difference = 24.78

Interpretation

Effectiveness of the rubric

From the results of the paired t-test, we reject the null hypothesis, which states there is no difference between the means of percentage correct of pre and post-rubric classification.

The confidence interval is entirely positive [13.130, 36.43]. The mean difference is 24.78. These indicate that the post-rubric correct classification percent improved positively. Observing the p-value of 0.00022, which is way less than the significance level of 0.05, we can conclude that the post-rubric scores were significantly higher than the pre-rubric scores.

Thus, we can state that the use of a rubric allowed for a meaningful and reliable increase in classification accuracy of the reports.

Observed effect size vs predicted effect size

mean_difference <- 24.78
observed_d <- mean_difference / diff_sd
observed_d

## [1] 0.919617

Our predicted minimum effect size was 0.60. The observed Cohen’s effect size is 0.92. The observed effect size is significantly higher than the minimum detectable effect size that we predicted earlier. This means that the rubric’s impact on improvement in accuracy was more than what we initially predicted.

Observed power vs predicted power

observed_power <- pwr.t.test(n = 23,
           d = 0.92,
           sig.level = 0.05,
           type = "paired")$power
observed_power

## [1] 0.9878557

Our predicted minimum power threshold was based on detecting an effect size of 0.60 at 80% power. The observed power associated with our actual effect size of 0.92 is 99% which is much higher than this threshold. Observed power is significantly higher than the target power level that we predicted earlier. This means that the study’s actual statistical power was more than what we initially expected.

Since this improvement effect is strong, our t-test had more than enough statistical power in detecting it.

Statistical and practical significance

The large observed effect size of 0.92 with such high power of 0.92, and the p-value of just 0.00022, which is much less than the 0.05 significance level, these statistics allow us to confirm that the rubric significantly improved the classification of reports, and this change was not just a random variation. We can conclude that this rubric is statistically reliable.

The improvement with the help of a rubric is relevant in practice. An approximately 25% improvement is significant. It means 25% of the safety reports, which were initially incorrectly classified, can now be correctly classified into the correct HRO hallmarks using this rubric. This correct classification can be realized by training LLM models to consider this rubric or by giving the experts/report readers to adopt this rubric in practice. Better safety-issue classification improves decision-making and an organization’s reliability.

Code Summary

dat <- read.csv("Data_Pre_Post_Rubric_HRO.csv", header=TRUE)
dat
cor(dat$'Pre.Rubric',dat$'Post.Rubric')
# power curve
library(pwr)
# number of observations
n <- 23
# List of effect sizes
seq_effect_sizes <- seq(0.1, 1.0, by = 0.05)
# create a numeric empty vector to store powers values
# size of it must equal size of effect sizes sequence vector
powers_seq <- numeric(length(seq_effect_sizes))
for (i in 1:length(seq_effect_sizes)) 
  {
  d_val <- seq_effect_sizes[i] #take ith element from 
  result <- pwr.t.test(n = n, d = d_val, sig.level = 0.05, type = "paired")
  powers_seq[i] <- result$power #append power value from results to ith position                                 # in powers vector
}
plot(seq_effect_sizes, powers_seq, type = "l", lwd = 2,
     xlab = "Effect Size",
     ylab = "Power",
     main = "Power Curve for Paired t-Test ",
     ylim = c(0, 1))
pwr.t.test(n=23, sig.level=.05, power=0.8, type="paired")
pre_rubric_mean  <- mean(dat$Pre.Rubric)
post_rubric_mean <- mean(dat$Post.Rubric)

print(paste("Pre Rubric Mean:", pre_rubric_mean, "Post Rubric Mean:", post_rubric_mean))
pre_sd  <- sd(dat$Pre.Rubric)
pre_sd
post_sd <- sd(dat$Post.Rubric)
post_sd
differences<- dat$Post.Rubric - dat$Pre.Rubric
diff_sd <- sd(differences)
diff_sd
hist(differences,
     breaks=10, #makes 10 bins
     main = "Histogram of Differences",
     xlab = "Differences in Percentage Correct Post-Pre",
     col = "lightgreen",
     border = "black")
boxplot(dat$Pre.Rubric, dat$Post.Rubric,
        names = c("Before", "After"),
        main = "Box-plots for Before and After Rubric",
        ylab = "Percentage Correct",
        col = c("red", "lightblue"))
qqnorm(dat$Pre.Rubric) 
qqline(dat$Pre.Rubric, col = "blue")
qqnorm(dat$Post.Rubric)
qqline(dat$Post.Rubric, col = "blue")
x<-c(dat$Pre.Rubric,dat$Post.Rubric)
qqnorm(x)
qqline(x, col="blue")
diff <- dat$Post.Rubric - dat$Pre.Rubric
qqnorm(diff)
qqline(diff, col = "blue")
t.test(dat$'Post.Rubric',dat$'Pre.Rubric', paired=TRUE)
mean_difference <- 24.78
observed_d <- mean_difference / diff_sd
observed_d
observed_power <- pwr.t.test(n = 23,
           d = 0.92,
           sig.level = 0.05,
           type = "paired")$power
observed_power