1 Design Choice

Because we are interested in how much each individual improves, it means that the same group of people should appear in both conditions. A paired (within-subjects) design directly measures the change in percentage for each subject. With the same sample size, paired (within-subjects) design generally have higher statistical power than between-subjects designs and the error term is smaller.

It controls for individual differences. Subjects differ in many aspects, such as prior knowledge, reading ability and general test-taking ability. In a between-subjects design, those individual differences add noise to the comparison between groups. However, in the paired design, each student serves as their own control. All differences in the subject cancel out when we look at the difference score. It is simpler and easier than comparing two different groups that have never experienced both conditions.

Differences in report difficulty between the two phases may affect the accuracy. The “pre” and “post” phases use two different sets of safety reports. They may not be perfectly matched in difficulty.

2 Power Analysis

library(pwr)

power_values <- seq(0.10, 0.99, by = 0.01)    # power values array

effect_sizes <- array(data = 0, dim = length(power_values)) # all output d values

# For-loop 
for (i in 1:length(power_values)) {
  
  p <- power_values[i] # Extract the i-th power value
  
  result <- pwr.t.test(
    n = 23,
    power = p,
    sig.level = 0.05,
    type = "paired",
    alternative = "two.sided"
  )
  
  effect_sizes[i] <- result$d # input the effect size from the loop
}

# Plot the power curve
plot(power_values, effect_sizes,
     type = "l",
     xlab = "Power",
     ylab = "Effect Size",
     main = "Power Curve"
)

# Effect size needed for 80% power
pwr.t.test(
  n = 23,
  power = 0.80,
  sig.level = 0.05,
  type = "paired",
  alternative = "two.sided")

## 
##      Paired t test power calculation 
## 
##               n = 23
##               d = 0.6112775
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*

3 Potential Confounds

Difficulty level on the reports

The “pre” and “post” phases use two different sets of safety reports. Even though the assignment randomizes the order of the sets across participants, the two sets may not be perfectly matched in difficulty. If the “Post-Rubric” report set simpler or more obvious cases, subjects may perform more accurate after learning the rubric. Conversely, if the second set is more difficult, this may underestimate the effect of the rubric.

The impact of fatigue/time-on-task on results.

How fatigue/time-on-task impacts results:

Participants will likely experience fatigue or distraction due to the time of day when Phase 2 is performed, thus impacting their performance and possibly negating the benefits seen from using the rubrics (negative bias). If Phase 1 is rushed and performed too quickly, that too can negatively affect result averages and exaggerate any improvements that occurred in this phase.

Ways to limit the impact of fatigue/time-on-task on results:

Reduce the length of sessions, take breaks in between, randomize the order of when each participant enters the study. Collect the time spent on each report and include it in the data set as a covariate, to help understand the effects of time-on-task. Conduct sessions at a consistent time each day, if possible.

Ordering of Viewing Reports Not in Order

A potential confound of this data would be that the data has no mechanism to insure that the assignment is done in order. The final video could be viewed out order which would skew the pre data numbers. There needs to be a mechanism to enforce the viewing order.

4 Descriptive Statistics

pre <- c(30,70,40,70,30,30,30,30,70,50,50,50,50,20,70,30,30,60,30,80,40,30,70)
post <- c(100,90,80,50,100,70,40,70,80,70,50,50,50,40,50,60,50,100,100,90,50,100,90)

diff_response<-post - pre
summary(diff_response)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -20.00   10.00   20.00   24.78   40.00   70.00

Above is a statistical summary of the differences of the pre-responses and the post responses for our data. The standard deviation of the data is shown below.

sd(diff_response)

## [1] 26.946

The summary of the individual data groups, pre-response and post-response are shown below

summary(pre)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   30.00   40.00   46.09   65.00   80.00

sd(pre)

## [1] 18.27545

The summary for the post-response is shown below:

summary(post)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40.00   50.00   70.00   70.87   90.00  100.00

sd(post)

## [1] 21.93234

To check for normality of the difference data, we the Shapiro-Wilk statistic test. Since our obtained p value is less than 0.1, we can conclude that the data set is normal

shapiro.test(diff_response)

## 
##  Shapiro-Wilk normality test
## 
## data:  diff_response
## W = 0.92204, p-value = 0.07369

Also, shown below is a histogram of the difference data

# Create a histogram
hist(diff_response,
     main = "Histogram of Response Differences",
     xlab = "Values",
     col = "darkgreen",
     border = "white")

Additionally we have NPP plots of our data to insure that the data is normal.

# # Create the Q-Q plot
qqnorm(diff_response, xlab = "Differences of Responses ", main = "Normal Probability Dist. of Differences")
qqline(diff_response, col = "darkgreen", lwd = 2)

# # Create the Q-Q plot
qqnorm(pre, xlab = "Pre-Response ", main = "Normal Probability Dist. of Pre-Responses")
qqline(pre, col = "blue", lwd = 2)

## Create the Q-Q plot
qqnorm(post, xlab = "Post-Response", main = "Normal Probability of Post-Responses")
qqline(post, col = "red", lwd = 2)

It is clear for all the NPP plots that data appears to be approximately linear.

Our box plot show a clear difference between the pre and post responses with a increased mean between the pre and post responses

# #Check for variance graphically 
# # Create side-by-side box plots
boxplot(pre, post,
        names = c("Pre-Responses", "Post=Responses"),  # Labels under each box
        main = "Comparison of Pre and Post Responses", # Main title
        col = c("darkblue", "red"), # Optional: color for each box
        ylab = "Score")                    # Label for y-axis

5 Paired t-test

5.1 Null hypothesis:

𝐻0:𝜇1=𝜇2 , There is no difference in the mean percentage of correct classifications before and after using the rubric.

5.2 Alternative hypothesis:

𝐻𝑎:𝜇1≠𝜇2 , There is difference in the mean percentage of correct classifications before and after using the rubric.

5.3 Testing Hypothesis using 2-sample T-test

t.test(pre,post, alternative = c("two.sided"), paired = TRUE, var.equal = TRUE, conf.level = 0.95)

## 
##  Paired t-test
## 
## data:  pre and post
## t = -4.4108, df = 22, p-value = 0.0002212
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -36.43493 -13.13028
## sample estimates:
## mean difference 
##       -24.78261

6 Conclusion:

Since, the p-value is less than 0.05, we reject the null hypothesis that there is no difference in mean performance. The following Interpretation expounds on this conclusion

7 Interpretation:

7.1 The Effectiveness of the Rubric

Based on the results of the paired t-test, the hypothesis that the introduction of the rubric improved the average percentage of accuracy associated with identifying the correct classification of proactive safety reports according to HRO hallmarks was confirmed (p<.05), thus providing evidence that implementing a rubric positively influences the participant’s ability to correctly classify proactive safety reports according to the HRO hallmark categories.

The consistency in the increase in post-rubric performance across the majority of participants demonstrates that the rubric served to clarify ambiguities and allow participants to make better decisions with respect to classifying a proactive safety report.

7.2 Comparing the Measured Effect Size to the Expected Effect Size

The assumption prior to analysis of the data was that structured guidance would improve performance but not eradicate all classification errors, and thus a moderate effect size was anticipated.

The effect size from the study indicates that the implemented rubric had a moderate to large impact on the mean performance of all participants. This was supported by the average difference between pre-rubric and post-rubric performance scores. Consequently, the actual effect size was greater than anticipated due to the fact that the implemented rubric provided a greater degree of structured guidance and clarity than originally considered.

7.3 Observed Power vs. Power Analysis

According to the initial power analysis, the study would have sufficient power (>80%) with roughly 40 subjects to detect a moderate effect size. However, given the significant results, and the magnitude of the effect, the observed power is probably greater than the initially estimated power based on the power analysis. As a result, it appears that the study design and the sample size were sufficient to reliably demonstrate the effect the rubric had on improving accuracy of classification.

7.4 Practical vs. Statistical Significance

Statistical significance confirms that the improvement observed was unlikely due to chance, as indicated by the hypothesis testing process, while practical significance demonstrates that the improvement in classification accuracy is meaningful and has potential implications related to the areas of safety management and organizational learning.

For example, the improved consistency in classifications may lead to:

Identification of systemic safety issues more accurately
Targeting specific corrective actions
Better alignment with HRO principles

In conclusion, the results can be considered both statistically significant and practically significant. Thus, the adoption of the rubric for use in opportunity reporting systems is supported both statistically and operationally.

8 Source Code

#IE5342 Project  
# Power Analysis


library(pwr)

power_values <- seq(0.10, 0.99, by = 0.01)    # power values array

effect_sizes <- array(data = 0, dim = length(power_values)) # all output d values

# For-loop 
for (i in 1:length(power_values)) {
  
  p <- power_values[i] # Extract the i-th power value
  
  result <- pwr.t.test(
    n = 23,
    power = p,
    sig.level = 0.05,
    type = "paired",
    alternative = "two.sided"
  )
  
  effect_sizes[i] <- result$d # input the effect size from the loop
}

# Plot the power curve
plot(power_values, effect_sizes,
     type = "l",
     xlab = "Power",
     ylab = "Effect Size",
     main = "Power Curve"
)

# Effect size needed for 80% power
pwr.t.test(
  n = 23,
  power = 0.80,
  sig.level = 0.05,
  type = "paired",
  alternative = "two.sided")

# Here is our data
pre <- c(30,70,40,70,30,30,30,30,70,50,50,50,50,20,70,30,30,60,30,80,40,30,70)
post <- c(100,90,80,50,100,70,40,70,80,70,50,50,50,40,50,60,50,100,100,90,50,100,90)


# Descriptive Statistics

#Calculate the differences and do the summary statistics
diff_response<-post - pre
summary(diff_response)
sd(diff_response)

#Summary of Pre Data
summary(pre)
sd(pre)

#Summary of Post Data
summary(post)
sd(post)

# Use the Shapiro test to check for Normality of Differences
shapiro.test(diff_response)

# Create a histogram
hist(diff_response,
     main = "Histogram of Response Differences",
     xlab = "Values",
     col = "darkgreen",
     border = "white")


##NPP plots for Pre and Post Data
# # Create the Q-Q plot Diff
qqnorm(diff_response, xlab = "Differences of Responses ", main = "Normal Probability Dist. of Differences")
qqline(diff_response, col = "darkgreen", lwd = 2)


# # Create the Q-Q plot Pre
qqnorm(pre, xlab = "Pre-Response ", main = "Normal Probability Dist. of Pre-Responses")
qqline(pre, col = "blue", lwd = 2)


## Create the Q-Q plot Post
qqnorm(post, xlab = "Post-Response", main = "Normal Probability of Post-Responses")
qqline(post, col = "red", lwd = 2)



# #Check for variance graphically 
# # Create side-by-side box plots
boxplot(pre, post,
        names = c("Pre-Responses", "Post=Responses"),  # Labels under each box
        main = "Comparison of Pre and Post Responses", # Main title
        col = c("darkblue", "red"), # Optional: color for each box
        ylab = "Score")                    # Label for y-axis


# Paired t-test

## Testing Hypothesis using 2-sample T-test
t.test(pre,post, alternative = c("two.sided"), paired = TRUE, var.equal = TRUE, conf.level = 0.95)

9 References

Design and Analysis of Experiments, 10th edition, Douglas Montgomery, Wiley Publishing 2020
The Statistical Sleuth, Ramsey & Shafer 2nd Edition Brooks & Cole Publishing 2002

IE5342 Project

Sagarika Shivakumar, Derrell Dunn, Renkun Wang

2025-11-24