Project Overview

The proposed project focuses on the question whether the detailed classification rubric can enhance the student skills in sorting out the proactive safety report into five hallmarks of the HRO. Twenty-three students were divided into two stages of reports classification: with short descriptions of hallmarks only (pre-rubric) and with another rubric (after learning a structured rubric with examples) (post-rubric). For each student, percentage of reports made on each of the phases were recorded correctly and compared the differences between them using a paired (within-subjects) design, a paired t-test and a power analysis.

Design Choice

In this case, a paired (within-subjects) design will be suitable, since the research question will be how the performance of each student will be affected upon providing him with the rubric, but not the differences among separate groups. The classification task is done two times; once with the short hallmark definitions and another with a study of the detailed rubric. Using this structure, we can compare the outcomes of each student on post-rubric accuracy with the baseline of each student, which is particularly significant in case students have different reading levels, language proficiency, and background knowledge of HRO concepts.

Advantages compared to an independent group design

The paired approach has two key benefits compared to an independent-groups design. To begin with, it balances out constant individual differences, since each scholar effectively acts as its own control, any consistent individual differences, such as overall ability, or comfort with reading incident reports will have no more effect on one of the phases than on the other, and thus will not confound the comparison of before and after conditions. Second, it improves statistical power using a relatively small sample since the analysis does not aim at the variability between two different groups of students but on within-person change.

Potential Limitations or Confounds

Despite the fact that a paired design is suitable to the question, it has certain limitations as well. To begin with, some students might just learn through practice, by the second stage they will have already sorted out eight reports, so some of the progress might be due to familiarity and not to the rubric. Second, the if two phases are performed at a single session thus the fatigue or attention changes over time can be able to push post-rubric scores either above or below the intervention. Third, after viewing the comprehensive rubric and samples, students can attempt to find what they believe is the correct answer to look like, which can artificially raise the rates of agreement with the standard due to strategic reasons instead of increased comprehension. Lastly, although it is randomized, the two sets of reports might have a slight discrepancy in difficulty and this can bias the magnitude of the pre-post difference.

Data Entry

subject<-c(1:23)
pre_rubric<-c(30,70,40,70,30,30,30,30,70,50,50,50,50,20,70,30,30,60,30,80,40,30,70)
post_rubric<-c(100,90,80,50,100,70,40,70,80,70,50,50,50,40,50,60,50,100,100,90,50,100,90)
dat<-data.frame(subject,pre_rubric,post_rubric)
dat$diff<-dat$post_rubric-dat$pre_rubric

Power Analysis

library(pwr)
d_test <- seq(0.1, 2.0, by = 0.05)

power_seq <- sapply(d_test, function(d)
  pwr.t.test(n        = 23,
             d        = d,
             sig.level = 0.05,
             type      = "paired",
             alternative = "two.sided")$power)


plot(d_test, power_seq, type = "o",
     xlab = "Effect size",
     ylab = "Statistical power",
    main = "Power curve for paired t-test (n = 23, alpha = 0.05)")
abline(h = 0.80, lty = 2, col = "red")

power<-pwr.t.test(n        = 23,
           d        = NULL,
           power    = 0.80,
           sig.level = 0.05,
           type      = "paired",
           alternative = "two.sided")
min_effect_size<-power$d
min_effect_size

## [1] 0.6112775

Observation from the Power Curve

We checked the sensitivity of our study with n = 23 before we interpreted the t-test. The statistical power of a paired, two-sided test with a = 0.05 was computed using pwr.t.test and then the plot of power curve was drawn as shown above. The red line has been drawn at a position of 80 percent power. The cross point of the curve is at an effect size of about d = 0.61 which is the least effect size our design can register with 80 percent power.

Potential confounds or source of bias in this design

Looking at the results we would be tempted to say, scores increased hence the rubric worked. However, other events are occurring simultaneously with the rubric in the real classroom setting, and they may have an effect on the pre-post difference. Some realistic confounds are given below and are written in the perspective of what a student actually goes through.

Getting used to the task

During the initial stage, the students are still getting used to how this task works: how the reports are produced, what information is important and how the five hallmarks are related to real-life situations. In the second stage, they have already categorized eight reports and thus they feel more at ease with the style and process of decision-making. Even when the rubric was not in place, this additional practice is sufficient to make them more precise. That is, some of the gains we have in the post-rubric scores could be nothing but getting used to the task such that the rubric appears even stronger than it is.

Change in energy and focus over time

If both stages occur within the same session, and a total of 16 reports are read and classified by students. They can be fresh and concentrated at the start, then some people can be tired, bored or in a hurry and others can be more confident or efficient. When fatigue is greater during the second phase, it may cause the post-rubric scores to fall and obscure part of the real advantage of the rubric. When the feeling of being warmed up prevails, it may push the scores to the positive and inflate the benefit. In both scenarios, alterations in energy and attention are capable of modifying the outcome of the rubric.

Differences in report difficulty between the two sets

Although the sequence of the two groups of eight reports is randomized, one of the sets may be somewhat more general: more understandable stories, better clues to the appropriate hallmark, or less cases of ambiguity. When the easier set is utilized more frequently during the post-rubric stage, then students will perform with higher scores partly due to the easier reports, not solely due to the rubric, which exaggerates the effect. When the harder set is applied more frequently after the rubric, the reverse occurs and the influence of the rubric is partially masked. Randomization partially addresses this problem but fails to do so entirely, with a small sample.

Descriptive Statistics

Mean_pre<-mean(dat$pre_rubric)
Mean_post<-mean(dat$post_rubric)
Std_diff<-sd(dat$diff)

The mean percentage correct prior to rubric is approximately 46.1 percent. proving that students, being beginners, were usually not able to match reports to the correct hallmark. Following the introduction of the rubric, the average percentage correct increased to approximately 70.9, which is a positive change.

The standard deviation of the differences (post - pre) was approximately 26.9. percentage points. This informs us that, even though the majority of the students have improved, the extent of the improvement between students was fairly different.

boxplot(dat$pre_rubric,dat$post_rubric,
        main = "Pre and post rubric percentage",
        names = c("Pre-rubric", "Post-rubric"),
        col = c("skyblue", "pink"))

The boxplot makes a comparison of the percentage of correctly classified reports in the pre- and post-rubric. The post-rubric scores are more. The median is raised to approximately 40 percent under the pre-rubric condition to approximately 70-75 percent under the rubric condition. The entire post-rubric box is moved upwards, but the scores still have some dispersion. This pattern indicates that the majority of students rated reports more correctly when they had the rubric, although not all of them showed the same level of improvement.

qqnorm(dat$pre_rubric,main= "Normality plot of pre-rubric",xlab= "Pre-rubric")
qqline(dat$pre_rubric, col= "skyblue")

qqnorm(dat$post_rubric,main= "Normality plot of post-rubric",xlab= "Post-rubric")
qqline(dat$post_rubric, col= "pink")

hist(dat$pre_rubric,
     main = "Histogram plot of pre-rubric perentage",
     xlab = "Pre rubric percentage",
     col = "skyblue")

hist(dat$post_rubric,
     main = "Histogram plot of post-rubric perentage",
     xlab = "Post rubric percentage",
     col = "pink")

From the Q-Q plots, we can observe that most points fall close to the straight reference line, with only some bending at the lowest and the highest scores for both the pre- and post- rubric data. The histogram shows that the pre-rubric scores mostly lie in the lower range, with only few students above 60%, whereas the post-rubric scores shift clearly upward, with many students between 70% and 100%. They are neither perfectly bell shaped nor extremely skewed.

hist(dat$diff,
     main = "Histogram of difference in percentage correct",
     xlab = "Differnce between post and pre rubric",
     col = "yellow")

The differences in percentage correct (post - pre) of each student are shown in the histogram. The bigger percentage of the bars lie on the right-hand side of the zero, and this implies that most students scored more following the use of the rubric than before the use of it. Some students demonstrate minimal or negative improvements, others a lot of improvements but most improved by about 10-30 percentage points with a small few improvements being very large (around 60-80 points). The plot indicates that the rubric typically enhanced the accuracy of classification, but improvement is different among students.

Paired t-test

Null hypothesis: The rubric does not change the mean percent correct.

\(H_{o}: \mu_{pre}= \mu_{post}\)

Alternative hypothesis: The rubric changes the mean percent correct.

\(H_{a}: \mu _{pre}\neq\mu _{post}\)

where,
\(\mu_{pre}\) = the mean percentage correct before the rubric
\(\mu_{post}\) = the mean percentage correct after the rubric

Also, for the difference, we have:
D= post - pre

Null hypothesis:
\(H_{o}: \mu _{D} = 0\)

Alternative hypothesis:

\(H_{a}: \mu _{D} \neq 0\)

Normality of differences

qqnorm(dat$diff, main= "Normality plot of rubric percentage difference",
       xlab= "Rubric % differnce")
qqline(dat$diff, col = "blue")

In the Q-Q plot of the percentage difference (post pre), the majority of the points are near the straight line of reference with only a few minor deviations at the lowest and highest values. This implies that the difference scores are normally distributed, so the normality assumption of the paired t-test is valid.

Paired t-test We ran a paired t-test comparing post-rubric and pre-rubric percentages:

t.test(dat$post_rubric,dat$pre_rubric,
       alternative = "two.sided",
       paired = TRUE)

## 
##  Paired t-test
## 
## data:  dat$post_rubric and dat$pre_rubric
## t = 4.4108, df = 22, p-value = 0.0002212
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  13.13028 36.43493
## sample estimates:
## mean difference 
##        24.78261

The test is based on the difference:
D = post_rubric - pre_rubric

The Report

test-statistic (t): 4.4108
degrees of freedom (df) : 22
p-value: 0.0002212
95% confidence interval: [13.13028, 36.43493]

This shows that the mean percentage of the post-rubric scores are 24.78261 % higher than that of the pre-rubric scores, and the improvement is significant.

Paired t-test effect size:

d_paired<- mean(dat$diff)/Std_diff

Interpretation

Overall conclusion about the rubric

The rubric seems to be efficient. Mean accuracy improved by approximately 46.09% to 70.87% after the rubric and the difference is significant (t(22) = 4.41, p= 0.0002212). This implies that the more reports were categorized correctly by the students when they had the rubric.

Effect size and comparison with expectations

The change in pre- and post is associated with a large effect size (Cohens d ≈ 0.9). In our power analysis, we have taken a smaller effect size of approximately d ≈ 0.61 as the smallest effect size we wanted to detect. The effect observed is hence greater than we had anticipated and that indicates that the rubric had more influence compared with the minimum we had intended.

Power relative to the power analysis

Our power analysis with n = 23 and a = 0.05 indicated that we could observe a minimum effect size of d ≈ 0.61 with power of approximately 80% to do so. Since the effect size (d ≈ 0.9) is more significant than that, the true power of our study exceeds 80, which is in accordance with the extremely small p-value.

Practical versus statistical significance

The outcome is statistically significant, as well as meaningful in practice. A 24.2681 percentage point gain in accuracy is a significant gain to the students who are new in this subject. Nevertheless, the magnitude of the effect is to be taken with a certain degree of caution since practice with the task, alterations in attention, and expectations following the viewing of the rubric could also be among the factors that lead to improvement.

Complete R-code

subject<-c(1:23)
pre_rubric<-c(30,70,40,70,30,30,30,30,70,50,50,50,50,20,70,30,30,60,30,80,40,30,70)
post_rubric<-c(100,90,80,50,100,70,40,70,80,70,50,50,50,40,50,60,50,100,100,90,50,100,90)
dat<-data.frame(subject,pre_rubric,post_rubric)
dat$diff<-dat$post_rubric-dat$pre_rubric

library(pwr)
d_test <- seq(0.1, 2.0, by = 0.05)

power_seq <- sapply(d_test, function(d)
  pwr.t.test(n        = 23,
             d        = d,
             sig.level = 0.05,
             type      = "paired",
             alternative = "two.sided")$power)


plot(d_test, power_seq, type = "o",
     xlab = "Effect size (Cohen's d)",
     ylab = "Statistical power",
    main = "Power curve for paired t-test (n = 23, alpha = 0.05)")
abline(h = 0.80, lty = 2, col = "red")
power<-pwr.t.test(n        = 23,
           d        = NULL,
           power    = 0.80,
           sig.level = 0.05,
           type      = "paired",
           alternative = "two.sided")
min_effect_size<-power$d
min_effect_size
Mean_pre<-mean(dat$pre_rubric)
Mean_post<-mean(dat$post_rubric)
Std_diff<-sd(dat$diff)

boxplot(dat$pre_rubric,dat$post_rubric,
        main = "Pre and post rubric percentage",
        names = c("Pre-rubric", "Post-rubric"),
        col = c("skyblue", "pink"))

qqnorm(dat$pre_rubric,main= "Normality plot of pre-rubric",xlab= "Pre-rubric")
qqline(dat$pre_rubric, col= "skyblue")

qqnorm(dat$post_rubric,main= "Normality plot of post-rubric",xlab= "Post-rubric")
qqline(dat$post_rubric, col= "pink")


hist(dat$pre_rubric,
     main = "Histogram plot of pre-rubric perentage",
     xlab = "Pre rubric percentage",
     col = "skyblue")

hist(dat$post_rubric,
     main = "Histogram plot of post-rubric perentage",
     xlab = "Post rubric percentage",
     col = "pink")

hist(dat$diff,
     main = "Histogram of difference in percentage correct",
     xlab = "Differnce between post and pre rubric",
     col = "yellow")

qqnorm(dat$diff, main= "Normality plot of rubric percentage difference",
       xlab= "Rubric % differnce")

qqline(dat$diff, col = "blue")

t.test(dat$post_rubric,dat$pre_rubric,
       alternative = "two.sided",
       paired = TRUE)

“Project Group 7” - Classification of Proactive Safety Reports by the High Reliability Organization (HRO) Hallmarks

Prateeksha Singh, Manas Verma, Samuel Martinez

2025-11-29