Introductions:

We are going to test of students before and after rubric quiz marks analysis, what changes we will observed and according to the project question need to answer accordingly.

Design Setup:

library(pwr)

pre_scores  <- c(30,70,40,70,30,30,30,30,70,50,50,50,50,20,70,30,30,60,30,80,40,30,70)
post_scores <- c(100,90,80,50,100,70,40,70,80,70,50,50,50,40,50,60,50,100,100,90,50,100,90)

quiz_data <- data.frame( Pre  = pre_scores, Post = post_scores)

quiz_data$Diff <- quiz_data$Post - quiz_data$Pre



n_actual <- nrow(quiz_data)   # actual sample size = 23
n_actual

## [1] 23

Design Choice:

Appropriateness of a Paired Design

A paired (within-subjects) design is the most appropriate choice for this study because the goal is to examine how each individual student’s accuracy changes after using the rubric, rather than to compare performance between two separate groups. Since students differ substantially in prior knowledge, reading ability, confidence, and speed of understanding the hallmarks, an independent-groups design would introduce unnecessary variability. By pairing each student’s post-rubric score with their own pre-rubric score, we control for these individual differences and isolate the effect of the rubric itself. This approach is especially suitable for small class sizes, where creating two large, comparable groups is neither practical nor statistically efficient. The paired design therefore provides stronger internal validity and clearer interpretation of the rubric’s impact.

Advantages Over an Independent-Groups Design

A paired design offers several advantages over an independent-groups design. Because each student serves as their own control, individual differences in reading skills, comprehension speed, and prior knowledge are held constant, resulting in more precise estimates of the rubric’s effect. It also works well with small class sizes, since it does not require forming two comparable groups. Additionally, paired designs provide higher statistical power because within-person variability is smaller than variability between different students. Overall, this approach gives a cleaner and more direct measurement of improvement attributable to the rubric.

Potential Limitations or Confounds

Despite its strengths, a paired design may still be influenced by practice, fatigue, or expectations since students complete both phases in close sequence. Any improvement may partly reflect increased familiarity with the task, while decreased performance could reflect tiredness. Because both conditions occur for the same individuals, these effects cannot be completely separated. Practice effects may lead students to improve simply because they have seen the task once before. Fatigue could lower accuracy in the second phase, while demand characteristics might cause students to adjust their answers to match what they think the instructor wants. Although report sets were randomized, order effects may still overlap with the true impact of the rubric.

Power Analysis

library(pwr)

# Planned sample size (approximate class size)
n_power <- 40   
# power curve across many d values
# Sequence of effect sizes for a smooth curve
d_seq <- seq(0.1, 1.2, by = 0.01)
# Power for each effect size
power_seq <- sapply(d_seq, function(d)
  pwr.t.test(n = n_power,
             d = d,
             sig.level = 0.05,
             type = "paired",
             alternative = "two.sided")$power
)


#Minimum effect size for 80% power

min_d_80 <- pwr.t.test(n        = n_power,
                       power    = 0.80,
                       sig.level = 0.05,
                       type      = "paired",
                       alternative = "two.sided")

min_d <- min_d_80$d  # numeric value for plotting

# Plot:  curve with benchmarks of 80% line
d_vals <- c(0.2, 0.5, 0.8, 1.0)
power_vals <- sapply(d_vals, function(d)
  pwr.t.test(n = n_power,
             d = d,
             sig.level = 0.05,
             type = "paired",
             alternative = "two.sided")$power
)

plot(d_seq, power_seq,
     type = "l",
     lwd  = 2,
     xlab = "Effect Size (Cohen's d)",
     ylab = "Power",
     main = "Power Curve for Paired t-test (n = 40, alpha = 0.05)")

abline(h = 0.80, lty = 2, col = "red")

abline(v = min_d, lty = 3, col = "blue")

points(d_vals, power_vals,
       pch = 16,
       col = "darkgreen")

text(x = min_d, y = 0.1,
     labels = paste0("d ≈ ", round(min_d, 2)),
     pos = 4,
     col = "blue")

grid()

Using the actual student size of ~40 students, we computed statistical power for several effect sizes. Power was low for small effects (d = 0.20), moderate-to-high for medium effects (d = 0.50), and very high for large effects (d ≥ 0.80). Solving for the threshold, the minimum effect size detectable with 80% power is approximately Cohen’s d ≈ 0.44. Thus, with n ≈ 40, the study is adequately powered to detect moderate improvements in quiz performance.

library(pwr)

# Actual analyzed sample size
n_actual <- 23   

# Power for benchmark effect sizes with n = 23
p_small_23  <- pwr.t.test(n = n_actual, d = 0.2,
                          sig.level = 0.05,
                          type = "paired",
                          alternative = "two.sided")
p_medium_23 <- pwr.t.test(n = n_actual, d = 0.5,
                          sig.level = 0.05,
                          type = "paired",
                          alternative = "two.sided")
p_large_23  <- pwr.t.test(n = n_actual, d = 0.8,
                          sig.level = 0.05,
                          type = "paired",
                          alternative = "two.sided")

p_small_23

## 
##      Paired t test power calculation 
## 
##               n = 23
##               d = 0.2
##       sig.level = 0.05
##           power = 0.1506641
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*

p_medium_23

## 
##      Paired t test power calculation 
## 
##               n = 23
##               d = 0.5
##       sig.level = 0.05
##           power = 0.6302205
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*

p_large_23

## 
##      Paired t test power calculation 
## 
##               n = 23
##               d = 0.8
##       sig.level = 0.05
##           power = 0.9558497
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*

# Minimum effect size detectable with 80% power (n = 23)
min_d_80_23 <- pwr.t.test(n        = n_actual,
                          power    = 0.80,
                          sig.level = 0.05,
                          type      = "paired",
                          alternative = "two.sided")

min_d_80_23$d

## [1] 0.6112775

Although the planned power analysis assumed a sample size of 40 students (minimum detectable effect d ≈ 0.44 at 80% power), only 23 students provided complete pre–post data. A follow-up power analysis with n = 23 shows that the study is less sensitive and can only detect relatively larger effects with 80% power

Potential Confounds:

Different questions → difficulty may differ → can over/underestimate rubric effect. Learning/practice: Students improved due to studying; improvement may not be caused by the rubric alone.

Motivation/testing conditions: Stress/fatigue/motivation/time of day differ → performance changes unrelated to rubric.

Descriptive Statistics:

mean_pre  <- mean(quiz_data$Pre)
mean_post <- mean(quiz_data$Post)
sd_diff   <- sd(quiz_data$Diff)

mean_pre

## [1] 46.08696

mean_post

## [1] 70.86957

sd_diff

## [1] 26.946

Mean percentage correct before rubric:

Mean pre-rubric score ≈ 46.1%

Mean percentage correct after rubric:

Mean post-rubric score ≈ 70.9%

Mean difference ≈ 24.8 percentage points

Standard deviation of differences:

SD of differences ≈ 26.9 points

Visualizations

# Visualization Set for Paired t-test Report

n_actual   <- nrow(quiz_data)
mean_pre   <- mean(quiz_data$Pre)
mean_post  <- mean(quiz_data$Post)

#Normality Check — Using Differences

qqnorm(quiz_data$Diff, main = "Normal Q-Q Plot of Differences (Post - Pre)", xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
qqline(quiz_data$Diff, col = "blue")

#Q-Q plots of Pre and Post separately

qqnorm(quiz_data$Pre, main = "Normal Q-Q Plot of Pre-rubric Scores", xlab = "Pre-rubric")
qqline(quiz_data$Pre, col = "blue")

qqnorm(quiz_data$Post, main = "Normal Q-Q Plot of Post-rubric Scores", xlab = "Post-rubric")
qqline(quiz_data$Post, col = "green")

#Histograms of Pre and Post Scores


hist(quiz_data$Pre, main = "Histogram of Pre-rubric Percentage", xlab = "Pre-rubric Percentage", col  = "blue", breaks = 10)

hist(quiz_data$Post, main = "Histogram of Post-rubric Percentage", xlab = "Post-rubric Percentage", col  = "green", breaks = 10)

#Histogram of Difference Scores


hist(quiz_data$Diff,
     main = "Histogram of Score Improvements (Post - Pre)",
     xlab = "Improvement (Points)",
     col  = "gray",
     breaks = 10)

#Boxplot: Pre vs Post

boxplot(quiz_data$Pre, quiz_data$Post,
        names = c("Pre", "Post"),
        ylab  = "Score (%)",
        main  = "Boxplot of Pre vs Post Quiz Scores",
        col   = c("blue", "green"))

#Barplot: Mean Pre vs Post

mean_values <- c(mean_pre, mean_post)

barplot(mean_values,names = c("Pre", "Post"), ylab  = "Mean Score (%)", main  = "Mean Quiz Scores Before and After Rubric",  col   = c("blue", "green"))

#Scatter Plot: Pre vs Post

plot(quiz_data$Pre, quiz_data$Post,
     main = "Scatter Plot: Pre vs Post Quiz Scores",
     xlab = "Pre Score (%)",
     ylab = "Post Score (%)",
     pch  = 16)
abline(0, 1, col = "red", lty = 2)  # Reference line: Post = Pre

#Individual Line Plot (Spaghetti Plot)

plot(c(1,2), range(c(quiz_data$Pre, quiz_data$Post)),  type = "n", xaxt = "n", xlab = "", ylab = "Score (%)", main = "Individual Score Changes (Pre → Post)")
axis(1, at = c(1,2), labels = c("Pre","Post"))

#Draw change lines for each student
segments(1, quiz_data$Pre,
         2, quiz_data$Post,
         col = "gray")

# Add points
points(rep(1, n_actual), quiz_data$Pre,  pch = 16)
points(rep(2, n_actual), quiz_data$Post, pch = 16)

Paired t-test:

State null and alternative hypotheses;

Null hypothesis: The rubric does not change the mean percent correct.

\(H_{o}: \mu_{pre}= \mu_{post}\)

Alternative hypothesis: The rubric changes the mean percent correct.

\(H_{a}: \mu_{pre}\neq\mu _{post}\)

where,

\(\mu_{pre}\) = the mean percentage correct before the rubric

\(\mu_{post}\) = the mean percentage correct after the rubric

Also, for the difference, we have;

D = post - pre

Null hypothesis:

\(H_{o}: \mu _{D} = 0\)

Alternative hypothesis:

\(H_{a}: \mu _{D} \neq 0\)

Check assumptions (normality of differences)

shapiro.test(quiz_data$Diff)

## 
##  Shapiro-Wilk normality test
## 
## data:  quiz_data$Diff
## W = 0.92204, p-value = 0.07369

qqnorm(quiz_data$Diff, main="Q-Q Plot of Score Differences")
qqline(quiz_data$Diff)

the p-value > 0.05 and the Q-Q plot is reasonably straight

the normality assumption for the differences is acceptable for a paired t-test.

Paired t-test in R

t_res <- t.test(quiz_data$Post, quiz_data$Pre, paired=TRUE, alternative="two.sided")

t_res

## 
##  Paired t-test
## 
## data:  quiz_data$Post and quiz_data$Pre
## t = 4.4108, df = 22, p-value = 0.0002212
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  13.13028 36.43493
## sample estimates:
## mean difference 
##        24.78261

Test Statistic, Degrees of Freedom, and Confidence Interval

- test-statistic (t): 4.4108

- degrees of freedom (df) : 22

- p-value: 0.0002212

- 95% confidence interval: [13.13028, 36.43493]

This shows that the mean percentage of the post-rubric scores are 24.78261 % higher than that of the pre-rubric scores, and the improvement is significant.

Interpretation:

Evaluating the Effectiveness of the Rubric

The paired t-test shows a statistically significant increase in quiz scores after the rubric (p < 0.001). On average, students improved by about 25 percentage. This suggests the rubric is associated with a substantial improvement in performance.

mean_diff  <- mean(quiz_data$Diff)
sd_diff    <- sd(quiz_data$Diff)
d_observed <- mean_diff / sd_diff
d_observed

## [1] 0.9197138

Comparison of Effect Sizes

The observed Cohen’s d ≈ 0.92, which is a large effect.

If we had predicted a medium effect (d ≈ 0.50), the observed effect is larger than expected.

pwr_observed <- pwr.t.test(n=n_actual,d=d_observed, sig.level=0.05, type="paired")
pwr_observed

## 
##      Paired t test power calculation 
## 
##               n = 23
##               d = 0.9197138
##       sig.level = 0.05
##           power = 0.9878144
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*

Evaluating Observed Power Relative to Power Analysis

With n_actual = 23 and d ≈ 0.92, the observed power is very high (close to 1.0, or > 95%). This is higher than the target 80% power used in the power analysis with n_power = 40.

Discuss practical significance vs. statistical significance

Statistical significance:

The p-value is far below 0.05, so we reject \(H_{o}\) and conclude that the rubric is associated with higher scores.

Practical significance:

An average improvement of ~25 percentage points is large and meaningful in an educational context (can change a failing grade to a passing grade).

Therefore, the rubric appears to be both statistically and practically significant, although design confounds (quiz difficulty, learning over time, motivation) mean we should be cautious about claiming causality.

COMPLETE CODE

library(pwr)

pre_scores  <- c(30,70,40,70,30,30,30,30,70,50,50,50,50,20,70,30,30,60,30,80,40,30,70)
post_scores <- c(100,90,80,50,100,70,40,70,80,70,50,50,50,40,50,60,50,100,100,90,50,100,90)

quiz_data <- data.frame( Pre  = pre_scores, Post = post_scores)

quiz_data$Diff <- quiz_data$Post - quiz_data$Pre



n_actual <- nrow(quiz_data)   # actual sample size = 23
n_actual 

library(pwr)

# Planned sample size (approximate class size)
n_power <- 40   
# power curve across many d values
# Sequence of effect sizes for a smooth curve
d_seq <- seq(0.1, 1.2, by = 0.01)
# Power for each effect size
power_seq <- sapply(d_seq, function(d)
  pwr.t.test(n = n_power,
             d = d,
             sig.level = 0.05,
             type = "paired",
             alternative = "two.sided")$power
)


#Minimum effect size for 80% power

min_d_80 <- pwr.t.test(n        = n_power,
                       power    = 0.80,
                       sig.level = 0.05,
                       type      = "paired",
                       alternative = "two.sided")

min_d <- min_d_80$d  # numeric value for plotting

# Plot:  curve with benchmarks of 80% line
d_vals <- c(0.2, 0.5, 0.8, 1.0)
power_vals <- sapply(d_vals, function(d)
  pwr.t.test(n = n_power,
             d = d,
             sig.level = 0.05,
             type = "paired",
             alternative = "two.sided")$power
)

plot(d_seq, power_seq,
     type = "l",
     lwd  = 2,
     xlab = "Effect Size (Cohen's d)",
     ylab = "Power",
     main = "Power Curve for Paired t-test (n = 40, alpha = 0.05)")

abline(h = 0.80, lty = 2, col = "red")

abline(v = min_d, lty = 3, col = "blue")

points(d_vals, power_vals,
       pch = 16,
       col = "darkgreen")

text(x = min_d, y = 0.1,
     labels = paste0("d ≈ ", round(min_d, 2)),
     pos = 4,
     col = "blue")

grid()


mean_pre  <- mean(quiz_data$Pre)
mean_post <- mean(quiz_data$Post)
sd_diff   <- sd(quiz_data$Diff)

mean_pre
mean_post
sd_diff

# Visualization Set for Paired t-test Report

n_actual   <- nrow(quiz_data)
mean_pre   <- mean(quiz_data$Pre)
mean_post  <- mean(quiz_data$Post)

#Normality Check — Using Differences

qqnorm(quiz_data$Diff, main = "Normal Q-Q Plot of Differences (Post - Pre)", xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
qqline(quiz_data$Diff, col = "blue")

#Q-Q plots of Pre and Post separately

qqnorm(quiz_data$Pre, main = "Normal Q-Q Plot of Pre-rubric Scores", xlab = "Pre-rubric")
qqline(quiz_data$Pre, col = "blue")

qqnorm(quiz_data$Post, main = "Normal Q-Q Plot of Post-rubric Scores", xlab = "Post-rubric")
qqline(quiz_data$Post, col = "green")

#Histograms of Pre and Post Scores


hist(quiz_data$Pre, main = "Histogram of Pre-rubric Percentage", xlab = "Pre-rubric Percentage", col  = "blue", breaks = 10)

hist(quiz_data$Post, main = "Histogram of Post-rubric Percentage", xlab = "Post-rubric Percentage", col  = "green", breaks = 10)


#Histogram of Difference Scores


hist(quiz_data$Diff,
     main = "Histogram of Score Improvements (Post - Pre)",
     xlab = "Improvement (Points)",
     col  = "gray",
     breaks = 10)


#Boxplot: Pre vs Post

boxplot(quiz_data$Pre, quiz_data$Post,
        names = c("Pre", "Post"),
        ylab  = "Score (%)",
        main  = "Boxplot of Pre vs Post Quiz Scores",
        col   = c("blue", "green"))

#Barplot: Mean Pre vs Post

mean_values <- c(mean_pre, mean_post)

barplot(mean_values,names = c("Pre", "Post"), ylab  = "Mean Score (%)", main  = "Mean Quiz Scores Before and After Rubric",  col   = c("blue", "green"))


#Scatter Plot: Pre vs Post

plot(quiz_data$Pre, quiz_data$Post,
     main = "Scatter Plot: Pre vs Post Quiz Scores",
     xlab = "Pre Score (%)",
     ylab = "Post Score (%)",
     pch  = 16)
abline(0, 1, col = "red", lty = 2)  # Reference line: Post = Pre

#Individual Line Plot (Spaghetti Plot)

plot(c(1,2), range(c(quiz_data$Pre, quiz_data$Post)),  type = "n", xaxt = "n", xlab = "", ylab = "Score (%)", main = "Individual Score Changes (Pre → Post)")
axis(1, at = c(1,2), labels = c("Pre","Post"))

#Draw change lines for each student
segments(1, quiz_data$Pre,
         2, quiz_data$Post,
         col = "gray")

# Add points
points(rep(1, n_actual), quiz_data$Pre,  pch = 16)
points(rep(2, n_actual), quiz_data$Post, pch = 16)

shapiro.test(quiz_data$Diff)

qqnorm(quiz_data$Diff, main="Q-Q Plot of Score Differences")
qqline(quiz_data$Diff)


t_res <- t.test(quiz_data$Post, quiz_data$Pre, paired=TRUE, alternative="two.sided")

t_res

mean_diff  <- mean(quiz_data$Diff)
sd_diff    <- sd(quiz_data$Diff)
d_observed <- mean_diff / sd_diff
d_observed

pwr_observed <- pwr.t.test(n=n_actual,d=d_observed, sig.level=0.05, type="paired")
pwr_observed

Group - 5 : Classification of Proactive Safety Reports by the High Reliability Organization (HRO) Hallmarks

David Okpokpo, Sams Zubayer Mridha, Declan Scott

2025-12-01

Introductions:

We are going to test of students before and after rubric quiz marks analysis, what changes we will observed and according to the project question need to answer accordingly.

Design Setup:

Design Choice:

Appropriateness of a Paired Design

Advantages Over an Independent-Groups Design

Potential Limitations or Confounds

Power Analysis

Potential Confounds:

Different questions → difficulty may differ → can over/underestimate rubric effect. Learning/practice: Students improved due to studying; improvement may not be caused by the rubric alone.

Motivation/testing conditions: Stress/fatigue/motivation/time of day differ → performance changes unrelated to rubric.

Descriptive Statistics:

Mean percentage correct before rubric:

Mean pre-rubric score ≈ 46.1%

Mean percentage correct after rubric:

Mean post-rubric score ≈ 70.9%

Mean difference ≈ 24.8 percentage points

Standard deviation of differences:

SD of differences ≈ 26.9 points

Visualizations

Paired t-test:

State null and alternative hypotheses;

Null hypothesis: The rubric does not change the mean percent correct.

\(H_{o}: \mu_{pre}= \mu_{post}\)

Alternative hypothesis: The rubric changes the mean percent correct.

\(H_{a}: \mu_{pre}\neq\mu _{post}\)

where,

\(\mu_{pre}\) = the mean percentage correct before the rubric

\(\mu_{post}\) = the mean percentage correct after the rubric

Also, for the difference, we have;

D = post - pre

Null hypothesis:

\(H_{o}: \mu _{D} = 0\)

Alternative hypothesis:

\(H_{a}: \mu _{D} \neq 0\)

Check assumptions (normality of differences)

the p-value > 0.05 and the Q-Q plot is reasonably straight

the normality assumption for the differences is acceptable for a paired t-test.

Paired t-test in R

Test Statistic, Degrees of Freedom, and Confidence Interval

- test-statistic (t): 4.4108

- degrees of freedom (df) : 22

- p-value: 0.0002212

- 95% confidence interval: [13.13028, 36.43493]

This shows that the mean percentage of the post-rubric scores are 24.78261 % higher than that of the pre-rubric scores, and the improvement is significant.

Interpretation:

Evaluating the Effectiveness of the Rubric

The paired t-test shows a statistically significant increase in quiz scores after the rubric (p < 0.001). On average, students improved by about 25 percentage. This suggests the rubric is associated with a substantial improvement in performance.

Comparison of Effect Sizes

The observed Cohen’s d ≈ 0.92, which is a large effect.

If we had predicted a medium effect (d ≈ 0.50), the observed effect is larger than expected.

Evaluating Observed Power Relative to Power Analysis

With n_actual = 23 and d ≈ 0.92, the observed power is very high (close to 1.0, or > 95%). This is higher than the target 80% power used in the power analysis with n_power = 40.

Discuss practical significance vs. statistical significance

Statistical significance:

The p-value is far below 0.05, so we reject \(H_{o}\) and conclude that the rubric is associated with higher scores.

Practical significance:

An average improvement of ~25 percentage points is large and meaningful in an educational context (can change a failing grade to a passing grade).

Therefore, the rubric appears to be both statistically and practically significant, although design confounds (quiz difficulty, learning over time, motivation) mean we should be cautious about claiming causality.

COMPLETE CODE