Introductions:
We are going to test of students before and after rubric quiz
marks analysis, what changes we will observed and according to the
project question need to answer accordingly.
Design Setup:
library(pwr)
pre_scores <- c(30,70,40,70,30,30,30,30,70,50,50,50,50,20,70,30,30,60,30,80,40,30,70)
post_scores <- c(100,90,80,50,100,70,40,70,80,70,50,50,50,40,50,60,50,100,100,90,50,100,90)
quiz_data <- data.frame( Pre = pre_scores, Post = post_scores)
quiz_data$Diff <- quiz_data$Post - quiz_data$Pre
n_actual <- nrow(quiz_data) # actual sample size = 23
n_actual
## [1] 23
Design Choice:
Appropriateness of a Paired Design
A paired (within-subjects) design is the most appropriate choice
for this study because the goal is to examine how each individual
student’s accuracy changes after using the rubric, rather than to
compare performance between two separate groups. Since students differ
substantially in prior knowledge, reading ability, confidence, and speed
of understanding the hallmarks, an independent-groups design would
introduce unnecessary variability. By pairing each student’s post-rubric
score with their own pre-rubric score, we control for these individual
differences and isolate the effect of the rubric itself. This approach
is especially suitable for small class sizes, where creating two large,
comparable groups is neither practical nor statistically efficient. The
paired design therefore provides stronger internal validity and clearer
interpretation of the rubric’s impact.
Advantages Over an Independent-Groups Design
A paired design offers several advantages over an
independent-groups design. Because each student serves as their own
control, individual differences in reading skills, comprehension speed,
and prior knowledge are held constant, resulting in more precise
estimates of the rubric’s effect. It also works well with small class
sizes, since it does not require forming two comparable groups.
Additionally, paired designs provide higher statistical power because
within-person variability is smaller than variability between different
students. Overall, this approach gives a cleaner and more direct
measurement of improvement attributable to the rubric.
Potential Limitations or Confounds
Despite its strengths, a paired design may still be influenced
by practice, fatigue, or expectations since students complete both
phases in close sequence. Any improvement may partly reflect increased
familiarity with the task, while decreased performance could reflect
tiredness. Because both conditions occur for the same individuals, these
effects cannot be completely separated. Practice effects may lead
students to improve simply because they have seen the task once before.
Fatigue could lower accuracy in the second phase, while demand
characteristics might cause students to adjust their answers to match
what they think the instructor wants. Although report sets were
randomized, order effects may still overlap with the true impact of the
rubric.
Power Analysis
library(pwr)
# Planned sample size (approximate class size)
n_power <- 40
# power curve across many d values
# Sequence of effect sizes for a smooth curve
d_seq <- seq(0.1, 1.2, by = 0.01)
# Power for each effect size
power_seq <- sapply(d_seq, function(d)
pwr.t.test(n = n_power,
d = d,
sig.level = 0.05,
type = "paired",
alternative = "two.sided")$power
)
#Minimum effect size for 80% power
min_d_80 <- pwr.t.test(n = n_power,
power = 0.80,
sig.level = 0.05,
type = "paired",
alternative = "two.sided")
min_d <- min_d_80$d # numeric value for plotting
# Plot: curve with benchmarks of 80% line
d_vals <- c(0.2, 0.5, 0.8, 1.0)
power_vals <- sapply(d_vals, function(d)
pwr.t.test(n = n_power,
d = d,
sig.level = 0.05,
type = "paired",
alternative = "two.sided")$power
)
plot(d_seq, power_seq,
type = "l",
lwd = 2,
xlab = "Effect Size (Cohen's d)",
ylab = "Power",
main = "Power Curve for Paired t-test (n = 40, alpha = 0.05)")
abline(h = 0.80, lty = 2, col = "red")
abline(v = min_d, lty = 3, col = "blue")
points(d_vals, power_vals,
pch = 16,
col = "darkgreen")
text(x = min_d, y = 0.1,
labels = paste0("d ≈ ", round(min_d, 2)),
pos = 4,
col = "blue")
grid()

Although the planned power analysis assumed a sample size of 40
students (minimum detectable effect d ≈ 0.44 at 80% power), only 23
students provided complete pre–post data. A follow-up power analysis
with n = 23 shows that the study is less sensitive and can only detect
relatively larger effects with 80% power
Potential Confounds:
Different questions → difficulty may differ → can
over/underestimate rubric effect. Learning/practice: Students improved
due to studying; improvement may not be caused by the rubric
alone.
Motivation/testing conditions: Stress/fatigue/motivation/time of
day differ → performance changes unrelated to rubric.
Descriptive Statistics:
mean_pre <- mean(quiz_data$Pre)
mean_post <- mean(quiz_data$Post)
sd_diff <- sd(quiz_data$Diff)
mean_pre
## [1] 46.08696
mean_post
## [1] 70.86957
sd_diff
## [1] 26.946
Mean percentage correct before rubric:
Mean pre-rubric score ≈ 46.1%
Mean percentage correct after rubric:
Mean post-rubric score ≈ 70.9%
Mean difference ≈ 24.8 percentage points
Standard deviation of differences:
SD of differences ≈ 26.9 points
Visualizations
# Visualization Set for Paired t-test Report
n_actual <- nrow(quiz_data)
mean_pre <- mean(quiz_data$Pre)
mean_post <- mean(quiz_data$Post)
#Normality Check — Using Differences
qqnorm(quiz_data$Diff, main = "Normal Q-Q Plot of Differences (Post - Pre)", xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
qqline(quiz_data$Diff, col = "blue")

#Q-Q plots of Pre and Post separately
qqnorm(quiz_data$Pre, main = "Normal Q-Q Plot of Pre-rubric Scores", xlab = "Pre-rubric")
qqline(quiz_data$Pre, col = "blue")

qqnorm(quiz_data$Post, main = "Normal Q-Q Plot of Post-rubric Scores", xlab = "Post-rubric")
qqline(quiz_data$Post, col = "green")

#Histograms of Pre and Post Scores
hist(quiz_data$Pre, main = "Histogram of Pre-rubric Percentage", xlab = "Pre-rubric Percentage", col = "blue", breaks = 10)

hist(quiz_data$Post, main = "Histogram of Post-rubric Percentage", xlab = "Post-rubric Percentage", col = "green", breaks = 10)

#Histogram of Difference Scores
hist(quiz_data$Diff,
main = "Histogram of Score Improvements (Post - Pre)",
xlab = "Improvement (Points)",
col = "gray",
breaks = 10)

#Boxplot: Pre vs Post
boxplot(quiz_data$Pre, quiz_data$Post,
names = c("Pre", "Post"),
ylab = "Score (%)",
main = "Boxplot of Pre vs Post Quiz Scores",
col = c("blue", "green"))

#Barplot: Mean Pre vs Post
mean_values <- c(mean_pre, mean_post)
barplot(mean_values,names = c("Pre", "Post"), ylab = "Mean Score (%)", main = "Mean Quiz Scores Before and After Rubric", col = c("blue", "green"))

#Scatter Plot: Pre vs Post
plot(quiz_data$Pre, quiz_data$Post,
main = "Scatter Plot: Pre vs Post Quiz Scores",
xlab = "Pre Score (%)",
ylab = "Post Score (%)",
pch = 16)
abline(0, 1, col = "red", lty = 2) # Reference line: Post = Pre

#Individual Line Plot (Spaghetti Plot)
plot(c(1,2), range(c(quiz_data$Pre, quiz_data$Post)), type = "n", xaxt = "n", xlab = "", ylab = "Score (%)", main = "Individual Score Changes (Pre → Post)")
axis(1, at = c(1,2), labels = c("Pre","Post"))
#Draw change lines for each student
segments(1, quiz_data$Pre,
2, quiz_data$Post,
col = "gray")
# Add points
points(rep(1, n_actual), quiz_data$Pre, pch = 16)
points(rep(2, n_actual), quiz_data$Post, pch = 16)

Paired t-test:
State null and alternative hypotheses;
Null hypothesis: The rubric does not change the mean percent
correct.
\(H_{o}: \mu_{pre}=
\mu_{post}\)
Alternative hypothesis: The rubric changes the mean percent
correct.
\(H_{a}: \mu_{pre}\neq\mu
_{post}\)
where,
\(\mu_{pre}\) = the mean
percentage correct before the rubric
\(\mu_{post}\) = the mean
percentage correct after the rubric
Also, for the difference, we have;
D = post - pre
Null hypothesis:
\(H_{o}: \mu _{D} = 0\)
Alternative hypothesis:
\(H_{a}: \mu _{D} \neq 0\)
Check assumptions (normality of differences)
shapiro.test(quiz_data$Diff)
##
## Shapiro-Wilk normality test
##
## data: quiz_data$Diff
## W = 0.92204, p-value = 0.07369
qqnorm(quiz_data$Diff, main="Q-Q Plot of Score Differences")
qqline(quiz_data$Diff)

the p-value > 0.05 and the Q-Q plot is reasonably
straight
the normality assumption for the differences is acceptable for a
paired t-test.
Paired t-test in R
t_res <- t.test(quiz_data$Post, quiz_data$Pre, paired=TRUE, alternative="two.sided")
t_res
##
## Paired t-test
##
## data: quiz_data$Post and quiz_data$Pre
## t = 4.4108, df = 22, p-value = 0.0002212
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 13.13028 36.43493
## sample estimates:
## mean difference
## 24.78261
Test Statistic, Degrees of Freedom, and Confidence
Interval
- test-statistic (t): 4.4108
- degrees of freedom (df) : 22
- p-value: 0.0002212
- 95% confidence interval: [13.13028, 36.43493]
This shows that the mean percentage of the post-rubric scores
are 24.78261 % higher than that of the pre-rubric scores, and the
improvement is significant.
Interpretation:
Evaluating the Effectiveness of the Rubric
Comparison of Effect Sizes
The observed Cohen’s d ≈ 0.92, which is a large
effect.
If we had predicted a medium effect (d ≈ 0.50), the observed
effect is larger than expected.
pwr_observed <- pwr.t.test(n=n_actual,d=d_observed, sig.level=0.05, type="paired")
pwr_observed
##
## Paired t test power calculation
##
## n = 23
## d = 0.9197138
## sig.level = 0.05
## power = 0.9878144
## alternative = two.sided
##
## NOTE: n is number of *pairs*
Evaluating Observed Power Relative to Power Analysis
With n_actual = 23 and d ≈ 0.92, the observed power is very high
(close to 1.0, or > 95%). This is higher than the target 80% power
used in the power analysis with n_power = 40.
Discuss practical significance vs. statistical
significance
Statistical significance:
The p-value is far below 0.05, so we reject \(H_{o}\) and conclude that the rubric is
associated with higher scores.
Practical significance:
An average improvement of ~25 percentage points is large and
meaningful in an educational context (can change a failing grade to a
passing grade).
Therefore, the rubric appears to be both statistically and
practically significant, although design confounds (quiz difficulty,
learning over time, motivation) mean we should be cautious about
claiming causality.
COMPLETE CODE
library(pwr)
pre_scores <- c(30,70,40,70,30,30,30,30,70,50,50,50,50,20,70,30,30,60,30,80,40,30,70)
post_scores <- c(100,90,80,50,100,70,40,70,80,70,50,50,50,40,50,60,50,100,100,90,50,100,90)
quiz_data <- data.frame( Pre = pre_scores, Post = post_scores)
quiz_data$Diff <- quiz_data$Post - quiz_data$Pre
n_actual <- nrow(quiz_data) # actual sample size = 23
n_actual
library(pwr)
# Planned sample size (approximate class size)
n_power <- 40
# power curve across many d values
# Sequence of effect sizes for a smooth curve
d_seq <- seq(0.1, 1.2, by = 0.01)
# Power for each effect size
power_seq <- sapply(d_seq, function(d)
pwr.t.test(n = n_power,
d = d,
sig.level = 0.05,
type = "paired",
alternative = "two.sided")$power
)
#Minimum effect size for 80% power
min_d_80 <- pwr.t.test(n = n_power,
power = 0.80,
sig.level = 0.05,
type = "paired",
alternative = "two.sided")
min_d <- min_d_80$d # numeric value for plotting
# Plot: curve with benchmarks of 80% line
d_vals <- c(0.2, 0.5, 0.8, 1.0)
power_vals <- sapply(d_vals, function(d)
pwr.t.test(n = n_power,
d = d,
sig.level = 0.05,
type = "paired",
alternative = "two.sided")$power
)
plot(d_seq, power_seq,
type = "l",
lwd = 2,
xlab = "Effect Size (Cohen's d)",
ylab = "Power",
main = "Power Curve for Paired t-test (n = 40, alpha = 0.05)")
abline(h = 0.80, lty = 2, col = "red")
abline(v = min_d, lty = 3, col = "blue")
points(d_vals, power_vals,
pch = 16,
col = "darkgreen")
text(x = min_d, y = 0.1,
labels = paste0("d ≈ ", round(min_d, 2)),
pos = 4,
col = "blue")
grid()
mean_pre <- mean(quiz_data$Pre)
mean_post <- mean(quiz_data$Post)
sd_diff <- sd(quiz_data$Diff)
mean_pre
mean_post
sd_diff
# Visualization Set for Paired t-test Report
n_actual <- nrow(quiz_data)
mean_pre <- mean(quiz_data$Pre)
mean_post <- mean(quiz_data$Post)
#Normality Check — Using Differences
qqnorm(quiz_data$Diff, main = "Normal Q-Q Plot of Differences (Post - Pre)", xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
qqline(quiz_data$Diff, col = "blue")
#Q-Q plots of Pre and Post separately
qqnorm(quiz_data$Pre, main = "Normal Q-Q Plot of Pre-rubric Scores", xlab = "Pre-rubric")
qqline(quiz_data$Pre, col = "blue")
qqnorm(quiz_data$Post, main = "Normal Q-Q Plot of Post-rubric Scores", xlab = "Post-rubric")
qqline(quiz_data$Post, col = "green")
#Histograms of Pre and Post Scores
hist(quiz_data$Pre, main = "Histogram of Pre-rubric Percentage", xlab = "Pre-rubric Percentage", col = "blue", breaks = 10)
hist(quiz_data$Post, main = "Histogram of Post-rubric Percentage", xlab = "Post-rubric Percentage", col = "green", breaks = 10)
#Histogram of Difference Scores
hist(quiz_data$Diff,
main = "Histogram of Score Improvements (Post - Pre)",
xlab = "Improvement (Points)",
col = "gray",
breaks = 10)
#Boxplot: Pre vs Post
boxplot(quiz_data$Pre, quiz_data$Post,
names = c("Pre", "Post"),
ylab = "Score (%)",
main = "Boxplot of Pre vs Post Quiz Scores",
col = c("blue", "green"))
#Barplot: Mean Pre vs Post
mean_values <- c(mean_pre, mean_post)
barplot(mean_values,names = c("Pre", "Post"), ylab = "Mean Score (%)", main = "Mean Quiz Scores Before and After Rubric", col = c("blue", "green"))
#Scatter Plot: Pre vs Post
plot(quiz_data$Pre, quiz_data$Post,
main = "Scatter Plot: Pre vs Post Quiz Scores",
xlab = "Pre Score (%)",
ylab = "Post Score (%)",
pch = 16)
abline(0, 1, col = "red", lty = 2) # Reference line: Post = Pre
#Individual Line Plot (Spaghetti Plot)
plot(c(1,2), range(c(quiz_data$Pre, quiz_data$Post)), type = "n", xaxt = "n", xlab = "", ylab = "Score (%)", main = "Individual Score Changes (Pre → Post)")
axis(1, at = c(1,2), labels = c("Pre","Post"))
#Draw change lines for each student
segments(1, quiz_data$Pre,
2, quiz_data$Post,
col = "gray")
# Add points
points(rep(1, n_actual), quiz_data$Pre, pch = 16)
points(rep(2, n_actual), quiz_data$Post, pch = 16)
shapiro.test(quiz_data$Diff)
qqnorm(quiz_data$Diff, main="Q-Q Plot of Score Differences")
qqline(quiz_data$Diff)
t_res <- t.test(quiz_data$Post, quiz_data$Pre, paired=TRUE, alternative="two.sided")
t_res
mean_diff <- mean(quiz_data$Diff)
sd_diff <- sd(quiz_data$Diff)
d_observed <- mean_diff / sd_diff
d_observed
pwr_observed <- pwr.t.test(n=n_actual,d=d_observed, sig.level=0.05, type="paired")
pwr_observed