This report evaluates the prerequisites for conducting a Multivariate Analysis of Covariance (MANCOVA). We aim to determine if an educational intervention significantly affected Math and Reading scores while controlling for baseline performance.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'MVN' was built under R version 4.5.3
## Warning: package 'biotools' was built under R version 4.5.3
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
##
## ---
## biotools version 4.3
## Warning: package 'ggcorrplot' was built under R version 4.5.3
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
##
## Attaching package: 'patchwork'
##
## The following object is masked from 'package:MASS':
##
## area
## Warning: package 'heplots' was built under R version 4.5.3
## Loading required package: broom
##
## Attaching package: 'heplots'
##
## The following object is masked from 'package:biotools':
##
## boxM
## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'
library(patchwork)
education_data <- read.csv("education_intervention_10000.csv")
names(education_data) <- make.names(names(education_data))
dv_cols <- c("post_math", "post_reading")
iv_col <- "intervention"
covariate_cols <- c("baseline_math", "baseline_reading")
data_clean <- education_data |>
drop_na(all_of(c(dv_cols, iv_col, covariate_cols)))
cont_cols <- c(dv_cols, covariate_cols)
remove_outliers_iqr <- function(df, cols) {
df_filtered <- df
for (col in cols) {
Q1 <- quantile(df_filtered[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(df_filtered[[col]], 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
df_filtered <- df_filtered |>
filter(.data[[col]] >= lower_bound & .data[[col]] <= upper_bound)
}
return(df_filtered)
}
data_no_outliers <- remove_outliers_iqr(data_clean, cont_cols)
cat("Removed", nrow(data_clean) - nrow(data_no_outliers), "outliers using IQR method.\n\n")## Removed 79 outliers using IQR method.
set.seed(42)
data_clean <- data_no_outliers |>
group_by(!!sym(iv_col)) |>
sample_n(30) |>
ungroup() |>
as.data.frame()
print("Education Intervention Dataframe Preview (IQR Outliers Removed & Balanced):")## [1] "Education Intervention Dataframe Preview (IQR Outliers Removed & Balanced):"
## intervention post_math post_reading baseline_math baseline_reading
## 1 blended 91.8 83.8 80.2 71.5
## 2 blended 52.2 58.3 41.4 48.2
## 3 blended 99.8 100.0 76.9 88.6
## 4 blended 91.9 93.1 77.9 80.4
## 5 blended 97.7 93.1 86.4 81.9
## 6 blended 85.7 50.9 80.6 44.7
We load the necessary libraries and prepare the data by balancing the groups, ensuring a clean start for the multivariate tests.
desc_stats <- data_clean |>
group_by(intervention) |>
summarise(across(all_of(cont_cols), list(mean = mean, sd = sd), .names = "{.col}_{.fn}")) |>
bind_rows(
data_clean |>
summarise(across(all_of(cont_cols), list(mean = mean, sd = sd), .names = "{.col}_{.fn}")) |>
mutate(intervention = "Total")
)
print(desc_stats)## # A tibble: 5 × 9
## intervention post_math_mean post_math_sd post_reading_mean post_reading_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 blended 83.7 13.8 79.7 13.2
## 2 control 73.8 15.5 78.7 12.5
## 3 flipped 81.3 11.8 81.8 12.8
## 4 tutoring 77.2 12.9 78.4 12.2
## 5 Total 79.0 13.9 79.6 12.6
## # ℹ 4 more variables: baseline_math_mean <dbl>, baseline_math_sd <dbl>,
## # baseline_reading_mean <dbl>, baseline_reading_sd <dbl>
plot_post_math <- ggplot(data_clean, aes(x = intervention, y = post_math, fill = intervention)) +
geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 1.5) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "white") +
scale_fill_brewer(palette = "Set2") +
theme_minimal(base_size = 12) +
theme(legend.position = "none") +
labs(title = "Post Math Score", x = "Intervention", y = "Score")
plot_post_reading <- ggplot(data_clean, aes(x = intervention, y = post_reading, fill = intervention)) +
geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 1.5) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "white") +
scale_fill_brewer(palette = "Set2") +
theme_minimal(base_size = 12) +
theme(legend.position = "none") +
labs(title = "Post Reading Score", x = "Intervention", y = "Score")
plot_baseline_math <- ggplot(data_clean, aes(x = intervention, y = baseline_math, fill = intervention)) +
geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 1.5) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "white") +
scale_fill_brewer(palette = "Set2") +
theme_minimal(base_size = 12) +
theme(legend.position = "none") +
labs(title = "Baseline Math Score", x = "Intervention", y = "Score")
plot_baseline_reading <- ggplot(data_clean, aes(x = intervention, y = baseline_reading, fill = intervention)) +
geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 1.5) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "white") +
scale_fill_brewer(palette = "Set2") +
theme_minimal(base_size = 12) +
theme(legend.position = "none") +
labs(title = "Baseline Reading Score", x = "Intervention", y = "Score")
combined_plot <- (plot_post_math + plot_post_reading) / (plot_baseline_math + plot_baseline_reading) +
plot_annotation(title = "Score Distribution by Intervention Method")
print(combined_plot)MANCOVA is only appropriate if the Dependent Variables (Post Math and Post Reading) are significantly correlated. If they were independent, the analysis would not provide any multivariate advantage over separate ANOVAs.
In order to simplify, hypothesis of H0 and H1 need to conducted:
H0: DV not correlated with each other
H1: DV correlated with each other
In order to pass the test, we need to successfully reject H0.
cor_matrix <- cor(data_clean[dv_cols])
n <- nrow(data_clean)
p <- length(dv_cols)
det_R <- det(cor_matrix)
bartlett_chi <- -((n - 1) - (2 * p + 5) / 6) * log(det_R)
p_val_bartlett <- pchisq(bartlett_chi, df = p*(p-1)/2, lower.tail = FALSE)
cat("Bartlett's Sphericity p-value:", p_val_bartlett)## Bartlett's Sphericity p-value: 0.01371306
We seek a p-value < 0.05. A significant result confirms that: we reject H0; there is indeed correlation between each DV. Math and reading scores share enough variance to be analyzed as a unified educational outcome, satisfying the first core assumption.
Explanation: This test ensures that the “spread” (variance) and the relationship (covariance) between the test scores are equal across all Intervention groups (e.g., Tutoring, Flipped, Blended).
H0: All groups share the same covariance matrix
H1: At least one of the group have differ covariance matrix
In order to pass, we need to fail to reject H0. Since we want the data to be have homogeneity accross group
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: data_clean[dv_cols] by data_clean[[iv_col]]
## Chi-Sq (approx.) = 7.5046, df = 9, p-value = 0.5847
Following the publication’s standards, we look for a p-value > 0.05.
The result confirms that: we successfully fail to reject H0. Because Box’s M is highly sensitive to sample size, a p-value above this strict threshold indicates that the group variances are sufficiently homogenous for a valid MANCOVA.
We check if the combination of Math and Reading scores follows a multivariate normal distribution by looking at Skewness (symmetry) and Kurtosis (peak height).
H0: Data follows a multivariate normal distribution
H1: Data not follows a multivariate normal distribution
In order to pass the test, we need to fail to reject H0
# Creating distributions for Math and Reading scores
plot_math <- ggplot(data_clean, aes(x = post_math)) +
geom_histogram(fill = "steelblue", color = "white", bins = 30) +
theme_minimal() +
labs(title = "Post Math Distribution")
plot_reading <- ggplot(data_clean, aes(x = post_reading)) +
geom_histogram(fill = "darkgreen", color = "white", bins = 30) +
theme_minimal() +
labs(title = "Post Reading Distribution")
# Displaying them side-by-side
print(plot_math + plot_reading)##
## ── Multivariate Normality Test Results ─────────────────────────────────────────
## Test Statistic p.value Method MVN
## 1 Henze-Zirkler 0.907 0.081 asymptotic ✓ Normal
## $multivariate_normality
## Test Statistic p.value Method MVN
## 1 Henze-Zirkler 0.907 0.081 asymptotic ✓ Normal
##
## $univariate_normality
## Test Variable Statistic p.value Normality
## 1 Anderson-Darling post_math 0.818 0.033 ✗ Not normal
## 2 Anderson-Darling post_reading 0.626 0.101 ✓ Normal
##
## $descriptives
## Variable n Mean Std.Dev Median Min Max 25th 75th Skew Kurtosis
## 1 post_math 120 78.998 13.932 79.90 46.0 100 68.35 89.600 -0.248 2.182
## 2 post_reading 120 79.629 12.574 79.05 50.9 100 69.95 90.375 -0.168 2.280
##
## $data
## post_math post_reading
## 1 91.8 83.8
## 2 52.2 58.3
## 3 99.8 100.0
## 4 91.9 93.1
## 5 97.7 93.1
## 6 85.7 50.9
## 7 68.7 98.2
## 8 59.6 60.1
## 9 100.0 94.0
## 10 73.9 72.4
## 11 90.7 93.2
## 12 79.7 83.7
## 13 85.6 82.8
## 14 77.0 82.4
## 15 100.0 83.4
## 16 86.8 88.3
## 17 94.8 89.8
## 18 83.8 98.8
## 19 62.1 73.0
## 20 72.6 77.1
## 21 71.7 69.3
## 22 65.6 64.8
## 23 92.4 86.0
## 24 82.6 71.3
## 25 100.0 69.2
## 26 89.9 91.5
## 27 99.2 67.1
## 28 67.3 65.7
## 29 100.0 68.7
## 30 88.9 79.8
## 31 57.1 69.3
## 32 57.7 60.5
## 33 80.9 100.0
## 34 69.1 77.9
## 35 72.2 100.0
## 36 83.3 77.9
## 37 49.9 75.3
## 38 70.6 98.2
## 39 59.0 94.3
## 40 79.6 71.2
## 41 64.0 55.5
## 42 95.3 86.1
## 43 91.2 59.9
## 44 71.2 91.9
## 45 66.5 90.3
## 46 84.3 67.9
## 47 89.0 84.4
## 48 57.9 75.0
## 49 81.6 69.2
## 50 96.1 87.7
## 51 94.6 75.2
## 52 88.8 72.9
## 53 65.5 90.8
## 54 48.1 80.6
## 55 100.0 70.1
## 56 87.5 72.1
## 57 67.0 67.4
## 58 46.0 70.8
## 59 58.7 97.3
## 60 80.4 70.6
## 61 100.0 88.8
## 62 83.7 100.0
## 63 83.6 82.7
## 64 89.5 100.0
## 65 59.6 63.2
## 66 70.4 90.7
## 67 85.7 63.8
## 68 68.8 83.5
## 69 100.0 94.3
## 70 82.9 71.2
## 71 67.1 97.3
## 72 63.0 68.1
## 73 96.2 79.0
## 74 100.0 82.1
## 75 87.3 67.8
## 76 88.8 83.7
## 77 74.6 79.1
## 78 93.3 99.9
## 79 77.6 89.1
## 80 100.0 84.6
## 81 79.1 90.6
## 82 69.2 90.9
## 83 61.4 79.0
## 84 71.5 90.9
## 85 76.3 78.3
## 86 76.3 66.7
## 87 79.5 93.2
## 88 84.2 54.1
## 89 84.7 83.4
## 90 84.5 57.7
## 91 81.8 80.3
## 92 92.2 89.4
## 93 73.1 52.2
## 94 75.8 75.4
## 95 65.1 90.6
## 96 76.9 51.7
## 97 98.4 81.7
## 98 57.0 69.5
## 99 66.3 80.5
## 100 66.9 82.8
## 101 80.1 68.6
## 102 63.5 74.7
## 103 97.0 78.2
## 104 85.3 96.6
## 105 52.6 77.0
## 106 67.1 74.8
## 107 85.9 97.7
## 108 73.5 68.6
## 109 86.6 76.1
## 110 100.0 100.0
## 111 93.8 87.4
## 112 93.0 62.2
## 113 70.3 67.1
## 114 75.6 77.6
## 115 86.8 100.0
## 116 61.3 77.7
## 117 57.7 77.5
## 118 78.1 70.1
## 119 77.5 88.0
## 120 76.7 77.7
##
## $subset
## NULL
##
## $outlierMethod
## [1] "none"
##
## attr(,"class")
## [1] "mvn"
The Henze-Zirkler test produced a test statistic of 0.907 with p = 0.081, which is above the 0.05 threshold. We succesfully fail to reject H0, meaning the joint distribution of post_math and post_reading satisfies multivariate normality. At the univariate level, post_reading is normal (p = 0.101), while post_math shows a minor violation (p = 0.033). Given the balanced sample size of n = 120 and the low skewness values for both variables (post_math: -0.248, post_reading: -0.168), this violation is considered negligible. The multivariate normality assumption is met.
This ensures a linear relationship between the Covariates (Baseline scores) and the DVs (Post-test scores). It also checks if the relationship is consistent across different intervention groups.
H0: There is no linearity between covariate and DV
H1: There is linearity between covariates and DV
In order to pass the test we need to reject H0.
for (dv in dv_cols) {
formula_str <- paste(dv, "~", paste(covariate_cols, collapse = " + "))
fit <- lm(as.formula(formula_str), data = data_clean)
cat("\nLinearity Analysis for", dv, ":\n")
print(summary(fit))
cat("R-squared:", summary(fit)$r.squared, "\n")
# Extract p-values
coef_table <- coef(summary(fit))
p_values <- coef_table[, "Pr(>|t|)"]
cat("\nP-values for", dv, ":\n")
print(p_values)
# Overall model p-value
overall_p <- pf(summary(fit)$fstatistic[1],
summary(fit)$fstatistic[2],
summary(fit)$fstatistic[3],
lower.tail = FALSE)
cat("Overall model p-value:", overall_p, "\n")
}##
## Linearity Analysis for post_math :
##
## Call:
## lm(formula = as.formula(formula_str), data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1819 -3.2368 -0.0691 2.6842 12.1383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.31294 3.13383 3.291 0.00132 **
## baseline_math 0.94523 0.03357 28.161 < 2e-16 ***
## baseline_reading 0.06129 0.03622 1.692 0.09330 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.909 on 117 degrees of freedom
## Multiple R-squared: 0.8779, Adjusted R-squared: 0.8758
## F-statistic: 420.7 on 2 and 117 DF, p-value: < 2.2e-16
##
## R-squared: 0.8779264
##
## P-values for post_math :
## (Intercept) baseline_math baseline_reading
## 1.320574e-03 6.021460e-54 9.329803e-02
## Overall model p-value: 3.692913e-54
##
## Linearity Analysis for post_reading :
##
## Call:
## lm(formula = as.formula(formula_str), data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4737 -3.0823 0.8732 3.1575 7.4170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.525692 2.754323 4.911 2.97e-06 ***
## baseline_math 0.001621 0.029500 0.055 0.956
## baseline_reading 0.935309 0.031835 29.379 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.315 on 117 degrees of freedom
## Multiple R-squared: 0.8842, Adjusted R-squared: 0.8822
## F-statistic: 446.8 on 2 and 117 DF, p-value: < 2.2e-16
##
## R-squared: 0.8842265
##
## P-values for post_reading :
## (Intercept) baseline_math baseline_reading
## 2.972857e-06 9.562718e-01 7.801254e-56
## Overall model p-value: 1.663921e-55
The assumption is met if the regression models are significant (p < 0.05). The results says that we successfully reject H0. This proves that the baseline scores and study hours are valid predictors that can effectively “clean” the post-test data in the MANCOVA model.
This assumption requires that each student’s score is independent and not influenced by others. Since there is no specific statistical test for this, we examine the residuals for random distribution.
res_fit <- lm(as.matrix(data_clean[dv_cols]) ~ intervention + baseline_math, data = data_clean)
res_data <- residuals(res_fit)[,1]
plot(res_data, main="Independence Check: Residual Plot",
xlab="Student Index", ylab="Residuals", col="darkblue")
abline(h = 0, col = "red", lwd = 2)A random scatter of points around the zero line indicates that there is no temporal or group-based correlation between student records. This confirms the Independence of Observations, fulfilling the final requirement for the analysis.
Before running any formal tests, we must understand the shape and behavior of our data across both outcomes. Exploratory Data Analysis helps us visualize the raw performance differences between the intervention groups and the relationship between the baseline and post-test scores for both math and reading.
data_clean |>
pivot_longer(cols = c(post_math, post_reading), names_to = 'outcome', values_to = 'score') |>
mutate(outcome = case_when(
outcome == 'post_math' ~ 'Post Math',
outcome == 'post_reading' ~ 'Post Reading'
)) |>
ggplot(aes(x = intervention, y = score, fill = intervention)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~ outcome, scales = 'free_y') +
theme_minimal() +
labs(
title = 'Raw Post-Test Scores by Intervention Group',
x = 'Intervention Type',
y = 'Score'
) +
theme(legend.position = 'none')For math, blended and flipped groups show visibly higher medians compared to control, with blended in particular sitting notably above the rest. The control group also displays a wider interquartile range, suggesting more variability in math outcomes among students who received no active intervention. For reading, the picture is strikingly different, as all four groups show heavily overlapping boxes with medians clustered closely together, indicating that intervention type produces little visible separation in raw reading outcomes. Notably, the reading plots also show wider spreads overall compared to math, meaning individual reading scores vary considerably regardless of group. Both observations are based on unadjusted raw scores however, so the apparent math advantage of blended and flipped may simply reflect the fact that those students started with stronger baseline math ability rather than benefiting more from the instruction itself.
data_clean |>
pivot_longer(
cols = c(baseline_math, baseline_reading),
names_to = 'baseline_outcome',
values_to = 'baseline_score'
) |>
mutate(
post_score = ifelse(baseline_outcome == 'baseline_math', post_math, post_reading),
outcome = case_when(
baseline_outcome == 'baseline_math' ~ 'Math',
baseline_outcome == 'baseline_reading' ~ 'Reading'
)
) |>
ggplot(aes(x = baseline_score, y = post_score, color = intervention)) +
geom_point(alpha = 0.5) +
geom_smooth(method = 'lm', se = FALSE) +
facet_wrap(~ outcome, scales = 'free') +
theme_minimal() +
labs(
title = 'Trajectory: Baseline vs Post-Test Scores',
x = 'Baseline Score',
y = 'Post-Test Score'
)## `geom_smooth()` using formula = 'y ~ x'
Both plots confirm a strong positive linear relationship between baseline and post-test scores for math and reading alike, meaning students who entered the study with higher prior ability consistently scored higher on the post-test regardless of which intervention they received. The regression lines across all four intervention groups are strikingly parallel and run in the same direction without crossing, which provides early visual evidence that the covariate behaves consistently across groups for both outcomes. For math, the control group line sits visibly lower than the active intervention groups across most of the baseline range, hinting that the intervention effect may be meaningful once baseline is accounted for. For reading, the four lines are even more tightly clustered together, reinforcing the earlier boxplot observation that group separation in reading is minimal. The consistent slopes across both outcomes are an encouraging early sign that the homogeneity of regression slopes assumption, which will be formally tested later, is likely to be satisfied.
data_clean |>
pivot_longer(cols = c(post_math, post_reading), names_to = 'outcome', values_to = 'score') |>
mutate(outcome = case_when(
outcome == 'post_math' ~ 'Post Math',
outcome == 'post_reading' ~ 'Post Reading'
)) |>
ggplot(aes(x = score, fill = intervention)) +
geom_density(alpha = 0.5) +
facet_wrap(~ outcome, scales = 'free') +
theme_minimal() +
labs(
title = 'Post-Test Score Distribution by Intervention Group',
x = 'Post-Test Score',
y = 'Density'
)For math, the density curves show meaningful separation between groups, with blended peaking furthest to the right around 90 and control spreading broadly across a lower range, consistent with the boxplot pattern observed earlier. Flipped and tutoring sit in between, with their peaks clustering around 80. The control group notably produces the flattest and widest curve, reflecting greater score variability among students with no active intervention. For reading, the picture is more chaotic, with all four curves overlapping heavily and peaks scattered across a narrow band between 75 and 85. Tutoring produces a sharp narrow peak while control spreads broadly, but no group consistently dominates the higher score range the way blended does in math. The overall reading distributions are largely indistinguishable from one another, further confirming that intervention type carries very little signal for reading outcomes when baseline ability is not controlled for.
data_clean |>
pivot_longer(cols = c(baseline_math, baseline_reading), names_to = 'outcome', values_to = 'score') |>
mutate(outcome = case_when(
outcome == 'baseline_math' ~ 'Baseline Math',
outcome == 'baseline_reading' ~ 'Baseline Reading'
)) |>
ggplot(aes(x = intervention, y = score, fill = intervention)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~ outcome, scales = 'free_y') +
theme_minimal() +
labs(
title = 'Baseline Scores by Intervention Group',
x = 'Intervention Type',
y = 'Score'
) +
theme(legend.position = 'none')For baseline math, the blended group enters the study with a noticeably higher median compared to control, while flipped also sits slightly above control and tutoring. This means the groups were not on equal footing before the intervention even began, so any raw post-test math advantage observed for blended cannot be confidently attributed to the instruction alone. For baseline reading, the four groups are considerably more comparable, with all medians sitting close together around 70 and heavily overlapping interquartile ranges across groups. The one exception is a single outlier visible in the tutoring group reaching near 100. Overall, the baseline math imbalance is the more concerning of the two, as it directly explains why the raw post-math differences we observed earlier need to be treated with caution. This confirms that statistically controlling for baseline scores in ANCOVA and MANCOVA is not just a methodological formality but a necessary correction given the pre-existing differences between groups.
data_clean |>
group_by(intervention) |>
summarise(
n = n(),
mean_post_math = round(mean(post_math), 2),
sd_post_math = round(sd(post_math), 2),
mean_baseline_math = round(mean(baseline_math), 2),
sd_baseline_math = round(sd(baseline_math), 2),
mean_post_reading = round(mean(post_reading), 2),
sd_post_reading = round(sd(post_reading), 2),
mean_baseline_reading = round(mean(baseline_reading), 2),
sd_baseline_reading = round(sd(baseline_reading), 2)
)Before controlling for any baseline differences, we first examine whether intervention type alone produces significantly different mean scores across groups. This establishes a raw baseline comparison that will later be contrasted against the covariate-adjusted results from ANCOVA.
anova_math <- aov(post_math ~ intervention, data = data_clean)
cat('--- One-Way ANOVA: Post Math ---\n')## --- One-Way ANOVA: Post Math ---
## Df Sum Sq Mean Sq F value Pr(>F)
## intervention 3 1748 582.7 3.166 0.0271 *
## Residuals 116 21350 184.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_reading <- aov(post_reading ~ intervention, data = data_clean)
cat('--- One-Way ANOVA: Post Reading ---\n')## --- One-Way ANOVA: Post Reading ---
## Df Sum Sq Mean Sq F value Pr(>F)
## intervention 3 213 71.13 0.444 0.722
## Residuals 116 18600 160.34
For post-math scores, the intervention effect is significant (F(3, 116) = 3.17, p = .027), meaning that intervention group membership is associated with different math outcomes even without baseline adjustment. For post-reading, the effect is non-significant (F(3, 116) = 0.44, p = .722), suggesting that reading outcomes do not differ meaningfully across groups when prior ability is ignored.
We therefore reject H0 for math but fail to reject H0 for reading.
This asymmetry is expected rather than concerning. Math outcomes tend to be more immediately responsive to instructional differences, while reading development is slower and more dependent on accumulated prior ability. Crucially, neither result can be fully trusted at face value.
Students entering different intervention programs likely carried different baseline ability levels into the study, meaning these raw group differences may reflect pre-existing advantages rather than the true effect of the instruction itself. This fundamental limitation of unadjusted ANOVA motivates the transition to ANCOVA, where baseline math and reading scores are statistically controlled to isolate the intervention’s genuine contribution to each outcome independently.
Partial eta squared quantifies how much of the total variance in each outcome is explained by intervention group membership alone, before any baseline adjustment is applied.
The partial eta squared for post-math indicates that intervention group membership explains approximately 7.6% of variance in math scores before any baseline adjustment, a rather small-to-medium effect that aligns with the significant omnibus result. For post-reading, η² = 0.011, a negligible effect consistent with the non-significant ANOVA. These unadjusted effect sizes likely underestimate the true intervention impact, since a substantial portion of outcome variance is driven by students’ pre-existing ability levels rather than the instruction they received. Once baseline scores are controlled in ANCOVA, the proportion of variance uniquely attributable to the intervention is expected to become clearer and more precise.
Since the omnibus ANOVA was significant for post-math (p = .027), Bonferroni-adjusted pairwise comparisons are conducted to identify which specific intervention pairs are driving the difference. These results serve as confirmatory evidence of group separation in math outcomes.
## contrast estimate SE df t.ratio p.value
## blended - control 9.96 3.5 116 2.844 0.0316
## blended - flipped 2.44 3.5 116 0.697 1.0000
## blended - tutoring 6.54 3.5 116 1.866 0.3873
## control - flipped -7.52 3.5 116 -2.148 0.2029
## control - tutoring -3.43 3.5 116 -0.978 1.0000
## flipped - tutoring 4.10 3.5 116 1.170 1.0000
##
## P value adjustment: bonferroni method for 6 tests
emm_math_anova <- as.data.frame(emmeans(anova_math, ~ intervention))
emm_math_anova$outcome <- 'Post Math'
names(emm_math_anova)[names(emm_math_anova) == 'emmean'] <- 'mean'
ggplot(emm_math_anova, aes(x = intervention, y = mean, fill = intervention)) +
geom_bar(stat = 'identity', width = 0.6) +
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
labs(
title = 'Group Means by Intervention (ANOVA)',
subtitle = 'Post Math',
x = 'Intervention', y = 'Mean Score'
) +
theme_minimal() +
theme(legend.position = 'none')For post-math scores, the only significant pairwise contrast after Bonferroni adjustment was blended vs. control (difference = 9.96, p = .032), where the blended group scored notably higher than the control group. All other pairwise comparisons were non-significant, suggesting that flipped and tutoring groups did not differ meaningfully from either the control or each other after correction.
The bar chart visually confirms this pattern. Blended and flipped sit visibly higher than control, while tutoring falls closer to the middle, yet only the blended-control gap is large enough to survive the Bonferroni correction. These unadjusted group means still reflect whatever baseline ability differences existed between students before the intervention began, which is precisely why ANCOVA is needed to determine whether this blended advantage holds after leveling the playing field.
Although the omnibus ANOVA for post-reading was non-significant (p = .722), we present Bonferroni-adjusted pairwise comparisons as an exploratory exercise to examine the direction of group differences. These results should be interpreted with caution and are not intended as confirmatory evidence.
## contrast estimate SE df t.ratio p.value
## blended - control 0.983 3.27 116 0.301 1.0000
## blended - flipped -2.130 3.27 116 -0.651 1.0000
## blended - tutoring 1.270 3.27 116 0.388 1.0000
## control - flipped -3.113 3.27 116 -0.952 1.0000
## control - tutoring 0.287 3.27 116 0.088 1.0000
## flipped - tutoring 3.400 3.27 116 1.040 1.0000
##
## P value adjustment: bonferroni method for 6 tests
emm_reading_anova <- as.data.frame(emmeans(anova_reading, ~ intervention))
emm_reading_anova$outcome <- 'Post Reading'
names(emm_reading_anova)[names(emm_reading_anova) == 'emmean'] <- 'mean'
ggplot(emm_reading_anova, aes(x = intervention, y = mean, fill = intervention)) +
geom_bar(stat = 'identity', width = 0.6) +
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
labs(
title = 'Group Means by Intervention (ANOVA)',
subtitle = 'Post Reading',
x = 'Intervention', y = 'Mean Score'
) +
theme_minimal() +
theme(legend.position = 'none')As expected from the non-significant omnibus result, no pairwise comparison reached significance for post-reading scores, all p-values were 1.000 after Bonferroni adjustment. The bar chart visually encourages this, showing four nearly identical bar heights with heavily overlapping confidence intervals across all intervention groups. Reading outcomes appear entirely uniform regardless of which intervention a student received, suggesting that intervention type carries no detectable signal for reading performance when baseline ability is not accounted for. This is consistent with the earlier discussion that reading development is slower and more resistant to short-term instructional differences, and further supports the need for ANCOVA to determine whether any reading signal emerges once baseline scores are properly controlled.
Now we move to ANCOVA. The goal of ANCOVA is to answer one critical question: if all students had started with the exact same baseline scores, would the intervention they received still make a significant difference in their final academic performance? By statistically controlling for each student’s prior math and reading ability, ANCOVA isolates the intervention’s true effect on each outcome independently, removing the noise caused by pre-existing differences between students before the instruction even began.
ANCOVA extends the earlier ANOVA by statistically removing baseline differences before comparing groups. We run one model per DV, each controlling for its own respective baseline covariate.
ancova_math <- lm(post_math ~ baseline_math + intervention, data = data_clean)
cat('--- ANCOVA: Post Math ---\n') ## --- ANCOVA: Post Math ---
ancova_reading <- lm(post_reading ~ baseline_reading + intervention, data = data_clean)
cat('--- ANCOVA: Post Reading ---\n') ## --- ANCOVA: Post Reading ---
After statistically controlling for prior ability, the ANCOVA results reveal a clear and consistent picture across both outcomes.
For post-math scores, baseline math is an overwhelmingly strong predictor (F(1, 115) = 871.31, p < .001), confirming that prior math ability dominates student outcomes. Critically, even after removing this baseline effect, the intervention remains highly significant (F(3, 115) = 6.15, p < .001), meaning the type of instruction a student received had a genuine impact on their final math score independent of where they started.
The same pattern holds for post-reading scores, where baseline reading is again the dominant predictor (F(1, 115) = 969.33, p < .001), yet the intervention effect remains significant (F(3, 115) = 3.99, p = .010).
Therefore we reject H0 for both outcomes.
These results demonstrate that a student’s prior ability is by far the strongest driver of their final score, but the type of instruction they received still contributes a meaningful and statistically significant effect on top of that. This also sets up an important question: do these intervention effects hold when both outcomes are examined simultaneously while controlling for baselines across the board? That is precisely what MANCOVA will address.
The partial eta squared values reinforce the ANCOVA findings. For post-math, baseline math accounts for an overwhelming 88.3% of variance in final math scores (η² = 0.883), while the intervention explains an additional 13.8% after that baseline is controlled (η² = 0.138). For post-reading, the pattern is nearly identical — baseline reading accounts for 89.4% of variance (η² = 0.894), with the intervention explaining a further 9.4% on top (η² = 0.094). While the intervention effect sizes are modest relative to the covariate, a 10-14% explained variance after already accounting for prior ability is considered practically meaningful in an educational context, where baseline performance naturally dominates outcomes.
Pairwise comparisons using adjusted means. This is the core value of ANCOVA, since it compares groups as if they all started from the same baseline.
## contrast estimate SE df t.ratio p.value
## blended - control 3.867 1.22 115 3.172 0.0116
## blended - flipped 1.645 1.20 115 1.369 1.0000
## blended - tutoring -0.939 1.23 115 -0.765 1.0000
## control - flipped -2.221 1.21 115 -1.829 0.4200
## control - tutoring -4.806 1.20 115 -3.998 0.0007
## flipped - tutoring -2.585 1.22 115 -2.114 0.2198
##
## P value adjustment: bonferroni method for 6 tests
## contrast estimate SE df t.ratio p.value
## blended - control 3.020 1.07 115 2.818 0.0341
## blended - flipped -0.266 1.07 115 -0.248 1.0000
## blended - tutoring 1.426 1.07 115 1.334 1.0000
## control - flipped -3.285 1.07 115 -3.072 0.0159
## control - tutoring -1.593 1.07 115 -1.487 0.8377
## flipped - tutoring 1.692 1.07 115 1.580 0.7007
##
## P value adjustment: bonferroni method for 6 tests
emm_ancova_math <- as.data.frame(emmeans(ancova_math, ~ intervention))
emm_ancova_reading <- as.data.frame(emmeans(ancova_reading, ~ intervention))
emm_ancova_math$outcome <- 'Post Math'
emm_ancova_reading$outcome <- 'Post Reading'
names(emm_ancova_math)[names(emm_ancova_math) == 'emmean'] <- 'adj_mean'
names(emm_ancova_reading)[names(emm_ancova_reading) == 'emmean'] <- 'adj_mean'
emm_ancova <- rbind(emm_ancova_math, emm_ancova_reading)
ggplot(emm_ancova, aes(x = intervention, y = adj_mean, fill = intervention)) +
geom_bar(stat = 'identity', width = 0.6) +
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
facet_wrap(~ outcome) +
labs(
title = 'Adjusted Means by Intervention (ANCOVA)',
subtitle = 'Each DV controlled for its own baseline',
x = 'Intervention', y = 'Adjusted Mean Score'
) +
theme_minimal() +
theme(legend.position = 'none')After adjusting for baseline scores, the pairwise comparisons reveal specific group differences that were not visible in the raw data. For post-math, two contrasts reached significance: blended outperformed control (difference = 3.87, p = .012) and tutoring outperformed control (difference = 4.81, p < .001), while all other math pairs were non-significant. For post-reading, blended outperformed control (difference = 3.02, p = .034) and flipped outperformed control (difference = 3.29, p = .016), with remaining pairs non-significant.
The bar chart visually confirms this pattern. The control group consistently sits lower than the active intervention groups across both outcomes, while blended, flipped, and tutoring cluster closely together with heavily overlapping confidence intervals. The tight confidence intervals reflect the reduced error variance achieved by controlling for baseline scores, giving us a more precise estimate of each group’s true performance. Across both subjects, the consistent takeaway is that any form of active intervention outperforms no intervention, but no single intervention method stands out as superior to the others.
Before trusting the ANCOVA results, we must verify a key assumption: homogeneity of regression slopes. This assumption requires that the relationship between the covariate and the dependent variable is consistent across all intervention groups. In other words, the baseline-to-post-test slope should be roughly parallel for every group. If the slopes differ significantly across groups, it means the covariate affects each group differently, and a standard ANCOVA would be inappropriate.
H0: The regression slopes are equal across all intervention groups
H1: At least one group differ
ancova_slope_math <- lm(post_math ~ baseline_math * intervention, data = data_clean)
cat("--- Homogeneity of Regression Slopes: Post Math ---\n")## --- Homogeneity of Regression Slopes: Post Math ---
ancova_slope_reading <- lm(post_reading ~ baseline_reading * intervention, data = data_clean)
cat("--- Homogeneity of Regression Slopes: Post Reading ---\n")## --- Homogeneity of Regression Slopes: Post Reading ---
p1 <- ggplot(data_clean, aes(x = baseline_math, y = post_math, color = intervention)) +
geom_point(alpha = 0.4, size = 2) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
labs(
title = "Regression Slopes: Post Math",
x = "Baseline Math",
y = "Post Math",
color = "Intervention"
) +
theme_minimal()
p2 <- ggplot(data_clean, aes(x = baseline_reading, y = post_reading, color = intervention)) +
geom_point(alpha = 0.4, size = 2) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
labs(
title = "Regression Slopes: Post Reading",
x = "Baseline Reading",
y = "Post Reading",
color = "Intervention"
) +
theme_minimal()
p1 + p2 + plot_layout(guides = "collect")## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
The interaction term between the covariate and intervention group is non-significant for both post-math (F(3, 112) = 0.97, p = .408) and post-reading (F(3, 112) = 0.69, p = .561), meaning we fail to reject H0. This confirms that the regression slopes are parallel across all four intervention groups for both outcomes: the relationship between baseline and post-test scores behaves consistently regardless of which intervention a student received. The assumption of homogeneity of regression slopes is therefore satisfied, and we can proceed with ANCOVA with confidence that the covariate adjustment is being implemented equally and fairly across all groups.
The ANCOVA successfully proved that the intervention works for Mathematics when controlling for prior knowledge. However, education is multidimensional. Did the math tutoring inadvertently hurt their reading scores? Did the flipped classroom boost both equally?
Because we want to know the impact of the intervention on the combined educational profile (Math AND Reading simultaneously) without necessarily controlling for baselines in every single scenario, our next logical step is to perform a Multivariate Analysis of Variance (MANOVA).
The goal here is to test whether intervention type simultaneously affects post_math and post_reading without controlling for any baseline. Unlike ANCOVA, no covariate adjustment is made, we are comparing the raw group differences across both outcomes at once.
manova_model <- manova(
cbind(post_math, post_reading) ~ intervention,
data = data_clean
)
cat('--- MANOVA: Pillai\'s Trace ---\n')## --- MANOVA: Pillai's Trace ---
## Df Pillai approx F num Df den Df Pr(>F)
## intervention 3 0.08326 1.6796 6 232 0.1268
## Residuals 116
## --- MANOVA: Wilks' Lambda ---
## Df Wilks approx F num Df den Df Pr(>F)
## intervention 3 0.91731 1.6904 6 230 0.1242
## Residuals 116
The MANOVA revealed a non-significant multivariate effect of intervention type on the combined outcome of post-test math and reading scores (Pillai’s Trace = 0.083, F(6, 232) = 1.68, p = .127; Wilks’ Lambda = 0.917, F(6, 230) = 1.69, p = .124). Both test statistics are consistent with each other and both exceed the α = 0.05 threshold. Therefore, we fail to reject H0, meaning that when baseline differences between students are not accounted for, the intervention groups do not appear to differ significantly in their combined academic outcomes.
This result is not surprising, as students entering different intervention programs likely had varying levels of prior knowledge that obscure the true effect of the instruction itself. This limitation motivates the transition to MANCOVA, where baseline math and reading scores are statistically controlled, allowing us to isolate the intervention’s genuine contribution to student outcomes.
After confirming a significant multivariate effect, we break it down to see which individual outcome (math or reading) is driving the result.
## Response post_math :
## Df Sum Sq Mean Sq F value Pr(>F)
## intervention 3 1748.1 582.69 3.1659 0.02714 *
## Residuals 116 21349.9 184.05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response post_reading :
## Df Sum Sq Mean Sq F value Pr(>F)
## intervention 3 213.4 71.129 0.4436 0.7223
## Residuals 116 18600.0 160.344
Breaking down the multivariate result into individual outcomes reveals that the two DVs behave differently. For post-math scores, the intervention effect is significant (F(3, 116) = 3.17, p = .027), meaning that even without baseline adjustment, intervention group membership is associated with different math outcomes. For post-reading scores however, the effect is clearly non-significant (F(3, 116) = 0.44, p = .722), suggesting that reading outcomes do not differ meaningfully across groups when baseline is ignored.
This asymmetry explains why the overall MANOVA was non-significant. The reading outcome carries no group signal, diluting the combined multivariate effect. It also reinforces why controlling for baseline matters: the math signal we see here may partly reflect pre-existing differences rather than the true effect of the intervention, which again motivates the use of MANCOVA.
Partial eta squared tells us how much variance in the combined outcome is explained by intervention group membership alone, before any baseline adjustment.
The partial eta squared value for intervention group membership is η² = 0.042, indicating that approximately 4.2% of the total variance in the combined post-test outcomes is explained by intervention type alone. By conventional benchmarks this is a small effect size, which is consistent with the non-significant overall MANOVA result. Notably, this unadjusted figure likely underestimates the true impact of the intervention, since a substantial portion of the outcome variance is driven by students’ pre-existing ability levels rather than the instruction they received. This further illustrates the value of MANCOVA, by partialling out baseline scores, the proportion of variance uniquely attributable to the intervention becomes much clearer.
Although the overall MANOVA was non-significant (p = .127), the univariate follow-up revealed a significant effect for post-math scores specifically (p = .027). As an exploratory exercise, we present Bonferroni-adjusted pairwise comparisons per outcome to examine which specific intervention pairs may be driving the math signal. These results should be interpreted with caution given the non-significant omnibus test, and are intended to illustrate the direction of group differences rather than serve as confirmatory evidence.
pairs(emmeans(lm(post_math ~ intervention, data = data_clean), ~ intervention), adjust = 'bonferroni')## contrast estimate SE df t.ratio p.value
## blended - control 9.96 3.5 116 2.844 0.0316
## blended - flipped 2.44 3.5 116 0.697 1.0000
## blended - tutoring 6.54 3.5 116 1.866 0.3873
## control - flipped -7.52 3.5 116 -2.148 0.2029
## control - tutoring -3.43 3.5 116 -0.978 1.0000
## flipped - tutoring 4.10 3.5 116 1.170 1.0000
##
## P value adjustment: bonferroni method for 6 tests
pairs(emmeans(lm(post_reading ~ intervention, data = data_clean), ~ intervention), adjust = 'bonferroni')## contrast estimate SE df t.ratio p.value
## blended - control 0.983 3.27 116 0.301 1.0000
## blended - flipped -2.130 3.27 116 -0.651 1.0000
## blended - tutoring 1.270 3.27 116 0.388 1.0000
## control - flipped -3.113 3.27 116 -0.952 1.0000
## control - tutoring 0.287 3.27 116 0.088 1.0000
## flipped - tutoring 3.400 3.27 116 1.040 1.0000
##
## P value adjustment: bonferroni method for 6 tests
emm_math_manova <- as.data.frame(emmeans(lm(post_math ~ intervention, data = data_clean), ~ intervention))
emm_reading_manova <- as.data.frame(emmeans(lm(post_reading ~ intervention, data = data_clean), ~ intervention))
emm_math_manova$outcome <- 'Post Math'
emm_reading_manova$outcome <- 'Post Reading'
names(emm_math_manova)[names(emm_math_manova) == 'emmean'] <- 'mean'
names(emm_reading_manova)[names(emm_reading_manova) == 'emmean'] <- 'mean'
emm_manova <- rbind(emm_math_manova, emm_reading_manova)
ggplot(emm_manova, aes(x = intervention, y = mean, fill = intervention)) +
geom_bar(stat = 'identity', width = 0.6) +
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
facet_wrap(~ outcome) +
labs(
title = 'Group Means by Intervention (MANOVA)',
x = 'Intervention', y = 'Mean Score'
) +
theme_minimal() +
theme(legend.position = 'none')For post-math scores, the only significant pairwise difference was between blended and control (difference = 9.96, p = .032), where the blended group scored notably higher. All other math comparisons were non-significant after Bonferroni adjustment, suggesting the remaining intervention pairs produce comparable math outcomes. For post-reading scores, no pairwise comparison reached significance (all p-values were 1.000) which is consistent with the non-significant univariate reading result from the follow-up. Visually, the bar chart confirms this pattern: math scores show more visible separation between groups, particularly blended vs. control, while reading scores are strikingly uniform across all four groups with heavily overlapping confidence intervals.
However, these group mean differences are unadjusted. They reflect whatever baseline ability differences already existed between students before the intervention began. It is entirely possible that the blended group simply started with higher math ability rather than benefiting more from the instruction.
This is the fundamental limitation of MANOVA, and precisely why we proceed to MANCOVA. By statistically controlling for each student’s baseline math and reading scores, we can determine whether these group differences hold up after leveling the playing field.
MANCOVA and MANOVA is the same method that answer a question about “how do the groups score differently across multiple outcomes?” The difference is that MANCOVA adjusts for each student started before comparing groups. Without that adjustment, there is risk giving the intervention credit for score differences that already existed before it even began.
dv_matrix <- cbind(data_clean$post_math, data_clean$post_reading)
mancova_model <- manova(
dv_matrix ~ baseline_math + baseline_reading + intervention,
data = data_clean
)
# Summary using Pillai's Trace (most robust)
summary(mancova_model, test = "Pillai")## Df Pillai approx F num Df den Df Pr(>F)
## baseline_math 1 0.89620 487.80 2 113 < 2.2e-16 ***
## baseline_reading 1 0.89113 462.45 2 113 < 2.2e-16 ***
## intervention 3 0.23931 5.16 6 228 5.318e-05 ***
## Residuals 114
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Wilks approx F num Df den Df Pr(>F)
## baseline_math 1 0.10380 487.80 2 113 < 2.2e-16 ***
## baseline_reading 1 0.10887 462.45 2 113 < 2.2e-16 ***
## intervention 3 0.77138 5.22 6 226 4.708e-05 ***
## Residuals 114
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After controlling for baseline math and reading scores, the MANCOVA revealed a significant multivariate effect of intervention type on the combined outcome of post-test math and reading scores (Pillai’s Trace = 0.239, F(6, 228) = 5.16, p < .001; Wilks’ Lambda = 0.771, F(6, 226) = 5.22, p < .001). Both covariates also showed significant multivariate effects, confirming that prior scores strongly predicted post-test performance.
These results indicate that students in different intervention groups performed differently overall, even after accounting for where they started. The significance of both test statistics adds confidence to this conclusion.
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline_math 1 20209.3 20209.3 963.1185 < 2.2e-16 ***
## baseline_reading 1 69.0 69.0 3.2882 0.0724085 .
## intervention 3 427.6 142.5 6.7922 0.0002967 ***
## Residuals 114 2392.1 21.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline_math 1 566.7 566.7 32.8887 8.14e-08 ***
## baseline_reading 1 16068.6 16068.6 932.6107 < 2.2e-16 ***
## intervention 3 213.9 71.3 4.1382 0.00797 **
## Residuals 114 1964.2 17.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Univariate follow-up tests showed that the intervention effect was significant for both post-test math (F(3, 114) = 6.79, p < .001) and post-test reading (F(3, 114) = 4.14, p = .008). Baseline math was the dominant predictor of post-math scores, while baseline reading was the dominant predictor of post-reading scores, each explaining the vast majority of variance in their respective outcomes.
This means the intervention did not just affect one subject area. Both math and reading outcomes differed across groups, suggesting the type of instruction had a broad academic impact rather than a narrow one.
Partial eta squared values indicated that the covariates accounted for a large proportion of variance: baseline math explained 88.1% of variance in the outcomes, and baseline reading explained 89.2%. The intervention itself explained a smaller but practically meaningful 11.9% of variance after controlling for baseline scores.
While the intervention effect size is modest compared to the covariates, a roughly 12% explained variance in an educational setting is considered meaningful, particularly given that prior achievement already dominates student outcomes.
# Adjusted means for post_math
emm_math <- emmeans(
lm(post_math ~ baseline_math + baseline_reading + intervention, data = data_clean),
~ intervention
)
pairs(emm_math, adjust = "bonferroni")## contrast estimate SE df t.ratio p.value
## blended - control 4.110 1.21 114 3.410 0.0054
## blended - flipped 1.802 1.19 114 1.520 0.7877
## blended - tutoring -0.823 1.21 114 -0.680 1.0000
## control - flipped -2.309 1.20 114 -1.929 0.3370
## control - tutoring -4.933 1.19 114 -4.163 0.0004
## flipped - tutoring -2.625 1.20 114 -2.181 0.1876
##
## P value adjustment: bonferroni method for 6 tests
# Adjusted means for post_reading
emm_reading <- emmeans(
lm(post_reading ~ baseline_math + baseline_reading + intervention, data = data_clean),
~ intervention
)
pairs(emm_reading, adjust = "bonferroni")## contrast estimate SE df t.ratio p.value
## blended - control 3.16 1.09 114 2.893 0.0274
## blended - flipped -0.24 1.07 114 -0.224 1.0000
## blended - tutoring 1.59 1.10 114 1.449 0.9010
## control - flipped -3.40 1.08 114 -3.136 0.0131
## control - tutoring -1.57 1.07 114 -1.464 0.8764
## flipped - tutoring 1.83 1.09 114 1.677 0.5781
##
## P value adjustment: bonferroni method for 6 tests
# Fit univariate models
model_math <- lm(post_math ~ baseline_math + baseline_reading + intervention, data = data_clean)
model_reading <- lm(post_reading ~ baseline_math + baseline_reading + intervention, data = data_clean)
# Extract adjusted means
emm_math <- as.data.frame(emmeans(model_math, ~ intervention))
emm_reading <- as.data.frame(emmeans(model_reading, ~ intervention))
emm_math$outcome <- "Post Math"
emm_reading$outcome <- "Post Reading"
# Rename emmean column to 'adjusted_mean'
names(emm_math)[names(emm_math) == "emmean"] <- "adjusted_mean"
names(emm_reading)[names(emm_reading) == "emmean"] <- "adjusted_mean"
emm_combined <- rbind(emm_math, emm_reading)
# Plot
ggplot(emm_combined, aes(x = intervention, y = adjusted_mean, fill = intervention)) +
geom_bar(stat = "identity", position = "dodge", width = 0.6) +
geom_errorbar(
aes(ymin = lower.CL, ymax = upper.CL),
width = 0.2
) +
facet_wrap(~ outcome) +
labs(
title = "Adjusted Means by Intervention Group",
subtitle = "Controlling for baseline math and reading scores",
x = "Intervention",
y = "Adjusted Mean Score",
fill = "Intervention"
) +
theme_minimal() +
theme(legend.position = "none")Bonferroni-adjusted pairwise comparisons for post-math scores showed that the blended group scored significantly higher than the control group (difference = 4.11, p = .005), and tutoring scored significantly higher than control (difference = 4.93, p < .001). For post-reading, blended outperformed control (difference = 3.16, p = .027), and flipped outperformed control (difference = 3.40, p = .013).
Across both outcomes, the control group consistently scored lower than the active intervention groups. No significant differences emerged between blended, flipped, and tutoring groups, suggesting these three approaches produced comparable results relative to one another.
mancova_interaction_test <- manova(
dv_matrix ~ baseline_math * intervention + baseline_reading * intervention,
data = data_clean
)
summary(mancova_interaction_test, test = "Pillai")## Df Pillai approx F num Df den Df Pr(>F)
## baseline_math 1 0.90397 503.61 2 107 < 2.2e-16 ***
## intervention 3 0.21139 4.25 6 216 0.0004503 ***
## baseline_reading 1 0.89771 469.53 2 107 < 2.2e-16 ***
## baseline_math:intervention 3 0.09457 1.79 6 216 0.1029986
## intervention:baseline_reading 3 0.03819 0.70 6 216 0.6491808
## Residuals 108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Math
p1 <- ggplot(data_clean, aes(x = baseline_math, y = post_math, color = intervention)) +
geom_point(alpha = 0.4, size = 2) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
labs(
title = "Regression Slopes: Math",
x = "Baseline Math",
y = "Post Math",
color = "Intervention"
) +
theme_minimal()
# Reading
p2 <- ggplot(data_clean, aes(x = baseline_reading, y = post_reading, color = intervention)) +
geom_point(alpha = 0.4, size = 2) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
labs(
title = "Regression Slopes: Reading",
x = "Baseline Reading",
y = "Post Reading",
color = "Intervention"
) +
theme_minimal()
p1 + p2 + plot_layout(guides = "collect")## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
The interaction between each covariate and intervention group was non-significant for both baseline math (Pillai = 0.095, F(6, 216) = 1.79, p = .103) and baseline reading (Pillai = 0.038, F(6, 216) = 0.70, p = .649). This confirms that the assumption of homogeneity of regression slopes was met, meaning the relationship between baseline and post-test scores was consistent across all four intervention groups.
The regression slope plots visually support this finding. Lines across all four groups run roughly parallel for both math and reading, with no meaningful crossing or fanning, which validates the use of MANCOVA for this dataset.
ANOVA initially suggested that Math scores differed significantly across intervention groups while Reading showed no effect, but this was misleading because groups started with unequal baselines, particularly with the blended group having higher initial Math scores. ANCOVA addressed this by controlling for prior ability, revealing significant intervention effects in both Math (F(3,115)=6.15, p<.001, η²=.138) and Reading (F(3,115)=3.99, p=.010, η²=.094), with blended and tutoring outperforming control in Math, and blended and flipped outperforming control in Reading, though no intervention was superior to another.
In contrast, MANOVA without covariate adjustment found no significant multivariate effect (Pillai=.083, p=.127), highlighting how baseline differences can obscure results. Finally, MANCOVA, which controlled for both baseline Math and Reading, showed a strong multivariate intervention effect (Pillai=.239, F(6,228)=5.16, p<.001), confirming that the intervention significantly improved both outcomes, with assumptions such as homogeneity of regression slopes satisfied.
The final picture is this: prior academic ability is by far the strongest determinant of student outcomes, accounting for roughly 88 to 89 percent of variance in each subject. Yet even after that dominant influence is removed, the type of instruction a student received still explains an additional 12 percent of variance in their combined academic performance. Across both subjects and across every analytic framework that controlled for baseline, the control group consistently finished below every active intervention condition. Blended learning in particular showed statistically significant advantages over control in both Math and Reading, making it the most consistently effective method in this dataset, though tutoring and flipped classroom were competitive across specific outcomes as well.
To conclude, the educational intervention worked. It worked after leveling for prior ability, it worked across multiple outcomes simultaneously, and it worked according to every statistical framework designed to isolate its true contribution from pre-existing student differences. The recommendation that follows from this analysis is straightforward: passive control conditions should be replaced with active instructional methods, and future studies should prioritize baseline-adjusted designs from the outset to avoid systematically underestimating the effect of instruction.