This report presents an Analysis of Covariance (ANCOVA) applied to student academic performance data sourced from the Students Performance in Exams dataset which is publicly available on Kaggle. The dataset contains records of 1,000 students and captures their scores across three subjects namely mathematics, reading and writing along with several demographic and contextual background variables.
The dataset includes eight variables in total. Gender is recorded as female or male while race/ethnicity is grouped into five categories from Group A to Group E. Parental level of education ranges from some high school up to master’s degree and lunch type serves as a proxy for socioeconomic status distinguishing between standard and free or reduced lunch. Test preparation course completion is also recorded as either completed or none. The three outcome variables are math score, reading score and writing score each measured on a continuous scale from 0 to 100.
In this analysis gender serves as the independent variable and parental level of education serves as the covariate which is encoded ordinally from 1 for some high school up to 6 for master’s degree. The remaining variables such as race/ethnicity, lunch type and test preparation are not included in the current models but are acknowledged as potential confounders that may influence academic outcomes.
ANCOVA extends ANOVA by incorporating a covariate allowing for a more accurate estimation of group differences after controlling for external influences. The main objective of this analysis is to examine whether gender differences in academic performance exist after adjusting for parental education level. Three separate one-way ANCOVA models are fitted with one model per dependent variable:
model_info <- data.frame(
Model = c("Model 1", "Model 2", "Model 3"),
DV = c("Math Score", "Reading Score", "Writing Score"),
IV_Factor = c("Gender", "Gender", "Gender"),
Covariate = c("Parental Education (Ordinal)",
"Parental Education (Ordinal)",
"Parental Education (Ordinal)")
)
knitr::kable(model_info,
caption = "ANCOVA Model Specification",
align = "lccc") %>%
kableExtra::kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover"))| Model | DV | IV_Factor | Covariate |
|---|---|---|---|
| Model 1 | Math Score | Gender | Parental Education (Ordinal) |
| Model 2 | Reading Score | Gender | Parental Education (Ordinal) |
| Model 3 | Writing Score | Gender | Parental Education (Ordinal) |
The general model applied for each dependent variable can be expressed as & DV ~ parental_edu_ord + gender. This specification indicates that academic performance is modeled as a function of gender while adjusting for parental education level.
df_raw <- read.csv("StudentsPerformance.csv", stringsAsFactors = FALSE)
colnames(df_raw) <- c(
"gender", "race_ethnicity", "parental_education",
"lunch", "test_prep", "math_score", "reading_score", "writing_score"
)
cat(sprintf("Rows: %d | Columns: %d\n", nrow(df_raw), ncol(df_raw)))## Rows: 1000 | Columns: 8
kable(head(df_raw, 10), caption = "Preview: First 10 Rows of Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE, font_size = 11) %>%
scroll_box(width = "100%")| gender | race_ethnicity | parental_education | lunch | test_prep | math_score | reading_score | writing_score |
|---|---|---|---|---|---|---|---|
| female | group B | bachelor’s degree | standard | none | 72 | 72 | 74 |
| female | group C | some college | standard | completed | 69 | 90 | 88 |
| female | group B | master’s degree | standard | none | 90 | 95 | 93 |
| male | group A | associate’s degree | free/reduced | none | 47 | 57 | 44 |
| male | group C | some college | standard | none | 76 | 78 | 75 |
| female | group B | associate’s degree | standard | none | 71 | 83 | 78 |
| female | group B | some college | standard | completed | 88 | 95 | 92 |
| male | group B | some college | free/reduced | none | 40 | 43 | 39 |
| male | group D | high school | free/reduced | completed | 64 | 64 | 67 |
| female | group B | high school | free/reduced | none | 38 | 60 | 50 |
df_raw$gender <- factor(df_raw$gender)
edu_order <- c(
"some high school" = 1,
"high school" = 2,
"some college" = 3,
"associate's degree" = 4,
"bachelor's degree" = 5,
"master's degree" = 6
)
df_raw$parental_edu_ord <- as.integer(edu_order[df_raw$parental_education])
edu_mapping <- data.frame(
Category = names(edu_order),
Ordinal_Code = unname(edu_order)
)
knitr::kable(
edu_mapping,
caption = "Ordinal Encoding of Parental Education Level",
align = c("l", "c")
) %>%
kableExtra::kable_styling(
full_width = FALSE,
bootstrap_options = c("striped", "hover")
)| Category | Ordinal_Code |
|---|---|
| some high school | 1 |
| high school | 2 |
| some college | 3 |
| associate’s degree | 4 |
| bachelor’s degree | 5 |
| master’s degree | 6 |
group_df <- data.frame(
Factor = c(rep("Gender", 2), rep("Parental Education", 6)),
Level = c(levels(df_raw$gender), names(edu_order)),
n = c(as.integer(table(df_raw$gender)),
as.integer(table(factor(df_raw$parental_education,
levels = names(edu_order)))))
)
rownames(group_df) <- NULL
knitr::kable(
group_df,
caption = "Sample Distribution Across Factors",
col.names = c("Factor", "Level", "n"),
align = c("l", "l", "c")
) %>%
kableExtra::kable_styling(
full_width = FALSE,
bootstrap_options = c("striped", "hover")
)| Factor | Level | n |
|---|---|---|
| Gender | female | 518 |
| Gender | male | 482 |
| Parental Education | some high school | 179 |
| Parental Education | high school | 196 |
| Parental Education | some college | 226 |
| Parental Education | associate’s degree | 222 |
| Parental Education | bachelor’s degree | 118 |
| Parental Education | master’s degree | 59 |
dvs <- c("math_score", "reading_score", "writing_score")
desc_all <- data.frame(
Variable = dvs,
N = sapply(df_raw[dvs], length),
Mean = round(sapply(df_raw[dvs], mean), 3),
SD = round(sapply(df_raw[dvs], sd), 3),
Min = sapply(df_raw[dvs], min),
Q1 = sapply(df_raw[dvs], quantile, 0.25),
Median = sapply(df_raw[dvs], median),
Q3 = sapply(df_raw[dvs], quantile, 0.75),
Max = sapply(df_raw[dvs], max),
Skewness = round(sapply(df_raw[dvs], skewness), 3),
Kurtosis = round(sapply(df_raw[dvs], kurtosis), 3)
)
rownames(desc_all) <- NULL
knitr::kable(
desc_all,
caption = "Descriptive Statistics of Dependent Variables",
align = c("l", rep("c", 10))
) %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover"),
full_width = FALSE
) %>%
kableExtra::scroll_box(width = "100%")| Variable | N | Mean | SD | Min | Q1 | Median | Q3 | Max | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|---|---|---|
| math_score | 1000 | 66.089 | 15.163 | 0 | 57.00 | 66 | 77 | 100 | -0.279 | 3.268 |
| reading_score | 1000 | 69.169 | 14.600 | 17 | 59.00 | 70 | 79 | 100 | -0.259 | 2.926 |
| writing_score | 1000 | 68.054 | 15.196 | 10 | 57.75 | 69 | 79 | 100 | -0.289 | 2.961 |
desc_gender <- lapply(dvs, function(dv) {
df_raw %>%
group_by(gender) %>%
summarise(
DV = dv,
N = n(),
Mean = round(mean(.data[[dv]]), 3),
Median = round(median(.data[[dv]]), 3),
SD = round(sd(.data[[dv]]), 3),
Min = min(.data[[dv]]),
Max = max(.data[[dv]]),
.groups = "drop"
)
}) %>% bind_rows()
knitr::kable(
desc_gender,
caption = "Descriptive Statistics per Gender x DV",
align = "lccccccc"
) %>%
kableExtra::kable_styling(
full_width = FALSE,
bootstrap_options = c("striped", "hover", "condensed"),
font_size = 11
)| gender | DV | N | Mean | Median | SD | Min | Max |
|---|---|---|---|---|---|---|---|
| female | math_score | 518 | 63.633 | 65 | 15.491 | 0 | 100 |
| male | math_score | 482 | 68.728 | 69 | 14.356 | 27 | 100 |
| female | reading_score | 518 | 72.608 | 73 | 14.378 | 17 | 100 |
| male | reading_score | 482 | 65.473 | 66 | 13.932 | 23 | 100 |
| female | writing_score | 518 | 72.467 | 74 | 14.845 | 10 | 100 |
| male | writing_score | 482 | 63.311 | 64 | 14.114 | 15 | 100 |
df_long <- df_raw %>%
select(math_score, reading_score, writing_score) %>%
pivot_longer(everything(), names_to = "Subject", values_to = "Score") %>%
mutate(Subject = case_when(
Subject == "math_score" ~ "Math",
Subject == "reading_score" ~ "Reading",
TRUE ~ "Writing"
))
ggplot(df_long, aes(x = Score, fill = Subject)) +
geom_histogram(aes(y = after_stat(density)), bins = 25,
color = "white", alpha = 0.75) +
geom_density(aes(color = Subject), linewidth = 0.9, fill = NA) +
facet_wrap(~Subject, ncol = 3) +
scale_fill_brewer(palette = "Set1", guide = "none") +
scale_color_brewer(palette = "Set1", guide = "none") +
labs(title = "Score Distributions by Subject",
subtitle = "Histogram with kernel density estimate",
x = "Score", y = "Density") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"))df_gender_long <- df_raw %>%
select(gender, math_score, reading_score, writing_score) %>%
pivot_longer(-gender, names_to = "Subject", values_to = "Score") %>%
mutate(Subject = case_when(
Subject == "math_score" ~ "Math",
Subject == "reading_score" ~ "Reading",
TRUE ~ "Writing"
))
ggplot(df_gender_long, aes(x = gender, y = Score, fill = gender)) +
geom_boxplot(outlier.shape = 21, outlier.size = 1.5, alpha = 0.75) +
geom_jitter(aes(color = gender), width = 0.15, alpha = 0.15, size = 0.8) +
facet_wrap(~Subject, ncol = 3) +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1", guide = "none") +
labs(title = "Score Distribution by Gender",
x = "Gender", y = "Score", fill = "Gender") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"))df_cov_long <- df_raw %>%
select(parental_edu_ord, math_score, reading_score, writing_score) %>%
pivot_longer(-parental_edu_ord, names_to = "Subject", values_to = "Score") %>%
mutate(Subject = case_when(
Subject == "math_score" ~ "Math",
Subject == "reading_score" ~ "Reading",
TRUE ~ "Writing"
))
ggplot(df_cov_long, aes(x = parental_edu_ord, y = Score, color = Subject)) +
geom_jitter(alpha = 0.25, size = 0.9, width = 0.15) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
facet_wrap(~Subject, ncol = 3) +
scale_color_brewer(palette = "Set1", guide = "none") +
scale_x_continuous(breaks = 1:6,
labels = c("Some HS", "HS", "Some Col", "Assoc", "Bach", "Master")) +
labs(title = "Parental Education (Covariate) vs. Academic Scores",
subtitle = "Linear trend with 95% confidence band",
x = "Parental Education Level (Ordinal)", y = "Score") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 20, hjust = 1))The three ANCOVA models are fitted here early so all objects are available throughout the document, including in assumption tests and interpretation sections.
model_math <- lm(math_score ~ parental_edu_ord + gender, data = df_raw)
model_reading <- lm(reading_score ~ parental_edu_ord + gender, data = df_raw)
model_writing <- lm(writing_score ~ parental_edu_ord + gender, data = df_raw)
models_list <- list(
"math_score" = model_math,
"reading_score" = model_reading,
"writing_score" = model_writing
)
cat("Models fitted successfully.\n")## Models fitted successfully.
cat(sprintf(" Model 1: math_score ~ parental_edu_ord + gender | df = %d\n",
df.residual(model_math)))## Model 1: math_score ~ parental_edu_ord + gender | df = 997
cat(sprintf(" Model 2: reading_score ~ parental_edu_ord + gender | df = %d\n",
df.residual(model_reading)))## Model 2: reading_score ~ parental_edu_ord + gender | df = 997
cat(sprintf(" Model 3: writing_score ~ parental_edu_ord + gender | df = %d\n",
df.residual(model_writing)))## Model 3: writing_score ~ parental_edu_ord + gender | df = 997
Theoretical Basis:
ANCOVA assumes a linear relationship between the covariate (Parental Education) and the dependent variable (Student Scores). This assumption is critical because ANCOVA uses linear regression to “adjust” the group means based on the covariate. If the relationship is non-linear, the resulting adjusted means (EMMs) will be biased, and the statistical power of the test will be compromised.
Statistical Hypotheses:
\(H_0: \rho = 0\): There is no linear relationship between the covariate and the dependent variable.
\(H_1: \rho \neq 0\): There is a significant linear relationship between the covariate and the dependent variable.
lin_results <- lapply(dvs, function(dv) {
label <- case_when(
dv == "math_score" ~ "Math Score",
dv == "reading_score" ~ "Reading Score",
dv == "writing_score" ~ "Writing Score"
)
ct <- cor.test(df_raw$parental_edu_ord, df_raw[[dv]])
data.frame(
DV = label,
r = round(unname(ct$estimate), 4),
t_stat = round(unname(ct$statistic), 4),
p_value = round(ct$p.value, 6),
Sig = ifelse(ct$p.value < 0.001, "***",
ifelse(ct$p.value < 0.01, "**",
ifelse(ct$p.value < 0.05, "*", "ns")))
)
}) %>% bind_rows()
rownames(lin_results) <- NULL
r_math <- lin_results$r[lin_results$DV == "Math Score"]
r_reading <- lin_results$r[lin_results$DV == "Reading Score"]
r_writing <- lin_results$r[lin_results$DV == "Writing Score"]
kable(lin_results,
caption = "Pearson Correlation: Parental Education (Covariate) vs. DV",
col.names = c("Dependent Variable", "r", "t statistic", "p-value", "Sig.")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),, full_width = FALSE)| Dependent Variable | r | t statistic | p-value | Sig. |
|---|---|---|---|---|
| Math Score | 0.1594 | 5.1019 | 0 | *** |
| Reading Score | 0.1909 | 6.1440 | 0 | *** |
| Writing Score | 0.2367 | 7.6969 | 0 | *** |
p_lin <- lapply(dvs, function(dv) {
label <- case_when(dv == "math_score" ~ "Math",
dv == "reading_score" ~ "Reading", TRUE ~ "Writing")
ggplot(df_raw, aes(x = parental_edu_ord, y = .data[[dv]])) +
geom_jitter(alpha = 0.3, color = "#2E86AB", width = 0.15, size = 1.2) +
geom_smooth(method = "lm", se = TRUE, color = "#E84855", linewidth = 1.1) +
scale_x_continuous(breaks = 1:6) +
labs(title = paste("Linearity:", label),
x = "Parental Education (Ordinal)", y = paste(label, "Score")) +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold"))
})
grid.arrange(grobs = p_lin, ncol = 3)Interpretation: Pearson correlations are all significant (p < 0.001): math r = 0.1594, reading r = 0.1909, writing r = 0.2367. The linearity assumption is met for all models.
Theoretical Basis:
The Homogeneity of Regression Slopes assumption requires that the relationship between the covariate (Parental Education) and the dependent variable (Student Scores) is consistent across all levels of the independent variable (Gender). Visually, this means the regression lines for each group should be approximately parallel. If an interaction exists, it indicates that the effect of parental education on academic performance varies depending on the student’s gender, which would render the ANCOVA model’s main effects uninterpretable.
Statistical Hypotheses:
H0: The regression slopes are equal across groups (No interaction between the IV and Covariate).
H1: The regression slopes are significantly different (A significant interaction exists).
slope_results <- lapply(dvs, function(dv) {
model_int <- lm(as.formula(paste(dv, "~ parental_edu_ord * gender")),
data = df_raw)
aov_int <- anova(model_int)
int_row <- aov_int["parental_edu_ord:gender", ]
data.frame(
DV = dv,
F_value = round(int_row$`F value`, 4),
p_value = round(int_row$`Pr(>F)`, 4),
Decision = ifelse(int_row$`Pr(>F)` > 0.05,
"Homogeneous", "Heterogeneous (ANCOVA invalid)")
)
}) %>% bind_rows()
rownames(slope_results) <- NULL
kable(slope_results,
caption = "Homogeneity of Regression Slopes: Interaction Test",
col.names = c("DV", "F value", "p-value", "Decision")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
row_spec(which(grepl("OK", slope_results$Decision)), color = "darkgreen")| DV | F value | p-value | Decision |
|---|---|---|---|
| math_score | 0.8128 | 0.3675 | Homogeneous |
| reading_score | 0.4402 | 0.5072 | Homogeneous |
| writing_score | 0.1621 | 0.6873 | Homogeneous |
p_slope <- lapply(dvs, function(dv) {
label <- case_when(dv == "math_score" ~ "Math",
dv == "reading_score" ~ "Reading", TRUE ~ "Writing")
ggplot(df_raw, aes(x = parental_edu_ord, y = .data[[dv]], color = gender)) +
geom_jitter(alpha = 0.25, width = 0.15, size = 1.1) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
scale_color_brewer(palette = "Set1") +
scale_x_continuous(breaks = 1:6) +
labs(title = paste("Regression Slopes by Gender:", label),
subtitle = "Parallel lines = homogeneous slopes",
x = "Parental Education (Ordinal)",
y = paste(label, "Score"), color = "Gender") +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold"))
})
grid.arrange(grobs = p_slope, ncol = 3)Interpretation: The interaction term parental_edu_ord x gender is not significant (p > 0.05) for all DVs. Slopes are parallel across gender groups. ANCOVA is valid to proceed.
Theoretical Basis:
ANCOVA assumes that the residuals (the differences between the observed values and the values predicted by the model) are normally distributed. This is essential for the validity of the \(F\)-test and the accuracy of the \(p\)-values. However, under the Central Limit Theorem, ANCOVA is generally robust to violations of normality when the sample size is sufficiently large (e.g., \(N > 30\) per group), as is the case in this study (\(N = 1000\)).
Statistical Hypotheses:
We aim to fail to reject \(H_0\) (\(p > 0.05\)). If \(H_0\) is rejected, we rely on the large sample size and visual inspection of Q-Q plots to justify the robustness of the model.
norm_results <- lapply(names(models_list), function(nm) {
resids <- residuals(models_list[[nm]])
sw <- shapiro.test(resids)
label <- case_when(nm == "math_score" ~ "Math",
nm == "reading_score" ~ "Reading", TRUE ~ "Writing")
data.frame(
DV = label,
W = round(sw$statistic, 5),
p_value = round(sw$p.value, 6),
Decision = ifelse(sw$p.value > 0.05, "Normal",
"Non-normal (robust for n=1000)")
)
}) %>% bind_rows()
rownames(norm_results) <- NULL
kable(norm_results,
caption = "Shapiro-Wilk Test: Normality of Residuals",
col.names = c("DV", "W statistic", "p-value", "Decision")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| DV | W statistic | p-value | Decision |
|---|---|---|---|
| Math | 0.99551 | 0.004961 | Non-normal (robust for n=1000) |
| Reading | 0.99307 | 0.000129 | Non-normal (robust for n=1000) |
| Writing | 0.99040 | 0.000004 | Non-normal (robust for n=1000) |
plot_list_norm <- lapply(names(models_list), function(nm) {
label <- case_when(nm == "math_score" ~ "Math",
nm == "reading_score" ~ "Reading", TRUE ~ "Writing")
resids <- data.frame(resid = residuals(models_list[[nm]]))
p_qq <- ggplot(resids, aes(sample = resid)) +
stat_qq(alpha = 0.5, size = 1.1, color = "#2E86AB") +
stat_qq_line(color = "#E84855", linewidth = 0.9) +
labs(title = paste("Q-Q Plot:", label),
x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold"))
p_hist <- ggplot(resids, aes(x = resid)) +
geom_histogram(aes(y = after_stat(density)), bins = 30,
fill = "#2E86AB", color = "white", alpha = 0.75) +
geom_density(color = "#E84855", linewidth = 1) +
labs(title = paste("Residual Distribution:", label),
x = "Residuals", y = "Density") +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold"))
list(p_qq, p_hist)
})
grid.arrange(
plot_list_norm[[1]][[1]], plot_list_norm[[1]][[2]],
plot_list_norm[[2]][[1]], plot_list_norm[[2]][[2]],
plot_list_norm[[3]][[1]], plot_list_norm[[3]][[2]],
ncol = 2
)Interpretation: Shapiro-Wilk yields p < 0.05, but with n = 1,000 the Central Limit Theorem ensures approximate normality. ANCOVA is robust at this sample size.
Theoretical Basis:
This assumption, also known as homoscedasticity, requires that the variance of the dependent variable is equal across all groups (Gender). If the variances are significantly different, the \(F\)-test becomes unreliable because the error term will be biased, potentially leading to Type I or Type II errors.
Statistical Hypotheses:
We aim to fail to reject \(H_0\) (\(p > 0.05\)). A non-significant Levene’s test indicates that the assumption of equal variances is met, ensuring the stability of the ANCOVA results.
lev_results <- lapply(dvs, function(dv) {
lt <- leveneTest(df_raw[[dv]] ~ df_raw$gender)
label <- case_when(dv == "math_score" ~ "Math",
dv == "reading_score" ~ "Reading", TRUE ~ "Writing")
data.frame(
DV = label,
F_value = round(lt$`F value`[1], 4),
p_value = round(lt$`Pr(>F)`[1], 4),
Decision = ifelse(lt$`Pr(>F)`[1] > 0.05, "Met", "Violated")
)
}) %>% bind_rows()
rownames(lev_results) <- NULL
kable(lev_results,
caption = "Levene's Test: Homogeneity of Variance",
col.names = c("DV", "F value", "p-value", "Decision")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
row_spec(which(lev_results$Decision == "Met"), color = "darkgreen")| DV | F value | p-value | Decision |
|---|---|---|---|
| Math | 0.3464 | 0.5563 | Met |
| Reading | 0.0188 | 0.8911 | Met |
| Writing | 0.0069 | 0.9336 | Met |
ggplot(df_gender_long, aes(x = gender, y = Score, fill = gender)) +
geom_boxplot(outlier.shape = 21, outlier.size = 1.5, alpha = 0.75) +
stat_summary(fun = mean, geom = "point", shape = 23,
size = 3, fill = "white", color = "black") +
facet_wrap(~Subject, ncol = 3) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Variance Comparison by Gender",
subtitle = "Diamond = group mean",
x = "Gender", y = "Score", fill = "Gender") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"))Interpretation: All three DVs pass Levene’s test (p > 0.05). Homogeneity of variance is met.
assump_summary <- data.frame(
No = 1:4,
Assumption = c("Linearity (Covariate-DV)",
"Homogeneity of Regression Slopes",
"Normality of Residuals",
"Homogeneity of Variance (Levene)"),
Test = c("Pearson Correlation",
"Covariate x IV Interaction (F-test)",
"Shapiro-Wilk + Q-Q Plot",
"Levene's Test"),
Math = c(paste0("r = ", r_math, ", p < .001"),
paste0("F = ", slope_results$F_value[1], ", p = ", slope_results$p_value[1]),
paste0("W = ", norm_results$W[norm_results$DV == "Math"],
", robust n=1000"),
paste0("F = ", lev_results$F_value[lev_results$DV == "Math"],
", p = ", lev_results$p_value[lev_results$DV == "Math"])),
Reading = c(paste0("r = ", r_reading, ", p < .001"),
paste0("F = ", slope_results$F_value[2], ", p = ", slope_results$p_value[2]),
paste0("W = ", norm_results$W[norm_results$DV == "Reading"],
", robust n=1000"),
paste0("F = ", lev_results$F_value[lev_results$DV == "Reading"],
", p = ", lev_results$p_value[lev_results$DV == "Reading"])),
Writing = c(paste0("r = ", r_writing, ", p < .001"),
paste0("F = ", slope_results$F_value[3], ", p = ", slope_results$p_value[3]),
paste0("W = ", norm_results$W[norm_results$DV == "Writing"],
", robust n=1000"),
paste0("F = ", lev_results$F_value[lev_results$DV == "Writing"],
", p = ", lev_results$p_value[lev_results$DV == "Writing"])),
Status = c("Met", "Met", "Robust (n=1000)", "Met")
)
kable(assump_summary,
caption = "ANCOVA Assumption Testing Summary",
col.names = c("No.", "Assumption", "Test", "Math", "Reading",
"Writing", "Status")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
scroll_box(width = "100%")| No. | Assumption | Test | Math | Reading | Writing | Status |
|---|---|---|---|---|---|---|
| 1 | Linearity (Covariate-DV) | Pearson Correlation | r = 0.1594, p < .001 | r = 0.1909, p < .001 | r = 0.2367, p < .001 | Met |
| 2 | Homogeneity of Regression Slopes | Covariate x IV Interaction (F-test) | F = 0.8128, p = 0.3675 | F = 0.4402, p = 0.5072 | F = 0.1621, p = 0.6873 | Met |
| 3 | Normality of Residuals | Shapiro-Wilk + Q-Q Plot | W = 0.99551, robust n=1000 | W = 0.99307, robust n=1000 | W = 0.9904, robust n=1000 | Robust (n=1000) |
| 4 | Homogeneity of Variance (Levene) | Levene’s Test | F = 0.3464, p = 0.5563 | F = 0.0188, p = 0.8911 | F = 0.0069, p = 0.9336 | Met |
anova_math <- anova(model_math)
kable(round(anova_math, 4),
caption = "ANCOVA Table: Math Score") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| parental_edu_ord | 1 | 5838.353 | 5838.3529 | 26.8484 | 0 |
| gender | 1 | 7046.756 | 7046.7562 | 32.4054 | 0 |
| Residuals | 997 | 216803.970 | 217.4563 | NA | NA |
sm_math <- summary(model_math)
fit_math <- data.frame(
Metric = c("R-squared", "Adjusted R-squared", "F statistic", "p-value (overall)"),
Value = c(round(sm_math$r.squared, 4),
round(sm_math$adj.r.squared, 4),
round(sm_math$fstatistic[1], 3),
round(pf(sm_math$fstatistic[1], sm_math$fstatistic[2],
sm_math$fstatistic[3], lower.tail = FALSE), 6))
)
kable(fit_math, caption = "Model Fit: Math Score") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Metric | Value |
|---|---|
| R-squared | 0.0561 |
| Adjusted R-squared | 0.0542 |
| F statistic | 29.6270 |
| p-value (overall) | 0.0000 |
coef_math <- as.data.frame(summary(model_math)$coefficients)
coef_math$Term <- rownames(coef_math)
coef_math <- coef_math %>%
select(Term, Estimate, `Std. Error`, `t value`, `Pr(>|t|)`) %>%
mutate(across(where(is.numeric), ~round(.x, 4)),
Sig = ifelse(`Pr(>|t|)` < 0.001, "***",
ifelse(`Pr(>|t|)` < 0.01, "**",
ifelse(`Pr(>|t|)` < 0.05, "*", "ns"))))
rownames(coef_math) <- NULL
kable(coef_math,
caption = "Regression Coefficients: Model 1 (Math Score)",
col.names = c("Term", "Estimate", "SE", "t value", "p-value", "Sig.")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Term | Estimate | SE | t value | p-value | Sig. |
|---|---|---|---|---|---|
| (Intercept) | 58.1791 | 1.1958 | 48.6524 | 0 | *** |
| parental_edu_ord | 1.7354 | 0.3198 | 5.4266 | 0 | *** |
| gendermale | 5.3177 | 0.9342 | 5.6926 | 0 | *** |
eta_math <- eta_squared(model_math, partial = TRUE)
kable(as.data.frame(eta_math) %>% mutate(across(where(is.numeric), ~round(.x, 4))),
caption = "Partial Eta-Squared: Model 1 (Math Score)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Parameter | Eta2_partial | CI | CI_low | CI_high |
|---|---|---|---|---|
| parental_edu_ord | 0.0262 | 0.95 | 0.0123 | 1 |
| gender | 0.0315 | 0.95 | 0.0161 | 1 |
emm_math_obj <- emmeans(model_math, ~ gender)
emm_math_df <- as.data.frame(emm_math_obj)
kable(emm_math_df %>% mutate(across(where(is.numeric), ~round(.x, 4))),
caption = "Estimated Marginal Means: Math Score (adjusted for parental education)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| gender | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| female | 63.5259 | 0.6482 | 997 | 62.2538 | 64.7979 |
| male | 68.8436 | 0.6720 | 997 | 67.5249 | 70.1623 |
contrast_math <- as.data.frame(pairs(emm_math_obj))
kable(contrast_math %>% mutate(across(where(is.numeric), ~round(.x, 4))),
caption = "Pairwise Contrast: Female vs. Male, Math Score") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| female - male | -5.3177 | 0.9342 | 997 | -5.6926 | 0 |
anova_reading <- anova(model_reading)
kable(round(anova_reading, 4),
caption = "ANCOVA Table: Reading Score") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| parental_edu_ord | 1 | 7761.257 | 7761.2573 | 40.0278 | 0 |
| gender | 1 | 11876.017 | 11876.0168 | 61.2491 | 0 |
| Residuals | 997 | 193315.165 | 193.8969 | NA | NA |
sm_reading <- summary(model_reading)
fit_reading <- data.frame(
Metric = c("R-squared", "Adjusted R-squared", "F statistic", "p-value (overall)"),
Value = c(round(sm_reading$r.squared, 4),
round(sm_reading$adj.r.squared, 4),
round(sm_reading$fstatistic[1], 3),
round(pf(sm_reading$fstatistic[1], sm_reading$fstatistic[2],
sm_reading$fstatistic[3], lower.tail = FALSE), 6))
)
kable(fit_reading, caption = "Model Fit: Reading Score") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Metric | Value |
|---|---|
| R-squared | 0.0922 |
| Adjusted R-squared | 0.0904 |
| F statistic | 50.6380 |
| p-value (overall) | 0.0000 |
coef_reading <- as.data.frame(summary(model_reading)$coefficients)
coef_reading$Term <- rownames(coef_reading)
coef_reading <- coef_reading %>%
select(Term, Estimate, `Std. Error`, `t value`, `Pr(>|t|)`) %>%
mutate(across(where(is.numeric), ~round(.x, 4)),
Sig = ifelse(`Pr(>|t|)` < 0.001, "***",
ifelse(`Pr(>|t|)` < 0.01, "**",
ifelse(`Pr(>|t|)` < 0.05, "*", "ns"))))
rownames(coef_reading) <- NULL
kable(coef_reading,
caption = "Regression Coefficients: Model 2 (Reading Score)",
col.names = c("Term", "Estimate", "SE", "t value", "p-value", "Sig.")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Term | Estimate | SE | t value | p-value | Sig. |
|---|---|---|---|---|---|
| (Intercept) | 66.9357 | 1.1292 | 59.2784 | 0 | *** |
| parental_edu_ord | 1.8048 | 0.3020 | 5.9768 | 0 | *** |
| gendermale | -6.9035 | 0.8821 | -7.8262 | 0 | *** |
eta_reading <- eta_squared(model_reading, partial = TRUE)
kable(as.data.frame(eta_reading) %>% mutate(across(where(is.numeric), ~round(.x, 4))),
caption = "Partial Eta-Squared: Model 2 (Reading Score)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Parameter | Eta2_partial | CI | CI_low | CI_high |
|---|---|---|---|---|
| parental_edu_ord | 0.0386 | 0.95 | 0.0214 | 1 |
| gender | 0.0579 | 0.95 | 0.0366 | 1 |
emm_reading_obj <- emmeans(model_reading, ~ gender)
emm_reading_df <- as.data.frame(emm_reading_obj)
kable(emm_reading_df %>% mutate(across(where(is.numeric), ~round(.x, 4))),
caption = "Estimated Marginal Means: Reading Score") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| gender | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| female | 72.4965 | 0.6121 | 997 | 71.2953 | 73.6976 |
| male | 65.5930 | 0.6346 | 997 | 64.3478 | 66.8383 |
contrast_reading <- as.data.frame(pairs(emm_reading_obj))
kable(contrast_reading %>% mutate(across(where(is.numeric), ~round(.x, 4))),
caption = "Pairwise Contrast: Female vs. Male, Reading Score") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| female - male | 6.9035 | 0.8821 | 997 | 7.8262 | 0 |
anova_writing <- anova(model_writing)
kable(round(anova_writing, 4),
caption = "ANCOVA Table: Writing Score") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| parental_edu_ord | 1 | 12925.78 | 12925.7766 | 65.0192 | 0 |
| gender | 1 | 19548.24 | 19548.2353 | 98.3314 | 0 |
| Residuals | 997 | 198203.07 | 198.7995 | NA | NA |
sm_writing <- summary(model_writing)
fit_writing <- data.frame(
Metric = c("R-squared", "Adjusted R-squared", "F statistic", "p-value (overall)"),
Value = c(round(sm_writing$r.squared, 4),
round(sm_writing$adj.r.squared, 4),
round(sm_writing$fstatistic[1], 3),
round(pf(sm_writing$fstatistic[1], sm_writing$fstatistic[2],
sm_writing$fstatistic[3], lower.tail = FALSE), 6))
)
kable(fit_writing, caption = "Model Fit: Writing Score") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Metric | Value |
|---|---|
| R-squared | 0.1408 |
| Adjusted R-squared | 0.1391 |
| F statistic | 81.6750 |
| p-value (overall) | 0.0000 |
coef_writing <- as.data.frame(summary(model_writing)$coefficients)
coef_writing$Term <- rownames(coef_writing)
coef_writing <- coef_writing %>%
select(Term, Estimate, `Std. Error`, `t value`, `Pr(>|t|)`) %>%
mutate(across(where(is.numeric), ~round(.x, 4)),
Sig = ifelse(`Pr(>|t|)` < 0.001, "***",
ifelse(`Pr(>|t|)` < 0.01, "**",
ifelse(`Pr(>|t|)` < 0.05, "*", "ns"))))
rownames(coef_writing) <- NULL
kable(coef_writing,
caption = "Regression Coefficients: Model 3 (Writing Score)",
col.names = c("Term", "Estimate", "SE", "t value", "p-value", "Sig.")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Term | Estimate | SE | t value | p-value | Sig. |
|---|---|---|---|---|---|
| (Intercept) | 65.1445 | 1.1434 | 56.9762 | 0 | *** |
| parental_edu_ord | 2.3300 | 0.3058 | 7.6200 | 0 | *** |
| gendermale | -8.8570 | 0.8932 | -9.9162 | 0 | *** |
eta_writing <- eta_squared(model_writing, partial = TRUE)
kable(as.data.frame(eta_writing) %>% mutate(across(where(is.numeric), ~round(.x, 4))),
caption = "Partial Eta-Squared: Model 3 (Writing Score)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Parameter | Eta2_partial | CI | CI_low | CI_high |
|---|---|---|---|---|
| parental_edu_ord | 0.0612 | 0.95 | 0.0393 | 1 |
| gender | 0.0898 | 0.95 | 0.0636 | 1 |
emm_writing_obj <- emmeans(model_writing, ~ gender)
emm_writing_df <- as.data.frame(emm_writing_obj)
kable(emm_writing_df %>% mutate(across(where(is.numeric), ~round(.x, 4))),
caption = "Estimated Marginal Means: Writing Score") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| gender | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| female | 72.3231 | 0.6198 | 997 | 71.1068 | 73.5393 |
| male | 63.4661 | 0.6425 | 997 | 62.2052 | 64.7270 |
contrast_writing <- as.data.frame(pairs(emm_writing_obj))
kable(contrast_writing %>% mutate(across(where(is.numeric), ~round(.x, 4))),
caption = "Pairwise Contrast: Female vs. Male, Writing Score") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| female - male | 8.857 | 0.8932 | 997 | 9.9162 | 0 |
extract_anova_row <- function(aov_table, term, dv_name) {
data.frame(
DV = dv_name,
Term = term,
Df = aov_table[term, "Df"],
SS = round(aov_table[term, "Sum Sq"], 3),
MS = round(aov_table[term, "Mean Sq"], 3),
F_value = round(aov_table[term, "F value"], 3),
p_value = round(aov_table[term, "Pr(>F)"], 4),
Sig = ifelse(aov_table[term, "Pr(>F)"] < 0.001, "***",
ifelse(aov_table[term, "Pr(>F)"] < 0.01, "**",
ifelse(aov_table[term, "Pr(>F)"] < 0.05, "*", "ns")))
)
}
full_comparison <- bind_rows(
extract_anova_row(anova_math, "parental_edu_ord", "Math Score"),
extract_anova_row(anova_math, "gender", "Math Score"),
extract_anova_row(anova_reading, "parental_edu_ord", "Reading Score"),
extract_anova_row(anova_reading, "gender", "Reading Score"),
extract_anova_row(anova_writing, "parental_edu_ord", "Writing Score"),
extract_anova_row(anova_writing, "gender", "Writing Score")
)
kable(full_comparison,
caption = "ANCOVA Results: All Three Models",
col.names = c("DV", "Term", "df", "SS", "MS", "F", "p-value", "Sig.")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE, font_size = 11) %>%
row_spec(which(full_comparison$Term == "gender"), background = "#f0f8ff")| DV | Term | df | SS | MS | F | p-value | Sig. |
|---|---|---|---|---|---|---|---|
| Math Score | parental_edu_ord | 1 | 5838.353 | 5838.353 | 26.848 | 0 | *** |
| Math Score | gender | 1 | 7046.756 | 7046.756 | 32.405 | 0 | *** |
| Reading Score | parental_edu_ord | 1 | 7761.257 | 7761.257 | 40.028 | 0 | *** |
| Reading Score | gender | 1 | 11876.017 | 11876.017 | 61.249 | 0 | *** |
| Writing Score | parental_edu_ord | 1 | 12925.777 | 12925.777 | 65.019 | 0 | *** |
| Writing Score | gender | 1 | 19548.235 | 19548.235 | 98.331 | 0 | *** |
model_fit_all <- data.frame(
DV = c("Math Score", "Reading Score", "Writing Score"),
R2 = round(c(sm_math$r.squared, sm_reading$r.squared, sm_writing$r.squared), 4),
Adj_R2 = round(c(sm_math$adj.r.squared, sm_reading$adj.r.squared, sm_writing$adj.r.squared), 4),
F_overall = round(c(sm_math$fstatistic[1], sm_reading$fstatistic[1], sm_writing$fstatistic[1]), 3),
p_overall = round(c(
pf(sm_math$fstatistic[1], sm_math$fstatistic[2], sm_math$fstatistic[3], lower.tail = FALSE),
pf(sm_reading$fstatistic[1], sm_reading$fstatistic[2], sm_reading$fstatistic[3], lower.tail = FALSE),
pf(sm_writing$fstatistic[1], sm_writing$fstatistic[2], sm_writing$fstatistic[3], lower.tail = FALSE)
), 6)
)
kable(model_fit_all,
caption = "Model Fit Statistics: All Three ANCOVA Models",
col.names = c("DV", "R-squared", "Adj. R-squared",
"F (overall)", "p (overall)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| DV | R-squared | Adj. R-squared | F (overall) | p (overall) |
|---|---|---|---|---|
| Math Score | 0.0561 | 0.0542 | 29.627 | 0 |
| Reading Score | 0.0922 | 0.0904 | 50.638 | 0 |
| Writing Score | 0.1408 | 0.1391 | 81.675 | 0 |
# eta_all built here, before it is referenced in the Interpretation section
eta_all <- bind_rows(
as.data.frame(eta_math) %>% mutate(DV = "Math Score"),
as.data.frame(eta_reading) %>% mutate(DV = "Reading Score"),
as.data.frame(eta_writing) %>% mutate(DV = "Writing Score")
) %>%
mutate(
Magnitude = case_when(
Eta2_partial >= 0.14 ~ "Large",
Eta2_partial >= 0.06 ~ "Medium",
Eta2_partial >= 0.01 ~ "Small",
TRUE ~ "Negligible"
)
)
kable(eta_all %>% mutate(across(where(is.numeric), ~round(.x, 4))),
caption = "Partial Eta-Squared: All Three ANCOVA Models") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Parameter | Eta2_partial | CI | CI_low | CI_high | DV | Magnitude |
|---|---|---|---|---|---|---|
| parental_edu_ord | 0.0262 | 0.95 | 0.0123 | 1 | Math Score | Small |
| gender | 0.0315 | 0.95 | 0.0161 | 1 | Math Score | Small |
| parental_edu_ord | 0.0386 | 0.95 | 0.0214 | 1 | Reading Score | Small |
| gender | 0.0579 | 0.95 | 0.0366 | 1 | Reading Score | Small |
| parental_edu_ord | 0.0612 | 0.95 | 0.0393 | 1 | Writing Score | Medium |
| gender | 0.0898 | 0.95 | 0.0636 | 1 | Writing Score | Medium |
eta_gender <- eta_all %>% filter(Parameter == "gender")
ggplot(eta_gender, aes(x = DV, y = Eta2_partial, fill = DV)) +
geom_col(alpha = 0.85, color = "white", width = 0.55) +
geom_text(aes(label = paste0("eta2 = ", round(Eta2_partial, 4))),
vjust = -0.5, size = 3.8, fontface = "bold") +
geom_hline(yintercept = 0.01, linetype = "dashed", color = "gray60") +
geom_hline(yintercept = 0.06, linetype = "dashed", color = "#E84855") +
geom_hline(yintercept = 0.14, linetype = "dashed", color = "#2E86AB") +
annotate("text", x = 0.6, y = 0.014, label = "Small (0.01)", size = 2.8, color = "gray50") +
annotate("text", x = 0.6, y = 0.074, label = "Medium (0.06)", size = 2.8, color = "#E84855") +
annotate("text", x = 0.6, y = 0.154, label = "Large (0.14)", size = 2.8, color = "#2E86AB") +
scale_fill_brewer(palette = "Set1", guide = "none") +
ylim(0, 0.20) +
labs(title = "Gender Effect Size (Partial eta-squared) per ANCOVA Model",
subtitle = "After controlling for parental education level",
x = "Dependent Variable", y = "Partial eta-squared") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))cohens_d_df <- lapply(dvs, function(dv) {
female <- df_raw[[dv]][df_raw$gender == "female"]
male <- df_raw[[dv]][df_raw$gender == "male"]
pooled_sd <- sqrt(((length(female) - 1) * var(female) +
(length(male) - 1) * var(male)) /
(length(female) + length(male) - 2))
d <- (mean(female) - mean(male)) / pooled_sd
data.frame(
DV = case_when(dv == "math_score" ~ "Math",
dv == "reading_score" ~ "Reading", TRUE ~ "Writing"),
Mean_Female = round(mean(female), 2),
Mean_Male = round(mean(male), 2),
Difference = round(mean(female) - mean(male), 2),
Cohen_d = round(d, 4),
Magnitude = ifelse(abs(d) >= 0.8, "Large",
ifelse(abs(d) >= 0.5, "Medium",
ifelse(abs(d) >= 0.2, "Small", "Negligible")))
)
}) %>% bind_rows()
rownames(cohens_d_df) <- NULL
kable(cohens_d_df,
caption = "Cohen's d: Unadjusted Gender Effect per DV",
col.names = c("DV", "Mean Female", "Mean Male",
"Difference", "Cohen's d", "Magnitude")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| DV | Mean Female | Mean Male | Difference | Cohen’s d | Magnitude |
|---|---|---|---|---|---|
| Math | 63.63 | 68.73 | -5.10 | -0.3407 | Small |
| Reading | 72.61 | 65.47 | 7.14 | 0.5037 | Medium |
| Writing | 72.47 | 63.31 | 9.16 | 0.6316 | Medium |
emm_all_plot <- bind_rows(
as.data.frame(emm_math_obj) %>% mutate(DV = "Math", Type = "Adjusted"),
as.data.frame(emm_reading_obj) %>% mutate(DV = "Reading", Type = "Adjusted"),
as.data.frame(emm_writing_obj) %>% mutate(DV = "Writing", Type = "Adjusted")
)
raw_means <- df_raw %>%
group_by(gender) %>%
summarise(
Math = mean(math_score),
Reading = mean(reading_score),
Writing = mean(writing_score),
.groups = "drop"
) %>%
pivot_longer(-gender, names_to = "DV", values_to = "emmean") %>%
mutate(
Type = "Unadjusted",
lower.CL = emmean - 1.96 * 2,
upper.CL = emmean + 1.96 * 2
)
combined <- bind_rows(
emm_all_plot %>% select(gender, DV, emmean, lower.CL, upper.CL, Type),
raw_means %>% select(gender, DV, emmean, lower.CL, upper.CL, Type)
)
ggplot(combined, aes(x = gender, y = emmean, fill = Type,
ymin = lower.CL, ymax = upper.CL)) +
geom_col(position = position_dodge(0.7), width = 0.6,
alpha = 0.85, color = "white") +
geom_errorbar(position = position_dodge(0.7), width = 0.2, linewidth = 0.7) +
facet_wrap(~DV, ncol = 3) +
scale_fill_manual(values = c("Adjusted" = "#2E86AB", "Unadjusted" = "#E84855")) +
labs(title = "Adjusted vs. Unadjusted Means by Gender",
subtitle = "Adjusted = parental education held at mean",
x = "Gender", y = "Mean Score", fill = "Type") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"))diag_plots <- lapply(names(models_list), function(nm) {
label <- case_when(nm == "math_score" ~ "Math",
nm == "reading_score" ~ "Reading", TRUE ~ "Writing")
resid_df <- data.frame(
Fitted = fitted(models_list[[nm]]),
Residuals = residuals(models_list[[nm]])
)
p_rf <- ggplot(resid_df, aes(x = Fitted, y = Residuals)) +
geom_point(alpha = 0.3, size = 1, color = "#2E86AB") +
geom_hline(yintercept = 0, color = "#E84855",
linetype = "dashed", linewidth = 0.8) +
geom_smooth(method = "loess", se = FALSE,
color = "gray40", linewidth = 0.7) +
labs(title = paste(label, "Residuals vs Fitted"),
x = "Fitted", y = "Residuals") +
theme_minimal(base_size = 9) +
theme(plot.title = element_text(face = "bold", size = 9))
p_qq <- ggplot(resid_df, aes(sample = Residuals)) +
stat_qq(alpha = 0.45, size = 1, color = "#2E86AB") +
stat_qq_line(color = "#E84855", linewidth = 0.8) +
labs(title = paste(label, "Normal Q-Q"),
x = "Theoretical", y = "Sample") +
theme_minimal(base_size = 9) +
theme(plot.title = element_text(face = "bold", size = 9))
list(p_rf, p_qq)
})
grid.arrange(
diag_plots[[1]][[1]], diag_plots[[1]][[2]],
diag_plots[[2]][[1]], diag_plots[[2]][[2]],
diag_plots[[3]][[1]], diag_plots[[3]][[2]],
ncol = 2
)final_summary <- data.frame(
Model = c("Model 1: Math Score",
"Model 2: Reading Score",
"Model 3: Writing Score"),
F_Covariate = c(f_cov_math, f_cov_reading, f_cov_writing),
p_Covariate = c("< 0.001", "< 0.001", "< 0.001"),
F_Gender = c(f_gender_math, f_gender_reading, f_gender_writing),
p_Gender = c("< 0.001", "< 0.001", "< 0.001"),
Eta2 = c(eta_g_math, eta_g_reading, eta_g_writing),
Direction = c("Male > Female", "Female > Male", "Female > Male"),
Effect = c("Small", "Small-Medium", "Medium")
)
kable(final_summary,
caption = "Final ANCOVA Summary: All Three Models",
col.names = c("Model", "F (Covariate)", "p (Covariate)",
"F (Gender)", "p (Gender)",
"Partial eta2", "Direction", "Effect Size")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
scroll_box(width = "100%")| Model | F (Covariate) | p (Covariate) | F (Gender) | p (Gender) | Partial eta2 | Direction | Effect Size |
|---|---|---|---|---|---|---|---|
| Model 1: Math Score | 26.848 | < 0.001 | 32.405 | < 0.001 | 0.0315 | Male > Female | Small |
| Model 2: Reading Score | 40.028 | < 0.001 | 61.249 | < 0.001 | 0.0579 | Female > Male | Small-Medium |
| Model 3: Writing Score | 65.019 | < 0.001 | 98.331 | < 0.001 | 0.0898 | Female > Male | Medium |
The ANCOVA summary indicates that both parental education and gender have a statistically significant effect on all academic outcomes (Math, Reading and Writing) with p-values < 0.001. In Model 1 (Math Score), parental education shows a significant effect (F = 26.848), and gender is also significant (F = 32.405), where male students score higher in mathematics after controlling for the covariate, although the effect size is small (η² = 0.0315). In Model 2 (Reading Score), parental education has a stronger influence (F = 40.028) and gender remains significant (F = 61.249), with female students outperforming male students in reading. The effect size falls in the small to small-to-medium range (η² = 0.0579).
In Model 3 (Writing Score), parental education has the strongest effect among all models (F = 65.019), while gender also shows its largest influence here (F = 98.331), with female students again outperforming male students. The effect size for gender is in the medium range (η² = 0.0898) that indicate a more practically meaningful difference in writing compared to the other domains. Overall, although gender differences remain statistically significant across all academic subjects after controlling for parental education, the magnitude of these effects is relatively small to moderate, suggesting that other unmeasured factors may also contribute to students’ academic performance.
note_table <- data.frame(
No = 1:5,
Note = c(
"Ordinal encoding of parental education assumes equal intervals",
"Normality of residuals technically violated",
"All interaction terms non-significant",
"Dataset is cross-sectional and simulated",
"Lunch type and test prep not controlled"
),
Implication = c(
"May not capture non-linear educational returns",
"Robust at n = 1,000 due to Central Limit Theorem",
"ANCOVA valid; slopes parallel across gender groups",
"Causal claims not warranted",
"See Full Factorial MANCOVA for more comprehensive model"
)
)
knitr::kable(note_table,
caption = "Methodological Notes",
align = "c") %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE
) %>%
kableExtra::column_spec(2, width = "40%") %>%
kableExtra::column_spec(3, width = "40%")| No | Note | Implication |
|---|---|---|
| 1 | Ordinal encoding of parental education assumes equal intervals | May not capture non-linear educational returns |
| 2 | Normality of residuals technically violated | Robust at n = 1,000 due to Central Limit Theorem |
| 3 | All interaction terms non-significant | ANCOVA valid; slopes parallel across gender groups |
| 4 | Dataset is cross-sectional and simulated | Causal claims not warranted |
| 5 | Lunch type and test prep not controlled | See Full Factorial MANCOVA for more comprehensive model |
package_table <- data.frame(
Package = c("tidyverse",
"knitr + kableExtra",
"car",
"emmeans",
"effectsize",
"gridExtra",
"moments"),
Purpose = c("Data wrangling and visualization",
"Table formatting",
"Levene's test",
"Estimated marginal means and pairwise contrasts",
"Partial eta-squared",
"Multi-panel plot layout",
"Skewness and kurtosis")
)
kable(package_table,
caption = "R Packages Used in the Analysis",
align = c("l", "l")) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE
)| Package | Purpose |
|---|---|
| tidyverse | Data wrangling and visualization |
| knitr + kableExtra | Table formatting |
| car | Levene’s test |
| emmeans | Estimated marginal means and pairwise contrasts |
| effectsize | Partial eta-squared |
| gridExtra | Multi-panel plot layout |
| moments | Skewness and kurtosis |
Date: 20 April 2026
S1 Data Science | FMIPA UNESA | 2026