This analysis explores patterns in a fictional student performance dataset. We investigate how variables such as gender, parental education, lunch type, and test preparation relate to academic scores in math, reading, and writing. Visualizations are created using base R, ggplot2, and plotly
students <- read_csv("C:/Users/vyrus/Documents/Senior Year/25 Spring/Data Vis/final proj/Expanded_data_with_more_features.csv") %>%
clean_names()
## New names:
## Rows: 30641 Columns: 15
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (10): Gender, EthnicGroup, ParentEduc, LunchType, TestPrep, ParentMarita... dbl
## (5): ...1, NrSiblings, MathScore, ReadingScore, WritingScore
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Reshape for long format where needed
long_scores <- melt(students,
id.vars = c("gender", "parent_educ", "lunch_type", "test_prep"),
measure.vars = c("math_score", "reading_score", "writing_score"),
variable.name = "subject",
value.name = "score")
avg_scores <- aggregate(score ~ test_prep + subject, data = long_scores, mean)
Investigation: Does completing a test preparation course lead to better scores in math, reading, and writing?
{
# Prepare the height matrix
bar_heights <- tapply(avg_scores$score, list(avg_scores$subject, avg_scores$test_prep), identity)
# Open wider margins for the right-side legend
par(mar = c(5, 4, 4, 8), xpd = TRUE) # Extend right margin
# Create the barplot
barplot(height = bar_heights,
beside = TRUE,
col = c("skyblue", "orange", "seagreen"),
main = "Average Scores by Test Prep Course",
xlab = "Test Preparation Status",
ylab = "Average Score")
# Add legend outside the plot area
legend("topright", inset = c(-0.25, 0), # adjust as needed
legend = rownames(bar_heights),
fill = c("skyblue", "orange", "seagreen"),
bty = "n") # no box around the legend
}
Discussion:
The bar plot reveals a clear performance advantage for students who
completed the test preparation course compared to those who did not.
Across all three subjects—math, reading, and writing—average scores are
consistently higher among students who participated in the prep course.
Notably, the difference is most pronounced in writing and reading, where
the scores exceed 75, while those without preparation hover closer to
65–70. Math scores also show improvement, though to a slightly lesser
extent. This supports the investigation’s hypothesis: completing a test
preparation course positively correlates with better academic outcomes,
suggesting the effectiveness of such interventions in boosting student
performance across multiple domains.
Investigation: How does a parent’s level of education affect student performance in different subjects?
long_scores %>%
ggplot(aes(x = parent_educ, y = score, fill = subject)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ subject) +
theme_minimal() +
labs(title = "Scores by Parental Education Level",
x = "Parental Education",
y = "Test Score")
Discussion: This set of boxplots illustrates how student performance varies by parental education level across three subjects: math, reading, and writing. Excluding entries with missing data, a clear trend emerges—students whose parents attained higher levels of education tend to achieve better scores in all subjects. Notably, those whose parents have bachelor’s or master’s degrees show higher medians and less score variability compared to students whose parents completed only high school or some college. This consistent pattern suggests that parental education is positively associated with academic performance, likely due to greater academic support, expectations, or resources available at home. The data highlights educational background as a key contextual factor influencing student achievement.
Investigation: What outside study factors can lead to increased testing scores. How does having siblings affect the data, what about taking test prep?
students <- students %>%
mutate(avg_score = (math_score + reading_score + writing_score) / 3)
clean_students <- students %>%
filter(!is.na(nr_siblings), !is.na(test_prep))
library(plotly)
plot_ly(data = clean_students,
x = ~interaction(nr_siblings, test_prep),
y = ~avg_score,
color = ~test_prep,
type = "box",
boxpoints = "all",
jitter = 0.3,
pointpos = -1.8,
hoverinfo = "text",
text = ~paste("Siblings:", nr_siblings,
"<br>Prep:", test_prep,
"<br>Avg Score:", round(avg_score, 1))) %>%
layout(title = "Average Test Scores by Sibling Count and Test Preparation",
xaxis = list(title = "Sibling Count + Test Prep",
tickangle = -45),
yaxis = list(title = "Average Test Score"))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Discussion: This boxplot presents the distribution of average test scores (math, reading, and writing combined) across sibling count, separated by whether students completed a test preparation course. A consistent trend emerges: students who completed the test prep course generally outperform those who did not, regardless of how many siblings they have. The median scores for the “completed” group are visibly higher across all sibling categories, with tighter interquartile ranges suggesting less variability in their performance. While the number of siblings introduces some fluctuation in score distribution, it does not appear to strongly influence performance on its own. This visualization supports the hypothesis that test preparation is a more impactful factor than having siblings when predicting academic success.