First step to exploratory data analysis (EDA) is to load the necessary libraries for different data handling.
The libraries used in this analysis support different stages of the EDA process. The readxl package is used to import the dataset from Excel, while tidyverse provides essential tools for data manipulation and visualization, including ggplot2. The janitor package helps clean and standardize column names, making the data easier to work with. Additionally, skimr is used for quick summary statistics, naniar for identifying and visualizing missing data, GGally for exploring relationships between variables, and scales for improving the readability of plots. Together, these libraries enable efficient data cleaning, exploration, and visualization.
library(readxl) # loading dataaset from Excel to R
library(tidyverse) # data wrangling + ggplot2
library(janitor) # cleaning column names
library(skimr) # quick data summary
library(naniar) # missing data summaries/plots
library(GGally) # pair plots/correlation style visuals
library(scales) # better axis labels
Of course, EDA requires a dataset, so this step involves loading the dataset that I got from the National Assessment of Educational Progress (NAEP) Data Explorer website. The website provides national and state assessments results in all core subjects that are assessed. Since I am a middle school teacher and one of the subjects I teach is mathematics, I picked to get data in both national and New York state level in 8th grade mathematics, especially Algebra. The data provides insights of the average scale score of different groups of students that are categorized by race and ethnicity which will be comparable and easy to find trends.
NAEP_8Math <- read_excel("NDECoreExcel_Mathematics.xlsx",
skip=7,
n_max=206)
Before moving on to EDA, let’s check if the dataset loaded correctly:
head(NAEP_8Math, 10)
## # A tibble: 10 × 4
## Year Jurisdiction Race/ethnicity used to report tren…¹ `Average scale score`
## <chr> <chr> <chr> <chr>
## 1 2024 National White 293.27971245966
## 2 2024 National Black 260.59177267955101
## 3 2024 National Hispanic 265.84186852976899
## 4 2024 National Asian/Pacific Islander 315.183504238252
## 5 2024 National American Indian/Alaska Native 257.90389371060098
## 6 2024 National Two or more races 286.27011213518199
## 7 2024 New York White 292.82165256179502
## 8 2024 New York Black 266.66589978477498
## 9 2024 New York Hispanic 264.56017274169801
## 10 2024 New York Asian/Pacific Islander 314.30916955953899
## # ℹ abbreviated name: ¹​`Race/ethnicity used to report trends, school-reported`
glimpse(NAEP_8Math)
## Rows: 204
## Columns: 4
## $ Year <chr> "2024", "2024"…
## $ Jurisdiction <chr> "National", "N…
## $ `Race/ethnicity used to report trends, school-reported` <chr> "White", "Blac…
## $ `Average scale score` <chr> "293.279712459…
The data set has 204 observations and 4 variables. The variables are
the year when the assessment is performed, the
jurisdiction, which is where the results of the tests are
produced either nationally or in the state. The next variable is race
and ethnicity of the participant students. Then there is the average
scale score.
This section focuses on preparing the dataset for analysis by cleaning column names, correcting data types, and handling missing values. Proper data cleaning ensures that the dataset is accurate, consistent, and suitable for meaningful analysis.
The column names were simplified to remove spaces and special characters, making them easier to reference and manipulate during analysis. This step improves code readability and reduces potential errors.
NAEP_8Math <- NAEP_8Math %>%
rename(Race_Ethnicity = `Race/ethnicity used to report trends, school-reported`) %>%
clean_names()
head(NAEP_8Math) #to check the new column names if they are clean
## # A tibble: 6 × 4
## year jurisdiction race_ethnicity average_scale_score
## <chr> <chr> <chr> <chr>
## 1 2024 National White 293.27971245966
## 2 2024 National Black 260.59177267955101
## 3 2024 National Hispanic 265.84186852976899
## 4 2024 National Asian/Pacific Islander 315.183504238252
## 5 2024 National American Indian/Alaska Native 257.90389371060098
## 6 2024 National Two or more races 286.27011213518199
glimpse(NAEP_8Math) shows that the variables year and
average_scale_score were originally stored as character values so they
need to be converted to numeric formats. This allows for proper
statistical analysis and visualization of trends over time.
NAEP_8Math <- NAEP_8Math %>%
clean_names() %>%
mutate(
year_raw = year,
average_scale_score = na_if(average_scale_score, "‡"),
average_scale_score = na_if(average_scale_score, "—"),
year = gsub("[^0-9]", "", year),
year = as.integer(year),
average_scale_score = as.numeric(average_scale_score)
)
glimpse(NAEP_8Math) # to check if the type of the data is converted correctly
## Rows: 204
## Columns: 5
## $ year <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 20…
## $ jurisdiction <chr> "National", "National", "National", "National", "N…
## $ race_ethnicity <chr> "White", "Black", "Hispanic", "Asian/Pacific Islan…
## $ average_scale_score <dbl> 293.2797, 260.5918, 265.8419, 315.1835, 257.9039, …
## $ year_raw <chr> "2024", "2024", "2024", "2024", "2024", "2024", "2…
Missing values, represented by special symbols in the original dataset, were removed to ensure accurate analysis. A cleaned subset of the dataset was created to preserve the original data while enabling reliable computations and visualizations.
NAEP_clean <- NAEP_8Math %>%
drop_na(average_scale_score)
summary(NAEP_clean)
## year jurisdiction race_ethnicity average_scale_score
## Min. :1990 Length:158 Length:158 Min. :234.8
## 1st Qu.:2000 Class :character Class :character 1st Qu.:265.6
## Median :2009 Mode :character Mode :character Median :274.8
## Mean :2008 Mean :278.3
## 3rd Qu.:2017 3rd Qu.:293.2
## Max. :2024 Max. :320.1
## year_raw
## Length:158
## Class :character
## Mode :character
##
##
##
colSums(is.na(NAEP_clean))
## year jurisdiction race_ethnicity average_scale_score
## 0 0 0 0
## year_raw
## 0
Now that the dataset is clean and ready for analysis, I can start the EDA.
The cleaned dataset contains observations from 1990 to 2024 across two jurisdictions: National and New York. It includes multiple racial and ethnic groups, allowing for meaningful comparisons and trend analysis across both time and demographic categories.
dim(NAEP_clean)
## [1] 158 5
unique(NAEP_clean$jurisdiction)
## [1] "National" "New York"
unique(NAEP_clean$race_ethnicity)
## [1] "White" "Black"
## [3] "Hispanic" "Asian/Pacific Islander"
## [5] "American Indian/Alaska Native" "Two or more races"
range(NAEP_clean$year)
## [1] 1990 2024
The distribution of average scale scores (histogram below) shows that most values fall within a moderate range, with a slight spread across groups. The boxplots reveal differences in score distributions among racial and ethnic groups, with some groups consistently scoring higher than others.
ggplot(NAEP_clean, aes(x = average_scale_score)) +
geom_histogram(binwidth = 5)
ggplot(NAEP_clean, aes(y = average_scale_score)) +
geom_boxplot()
ggplot(NAEP_clean,
aes(x = reorder(race_ethnicity,
average_scale_score, median),
y = average_scale_score)) +
geom_boxplot() +
coord_flip()
Now, let’s look at the trends of the scores at national level and at the state of New York level.
The trend analysis shows how average scale scores have changed over time for different racial and ethnic groups. Overall, scores have generally improved from 1990 to the early 2010s, followed by slight declines in recent years. Additionally, consistent gaps between groups can be observed, with some groups maintaining higher performance levels over time.
Notably, Asian/Pacific Islander students consistently achieve the highest scores, while other groups show gradual improvement but remain below this level.
NAEP_clean %>%
filter(jurisdiction == "National") %>%
ggplot(aes(x = year, y = average_scale_score,
color = race_ethnicity)) +
geom_line() +
geom_point()
NAEP_clean %>%
filter(jurisdiction == "New York") %>%
ggplot(aes(x = year, y = average_scale_score,
color = race_ethnicity)) +
geom_line() +
geom_point()
These trends suggest that improvements over time have not been equally distributed across all groups, highlighting persistent structural disparities in educational outcomes.
The comparison between National and New York results highlights similarities and differences across groups. While overall trends are comparable, some variations suggest that New York performs slightly differently for certain groups, indicating regional differences in educational outcomes.
NAEP_clean %>%
filter(year == 2024) %>%
ggplot(aes(x = race_ethnicity,
y = average_scale_score,
fill = jurisdiction)) +
geom_col(position = "dodge") +
coord_flip()
The summary statistics indicate that the average score is approximately 278, with a moderate spread around the mean. The relatively close values of the mean and median suggest that the distribution is fairly balanced without extreme skewness.
NAEP_clean %>%
summarise(
mean = mean(average_scale_score),
median = median(average_scale_score),
sd = sd(average_scale_score),
min = min(average_scale_score),
max = max(average_scale_score)
)
## # A tibble: 1 × 5
## mean median sd min max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 278. 275. 19.7 235. 320.
The boxplot analysis does not reveal extreme outliers, but it highlights variations between groups. These differences likely reflect genuine disparities in performance rather than data errors. Furthermore, these variations are consistent with the patterns observed in the previous line graphs, which show persistent differences between groups over time.
ggplot(NAEP_clean,
aes(x = race_ethnicity,
y = average_scale_score)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
The achievement gap analysis reveals persistent differences in performance between groups over time. The gap between White and Black students, for example, remains consistent across years, indicating that disparities in educational outcomes have not been fully addressed despite overall improvements.
Although all groups show some improvement over time, the persistence of these gaps suggests systemic challenges that require targeted educational interventions.
NAEP_clean <- NAEP_8Math %>%
filter(!grepl("¹", year_raw)) %>%
drop_na(average_scale_score)
gap <- NAEP_clean %>%
filter(race_ethnicity %in% c("White", "Black")) %>%
select(year, jurisdiction,
race_ethnicity,
average_scale_score) %>%
pivot_wider(names_from = race_ethnicity,
values_from = average_scale_score) %>%
mutate(gap = White - Black)
ggplot(gap, aes(x = year, y = gap, color = jurisdiction)) +
geom_line() +
geom_point()
Correlation analysis is limited in this dataset because most variables are categorical, so the most meaningful relationship examined is the trend between year and average scale score within demographic groups. There is no clear data leakage concern in this EDA because the variables describe observed assessment outcomes rather than future-known predictors. For data preparation, missing-value symbols should be retained as NA and removed only when necessary for analysis, while superscript-year duplicate rows should be excluded carefully. Because some demographic groups have fewer reported observations, subgroup imbalance should be acknowledged when interpreting results. Since the dataset contains only a few variables, dimension reduction is not necessary, and group-based comparisons are more informative than reducing features.
My use of LLM consisted of correcting some codes when I encountered errors, one example is the following prompt:
It turns out that I got that error
because of the duplicate year value after cleaning. When I do
pivot_wider(), there was more than one score for the same
combination of year, jurisdiction and race_ethnicity and R doesn’t know
what to use. So I had to remove the rows with superscript years.
My other use of the LLM was to paste my R Markdown into the chat along with the assignment and the rubric. I then prompted the LLM to check my work against the assignment requirements.