1 EDA

First step to exploratory data analysis (EDA) is to load the necessary libraries for different data handling.

1.1 Loading the necessary libraries

The libraries used in this analysis support different stages of the EDA process. The readxl package is used to import the dataset from Excel, while tidyverse provides essential tools for data manipulation and visualization, including ggplot2. The janitor package helps clean and standardize column names, making the data easier to work with. Additionally, skimr is used for quick summary statistics, naniar for identifying and visualizing missing data, GGally for exploring relationships between variables, and scales for improving the readability of plots. Together, these libraries enable efficient data cleaning, exploration, and visualization.

library(readxl)      # loading dataaset from Excel to R
library(tidyverse)   # data wrangling + ggplot2
library(janitor)     # cleaning column names
library(skimr)       # quick data summary
library(naniar)      # missing data summaries/plots
library(GGally)      # pair plots/correlation style visuals
library(scales)      # better axis labels

1.2 Loading the dataset

Of course, EDA requires a dataset, so this step involves loading the dataset that I got from the National Assessment of Educational Progress (NAEP) Data Explorer website. The website provides national and state assessments results in all core subjects that are assessed. Since I am a middle school teacher and one of the subjects I teach is mathematics, I picked to get data in both national and New York state level in 8th grade mathematics, especially Algebra. The data provides insights of the average scale score of different groups of students that are categorized by race and ethnicity which will be comparable and easy to find trends.

NAEP_8Math <- read_excel("NDECoreExcel_Mathematics.xlsx",
                         skip=7,
                         n_max=206)

Before moving on to EDA, let’s check if the dataset loaded correctly:

head(NAEP_8Math, 10)
## # A tibble: 10 × 4
##    Year  Jurisdiction Race/ethnicity used to report tren…¹ `Average scale score`
##    <chr> <chr>        <chr>                                <chr>                
##  1 2024  National     White                                293.27971245966      
##  2 2024  National     Black                                260.59177267955101   
##  3 2024  National     Hispanic                             265.84186852976899   
##  4 2024  National     Asian/Pacific Islander               315.183504238252     
##  5 2024  National     American Indian/Alaska Native        257.90389371060098   
##  6 2024  National     Two or more races                    286.27011213518199   
##  7 2024  New York     White                                292.82165256179502   
##  8 2024  New York     Black                                266.66589978477498   
##  9 2024  New York     Hispanic                             264.56017274169801   
## 10 2024  New York     Asian/Pacific Islander               314.30916955953899   
## # ℹ abbreviated name: ¹​`Race/ethnicity used to report trends, school-reported`
glimpse(NAEP_8Math)
## Rows: 204
## Columns: 4
## $ Year                                                    <chr> "2024", "2024"…
## $ Jurisdiction                                            <chr> "National", "N…
## $ `Race/ethnicity used to report trends, school-reported` <chr> "White", "Blac…
## $ `Average scale score`                                   <chr> "293.279712459…

The data set has 204 observations and 4 variables. The variables are the year when the assessment is performed, the jurisdiction, which is where the results of the tests are produced either nationally or in the state. The next variable is race and ethnicity of the participant students. Then there is the average scale score.

1.3 Cleaning the Data

This section focuses on preparing the dataset for analysis by cleaning column names, correcting data types, and handling missing values. Proper data cleaning ensures that the dataset is accurate, consistent, and suitable for meaningful analysis.

1.3.1 Clean column names

The column names were simplified to remove spaces and special characters, making them easier to reference and manipulate during analysis. This step improves code readability and reduces potential errors.

NAEP_8Math <- NAEP_8Math %>%
  rename(Race_Ethnicity = `Race/ethnicity used to report trends, school-reported`) %>%
  clean_names()
head(NAEP_8Math) #to check the new column names if they are clean
## # A tibble: 6 × 4
##   year  jurisdiction race_ethnicity                average_scale_score
##   <chr> <chr>        <chr>                         <chr>              
## 1 2024  National     White                         293.27971245966    
## 2 2024  National     Black                         260.59177267955101 
## 3 2024  National     Hispanic                      265.84186852976899 
## 4 2024  National     Asian/Pacific Islander        315.183504238252   
## 5 2024  National     American Indian/Alaska Native 257.90389371060098 
## 6 2024  National     Two or more races             286.27011213518199

1.3.2 Convert data types

glimpse(NAEP_8Math) shows that the variables year and average_scale_score were originally stored as character values so they need to be converted to numeric formats. This allows for proper statistical analysis and visualization of trends over time.

NAEP_8Math <- NAEP_8Math %>%
  clean_names() %>%
  mutate(
    year_raw = year,
    average_scale_score = na_if(average_scale_score, "‡"),
    average_scale_score = na_if(average_scale_score, "—"),
    year = gsub("[^0-9]", "", year),
    year = as.integer(year),
    average_scale_score = as.numeric(average_scale_score)
  )
glimpse(NAEP_8Math) # to check if the type of the data is converted correctly
## Rows: 204
## Columns: 5
## $ year                <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 20…
## $ jurisdiction        <chr> "National", "National", "National", "National", "N…
## $ race_ethnicity      <chr> "White", "Black", "Hispanic", "Asian/Pacific Islan…
## $ average_scale_score <dbl> 293.2797, 260.5918, 265.8419, 315.1835, 257.9039, …
## $ year_raw            <chr> "2024", "2024", "2024", "2024", "2024", "2024", "2…

1.3.3 Remove missing values

Missing values, represented by special symbols in the original dataset, were removed to ensure accurate analysis. A cleaned subset of the dataset was created to preserve the original data while enabling reliable computations and visualizations.

NAEP_clean <- NAEP_8Math %>%
  drop_na(average_scale_score)
summary(NAEP_clean)
##       year      jurisdiction       race_ethnicity     average_scale_score
##  Min.   :1990   Length:158         Length:158         Min.   :234.8      
##  1st Qu.:2000   Class :character   Class :character   1st Qu.:265.6      
##  Median :2009   Mode  :character   Mode  :character   Median :274.8      
##  Mean   :2008                                         Mean   :278.3      
##  3rd Qu.:2017                                         3rd Qu.:293.2      
##  Max.   :2024                                         Max.   :320.1      
##    year_raw        
##  Length:158        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
colSums(is.na(NAEP_clean))
##                year        jurisdiction      race_ethnicity average_scale_score 
##                   0                   0                   0                   0 
##            year_raw 
##                   0

Now that the dataset is clean and ready for analysis, I can start the EDA.

1.4 Exploratory Data Analysis (EDA)

1.4.1 Overview of the clean data

The cleaned dataset contains observations from 1990 to 2024 across two jurisdictions: National and New York. It includes multiple racial and ethnic groups, allowing for meaningful comparisons and trend analysis across both time and demographic categories.

dim(NAEP_clean)
## [1] 158   5
unique(NAEP_clean$jurisdiction)
## [1] "National" "New York"
unique(NAEP_clean$race_ethnicity)
## [1] "White"                         "Black"                        
## [3] "Hispanic"                      "Asian/Pacific Islander"       
## [5] "American Indian/Alaska Native" "Two or more races"
range(NAEP_clean$year)
## [1] 1990 2024

1.4.2 Distribution of the clean data

The distribution of average scale scores (histogram below) shows that most values fall within a moderate range, with a slight spread across groups. The boxplots reveal differences in score distributions among racial and ethnic groups, with some groups consistently scoring higher than others.

ggplot(NAEP_clean, aes(x = average_scale_score)) +
  geom_histogram(binwidth = 5)

ggplot(NAEP_clean, aes(y = average_scale_score)) +
  geom_boxplot()

ggplot(NAEP_clean,
       aes(x = reorder(race_ethnicity,
                       average_scale_score, median),
           y = average_scale_score)) +
  geom_boxplot() +
  coord_flip()

1.4.4 Compare groups

The comparison between National and New York results highlights similarities and differences across groups. While overall trends are comparable, some variations suggest that New York performs slightly differently for certain groups, indicating regional differences in educational outcomes.

NAEP_clean %>%
  filter(year == 2024) %>%
  ggplot(aes(x = race_ethnicity,
             y = average_scale_score,
             fill = jurisdiction)) +
  geom_col(position = "dodge") +
  coord_flip()

1.4.5 Central Tendency

The summary statistics indicate that the average score is approximately 278, with a moderate spread around the mean. The relatively close values of the mean and median suggest that the distribution is fairly balanced without extreme skewness.

NAEP_clean %>%
  summarise(
    mean = mean(average_scale_score),
    median = median(average_scale_score),
    sd = sd(average_scale_score),
    min = min(average_scale_score),
    max = max(average_scale_score)
  )
## # A tibble: 1 × 5
##    mean median    sd   min   max
##   <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1  278.   275.  19.7  235.  320.

1.4.6 Outliers

The boxplot analysis does not reveal extreme outliers, but it highlights variations between groups. These differences likely reflect genuine disparities in performance rather than data errors. Furthermore, these variations are consistent with the patterns observed in the previous line graphs, which show persistent differences between groups over time.

ggplot(NAEP_clean,
       aes(x = race_ethnicity,
           y = average_scale_score)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

1.4.7 Achievement gaps

The achievement gap analysis reveals persistent differences in performance between groups over time. The gap between White and Black students, for example, remains consistent across years, indicating that disparities in educational outcomes have not been fully addressed despite overall improvements.

Although all groups show some improvement over time, the persistence of these gaps suggests systemic challenges that require targeted educational interventions.

NAEP_clean <- NAEP_8Math %>%
  filter(!grepl("¹", year_raw)) %>%
  drop_na(average_scale_score)
gap <- NAEP_clean %>%
  filter(race_ethnicity %in% c("White", "Black")) %>%
  select(year, jurisdiction,
         race_ethnicity,
         average_scale_score) %>%
  pivot_wider(names_from = race_ethnicity,
              values_from = average_scale_score) %>%
  mutate(gap = White - Black)
ggplot(gap, aes(x = year, y = gap, color = jurisdiction)) +
  geom_line() +
  geom_point()

Correlation analysis is limited in this dataset because most variables are categorical, so the most meaningful relationship examined is the trend between year and average scale score within demographic groups. There is no clear data leakage concern in this EDA because the variables describe observed assessment outcomes rather than future-known predictors. For data preparation, missing-value symbols should be retained as NA and removed only when necessary for analysis, while superscript-year duplicate rows should be excluded carefully. Because some demographic groups have fewer reported observations, subgroup imbalance should be acknowledged when interpreting results. Since the dataset contains only a few variables, dimension reduction is not necessary, and group-based comparisons are more informative than reducing features.

1.5 Appendix

My use of LLM consisted of correcting some codes when I encountered errors, one example is the following prompt:

It turns out that I got that error because of the duplicate year value after cleaning. When I do pivot_wider(), there was more than one score for the same combination of year, jurisdiction and race_ethnicity and R doesn’t know what to use. So I had to remove the rows with superscript years.

My other use of the LLM was to paste my R Markdown into the chat along with the assignment and the rubric. I then prompted the LLM to check my work against the assignment requirements.