Question: How does the enrollment amount differ between racial groups, and how has that changed from 2023 to 2025? I will be taking a look at how enrollment amount for racial groups differ and how these enrollment amounts have changed from 2023-2025.
The dataset I selected was retrieved from GitHub, which led me to the site, data.nysed.gov. The information is provided by the State Education Department for the public schools of the state of New York in the United States. The information includes data on student enrollment, county, Need-To-Resources, group, district, and public schools. This dataset provides an insight into New York’s public schools from 2023 to 2025.
As I am focusing on the different racial groups for my research question, I am focusing on the specific data set of demographic factors. This set includes information on a variety of demographics from race, gender, economic situation, homing situation, and whether they are migrants for different entities of the public school system (districts, schools, location, etc.) over the course of three years. The variables I will be focusing on are NUM_WHITE, NUM_HISP, NUM_BLACK, NUM_ASIAN,NUM_Multi, NUM_AM-IND, all of which are numeric. Each variable holds a column of numbers, which are the enrollment amounts for the different entities listed. I will also be using the variable YEAR which tells us the year for each of these enrollments. My dataset includes 16,606 observations and 35 variables.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
dem_factors <- read_csv("C:/DATA101/demographics.csv")
## Rows: 16606 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): ENTITY_CD, ENTITY_NAME
## dbl (33): YEAR, NUM_ELL, PER_ELL, NUM_AM_IND, PER_AM_IND, NUM_BLACK, PER_BLA...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(dem_factors)
## spc_tbl_ [16,606 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ENTITY_CD : chr [1:16606] "000000000001" "000000000002" "000000000003" "000000000004" ...
## $ ENTITY_NAME : chr [1:16606] "NYC Public Schools" "Large Cities" "High Need/Resource Urban-Suburban Districts" "High Need/Resource Rural Districts" ...
## $ YEAR : num [1:16606] 2023 2023 2023 2023 2023 ...
## $ NUM_ELL : num [1:16606] 129782 14300 35438 2143 33962 ...
## $ PER_ELL : num [1:16606] 16 16 18 2 5 4 9 6 0 2 ...
## $ NUM_AM_IND : num [1:16606] 9743 509 580 2136 2529 ...
## $ PER_AM_IND : num [1:16606] 1 1 0 2 0 0 1 0 0 0 ...
## $ NUM_BLACK : num [1:16606] 159551 34103 39603 3894 44836 ...
## $ PER_BLACK : num [1:16606] 20 38 21 3 7 4 49 20 1 9 ...
## $ NUM_HISP : num [1:16606] 336708 29646 87002 11476 126167 ...
## $ PER_HISP : num [1:16606] 42 33 45 9 18 16 39 11 2 8 ...
## $ NUM_ASIAN : num [1:16606] 151113 6779 7905 1055 29061 ...
## $ PER_ASIAN : num [1:16606] 19 8 4 1 4 14 4 11 1 4 ...
## $ NUM_WHITE : num [1:16606] 125091 13658 46914 108669 457875 ...
## $ PER_WHITE : num [1:16606] 16 15 24 82 66 63 6 51 94 71 ...
## $ NUM_Multi : num [1:16606] 15638 4027 9895 5173 28665 ...
## $ PER_Multi : num [1:16606] 2 5 5 4 4 4 2 6 2 8 ...
## $ NUM_SWD : num [1:16606] 189036 19399 32099 22936 109933 ...
## $ PER_SWD : num [1:16606] 24 22 17 17 16 15 18 14 18 15 ...
## $ NUM_FEMALE : num [1:16606] 384400 43293 92919 64894 336061 ...
## $ PER_FEMALE : num [1:16606] 48 49 48 49 49 49 51 49 49 49 ...
## $ NUM_MALE : num [1:16606] 413331 45403 98945 67415 352691 ...
## $ PER_MALE : num [1:16606] 52 51 52 51 51 51 49 51 51 51 ...
## $ NUM_NONBINARY: num [1:16606] 113 26 35 94 381 135 33 16 5 24 ...
## $ PER_NONBINARY: num [1:16606] 0 0 0 0 0 0 0 0 0 0 ...
## $ NUM_ECDIS : num [1:16606] 601178 74564 142102 79437 301330 ...
## $ PER_ECDIS : num [1:16606] 75 84 74 60 44 19 82 46 56 54 ...
## $ NUM_MIGRANT : num [1:16606] 23 23 286 775 737 98 14 20 3 1 ...
## $ PER_MIGRANT : num [1:16606] 0 0 0 1 0 0 0 0 0 0 ...
## $ NUM_HOMELESS : num [1:16606] 73675 3833 8274 2952 9279 ...
## $ PER_HOMELESS : num [1:16606] 9 4 4 2 1 0 8 3 1 3 ...
## $ NUM_FOSTER : num [1:16606] 4196 254 782 553 1537 ...
## $ PER_FOSTER : num [1:16606] 1 0 0 0 0 0 0 0 0 0 ...
## $ NUM_ARMED : num [1:16606] 3215 17 286 3206 2021 ...
## $ PER_ARMED : num [1:16606] 0 0 0 2 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ENTITY_CD = col_character(),
## .. ENTITY_NAME = col_character(),
## .. YEAR = col_double(),
## .. NUM_ELL = col_double(),
## .. PER_ELL = col_double(),
## .. NUM_AM_IND = col_double(),
## .. PER_AM_IND = col_double(),
## .. NUM_BLACK = col_double(),
## .. PER_BLACK = col_double(),
## .. NUM_HISP = col_double(),
## .. PER_HISP = col_double(),
## .. NUM_ASIAN = col_double(),
## .. PER_ASIAN = col_double(),
## .. NUM_WHITE = col_double(),
## .. PER_WHITE = col_double(),
## .. NUM_Multi = col_double(),
## .. PER_Multi = col_double(),
## .. NUM_SWD = col_double(),
## .. PER_SWD = col_double(),
## .. NUM_FEMALE = col_double(),
## .. PER_FEMALE = col_double(),
## .. NUM_MALE = col_double(),
## .. PER_MALE = col_double(),
## .. NUM_NONBINARY = col_double(),
## .. PER_NONBINARY = col_double(),
## .. NUM_ECDIS = col_double(),
## .. PER_ECDIS = col_double(),
## .. NUM_MIGRANT = col_double(),
## .. PER_MIGRANT = col_double(),
## .. NUM_HOMELESS = col_double(),
## .. PER_HOMELESS = col_double(),
## .. NUM_FOSTER = col_double(),
## .. PER_FOSTER = col_double(),
## .. NUM_ARMED = col_double(),
## .. PER_ARMED = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(dem_factors)
## # A tibble: 6 × 35
## ENTITY_CD ENTITY_NAME YEAR NUM_ELL PER_ELL NUM_AM_IND PER_AM_IND NUM_BLACK
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 000000000001 NYC Public… 2023 129782 16 9743 1 159551
## 2 000000000002 Large Citi… 2023 14300 16 509 1 34103
## 3 000000000003 High Need/… 2023 35438 18 580 0 39603
## 4 000000000004 High Need/… 2023 2143 2 2136 2 3894
## 5 000000000005 Average Ne… 2023 33962 5 2529 0 44836
## 6 000000000006 Low Need D… 2023 13643 4 584 0 14339
## # ℹ 27 more variables: PER_BLACK <dbl>, NUM_HISP <dbl>, PER_HISP <dbl>,
## # NUM_ASIAN <dbl>, PER_ASIAN <dbl>, NUM_WHITE <dbl>, PER_WHITE <dbl>,
## # NUM_Multi <dbl>, PER_Multi <dbl>, NUM_SWD <dbl>, PER_SWD <dbl>,
## # NUM_FEMALE <dbl>, PER_FEMALE <dbl>, NUM_MALE <dbl>, PER_MALE <dbl>,
## # NUM_NONBINARY <dbl>, PER_NONBINARY <dbl>, NUM_ECDIS <dbl>, PER_ECDIS <dbl>,
## # NUM_MIGRANT <dbl>, PER_MIGRANT <dbl>, NUM_HOMELESS <dbl>,
## # PER_HOMELESS <dbl>, NUM_FOSTER <dbl>, PER_FOSTER <dbl>, NUM_ARMED <dbl>, …
summary(dem_factors)
## ENTITY_CD ENTITY_NAME YEAR NUM_ELL
## Length:16606 Length:16606 Min. :2023 Min. : 0
## Class :character Class :character 1st Qu.:2023 1st Qu.: 8
## Mode :character Mode :character Median :2024 Median : 28
## Mean :2024 Mean : 283
## 3rd Qu.:2025 3rd Qu.: 77
## Max. :2025 Max. :280312
## NA's :1292
## PER_ELL NUM_AM_IND PER_AM_IND NUM_BLACK
## Min. : 0.00 Min. : 0.00 Min. : 0.0000 Min. : 0.0
## 1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 7.0
## Median : 6.00 Median : 1.00 Median : 0.0000 Median : 33.0
## Mean : 10.54 Mean : 17.89 Mean : 0.7473 Mean : 364.7
## 3rd Qu.: 15.00 3rd Qu.: 4.00 3rd Qu.: 1.0000 3rd Qu.: 106.0
## Max. :100.00 Max. :18122.00 Max. :99.0000 Max. :382380.0
## NA's :1292
## PER_BLACK NUM_HISP PER_HISP NUM_ASIAN
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 1.00 1st Qu.: 29.0 1st Qu.: 6.00 1st Qu.: 3.0
## Median : 6.00 Median : 89.0 Median : 19.00 Median : 12.0
## Mean : 15.71 Mean : 719.2 Mean : 27.33 Mean : 255.6
## 3rd Qu.: 22.00 3rd Qu.: 225.0 3rd Qu.: 43.00 3rd Qu.: 49.0
## Max. :100.00 Max. :744672.0 Max. :100.00 Max. :257714.0
##
## PER_ASIAN NUM_WHITE PER_WHITE NUM_Multi
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 23.0 1st Qu.: 6.00 1st Qu.: 4.00
## Median : 2.000 Median : 197.0 Median : 46.00 Median : 13.00
## Mean : 7.291 Mean : 888.2 Mean : 45.22 Mean : 78.47
## 3rd Qu.: 8.000 3rd Qu.: 388.0 3rd Qu.: 81.00 3rd Qu.: 29.00
## Max. :94.000 Max. :980161.0 Max. :100.00 Max. :87228.00
##
## PER_Multi NUM_SWD PER_SWD NUM_FEMALE
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0
## 1st Qu.: 1.000 1st Qu.: 55.0 1st Qu.: 15.00 1st Qu.: 144
## Median : 3.000 Median : 85.0 Median : 18.00 Median : 215
## Mean : 3.513 Mean : 455.5 Mean : 20.27 Mean : 1131
## 3rd Qu.: 5.000 3rd Qu.: 138.0 3rd Qu.: 23.00 3rd Qu.: 355
## Max. :25.000 Max. :480579.0 Max. :100.00 Max. :1179812
## NA's :89 NA's :89
## PER_FEMALE NUM_MALE PER_MALE NUM_NONBINARY
## Min. : 0.00 Min. : 0 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 47.00 1st Qu.: 153 1st Qu.: 49.00 1st Qu.: 0.0000
## Median : 49.00 Median : 228 Median : 51.00 Median : 0.0000
## Mean : 48.51 Mean : 1192 Mean : 51.31 Mean : 0.8777
## 3rd Qu.: 51.00 3rd Qu.: 370 3rd Qu.: 53.00 3rd Qu.: 0.0000
## Max. :100.00 Max. :1242577 Max. :100.00 Max. :1023.0000
##
## PER_NONBINARY NUM_ECDIS PER_ECDIS NUM_MIGRANT
## Min. : 0.00000 Min. : 0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00000 1st Qu.: 144 1st Qu.: 39.00 1st Qu.: 0.000
## Median : 0.00000 Median : 250 Median : 58.00 Median : 0.000
## Mean : 0.02391 Mean : 1400 Mean : 58.82 Mean : 2.739
## 3rd Qu.: 0.00000 3rd Qu.: 429 3rd Qu.: 84.00 3rd Qu.: 0.000
## Max. :14.00000 Max. :1440928 Max. :100.00 Max. :2388.000
## NA's :63 NA's :63 NA's :5262
## PER_MIGRANT NUM_HOMELESS PER_HOMELESS NUM_FOSTER
## Min. : 0.0000 Min. : 0.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 5.0 1st Qu.: 1.000 1st Qu.: 0.00
## Median : 0.0000 Median : 16.0 Median : 3.000 Median : 1.00
## Mean : 0.1444 Mean : 153.9 Mean : 6.529 Mean : 11.11
## 3rd Qu.: 0.0000 3rd Qu.: 45.0 3rd Qu.: 9.000 3rd Qu.: 4.00
## Max. :14.0000 Max. :155242.0 Max. :84.000 Max. :8637.00
## NA's :5262 NA's :1654 NA's :1654 NA's :4676
## PER_FOSTER NUM_ARMED PER_ARMED
## Min. : 0.0000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 0.0000 Median : 0.00 Median : 0.0000
## Mean : 0.4381 Mean : 12.93 Mean : 0.4614
## 3rd Qu.: 1.0000 3rd Qu.: 2.00 3rd Qu.: 0.0000
## Max. :100.0000 Max. :11317.00 Max. :80.0000
## NA's :4676 NA's :4787 NA's :4787
Analysis I began my analysis of my data by first cleaning the data I had using EDA functions. I made the text lowercase for easier reading and checked for NAs for the variables I planned to use. I then selected the variables I would be using for the analysis (which were the races and not the other demographics). I found the average number of enrollments per race and rounded them to the second decimal place (the rounding was for easier reading purposes as well). I figured finding the averages of enrollment for each race allowed me to summarize this large amount of data easily, while also being able to give a fair comparison of each race, as I would be using the same function of summarization for each race. For my summary table, i did a similar method, except I first grouped my data by year so I could look for patterns and changes. With the averages, I was able to easily compare the information and make notes of patterns and changes.
#Clean variable names by changing them to lowercase for easier reading
names(dem_factors) <- tolower(names(dem_factors))
#Check for any NAs for the variables I will be using
colSums(is.na(dem_factors))
## entity_cd entity_name year num_ell per_ell
## 0 0 0 1292 1292
## num_am_ind per_am_ind num_black per_black num_hisp
## 0 0 0 0 0
## per_hisp num_asian per_asian num_white per_white
## 0 0 0 0 0
## num_multi per_multi num_swd per_swd num_female
## 0 0 89 89 0
## per_female num_male per_male num_nonbinary per_nonbinary
## 0 0 0 0 0
## num_ecdis per_ecdis num_migrant per_migrant num_homeless
## 63 63 5262 5262 1654
## per_homeless num_foster per_foster num_armed per_armed
## 1654 4676 4676 4787 4787
Note There were no NAs for the variables I was using
avg_race <- dem_factors |>
select(num_black, num_am_ind, num_asian, num_hisp, num_white, num_multi) |>
summarise(avg_black = round(mean(num_black), digits = 2),
avg_ind = round(mean(num_am_ind), digits = 2),
avg_asian = round(mean(num_asian), digits =2),
avg_hisp = round(mean(num_hisp), digits = 2),
avg_white = round(mean(num_white), digits = 2),
avg_multi = round(mean(num_multi), digits =2))
avg_race
## # A tibble: 1 × 6
## avg_black avg_ind avg_asian avg_hisp avg_white avg_multi
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 365. 17.9 256. 719. 888. 78.5
#Bar plot to see the differences between the races
barplot(c(avg_race$avg_white,avg_race$avg_hisp,avg_race$avg_black,
avg_race$avg_asian, avg_race$avg_multi,avg_race$avg_ind),
names.arg = c("White", "Hispanic", "Black","Asian", "Multiracial", "Am_Indian"),
col = "purple",ylab = "Average Enrollment Amount", xlab = "Races")
#Summary table to see the changes across the different years
summary_table <- dem_factors |>
group_by(year) |>
summarise(avg_black = round(mean(num_black), digits = 2),
avg_ind = round(mean(num_am_ind), digits = 2),
avg_asian = round(mean(num_asian), digits =2),
avg_hisp = round(mean(num_hisp), digits = 2),
avg_white = round(mean(num_white), digits = 2),
avg_multi = round(mean(num_multi), digits =2))
summary_table
## # A tibble: 3 × 7
## year avg_black avg_ind avg_asian avg_hisp avg_white avg_multi
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2023 371. 17.7 252. 704. 907. 74.5
## 2 2024 361. 17.8 255. 721. 887. 79.0
## 3 2025 362. 18.1 260. 733. 870. 81.9
Conclusion: In conclusion, I have found that the races that have the highest enrollment status are people who identify with the races of White, Hispanic, and Black. The average amount of White people enrolled is nearly three times the number of Asian people enrolled. In terms of changes over the course of the three years, Black people’s enrollment decreased, American Indian enrollment increased slightly, Asian enrollment increased, Hispanic enrollment increased, White enrollment decreased, and multinational enrollment increased. It seems the race that held the highest averages(White and Black) of enrollment has decreased, while diversity increased as the enrollment averages for every other race have increased. In relation to my research question, there are indeed differences in the average enrollment amount between racial groups, as many have very low or very high enrollment amounts. These enrollment amounts have changed over the course of the three years, only slightly for most races except for White, as they experienced a larger drop.
Having this information is important as it allows the State Education Department to know the diversity of the students they are working with and how it is fluctuating. Noticing trends and patterns in the races that are enrolling more can help when making decisions such as allocating resources, policy making, and plans to increase diversity. This dataset provides information on other demographics, such as gender and economic status, which can be used in future studies to narrow down the racial differences in relation to other demographics for an even better understanding of how the enrollment amount differs for different races. A similar analysis can be conducted by grouping racial groups based on the entity they are under to understand how the enrollment amounts differs across races based on the entity, which helps with further narrowing down how factors impact enrollment for different races.
References: https://data.nysed.gov/downloads.php