DATA 101 - Project 1

Introduction

Question: How does the enrollment amount differ between racial groups, and how has that changed from 2023 to 2025? I will be taking a look at how enrollment amount for racial groups differ and how these enrollment amounts have changed from 2023-2025.

The dataset I selected was retrieved from GitHub, which led me to the site, data.nysed.gov. The information is provided by the State Education Department for the public schools of the state of New York in the United States. The information includes data on student enrollment, county, Need-To-Resources, group, district, and public schools. This dataset provides an insight into New York’s public schools from 2023 to 2025.

As I am focusing on the different racial groups for my research question, I am focusing on the specific data set of demographic factors. This set includes information on a variety of demographics from race, gender, economic situation, homing situation, and whether they are migrants for different entities of the public school system (districts, schools, location, etc.) over the course of three years. The variables I will be focusing on are NUM_WHITE, NUM_HISP, NUM_BLACK, NUM_ASIAN,NUM_Multi, NUM_AM-IND, all of which are numeric. Each variable holds a column of numbers, which are the enrollment amounts for the different entities listed. I will also be using the variable YEAR which tells us the year for each of these enrollments. My dataset includes 16,606 observations and 35 variables.

1. Importing my dataset and libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

dem_factors <- read_csv("C:/DATA101/demographics.csv")

## Rows: 16606 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): ENTITY_CD, ENTITY_NAME
## dbl (33): YEAR, NUM_ELL, PER_ELL, NUM_AM_IND, PER_AM_IND, NUM_BLACK, PER_BLA...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Exploring my data

str(dem_factors)

## spc_tbl_ [16,606 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ENTITY_CD    : chr [1:16606] "000000000001" "000000000002" "000000000003" "000000000004" ...
##  $ ENTITY_NAME  : chr [1:16606] "NYC Public Schools" "Large Cities" "High Need/Resource Urban-Suburban Districts" "High Need/Resource Rural Districts" ...
##  $ YEAR         : num [1:16606] 2023 2023 2023 2023 2023 ...
##  $ NUM_ELL      : num [1:16606] 129782 14300 35438 2143 33962 ...
##  $ PER_ELL      : num [1:16606] 16 16 18 2 5 4 9 6 0 2 ...
##  $ NUM_AM_IND   : num [1:16606] 9743 509 580 2136 2529 ...
##  $ PER_AM_IND   : num [1:16606] 1 1 0 2 0 0 1 0 0 0 ...
##  $ NUM_BLACK    : num [1:16606] 159551 34103 39603 3894 44836 ...
##  $ PER_BLACK    : num [1:16606] 20 38 21 3 7 4 49 20 1 9 ...
##  $ NUM_HISP     : num [1:16606] 336708 29646 87002 11476 126167 ...
##  $ PER_HISP     : num [1:16606] 42 33 45 9 18 16 39 11 2 8 ...
##  $ NUM_ASIAN    : num [1:16606] 151113 6779 7905 1055 29061 ...
##  $ PER_ASIAN    : num [1:16606] 19 8 4 1 4 14 4 11 1 4 ...
##  $ NUM_WHITE    : num [1:16606] 125091 13658 46914 108669 457875 ...
##  $ PER_WHITE    : num [1:16606] 16 15 24 82 66 63 6 51 94 71 ...
##  $ NUM_Multi    : num [1:16606] 15638 4027 9895 5173 28665 ...
##  $ PER_Multi    : num [1:16606] 2 5 5 4 4 4 2 6 2 8 ...
##  $ NUM_SWD      : num [1:16606] 189036 19399 32099 22936 109933 ...
##  $ PER_SWD      : num [1:16606] 24 22 17 17 16 15 18 14 18 15 ...
##  $ NUM_FEMALE   : num [1:16606] 384400 43293 92919 64894 336061 ...
##  $ PER_FEMALE   : num [1:16606] 48 49 48 49 49 49 51 49 49 49 ...
##  $ NUM_MALE     : num [1:16606] 413331 45403 98945 67415 352691 ...
##  $ PER_MALE     : num [1:16606] 52 51 52 51 51 51 49 51 51 51 ...
##  $ NUM_NONBINARY: num [1:16606] 113 26 35 94 381 135 33 16 5 24 ...
##  $ PER_NONBINARY: num [1:16606] 0 0 0 0 0 0 0 0 0 0 ...
##  $ NUM_ECDIS    : num [1:16606] 601178 74564 142102 79437 301330 ...
##  $ PER_ECDIS    : num [1:16606] 75 84 74 60 44 19 82 46 56 54 ...
##  $ NUM_MIGRANT  : num [1:16606] 23 23 286 775 737 98 14 20 3 1 ...
##  $ PER_MIGRANT  : num [1:16606] 0 0 0 1 0 0 0 0 0 0 ...
##  $ NUM_HOMELESS : num [1:16606] 73675 3833 8274 2952 9279 ...
##  $ PER_HOMELESS : num [1:16606] 9 4 4 2 1 0 8 3 1 3 ...
##  $ NUM_FOSTER   : num [1:16606] 4196 254 782 553 1537 ...
##  $ PER_FOSTER   : num [1:16606] 1 0 0 0 0 0 0 0 0 0 ...
##  $ NUM_ARMED    : num [1:16606] 3215 17 286 3206 2021 ...
##  $ PER_ARMED    : num [1:16606] 0 0 0 2 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ENTITY_CD = col_character(),
##   ..   ENTITY_NAME = col_character(),
##   ..   YEAR = col_double(),
##   ..   NUM_ELL = col_double(),
##   ..   PER_ELL = col_double(),
##   ..   NUM_AM_IND = col_double(),
##   ..   PER_AM_IND = col_double(),
##   ..   NUM_BLACK = col_double(),
##   ..   PER_BLACK = col_double(),
##   ..   NUM_HISP = col_double(),
##   ..   PER_HISP = col_double(),
##   ..   NUM_ASIAN = col_double(),
##   ..   PER_ASIAN = col_double(),
##   ..   NUM_WHITE = col_double(),
##   ..   PER_WHITE = col_double(),
##   ..   NUM_Multi = col_double(),
##   ..   PER_Multi = col_double(),
##   ..   NUM_SWD = col_double(),
##   ..   PER_SWD = col_double(),
##   ..   NUM_FEMALE = col_double(),
##   ..   PER_FEMALE = col_double(),
##   ..   NUM_MALE = col_double(),
##   ..   PER_MALE = col_double(),
##   ..   NUM_NONBINARY = col_double(),
##   ..   PER_NONBINARY = col_double(),
##   ..   NUM_ECDIS = col_double(),
##   ..   PER_ECDIS = col_double(),
##   ..   NUM_MIGRANT = col_double(),
##   ..   PER_MIGRANT = col_double(),
##   ..   NUM_HOMELESS = col_double(),
##   ..   PER_HOMELESS = col_double(),
##   ..   NUM_FOSTER = col_double(),
##   ..   PER_FOSTER = col_double(),
##   ..   NUM_ARMED = col_double(),
##   ..   PER_ARMED = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(dem_factors)

## # A tibble: 6 × 35
##   ENTITY_CD    ENTITY_NAME  YEAR NUM_ELL PER_ELL NUM_AM_IND PER_AM_IND NUM_BLACK
##   <chr>        <chr>       <dbl>   <dbl>   <dbl>      <dbl>      <dbl>     <dbl>
## 1 000000000001 NYC Public…  2023  129782      16       9743          1    159551
## 2 000000000002 Large Citi…  2023   14300      16        509          1     34103
## 3 000000000003 High Need/…  2023   35438      18        580          0     39603
## 4 000000000004 High Need/…  2023    2143       2       2136          2      3894
## 5 000000000005 Average Ne…  2023   33962       5       2529          0     44836
## 6 000000000006 Low Need D…  2023   13643       4        584          0     14339
## # ℹ 27 more variables: PER_BLACK <dbl>, NUM_HISP <dbl>, PER_HISP <dbl>,
## #   NUM_ASIAN <dbl>, PER_ASIAN <dbl>, NUM_WHITE <dbl>, PER_WHITE <dbl>,
## #   NUM_Multi <dbl>, PER_Multi <dbl>, NUM_SWD <dbl>, PER_SWD <dbl>,
## #   NUM_FEMALE <dbl>, PER_FEMALE <dbl>, NUM_MALE <dbl>, PER_MALE <dbl>,
## #   NUM_NONBINARY <dbl>, PER_NONBINARY <dbl>, NUM_ECDIS <dbl>, PER_ECDIS <dbl>,
## #   NUM_MIGRANT <dbl>, PER_MIGRANT <dbl>, NUM_HOMELESS <dbl>,
## #   PER_HOMELESS <dbl>, NUM_FOSTER <dbl>, PER_FOSTER <dbl>, NUM_ARMED <dbl>, …

summary(dem_factors)

##   ENTITY_CD         ENTITY_NAME             YEAR         NUM_ELL      
##  Length:16606       Length:16606       Min.   :2023   Min.   :     0  
##  Class :character   Class :character   1st Qu.:2023   1st Qu.:     8  
##  Mode  :character   Mode  :character   Median :2024   Median :    28  
##                                        Mean   :2024   Mean   :   283  
##                                        3rd Qu.:2025   3rd Qu.:    77  
##                                        Max.   :2025   Max.   :280312  
##                                                       NA's   :1292    
##     PER_ELL         NUM_AM_IND         PER_AM_IND        NUM_BLACK       
##  Min.   :  0.00   Min.   :    0.00   Min.   : 0.0000   Min.   :     0.0  
##  1st Qu.:  2.00   1st Qu.:    0.00   1st Qu.: 0.0000   1st Qu.:     7.0  
##  Median :  6.00   Median :    1.00   Median : 0.0000   Median :    33.0  
##  Mean   : 10.54   Mean   :   17.89   Mean   : 0.7473   Mean   :   364.7  
##  3rd Qu.: 15.00   3rd Qu.:    4.00   3rd Qu.: 1.0000   3rd Qu.:   106.0  
##  Max.   :100.00   Max.   :18122.00   Max.   :99.0000   Max.   :382380.0  
##  NA's   :1292                                                            
##    PER_BLACK         NUM_HISP           PER_HISP        NUM_ASIAN       
##  Min.   :  0.00   Min.   :     0.0   Min.   :  0.00   Min.   :     0.0  
##  1st Qu.:  1.00   1st Qu.:    29.0   1st Qu.:  6.00   1st Qu.:     3.0  
##  Median :  6.00   Median :    89.0   Median : 19.00   Median :    12.0  
##  Mean   : 15.71   Mean   :   719.2   Mean   : 27.33   Mean   :   255.6  
##  3rd Qu.: 22.00   3rd Qu.:   225.0   3rd Qu.: 43.00   3rd Qu.:    49.0  
##  Max.   :100.00   Max.   :744672.0   Max.   :100.00   Max.   :257714.0  
##                                                                         
##    PER_ASIAN        NUM_WHITE          PER_WHITE        NUM_Multi       
##  Min.   : 0.000   Min.   :     0.0   Min.   :  0.00   Min.   :    0.00  
##  1st Qu.: 1.000   1st Qu.:    23.0   1st Qu.:  6.00   1st Qu.:    4.00  
##  Median : 2.000   Median :   197.0   Median : 46.00   Median :   13.00  
##  Mean   : 7.291   Mean   :   888.2   Mean   : 45.22   Mean   :   78.47  
##  3rd Qu.: 8.000   3rd Qu.:   388.0   3rd Qu.: 81.00   3rd Qu.:   29.00  
##  Max.   :94.000   Max.   :980161.0   Max.   :100.00   Max.   :87228.00  
##                                                                         
##    PER_Multi         NUM_SWD            PER_SWD         NUM_FEMALE     
##  Min.   : 0.000   Min.   :     0.0   Min.   :  0.00   Min.   :      0  
##  1st Qu.: 1.000   1st Qu.:    55.0   1st Qu.: 15.00   1st Qu.:    144  
##  Median : 3.000   Median :    85.0   Median : 18.00   Median :    215  
##  Mean   : 3.513   Mean   :   455.5   Mean   : 20.27   Mean   :   1131  
##  3rd Qu.: 5.000   3rd Qu.:   138.0   3rd Qu.: 23.00   3rd Qu.:    355  
##  Max.   :25.000   Max.   :480579.0   Max.   :100.00   Max.   :1179812  
##                   NA's   :89         NA's   :89                        
##    PER_FEMALE        NUM_MALE          PER_MALE      NUM_NONBINARY      
##  Min.   :  0.00   Min.   :      0   Min.   :  0.00   Min.   :   0.0000  
##  1st Qu.: 47.00   1st Qu.:    153   1st Qu.: 49.00   1st Qu.:   0.0000  
##  Median : 49.00   Median :    228   Median : 51.00   Median :   0.0000  
##  Mean   : 48.51   Mean   :   1192   Mean   : 51.31   Mean   :   0.8777  
##  3rd Qu.: 51.00   3rd Qu.:    370   3rd Qu.: 53.00   3rd Qu.:   0.0000  
##  Max.   :100.00   Max.   :1242577   Max.   :100.00   Max.   :1023.0000  
##                                                                         
##  PER_NONBINARY        NUM_ECDIS         PER_ECDIS       NUM_MIGRANT      
##  Min.   : 0.00000   Min.   :      0   Min.   :  0.00   Min.   :   0.000  
##  1st Qu.: 0.00000   1st Qu.:    144   1st Qu.: 39.00   1st Qu.:   0.000  
##  Median : 0.00000   Median :    250   Median : 58.00   Median :   0.000  
##  Mean   : 0.02391   Mean   :   1400   Mean   : 58.82   Mean   :   2.739  
##  3rd Qu.: 0.00000   3rd Qu.:    429   3rd Qu.: 84.00   3rd Qu.:   0.000  
##  Max.   :14.00000   Max.   :1440928   Max.   :100.00   Max.   :2388.000  
##                     NA's   :63        NA's   :63       NA's   :5262      
##   PER_MIGRANT       NUM_HOMELESS       PER_HOMELESS      NUM_FOSTER     
##  Min.   : 0.0000   Min.   :     0.0   Min.   : 0.000   Min.   :   0.00  
##  1st Qu.: 0.0000   1st Qu.:     5.0   1st Qu.: 1.000   1st Qu.:   0.00  
##  Median : 0.0000   Median :    16.0   Median : 3.000   Median :   1.00  
##  Mean   : 0.1444   Mean   :   153.9   Mean   : 6.529   Mean   :  11.11  
##  3rd Qu.: 0.0000   3rd Qu.:    45.0   3rd Qu.: 9.000   3rd Qu.:   4.00  
##  Max.   :14.0000   Max.   :155242.0   Max.   :84.000   Max.   :8637.00  
##  NA's   :5262      NA's   :1654       NA's   :1654     NA's   :4676     
##    PER_FOSTER         NUM_ARMED          PER_ARMED      
##  Min.   :  0.0000   Min.   :    0.00   Min.   : 0.0000  
##  1st Qu.:  0.0000   1st Qu.:    0.00   1st Qu.: 0.0000  
##  Median :  0.0000   Median :    0.00   Median : 0.0000  
##  Mean   :  0.4381   Mean   :   12.93   Mean   : 0.4614  
##  3rd Qu.:  1.0000   3rd Qu.:    2.00   3rd Qu.: 0.0000  
##  Max.   :100.0000   Max.   :11317.00   Max.   :80.0000  
##  NA's   :4676       NA's   :4787       NA's   :4787

3. Utilizing EDA functions to clean my data

Analysis I began my analysis of my data by first cleaning the data I had using EDA functions. I made the text lowercase for easier reading and checked for NAs for the variables I planned to use. I then selected the variables I would be using for the analysis (which were the races and not the other demographics). I found the average number of enrollments per race and rounded them to the second decimal place (the rounding was for easier reading purposes as well). I figured finding the averages of enrollment for each race allowed me to summarize this large amount of data easily, while also being able to give a fair comparison of each race, as I would be using the same function of summarization for each race. For my summary table, i did a similar method, except I first grouped my data by year so I could look for patterns and changes. With the averages, I was able to easily compare the information and make notes of patterns and changes.

#Clean variable names by changing them to lowercase for easier reading
names(dem_factors) <- tolower(names(dem_factors))

#Check for any NAs for the variables I will be using
colSums(is.na(dem_factors))

##     entity_cd   entity_name          year       num_ell       per_ell 
##             0             0             0          1292          1292 
##    num_am_ind    per_am_ind     num_black     per_black      num_hisp 
##             0             0             0             0             0 
##      per_hisp     num_asian     per_asian     num_white     per_white 
##             0             0             0             0             0 
##     num_multi     per_multi       num_swd       per_swd    num_female 
##             0             0            89            89             0 
##    per_female      num_male      per_male num_nonbinary per_nonbinary 
##             0             0             0             0             0 
##     num_ecdis     per_ecdis   num_migrant   per_migrant  num_homeless 
##            63            63          5262          5262          1654 
##  per_homeless    num_foster    per_foster     num_armed     per_armed 
##          1654          4676          4676          4787          4787

Note There were no NAs for the variables I was using

4. Selecting and summarizing necessary data

avg_race <- dem_factors |>
  select(num_black, num_am_ind, num_asian, num_hisp, num_white, num_multi) |>
  summarise(avg_black = round(mean(num_black), digits = 2), 
            avg_ind = round(mean(num_am_ind), digits = 2),
            avg_asian = round(mean(num_asian), digits =2), 
            avg_hisp = round(mean(num_hisp), digits = 2),
            avg_white = round(mean(num_white), digits = 2), 
            avg_multi = round(mean(num_multi), digits =2))
avg_race

## # A tibble: 1 × 6
##   avg_black avg_ind avg_asian avg_hisp avg_white avg_multi
##       <dbl>   <dbl>     <dbl>    <dbl>     <dbl>     <dbl>
## 1      365.    17.9      256.     719.      888.      78.5

5. Displaying Information

#Bar plot to see the differences between the races
barplot(c(avg_race$avg_white,avg_race$avg_hisp,avg_race$avg_black, 
          avg_race$avg_asian, avg_race$avg_multi,avg_race$avg_ind), 
        names.arg = c("White", "Hispanic", "Black","Asian", "Multiracial", "Am_Indian"),
        col = "purple",ylab = "Average Enrollment Amount", xlab = "Races")

#Summary table to see the changes across the different years
summary_table <- dem_factors |>
  group_by(year) |>
  summarise(avg_black = round(mean(num_black), digits = 2), 
            avg_ind = round(mean(num_am_ind), digits = 2),
            avg_asian = round(mean(num_asian), digits =2), 
            avg_hisp = round(mean(num_hisp), digits = 2),
            avg_white = round(mean(num_white), digits = 2), 
            avg_multi = round(mean(num_multi), digits =2))
summary_table

## # A tibble: 3 × 7
##    year avg_black avg_ind avg_asian avg_hisp avg_white avg_multi
##   <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>     <dbl>
## 1  2023      371.    17.7      252.     704.      907.      74.5
## 2  2024      361.    17.8      255.     721.      887.      79.0
## 3  2025      362.    18.1      260.     733.      870.      81.9

Conclusion

Conclusion: In conclusion, I have found that the races that have the highest enrollment status are people who identify with the races of White, Hispanic, and Black. The average amount of White people enrolled is nearly three times the number of Asian people enrolled. In terms of changes over the course of the three years, Black people’s enrollment decreased, American Indian enrollment increased slightly, Asian enrollment increased, Hispanic enrollment increased, White enrollment decreased, and multinational enrollment increased. It seems the race that held the highest averages(White and Black) of enrollment has decreased, while diversity increased as the enrollment averages for every other race have increased. In relation to my research question, there are indeed differences in the average enrollment amount between racial groups, as many have very low or very high enrollment amounts. These enrollment amounts have changed over the course of the three years, only slightly for most races except for White, as they experienced a larger drop.

Having this information is important as it allows the State Education Department to know the diversity of the students they are working with and how it is fluctuating. Noticing trends and patterns in the races that are enrolling more can help when making decisions such as allocating resources, policy making, and plans to increase diversity. This dataset provides information on other demographics, such as gender and economic status, which can be used in future studies to narrow down the racial differences in relation to other demographics for an even better understanding of how the enrollment amount differs for different races. A similar analysis can be conducted by grouping racial groups based on the entity they are under to understand how the enrollment amounts differs across races based on the entity, which helps with further narrowing down how factors impact enrollment for different races.

References: https://data.nysed.gov/downloads.php