This project will look at California public school administrative data to study the relationship between race and free/reduced lunch eligibility. Additionally, I will see if the location of the school or the type of school has a relationship with student eligibility for free/reduced lunch. I want to look specifically at the relationship between race and free/reduced lunch because I believe race plays a significant role in poverty among families, which then results in a child’s eligibility for free or reduced school lunches at public schools in the U.S. I’m also interested in whether other factors affect free/reduced lunch eligibility, such as school location, as school location type may reflect community poverty levels.
Questions that I’m interested in pursuing in this project include: what is the white vs. non-white race demographics in California public schools? Do California schools with non-white students have more students eligible for free/reduced lunch than schools with white students? Is the proportion of non-white students at a school a predictive factor for the proportion of students eligible for free/reduced lunch? Does the location type of the school affect how many students eligible for free/reduced lunch? Do rural and city schools have higher proportions of students eligible for free/reduced lunch than suburban or town locations? Are there differences in free/reduced lunch eligibility across school types (elementary/middle/high)?
I expect to find a positive relationship between non-white races and eligibility for free/reduced school lunch. I expect city and rural locations to have more students eligible for free/reduced school lunch compared to suburb and town locations.
The original dataset for this project is from the National Center for Education Statistics https://catalog.data.gov/dataset/public-school-characteristics-2020-21. The dataset contains administrative data about U.S. public schools from the 2020-2021 school year. This dataset has 100,722 rows and 79 variables (named “pubsch”).
pubsch <- read.csv("~/1. CMU Coursework/2. Spring 2023/94842 Programming R for Analytics/4. Final Project/pubsch.csv", header=TRUE, stringsAsFactors=TRUE)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.1 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)
I completed all cleaning/pre-processing in R. For this project, I want to reduce the scope of the dataset, so I first selected variables from the dataset that I thought would be meaningful to consider in this project. From the original dataset, I selected variables on school level (elementary, middle, high, etc), location type (rural, town, suburban, city), number of students eligible for free or reduced lunch, total number of students in the school, and race categories. This filtered out 64 other variables/columns, which included unneeded data, such as detailed address data, grade level data, and gender data. I chose to use the variable that combined free and reduced lunch eligibility data, rather than keeping those variables separate. This was to help simplify analyses. In future research or for a larger project, these additional variables could be meaningful to include, but was outside the scope of this project.
Then, I edited variable names for ease of use in my analyses.
I wanted to reduce the dataset from a national-level to a state-level scope. I chose to focus on California because it has a large and racially diverse population, so it would allow me to complete meaningful analyses using race variables while I work with a smaller scope.
I filtered the school type to only include elementary, middle or high schools. This schools that fall outside these traditional levels, such as adult education. I don’t think these values will be meaningful for my analyses, as free/reduced lunch is typically more of a concern for younger students (i.e. children).
I combined sub-levels of the 4 major location types to simplify my analyses. Each location originally each contained 3 sub-levels, and I mutated the data to remove this detail to simply my analysis.
My final step of data cleaning checked for NAs and changed blanks in numerical data to 0s to allow for easier analysis. I also removed any schools that did not have any students in the 2020-2021 school year, as this would not be meaningful for my analysis.
#removing unnecessary variables & selecting variables of interest
pubsch_a <- pubsch[, c("OBJECTID", "LSTATE", "SCHOOL_LEVEL", "SY_STATUS_TEXT", "ULOCALE", "TOTFRL", "TOTAL", "AM", "AS", "BL", "HP", "HI", "TR", "WH")]
#changing variable names
colnames(pubsch_a) <- c("id", "state", "type", "operational", "location", "totalfreered", "totalstud", "native", "asian", "blk", "pacific", "hispanic", "twoplus", "wht")
#removing other states
pubsch_a <- filter(pubsch_a, state == "CA")
#selecting only elementary, middle, high schools
pubsch_a <- filter(pubsch_a, type == "Elementary" | type == "Middle" | type == "High")
#merging location levels
pubsch_a$location <- ifelse(pubsch_a$location == "41-Rural: Fringe" | pubsch_a$location == "42-Rural: Distant" | pubsch_a$location == "43-Rural: Remote", "Rural",
ifelse(pubsch_a$location == "31-Town: Fringe" | pubsch_a$location == "32-Town: Distant" | pubsch_a$location == "33-Town: Remote", "Town",
ifelse(pubsch_a$location == "21-Suburb: Large" | pubsch_a$location == "22-Suburb: Mid-size" | pubsch_a$location == "23-Suburb: Small", "Suburb",
ifelse(pubsch_a$location == "11-City: Large" | pubsch_a$location == "12-City: Mid-size" | pubsch_a$location == "13-City: Small", "City", NA))))
#changing NAs in student population columns to 0s
#first, check to make sure no NAs exist in certain columns
which(is.na(pubsch_a[, c("id", "state", "type", "operational", "location")]))
## integer(0)
#change NAs to 0
pubsch_a[is.na(pubsch_a)] <- 0
#remove any schools with no students in them
pubsch_a <- subset(pubsch_a, totalstud != 0)
To compare white vs. non-white California students, I combined the non-white race categories and create a new race variable, nonwht.
#creating a new variable, "nonwht", that combines all minority race variables
pubsch_a <- mutate(pubsch_a, nonwht = native + asian + blk + pacific + hispanic + twoplus)
The dataset has 9,613 rows and 15 columns. The dataset variables include the type of school (elementary, middle, or high), location type of the school (rural, town, suburb, or city), operational status in 2020-2021 (with all schools in the cleaned dataset currently operational). The dataset also include the total number of students in each school and how many students are eligible for free/reduced lunch in each school. Race data is broken out by 7 race categories, plus the additional “nonwht” variable I created, for a total of 8 race categories.
str(pubsch_a)
## 'data.frame': 9441 obs. of 15 variables:
## $ id : int 5629 5630 5631 5634 5635 5636 5637 5638 5641 5642 ...
## $ state : Factor w/ 56 levels "AK","AL","AR",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ type : Factor w/ 10 levels "Adult Education",..: 3 2 4 2 4 2 2 2 3 2 ...
## $ operational : Factor w/ 7 levels "Currently operational ",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ location : chr "Rural" "Rural" "Rural" "Suburb" ...
## $ totalfreered: num 190 169 158 40 71 17 31 20 31 100 ...
## $ totalstud : num 358 287 274 217 670 292 318 255 50 133 ...
## $ native : num 2 2 0 1 3 0 0 1 1 0 ...
## $ asian : num 4 2 3 6 22 6 19 4 0 2 ...
## $ blk : num 2 2 4 1 9 3 4 2 0 0 ...
## $ pacific : num 2 0 0 0 1 0 0 0 0 0 ...
## $ hispanic : num 173 145 131 41 83 29 32 25 44 103 ...
## $ twoplus : num 4 11 11 18 31 28 32 37 0 3 ...
## $ wht : num 171 125 125 150 520 226 231 185 5 25 ...
## $ nonwht : num 187 162 149 67 149 66 87 69 45 108 ...
summary(pubsch_a)
## id state type
## Min. : 5629 CA :9441 Elementary :5922
## 1st Qu.: 8373 AK : 0 High :2127
## Median :10876 AL : 0 Middle :1392
## Mean :10855 AR : 0 Adult Education: 0
## 3rd Qu.:13345 AS : 0 Not Applicable : 0
## Max. :99741 AZ : 0 Not Reported : 0
## (Other): 0 (Other) : 0
## operational location
## Currently operational :9439 Length:9441
## New school : 2 Class :character
## School has changed agency : 0 Mode :character
## School has reopened : 0
## School temporarily closed : 0
## School to be operational within two years : 0
## School was operational but not reported last year: 0
## totalfreered totalstud native asian
## Min. : 0.0 Min. : 1.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 118.0 1st Qu.: 316.0 1st Qu.: 0.000 1st Qu.: 3.00
## Median : 277.0 Median : 492.0 Median : 1.000 Median : 19.00
## Mean : 356.3 Mean : 601.1 Mean : 2.786 Mean : 72.62
## 3rd Qu.: 475.0 3rd Qu.: 700.0 3rd Qu.: 3.000 3rd Qu.: 70.00
## Max. :3947.0 Max. :4692.0 Max. :378.000 Max. :2619.00
##
## blk pacific hispanic twoplus
## Min. : 0.00 Min. : 0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 0.000 1st Qu.: 99.0 1st Qu.: 5.00
## Median : 10.00 Median : 1.000 Median : 251.0 Median : 16.00
## Mean : 30.69 Mean : 2.621 Mean : 337.2 Mean : 29.04
## 3rd Qu.: 34.00 3rd Qu.: 3.000 3rd Qu.: 452.0 3rd Qu.: 40.00
## Max. :671.00 Max. :109.000 Max. :3804.0 Max. :467.00
##
## wht nonwht
## Min. : 0 Min. : 0.0
## 1st Qu.: 14 1st Qu.: 205.0
## Median : 53 Median : 388.0
## Mean : 126 Mean : 474.9
## 3rd Qu.: 170 3rd Qu.: 594.0
## Max. :1884 Max. :4167.0
##
First, I wanted to look at the California students eligible or not for free/reduced lunch.
I used this site for code on how to make stylistic changes to the data visualzations: http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/
I used this site for color names https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf
#summing total students eligible for free/reduced lunch to calculate total students not eligible
sum_totalfree <- sum(pubsch_a$totalfreered)
sum_totalall <- sum(pubsch_a$totalstud)
sum_totalnon <- sum_totalall - sum_totalfree
#creating a dataframe to create a barchart
stud_lunch_data <- data.frame(lunch = c("Eligible for Free/Reduced Lunches", "Not Eligible for Free/Reduced Lunches"),
sums = c(sum_totalfree, sum_totalnon))
ggplot(stud_lunch_data, aes(x = lunch, y = sums, fill = lunch)) + geom_bar(stat = "identity", width = 0.3) + labs(title = "California Public School Students Eligible or Not for Free/Reduced Lunches", x = "", y = "Value") + theme(legend.position = "none") + scale_y_continuous(labels = scales::comma)
Next, I looked at total California students in each race category to see what the racial make-up of the population was.
#summing race category totals
sum_native <- sum(pubsch_a$native)
sum_asian <- sum(pubsch_a$asian)
sum_blk <- sum(pubsch_a$blk)
sum_pacific <- sum(pubsch_a$pacific)
sum_hispanic <- sum(pubsch_a$hispanic)
sum_twoplus <- sum(pubsch_a$twoplus)
sum_wht <- sum(pubsch_a$wht)
#creating a dataframe to create a barchart
sum_races <- data.frame(race = c("Native American", "Asian", "Black", "Pacific", "Hispanic", "Two+ Races", "White"),
count = c(sum_native, sum_asian, sum_blk, sum_pacific, sum_hispanic, sum_twoplus, sum_wht))
ggplot(sum_races, aes(x = race, y = count)) + geom_bar(stat = "identity", fill = "goldenrod1", ) + labs(title = "California Public School Students by Ethnicity", x = "", y = "Number of Students") + scale_y_continuous(labels = scales::comma)
California has a very large Hispanic student population, much larger than any other race category.
Next, I looked at white vs. non-white California students.
sum_nonwht <- sum(pubsch_a$nonwht)
sum_wht <- sum(pubsch_a$wht)
#creating a dataframe to create a barchart
sum_wht_non <- data.frame(race = c("Non-White", "White"),
count = c(sum_nonwht, sum_wht))
ggplot(sum_wht_non, aes(x = race, y = count)) + geom_bar(stat = "identity", fill = "yellowgreen", width = 0.3) + labs(title = "White vs. Non-White California Public School Students", x = "", y = "Number of Students") + scale_y_continuous(labels = scales::comma)
Consistent with the prior chart, this exemplifies how diverse the California student population is. Comparing the free/reduced lunch eligibility chart and the non-white vs. white charts, I suspect there is a positive relationship between eligibility for free/reduced lunch and non-white students.
For my analyses, I wanted to see the proportion of non-white students at each school in California and the proportion of students eligible for free/reduced lunch, so I calculated these variables.
#creating a new variables of proportion of students in each school that are non-white and eligible for free/reduced lucnh
pubsch_a <- mutate(pubsch_a, prononwht = round(nonwht/totalstud, 2))
pubsch_a <- mutate(pubsch_a, profreered = round(totalfreered/totalstud, 2))
I investigate my question about the relationship about race and eligibility for free/reduced lunch, I ran a correlation to see the relationship between the proportion of non-white students and proportion of students eligible for free/reduced lunch. I expected to see a positive relationship between these variables.
#running a correlation analysis to determine the relationship of schools with high percentage of non-white students and free/reduced lunch eligibility
cor(pubsch_a$prononwht, pubsch_a$profreered)
## [1] 0.6315537
The results of the correlation test indicate a moderate positive relationship between the proportion of non-white students and proportion of students eligible for free/reduced school lunch.
With the findings of a moderate positive relationship, I next ran a regression model with the proportion of non-white students and proportion of students eligible for free/reduced school lunch.
regress <- lm(pubsch_a$profreered ~ pubsch_a$prononwht)
summary(regress)
##
## Call:
## lm(formula = pubsch_a$profreered ~ pubsch_a$prononwht)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78912 -0.13319 0.03641 0.15262 0.93628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.004210 0.008069 -0.522 0.602
## pubsch_a$prononwht 0.793328 0.010025 79.138 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2204 on 9439 degrees of freedom
## Multiple R-squared: 0.3989, Adjusted R-squared: 0.3988
## F-statistic: 6263 on 1 and 9439 DF, p-value: < 2.2e-16
This regression indicates a high level of significance between the proportion of non-white students at a school and the proportion of students eligible for free/reduced lunch, which support my original hypothesis. The adjusted R square of 0.3988 indicates that 39% of the variance in the proportion of students eligible for free/reduced lunch is explained by the proportion of non-white students at a school.
regress_plot <- ggplot(pubsch_a, aes(x = profreered, y = prononwht, color = "tomato"))
regress_plot + geom_point() + xlab("Proportion of Students Eligible for Free/Reduced Lunch") + ylab("Proportion of Non-White Students") + labs(title="Free/Reduced Lunch by Non-White Student Proportion") + geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "royalblue") + theme(legend.position = "none")
Because certain types of communities are thought to be wealthier (suburbs, towns) than others (city, rural), I wanted to look at at the proportion of free/reduced lunch eligible students based on school location type. I expected to see differences across location type, mostly a lower proportion of suburban and town students eligible for free/reduced lunch and a higher proportion of city and rural students eligible for free/reduced lunch.
ggplot(pubsch_a, aes(x = location, y = profreered)) +
geom_boxplot() +
labs(x = "Location Type", y = "Proportion of Students Eligible for Free/Reduced Lunch", title = "Free/Reduced Lunch Eligibility based on 4 Location Types")
This barplot is interesting because the suburban level has a larger bar than the others and the town level has a smaller bar than the others, which is not what I expected in my analysis. This indicates that there is a greater frequency of suburban students eligible for free/reduced lunch compared to the other location types. And, that there is a lower frequency of town students eligible for free/reduced lunch compared to the other location types.
Next, I wanted to look at the proportion of free/reduced lunch eligible students based on school type. Although there could be differences across school types, I do not expect the proportion of free/reduce lunch eligible students to vary meaningfully across school type because I expect that once students are eligible for free/reduced lunch, this eligibility does not change as they age and attend other school types.
ggplot(pubsch_a, aes(x = type, y = profreered)) +
geom_boxplot() +
labs(x = "School Type", y = "Proportion of Students Eligible for Free/Reduced Lunch", title = "Free/Reduced Lunch Eligibility based on School Types")
Consistent with my expectation, there is little variation across school type. If the dataset included data on actual uptake of free/reduced lunch across school types, I would expect larger differences, as older students, maybe middle schoolers but likely high schoolers, tend to utilize free/reduced lunch less compared to younger students due to more salient social stigma about poverty and social service use.
Based on the barplots, I wanted to further explore the relationship between free/reduced lunch eligibility, race, and school location type. I ran another regression, this time controlling for location type.
local_ctrl <- lm(pubsch_a$profreered ~ pubsch_a$prononwht+pubsch_a$location)
summary(local_ctrl)
##
## Call:
## lm(formula = pubsch_a$profreered ~ pubsch_a$prononwht + pubsch_a$location)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94692 -0.11932 0.02764 0.15154 1.03246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.082905 0.009138 -9.073 < 2e-16 ***
## pubsch_a$prononwht 0.869619 0.010227 85.030 < 2e-16 ***
## pubsch_a$locationRural 0.131754 0.007768 16.962 < 2e-16 ***
## pubsch_a$locationSuburb -0.018252 0.004880 -3.740 0.000185 ***
## pubsch_a$locationTown 0.160207 0.008690 18.436 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2125 on 9436 degrees of freedom
## Multiple R-squared: 0.441, Adjusted R-squared: 0.4408
## F-statistic: 1861 on 4 and 9436 DF, p-value: < 2.2e-16
location_ctrl <- ggplot(pubsch_a, aes(x = profreered, y = prononwht, color = "tomato"))
location_ctrl + geom_point() + facet_wrap(~location) + xlab("Proportion of Students Eligible for Free/Reduced Lunch") + ylab("Proportion of Non-White Students") + labs(title="Free/Reduced Lunch by Non-White Student Proportion, by Location Type") + geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "royalblue") + theme(legend.position = "none")
This regression again shows a strong positive relationship between the proportion of non-white students and the proportion of students eligible for free/reduced lunch. This relationship is stronger in rural areas than cities, but weaker in the suburbs than cities which is consistent with my hypotheses. These analyses overall are consistent with my predictions regarding the relationship between non-white students and free/reduced lunch eligibility.
This research question is important for state budgeting for free/reduced lunch and also illuminates the relationship between race and family income and child poverty. Particularly interesting is that this dataset is from the first year of the COVID-19 pandemic; during this time, students were away from school and the usual services they receive at school, such as free/reduced lunch. This illustrates another way that the effect of race and poverty can worsen and that people in this intersection are highly vulnerable to disruptions is social services. Continuing to explore and understand this relationship is important for social policy across all states, but particularly racially diverse states like California.