The State of New York has a budget known as School Aid that provides funding to k-12 school districts state-wide. This aid provides funding for a variety of areas including, but not limited to: BOCES, Academic Improvment Aid, Transportation Aid, and expenses like textbooks, software, hardware, and library materials. Full details regarding School Aid can be found in this executive summary for ’17-’18 school years.
New York State has 62 counties, over 720 individual school districts with over 4,400 public schools, and over 2.5 Million k-12 students enrolled on average. The purpose of this observational study is to answer the question:
Do NY State counties who receive a higher amount of state aid per student also have the lowest graduation rates?
Three packages will be used to pull, clense, format, and combine the data:
library(RCurl)
library(tidyverse)
library(scales)
Each case will be a county with a varying number of districts.
Grad Data: 2015-2016 data covering cohorts 2010,11, and 12. This database contains annual graduation, and dropout data for the state as well as by county. The data was obtained from this source from NYSE.gov.
School Aid Data: reflects State aid funding for each school district. The data covers multiple years but I’ll be focused on the 2012-2015 enacted budget years to roughly cover the cohorts graduating per the 2015-2016 data.
Population Data: is re-purposed here from an additional reference file for another NYS analysis for interpreting vital statistics death rates. The data can be found here.
All data will be combined at the county level and averaged by district information where appropriate.
First I download the 3 different datasets and clean them up to combine them:
# Download .csv file, clean and format:
schl_data_url <- "https://raw.githubusercontent.com/jbrnbrg/proj_proposal_606/master/Data/GRAD_RATE_AND_OUTCOMES_2016.csv"
my_school_data <- read.csv(text = getURL(schl_data_url),
colClasses = c(AGGREGATION_CODE = "character", LEA_BEDS = "character"),
stringsAsFactors = FALSE,
na.strings = c("-"))
schl_dat_gen_school <-my_school_data %>%
filter(MEMBERSHIP_DESC != "2012 Total Cohort - 4 Year Outcome",
AGGREGATION_INDEX == 4, #schools
SUBGROUP_CODE == 2 | SUBGROUP_CODE == 3) %>% #male, female
select(REPORT_SCHOOL_YEAR:SUBGROUP_NAME, ENROLL_CNT, GRAD_CNT, DROPOUT_CNT) %>%
na.omit() %>%
select(REPORT_SCHOOL_YEAR, AGGREGATION_NAME,LEA_NAME,LEA_BEDS,
NRC_DESC, COUNTY_NAME, MEMBERSHIP_CODE,
MEMBERSHIP_DESC, SUBGROUP_NAME, ENROLL_CNT, GRAD_CNT, DROPOUT_CNT) %>%
mutate(COUNTY_NAME = tolower(COUNTY_NAME),
HS_NAME = AGGREGATION_NAME,
GENDER = SUBGROUP_NAME,
DISTRICT_NAME = LEA_NAME,
BEDS_CODE = LEA_BEDS,
COHORT = ifelse(MEMBERSHIP_CODE == 6, "c_2010_6yr",
ifelse(MEMBERSHIP_CODE == 8, "c_2011_5yr",
ifelse(MEMBERSHIP_CODE == 11, "c_2012_4yr",
MEMBERSHIP_CODE)))) %>%
select(COUNTY_NAME, HS_NAME,DISTRICT_NAME, LEA_BEDS, NRC_DESC, COHORT,
GENDER,ENROLL_CNT, GRAD_CNT, DROPOUT_CNT) %>%
arrange(COUNTY_NAME, desc(COHORT))
# obtain avg grad rate per county:
schl_enroll_grad_drop <- schl_dat_gen_school %>%
select(COUNTY_NAME, COHORT, ENROLL_CNT, GRAD_CNT, DROPOUT_CNT) %>%
mutate(COUNTY_NAME = ifelse(COUNTY_NAME == "bronx" | COUNTY_NAME == "richmond" |
COUNTY_NAME == "queens" | COUNTY_NAME == "kings" |
COUNTY_NAME == "new york", "new york city", COUNTY_NAME)) %>%
group_by(COUNTY_NAME, COHORT) %>%
summarise_each(funs(sum)) %>%
mutate(GRAD_RT = GRAD_CNT / ENROLL_CNT, DROP_RT = DROPOUT_CNT/ENROLL_CNT) %>%
select(COUNTY_NAME, ENROLL_CNT:DROP_RT) %>%
group_by(COUNTY_NAME) %>%
summarise_each(funs(mean)) %>%
select(COUNTY_NAME, GRAD_RT, DROP_RT)
#boxplot(schl_dat_gen_school$GRAD_CNT~schl_dat_gen_school$GENDER)
#unique(schl_enroll_grad_drop$COUNTY_NAME)
#length(unique(schl_dat_gen_school$DISTRICT_NAME))
nys_aid_url <- "https://raw.githubusercontent.com/jbrnbrg/proj_proposal_606/master/Data/New_York_State_School_Aid__Beginning_School_Year_1996-97.csv"
nys_school_aid <- read.csv(text = getURL(nys_aid_url), stringsAsFactors = FALSE)
# avg aid $'s per county over 3 years
school_aid_avg <- nys_school_aid %>%
filter(Event == "2012-13 Enacted Budget" |
Event == "2013-14 Enacted Budget" |
Event == "2014-15 Enacted Budget",
Aid.Category == "Sum of Above Aid Categories") %>%
mutate(COUNTY_NAME = tolower(ifelse(County == "St. Lawrence", "Saint Lawrence", County))) %>%
select(Event, COUNTY_NAME, School.Year) %>%
group_by(Event, COUNTY_NAME) %>%
summarise(USD_PER_CNTY = sum(School.Year)) %>%
group_by(COUNTY_NAME) %>%
summarise(AVG_AID = mean(USD_PER_CNTY))
nys_pop_url <- "https://raw.githubusercontent.com/jbrnbrg/proj_proposal_606/master/Data/NYS_Pop_Data_Beg_2003.csv"
nys_pop <- read.csv(text = getURL(nys_pop_url), stringsAsFactors = FALSE) %>%
filter(Age.Group.Code != 0, Gender.Code != 0, Race.Ethnicity.Code != 0, # no totals
County.Name != "New York State", # use NYC combined like Student Aid $ file
County.Name != "Bronx", County.Name != "Richmond",
County.Name != "Queens", County.Name != "Kings",
County.Name != "New York", County.Name != "Rest of State" ) %>%
mutate(COUNTY_NAME = tolower(ifelse(County.Name == "St Lawrence", "Saint Lawrence", County.Name)))
# estimate the NYS student population:
nys_pop_students <- nys_pop %>%
filter(Year >= 2011, Age.Group.Code == 3) %>%
select(Year, COUNTY_NAME, Population) %>%
group_by(Year, COUNTY_NAME) %>%
summarise(Population = sum(Population)) %>%
group_by(COUNTY_NAME) %>%
summarise(STU_POP_4YRAVG = round(mean(Population), digits = 0))
combined_data <- school_aid_avg %>%
inner_join(nys_pop_students, by = "COUNTY_NAME") %>%
mutate(USD_AID_PER_STU = AVG_AID / STU_POP_4YRAVG) %>%
inner_join(schl_enroll_grad_drop, by = "COUNTY_NAME")
Here are the columns of data within the dataset combined_data
:
Let’s review some summary stats for reference:
review_county <- combined_data %>%
arrange(desc(AVG_AID)) %>%
ungroup() %>%
mutate(index = row_number())
review_county <- transform(review_county,
COUNTY_NAME = reorder(COUNTY_NAME, AVG_AID))
county_noNYC <- review_county %>%
filter(COUNTY_NAME != "new york city")
summary(review_county)
## COUNTY_NAME AVG_AID STU_POP_4YRAVG USD_AID_PER_STU
## hamilton: 1 Min. :2.809e+06 Min. : 460 Min. : 3826
## yates : 1 1st Qu.:8.184e+07 1st Qu.: 6662 1st Qu.: 8251
## schuyler: 1 Median :1.186e+08 Median : 10932 Median : 9899
## essex : 1 Mean :3.607e+08 Mean : 42720 Mean :10276
## wyoming : 1 3rd Qu.:2.149e+08 3rd Qu.: 25354 3rd Qu.:12648
## greene : 1 Max. :8.296e+09 Max. :954600 Max. :18180
## (Other) :52
## GRAD_RT DROP_RT index
## Min. :0.7563 Min. :0.01907 Min. : 1.00
## 1st Qu.:0.8614 1st Qu.:0.05555 1st Qu.:15.25
## Median :0.8794 Median :0.07452 Median :29.50
## Mean :0.8774 Mean :0.07274 Mean :29.50
## 3rd Qu.:0.9057 3rd Qu.:0.08256 3rd Qu.:43.75
## Max. :0.9456 Max. :0.16236 Max. :58.00
##
The output of summary
, above, provdes a decent overview of information provided in this dataset.
E.g. across all counties, I see that:
DROP_RT
is between 2% and 16%GRAD_RT
is between 76% and 95%USD_AID_PER_STU
is between 3,800 and $18,200.For reference, here’s the head
of primary dataset, review_county
:
knitr::kable(head(review_county), caption = "")
COUNTY_NAME | AVG_AID | STU_POP_4YRAVG | USD_AID_PER_STU | GRAD_RT | DROP_RT | index |
---|---|---|---|---|---|---|
new york city | 8296490932 | 954600 | 8691.065 | 0.7756557 | 0.1183864 | 1 |
suffolk | 1679895731 | 207504 | 8095.727 | 0.9255756 | 0.0391216 | 2 |
erie | 1183227634 | 118317 | 10000.487 | 0.8625001 | 0.0835583 | 3 |
monroe | 958907030 | 101244 | 9471.248 | 0.8730557 | 0.0801537 | 4 |
nassau | 863504365 | 184023 | 4692.372 | 0.9388640 | 0.0229229 | 5 |
westchester | 607677078 | 133016 | 4568.451 | 0.9179992 | 0.0384439 | 6 |
Now lets visualize some of this data: This is the annual average Student Aid per county.
# p1<-ggplot(data = review_county, aes(x=COUNTY_NAME, y=AVG_AID)) +
# geom_bar(stat = "identity", alpha = .8) +
# coord_flip() + scale_y_continuous(label = dollar_format()) +
# theme(panel.grid.major = element_line(color = "gray")) +
# ggtitle(expression(atop("Avg Annual Student Aid $",
# atop(italic("Per NYS County 2012-2015 average"), ""))))
#
p2<-ggplot(data = county_noNYC, aes(x=COUNTY_NAME, y=AVG_AID)) +
geom_bar(stat = "identity", alpha = .8) + #+ scale_y_continuous(labels = "comma") #+
coord_flip() + scale_y_continuous(label = dollar_format()) +
theme(panel.grid.major = element_line(color = "gray")) +
labs(title = "Avg Annual Student Aid $ by NYS County (without NYC)")
# grid.arrange(p1,p2, nrow = 2)
p2
As expected, the second-highest county for average annual Student Aid recipient is Suffolk.
A per-student view provides a clearer comparison with NYC included:
# save df as reordered:
review_county <- transform(review_county,
COUNTY_NAME = reorder(COUNTY_NAME, USD_AID_PER_STU))
ggplot(data = review_county, aes(x=COUNTY_NAME, y=USD_AID_PER_STU, fill = GRAD_RT)) +
geom_bar(stat = "identity", alpha = .8) +
coord_flip() + scale_y_continuous(label = dollar_format()) +
ggtitle(expression(atop("Avg Annual Student Aid per Student by NYS County",
atop(italic("Per NYS County 2012-2015 average"), "")))) +
theme(panel.grid.major = element_line(color = "gray")) +
labs(y = "Average Student Aid $ per Student")
As shown above, when we arrange by USD_AID_PER_STU
, the counties appear to be more closely comparable. Even with the fill
that includes the GRAD_RT
, it’s hard to tell whether more student aid $’s per student impacts graduation rates.
To take a deeper dive I’ll plot GRAD_RT
by USD_AID_PER_STU
to investigate whether there’s a relationship there:
ggplot(data = review_county, aes(x=USD_AID_PER_STU, y=GRAD_RT)) +
geom_point(size = 2, alpha = .4) +
scale_x_continuous(label = dollar_format()) +
theme(panel.grid.major = element_line(color = "gray")) +
labs(title = "Average Annual Student Aid per Student vs. County Grad Rates:",
x = "Annual Student Aid per Student by County",
y = "HS Grad Rates by County")
There appears to be some trend above - here’s the correlation:
sm_rc <- review_county #to make the name shorter
cor(sm_rc$GRAD_RT, sm_rc$USD_AID_PER_STU)
## [1] -0.3522534
But I need to ask - is there a relationship between these variables? Using inference by linear regression could be a good approach.
\(H_{0}:\) Student aid per student and graduation rate are unrelated i.e. \(\beta_{1} = 0\)
\(H_{A}:\) Student aid per student and graduation rate has a negative slope i.e. \(\beta_{1} < 0\)
If I demonstrate a linear relationship that has a non-zero slope, I’d provide some support that would help me answer:
Do NY State counties who receive a higher amount of state aid per student also have the lowest graduation rates?
First I create a linear model: jmb_m1
:
jmb_m1 <- lm(GRAD_RT~USD_AID_PER_STU, data = sm_rc)
Next, I’ll check for linearity, constant variability, and nearly normal residuals to support that this linear method is appropriate:
Linearity: As shown by the plot below, there appears to be a trend. I’ve included geom_smooth
to put in the linear trend line:
ggplot(data = review_county, aes(x=USD_AID_PER_STU, y=GRAD_RT)) +
geom_point(size = 2, alpha = .4) +
scale_x_continuous(label = dollar_format()) +
#geom_abline(intercept = 0.9238, slope = -0.000004511, color = "blue") +
geom_smooth(method = "lm", se = FALSE) +
theme(panel.grid.major = element_line(color = "gray")) +
labs(title = "Average Annual Student Aid per Student vs. County Grad Rates:",
x = "Annual Student Aid per Student by County",
y = "HS Grad Rates by County")
Constant Variability: The below plot depicts the residuals of my model and they appear regularly dispersed about zero and I am happy to consider this data has constant variability.
ggplot(data = review_county, aes(x=USD_AID_PER_STU, y=jmb_m1$residuals)) +
geom_point(size = 2, alpha = .4) +
scale_x_continuous(label = dollar_format()) +
geom_abline(intercept = 0, slope = 0, color = "blue") +
theme(panel.grid.major = element_line(color = "gray")) +
labs(title = "Average Annual Student Aid per Student vs. Model Residuals:",
x = "Annual Student Aid per Student by County",
y = "Model Residuals")
Nearly Normal Residuals: below, the qqnorm
normal probability plot of the linear model’s residuals supports a normal determination as well as the histogram of the residuals:
qqnorm(jmb_m1$residuals)
qqline(jmb_m1$residuals)
hist(jmb_m1$residuals)
Now that I am happy I have a reliable model on my hands, I can continue:
I’ll use summary
on my model jmb_m1
to test my hypothesis:
summary(jmb_m1)
##
## Call:
## lm(formula = GRAD_RT ~ USD_AID_PER_STU, data = sm_rc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.108906 -0.015082 0.003821 0.024533 0.063275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.238e-01 1.724e-02 53.590 < 2e-16 ***
## USD_AID_PER_STU -4.511e-06 1.602e-06 -2.817 0.00669 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03903 on 56 degrees of freedom
## Multiple R-squared: 0.1241, Adjusted R-squared: 0.1084
## F-statistic: 7.933 on 1 and 56 DF, p-value: 0.006692
To review the given linear model with a bit clearer formatting:
\[ \hat{y} = 0.9238 - 4.411\times 10^{-6} * AidPerStudent \]
This allows me to reject \(H_{0}\) in favor of \(H_{A}\) - with a p value close to zero, I am confident the model describes this relationship well.
This makes intuitive sense to me - i.e. it seems logical that counties with lower grad rates would also be the recipients of more student funding - I suspect the idea being that more funding is going to the locations that are struggling the most.
For further information and context, please review the below links: