DATA 606 Data Project Final

Part 1 - Introduction:

The State of New York has a budget known as School Aid that provides funding to k-12 school districts state-wide. This aid provides funding for a variety of areas including, but not limited to: BOCES, Academic Improvment Aid, Transportation Aid, and expenses like textbooks, software, hardware, and library materials. Full details regarding School Aid can be found in this executive summary for ’17-’18 school years.

New York State has 62 counties, over 720 individual school districts with over 4,400 public schools, and over 2.5 Million k-12 students enrolled on average. The purpose of this observational study is to answer the question:

Do NY State counties who receive a higher amount of state aid per student also have the lowest graduation rates?

Setup:

Three packages will be used to pull, clense, format, and combine the data:

library(RCurl)
library(tidyverse)
library(scales)

Part 2 - Data:

Each case will be a county with a varying number of districts.

Grad Data: 2015-2016 data covering cohorts 2010,11, and 12. This database contains annual graduation, and dropout data for the state as well as by county. The data was obtained from this source from NYSE.gov.
School Aid Data: reflects State aid funding for each school district. The data covers multiple years but I’ll be focused on the 2012-2015 enacted budget years to roughly cover the cohorts graduating per the 2015-2016 data.
Population Data: is re-purposed here from an additional reference file for another NYS analysis for interpreting vital statistics death rates. The data can be found here.

All data will be combined at the county level and averaged by district information where appropriate.

First I download the 3 different datasets and clean them up to combine them:

Graduation Data:

# Download .csv file, clean and format: 
schl_data_url <- "https://raw.githubusercontent.com/jbrnbrg/proj_proposal_606/master/Data/GRAD_RATE_AND_OUTCOMES_2016.csv"
my_school_data <- read.csv(text = getURL(schl_data_url), 
                           colClasses = c(AGGREGATION_CODE = "character", LEA_BEDS = "character"),
                           stringsAsFactors = FALSE,
                           na.strings = c("-"))

schl_dat_gen_school <-my_school_data %>% 
  filter(MEMBERSHIP_DESC != "2012 Total Cohort - 4 Year Outcome",
         AGGREGATION_INDEX == 4, #schools
         SUBGROUP_CODE == 2 | SUBGROUP_CODE == 3) %>% #male, female
  select(REPORT_SCHOOL_YEAR:SUBGROUP_NAME, ENROLL_CNT, GRAD_CNT, DROPOUT_CNT) %>%
  na.omit() %>% 
  select(REPORT_SCHOOL_YEAR, AGGREGATION_NAME,LEA_NAME,LEA_BEDS, 
         NRC_DESC, COUNTY_NAME, MEMBERSHIP_CODE,
         MEMBERSHIP_DESC, SUBGROUP_NAME, ENROLL_CNT, GRAD_CNT, DROPOUT_CNT) %>% 
  mutate(COUNTY_NAME = tolower(COUNTY_NAME),
         HS_NAME = AGGREGATION_NAME,
         GENDER = SUBGROUP_NAME,
         DISTRICT_NAME = LEA_NAME,
         BEDS_CODE = LEA_BEDS,
         COHORT = ifelse(MEMBERSHIP_CODE == 6, "c_2010_6yr",
                         ifelse(MEMBERSHIP_CODE == 8, "c_2011_5yr",
                                ifelse(MEMBERSHIP_CODE == 11, "c_2012_4yr", 
                                       MEMBERSHIP_CODE)))) %>% 
  select(COUNTY_NAME, HS_NAME,DISTRICT_NAME, LEA_BEDS, NRC_DESC, COHORT,
         GENDER,ENROLL_CNT, GRAD_CNT, DROPOUT_CNT) %>% 
  arrange(COUNTY_NAME, desc(COHORT))

# obtain avg grad rate per county: 
schl_enroll_grad_drop <- schl_dat_gen_school %>% 
  select(COUNTY_NAME, COHORT, ENROLL_CNT, GRAD_CNT, DROPOUT_CNT) %>% 
  mutate(COUNTY_NAME = ifelse(COUNTY_NAME == "bronx" | COUNTY_NAME == "richmond" |
                                COUNTY_NAME == "queens" | COUNTY_NAME == "kings" | 
                                COUNTY_NAME == "new york", "new york city", COUNTY_NAME)) %>% 
  group_by(COUNTY_NAME, COHORT) %>% 
  summarise_each(funs(sum)) %>% 
  mutate(GRAD_RT = GRAD_CNT / ENROLL_CNT, DROP_RT = DROPOUT_CNT/ENROLL_CNT) %>% 
  select(COUNTY_NAME, ENROLL_CNT:DROP_RT) %>% 
  group_by(COUNTY_NAME) %>% 
  summarise_each(funs(mean)) %>% 
  select(COUNTY_NAME, GRAD_RT, DROP_RT)

#boxplot(schl_dat_gen_school$GRAD_CNT~schl_dat_gen_school$GENDER)
#unique(schl_enroll_grad_drop$COUNTY_NAME)
#length(unique(schl_dat_gen_school$DISTRICT_NAME))

Student Aid Data:

nys_aid_url <- "https://raw.githubusercontent.com/jbrnbrg/proj_proposal_606/master/Data/New_York_State_School_Aid__Beginning_School_Year_1996-97.csv"
nys_school_aid <- read.csv(text = getURL(nys_aid_url), stringsAsFactors = FALSE)

# avg aid $'s per county over 3 years 
school_aid_avg <- nys_school_aid %>% 
  filter(Event == "2012-13 Enacted Budget" | 
           Event == "2013-14 Enacted Budget" | 
           Event == "2014-15 Enacted Budget",
         Aid.Category == "Sum of Above Aid Categories") %>% 
  mutate(COUNTY_NAME = tolower(ifelse(County == "St. Lawrence", "Saint Lawrence", County))) %>% 
  select(Event, COUNTY_NAME, School.Year) %>% 
  group_by(Event, COUNTY_NAME) %>% 
  summarise(USD_PER_CNTY = sum(School.Year)) %>%
  group_by(COUNTY_NAME) %>% 
  summarise(AVG_AID = mean(USD_PER_CNTY))

Student Population Data:

nys_pop_url <- "https://raw.githubusercontent.com/jbrnbrg/proj_proposal_606/master/Data/NYS_Pop_Data_Beg_2003.csv"
nys_pop <- read.csv(text = getURL(nys_pop_url), stringsAsFactors = FALSE) %>%
  filter(Age.Group.Code != 0, Gender.Code != 0, Race.Ethnicity.Code != 0,  # no totals
         County.Name != "New York State",                                  # use NYC combined like Student Aid $ file
         County.Name != "Bronx", County.Name != "Richmond", 
         County.Name != "Queens", County.Name != "Kings", 
         County.Name != "New York", County.Name != "Rest of State" ) %>% 
  mutate(COUNTY_NAME = tolower(ifelse(County.Name == "St Lawrence", "Saint Lawrence", County.Name)))

# estimate the NYS student population:  
nys_pop_students <- nys_pop %>%
  filter(Year >= 2011, Age.Group.Code == 3) %>% 
  select(Year, COUNTY_NAME, Population) %>% 
  group_by(Year, COUNTY_NAME) %>% 
  summarise(Population = sum(Population)) %>% 
  group_by(COUNTY_NAME) %>% 
  summarise(STU_POP_4YRAVG = round(mean(Population), digits = 0))

Combined Data:

combined_data <- school_aid_avg %>% 
  inner_join(nys_pop_students, by = "COUNTY_NAME") %>% 
  mutate(USD_AID_PER_STU = AVG_AID / STU_POP_4YRAVG) %>% 
  inner_join(schl_enroll_grad_drop, by = "COUNTY_NAME")

Part 3 - Exploratory data analysis:

Data Contents:

Here are the columns of data within the dataset combined_data:

COUNTY_NAME: one of 62 NYS counties
AVG_AID: a 3-year average funding in USD per country by district average for NYS
STU_POP_4YRAVG: An estimate of the average student population for NYS
USD_AID_PER_STU: AVG_AID / STU_POP_4YRAVG (to get student aid $’s per student average)
GRAT_RT: Graduation rate by county based on district average for NYS
DROP_RT: Drop-out rate by county based on district average for NYS

Let’s review some summary stats for reference:

review_county <- combined_data %>% 
  arrange(desc(AVG_AID)) %>% 
  ungroup() %>% 
  mutate(index = row_number())

review_county <- transform(review_county, 
                           COUNTY_NAME = reorder(COUNTY_NAME, AVG_AID)) 

county_noNYC <- review_county %>% 
  filter(COUNTY_NAME != "new york city")

summary(review_county)

##    COUNTY_NAME    AVG_AID          STU_POP_4YRAVG   USD_AID_PER_STU
##  hamilton: 1   Min.   :2.809e+06   Min.   :   460   Min.   : 3826  
##  yates   : 1   1st Qu.:8.184e+07   1st Qu.:  6662   1st Qu.: 8251  
##  schuyler: 1   Median :1.186e+08   Median : 10932   Median : 9899  
##  essex   : 1   Mean   :3.607e+08   Mean   : 42720   Mean   :10276  
##  wyoming : 1   3rd Qu.:2.149e+08   3rd Qu.: 25354   3rd Qu.:12648  
##  greene  : 1   Max.   :8.296e+09   Max.   :954600   Max.   :18180  
##  (Other) :52                                                       
##     GRAD_RT          DROP_RT            index      
##  Min.   :0.7563   Min.   :0.01907   Min.   : 1.00  
##  1st Qu.:0.8614   1st Qu.:0.05555   1st Qu.:15.25  
##  Median :0.8794   Median :0.07452   Median :29.50  
##  Mean   :0.8774   Mean   :0.07274   Mean   :29.50  
##  3rd Qu.:0.9057   3rd Qu.:0.08256   3rd Qu.:43.75  
##  Max.   :0.9456   Max.   :0.16236   Max.   :58.00  
##

The output of summary, above, provdes a decent overview of information provided in this dataset.

E.g. across all counties, I see that:

DROP_RT is between 2% and 16%
GRAD_RT is between 76% and 95%
USD_AID_PER_STU is between 3,800 and $18,200.

For reference, here’s the head of primary dataset, review_county:

knitr::kable(head(review_county), caption = "")

COUNTY_NAME	AVG_AID	STU_POP_4YRAVG	USD_AID_PER_STU	GRAD_RT	DROP_RT	index
new york city	8296490932	954600	8691.065	0.7756557	0.1183864	1
suffolk	1679895731	207504	8095.727	0.9255756	0.0391216	2
erie	1183227634	118317	10000.487	0.8625001	0.0835583	3
monroe	958907030	101244	9471.248	0.8730557	0.0801537	4
nassau	863504365	184023	4692.372	0.9388640	0.0229229	5
westchester	607677078	133016	4568.451	0.9179992	0.0384439	6

Average Annual Aid per County:

Now lets visualize some of this data: This is the annual average Student Aid per county.

# p1<-ggplot(data = review_county, aes(x=COUNTY_NAME, y=AVG_AID)) + 
#   geom_bar(stat = "identity", alpha = .8) + 
#   coord_flip() + scale_y_continuous(label = dollar_format()) + 
#   theme(panel.grid.major = element_line(color = "gray")) +
#   ggtitle(expression(atop("Avg Annual Student Aid $", 
#                           atop(italic("Per NYS County 2012-2015 average"), ""))))
# 
p2<-ggplot(data = county_noNYC, aes(x=COUNTY_NAME, y=AVG_AID)) + 
  geom_bar(stat = "identity", alpha = .8) + #+ scale_y_continuous(labels = "comma") #+ 
  coord_flip() + scale_y_continuous(label = dollar_format()) + 
  theme(panel.grid.major = element_line(color = "gray")) +
  labs(title = "Avg Annual Student Aid $ by NYS County (without NYC)")
# grid.arrange(p1,p2, nrow = 2)

p2

As expected, the second-highest county for average annual Student Aid recipient is Suffolk.

Average Annual Aid per County per Student:

A per-student view provides a clearer comparison with NYC included:

# save df as reordered:  
review_county <- transform(review_county, 
                           COUNTY_NAME = reorder(COUNTY_NAME, USD_AID_PER_STU))

ggplot(data = review_county, aes(x=COUNTY_NAME, y=USD_AID_PER_STU, fill = GRAD_RT)) + 
  geom_bar(stat = "identity", alpha = .8) + 
  coord_flip() + scale_y_continuous(label = dollar_format()) + 
  ggtitle(expression(atop("Avg Annual Student Aid per Student by NYS County", 
                          atop(italic("Per NYS County 2012-2015 average"), "")))) + 
  theme(panel.grid.major = element_line(color = "gray")) +
  labs(y = "Average Student Aid $ per Student")

As shown above, when we arrange by USD_AID_PER_STU, the counties appear to be more closely comparable. Even with the fill that includes the GRAD_RT, it’s hard to tell whether more student aid $’s per student impacts graduation rates.

Average Annual Student Aid per Student vs. County Grad Rates:

To take a deeper dive I’ll plot GRAD_RT by USD_AID_PER_STU to investigate whether there’s a relationship there:

ggplot(data = review_county, aes(x=USD_AID_PER_STU, y=GRAD_RT)) + 
  geom_point(size = 2, alpha = .4) + 
  scale_x_continuous(label = dollar_format()) + 
  theme(panel.grid.major = element_line(color = "gray")) +
  labs(title = "Average Annual Student Aid per Student vs. County Grad Rates:", 
       x = "Annual Student Aid per Student by County", 
       y = "HS Grad Rates by County")

There appears to be some trend above - here’s the correlation:

sm_rc <- review_county #to make the name shorter
cor(sm_rc$GRAD_RT, sm_rc$USD_AID_PER_STU)

## [1] -0.3522534

But I need to ask - is there a relationship between these variables? Using inference by linear regression could be a good approach.

Part 4 - Inference:

$H_{0}:$ Student aid per student and graduation rate are unrelated i.e. $\beta_{1} = 0$
$H_{A}:$ Student aid per student and graduation rate has a negative slope i.e. $\beta_{1} < 0$

If I demonstrate a linear relationship that has a non-zero slope, I’d provide some support that would help me answer:

Do NY State counties who receive a higher amount of state aid per student also have the lowest graduation rates?

First I create a linear model: jmb_m1:

jmb_m1 <- lm(GRAD_RT~USD_AID_PER_STU, data = sm_rc)

Next, I’ll check for linearity, constant variability, and nearly normal residuals to support that this linear method is appropriate:

Linearity: As shown by the plot below, there appears to be a trend. I’ve included geom_smooth to put in the linear trend line:

ggplot(data = review_county, aes(x=USD_AID_PER_STU, y=GRAD_RT)) + 
  geom_point(size = 2, alpha = .4) + 
  scale_x_continuous(label = dollar_format()) +
  #geom_abline(intercept = 0.9238, slope = -0.000004511, color = "blue") +
  geom_smooth(method = "lm", se = FALSE) +
  theme(panel.grid.major = element_line(color = "gray")) +
  labs(title = "Average Annual Student Aid per Student vs. County Grad Rates:", 
       x = "Annual Student Aid per Student by County", 
       y = "HS Grad Rates by County")

Constant Variability: The below plot depicts the residuals of my model and they appear regularly dispersed about zero and I am happy to consider this data has constant variability.

ggplot(data = review_county, aes(x=USD_AID_PER_STU, y=jmb_m1$residuals)) + 
  geom_point(size = 2, alpha = .4) + 
  scale_x_continuous(label = dollar_format()) +
  geom_abline(intercept = 0, slope = 0, color = "blue") +
  theme(panel.grid.major = element_line(color = "gray")) +
  labs(title = "Average Annual Student Aid per Student vs. Model Residuals:", 
       x = "Annual Student Aid per Student by County", 
       y = "Model Residuals")

Nearly Normal Residuals: below, the qqnorm normal probability plot of the linear model’s residuals supports a normal determination as well as the histogram of the residuals:

qqnorm(jmb_m1$residuals)
qqline(jmb_m1$residuals)

hist(jmb_m1$residuals)

Now that I am happy I have a reliable model on my hands, I can continue:

Part 5 - Conclusion:

I’ll use summary on my model jmb_m1 to test my hypothesis:

summary(jmb_m1)

## 
## Call:
## lm(formula = GRAD_RT ~ USD_AID_PER_STU, data = sm_rc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.108906 -0.015082  0.003821  0.024533  0.063275 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.238e-01  1.724e-02  53.590  < 2e-16 ***
## USD_AID_PER_STU -4.511e-06  1.602e-06  -2.817  0.00669 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03903 on 56 degrees of freedom
## Multiple R-squared:  0.1241, Adjusted R-squared:  0.1084 
## F-statistic: 7.933 on 1 and 56 DF,  p-value: 0.006692

To review the given linear model with a bit clearer formatting:

\[ \hat{y} = 0.9238 - 4.411\times 10^{-6} * AidPerStudent \]

This allows me to reject $H_{0}$ in favor of $H_{A}$ - with a p value close to zero, I am confident the model describes this relationship well.

This makes intuitive sense to me - i.e. it seems logical that counties with lower grad rates would also be the recipients of more student funding - I suspect the idea being that more funding is going to the locations that are struggling the most.

References:

Appendix:

For further information and context, please review the below links: