My motivation for this project was to gain some insight into public education in the United States. Specifically, I want to see if finances, whether it’s revenue or expenditure, act as a predictor for school performance.
Libraries used:
library(jsonlite)
library(httr)
library(RCurl)
library(XML)
library(plyr)
library(dplyr)
library(reshape2)
The finance data for every public school district in the USA was avaialable at the United States Censu Bureau’s website. I began with the 2013 data, which is the most recent dataset, but wound up switching to the 2011 data, for reasons that will be elucidated on in due course. The dataset description for 2011 can be found here.
The data provides an almost overwhelming level of detail in the way it breaks down different revenue streams and expenditure categories. Federal, State, and Local Revenue categories are provided, with another level of detail below each. The second group of data elements are Expenditures, which include totals and detail level for categories such as instructor salaries, benefit programs, capital outlays, administration, and more. Finally, the data set includes some ratios using the first two categories, such as % revenue from federal sources and spending per pupil.
The data in the above directory is in a relational format, but as an excel spreadsheet. CSV files are a little bit more predictable to read into R, so I downloaded the file, converted it to csv, and loaded it into R.
fed_data <- read.csv('https://raw.githubusercontent.com/cjf4/IS-607-Data-Acquisition-and-Management/master/Final%20Project/elsec11.csv',
stringsAsFactors = FALSE)
tbl_df(fed_data)
## Source: local data frame [14,491 x 141]
##
## STATE IDCENSUS NAME CONUM CSA CBSA
## (int) (dbl) (chr) (int) (chr) (chr)
## 1 1 1.5001e+12 AUTAUGA COUNTY SCHOOL DISTRICT 1001 388 33860
## 2 1 1.5002e+12 BALDWIN COUNTY SCHOOL DISTRICT 1003 380 19300
## 3 1 1.5003e+12 BARBOUR COUNTY SCHOOL DISTRICT 1005 N 21640
## 4 1 1.5003e+12 EUFAULA CITY SCHOOL DISTRICT 1005 N 21640
## 5 1 1.5004e+12 BIBB COUNTY SCHOOL DISTRICT 1007 142 13820
## 6 1 1.5005e+12 BLOUNT COUNTY SCHOOL DISTRICT 1009 142 13820
## 7 1 1.5005e+12 ONEONTA CITY SCHOOL DISTRICT 1009 142 13820
## 8 1 1.5006e+12 BULLOCK COUNTY SCHOOL DISTRICT 1011 N N
## 9 1 1.5007e+12 BUTLER COUNTY SCHOOL DISTRICT 1013 N N
## 10 1 1.5008e+12 ANNISTON CITY SCHOOL DISTRICT 1015 N 11500
## .. ... ... ... ... ... ...
## Variables not shown: SCHLEV (int), NCESID (chr), YRDATA (int), V33 (int),
## TOTALREV (int), TFEDREV (int), C14 (int), C15 (int), C16 (int), C17
## (int), C19 (int), B11 (int), C20 (int), C25 (int), C36 (int), B10 (int),
## B12 (int), B13 (int), TSTREV (int), C01 (int), C04 (int), C05 (int), C06
## (int), C07 (int), C08 (int), C09 (int), C10 (int), C11 (int), C12 (int),
## C13 (int), C24 (int), C35 (int), C38 (int), C39 (int), TLOCREV (int),
## T02 (int), T06 (int), T09 (int), T15 (int), T40 (int), T99 (int), D11
## (int), D23 (int), A07 (int), A08 (int), A09 (int), A11 (int), A13 (int),
## A15 (int), A20 (int), A40 (int), U11 (int), U22 (int), U30 (int), U50
## (int), U97 (int), TOTALEXP (int), TCURELSC (int), TCURINST (int), E13
## (int), J13 (int), J12 (int), J14 (int), V91 (int), V92 (int), TCURSSVC
## (int), E17 (int), E07 (int), E08 (int), E09 (int), V40 (int), V45 (int),
## V90 (int), V85 (int), J17 (int), J07 (int), J08 (int), J09 (int), J40
## (int), J45 (int), J90 (int), J11 (int), J96 (int), TCUROTH (int), E11
## (int), V60 (int), V65 (int), J10 (int), J97 (int), NONELSEC (int), V70
## (int), V75 (int), V80 (int), J98 (int), TCAPOUT (int), F12 (int), G15
## (int), K09 (int), K10 (int), K11 (int), J99 (int), L12 (int), M12 (int),
## Q11 (int), I86 (int), Z32 (int), Z33 (int), V11 (int), V13 (int), V15
## (int), V17 (int), V21 (int), V23 (int), V37 (int), V29 (int), Z34 (int),
## V10 (int), V12 (int), V14 (int), V16 (int), V18 (int), V22 (int), V24
## (int), V38 (int), V30 (int), V32 (int), X_19H (int), X_21F (int), X_31F
## (int), X_41F (int), X_61V (int), X_66V (int), W01 (int), W31 (int), W61
## (int)
Simple enough. However, in the context of my desired analysis, the above served as explanatory variables. I want to measure performance as my response variable, so that would have to come elsewhere.
The first place I looked was the federal Department of Education. The DOE provides data in both relational data files and APIs. An education professional I consulted had mentioned that 9th grade algebra scores are often used as a KPI (Key Performanc Metric) in education. This link appeared promising, but wound up having a couple of issues that forced me to explore other options.
The first issue was that there was no apparent dataset description. The above link contained a truncated “dataset abstract”, but after some searching, I was not able to find the full description. While some of the columns seemed to be self evident, some were not. It would be impossible to rely on this data without a working knowledge of what the data represented.
The second issue was with the web API. At first glance, the API appeared to be clean, simple, and allow the powerful option of including SQL queries as part of the http request. The API documentation can be found here.
However, separate attempts to access the data via all the mentioned methods proved unsuccessful, usually resulting in 409 errors (access forbidden).
I tried post a JSON object as a request, which is mentioned as an option on the API page, however this did not work.
req <- POST('https://inventory.data.gov/api/action/datastore_search',
body = "{
resource_id: 'b4e7f148-3afc-4782-84a3-dfaca3156e8b',
limit: 5,
q: 'jones'
}",
encode = 'json'
)
A URL with an SQL query received a 409 error as well.
req <- POST("https://inventory.data.gov/api/action/datastore_search_sql?api_key=W7KRCyAf29uIuiklIvJaSgJliE6GrRpR01ObMUh6&sql=SELECT * from 'b4e7f148-3afc-4782-84a3-dfaca3156e8b' WHERE 'leanm11' LIKE 'jones'")
Suspicious, I used the SQL example:
https://inventory.data.gov/api/action/datastore_search_sql?sql=SELECT * from “89fec729-9ab9-43d5-8dcd-e65dfab2a17c” WHERE title LIKE ‘jones’
mentioned in the API page, which also resulted in a 409 error. It was at that point that I decided that the API was not reliable enough for me to work with, even if it (likely) included some mistakes on my end.
After searching for another source of data that had Math performance listed by state, I found a promising set at the National Center for Education Statistics (NCES), a sub organization of the US Department of Education. Each state had a summary performance page for a variety of subjects and grade levels. Here is New York’s page.
Fortunately, this page included 8th Grade Math peformance by state, and the data was from 2011 (thus explaining why I chose 2011 financial data above). Of all the subjects, Math was the most recently collected data, which was a little curious since 2011 was almost 5 years ago. C’est la vie (when it comes to beauracracies, anyway).
The other attractive thing about these web pages was that the URLs followed a recognizable pattern for each state: two digits, from 01 to 55 (some of the intermittent two digits did not return a state, for some reason, but that did not cause a problem). What this allowed me to do was build up a character vector of “01” to “55”, which I would go on to use to construct URLs for each state’s page by pasting to a “base” url.
state_ids <- as.character(rep(1:55))
state_ids <- unlist(lapply(state_ids, function(x) if(nchar(x) < 2) paste0("0", x) else x ))
url <- 'https://nces.ed.gov/programs/stateprofiles/sresult.asp?mode=full&displaycat=7&s1='
I started with the first state, Alabama, and worked to figure out a repeatable pattern for downloading the data from each URL. With the pattern determined, I could write functions and simply apply them to each URL.
My first function takes a two digit character vector and creates a URL and uses the getURL function to issue a HTTP get request to the server. With the request, it converts all the HTML tables to data frames using the readHTMLTable function. Finally, it pulls the germane data frame out and returns it.
per_data_dl <- function(state_id) {
state_uri <- getURL(paste0(url, state_id))
state_table <- readHTMLTable(state_uri, stringsAsFactors = FALSE)
state_df <- state_table[[9]]
return(state_df)
}
This part jumps ahead to tranformation a little bit, but I thought it made more sense to include in immediate succesion to the above work
For each state’s dataframe (the object returned in the above function), I would need to clean up the data a little bit so I could get it into a coherent data frame. The state was identified as one of the column names in the data frame, so I was able to easily obtain that. Next, I looked for all of the relevant math data, and stored each in a variable. Finally, I took all the created variables and created a one observation data frame for each state.
data_extract <- function(df) {
state_name <- colnames(df)[3]
math_scale <- as.integer(df[6,3])
math_basic <- as.integer(df[7,3])
math_pro <- as.integer(df[8,3])
math_adv <- as.integer(df[9,3])
return(data.frame(state_name, math_scale, math_basic, math_pro, math_adv))
}
With the above two functions, I could call them on my list of state numbers created earlier.
df_list <- lapply(state_ids, per_data_dl)
extracted <- lapply(df_list, data_extract)
Finally, since each data frame was created with the same function, it was easy to combine them into one big data frame with the performance data. The V3’s were the mysterious non specific states, which didn’t provide any problems.
math_data <- ldply(extracted, data.frame)
head(math_data)
## state_name math_scale math_basic math_pro math_adv
## 1 Alabama 269 58 20 4
## 2 Alaska 283 75 33 6
## 3 V3 NA NA NA NA
## 4 Arizona 277 67 29 6
## 5 Arkansas 276 67 27 4
## 6 California 270 59 23 5
Now that I had everything (or almost everything, as will become apparent) in R, I was ready to start getting my data in a workable format.
The first task was to summarize the finance data. Each observation in the finance data set was a district, but I was comparing states. So, I needed to sum up each variable that I cared about by state. Using the group_by and summarise functions from dplyr, this was painless. I decided to just look at the total revenue and total expenditures to keep things simple. I wasn’t able to use the provided ratios, since they were by district. This would mean I would be taking the average of an average and not weighting appropriately. Instead, I summed the total enrollment figure for each each district, giving me the total enrollment for each state.
finance_sum <- group_by(fed_data, STATE)
summed_fin <- summarise(finance_sum, sum_rev = sum(TOTALREV), sum_exp = sum(TOTALEXP), total_enroll = sum(V33))
However at this point it became apparent that the State variable did not contain state names, but instead state codes from the data definition file. Using the file, I created .csv file and read it in to R.
state_key <- read.csv('https://raw.githubusercontent.com/cjf4/IS-607-Data-Acquisition-and-Management/master/Final%20Project/state_key.csv', stringsAsFactors = FALSE)
This data frame served as a key of sorts, and allowed me to attach the name to each summed state observation in the finance data frame. I used the merge function for this task.
school_finance_per <- merge(summed_fin, state_key, by.x = "STATE", by.y = "state_id")
Again, using the merge function, I was finally able to combine by financial data with my performance data.
school_finance_per <- merge(school_finance_per, math_data, by = "state_name")
Finally, since the absolute revenue/spending difference between states of different populations would be difficult to compare, I created revenue and expenditure per pupil variables.
school_finance_per$exp_per_pup <- school_finance_per$sum_exp / school_finance_per$total_enroll
school_finance_per$rev_per_pup <- school_finance_per$sum_rev / school_finance_per$total_enroll
head(school_finance_per)
## state_name STATE sum_rev sum_exp total_enroll math_scale math_basic
## 1 Alabama 1 7376725 7411768 746933 269 58
## 2 Alaska 2 2357828 2443898 131704 283 75
## 3 Arizona 3 8429186 8610417 946571 277 67
## 4 Arkansas 4 5176362 5669431 475551 276 67
## 5 California 5 70953331 69847705 6202604 270 59
## 6 Colorado 6 8747471 8676451 834832 287 76
## math_pro math_adv exp_per_pup rev_per_pup
## 1 20 4 9.922936 9.87602
## 2 33 6 18.555989 17.90248
## 3 29 6 9.096430 8.90497
## 4 27 4 11.921815 10.88498
## 5 23 5 11.261029 11.43928
## 6 40 10 10.393050 10.47812
I’ll start by taking a look at some easy to calculate statistics. First up, do schools on aggregate stay on budget?
str(subset(school_finance_per, exp_per_pup > rev_per_pup))
## 'data.frame': 28 obs. of 11 variables:
## $ state_name : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ STATE : int 1 2 3 4 8 9 10 17 18 19 ...
## $ sum_rev : int 7376725 2357828 8429186 5176362 1759170 1283071 26453693 5538869 7104589 8022917 ...
## $ sum_exp : int 7411768 2443898 8610417 5669431 1808905 1297175 26991946 5662146 7253850 8218870 ...
## $ total_enroll: int 746933 131704 946571 475551 119878 44199 2636404 482692 672958 665109 ...
## $ math_scale : int 269 283 277 276 284 254 279 289 279 272 ...
## $ math_basic : int 58 75 67 67 75 40 70 79 70 62 ...
## $ math_pro : int 20 33 29 27 32 11 29 39 27 20 ...
## $ math_adv : int 4 6 6 4 6 2 6 8 5 4 ...
## $ exp_per_pup : num 9.92 18.56 9.1 11.92 15.09 ...
## $ rev_per_pup : num 9.88 17.9 8.9 10.88 14.67 ...
So we can see 29 of the 51 states (plus DC) ran deficits in aggreagate in 2011.
Another quick qestion would be which stats spend the most per pupil on education?
head(top_spenders <- school_finance_per[order(school_finance_per$exp_per_pup, decreasing = TRUE),])
## state_name STATE sum_rev sum_exp total_enroll math_scale
## 9 District of Columbia 9 1283071 1297175 44199 254
## 33 New York 33 57753776 59446908 2677412 283
## 46 Vermont 46 1686499 1722515 86982 293
## 2 Alaska 2 2357828 2443898 131704 283
## 31 New Jersey 31 26242376 25716298 1399409 293
## 7 Connecticut 7 9783168 9439182 530132 289
## math_basic math_pro math_adv exp_per_pup rev_per_pup
## 9 40 11 2 29.34851 29.02941
## 33 73 34 8 22.20312 21.57075
## 46 81 43 13 19.80312 19.38906
## 2 75 33 6 18.55599 17.90248
## 31 80 44 14 18.37654 18.75247
## 7 78 40 10 17.80534 18.45421
And the least?
tail(top_spenders <- school_finance_per[order(school_finance_per$exp_per_pup, decreasing = TRUE),])
## state_name STATE sum_rev sum_exp total_enroll math_scale math_basic
## 3 Arizona 3 8429186 8610417 946571 277 67
## 43 Tennessee 43 8679893 8961443 986382 275 65
## 25 Mississippi 25 4516742 4345783 490526 265 54
## 37 Oklahoma 37 5835133 5644988 658163 276 68
## 45 Utah 45 4141253 4195916 545395 284 75
## 13 Idaho 13 2089982 1960818 265713 287 78
## math_pro math_adv exp_per_pup rev_per_pup
## 3 29 6 9.096430 8.904970
## 43 25 4 9.085165 8.799728
## 25 15 2 8.859435 9.207956
## 37 24 3 8.576884 8.865787
## 45 35 7 7.693353 7.593126
## 13 38 8 7.379458 7.865562
Now it’s time to look at the relationship between spending and math performance. We will look at expenditure per pupil against basic math proficiency, which is defined as “implies partial mastery of prerequisite knowledge and skills that are fundamental for proficient work at each grade.”
plot(school_finance_per$exp_per_pup, school_finance_per$math_basic, xlab = "Expenditure per pupil", ylab = "Basic Math Profiency Percentage")
We can see a clear linear trend of correlation between expenditure and Basic Math proficiency. However, of interest is the outlier on this graph, that spends more per pupil than any other state/district, yet achieves the worst performance, both by wide margins. From the above listing of highest spending states, we know this is Washington DC.
Since there appears to be a linear relationship, I will try to fit a regression line using the two sets of data.
basic_math.lm <- lm(math_basic ~ exp_per_pup, school_finance_per)
summary(basic_math.lm)
##
## Call:
## lm(formula = math_basic ~ exp_per_pup, data = school_finance_per)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.055 -5.353 2.765 5.778 13.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.1488 4.3809 16.926 <2e-16 ***
## exp_per_pup -0.1395 0.3234 -0.431 0.668
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.959 on 48 degrees of freedom
## Multiple R-squared: 0.003861, Adjusted R-squared: -0.01689
## F-statistic: 0.186 on 1 and 48 DF, p-value: 0.6682
Interesting. The regression line actually predicts a negative relationship between the two variables. Washington DC must be throwing off the model.
I’ll try again, only this time excluding Washington DC.
schools_no_DC <- subset(school_finance_per, math_basic > 50)
basic_math.lm <- lm(math_basic ~ exp_per_pup, schools_no_DC)
summary(basic_math.lm)
##
## Call:
## lm(formula = math_basic ~ exp_per_pup, data = schools_no_DC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6852 -4.8835 0.9517 5.4695 12.1225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.9043 4.2065 14.72 < 2e-16 ***
## exp_per_pup 0.8783 0.3229 2.72 0.00913 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.175 on 47 degrees of freedom
## Multiple R-squared: 0.136, Adjusted R-squared: 0.1176
## F-statistic: 7.397 on 1 and 47 DF, p-value: 0.009127
That gives us a positive relationship. For every additional 1,000 spent per pupil, we would predict a state’s percentage of students who achieve basic profiecieny to increase by .87. However, the R-squared on this model is quite low, and this is not establishing a causal relationship, so it would be imprudent to advise states to start pouring money into their education budgets (on this analysis alone, at least).
The curious case of Washington DC illustrates a need for futher analysis. No other state spent so much, yet get so little for their money. This is likely related to the fact that Washington DC is exclusively an urban district. Other inner city districts that may be otherwise comparable to Washington would have their results mixed in with rural and suburban districts. If nothing else, this illustrates the need to study inner city districts separately from the rest of the country.