My motivation for this project was to gain some insight into public education in the United States. Specifically, I want to see if finances, whether it’s revenue or expenditure, act as a predictor for school performance.

Libraries used:

library(jsonlite)
library(httr)
library(RCurl)
library(XML)
library(plyr)
library(dplyr)
library(reshape2)

Acquisition

School Financial Data

The finance data for every public school district in the USA was avaialable at the United States Censu Bureau’s website. I began with the 2013 data, which is the most recent dataset, but wound up switching to the 2011 data, for reasons that will be elucidated on in due course. The dataset description for 2011 can be found here.

The data provides an almost overwhelming level of detail in the way it breaks down different revenue streams and expenditure categories. Federal, State, and Local Revenue categories are provided, with another level of detail below each. The second group of data elements are Expenditures, which include totals and detail level for categories such as instructor salaries, benefit programs, capital outlays, administration, and more. Finally, the data set includes some ratios using the first two categories, such as % revenue from federal sources and spending per pupil.

The data in the above directory is in a relational format, but as an excel spreadsheet. CSV files are a little bit more predictable to read into R, so I downloaded the file, converted it to csv, and loaded it into R.

fed_data <- read.csv('https://raw.githubusercontent.com/cjf4/IS-607-Data-Acquisition-and-Management/master/Final%20Project/elsec11.csv', 
                     stringsAsFactors = FALSE)
tbl_df(fed_data)
## Source: local data frame [14,491 x 141]
## 
##    STATE   IDCENSUS                           NAME CONUM   CSA  CBSA
##    (int)      (dbl)                          (chr) (int) (chr) (chr)
## 1      1 1.5001e+12 AUTAUGA COUNTY SCHOOL DISTRICT  1001   388 33860
## 2      1 1.5002e+12 BALDWIN COUNTY SCHOOL DISTRICT  1003   380 19300
## 3      1 1.5003e+12 BARBOUR COUNTY SCHOOL DISTRICT  1005     N 21640
## 4      1 1.5003e+12   EUFAULA CITY SCHOOL DISTRICT  1005     N 21640
## 5      1 1.5004e+12    BIBB COUNTY SCHOOL DISTRICT  1007   142 13820
## 6      1 1.5005e+12  BLOUNT COUNTY SCHOOL DISTRICT  1009   142 13820
## 7      1 1.5005e+12   ONEONTA CITY SCHOOL DISTRICT  1009   142 13820
## 8      1 1.5006e+12 BULLOCK COUNTY SCHOOL DISTRICT  1011     N     N
## 9      1 1.5007e+12  BUTLER COUNTY SCHOOL DISTRICT  1013     N     N
## 10     1 1.5008e+12  ANNISTON CITY SCHOOL DISTRICT  1015     N 11500
## ..   ...        ...                            ...   ...   ...   ...
## Variables not shown: SCHLEV (int), NCESID (chr), YRDATA (int), V33 (int),
##   TOTALREV (int), TFEDREV (int), C14 (int), C15 (int), C16 (int), C17
##   (int), C19 (int), B11 (int), C20 (int), C25 (int), C36 (int), B10 (int),
##   B12 (int), B13 (int), TSTREV (int), C01 (int), C04 (int), C05 (int), C06
##   (int), C07 (int), C08 (int), C09 (int), C10 (int), C11 (int), C12 (int),
##   C13 (int), C24 (int), C35 (int), C38 (int), C39 (int), TLOCREV (int),
##   T02 (int), T06 (int), T09 (int), T15 (int), T40 (int), T99 (int), D11
##   (int), D23 (int), A07 (int), A08 (int), A09 (int), A11 (int), A13 (int),
##   A15 (int), A20 (int), A40 (int), U11 (int), U22 (int), U30 (int), U50
##   (int), U97 (int), TOTALEXP (int), TCURELSC (int), TCURINST (int), E13
##   (int), J13 (int), J12 (int), J14 (int), V91 (int), V92 (int), TCURSSVC
##   (int), E17 (int), E07 (int), E08 (int), E09 (int), V40 (int), V45 (int),
##   V90 (int), V85 (int), J17 (int), J07 (int), J08 (int), J09 (int), J40
##   (int), J45 (int), J90 (int), J11 (int), J96 (int), TCUROTH (int), E11
##   (int), V60 (int), V65 (int), J10 (int), J97 (int), NONELSEC (int), V70
##   (int), V75 (int), V80 (int), J98 (int), TCAPOUT (int), F12 (int), G15
##   (int), K09 (int), K10 (int), K11 (int), J99 (int), L12 (int), M12 (int),
##   Q11 (int), I86 (int), Z32 (int), Z33 (int), V11 (int), V13 (int), V15
##   (int), V17 (int), V21 (int), V23 (int), V37 (int), V29 (int), Z34 (int),
##   V10 (int), V12 (int), V14 (int), V16 (int), V18 (int), V22 (int), V24
##   (int), V38 (int), V30 (int), V32 (int), X_19H (int), X_21F (int), X_31F
##   (int), X_41F (int), X_61V (int), X_66V (int), W01 (int), W31 (int), W61
##   (int)

Simple enough. However, in the context of my desired analysis, the above served as explanatory variables. I want to measure performance as my response variable, so that would have to come elsewhere.

School Performance Data

First Attempt

The first place I looked was the federal Department of Education. The DOE provides data in both relational data files and APIs. An education professional I consulted had mentioned that 9th grade algebra scores are often used as a KPI (Key Performanc Metric) in education. This link appeared promising, but wound up having a couple of issues that forced me to explore other options.

The first issue was that there was no apparent dataset description. The above link contained a truncated “dataset abstract”, but after some searching, I was not able to find the full description. While some of the columns seemed to be self evident, some were not. It would be impossible to rely on this data without a working knowledge of what the data represented.

The second issue was with the web API. At first glance, the API appeared to be clean, simple, and allow the powerful option of including SQL queries as part of the http request. The API documentation can be found here.

However, separate attempts to access the data via all the mentioned methods proved unsuccessful, usually resulting in 409 errors (access forbidden).

I tried post a JSON object as a request, which is mentioned as an option on the API page, however this did not work.

req <- POST('https://inventory.data.gov/api/action/datastore_search',
            body = "{
                        resource_id: 'b4e7f148-3afc-4782-84a3-dfaca3156e8b', 
                        limit: 5, 
                        q: 'jones' 
                     }",
            encode = 'json'

          
)

A URL with an SQL query received a 409 error as well.

req <- POST("https://inventory.data.gov/api/action/datastore_search_sql?api_key=W7KRCyAf29uIuiklIvJaSgJliE6GrRpR01ObMUh6&sql=SELECT * from 'b4e7f148-3afc-4782-84a3-dfaca3156e8b' WHERE 'leanm11' LIKE 'jones'")

Suspicious, I used the SQL example:

https://inventory.data.gov/api/action/datastore_search_sql?sql=SELECT * from “89fec729-9ab9-43d5-8dcd-e65dfab2a17c” WHERE title LIKE ‘jones’

mentioned in the API page, which also resulted in a 409 error. It was at that point that I decided that the API was not reliable enough for me to work with, even if it (likely) included some mistakes on my end.

Alternate Math Performance Data Source

After searching for another source of data that had Math performance listed by state, I found a promising set at the National Center for Education Statistics (NCES), a sub organization of the US Department of Education. Each state had a summary performance page for a variety of subjects and grade levels. Here is New York’s page.

Fortunately, this page included 8th Grade Math peformance by state, and the data was from 2011 (thus explaining why I chose 2011 financial data above). Of all the subjects, Math was the most recently collected data, which was a little curious since 2011 was almost 5 years ago. C’est la vie (when it comes to beauracracies, anyway).

The other attractive thing about these web pages was that the URLs followed a recognizable pattern for each state: two digits, from 01 to 55 (some of the intermittent two digits did not return a state, for some reason, but that did not cause a problem). What this allowed me to do was build up a character vector of “01” to “55”, which I would go on to use to construct URLs for each state’s page by pasting to a “base” url.

state_ids <- as.character(rep(1:55))
state_ids <- unlist(lapply(state_ids, function(x) if(nchar(x) < 2) paste0("0", x) else x ))

url <- 'https://nces.ed.gov/programs/stateprofiles/sresult.asp?mode=full&displaycat=7&s1='

I started with the first state, Alabama, and worked to figure out a repeatable pattern for downloading the data from each URL. With the pattern determined, I could write functions and simply apply them to each URL.

My first function takes a two digit character vector and creates a URL and uses the getURL function to issue a HTTP get request to the server. With the request, it converts all the HTML tables to data frames using the readHTMLTable function. Finally, it pulls the germane data frame out and returns it.

per_data_dl <- function(state_id) {
  state_uri <- getURL(paste0(url, state_id))
  state_table <- readHTMLTable(state_uri, stringsAsFactors = FALSE)
  state_df <- state_table[[9]]
  return(state_df)
}

This part jumps ahead to tranformation a little bit, but I thought it made more sense to include in immediate succesion to the above work

For each state’s dataframe (the object returned in the above function), I would need to clean up the data a little bit so I could get it into a coherent data frame. The state was identified as one of the column names in the data frame, so I was able to easily obtain that. Next, I looked for all of the relevant math data, and stored each in a variable. Finally, I took all the created variables and created a one observation data frame for each state.

data_extract <- function(df) {
  state_name <- colnames(df)[3]
  math_scale <- as.integer(df[6,3])
  math_basic <- as.integer(df[7,3])
  math_pro <- as.integer(df[8,3])
  math_adv <- as.integer(df[9,3])
  return(data.frame(state_name, math_scale, math_basic, math_pro, math_adv))
}

With the above two functions, I could call them on my list of state numbers created earlier.

df_list <- lapply(state_ids, per_data_dl)
extracted <- lapply(df_list, data_extract)

Finally, since each data frame was created with the same function, it was easy to combine them into one big data frame with the performance data. The V3’s were the mysterious non specific states, which didn’t provide any problems.

math_data <- ldply(extracted, data.frame)
head(math_data)
##   state_name math_scale math_basic math_pro math_adv
## 1    Alabama        269         58       20        4
## 2     Alaska        283         75       33        6
## 3         V3         NA         NA       NA       NA
## 4    Arizona        277         67       29        6
## 5   Arkansas        276         67       27        4
## 6 California        270         59       23        5

Transformations

Now that I had everything (or almost everything, as will become apparent) in R, I was ready to start getting my data in a workable format.

The first task was to summarize the finance data. Each observation in the finance data set was a district, but I was comparing states. So, I needed to sum up each variable that I cared about by state. Using the group_by and summarise functions from dplyr, this was painless. I decided to just look at the total revenue and total expenditures to keep things simple. I wasn’t able to use the provided ratios, since they were by district. This would mean I would be taking the average of an average and not weighting appropriately. Instead, I summed the total enrollment figure for each each district, giving me the total enrollment for each state.

finance_sum <- group_by(fed_data, STATE)
summed_fin <- summarise(finance_sum, sum_rev = sum(TOTALREV), sum_exp = sum(TOTALEXP), total_enroll = sum(V33))

However at this point it became apparent that the State variable did not contain state names, but instead state codes from the data definition file. Using the file, I created .csv file and read it in to R.

state_key <- read.csv('https://raw.githubusercontent.com/cjf4/IS-607-Data-Acquisition-and-Management/master/Final%20Project/state_key.csv', stringsAsFactors = FALSE)

This data frame served as a key of sorts, and allowed me to attach the name to each summed state observation in the finance data frame. I used the merge function for this task.

school_finance_per <- merge(summed_fin, state_key, by.x = "STATE", by.y = "state_id")

Again, using the merge function, I was finally able to combine by financial data with my performance data.

school_finance_per <- merge(school_finance_per, math_data, by = "state_name")

Finally, since the absolute revenue/spending difference between states of different populations would be difficult to compare, I created revenue and expenditure per pupil variables.

school_finance_per$exp_per_pup <- school_finance_per$sum_exp / school_finance_per$total_enroll
school_finance_per$rev_per_pup <- school_finance_per$sum_rev / school_finance_per$total_enroll
head(school_finance_per)
##   state_name STATE  sum_rev  sum_exp total_enroll math_scale math_basic
## 1    Alabama     1  7376725  7411768       746933        269         58
## 2     Alaska     2  2357828  2443898       131704        283         75
## 3    Arizona     3  8429186  8610417       946571        277         67
## 4   Arkansas     4  5176362  5669431       475551        276         67
## 5 California     5 70953331 69847705      6202604        270         59
## 6   Colorado     6  8747471  8676451       834832        287         76
##   math_pro math_adv exp_per_pup rev_per_pup
## 1       20        4    9.922936     9.87602
## 2       33        6   18.555989    17.90248
## 3       29        6    9.096430     8.90497
## 4       27        4   11.921815    10.88498
## 5       23        5   11.261029    11.43928
## 6       40       10   10.393050    10.47812

Analysis

I’ll start by taking a look at some easy to calculate statistics. First up, do schools on aggregate stay on budget?

str(subset(school_finance_per, exp_per_pup > rev_per_pup))
## 'data.frame':    28 obs. of  11 variables:
##  $ state_name  : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ STATE       : int  1 2 3 4 8 9 10 17 18 19 ...
##  $ sum_rev     : int  7376725 2357828 8429186 5176362 1759170 1283071 26453693 5538869 7104589 8022917 ...
##  $ sum_exp     : int  7411768 2443898 8610417 5669431 1808905 1297175 26991946 5662146 7253850 8218870 ...
##  $ total_enroll: int  746933 131704 946571 475551 119878 44199 2636404 482692 672958 665109 ...
##  $ math_scale  : int  269 283 277 276 284 254 279 289 279 272 ...
##  $ math_basic  : int  58 75 67 67 75 40 70 79 70 62 ...
##  $ math_pro    : int  20 33 29 27 32 11 29 39 27 20 ...
##  $ math_adv    : int  4 6 6 4 6 2 6 8 5 4 ...
##  $ exp_per_pup : num  9.92 18.56 9.1 11.92 15.09 ...
##  $ rev_per_pup : num  9.88 17.9 8.9 10.88 14.67 ...

So we can see 29 of the 51 states (plus DC) ran deficits in aggreagate in 2011.

Another quick qestion would be which stats spend the most per pupil on education?

head(top_spenders <- school_finance_per[order(school_finance_per$exp_per_pup, decreasing = TRUE),])
##              state_name STATE  sum_rev  sum_exp total_enroll math_scale
## 9  District of Columbia     9  1283071  1297175        44199        254
## 33             New York    33 57753776 59446908      2677412        283
## 46              Vermont    46  1686499  1722515        86982        293
## 2                Alaska     2  2357828  2443898       131704        283
## 31           New Jersey    31 26242376 25716298      1399409        293
## 7           Connecticut     7  9783168  9439182       530132        289
##    math_basic math_pro math_adv exp_per_pup rev_per_pup
## 9          40       11        2    29.34851    29.02941
## 33         73       34        8    22.20312    21.57075
## 46         81       43       13    19.80312    19.38906
## 2          75       33        6    18.55599    17.90248
## 31         80       44       14    18.37654    18.75247
## 7          78       40       10    17.80534    18.45421

And the least?

tail(top_spenders <- school_finance_per[order(school_finance_per$exp_per_pup, decreasing = TRUE),])
##     state_name STATE sum_rev sum_exp total_enroll math_scale math_basic
## 3      Arizona     3 8429186 8610417       946571        277         67
## 43   Tennessee    43 8679893 8961443       986382        275         65
## 25 Mississippi    25 4516742 4345783       490526        265         54
## 37    Oklahoma    37 5835133 5644988       658163        276         68
## 45        Utah    45 4141253 4195916       545395        284         75
## 13       Idaho    13 2089982 1960818       265713        287         78
##    math_pro math_adv exp_per_pup rev_per_pup
## 3        29        6    9.096430    8.904970
## 43       25        4    9.085165    8.799728
## 25       15        2    8.859435    9.207956
## 37       24        3    8.576884    8.865787
## 45       35        7    7.693353    7.593126
## 13       38        8    7.379458    7.865562

Now it’s time to look at the relationship between spending and math performance. We will look at expenditure per pupil against basic math proficiency, which is defined as “implies partial mastery of prerequisite knowledge and skills that are fundamental for proficient work at each grade.”

plot(school_finance_per$exp_per_pup, school_finance_per$math_basic, xlab = "Expenditure per pupil", ylab = "Basic Math Profiency Percentage")

We can see a clear linear trend of correlation between expenditure and Basic Math proficiency. However, of interest is the outlier on this graph, that spends more per pupil than any other state/district, yet achieves the worst performance, both by wide margins. From the above listing of highest spending states, we know this is Washington DC.

Since there appears to be a linear relationship, I will try to fit a regression line using the two sets of data.

basic_math.lm <- lm(math_basic ~ exp_per_pup, school_finance_per)
summary(basic_math.lm)
## 
## Call:
## lm(formula = math_basic ~ exp_per_pup, data = school_finance_per)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.055  -5.353   2.765   5.778  13.753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  74.1488     4.3809  16.926   <2e-16 ***
## exp_per_pup  -0.1395     0.3234  -0.431    0.668    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.959 on 48 degrees of freedom
## Multiple R-squared:  0.003861,   Adjusted R-squared:  -0.01689 
## F-statistic: 0.186 on 1 and 48 DF,  p-value: 0.6682

Interesting. The regression line actually predicts a negative relationship between the two variables. Washington DC must be throwing off the model.

I’ll try again, only this time excluding Washington DC.

schools_no_DC <- subset(school_finance_per, math_basic > 50)
basic_math.lm <- lm(math_basic ~ exp_per_pup, schools_no_DC)
summary(basic_math.lm)
## 
## Call:
## lm(formula = math_basic ~ exp_per_pup, data = schools_no_DC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6852  -4.8835   0.9517   5.4695  12.1225 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.9043     4.2065   14.72  < 2e-16 ***
## exp_per_pup   0.8783     0.3229    2.72  0.00913 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.175 on 47 degrees of freedom
## Multiple R-squared:  0.136,  Adjusted R-squared:  0.1176 
## F-statistic: 7.397 on 1 and 47 DF,  p-value: 0.009127

That gives us a positive relationship. For every additional 1,000 spent per pupil, we would predict a state’s percentage of students who achieve basic profiecieny to increase by .87. However, the R-squared on this model is quite low, and this is not establishing a causal relationship, so it would be imprudent to advise states to start pouring money into their education budgets (on this analysis alone, at least).

The curious case of Washington DC illustrates a need for futher analysis. No other state spent so much, yet get so little for their money. This is likely related to the fact that Washington DC is exclusively an urban district. Other inner city districts that may be otherwise comparable to Washington would have their results mixed in with rural and suburban districts. If nothing else, this illustrates the need to study inner city districts separately from the rest of the country.