library(tidyverse)
library(educationdata)
library(skimr)
library(GGally)
library(readxl)
library(stringr)
library(janitor)
library(tigris)
library(jsonlite)607 Final Project Report
Abstract
This project explores publicly available data on Massachusetts public school districts to identify patterns and relationships between district characteristics, expenditures, and course offerings. Motivated by personal inquiry into factors influencing school quality and potential relocation decisions, this analysis focuses on comparing metrics beyond standardized test scores, such as the prevalence of arts and computer science courses and class sizes.
Using data from the Urban Institute’s Education Data Explorer (educationdata R package) and supplemental data from the Massachusetts Department of Education and US Census data, this exploratory analysis merges 2022 data for Pre-K to 12th grade non-charter school districts. The analysis employs various data acquisition and cleaning techniques, including processing strings and text, creating a function to perform a repetitive operation on multiple dataframes, dataset joins, tidying, and exploratory visualizations. Metrics such as per-pupil expenditures and percentages enrolled in various course types provide insights into how districts vary in their offerings and resources.
The COVID-19 pandemic introduced substantial gaps in assessment-related data, and as such, this analysis avoids test scores and instead examines district factors related to course availability and resource allocation. While no definitive conclusions are drawn about what makes a “good” district, the findings highlight significant variability in offerings and raise questions about the impact of resources, especially expenditures and median income levels, on educational opportunities. This exploration emphasizes the importance of transparency and contextual data in informing educational and personal decisions.
Introduction
The motivation for this project stems from a personal inquiry: my husband and I plan to move back to Massachusetts and eventually have children. As we consider towns for relocation, we are curious about how public school districts compare across a range of metrics. Popular real estate platforms like Zillow and Redfin provide unreliable and biased information, such as “school ratings” based on self-reported parent surveys or opaque algorithms. These data points lack transparency, fail to account for district-wide trends, and cannot answer broader questions like, What are some features that make a school district a good option for us?
This project builds upon findings from my DATA-606 coursework, where I examined a simple linear regression between instructional expenditures per student and midpoint math proficiency scores using 2020 data from the Urban Institute’s educationdata package. While the relationship was statistically significant, the model’s low R-squared value revealed limited explanatory power, emphasizing that statistical relationships do not always translate into practical significance. Inspired by these limitations, I aim to explore additional variables and broader district-level metrics.
For this analysis, I merge data from the Urban Institute’s Education Data Portal (2022 directory) and the Massachusetts Department of Education, in addition to data from the US Census for 2022. The dataset includes Pre-K to 12th grade non-charter school districts, as defined by the Common Core of Data (CCD). Although the pandemic disrupted the collection of assessment-related data, this project shifts the focus to variables like course enrollment in arts and computer science, and class sizes. By examining these factors, I aim to uncover patterns in district offerings and resource allocation rather than predict outcomes or define “best” districts.
The analysis workflow involves a variety of data science techniques, including: - Data acquisition using the educationdata package, Massachusetts Department of Education downloaded data, and US Census data. - Data transformations, such as creating calculated fields and creating functions to format csv imports. - Processing strings and text to merge data sources. - Visualizations to highlight observations and categorical trends.
Although exploratory, this project illustrates the variability in resources and offerings across districts and demonstrates the potential of publicly available data to inform nuanced, data-driven inquiries. However, it also underscores the challenges of interpreting incomplete, aggregated, and anonymized data, especially in the realm of education-related data, and the importance of context when drawing conclusions. While there does appear to be a relationship between expenditures and median income levels on “outcomes” such as percentage of students passing grade 9, it is unclear and outside the scope of this analysis to determine the cause/effect of such findings, in particular for a finite year at the tail-end of a global pandemic which so significantly impacted education and housing, among many other factors.
Data Sources
Sources overview
Urban Institute Education Data Portal R package eduationdata
Relevant links regarding this data source: - Education Data Explorer - API FAQs - R package overview on GitHub - Data dictionary is available here
Massachusetts Department of Education
Relevant links regarding these data sources: - Expenditures Report - Digital Literacy Coursetaking Report - Arts Coursetaking Report - Grade 9 Passing Report - Teacher Data Report
Note on Mass. DoE Data
As noted previously, a major source of data challenge is the COVID-19 global pandemic and its impacts on the collection of various data points. For example, in reviewing the “Accountability Report” from the Massachusetts Department of Education (intended to classify schools and districts based on performance and improvement targets), I found that for the years 2019–2022 in the data source), nearly 100% of the rows lacked accountability determinations. These omissions stemmed from the pandemic’s disruption to state assessments and school operations.
Given these limitations, this analysis focuses on exploring non-assessment-related metrics such as course offerings and expenditures. While 2022 data from the Massachusetts Department of Education is less comprehensive due to these disruptions, it still provides meaningful insights into resource allocation and opportunities at the district level. This project underscores the need for transparent, reliable, and consistently collected data to inform decisions at both the personal and policy levels.
Census API Data
U.S. Census Bureau. “Median Household Income in the Past 12 Months (in 2022 Inflation-Adjusted Dollars).” American Community Survey, ACS 1-Year Estimates Detailed Tables, Table B19013, 2022, https://data.census.gov/table/ACSDT1Y2022.B19013?q=B19013&g=040XX00US25$9700000&y=2022. Accessed on December 8, 2024.
Load Libraries
Load and Prepare Data: Clean and Transform
Urban Institute Directory
For the data from the educationdata R package, I will load data from the Urban Institute Education Data Portal using their get_education_data() function for the year 2022 and the state of Massachusetts, which has the FIPS code 25. Data from the Urban Institute Education Data Portal is available via a package in R called educationdata, which I have installed, which essentially calls their API to get data.
directory2022 <- get_education_data(level = 'school-districts',
source = 'ccd',
topic = 'directory',
filters = list(year = 2022,
fips = '25'),
add_labels = TRUE
)I will now look at general summary overview of this dataframe using skim to ensure that we know what we’re looking for. For now we want to look at the ‘general information’ from the directory, like the latitude and longitude. Eventually we want to get to a “foundation” dataframe which includes ‘general’ values from this and other sources – we will eventually want to merge them using the state_leaid or leaid column, which based on the data dictionary means the “Local Education Agency Identification number”.
skim(directory2022)directory2022 |>
group_by(lowest_grade_offered, highest_grade_offered) |>
summarise(n = n()) |>
arrange(desc(by = n))`summarise()` has grouped output by 'lowest_grade_offered'. You can override
using the `.groups` argument.
# A tibble: 20 × 3
# Groups: lowest_grade_offered [11]
lowest_grade_offered highest_grade_offered n
<fct> <fct> <int>
1 Pre-K 12 213
2 9 12 44
3 Pre-K 6 38
4 <NA> <NA> 29
5 Pre-K 8 22
6 Kindergarten 12 15
7 6 12 13
8 7 12 13
9 Kindergarten 8 11
10 5 12 7
11 Pre-K 5 6
12 Kindergarten 6 4
13 6 8 4
14 Kindergarten 5 3
15 4 8 1
16 5 9 1
17 6 10 1
18 11 12 1
19 Ungraded Ungraded 1
20 Not applicable Not applicable 1
In order to compare measures that we’ll find in other datasets, we want to make sure that what we’re comparing schools that offer equivalent grade spans. We see that the most common grades offered is a lowest_grades_offered of Pre-K and a highest_grade_offered of 12. Let’s limit our dataframe to that subset of school districts that offers opportunities from Pre-K to 12th grade.
directory2022 <- directory2022 |>
filter(lowest_grade_offered == "Pre-K" & highest_grade_offered == "12")Now, let’s look at the agency_types included in the filtered dataframe. Based on the Urban Institute’s data dictionary we know that this differentiates Charter schools and special school districts from what’s considered a ‘normal’ public school district.
directory2022 |>
group_by(agency_type) |>
summarise(n = n()) # A tibble: 2 × 2
agency_type n
<fct> <int>
1 Regular local school district 209
2 Charter agency 4
We want to ensure that we’re comparing school districts of similar type, so we’ll filter out Charter agencies.
directory2022 <- directory2022 |>
filter(agency_type == "Regular local school district")directory2022There are some redundant and unnecessary variables in this dataframe, including the “year” (which will always be 2022, since that’s what we selected for), state (fips) name, and other values.
We are going to create a subset of directory2022 to include just the columns we think we’ll need going forward. We’ll rename it directory so that if we ever want to get some of the other variables we will be able to find and join them more easily.
We are going to select for public school districts that offer pre-K to 12th grade that are non-charter schools. We’ll want to get information about their name, district codes, location via latitude and longitude values, and some other ‘general’ fields to act as the base in the directory. The reduction based on offering pre-K to 12th grade plus non-Charter schools reduced us down from nearly 400 districts to just over 200 (209 to be exact.)
directory <- directory2022 |>
select(leaid, lea_name, state_leaid, city_location, latitude, longitude,
urban_centric_locale, number_of_schools)Looking at types of districts, it’s interesting that many are classified as large suburbs. Perhaps we can do some more interesting EDA down the line if we split urban_centric_locale into “urban” and “locale” so that the ‘types’ (suburb, city, town, rural) are distinguished from the locale (large, midsize, small, fringe, distant).
directory |>
group_by(urban_centric_locale) |>
summarise(n = n()) |>
arrange(urban_centric_locale)# A tibble: 9 × 2
urban_centric_locale n
<fct> <int>
1 City large 1
2 City midsize 3
3 City small 5
4 Suburb large 147
5 Suburb midsize 16
6 Town fringe 4
7 Town distant 3
8 Rural fringe 26
9 Rural distant 4
directory <- directory |>
separate(urban_centric_locale, into = c("urban", "locale"), sep = " ")directory |>
group_by(urban) |>
summarise(n = n()) # A tibble: 4 × 2
urban n
<chr> <int>
1 City 9
2 Rural 30
3 Suburb 163
4 Town 7
directory |>
group_by(locale) |>
summarise(n = n())# A tibble: 5 × 2
locale n
<chr> <int>
1 distant 7
2 fringe 30
3 large 148
4 midsize 19
5 small 5
While it’s not a complete break from the dominance of “large suburb” this does allow us to see patterns in a different way.
ggplot(directory, aes(x = urban, fill = locale)) +
geom_bar() +
labs(
title = "Distribution of Urban and Locale Categories",
x = "Urban Category",
y = "Count",
fill = "Locale"
)Mass. DOE:
All datasets have been downloaded for the year 2022 from this main page: Statewide Reports
The data output is pretty consistent: they all have a merged “header” row that has the Report year. That’s why for the following code chunks I will be skipping row 1, which enables me to see the column names. When I downloaded each data file, I put it into a folder specifying the year, “2022”.
Expenditures and Pupil Enrollment Data
Description: - Per Pupil expenditures - By District, 2022. Data dictionary for variables Source Link: - Per Pupil expenditures
This also includes the number of enrolled pupils.
expenditures2022 <- read_xlsx("Data/2022/PerPupilExpenditures.xlsx", skip = 1)New names:
• `` -> `...9`
head(expenditures2022)# A tibble: 6 × 9
`District Name` `District Code` In-District Expendit…¹ Total In-district FT…²
<chr> <chr> <chr> <chr>
1 Abby Kelley Fos… 04450000 $22,455,445.00 1,423.0
2 Abington 00010000 $33,667,339.55 2,151.3
3 Academy Of the … 04120000 $11,731,129.95 504.3
4 Acton-Boxborough 06000000 $94,671,026.58 5,242.6
5 Acushnet 00030000 $14,347,713.00 942.8
6 Advanced Math a… 04300000 $14,353,114.97 965.3
# ℹ abbreviated names: ¹`In-District Expenditures`, ²`Total In-district FTEs`
# ℹ 5 more variables: `In-District Expenditures per Pupil` <chr>,
# `Total Expenditures` <chr>, `Total Pupil FTEs` <chr>,
# `Total Expenditures per Pupil` <chr>, ...9 <lgl>
It looks like this brought in a full blank 9th column which was named ...9. We can see it’s entirely blank, so we’ll drop it when we also rename the columns to snake_case using the clean_names() function from the janitor package.
expenditures2022 <- expenditures2022 |>
clean_names() |>
select(1:8)We see a pattern that will recur for all the dataframes we’ve downloaded from the Massachusetts Department of Education: the district_code consistently has an extra four 0’s at the end of it. This is presumably to enable school-specific IDs to have the first four characters match, and unique variation using the last four characters. Since we are analyzing at the district level only across all our dataframes, we’re going to use str_sub() to extract just the first four characters for this district_code and all the Mass. Dept. of Education dataframes that follow. This will enable easier joining later on.
So, we will select for only the first 4 characters in the district_code:
expenditures2022 <- expenditures2022 |>
mutate(district_code = str_sub(district_code, 1, 4))head(expenditures2022)# A tibble: 6 × 8
district_name district_code in_district_expendit…¹ total_in_district_ft…²
<chr> <chr> <chr> <chr>
1 Abby Kelley Foste… 0445 $22,455,445.00 1,423.0
2 Abington 0001 $33,667,339.55 2,151.3
3 Academy Of the Pa… 0412 $11,731,129.95 504.3
4 Acton-Boxborough 0600 $94,671,026.58 5,242.6
5 Acushnet 0003 $14,347,713.00 942.8
6 Advanced Math and… 0430 $14,353,114.97 965.3
# ℹ abbreviated names: ¹in_district_expenditures, ²total_in_district_ft_es
# ℹ 4 more variables: in_district_expenditures_per_pupil <chr>,
# total_expenditures <chr>, total_pupil_ft_es <chr>,
# total_expenditures_per_pupil <chr>
We can see a few errors in the way that the clean_names() modified the acronym in some of these columns, “FTEs” which means “Full Time Equivalent” pupils. We want the reference to “FTEs” to read ftes rather than ft_es.
expenditures2022 <- expenditures2022 |>
rename(total_in_district_ftes = total_in_district_ft_es,
total_pupil_ftes = total_pupil_ft_es)For reference, from the data dictionary linked in this section:
Per pupil expenditures are calculated by dividing a district’s operating expenditures by its average pupil membership, including in-district expenditures per pupil and total expenditures per pupil. The following data elements are included in these calculations:
In-District Full-Time Equivalents (FTE) Average Membership: Average enrollment across the school year, for pupils enrolled at local schools.
In-District Expenditures: All of a district’s operating expenditures for in-district programs. All expenditures and all funding sources are included, except community services (6000 series), fixed assets (7000 series), debt service (8000 series), and out-of-district expenditures (9000 series).
In-District Expenditures Per Pupil: Total in-district expenditures divided by in-district FTE average membership.
Total FTE Average Membership: Average enrollment across the school year for students enrolled at local schools and those publicly-funded students enrolled at other districts, including charter schools, special education collaboratives, and private special education schools.
Total Expenditures: All district operating expenditures for in-district programs and out-of-district placements. All expenditures and funding sources are included, except community services (6000 series), fixed assets (7000 series), and debt service (8000 series).
Total Expenditures Per Pupil: Total expenditures divided by total FTE average membership.
With this in mind, we’ll want to only (for this and others) look at “in district” variables, rather than the “total” variables. This is a choice we are deliberately making for our analysis: they sometimes represent the same data (total_in_district_ftes and total_pupil_ftes) but sometimes they do not. We also have noticed that sometimes the ‘total pupil’ number differs from the ‘total student’ number in other dataframes we aim to merge with. Knowing these are slightly different variables, we will seek again to look just at the ‘total’ values and also just the total percentages for each figure of interest later on, such as the percent of students enrolled in an art course. The difference between ‘total FTEs’ and ‘total pupils’ or ‘total students’ does not often vary, but sometimes does - and it is for this reason that we are choosing to move forward with our analysis out of interest rather than seeking total precision in the subject matter. (If it isn’t already clear, I am not a subject matter expert in education data - rather, I am a very curious person who seeks validity and value in data I am exploring, and hope to clarify all uncertainties with interested parties - to clarify nuances or inconsistencies that are evident in the data.)
The reasoning for this and elsewhere is that we care more about the “in district” spending for this specific analysis. The “Total”, when different, indicates spending on “out-of-distrct placements” in addition to in-district programs. This means we would potentially be mixing district-level data, so we’re sticking with in-district measures.
We’ll now create a new dataframe called simply expenditures that subsets the expenditures2022 dataframe to include just the “in-district” data. We no longer need the district_name either because we have the district_code.
expenditures <- expenditures2022 |>
select(district_code, in_district_expenditures, total_in_district_ftes,
in_district_expenditures_per_pupil)We also want to make some columns that are ‘characters’ into ‘numeric’ so we can more feasibly perform EDA later on.
We notice that converting the large currency values with as_numeric() rounded the numbers to the nearest whole number, which in some cases might be an issue, but we are OK with it here, knowing that’s what happens with the transformation – and knowing that we are mostly interested in the ‘per pupil’ figure, which isn’t rounded as a result of the below transformation.
expenditures <- expenditures |>
mutate(
total_in_district_ftes = as.numeric(
str_replace_all(total_in_district_ftes, ",", "")),
in_district_expenditures =
as.numeric(str_replace_all(in_district_expenditures, "[\\$,]", "")),
in_district_expenditures_per_pupil =
as.numeric(str_replace_all(
in_district_expenditures_per_pupil, "[\\$,]", ""))
)head(expenditures)# A tibble: 6 × 4
district_code in_district_expenditures total_in_district_ftes
<chr> <dbl> <dbl>
1 0445 22455445 1423
2 0001 33667340. 2151.
3 0412 11731130. 504.
4 0600 94671027. 5243.
5 0003 14347713 943.
6 0430 14353115. 965.
# ℹ 1 more variable: in_district_expenditures_per_pupil <dbl>
Great! This is exactly what we were looking for.
Digital Literacy & Computer Science Coursetaking
Description: - Count of Students Taking At Least 1 Digital Literacy and/or Computer Science Course - By District, 2022 Source Link: - Digital Literacy and Computer Science Coursetaking Report
digital2022 <- read_xlsx("Data/2022/dlcs_course_percent.xlsx", skip = 1)head(digital2022)# A tibble: 6 × 17
`District Name` `District Code` K `01` `02` `03` `04` `05` `06`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Abby Kelley Foster … 04450000 100.0 100.0 100.0 100.0 100.0 100.0 100.0
2 Abington 00010000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 Academy Of the Paci… 04120000 <NA> <NA> <NA> <NA> <NA> 0.0 0.0
4 Acton-Boxborough 06000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 Acushnet 00030000 100.0 100.0 100.0 100.0 100.0 100.0 100.0
6 Advanced Math and S… 04300000 <NA> <NA> <NA> <NA> <NA> <NA> 100.0
# ℹ 8 more variables: `07` <chr>, `08` <chr>, `09` <chr>, `10` <chr>,
# `11` <chr>, `12` <chr>, `All Grades` <chr>, `Total Students` <chr>
Since we know we’ll want to perform the same clean_names() and mutate of district code to include the first 4 characters, we can create a function to do so:
process_dataframe <- function(df) {
df |>
clean_names() |>
mutate(district_code = str_sub(district_code, 1, 4))
}digital2022 <- process_dataframe(digital2022) We’ll also just select for the ‘total’ percentage of students who are taking a digital literacy or computer science course.
Like with the expenditures smaller dataframe, we’ll create a digital dataframe that’s created out of the digital2022 dataframe, and just select for the all_grades percentage – and we’ll rename the variable so it’s more clear what it represents.
digital <- digital2022 |>
select(district_code, all_grades) |>
rename(all_grades_percent_digital = all_grades) |>
mutate(all_grades_percent_digital = as.numeric(all_grades_percent_digital))head(digital)# A tibble: 6 × 2
district_code all_grades_percent_digital
<chr> <dbl>
1 0445 76.4
2 0001 22.4
3 0412 5.4
4 0600 21.6
5 0003 100
6 0430 89.5
Arts Coursetaking Report
Description: - Percent and Number of students enrolled in an arts course, by District, 2022 Link: - https://profiles.doe.mass.edu/statereport/artcourse.aspx
arts2022 <- read_xlsx("Data/2022/artcourse_percent.xlsx", skip = 1)arts2022 <- process_dataframe(arts2022)arts <- arts2022 |>
select(district_code, all_grades) |>
rename(all_grades_percent_arts = all_grades) |>
mutate(all_grades_percent_arts = as.numeric(all_grades_percent_arts))head(arts)# A tibble: 6 × 2
district_code all_grades_percent_arts
<chr> <dbl>
1 0445 95.6
2 0001 85.8
3 0412 89.9
4 0600 81
5 0003 69.4
6 0430 81.4
Grade 9 Passing Data
Source Link: - https://profiles.doe.mass.edu/statereport/gradeninecoursepass.aspx Description: - Grade Nine Course Passing Report (District) - All Students - All Subjects
grade9pass2022 <- read_xlsx("Data/2022/gradeninecoursepasss.xlsx", skip = 1)grade9pass2022 <- process_dataframe(grade9pass2022)head(grade9pass2022)# A tibble: 6 × 5
district_name district_code number_grade_nine_st…¹ number_passing_all_c…²
<chr> <chr> <chr> <chr>
1 Abby Kelley Foste… 0445 89 72
2 Abington 0001 140 107
3 Academy Of the Pa… 0412 64 59
4 Acton-Boxborough 0600 418 408
5 Advanced Math and… 0430 137 129
6 Agawam 0005 314 240
# ℹ abbreviated names: ¹number_grade_nine_students, ²number_passing_all_courses
# ℹ 1 more variable: percent_passing_all_courses <chr>
grade9pass <- grade9pass2022 |>
select(district_code, percent_passing_all_courses) |>
mutate(percent_passing_all_courses = as.numeric(percent_passing_all_courses))head(grade9pass)# A tibble: 6 × 2
district_code percent_passing_all_courses
<chr> <dbl>
1 0445 80.9
2 0001 76.4
3 0412 92.2
4 0600 97.6
5 0430 94.2
6 0005 76.4
Teacher Data
Source Link: - https://profiles.doe.mass.edu/statereport/teacherdata.aspx Description: - Includes information including student-teacher ratio (number of students to 1 teacher) overall. Additionally includes a percentage of experienced teachers, which is defined as “Percent of Experienced Teachers: Percent of teachers who have been teaching in a Massachusetts public school for at least three years.”
teachers2022 <- read_xlsx("Data/2022/teacherdata.xlsx", skip = 1)teachers2022 <- process_dataframe(teachers2022)head(teachers2022)# A tibble: 6 × 8
district_name district_code total_number_of_teac…¹ percent_of_teachers_…²
<chr> <chr> <chr> <chr>
1 Abby Kelley Foste… 0445 119.6 99.2
2 Abington 0001 153.8 97.1
3 Academy Of the Pa… 0412 49.6 65.6
4 Acton-Boxborough 0600 395.4 100.0
5 Acushnet 0003 76.0 100.0
6 Advanced Math and… 0430 80.5 92.3
# ℹ abbreviated names: ¹total_number_of_teachers_fte,
# ²percent_of_teachers_licensed
# ℹ 4 more variables: student_teacher_ratio <chr>,
# percent_of_experienced_teachers <chr>,
# percent_of_teachers_without_waiver_or_provisional_license <chr>,
# percent_teaching_in_field <chr>
We can see that the variable student_teacher_ratio is a character that reads with values like “11.9 to 1”, “13.9 to 1”, etc. We want to extract the part before the “to” and convert to numeric:
teachers2022 <- teachers2022 |>
mutate(
student_teacher_ratio_clean =
str_extract(student_teacher_ratio, "^(.*) to"),
student_teacher_ratio_clean =
str_remove(student_teacher_ratio_clean, " to"),
student_teacher_ratio_clean =
as.numeric(student_teacher_ratio_clean)
)We also want to convert another <ch>r variable to <dbl>, and make a smaller dataframe comprised of variables of interest:
teachers <- teachers2022 |>
mutate(
percent_of_experienced_teachers =
as.numeric(percent_of_experienced_teachers),
percent_of_teachers_licensed =
as.numeric(percent_of_teachers_licensed)
) |>
select(district_code, percent_of_experienced_teachers,
percent_of_teachers_licensed, student_teacher_ratio_clean) |>
rename(student_teacher_ratio = student_teacher_ratio_clean)head(teachers)# A tibble: 6 × 4
district_code percent_of_experienced_teachers percent_of_teachers_licensed
<chr> <dbl> <dbl>
1 0445 81.5 99.2
2 0001 81.3 97.1
3 0412 37.3 65.6
4 0600 88.3 100
5 0003 78.9 100
6 0430 80.8 92.3
# ℹ 1 more variable: student_teacher_ratio <dbl>
Census Data
This does not come from the Urban Institute nor the Massachusetts Department of Education.
Downloaded for the year 2022, Massachusetts school districts median household income report (table B19013). https://data.census.gov/table/ACSDT5Y2022.B19013?q=B19013&g=040XX00US25$9700000&y=2022&moe=false&tp=true
To execute, I added my census key where this says “MY_API_KEY”: https://api.census.gov/data/2022/acs/acs5?get=NAME,B19013_001E&for=school%20district%20(unified):*&in=state:25&key=MY_API_KEY and downloaded it as a json.
I saved the output as a json which I’ll read into Rstudio. The column named “B19013_001E” represents the median hosuehold income per school district. I’ll be able to join this with the directory data using part of its leaid with the “school district (unified)”
censusdata2022 <- fromJSON("Data/2022/acs5.json")str(censusdata2022) chr [1:213, 1:4] "NAME" "Quabbin School District, Massachusetts" ...
censusdata <- as.data.frame(censusdata2022)head(censusdata) V1
1 NAME
2 Quabbin School District, Massachusetts
3 Spencer-East Brookfield School District, Massachusetts
4 Southwick-Tolland-Granville Regional School District, Massachusetts
5 Manchester Essex Regional School District, Massachusetts
6 Ayer-Shirley School District, Massachusetts
V2 V3 V4
1 B19013_001E state school district (unified)
2 97094 25 00001
3 76636 25 00002
4 105987 25 00013
5 183854 25 00067
6 103189 25 00542
We can see that the first row represents the column names, so we’ll fix this by assigning the first row as column names and then removing the first row.
colnames(censusdata) <- censusdata[1, ]
censusdata <- censusdata[-1, ]head(censusdata) NAME
2 Quabbin School District, Massachusetts
3 Spencer-East Brookfield School District, Massachusetts
4 Southwick-Tolland-Granville Regional School District, Massachusetts
5 Manchester Essex Regional School District, Massachusetts
6 Ayer-Shirley School District, Massachusetts
7 Monomoy Regional School District, Massachusetts
B19013_001E state school district (unified)
2 97094 25 00001
3 76636 25 00002
4 105987 25 00013
5 183854 25 00067
6 103189 25 00542
7 83125 25 00544
Great! That worked. Now let’s rename the column names and modify the types. We’ll name the school district variable to “leaid” which matches what we’ll need to match on in the directory dataframe.
censusdata <- censusdata |>
rename(
median_income = B19013_001E,
leaid = `school district (unified)`,
district_name = NAME
)Now we’ll just select the columns we need - we don’t need the code 25, since we know all values are from Massachusetts. And similar to the datasets from the Massachusetts Department of Education, we don’t need the district name string values to allow us to join data together.
censusdata <- censusdata |>
select(leaid, median_income)Now let’s make the median_income variable into a numeric, since it represents each district’s median income (a numeric value).
censusdata$median_income <- as.numeric(censusdata$median_income)head(censusdata) leaid median_income
2 00001 97094
3 00002 76636
4 00013 105987
5 00067 183854
6 00542 103189
7 00544 83125
Great! Now it looks exactly as we’d like it to prior to merging with other datasets. When we merge this with the other datasets, we can see that if we similarly remove the prefix of “25” from the leaid from the directory dataset, the school_district code from censusdata will allow us to join.
Merge Dataframes
Transform Directory
The directory dataframe from the Urban Institute has a field state_leaid that appears with values “MA-” followed by four digits. We know from the data dictionary that this means “State Local Education Agency Identification number”. When we look at the data downloaded from the Massachusetts Department of education, we can see that there are “District Codes” which present as eight digits, but the last four digits are always 0’s. The 4 digits following “MA-” in the state_leaid from directory is the key to join with the first four digits of the “District Code” renamed as leaid values from the Massachusetts Department of Education dataframes.
So, first we’ll modify the directory dataframe to retain the numeric portion of the state_leaid. We’ll want to retain this as a character string, though, because there are important leading zeros that we will need to join the first four characters of the District Code from other dataframes. The "\\d+" in the code below references the digit (numeric) component of this field.
directory <- directory |>
mutate(district_code = str_extract(directory$state_leaid, "\\d+")) |>
select(district_code, leaid, lea_name, city_location, longitude, latitude,
urban, locale, number_of_schools)We will also want to modify the leaid from directory to join in the Census data. To do this we’ll use another type of regex reference to the beginning of the string with “^25” to remove the 25 at the beginning of each leaid which represents the Massachusetts FIPS code 25.
directory$leaid <- sub("^25", "", directory$leaid)Merge dataframes
Now we’ll merge the datasets with the directory as our base, because we only want to look at the school districts that offer pre-K to 12th grade – the 209 school districts in the directory dataframe. We can now also use the created district_code field in order to join the other dataframes we’ve transformed above.
directory_merged <- directory |>
left_join(arts, by = "district_code") |>
left_join(digital, by = "district_code") |>
left_join(teachers, by = "district_code") |>
left_join(expenditures, by = "district_code") |>
left_join(grade9pass, by = "district_code") |>
left_join(censusdata, by = "leaid")head(directory_merged) district_code leaid lea_name
1 0753 00001 Quabbin
2 0767 00002 Spencer-E Brookfield
3 0766 00013 Southwick-Tolland-Granville Regional School District
4 0698 00067 Manchester Essex Regional
5 0616 00542 Ayer Shirley School District
6 0712 00544 Monomoy Regional School District
city_location longitude latitude urban locale number_of_schools
1 Barre -72.11355 42.39849 Rural distant 7
2 Spencer -71.97686 42.25104 Suburb large 4
3 Southwick -72.75376 42.06204 Rural fringe 3
4 Manchester -70.76566 42.58115 Suburb large 4
5 Ayer -71.57926 42.56630 Suburb large 4
6 Chatham -69.97116 41.69448 Suburb midsize 4
all_grades_percent_arts all_grades_percent_digital
1 88.7 61.8
2 90.1 2.9
3 84.6 69.7
4 85.6 77.0
5 97.4 90.3
6 89.2 71.4
percent_of_experienced_teachers percent_of_teachers_licensed
1 84.6 99.6
2 79.4 98.2
3 94.2 100.0
4 91.1 100.0
5 85.0 99.3
6 86.6 98.8
student_teacher_ratio in_district_expenditures total_in_district_ftes
1 14.6 38543729 2222.6
2 12.4 23307703 1380.3
3 10.9 24210887 1364.5
4 9.9 28134235 1258.0
5 11.2 30090579 1611.1
6 10.7 39269662 1771.0
in_district_expenditures_per_pupil percent_passing_all_courses median_income
1 17341.73 47.7 97094
2 16885.97 47.7 76636
3 17743.41 92.0 105987
4 22364.26 100.0 183854
5 18677.04 92.1 103189
6 22173.72 85.8 83125
Awesome! The processing of our dataframes made merging so much easier.
Let’s rename a few variables to make them a bit shorter and therefore more readable later on. - We’ll create an abbreviation in this merged dataset, where pct_ means “percent”. - We’ll abbreviate the word “expenditures” to “expend”. - For the variables that have specified “in-district”, we’ll remove this notation, since we know that all varibles are in regard to “in district”
These changes simply make our variables easier to read and distinguish in their setting.
directory_merged <- directory_merged |>
rename(
pct_experienced_teachers = percent_of_experienced_teachers,
pct_grade9_passing = percent_passing_all_courses,
pct_teachers_licensed = percent_of_teachers_licensed,
expend = in_district_expenditures,
expend_per_pupil = in_district_expenditures_per_pupil,
total_ftes = total_in_district_ftes,
all_grades_pct_arts = all_grades_percent_arts,
all_grades_pct_digital = all_grades_percent_digital
)Now we can move on to more EDA with visualization.
EDA Visualizations
GGPairs and Variable Relationships
ggpairs(
directory_merged |>
select(-district_code, -leaid, -lea_name, -city_location, -locale, -urban,
-number_of_schools, -longitude, -latitude,
-pct_teachers_licensed, -expend,
-total_ftes),
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("points", size = 0.5))
) +
labs(title = "2022 Massachusetts School Districts (Pre-K to 12) Data") +
theme(
strip.text = element_text(size = 6, face = "bold",
margin = margin(t = 2, r = 0, b = 1, l = 0)
),
axis.text.y = element_text(size = 8, face = "bold"),
axis.text.x = element_text(size = 8, face = "bold", angle = 45, hjust = 1),
)# Save the plot with adjusted height
# ggsave("ggpairs_plot.png", height = 10, width = 14)Unsurprisingly, there is a strong negative correlation between the in-district expenditures per pupil and the student-teacher ratio. This is unsurprising to me because, as the student-teacher ratio increases (more students to a teacher) the in-district expenditures per pupil decrease. It’s important to remember that expenditures on instruction is not limited merely to salaries. It may include things like a teacher lounge or other items that – with more teachers, the overall “cost” or expenditures goes down. This is perhaps more a matter of the economies of scale than anything to do with the quality or ‘value’ of education. The highest number we see in the student teacher ratio overall is 16:1, which makes sense when spread across pre-K to 12th grade. And the overall lower student-teacher ratio, the more per pupil a district is spending – we can see the spread of urban here as well.
ggplot(
directory_merged,
aes(x = student_teacher_ratio, y = expend_per_pupil)
) +
labs(title = "Expenditures per Pupil and Student-Teacher Ratio") +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
cor(directory_merged$student_teacher_ratio, directory_merged$expend_per_pupil)[1] -0.6891134
Another item we can see in this ggpairs visualization is that, for every district in the pre-K to 12 non-Charter Massachusetts school districts observed, the percentage of students taking an art course is well above 60%.
ggplot(
directory_merged,
aes(x = all_grades_pct_arts)
) +
labs(title = "Distribution of % of Students in an Arts Course") +
geom_histogram(color = "#0D0887", fill = "#ED7953")`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(
directory_merged,
aes(x = all_grades_pct_arts, y = median_income)
) +
labs(title = "Median Income and Percent of Students in Arts Courses") +
geom_point(color = "#ED7953") +
geom_smooth(method = "lm") `geom_smooth()` using formula = 'y ~ x'
cor(directory_merged$all_grades_pct_arts, directory_merged$median_income)[1] 0.2011804
Another item of note is that there’s a relationship observed between median income and percent of experienced teachers, which is an indicator of the number of teachers who have been teaching in a public school for more than 3 years.
ggplot(
directory_merged,
aes(x = pct_experienced_teachers, y = median_income)
) +
labs(title = "Median Income and Percent Experienced Teachers") +
geom_point(color = "#CB4679") +
geom_smooth(method = "lm") `geom_smooth()` using formula = 'y ~ x'
cor(directory_merged$pct_experienced_teachers, directory_merged$median_income)[1] 0.4419556
Along the same lines, there’s a relationship that appears between the median income and the percent of grade 9 passing - meaning, in districts with higher median incomes, we are more likely to see a higher share of grade 9 students passing.
ggplot(
directory_merged,
aes(x = pct_grade9_passing, y = median_income)
) +
labs(title = "Median Income and Percent of Grade 9 Passing") +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
cor(directory_merged$pct_grade9_passing, directory_merged$median_income)[1] 0.6868479
Of course, correlation is not causation, for this relationship or any of the others. But we do not see many exceptions where there is a high median income and low grade 9 passing percentage.
A surprise to me is that there does not appear to be, immediately, a strong positive relationship between expenditures per student and the US Census estimated median income per district.
ggplot(
directory_merged,
aes(x = expend_per_pupil, y = median_income, color = student_teacher_ratio)
) +
labs(title = "Median Income and Expenditures per Pupil") +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
There is probably more to unpack if we were to look more into this relationship, such as the way that expenditures are spread out at different levels of education, and more. We can see here that the student teacher ratio is not a very illustrative variable, especially seeing as it’s over the total district level.
Within the percent of students enrolled in a digital literacy or computer science course (the second ‘column’ of data in the above ggpairs visualization), there appear to be two ‘peaks’ of where the data is gathered. The digital literacy / computer science variable is perhaps the most baffling and seems like there might be some issues with the data - it’s unclear yet. Let’s take a closer look.
ggplot(
directory_merged,
aes(x = all_grades_pct_digital, y = expend_per_pupil,
color = urban
)
) +
geom_point() This is surprising, as I would’ve presumed that high in-district expenditures per pupil would correlate with high percentage of students enrolled in digital literacy or computer science courses. After all, technology is expensive, both the hardware/software and the specialization/expertise required of teachers to teach it, or the cost of online special programs in the place of teacher experts.
Another item of note along these lines is that the city data point near the top left quadrant of this visualization seems odd: very high in-district expenditures per pupil, yet very low enrollment in digital literacy or computer science courses. Which city might this be?
directory_merged |>
arrange(desc(expend_per_pupil)) |>
select(city_location, urban, locale, all_grades_pct_digital,
expend_per_pupil, student_teacher_ratio) |>
head() city_location urban locale all_grades_pct_digital expend_per_pupil
1 Cambridge City midsize 6.6 34937.94
2 Roxbury City large 23.4 30106.66
3 Weston Suburb large 33.2 29285.55
4 Sheffield Rural distant 15.8 29175.64
5 Shelburne Falls Rural distant 25.6 26569.42
6 Nantucket Town distant 36.7 26482.26
student_teacher_ratio
1 8.7
2 10.7
3 11.9
4 8.1
5 10.6
6 10.7
This is interesting: the datapoint we saw in the scatterplot above, which appeared to be an outlier, is the school district of Cambridge, Massachusetts. The Cambridge, Massachusetts, where numerous high-tech companies and top universities are based, among them the Massachusetts Institute of Technology (MIT) and Harvard. This low percentage for all_grades_pct_digital seems unusual, and potentially unreliable, based on what we know about the location as well as the high in-district expenditures per pupil. Perhaps this was misreported, or there is simply more to the data and what counts as a “digital literacy or computer science course”. Or, perhaps digital literacy and computer science education are so ingrained in the daily life and every class in this tech-hub of Cambridge that there is no “dedicated” computer science course, and the curricula is embedded in courses like statistics or biology (at least at the high-school level). We now know that, having observed Cambridge’s in-district expenditures per pupil around $35,000, it is the most extreme case in this negative correlation.
This makes us think: what would happen if we focused specifically on Suburbs, especially since those comprise the vast majority of our data points?
ggplot(directory, aes(x = urban)) +
geom_bar() +
labs(
title = "Distribution of Urban Categories",
x = "Urban Category",
y = "Count",
fill = "Locale"
)ggpairs(
directory_merged |>
filter(urban == "Suburb") |>
select(-district_code, -leaid, -lea_name, -city_location, -locale, -urban,
-number_of_schools, -longitude, -latitude,
-pct_grade9_passing, #-pct_experienced_teachers,
-pct_teachers_licensed, -expend,
-total_ftes),
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("points", size = 0.5))
) +
labs(title =
"2022 Suburban Massachusetts School Districts (Pre-K to 12) Data") +
theme(
strip.text = element_text(size = 6, face = "bold",
margin = margin(t = 2, r = 0, b = 1, l = 0)
),
axis.text.y = element_text(size = 8, face = "bold"),
axis.text.x = element_text(size = 8, face = "bold")
) We see changes to the correlations between variables, but not in such a significant way that we would want to re-do the analysis above.
Map
Now, we’ll load the Massachusetts state boundary
ggplot() +
geom_sf(data = massachusetts_map, fill = "white", color = "black") +
geom_point(data = directory_merged,
aes(x = longitude, y = latitude,
color = student_teacher_ratio,
size = expend_per_pupil),
alpha = 0.8) +
scale_color_viridis_c(
option = "magma", name = "Student-Teacher Ratio") +
scale_size_continuous(name = "Expenditures per Pupil ($)") +
labs(title =
"Geographic Distribution of Student-Teacher Ratio and Expenditures",
subtitle = "Massachusetts Public School Districts",
x = "Longitude", y = "Latitude") +
theme_minimal()ggplot() +
geom_sf(data = massachusetts_map, fill = "white", color = "black") +
geom_point(data = directory_merged,
aes(x = longitude, y = latitude,
color = all_grades_pct_digital,
size = expend_per_pupil),
alpha = 0.8) +
scale_color_viridis_c(
option = "magma", name = "Students in Digital (%)") +
scale_size_continuous(name = "Expenditures per Pupil ($)") +
labs(title =
"Geographic Distribution of % Students in Digital and Expenditures",
subtitle = "Massachusetts Public School Districts",
x = "Longitude", y = "Latitude") +
theme_minimal()ggplot() +
geom_sf(data = massachusetts_map, fill = "white", color = "black") +
geom_point(data = directory_merged,
aes(x = longitude, y = latitude,
color = all_grades_pct_arts,
size = expend_per_pupil),
alpha = 0.8) +
scale_color_viridis_c(
option = "magma", name = "Students in Arts (%)") +
scale_size_continuous(name = "Expenditures per Pupil ($)") +
labs(title =
"Geographic Distribution of % Students in Arts and Expenditures",
subtitle = "Massachusetts Public School Districts",
x = "Longitude", y = "Latitude") +
theme_minimal()ggplot() +
geom_sf(data = massachusetts_map, fill = "white", color = "black") +
geom_point(data = directory_merged,
aes(x = longitude, y = latitude,
color = all_grades_pct_arts,
size = median_income),
alpha = 0.8) +
scale_color_viridis_c(
option = "magma", name = "Students in Arts (%)") +
scale_size_continuous(name = "Median Income ($)") +
labs(title =
"Geographic Distribution of % Students in Arts and Median Income",
subtitle = "Massachusetts Public School Districts",
x = "Longitude", y = "Latitude") +
theme_minimal()As we saw previously, the prevalence of arts courses is much more widespread than digital/computer-science courses. This is very visible in the map view.
ggplot() +
geom_sf(data = massachusetts_map, fill = "white", color = "black") +
geom_point(data = directory_merged,
aes(x = longitude, y = latitude,
color = all_grades_pct_digital,
size = median_income),
alpha = 0.8) +
scale_color_viridis_c(
option = "plasma", name = "% in Digital") +
scale_size_continuous(name = "Median Income ($)") +
labs(title =
"Geographic Distribution of Median Income and % in Digital",
subtitle = "Massachusetts Public School Districts",
x = "Longitude", y = "Latitude") +
theme_minimal()Unsurprisingly, the highest median incomes are around the Boston area.
ggplot() +
geom_sf(data = massachusetts_map, fill = "white", color = "black") +
geom_point(data = directory_merged,
aes(x = longitude, y = latitude,
color = pct_grade9_passing,
size = median_income),
alpha = 0.8) +
scale_color_viridis_c(
option = "plasma", name = "% Grade 9 Passing") +
scale_size_continuous(name = "Median Income ($)") +
labs(title =
"Geographic Distribution of Median Income and % Grade 9 Passing",
subtitle = "Massachusetts Public School Districts",
x = "Longitude", y = "Latitude") +
theme_minimal()I think that we can expect the number of students enrolled in digital literacy and computer science classes to continue to climb - or perhaps we can simply expect that digital literacy and computer science courses will continue to be a part of our lives as well as our children’s lives (in my and my husband’s case, our future children).
Conclusion
This analysis highlights that determining what makes a “good” school district is multifaceted and cannot be derived from these data points alone. Metrics like expenditures per pupil, median household income, and digital course offerings provide valuable insights, but they represent only part of the picture. Notably, the correlation between higher median household income and a greater percentage of Grade 9 students passing all courses suggests socioeconomic factors may play a significant role in student outcomes. However, the lack of a strong relationship between median income and expenditures per pupil indicates that financial resources are not uniformly allocated or may not directly influence certain outcomes.
The findings also show a high prevalence of arts course enrollment across districts, contrasting with the variability in digital literacy and computer science offerings. This disparity raises important questions about access to modern educational opportunities in technology and underscores potential areas for policy intervention. It also raises the question of the value of the data points in a world where “digital” is everywhere.
The limitations of using 2022 data, which is still influenced by the disruptions of the COVID-19 pandemic, should be noted. Missing accountability metrics and other gaps in the Massachusetts Department of Education and Urban Institute datasets pose challenges to drawing definitive conclusions for future actions, whether personal choices like where to live or policy decisions at the state or district level. Future studies would benefit from examining more detailed trends in expenditures, curriculum offerings, and teacher experience/expertise to better understand their impact on overall district metrics and measures.
While this analysis may not directly inform decisions about where to live, it provides a framework for exploring public school data and raises critical questions about resource allocation and educational opportunity. Future iterations could incorporate more granular data points, such as parental reviews, advanced placement course availability, median SAT scores or MCAS scores, or extracurricular offerings, to create a more comprehensive understanding of school district characteristics.
At the end of the day, evaluating school districts requires both data-driven insights and personal experience. By refining this analysis with additional data and perspectives, we can better inform public policy and personal decisions alike, in the great state of Massachusetts or elsewhere.