607 Final Project Report

Author

Amanda Rose Knudsen

Abstract

This project explores publicly available data on Massachusetts public school districts to identify patterns and relationships between district characteristics, expenditures, and course offerings. Motivated by personal inquiry into factors influencing school quality and potential relocation decisions, this analysis focuses on comparing metrics beyond standardized test scores, such as the prevalence of arts and computer science courses and class sizes.

Using data from the Urban Institute’s Education Data Explorer (educationdata R package) and supplemental data from the Massachusetts Department of Education and US Census data, this exploratory analysis merges 2022 data for Pre-K to 12th grade non-charter school districts. The analysis employs various data acquisition and cleaning techniques, including processing strings and text, creating a function to perform a repetitive operation on multiple dataframes, dataset joins, tidying, and exploratory visualizations. Metrics such as per-pupil expenditures and percentages enrolled in various course types provide insights into how districts vary in their offerings and resources.

The COVID-19 pandemic introduced substantial gaps in assessment-related data, and as such, this analysis avoids test scores and instead examines district factors related to course availability and resource allocation. While no definitive conclusions are drawn about what makes a “good” district, the findings highlight significant variability in offerings and raise questions about the impact of resources, especially expenditures and median income levels, on educational opportunities. This exploration emphasizes the importance of transparency and contextual data in informing educational and personal decisions.

Introduction

The motivation for this project stems from a personal inquiry: my husband and I plan to move back to Massachusetts and eventually have children. As we consider towns for relocation, we are curious about how public school districts compare across a range of metrics. Popular real estate platforms like Zillow and Redfin provide unreliable and biased information, such as “school ratings” based on self-reported parent surveys or opaque algorithms. These data points lack transparency, fail to account for district-wide trends, and cannot answer broader questions like, What are some features that make a school district a good option for us?

This project builds upon findings from my DATA-606 coursework, where I examined a simple linear regression between instructional expenditures per student and midpoint math proficiency scores using 2020 data from the Urban Institute’s educationdata package. While the relationship was statistically significant, the model’s low R-squared value revealed limited explanatory power, emphasizing that statistical relationships do not always translate into practical significance. Inspired by these limitations, I aim to explore additional variables and broader district-level metrics.

For this analysis, I merge data from the Urban Institute’s Education Data Portal (2022 directory) and the Massachusetts Department of Education, in addition to data from the US Census for 2022. The dataset includes Pre-K to 12th grade non-charter school districts, as defined by the Common Core of Data (CCD). Although the pandemic disrupted the collection of assessment-related data, this project shifts the focus to variables like course enrollment in arts and computer science, and class sizes. By examining these factors, I aim to uncover patterns in district offerings and resource allocation rather than predict outcomes or define “best” districts.

The analysis workflow involves a variety of data science techniques, including: - Data acquisition using the educationdata package, Massachusetts Department of Education downloaded data, and US Census data. - Data transformations, such as creating calculated fields and creating functions to format csv imports. - Processing strings and text to merge data sources. - Visualizations to highlight observations and categorical trends.

Although exploratory, this project illustrates the variability in resources and offerings across districts and demonstrates the potential of publicly available data to inform nuanced, data-driven inquiries. However, it also underscores the challenges of interpreting incomplete, aggregated, and anonymized data, especially in the realm of education-related data, and the importance of context when drawing conclusions. While there does appear to be a relationship between expenditures and median income levels on “outcomes” such as percentage of students passing grade 9, it is unclear and outside the scope of this analysis to determine the cause/effect of such findings, in particular for a finite year at the tail-end of a global pandemic which so significantly impacted education and housing, among many other factors.

Data Sources

Sources overview

Urban Institute Education Data Portal R package `eduationdata`

Relevant links regarding this data source: - Education Data Explorer - API FAQs - R package overview on GitHub - Data dictionary is available here

Massachusetts Department of Education

Relevant links regarding these data sources: - Expenditures Report - Digital Literacy Coursetaking Report - Arts Coursetaking Report - Grade 9 Passing Report - Teacher Data Report

Note on Mass. DoE Data

As noted previously, a major source of data challenge is the COVID-19 global pandemic and its impacts on the collection of various data points. For example, in reviewing the “Accountability Report” from the Massachusetts Department of Education (intended to classify schools and districts based on performance and improvement targets), I found that for the years 2019–2022 in the data source), nearly 100% of the rows lacked accountability determinations. These omissions stemmed from the pandemic’s disruption to state assessments and school operations.

Given these limitations, this analysis focuses on exploring non-assessment-related metrics such as course offerings and expenditures. While 2022 data from the Massachusetts Department of Education is less comprehensive due to these disruptions, it still provides meaningful insights into resource allocation and opportunities at the district level. This project underscores the need for transparent, reliable, and consistently collected data to inform decisions at both the personal and policy levels.

Census API Data

U.S. Census Bureau. “Median Household Income in the Past 12 Months (in 2022 Inflation-Adjusted Dollars).” American Community Survey, ACS 1-Year Estimates Detailed Tables, Table B19013, 2022, https://data.census.gov/table/ACSDT1Y2022.B19013?q=B19013&g=040XX00US25$9700000&y=2022. Accessed on December 8, 2024.

Load Libraries

library(tidyverse)
library(educationdata)
library(skimr)
library(GGally)
library(readxl)
library(stringr)
library(janitor)
library(tigris)
library(jsonlite)

Load and Prepare Data: Clean and Transform

Urban Institute Directory

For the data from the educationdata R package, I will load data from the Urban Institute Education Data Portal using their get_education_data() function for the year 2022 and the state of Massachusetts, which has the FIPS code 25. Data from the Urban Institute Education Data Portal is available via a package in R called educationdata, which I have installed, which essentially calls their API to get data.

directory2022 <- get_education_data(level = 'school-districts',
                                 source = 'ccd',
                                 topic = 'directory',
                                 filters = list(year = 2022,
                                                fips = '25'),
                                 add_labels = TRUE
                                 )

I will now look at general summary overview of this dataframe using skim to ensure that we know what we’re looking for. For now we want to look at the ‘general information’ from the directory, like the latitude and longitude. Eventually we want to get to a “foundation” dataframe which includes ‘general’ values from this and other sources – we will eventually want to merge them using the state_leaid or leaid column, which based on the data dictionary means the “Local Education Agency Identification number”.

skim(directory2022)

directory2022 |> 
  group_by(lowest_grade_offered, highest_grade_offered) |> 
  summarise(n = n()) |> 
  arrange(desc(by = n))

`summarise()` has grouped output by 'lowest_grade_offered'. You can override
using the `.groups` argument.

# A tibble: 20 × 3
# Groups:   lowest_grade_offered [11]
   lowest_grade_offered highest_grade_offered     n
   <fct>                <fct>                 <int>
 1 Pre-K                12                      213
 2 9                    12                       44
 3 Pre-K                6                        38
 4 <NA>                 <NA>                     29
 5 Pre-K                8                        22
 6 Kindergarten         12                       15
 7 6                    12                       13
 8 7                    12                       13
 9 Kindergarten         8                        11
10 5                    12                        7
11 Pre-K                5                         6
12 Kindergarten         6                         4
13 6                    8                         4
14 Kindergarten         5                         3
15 4                    8                         1
16 5                    9                         1
17 6                    10                        1
18 11                   12                        1
19 Ungraded             Ungraded                  1
20 Not applicable       Not applicable            1

In order to compare measures that we’ll find in other datasets, we want to make sure that what we’re comparing schools that offer equivalent grade spans. We see that the most common grades offered is a lowest_grades_offered of Pre-K and a highest_grade_offered of 12. Let’s limit our dataframe to that subset of school districts that offers opportunities from Pre-K to 12th grade.

directory2022 <- directory2022 |> 
  filter(lowest_grade_offered == "Pre-K" & highest_grade_offered == "12")

Now, let’s look at the agency_types included in the filtered dataframe. Based on the Urban Institute’s data dictionary we know that this differentiates Charter schools and special school districts from what’s considered a ‘normal’ public school district.

directory2022 |> 
  group_by(agency_type) |> 
  summarise(n = n())

# A tibble: 2 × 2
  agency_type                       n
  <fct>                         <int>
1 Regular local school district   209
2 Charter agency                    4

We want to ensure that we’re comparing school districts of similar type, so we’ll filter out Charter agencies.

directory2022 <- directory2022 |> 
  filter(agency_type == "Regular local school district")

directory2022

There are some redundant and unnecessary variables in this dataframe, including the “year” (which will always be 2022, since that’s what we selected for), state (fips) name, and other values.

We are going to create a subset of directory2022 to include just the columns we think we’ll need going forward. We’ll rename it directory so that if we ever want to get some of the other variables we will be able to find and join them more easily.

We are going to select for public school districts that offer pre-K to 12th grade that are non-charter schools. We’ll want to get information about their name, district codes, location via latitude and longitude values, and some other ‘general’ fields to act as the base in the directory. The reduction based on offering pre-K to 12th grade plus non-Charter schools reduced us down from nearly 400 districts to just over 200 (209 to be exact.)

directory <- directory2022 |> 
  select(leaid, lea_name, state_leaid, city_location, latitude, longitude,
         urban_centric_locale, number_of_schools)

Looking at types of districts, it’s interesting that many are classified as large suburbs. Perhaps we can do some more interesting EDA down the line if we split urban_centric_locale into “urban” and “locale” so that the ‘types’ (suburb, city, town, rural) are distinguished from the locale (large, midsize, small, fringe, distant).

directory |> 
  group_by(urban_centric_locale) |> 
  summarise(n = n()) |> 
  arrange(urban_centric_locale)

# A tibble: 9 × 2
  urban_centric_locale     n
  <fct>                <int>
1 City large               1
2 City midsize             3
3 City small               5
4 Suburb large           147
5 Suburb midsize          16
6 Town fringe              4
7 Town distant             3
8 Rural fringe            26
9 Rural distant            4

directory <- directory |> 
  separate(urban_centric_locale, into = c("urban", "locale"), sep = " ")

directory |> 
  group_by(urban) |> 
  summarise(n = n())

# A tibble: 4 × 2
  urban      n
  <chr>  <int>
1 City       9
2 Rural     30
3 Suburb   163
4 Town       7

directory |> 
  group_by(locale) |> 
  summarise(n = n())

# A tibble: 5 × 2
  locale      n
  <chr>   <int>
1 distant     7
2 fringe     30
3 large     148
4 midsize    19
5 small       5

While it’s not a complete break from the dominance of “large suburb” this does allow us to see patterns in a different way.

ggplot(directory, aes(x = urban, fill = locale)) +
  geom_bar() +
  labs(
    title = "Distribution of Urban and Locale Categories",
    x = "Urban Category",
    y = "Count",
    fill = "Locale"
  )

Mass. DOE:

All datasets have been downloaded for the year 2022 from this main page: Statewide Reports

The data output is pretty consistent: they all have a merged “header” row that has the Report year. That’s why for the following code chunks I will be skipping row 1, which enables me to see the column names. When I downloaded each data file, I put it into a folder specifying the year, “2022”.

Expenditures and Pupil Enrollment Data

Description: - Per Pupil expenditures - By District, 2022. Data dictionary for variables Source Link: - Per Pupil expenditures

This also includes the number of enrolled pupils.

expenditures2022 <- read_xlsx("Data/2022/PerPupilExpenditures.xlsx", skip = 1)

New names:
• `` -> `...9`

head(expenditures2022)

# A tibble: 6 × 9
  `District Name`  `District Code` In-District Expendit…¹ Total In-district FT…²
  <chr>            <chr>           <chr>                  <chr>                 
1 Abby Kelley Fos… 04450000        $22,455,445.00         1,423.0               
2 Abington         00010000        $33,667,339.55         2,151.3               
3 Academy Of the … 04120000        $11,731,129.95         504.3                 
4 Acton-Boxborough 06000000        $94,671,026.58         5,242.6               
5 Acushnet         00030000        $14,347,713.00         942.8                 
6 Advanced Math a… 04300000        $14,353,114.97         965.3                 
# ℹ abbreviated names: ¹`In-District Expenditures`, ²`Total In-district FTEs`
# ℹ 5 more variables: `In-District Expenditures per Pupil` <chr>,
#   `Total Expenditures` <chr>, `Total Pupil FTEs` <chr>,
#   `Total Expenditures per Pupil` <chr>, ...9 <lgl>

It looks like this brought in a full blank 9th column which was named ...9. We can see it’s entirely blank, so we’ll drop it when we also rename the columns to snake_case using the clean_names() function from the janitor package.

expenditures2022 <- expenditures2022 |> 
  clean_names() |> 
  select(1:8)

We see a pattern that will recur for all the dataframes we’ve downloaded from the Massachusetts Department of Education: the district_code consistently has an extra four 0’s at the end of it. This is presumably to enable school-specific IDs to have the first four characters match, and unique variation using the last four characters. Since we are analyzing at the district level only across all our dataframes, we’re going to use str_sub() to extract just the first four characters for this district_code and all the Mass. Dept. of Education dataframes that follow. This will enable easier joining later on.

So, we will select for only the first 4 characters in the district_code:

expenditures2022 <- expenditures2022 |> 
  mutate(district_code = str_sub(district_code, 1, 4))

head(expenditures2022)

# A tibble: 6 × 8
  district_name      district_code in_district_expendit…¹ total_in_district_ft…²
  <chr>              <chr>         <chr>                  <chr>                 
1 Abby Kelley Foste… 0445          $22,455,445.00         1,423.0               
2 Abington           0001          $33,667,339.55         2,151.3               
3 Academy Of the Pa… 0412          $11,731,129.95         504.3                 
4 Acton-Boxborough   0600          $94,671,026.58         5,242.6               
5 Acushnet           0003          $14,347,713.00         942.8                 
6 Advanced Math and… 0430          $14,353,114.97         965.3                 
# ℹ abbreviated names: ¹in_district_expenditures, ²total_in_district_ft_es
# ℹ 4 more variables: in_district_expenditures_per_pupil <chr>,
#   total_expenditures <chr>, total_pupil_ft_es <chr>,
#   total_expenditures_per_pupil <chr>

We can see a few errors in the way that the clean_names() modified the acronym in some of these columns, “FTEs” which means “Full Time Equivalent” pupils. We want the reference to “FTEs” to read ftes rather than ft_es.

expenditures2022 <- expenditures2022 |> 
  rename(total_in_district_ftes = total_in_district_ft_es,
         total_pupil_ftes = total_pupil_ft_es)

For reference, from the data dictionary linked in this section:

Per pupil expenditures are calculated by dividing a district’s operating expenditures by its average pupil membership, including in-district expenditures per pupil and total expenditures per pupil. The following data elements are included in these calculations:
- In-District Full-Time Equivalents (FTE) Average Membership: Average enrollment across the school year, for pupils enrolled at local schools.
- In-District Expenditures: All of a district’s operating expenditures for in-district programs. All expenditures and all funding sources are included, except community services (6000 series), fixed assets (7000 series), debt service (8000 series), and out-of-district expenditures (9000 series).
- In-District Expenditures Per Pupil: Total in-district expenditures divided by in-district FTE average membership.
- Total FTE Average Membership: Average enrollment across the school year for students enrolled at local schools and those publicly-funded students enrolled at other districts, including charter schools, special education collaboratives, and private special education schools.
- Total Expenditures: All district operating expenditures for in-district programs and out-of-district placements. All expenditures and funding sources are included, except community services (6000 series), fixed assets (7000 series), and debt service (8000 series).
- Total Expenditures Per Pupil: Total expenditures divided by total FTE average membership.

With this in mind, we’ll want to only (for this and others) look at “in district” variables, rather than the “total” variables. This is a choice we are deliberately making for our analysis: they sometimes represent the same data (total_in_district_ftes and total_pupil_ftes) but sometimes they do not. We also have noticed that sometimes the ‘total pupil’ number differs from the ‘total student’ number in other dataframes we aim to merge with. Knowing these are slightly different variables, we will seek again to look just at the ‘total’ values and also just the total percentages for each figure of interest later on, such as the percent of students enrolled in an art course. The difference between ‘total FTEs’ and ‘total pupils’ or ‘total students’ does not often vary, but sometimes does - and it is for this reason that we are choosing to move forward with our analysis out of interest rather than seeking total precision in the subject matter. (If it isn’t already clear, I am not a subject matter expert in education data - rather, I am a very curious person who seeks validity and value in data I am exploring, and hope to clarify all uncertainties with interested parties - to clarify nuances or inconsistencies that are evident in the data.)

The reasoning for this and elsewhere is that we care more about the “in district” spending for this specific analysis. The “Total”, when different, indicates spending on “out-of-distrct placements” in addition to in-district programs. This means we would potentially be mixing district-level data, so we’re sticking with in-district measures.

We’ll now create a new dataframe called simply expenditures that subsets the expenditures2022 dataframe to include just the “in-district” data. We no longer need the district_name either because we have the district_code.

expenditures <- expenditures2022 |> 
  select(district_code, in_district_expenditures, total_in_district_ftes, 
         in_district_expenditures_per_pupil)

We also want to make some columns that are ‘characters’ into ‘numeric’ so we can more feasibly perform EDA later on.

We notice that converting the large currency values with as_numeric() rounded the numbers to the nearest whole number, which in some cases might be an issue, but we are OK with it here, knowing that’s what happens with the transformation – and knowing that we are mostly interested in the ‘per pupil’ figure, which isn’t rounded as a result of the below transformation.

expenditures <- expenditures |> 
  mutate(
    total_in_district_ftes = as.numeric(
      str_replace_all(total_in_district_ftes, ",", "")),
    in_district_expenditures = 
      as.numeric(str_replace_all(in_district_expenditures, "[\\$,]", "")),
    in_district_expenditures_per_pupil =
      as.numeric(str_replace_all(
        in_district_expenditures_per_pupil, "[\\$,]", ""))
         )

head(expenditures)

# A tibble: 6 × 4
  district_code in_district_expenditures total_in_district_ftes
  <chr>                            <dbl>                  <dbl>
1 0445                         22455445                   1423 
2 0001                         33667340.                  2151.
3 0412                         11731130.                   504.
4 0600                         94671027.                  5243.
5 0003                         14347713                    943.
6 0430                         14353115.                   965.
# ℹ 1 more variable: in_district_expenditures_per_pupil <dbl>

Great! This is exactly what we were looking for.

Digital Literacy & Computer Science Coursetaking

Description: - Count of Students Taking At Least 1 Digital Literacy and/or Computer Science Course - By District, 2022 Source Link: - Digital Literacy and Computer Science Coursetaking Report

digital2022 <- read_xlsx("Data/2022/dlcs_course_percent.xlsx", skip = 1)

head(digital2022)

# A tibble: 6 × 17
  `District Name`      `District Code` K     `01`  `02`  `03`  `04`  `05`  `06` 
  <chr>                <chr>           <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Abby Kelley Foster … 04450000        100.0 100.0 100.0 100.0 100.0 100.0 100.0
2 Abington             00010000        0.0   0.0   0.0   0.0   0.0   0.0   0.0  
3 Academy Of the Paci… 04120000        <NA>  <NA>  <NA>  <NA>  <NA>  0.0   0.0  
4 Acton-Boxborough     06000000        0.0   0.0   0.0   0.0   0.0   0.0   0.0  
5 Acushnet             00030000        100.0 100.0 100.0 100.0 100.0 100.0 100.0
6 Advanced Math and S… 04300000        <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  100.0
# ℹ 8 more variables: `07` <chr>, `08` <chr>, `09` <chr>, `10` <chr>,
#   `11` <chr>, `12` <chr>, `All Grades` <chr>, `Total Students` <chr>

Since we know we’ll want to perform the same clean_names() and mutate of district code to include the first 4 characters, we can create a function to do so:

process_dataframe <- function(df) {
  df |> 
    clean_names() |> 
    mutate(district_code = str_sub(district_code, 1, 4))
}

digital2022 <- process_dataframe(digital2022)

We’ll also just select for the ‘total’ percentage of students who are taking a digital literacy or computer science course.

Like with the expenditures smaller dataframe, we’ll create a digital dataframe that’s created out of the digital2022 dataframe, and just select for the all_grades percentage – and we’ll rename the variable so it’s more clear what it represents.

digital <- digital2022 |> 
  select(district_code, all_grades) |> 
  rename(all_grades_percent_digital = all_grades) |> 
  mutate(all_grades_percent_digital = as.numeric(all_grades_percent_digital))

head(digital)

# A tibble: 6 × 2
  district_code all_grades_percent_digital
  <chr>                              <dbl>
1 0445                                76.4
2 0001                                22.4
3 0412                                 5.4
4 0600                                21.6
5 0003                               100  
6 0430                                89.5

Arts Coursetaking Report

Description: - Percent and Number of students enrolled in an arts course, by District, 2022 Link: - https://profiles.doe.mass.edu/statereport/artcourse.aspx

arts2022 <- read_xlsx("Data/2022/artcourse_percent.xlsx", skip = 1)

arts2022 <- process_dataframe(arts2022)

arts <- arts2022 |> 
  select(district_code, all_grades) |> 
  rename(all_grades_percent_arts = all_grades) |> 
  mutate(all_grades_percent_arts = as.numeric(all_grades_percent_arts))

head(arts)

# A tibble: 6 × 2
  district_code all_grades_percent_arts
  <chr>                           <dbl>
1 0445                             95.6
2 0001                             85.8
3 0412                             89.9
4 0600                             81  
5 0003                             69.4
6 0430                             81.4

Grade 9 Passing Data

Source Link: - https://profiles.doe.mass.edu/statereport/gradeninecoursepass.aspx Description: - Grade Nine Course Passing Report (District) - All Students - All Subjects

grade9pass2022 <- read_xlsx("Data/2022/gradeninecoursepasss.xlsx", skip = 1)

grade9pass2022 <- process_dataframe(grade9pass2022)

head(grade9pass2022)

# A tibble: 6 × 5
  district_name      district_code number_grade_nine_st…¹ number_passing_all_c…²
  <chr>              <chr>         <chr>                  <chr>                 
1 Abby Kelley Foste… 0445          89                     72                    
2 Abington           0001          140                    107                   
3 Academy Of the Pa… 0412          64                     59                    
4 Acton-Boxborough   0600          418                    408                   
5 Advanced Math and… 0430          137                    129                   
6 Agawam             0005          314                    240                   
# ℹ abbreviated names: ¹number_grade_nine_students, ²number_passing_all_courses
# ℹ 1 more variable: percent_passing_all_courses <chr>

grade9pass <- grade9pass2022 |> 
  select(district_code, percent_passing_all_courses) |> 
  mutate(percent_passing_all_courses = as.numeric(percent_passing_all_courses))

head(grade9pass)

# A tibble: 6 × 2
  district_code percent_passing_all_courses
  <chr>                               <dbl>
1 0445                                 80.9
2 0001                                 76.4
3 0412                                 92.2
4 0600                                 97.6
5 0430                                 94.2
6 0005                                 76.4

Teacher Data

Source Link: - https://profiles.doe.mass.edu/statereport/teacherdata.aspx Description: - Includes information including student-teacher ratio (number of students to 1 teacher) overall. Additionally includes a percentage of experienced teachers, which is defined as “Percent of Experienced Teachers: Percent of teachers who have been teaching in a Massachusetts public school for at least three years.”

teachers2022 <- read_xlsx("Data/2022/teacherdata.xlsx", skip = 1)

teachers2022 <- process_dataframe(teachers2022)

head(teachers2022)

# A tibble: 6 × 8
  district_name      district_code total_number_of_teac…¹ percent_of_teachers_…²
  <chr>              <chr>         <chr>                  <chr>                 
1 Abby Kelley Foste… 0445          119.6                  99.2                  
2 Abington           0001          153.8                  97.1                  
3 Academy Of the Pa… 0412          49.6                   65.6                  
4 Acton-Boxborough   0600          395.4                  100.0                 
5 Acushnet           0003          76.0                   100.0                 
6 Advanced Math and… 0430          80.5                   92.3                  
# ℹ abbreviated names: ¹total_number_of_teachers_fte,
#   ²percent_of_teachers_licensed
# ℹ 4 more variables: student_teacher_ratio <chr>,
#   percent_of_experienced_teachers <chr>,
#   percent_of_teachers_without_waiver_or_provisional_license <chr>,
#   percent_teaching_in_field <chr>

We can see that the variable student_teacher_ratio is a character that reads with values like “11.9 to 1”, “13.9 to 1”, etc. We want to extract the part before the “to” and convert to numeric:

teachers2022 <- teachers2022 |> 
  mutate(
    student_teacher_ratio_clean = 
      str_extract(student_teacher_ratio, "^(.*) to"),
    student_teacher_ratio_clean = 
      str_remove(student_teacher_ratio_clean, " to"),
    student_teacher_ratio_clean = 
      as.numeric(student_teacher_ratio_clean)
    )

We also want to convert another <ch>r variable to <dbl>, and make a smaller dataframe comprised of variables of interest:

teachers <- teachers2022 |> 
  mutate(
    percent_of_experienced_teachers = 
      as.numeric(percent_of_experienced_teachers),
    percent_of_teachers_licensed =
      as.numeric(percent_of_teachers_licensed)
  ) |> 
  select(district_code, percent_of_experienced_teachers, 
         percent_of_teachers_licensed, student_teacher_ratio_clean) |> 
  rename(student_teacher_ratio = student_teacher_ratio_clean)

head(teachers)

# A tibble: 6 × 4
  district_code percent_of_experienced_teachers percent_of_teachers_licensed
  <chr>                                   <dbl>                        <dbl>
1 0445                                     81.5                         99.2
2 0001                                     81.3                         97.1
3 0412                                     37.3                         65.6
4 0600                                     88.3                        100  
5 0003                                     78.9                        100  
6 0430                                     80.8                         92.3
# ℹ 1 more variable: student_teacher_ratio <dbl>

Census Data

This does not come from the Urban Institute nor the Massachusetts Department of Education.

Downloaded for the year 2022, Massachusetts school districts median household income report (table B19013). https://data.census.gov/table/ACSDT5Y2022.B19013?q=B19013&g=040XX00US25$9700000&y=2022&moe=false&tp=true

To execute, I added my census key where this says “MY_API_KEY”: https://api.census.gov/data/2022/acs/acs5?get=NAME,B19013_001E&for=school%20district%20(unified):*&in=state:25&key=MY_API_KEY and downloaded it as a json.

I saved the output as a json which I’ll read into Rstudio. The column named “B19013_001E” represents the median hosuehold income per school district. I’ll be able to join this with the directory data using part of its leaid with the “school district (unified)”

censusdata2022 <- fromJSON("Data/2022/acs5.json")

str(censusdata2022)

 chr [1:213, 1:4] "NAME" "Quabbin School District, Massachusetts" ...

censusdata <- as.data.frame(censusdata2022)

head(censusdata)

                                                                   V1
1                                                                NAME
2                              Quabbin School District, Massachusetts
3              Spencer-East Brookfield School District, Massachusetts
4 Southwick-Tolland-Granville Regional School District, Massachusetts
5            Manchester Essex Regional School District, Massachusetts
6                         Ayer-Shirley School District, Massachusetts
           V2    V3                        V4
1 B19013_001E state school district (unified)
2       97094    25                     00001
3       76636    25                     00002
4      105987    25                     00013
5      183854    25                     00067
6      103189    25                     00542

We can see that the first row represents the column names, so we’ll fix this by assigning the first row as column names and then removing the first row.

colnames(censusdata) <- censusdata[1, ] 
censusdata <- censusdata[-1, ]

head(censusdata)

                                                                 NAME
2                              Quabbin School District, Massachusetts
3              Spencer-East Brookfield School District, Massachusetts
4 Southwick-Tolland-Granville Regional School District, Massachusetts
5            Manchester Essex Regional School District, Massachusetts
6                         Ayer-Shirley School District, Massachusetts
7                     Monomoy Regional School District, Massachusetts
  B19013_001E state school district (unified)
2       97094    25                     00001
3       76636    25                     00002
4      105987    25                     00013
5      183854    25                     00067
6      103189    25                     00542
7       83125    25                     00544

Great! That worked. Now let’s rename the column names and modify the types. We’ll name the school district variable to “leaid” which matches what we’ll need to match on in the directory dataframe.

censusdata <- censusdata |> 
  rename(
    median_income = B19013_001E,
    leaid = `school district (unified)`,
    district_name = NAME
  )

Now we’ll just select the columns we need - we don’t need the code 25, since we know all values are from Massachusetts. And similar to the datasets from the Massachusetts Department of Education, we don’t need the district name string values to allow us to join data together.

censusdata <- censusdata |> 
  select(leaid, median_income)

Now let’s make the median_income variable into a numeric, since it represents each district’s median income (a numeric value).

censusdata$median_income <- as.numeric(censusdata$median_income)

head(censusdata)

  leaid median_income
2 00001         97094
3 00002         76636
4 00013        105987
5 00067        183854
6 00542        103189
7 00544         83125

Great! Now it looks exactly as we’d like it to prior to merging with other datasets. When we merge this with the other datasets, we can see that if we similarly remove the prefix of “25” from the leaid from the directory dataset, the school_district code from censusdata will allow us to join.

Merge Dataframes

Transform Directory

The directory dataframe from the Urban Institute has a field state_leaid that appears with values “MA-” followed by four digits. We know from the data dictionary that this means “State Local Education Agency Identification number”. When we look at the data downloaded from the Massachusetts Department of education, we can see that there are “District Codes” which present as eight digits, but the last four digits are always 0’s. The 4 digits following “MA-” in the state_leaid from directory is the key to join with the first four digits of the “District Code” renamed as leaid values from the Massachusetts Department of Education dataframes.

So, first we’ll modify the directory dataframe to retain the numeric portion of the state_leaid. We’ll want to retain this as a character string, though, because there are important leading zeros that we will need to join the first four characters of the District Code from other dataframes. The "\\d+" in the code below references the digit (numeric) component of this field.

directory <- directory |> 
  mutate(district_code = str_extract(directory$state_leaid, "\\d+")) |> 
  select(district_code, leaid, lea_name, city_location, longitude, latitude, 
         urban, locale, number_of_schools)

We will also want to modify the leaid from directory to join in the Census data. To do this we’ll use another type of regex reference to the beginning of the string with “^25” to remove the 25 at the beginning of each leaid which represents the Massachusetts FIPS code 25.

directory$leaid <- sub("^25", "", directory$leaid)

Merge dataframes

Now we’ll merge the datasets with the directory as our base, because we only want to look at the school districts that offer pre-K to 12th grade – the 209 school districts in the directory dataframe. We can now also use the created district_code field in order to join the other dataframes we’ve transformed above.

directory_merged <- directory |> 
  left_join(arts, by = "district_code") |> 
  left_join(digital, by = "district_code") |> 
  left_join(teachers, by = "district_code") |> 
  left_join(expenditures, by = "district_code") |> 
  left_join(grade9pass, by = "district_code") |> 
  left_join(censusdata, by = "leaid")

head(directory_merged)

  district_code leaid                                             lea_name
1          0753 00001                                              Quabbin
2          0767 00002                                 Spencer-E Brookfield
3          0766 00013 Southwick-Tolland-Granville Regional School District
4          0698 00067                            Manchester Essex Regional
5          0616 00542                         Ayer Shirley School District
6          0712 00544                     Monomoy Regional School District
  city_location longitude latitude  urban  locale number_of_schools
1         Barre -72.11355 42.39849  Rural distant                 7
2       Spencer -71.97686 42.25104 Suburb   large                 4
3     Southwick -72.75376 42.06204  Rural  fringe                 3
4    Manchester -70.76566 42.58115 Suburb   large                 4
5          Ayer -71.57926 42.56630 Suburb   large                 4
6       Chatham -69.97116 41.69448 Suburb midsize                 4
  all_grades_percent_arts all_grades_percent_digital
1                    88.7                       61.8
2                    90.1                        2.9
3                    84.6                       69.7
4                    85.6                       77.0
5                    97.4                       90.3
6                    89.2                       71.4
  percent_of_experienced_teachers percent_of_teachers_licensed
1                            84.6                         99.6
2                            79.4                         98.2
3                            94.2                        100.0
4                            91.1                        100.0
5                            85.0                         99.3
6                            86.6                         98.8
  student_teacher_ratio in_district_expenditures total_in_district_ftes
1                  14.6                 38543729                 2222.6
2                  12.4                 23307703                 1380.3
3                  10.9                 24210887                 1364.5
4                   9.9                 28134235                 1258.0
5                  11.2                 30090579                 1611.1
6                  10.7                 39269662                 1771.0
  in_district_expenditures_per_pupil percent_passing_all_courses median_income
1                           17341.73                        47.7         97094
2                           16885.97                        47.7         76636
3                           17743.41                        92.0        105987
4                           22364.26                       100.0        183854
5                           18677.04                        92.1        103189
6                           22173.72                        85.8         83125

Awesome! The processing of our dataframes made merging so much easier.

Let’s rename a few variables to make them a bit shorter and therefore more readable later on. - We’ll create an abbreviation in this merged dataset, where pct_ means “percent”. - We’ll abbreviate the word “expenditures” to “expend”. - For the variables that have specified “in-district”, we’ll remove this notation, since we know that all varibles are in regard to “in district”

These changes simply make our variables easier to read and distinguish in their setting.

directory_merged <- directory_merged |> 
  rename(
   pct_experienced_teachers = percent_of_experienced_teachers,
   pct_grade9_passing = percent_passing_all_courses,
   pct_teachers_licensed = percent_of_teachers_licensed,
   expend = in_district_expenditures,
   expend_per_pupil = in_district_expenditures_per_pupil,
   total_ftes = total_in_district_ftes,
   all_grades_pct_arts = all_grades_percent_arts,
   all_grades_pct_digital = all_grades_percent_digital
  )

Now we can move on to more EDA with visualization.

EDA Visualizations

GGPairs and Variable Relationships

ggpairs(
  directory_merged |> 
    select(-district_code, -leaid, -lea_name, -city_location, -locale, -urban, 
           -number_of_schools, -longitude, -latitude, 
           -pct_teachers_licensed, -expend, 
           -total_ftes),
    upper = list(continuous = wrap("cor", size = 3)),
    lower = list(continuous = wrap("points", size = 0.5))
  )  +
  labs(title = "2022 Massachusetts School Districts (Pre-K to 12) Data") +
  theme(
    strip.text = element_text(size = 6, face = "bold",
                              margin = margin(t = 2, r = 0, b = 1, l = 0)
                              ),
    axis.text.y = element_text(size = 8, face = "bold"),
    axis.text.x = element_text(size = 8, face = "bold", angle = 45, hjust = 1),
  )

# Save the plot with adjusted height
# ggsave("ggpairs_plot.png", height = 10, width = 14)

Unsurprisingly, there is a strong negative correlation between the in-district expenditures per pupil and the student-teacher ratio. This is unsurprising to me because, as the student-teacher ratio increases (more students to a teacher) the in-district expenditures per pupil decrease. It’s important to remember that expenditures on instruction is not limited merely to salaries. It may include things like a teacher lounge or other items that – with more teachers, the overall “cost” or expenditures goes down. This is perhaps more a matter of the economies of scale than anything to do with the quality or ‘value’ of education. The highest number we see in the student teacher ratio overall is 16:1, which makes sense when spread across pre-K to 12th grade. And the overall lower student-teacher ratio, the more per pupil a district is spending – we can see the spread of urban here as well.

ggplot(
  directory_merged, 
  aes(x = student_teacher_ratio, y = expend_per_pupil)
) + 
  labs(title = "Expenditures per Pupil and Student-Teacher Ratio") +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

cor(directory_merged$student_teacher_ratio, directory_merged$expend_per_pupil)

[1] -0.6891134

Another item we can see in this ggpairs visualization is that, for every district in the pre-K to 12 non-Charter Massachusetts school districts observed, the percentage of students taking an art course is well above 60%.

ggplot(
  directory_merged, 
  aes(x = all_grades_pct_arts)
) + 
  labs(title = "Distribution of % of Students in an Arts Course") +
  geom_histogram(color = "#0D0887", fill = "#ED7953")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(
  directory_merged, 
  aes(x = all_grades_pct_arts, y = median_income)
) + 
  labs(title = "Median Income and Percent of Students in Arts Courses") +
  geom_point(color = "#ED7953") +
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

cor(directory_merged$all_grades_pct_arts, directory_merged$median_income)

[1] 0.2011804

Another item of note is that there’s a relationship observed between median income and percent of experienced teachers, which is an indicator of the number of teachers who have been teaching in a public school for more than 3 years.

ggplot(
  directory_merged, 
  aes(x = pct_experienced_teachers, y = median_income)
) + 
  labs(title = "Median Income and Percent Experienced Teachers") +
  geom_point(color = "#CB4679") + 
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

cor(directory_merged$pct_experienced_teachers, directory_merged$median_income)

[1] 0.4419556

Along the same lines, there’s a relationship that appears between the median income and the percent of grade 9 passing - meaning, in districts with higher median incomes, we are more likely to see a higher share of grade 9 students passing.

ggplot(
  directory_merged, 
  aes(x = pct_grade9_passing, y = median_income)
) + 
  labs(title = "Median Income and Percent of Grade 9 Passing") +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

cor(directory_merged$pct_grade9_passing, directory_merged$median_income)

[1] 0.6868479

Of course, correlation is not causation, for this relationship or any of the others. But we do not see many exceptions where there is a high median income and low grade 9 passing percentage.

A surprise to me is that there does not appear to be, immediately, a strong positive relationship between expenditures per student and the US Census estimated median income per district.

ggplot(
  directory_merged, 
  aes(x = expend_per_pupil, y = median_income, color = student_teacher_ratio)
) + 
  labs(title = "Median Income and Expenditures per Pupil") +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

There is probably more to unpack if we were to look more into this relationship, such as the way that expenditures are spread out at different levels of education, and more. We can see here that the student teacher ratio is not a very illustrative variable, especially seeing as it’s over the total district level.

Within the percent of students enrolled in a digital literacy or computer science course (the second ‘column’ of data in the above ggpairs visualization), there appear to be two ‘peaks’ of where the data is gathered. The digital literacy / computer science variable is perhaps the most baffling and seems like there might be some issues with the data - it’s unclear yet. Let’s take a closer look.

ggplot(
  directory_merged, 
  aes(x = all_grades_pct_digital, y = expend_per_pupil,
      color = urban
      )
) + 
  geom_point()

This is surprising, as I would’ve presumed that high in-district expenditures per pupil would correlate with high percentage of students enrolled in digital literacy or computer science courses. After all, technology is expensive, both the hardware/software and the specialization/expertise required of teachers to teach it, or the cost of online special programs in the place of teacher experts.

Another item of note along these lines is that the city data point near the top left quadrant of this visualization seems odd: very high in-district expenditures per pupil, yet very low enrollment in digital literacy or computer science courses. Which city might this be?

directory_merged |> 
  arrange(desc(expend_per_pupil)) |> 
  select(city_location, urban, locale, all_grades_pct_digital, 
         expend_per_pupil, student_teacher_ratio) |> 
  head()

    city_location  urban  locale all_grades_pct_digital expend_per_pupil
1       Cambridge   City midsize                    6.6         34937.94
2         Roxbury   City   large                   23.4         30106.66
3          Weston Suburb   large                   33.2         29285.55
4       Sheffield  Rural distant                   15.8         29175.64
5 Shelburne Falls  Rural distant                   25.6         26569.42
6       Nantucket   Town distant                   36.7         26482.26
  student_teacher_ratio
1                   8.7
2                  10.7
3                  11.9
4                   8.1
5                  10.6
6                  10.7

This is interesting: the datapoint we saw in the scatterplot above, which appeared to be an outlier, is the school district of Cambridge, Massachusetts. The Cambridge, Massachusetts, where numerous high-tech companies and top universities are based, among them the Massachusetts Institute of Technology (MIT) and Harvard. This low percentage for all_grades_pct_digital seems unusual, and potentially unreliable, based on what we know about the location as well as the high in-district expenditures per pupil. Perhaps this was misreported, or there is simply more to the data and what counts as a “digital literacy or computer science course”. Or, perhaps digital literacy and computer science education are so ingrained in the daily life and every class in this tech-hub of Cambridge that there is no “dedicated” computer science course, and the curricula is embedded in courses like statistics or biology (at least at the high-school level). We now know that, having observed Cambridge’s in-district expenditures per pupil around $35,000, it is the most extreme case in this negative correlation.

This makes us think: what would happen if we focused specifically on Suburbs, especially since those comprise the vast majority of our data points?

ggplot(directory, aes(x = urban)) +
  geom_bar() +
  labs(
    title = "Distribution of Urban Categories",
    x = "Urban Category",
    y = "Count",
    fill = "Locale"
  )

ggpairs(
  directory_merged |> 
    filter(urban == "Suburb") |> 
    select(-district_code, -leaid, -lea_name, -city_location, -locale, -urban, 
           -number_of_schools, -longitude, -latitude, 
           -pct_grade9_passing, #-pct_experienced_teachers, 
           -pct_teachers_licensed, -expend, 
           -total_ftes),
    upper = list(continuous = wrap("cor", size = 3)),
    lower = list(continuous = wrap("points", size = 0.5))
  )  +
  labs(title = 
         "2022 Suburban Massachusetts School Districts (Pre-K to 12) Data") +
  theme(
    strip.text = element_text(size = 6, face = "bold",
                              margin = margin(t = 2, r = 0, b = 1, l = 0)
                              ),
    axis.text.y = element_text(size = 8, face = "bold"),
    axis.text.x = element_text(size = 8, face = "bold")
    )

We see changes to the correlations between variables, but not in such a significant way that we would want to re-do the analysis above.

Map

Now, we’ll load the Massachusetts state boundary

ggplot() +
  geom_sf(data = massachusetts_map, fill = "white", color = "black") + 
  geom_point(data = directory_merged, 
             aes(x = longitude, y = latitude, 
                 color = student_teacher_ratio,
                 size = expend_per_pupil),
             alpha = 0.8) +
  scale_color_viridis_c(
    option = "magma", name = "Student-Teacher Ratio") + 
  scale_size_continuous(name = "Expenditures per Pupil ($)") + 
  labs(title = 
         "Geographic Distribution of Student-Teacher Ratio and Expenditures",
       subtitle = "Massachusetts Public School Districts",
       x = "Longitude", y = "Latitude") +
  theme_minimal()

ggplot() +
  geom_sf(data = massachusetts_map, fill = "white", color = "black") + 
  geom_point(data = directory_merged, 
             aes(x = longitude, y = latitude, 
                 color = all_grades_pct_digital,
                 size = expend_per_pupil),
             alpha = 0.8) +
  scale_color_viridis_c(
    option = "magma", name = "Students in Digital (%)") + 
  scale_size_continuous(name = "Expenditures per Pupil ($)") + 
  labs(title = 
         "Geographic Distribution of % Students in Digital and Expenditures",
       subtitle = "Massachusetts Public School Districts",
       x = "Longitude", y = "Latitude") +
  theme_minimal()

ggplot() +
  geom_sf(data = massachusetts_map, fill = "white", color = "black") + 
  geom_point(data = directory_merged, 
             aes(x = longitude, y = latitude, 
                 color = all_grades_pct_arts,
                 size = expend_per_pupil),
             alpha = 0.8) +
  scale_color_viridis_c(
    option = "magma", name = "Students in Arts (%)") + 
  scale_size_continuous(name = "Expenditures per Pupil ($)") + 
  labs(title = 
         "Geographic Distribution of % Students in Arts and Expenditures",
       subtitle = "Massachusetts Public School Districts",
       x = "Longitude", y = "Latitude") +
  theme_minimal()

ggplot() +
  geom_sf(data = massachusetts_map, fill = "white", color = "black") + 
  geom_point(data = directory_merged, 
             aes(x = longitude, y = latitude, 
                 color = all_grades_pct_arts,
                 size = median_income),
             alpha = 0.8) +
  scale_color_viridis_c(
    option = "magma", name = "Students in Arts (%)") + 
  scale_size_continuous(name = "Median Income ($)") + 
  labs(title = 
         "Geographic Distribution of % Students in Arts and Median Income",
       subtitle = "Massachusetts Public School Districts",
       x = "Longitude", y = "Latitude") +
  theme_minimal()

As we saw previously, the prevalence of arts courses is much more widespread than digital/computer-science courses. This is very visible in the map view.

ggplot() +
  geom_sf(data = massachusetts_map, fill = "white", color = "black") + 
  geom_point(data = directory_merged, 
             aes(x = longitude, y = latitude, 
                 color = all_grades_pct_digital,
                 size = median_income),
             alpha = 0.8) +
  scale_color_viridis_c(
    option = "plasma", name = "% in Digital") + 
  scale_size_continuous(name = "Median Income ($)") + 
  labs(title = 
         "Geographic Distribution of Median Income and % in Digital",
       subtitle = "Massachusetts Public School Districts",
       x = "Longitude", y = "Latitude") +
  theme_minimal()

Unsurprisingly, the highest median incomes are around the Boston area.

ggplot() +
  geom_sf(data = massachusetts_map, fill = "white", color = "black") + 
  geom_point(data = directory_merged, 
             aes(x = longitude, y = latitude, 
                 color = pct_grade9_passing,
                 size = median_income),
             alpha = 0.8) +
  scale_color_viridis_c(
    option = "plasma", name = "% Grade 9 Passing") + 
  scale_size_continuous(name = "Median Income ($)") + 
  labs(title = 
         "Geographic Distribution of Median Income and % Grade 9 Passing",
       subtitle = "Massachusetts Public School Districts",
       x = "Longitude", y = "Latitude") +
  theme_minimal()

I think that we can expect the number of students enrolled in digital literacy and computer science classes to continue to climb - or perhaps we can simply expect that digital literacy and computer science courses will continue to be a part of our lives as well as our children’s lives (in my and my husband’s case, our future children).

Conclusion

This analysis highlights that determining what makes a “good” school district is multifaceted and cannot be derived from these data points alone. Metrics like expenditures per pupil, median household income, and digital course offerings provide valuable insights, but they represent only part of the picture. Notably, the correlation between higher median household income and a greater percentage of Grade 9 students passing all courses suggests socioeconomic factors may play a significant role in student outcomes. However, the lack of a strong relationship between median income and expenditures per pupil indicates that financial resources are not uniformly allocated or may not directly influence certain outcomes.

The findings also show a high prevalence of arts course enrollment across districts, contrasting with the variability in digital literacy and computer science offerings. This disparity raises important questions about access to modern educational opportunities in technology and underscores potential areas for policy intervention. It also raises the question of the value of the data points in a world where “digital” is everywhere.

The limitations of using 2022 data, which is still influenced by the disruptions of the COVID-19 pandemic, should be noted. Missing accountability metrics and other gaps in the Massachusetts Department of Education and Urban Institute datasets pose challenges to drawing definitive conclusions for future actions, whether personal choices like where to live or policy decisions at the state or district level. Future studies would benefit from examining more detailed trends in expenditures, curriculum offerings, and teacher experience/expertise to better understand their impact on overall district metrics and measures.

While this analysis may not directly inform decisions about where to live, it provides a framework for exploring public school data and raises critical questions about resource allocation and educational opportunity. Future iterations could incorporate more granular data points, such as parental reviews, advanced placement course availability, median SAT scores or MCAS scores, or extracurricular offerings, to create a more comprehensive understanding of school district characteristics.

At the end of the day, evaluating school districts requires both data-driven insights and personal experience. By refining this analysis with additional data and perspectives, we can better inform public policy and personal decisions alike, in the great state of Massachusetts or elsewhere.