library(tidyverse)
library(educationdata)
library(skimr)
library(GGally)
library(tigris)606 Final Project Report
Introduction
Pre-Abstract Context: We are interested in Massachusetts public school districts and if there is a relationship between their overall expenditures on instruction per enrolled student and their overall midpoint math proficiency scores. The reason for this analysis is a personal interest. I was born and raised in Massachusetts, where I attended public school from kindergarten to 12th grade. My husband and I would like to eventually move back to Massachusetts and are interested in knowing what school districts have the best opportunities for our future kids. Knowing that there are many ways to understand “outcomes”, “performance”, “opportunities” and more, this inquiry endeavors to look at data available from the Urban Institute’s Education Data Portal to identify relationships between variables from their finance data (such as total expenditures per public school district) and assessments data (the midpoint of math scores in statewide assessments, which I participated in while in the public school system.) I have relied upon the well-documented data dictionary from the Urban Institute and the related sources available from their website and Data Explorer as well as the Urban Institute’s Education Data Package on R on github.
Abstract
This analysis investigates the relationship between instructional expenditures per enrolled student and math proficiency midpoint scores in Massachusetts public school districts. The research question explores whether increased spending on instruction correlates with higher student performance, as measured by the midpoint of the range of students scoring proficient on standardized mathematics assessments.
The dataset, obtained from the Urban Institute’s Education Data R package educationdata, combines district-level data from the National Center for Education Statistics’ Common Core of Data (CCD) and the U.S. Department of Education’s EDFacts for the year 2020. The dependent variable is math_test_pct_prof_midpt, representing the midpoint of the reported range of students achieving proficiency in math (0–100 scale). The primary independent variable, exp_per_instruction_enrolled, reflects instructional expenditures per student, created by dividing total instructional spending by total student enrollment. A total of 208 non-charter, Pre-K–12 school districts were analyzed using a simple linear regression model.
Results indicate a statistically significant but modest positive relationship between instructional spending and math proficiency midpoints (p=0.003). For every $1,000 increase in instructional spending per student, proficiency midpoint scores increase by approximately 1.6 percentage points. However, the model explains only 4.2% of the variance in proficiency scores, highlighting the influence of unmeasured factors, such as socioeconomic conditions, teacher quality, attendance rates, or other numerous factors not captured in the data.
This analysis underscores the limitations of using midpoint proficiency scores, which are approximations based on anonymized ranges. While instructional spending shows a measurable effect, its practical significance is limited, and caution is advised in interpreting the results. The findings suggest that further research incorporating additional explanatory variables is necessary to better understand the determinants of student performance.
Data
Loading the Data
I will load data from the Urban Institute Education Data Portal for the year 2020 and the state of Massachusetts, which has the FIPS code ‘25’. Data from the Urban Institute Education Data Portal is available via API and other methods as well as via a package in R called educationdata, which I have installed.
directory <- get_education_data(level = 'school-districts',
source = 'ccd',
topic = 'directory',
filters = list(year = 2020,
fips = '25'),
add_labels = TRUE
)
finance <- get_education_data(level = 'school-districts',
source = 'ccd',
topic = 'finance',
filters = list(year = 2020,
fips = '25'),
add_labels = TRUE
)
assessments <- get_education_data(level = 'school-districts',
source = 'edfacts',
topic = 'assessments',
filters = list(year = 2020,
fips = '25'),
add_labels = TRUE
)Directory 2020
skim(directory)directory |>
group_by(lowest_grade_offered, highest_grade_offered) |>
summarise(n = n()) |>
arrange(desc(by = n))`summarise()` has grouped output by 'lowest_grade_offered'. You can override
using the `.groups` argument.
# A tibble: 22 × 3
# Groups: lowest_grade_offered [11]
lowest_grade_offered highest_grade_offered n
<fct> <fct> <int>
1 Pre-K 12 211
2 9 12 43
3 Pre-K 6 34
4 <NA> <NA> 28
5 Pre-K 8 22
6 Kindergarten 12 15
7 7 12 14
8 6 12 12
9 Kindergarten 8 11
10 Kindergarten 6 10
# ℹ 12 more rows
directory2 <- directory |>
filter(lowest_grade_offered == "Pre-K" & highest_grade_offered == "12")directory2 |>
arrange(desc(by = lea_name))directory3 <- directory2 |>
filter(agency_type == "Regular local school district")directory3total_directory2020 <- directory3 |>
select(leaid, lea_name, state_leaid, city_location, latitude, longitude,
urban_centric_locale, enrollment, teachers_total_fte)Finance 2020
skim(finance)total_finance2020 <- finance |>
select(leaid, exp_total, exp_current_instruction_total)Assessments 2020
skim(assessments)assessments |>
group_by(grade_edfacts) |>
summarise(n = n())# A tibble: 8 × 2
grade_edfacts n
<fct> <int>
1 3 314
2 4 315
3 5 323
4 6 332
5 7 305
6 8 304
7 Grades 9-12 313
8 Total 399
total_assessments2020 <- assessments |>
filter(grade_edfacts %in% "Total") |>
select(leaid, read_test_pct_prof_midpt, math_test_pct_prof_midpt)Join to One Dataframe to use for Statistical Analysis
Now we merge the datasets, with the ‘directory’ as our base.
education_data_2020 <- total_directory2020 |>
inner_join(total_assessments2020, by = "leaid") |>
inner_join(total_finance2020, by = "leaid")
nrow(education_data_2020)[1] 208
education_data_2020This represents 208 school districts who in 2020 had Pre-K to 12 schools and were not Charter schools.
Create Variables
Now we’ll calculate a value for expenditure per enrolled student, expenditure on instruction per enrolled student, and the ratio of teachers to students.
education_data_2020 <- education_data_2020 |>
mutate(
exp_per_enrolled = round(exp_total / enrollment, 2),
exp_instruction_per_enrolled = round(exp_current_instruction_total / enrollment, 2),
student_teacher_ratio = round(enrollment / teachers_total_fte, 2)
)education_data_2020education_data_2020 |>
group_by(read_test_pct_prof_midpt) |>
summarise(n = n()) |>
arrange(desc(read_test_pct_prof_midpt))# A tibble: 58 × 2
read_test_pct_prof_midpt n
<dbl> <int>
1 79 2
2 78 2
3 77 4
4 76 1
5 75 4
6 74 2
7 73 4
8 72 5
9 70 2
10 69 1
# ℹ 48 more rows
We are only looking at the midpoint data because some school districts do not report the ‘low’ and ‘high’ range. This is the closest we can get to the data because the ranges are reported to anonymize test-takers. Luckily we do not have to further remove results here (if the range was reported with the number -3 that would mean we had to remove those rows.)
Midpoint of Proficiency Range: What does it mean and is it valid?
The variable we will be interested to use as the dependent variable is read_test_pct_prof_midpt which is defined in the data source as the “Midpoint of range used to report the share of students scoring proficient on a mathematics assessment (0-100 scale)”. It is a metric used in educational data reporting to simplify and anonymize performance results for student groups. It indicates the percentage of students who scored at or above the proficiency level in mathematics on the state-wide standardized assessment, and it is reported on a scale from 0-100.
The reason that, in the data, a midpoint is used instead of an exact percentage, is to protect privacy: an actual percentage or number could risk identifying individual performance or smaller groups, so it is standard for the data to be reported in ranges in the Common Core of Data (CCD) which is where all of this information comes from.
We will use the read_test_pct_prof_midpt in this analysis. Based on sources to explain the data, the purpose of this method of reporting is to balance utility with privacy, allowing stakeholders to analyze trends and performance without exposing sensitive details about specific students or student groups. The midpoint is the standard proxy for the actual performance level, so we will need to somewhat take this data with a grain of proverbial salt.
The midpoint serves as an approximate representation of performance - it is the best we have available. Midpoints are not exact values of the underlying data: they are midpoints based on provided ranges. Side note that many school districts do not report the low and high points, just the midpoints. The ranges themselves are either not provided or non-equal (e.g. one school district reports a low of 60 and a high of 64 and a midpoint of 62, while another school district reports a low of 50 and a high of 56 and a midpoint of 58, and another school reports no low and no high and a midpoint of 68.) This means that the midpoints that we use may obscure nonlinear relationships, so we are stating this information here. Knowing that the range widths are unknown or vary, this can lead to heteroscedasticity (unequal variance). We must state this clearly prior to next steps in our analysis.
Simply put, we do not have raw data available for the number of students who scored proficient or higher on math tests. This does introduce some potential limitations in interpretation of the data, and must be taken into consideration in our analysis that follows.
Our sample size is large (208) but we do know that using midpoints in regression is a tradeoff that simplifies our use-case but potentially introduces approximation errors. Since we do not have another proxy for assessment scores or outcomes, we are limited in what options we have available to use in this data. We are choosing to proceed clearly stating these limitations. We strongly recommend against using the results of the analysis to make decisions such as buying a house in a particular location simply based on the regression analysis alone.
Analysis of Interest
Our dependent variable will be the math_test_pct_prof_midpt which is defined in the data source as the “Midpoint of range used to report the share of students scoring proficient on a mathematics assessment (0-100 scale)”.
Variables of interest, in addition to assessment scores, include the number of full time equivalent (FTE) teachers and the total expenditures per district. This is why we created variables like exp_per_instruction_enrolled (expenditure on instruction per enrolled student). Is this variable related to assessment scores in math?
In particular, is there a relationship between exp_per_instruction_enrolled (expenditure on instruction per enrolled student) as a predictor of read_test_pct_prof_midpt?
Research question
Is there a relationship between the created exp_per_instruction_enrolled (expenditure on instruction per enrolled student) as related to the midpoint assessment scores in math read_test_pct_prof_midpt?
Cases
The cases are school districts (non-Charter school) that offer grades Pre-K to grade 12 in the state of Massachusetts in the year 2020. There are 208 cases (observations) we will be looking at, based on the dataset merges and availability / non-suppression of data as outlined above.
Data collection
This district-level data is from the Urban Institute’s Education Data Portal, which provides datasets on school districts that here come from two sources: the National Center for Education Statistics’ Common Core of Data (CCD) and the US Department of Education’s EDFacts.
Type of study
This is an observational study.
Data Source
Urban Institute Educaiton Data Portal. For this project, data was extracted using their Education Data Package in R ›
Relevant summary statistics
summary(education_data_2020$math_test_pct_prof_midpt) Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 26.00 36.00 37.89 50.00 74.00
summary(education_data_2020$exp_per_enrolled) Min. 1st Qu. Median Mean 3rd Qu. Max.
16267 20194 22244 24233 26213 56190
ggplot(education_data_2020, aes(x = math_test_pct_prof_midpt)) +
geom_histogram(color = "#0D0887", fill = "#CB4679") +
labs(title = "Midpoint of range used to report the share of students scoring
proficient on a mathematics assessment (0-100 scale)")`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(education_data_2020, aes(x = exp_instruction_per_enrolled)) +
geom_histogram(color = "#0D0887", fill = "#ED7953") +
labs(title = "Total Expenditure on Instruction per number enrolled")`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(education_data_2020, aes(x = student_teacher_ratio)) +
geom_histogram(color = "#0D0887", fill = "#5B02A3") +
labs(title = "Student-teacher ratio")`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
education_data_2020_model <- education_data_2020 |>
select(leaid, math_test_pct_prof_midpt,
exp_instruction_per_enrolled, student_teacher_ratio)Can we perform Regression analysis? (Check)
ggpairs(education_data_2020_model |> select(-leaid, -student_teacher_ratio))The conditions for linear regression are
Linearity: linear relationship between the dependent (response) variable (
math_test_pct_prof_midpt) and the predictor (independent) variableexp_instruction_per_enrolled.Normality: Nearly normal residuals with mean 0 - checked using a normal probability plot and histogram of residuals
Variability: constant variability of residuals - checked using residuals plots of residuals vs. y-hat, and residuals vs. each x
Independence: Observations that are independent of each other – this is met; we can confirm that each district is separate and there is no overlap
Linearity
First, we look at linearity. There does appear to be a linear relationship here. We can also see in the correlations identified in the above ggpairs.
ggplot(education_data_2020_model,
aes(x = exp_instruction_per_enrolled, y = math_test_pct_prof_midpt)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Math Proficiency Midpoint and Expenditures on Instruction, per District")`geom_smooth()` using formula = 'y ~ x'
Model: Simple Linear Regression
model2020 <- lm(math_test_pct_prof_midpt ~ exp_instruction_per_enrolled,
data = education_data_2020_model)summary(model2020)
Call:
lm(formula = math_test_pct_prof_midpt ~ exp_instruction_per_enrolled,
data = education_data_2020_model)
Residuals:
Min 1Q Median 3Q Max
-31.988 -12.228 -0.815 11.167 34.856
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.783e+01 6.773e+00 2.633 0.00911 **
exp_instruction_per_enrolled 1.636e-03 5.453e-04 3.000 0.00304 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.35 on 206 degrees of freedom
Multiple R-squared: 0.04185, Adjusted R-squared: 0.0372
F-statistic: 8.997 on 1 and 206 DF, p-value: 0.003037
Simple Linear Regression
par(mfrow = c(2, 2))
plot(model2020)ggplot(model2020, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")Interpret the results
This model attempts to predict the midpoint of math proficiency rates (math_test_pct_prof_midpt) using expenditures on instruction per enrolled student (exp_instruction_per_enrolled). The intercept (17.83) represents the math proficiency midpoint when the instructional expenditures per enrolled student = 0. This is not meaningful in practice because these values don’t occur in real-world data (there is never literally $0 expenditures on instruction) but it acts as a baseline for the predictions in the model.
Coefficient for exp_instruction_per_enrolled (0.001636):
Interpretation: For every additional dollar spent on instruction per student, the math proficiency midpoint increases by 0.001636 percentage points. This means that an increase of $1,000 in instructional expenditures per student would result in an increase of 0.001636 * 1000 = 1.636 percentage points in the math proficiency midpoint.
Significance: The p-value is 0.00304 which is a statistically significant (p < 0.01) relationship, meaning that there is evidence that instructional spending is positively associated with math proficiency midpoints.
Model Fit
The multiple R-squared is 0.04185 which means that the model explains 4.185% of the variance in math proficiency midpoint scores. This is a low R-squared value, and from this we can interpret that most of the variation in math proficiency rates is due to factors that aren’t included in this model, such as: socioeconomic makeup of the district, quality of math teachers, longevity of teachers in the district, school environment and preparedness for testing, or numerous other factors. The adjusted R-squared is also low: 0.0372, which also indicates that the model’s explanatory power is limited.
Goodness of Fit
Residual Standard Error (15.35): This value represents the typical deviation of observed proficiency midpoints from the model’s predictions. A residual standard error of 15.35 means that, on average, the model’s predictions differ from the actual values by about 15 percentage points. This is pretty substantial, especially since the stated midpoint proficiency values are between 0 and 100!
F-Statistic (8.997, p = 0.003037): The overall model is statistically significant, meaning that instructional expenditures per student as a predictor has a significant relationship with math proficiency midpoint.
Conclusion
Our analysis reveals a positive and statistically significant relationship between instructional expenditures per student and math proficiency midpoint scores in Massachusetts. However, the relationship is modest, as the model only explains 4.185% of the variance in the math proficiency scores. This indicates that there are other important factors influencing math proficiency, such as socioeconomic conditions, teacher quality, or other district-specific attributes that are not included in the model.
While the results show that higher instructional spending is associated with better proficiency scores, the effect size is small – requiring large increases in spending to achieve meaningful improvements as predicted by the model. In addition, the data reveals districts with high per-student spending but low proficiency midpoints, which further underscores the takeaway that expenditures alone do not determine outcomes.
The model is not sufficient to fully explain or predict math proficiency midpoint scores across districts or to predict performance across districts solely based on expenditures per enrolled student. These findings highlight the need for more comprehensive data and additional variables to better understand and model the factors that drive student achievement outcomes. As such, decisions such as where to send future kids to school should not rely solely on expenditure metrics without considering a broader context and additional data.
# Load Massachusetts state boundary
massachusetts_map <- states(cb = TRUE, resolution = "20m") |>
filter(NAME == "Massachusetts") # Filter to Massachusetts onlyRetrieving data for the year 2021
|
| | 0%
|
| | 1%
|
|======= | 9%
|
|============= | 18%
|
|=================== | 27%
|
|========================= | 36%
|
|=============================== | 44%
|
|===================================== | 53%
|
|=========================================== | 62%
|
|================================================= | 71%
|
|======================================================= | 79%
|
|======================================================== | 79%
|
|============================================================== | 88%
|
|==================================================================== | 97%
|
|======================================================================| 100%
ggplot() +
geom_sf(data = massachusetts_map, fill = "white", color = "black") +
geom_point(data = education_data_2020,
aes(x = longitude, y = latitude,
color = math_test_pct_prof_midpt,
size = exp_instruction_per_enrolled),
alpha = 0.8) +
scale_color_viridis_c(option = "plasma", name = "Math Proficiency Midpoint (%)") +
scale_size_continuous(name = "Expenditures per Student ($)") +
labs(title = "Geographic Distribution of Math Proficiency Midpoint Scores and Expenditures",
subtitle = "Massachusetts Public School Districts",
x = "Longitude", y = "Latitude") +
theme_minimal()ggplot() +
geom_sf(data = massachusetts_map, fill = "white", color = "black") +
geom_point(data = education_data_2020,
aes(x = longitude, y = latitude,
color = math_test_pct_prof_midpt),
alpha = 0.8) +
scale_color_viridis_c(option = "plasma", name = "Math Proficiency Midpoint (%)") +
labs(title = "Geographic Distribution of Math Proficiency Midpoint Scores",
subtitle = "Massachusetts Public School Districts",
x = "Longitude", y = "Latitude") +
theme_minimal()ggplot() +
geom_sf(data = massachusetts_map, fill = "white", color = "black") +
geom_point(data = education_data_2020,
aes(x = longitude, y = latitude,
color = exp_instruction_per_enrolled),
alpha = 0.8) +
scale_color_viridis_c(option = "plasma", name = "Expenditures per Enrolled ($)") +
labs(title = "Geographic Distribution of Expenditures on Instruction per Enrolled Student",
subtitle = "Massachusetts Public School Districts",
x = "Longitude", y = "Latitude") +
theme_minimal()