The Maryland Senate Districts-Socioeconomic Characteristics, ACS 5-year Estimates (2019-2023) dataset includes socioeconomic characteristics of all 47 Maryland Senate Districts. The American Community Survey 5-year Estimates are U.S. Census Bureau surveys that documents the average characteristics concerning housing and population metrics recorded during the 2019 to 2023.
This dataset comes from the Maryland Open Data Portal, where the state collects and publishes data from all departments. I have linked the source here.
My main research inquiry to this dataset is: From 2019-2023, how does median household income and the percentage of families in poverty affect the rental burden (percent of renters paying >35% of income towards rent) across all 47 Maryland Senate Districts?
There are 36 variables, 4 of them being categorical and 32 are quantitative. The main variables are self-explanatory in their names such as “POPULATION - Pct. Foreign born”, “HOUSEHOLD - Pct. Receiving SNAP (food stamp)”, and “COMMUTERS-Pct. Public Transportation”. The variables I will be focusing on in this project are:
“State legislative district”: The unique number identifying each of Maryland’s 47 Senate districts.
“HOUSEHOLD - Median income”: The median household income for the district.
“Pct. Of FAMILIES in Poverty”: The percentage of family units falling below the federal poverty threshold.
“HOUSEHOLD - Pct. Who spend 35 percent or more of income on housing cost”: Residents who spend 35 percent or more of their income on housing cost (rent or mortgage) and have a high rental burden.
Uploading libraries and dataset
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(dplyr)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
setwd("C:/Users/hwang/OneDrive/Documents/MC stuff/Spring 2026/DATA 110 Data Visualization and Communication/Projects/Project 1 stuff")mdsenatedata <-read_csv("Maryland_Senate_Districts-Socioeconomic_Characteristics,_ACS_5-year_Estimates_(2019-2023).csv")
Rows: 47 Columns: 36
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Date Created, SENATE DISTRICT
dbl (24): POPULATION - Median age (years), POPULATION - Pct. Foreign born, L...
num (10): POPULATION - Total, POPULATION - 65 years and over, HOUSEHOLD - Me...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Retrieving list of variables
names(mdsenatedata)
[1] "Date Created"
[2] "SENATE DISTRICT"
[3] "POPULATION - Total"
[4] "POPULATION - Median age (years)"
[5] "POPULATION - 65 years and over"
[6] "POPULATION - Pct. Foreign born"
[7] "LANGUAGE - Pct. Language spoken at home other than English"
[8] "POPULATION - Pct. Living alone, Male 65 years and over"
[9] "POPULATION - Pct. Living alone, Female 65 years and over"
[10] "HOUSEHOLD - Pct. Female householder, no spouse, with children under 18 years"
[11] "EDUCATION-Pct High School diploma or higher"
[12] "EDUCATION - Pct. Bachelor's Degree or higher"
[13] "HOUSEHOLD - Median income"
[14] "Pct. Of FAMILIES in Poverty"
[15] "Civilian Labor Force - Pct. no health insurance coverage"
[16] "Unemployment rate"
[17] "HOUSEHOLD - Pct. Receiving SNAP (food stamp)"
[18] "Pct. Civilian population with disability"
[19] "HOUSING - Total Units"
[20] "HOUSING - Pct. Owner Occupied Housing Units"
[21] "HOUSING - Median value"
[22] "HOUSEHOLD - Pct. Who spend 35 percent or more of income on housing cost"
[23] "Median Gross Rent"
[24] "HOUSING - Pct. Renter-Occupied paying more than 35 percent of income on rent"
[25] "COMMUTERS-Mean travel time (minutes)"
[26] "COMMUTERS-16 years and over"
[27] "COMMUTERS-Pct. Public Transportation"
[28] "COMMUTERS-Pct. Drove Alone or carpool"
[29] "COMMUTERS-Pct. Worked from home"
[30] "HOUSING - Pct. No vehicles"
[31] "Total Houeholds"
[32] "Pct. of Civilian Labor Force-Employed"
[33] "HOUSEHOLD-with computer"
[34] "HOUSEHOLD-Pct with computer"
[35] "State"
[36] "State legislative district"
Cleaning the dataset
# Make all variable characters lowercasenames(mdsenatedata) <-tolower(names(mdsenatedata))# Replace spaces and dashes with underscoresnames(mdsenatedata) <-gsub(" ", "_", names(mdsenatedata), fixed =TRUE)names(mdsenatedata) <-gsub("-", "_", names(mdsenatedata), fixed =TRUE)# Remove commas and periodsnames(mdsenatedata) <-gsub(",", "", names(mdsenatedata), fixed =TRUE)names(mdsenatedata) <-gsub(".", "", names(mdsenatedata), fixed =TRUE)# Collapse multiple underscores into onenames(mdsenatedata) <-gsub("_+", "_", names(mdsenatedata))# Renaming long variable names to shorter onesmdsenatedata <- mdsenatedata %>%rename(housing_burden_pct =`housing_pct_renter_occupied_paying_more_than_35_percent_of_income_on_rent`,poverty_pct =`pct_of_families_in_poverty` )names(mdsenatedata)
# A tibble: 10 × 5
maryland_region avg_income avg_rent_burden avg_poverty district_count
<chr> <dbl> <dbl> <dbl> <int>
1 Anne Arundel County 145200. 40.4 3.65 4
2 Baltimore City 78069. 44.8 14.8 5
3 Baltimore Region 131260. 41.8 5.73 10
4 Eastern Shore 98361 42.8 7.67 3
5 Frederick/Carroll 141406. 36.5 4.5 3
6 Harford/Cecil 125556 39.2 5 2
7 Montgomery County 161073. 42.4 4.99 8
8 Prince George's County 118117. 43.2 6.71 7
9 Southern Maryland 140508. 38.8 4.4 3
10 Western Maryland 87510. 37.6 8.9 2
Regression Analysis
Regression Model
fit_md <-lm(housing_burden_pct ~ household_median_income + poverty_pct, data = mdsenatedata)summary(fit_md)
Call:
lm(formula = housing_burden_pct ~ household_median_income + poverty_pct,
data = mdsenatedata)
Residuals:
Min 1Q Median 3Q Max
-13.508 -2.645 -0.244 1.834 9.629
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.611e+01 5.130e+00 8.989 1.61e-11 ***
household_median_income -4.103e-05 2.865e-05 -1.432 0.159
poverty_pct 9.885e-02 2.568e-01 0.385 0.702
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.273 on 44 degrees of freedom
Multiple R-squared: 0.1602, Adjusted R-squared: 0.122
F-statistic: 4.197 on 2 and 44 DF, p-value: 0.02147
par(mfrow =c(2, 2))plot(fit_md)
Equation
P-values: Neither Income (0.159) nor Poverty (0.702) reached the standard significance level of 0.05. This means that while there is a slight downward trend for income and upward trend for poverty, these two factors alone are not “statistically significant” predictors of housing burden for Maryland Senate districts.
Adjusted R-squared: The adjusted R-squared is 0.122. This means that only 12.2% of the variation in housing burden is explained by the model. The other 87.8% is caused by factors NOT in this model. This could include determinants such as local zoning, proximity to DC and Baltimore, or the number and supply of apartments.
Diagnostic Plots: The Residuals vs Fitted plot was used to verify linearity, it seems that income and poverty do have a straight-line relationship with rent burden. The Q-Q plot confirms the normality of the residuals. This plot shows some outliers that reveal how Senate Districts 46 (South Baltimore City) and 47 (Prince George’s County on the border of DC) have a rent burden that is much higher or lower than what their income levels would suggest. The Scale-Location plot checked for consistent variance. The Residuals vs Leverage plot identified specific districts that affect disproportionate influence on the model’s coefficients. In this case, we can see that Districts 16 (South Montgomery County), 40 (East Baltimore City), and 46 have a significant influence on the regression results.
While the overall model is statistically significant (p-value = 0.021), the low Adjusted R^2 of 0.122 suggests that median income and poverty rates only account for about 12% of the variation in housing burden across Maryland. This indicates that other geographic or policy-driven factors are likely influencing rental costs in the state.
Conclusion and Insights
Dataset clean-up
In this project’s dataset, I first did some cleaning of the variable names. The default variables had lower- and uppercase letters, periods, commas, and spaces. In order to do this I used gsub() to convert all names to lowercase and replace these characters with underscores. I further refined the it by renaming the long-form strings into short identifiers like housing_burden_pct and poverty_pct for better reading and handling. Additionally, I performed categorical data binning. By using the case_when() function, I categorized the 47 state_legislative_district numbers into 10 distinct geographic regions (e.g., “Western Maryland,” “Montgomery County”). This feature engineering allowed for a more intricate regional analysis.
Visualization
The main scatter plot visualizes the socioeconomic representation of Maryland during the period of 2019-2023. The most striking pattern is the geographic clustering of districts. The “Baltimore City” districts (pink) are clustered at the high end of the y-axis, representing a high rental burden at lower income levels. In contrast, “Montgomery County” (yellow) and “Prince George’s County” (brown) show a horizontal spread, suggesting that while their incomes vary significantly, their rent burden remains relatively high, this is likely due to the high cost of living in the D.C. metropolitan area.
An interesting pattern in this data is the variance in the middle-income districts. This likely reflects the economic distortions of the COVID-19 pandemic. One reason why this may have occurred is that during this period, Maryland saw a surge in remote work, which allowed higher-income residents to migrate to traditionally “affordable” regions, potentially driving up rents in those areas and dividing rent burden from local median incomes. The inclusion of poverty as a third variable shows that housing stress is not strictly a low-income issue. The data suggests that COVID-19’s impact on the housing market inflated costs to the point that even those with higher income still struggle to keep up with rent, exacerbating the burden for those with lower income.
Technical Challenges and Future Studies
One technical hurdle I faced making the entire legend fit. In the ten distinct regions I made, the vertical legend initially pushed past the plot margins. I resolved this by shrinking the legend text and manually expanding the plot.margin on the right side.
If I were to expand this project, I would like to include a data before 2019, such as 2014–2018. Looking at this “pre-crisis” period would allow us to see the housing burden that likely followed a more predictable linear relationship with income. Additionally, also comparing 2024–present data could reveal whether the we trends we identified has become a permanent feature of the Maryland’s economy or if the market is beginning to stabilize, or if different trends are happening.