project3

Author

Jong-Geon Park

DATA110 Final Project

Image 1: https://www.redfin.com/blog/what-is-baltimore-md-known-for/
Image 2: https://en.wikipedia.org/wiki/Penn-North,_Baltimore
Image 3: https://the1a.org/segments/americas-most-corrupt-police-squad-baltimore-and-the-gun-trace-task-force/

Project Overview

The primary objective of this project is to identify whether crime density trends vary significantly across different Community Statistical Areas (CSAs) in Baltimore. By establishing these regional economic patterns, the project then analyzes the specific influence of urban decay, primarily measured by vacant housing density. The investigation focuses on whether the burden of housing affordability directly drives crime, or if it creates a systematic pathway through neighborhood abandonment (high vacancy rates) that ultimately compromises public safety.

Research Questions

I plan to explore these main questions:

How are severe urban decay (vacant buildings) and high crime density geographically distributed across Baltimore’s Community Statistical Areas (CSAs)?
Does housing affordability directly predict crime density, or is the relationship mediated by neighborhood abandonment (high vacancy rates)?
Based on the integrated data, which specific CSAs offer the optimal balance of economic stability and public safety for prospective residents?

Data Overview

The data used in this study is sourced from the official Baltimore City Open Data Portal (https://data.baltimorecity.gov/). To explore the underlying pathways between economic burden, urban decay, and crime, three primary datasets were collected and integrated:

Affordability Index (Rent by Community Statistical Area): This dataset provides the rent affordability index across Baltimore’s 55 Community Statistical Areas (CSAs). It serves as the primary measure of economic burden and includes the geospatial boundaries (Polygons) necessary for regional mapping.
BPD Arrests: This dataset contains incident-level records of arrests made by the Baltimore Police Department. It includes exact geographic coordinates (Latitude/Longitude) for each incident, representing the dependent variable of public safety.
Vacant Building Rehabs: This dataset tracks the locations of vacant, abandoned, or rehabilitating buildings across the city. Using its precise spatial coordinates, this data serves as the critical mediating variable representing physical urban decay.

+ Affordability Index Geolocation (geojson)

Also, I improved the connections of each data through the processes below.

Data Integrity & Methodology

To ensure the highest level of data accuracy and focus purely on the systemic relationship between spatial economics and public safety, I implemented a rigorous multi-stage data cleaning and integration process:

Ethical Filtering & Geolocation Accuracy: For the BPD Arrests dataset, I removed sensitive demographic variables (e.g., Gender and Race) to maintain ethical data practices, as the focus of this study is strictly geographic and economic. Furthermore, to ensure the validity of the spatial analysis, any records marked as “Address-unknown” or lacking precise geographic coordinates (Null Latitude/Longitude) were strictly eliminated.
Coordinate Standardization & Data Harmonization: During the initial data exploration, I identified a critical discrepancy in the Coordinate Reference Systems (CRS) between the raw datasets (specifically involving local projections (Maryland State Plane) versus global standards (WGS84)). To prevent spatial distortion, I utilized Python’s geospatial libraries to reproject all coordinates into a standardized WGS84 (Latitude/Longitude) format. Following this spatial alignment, I programmatically resolved hidden text encoding issues (such as trailing whitespaces) in the CSV headers to perfectly match the GeoJSON’s spatial key (CSA2010). This rigorous standardization ensured a 100% accurate spatial join without any data loss.

#1 How are vacant buildings and high crime density geographically distributed across Baltimore’s Community Statistical Areas (CSAs)

https://public.tableau.com/app/profile/jong.geon.park/viz/BaltimoreMap_BPDarrestAffindexVacant/Sheet1

For the geospatial visualization in Tableau, I strategically selected colors to ensure both symbolic clarity and analytical precision. BPD arrests are mapped in blue, a color universally associated with law enforcement. Vacant buildings are represented in grey to evoke a sense of abandonment and inactivity. To provide underlying context without distracting from these primary data points, the Affordability Index was applied as a high-transparency green background.

This layered approach allowed for a clear identification of spatial trends. By overlaying these datasets, it becomes visually evident that clusters of vacant buildings and reported crime locations show a significant geographic overlap. To quantitatively verify these observations, I performed the following regression analysis.

#2-1 Data Integration Process

The goal of this process was to combine three different datasets into one master table to analyze the relationship between neighborhood economics and crime.

First, I loaded the Baltimore neighborhood map(Affordability_Index.geojson) and the rent, crime datasets. I removed any arrest records that were missing GPS coordinates to ensure accuracy. Then, I used a Spatial Join to link each individual crime point to its specific neighborhood on the map. After counting the total number of crimes for each area, I merged this information with the rent affordability data using the neighborhood names. This created a single dataset that includes geography, crime counts, and economic indicators.\

*Since the our curriculum has not yet covered the handling of JSON/GeoJSON file structures or complex Coordinate Reference System (CRS) transformations, I partially consulted AI to ensure these technical steps were executed without data distortion.

library(sf)

Warning: 패키지 'sf'는 R 버전 4.5.3에서 작성되었습니다

Linking to GEOS 3.14.1, GDAL 3.12.1, PROJ 9.7.1; sf_use_s2() is TRUE

library(tidyverse)

Warning: 패키지 'tidyverse'는 R 버전 4.5.3에서 작성되었습니다

Warning: 패키지 'ggplot2'는 R 버전 4.5.3에서 작성되었습니다

Warning: 패키지 'readr'는 R 버전 4.5.3에서 작성되었습니다

Warning: 패키지 'dplyr'는 R 버전 4.5.3에서 작성되었습니다

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

baltimore_csa <- st_read("Affordability_Index.geojson")

Reading layer `Affordability_Index_-_Rent_-_Community_Statistical_Area' from data source `C:\Users\gun26\OneDrive\Desktop\data3\Affordability_Index.geojson' 
  using driver `GeoJSON'
Simple feature collection with 55 features and 19 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -76.71141 ymin: 39.19724 xmax: -76.52968 ymax: 39.37201
Geodetic CRS:  WGS 84

df_arrests <- read.csv("BPD_Arrests.csv")
df_affordability <- read.csv("Affordability_Index.csv")

df_arrests <- df_arrests %>% 
  filter(!is.na(Longitude) & !is.na(Latitude))

arrests_sf <- st_as_sf(df_arrests, coords = c("Longitude", "Latitude"), crs = 4326)
joined_data <- st_join(arrests_sf, baltimore_csa, join = st_within)

csa_summary <- joined_data %>%
  group_by(CSA2010) %>% 
  summarise(Arrest_Count = n())

final_analysis_df <- left_join(csa_summary, df_affordability, by = c("CSA2010" = "Community.Statistical.Area..2010.."))

print(head(final_analysis_df))

Simple feature collection with 6 features and 11 fields
Geometry type: MULTIPOINT
Dimension:     XY
Bounding box:  xmin: -76.7109 ymin: 39.2007 xmax: -76.5299 ymax: 39.3551
Geodetic CRS:  WGS 84
# A tibble: 6 × 12
  CSA2010             Arrest_Count                  geometry OBJECTID X2015.2019
  <chr>                      <int>          <MULTIPOINT [°]>    <int>      <dbl>
1 Allendale/Irvingto…         1613 ((-76.6942 39.2756), (-7…        1       43.3
2 Beechfield/Ten Hil…          549 ((-76.6925 39.2971), (-7…        2       49.5
3 Belair-Edison               1839 ((-76.5555 39.3177), (-7…        3       60.0
4 Brooklyn/Curtis Ba…         3162 ((-76.5431 39.2102), (-7…        4       47.7
5 Canton                       256 ((-76.5868 39.284), (-76…        5       26.9
6 Cedonia/Frankford           1912 ((-76.5489 39.3171), (-7…        6       48.1
# ℹ 7 more variables: Community.Statistical.Area..2020. <chr>,
#   X2016.2020 <dbl>, X2017.2021 <dbl>, X2018.2022 <dbl>, X2019.2023 <dbl>,
#   Shape__Area <dbl>, Shape__Length <dbl>

#2-2 Geospatial Data Integration & Analysis Workflow

The primary goal of this process was to construct a master analytical table by integrating three distinct datasets (housing affordability, criminal arrests, and vacant building rehabilitations) to identify how economic burden and urban decay influence public safety in Baltimore.

The workflow was executed in the following steps:

Data Loading & Preprocessing: I imported the neighborhood boundaries (GeoJSON) and the primary datasets (CSVs). I removed any records missing GPS coordinates and standardized the neighborhood names to ensure a 100% match during merging.
Spatial Join (Point-in-Polygon): Since arrests and vacant buildings are recorded as individual points, I used a Spatial Join to mathematically assign each incident to its respective neighborhood (CSA) on the map.
Data Aggregation & Normalization: I counted the total number of incidents for each area. To ensure a fair comparison regardless of a neighborhood’s physical size, I calculated Arrest Density and Vacant Density by dividing the counts by the geographic area.
Statistical Analysis & Visualization: I performed a Multiple Linear Regression to test the relationship between these variables and created an interactive scatter plot with a regression line to visualize the trends.

library(sf)
library(tidyverse)
library(shiny)

Warning: 패키지 'shiny'는 R 버전 4.5.3에서 작성되었습니다

library(ggplot2)
library(dplyr)

baltimore_csa <- st_read("Affordability_Index.geojson")

Reading layer `Affordability_Index_-_Rent_-_Community_Statistical_Area' from data source `C:\Users\gun26\OneDrive\Desktop\data3\Affordability_Index.geojson' 
  using driver `GeoJSON'
Simple feature collection with 55 features and 19 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -76.71141 ymin: 39.19724 xmax: -76.52968 ymax: 39.37201
Geodetic CRS:  WGS 84

df_arrests <- read.csv("BPD_Arrests.csv")
df_vacants <- read.csv("Vacant_Building_Rehabs_Converted.csv")
df_affordability <- read.csv("Affordability_Index.csv")


# Standardize Affordability Index column name (to prevent errors)
# Overwrite the 2nd column as "CSA2010" to match the GeoJSON key and avoid special character issues.
# ------------------------------------------
colnames(df_affordability)[2] <- "CSA2010"

# Spatial Join and Aggregation of Arrests Data
# Remove incomplete records (NA Lat/Long) and convert to spatial object (sf)
# ------------------------------------------
df_arrests <- df_arrests %>% filter(!is.na(Longitude) & !is.na(Latitude))
arrests_sf <- st_as_sf(df_arrests, coords = c("Longitude", "Latitude"), crs = 4326)

# Identify which CSA each arrest point belongs to (st_within) and aggregate counts
arrests_summary <- st_join(arrests_sf, baltimore_csa, join = st_within) %>%
  group_by(CSA2010) %>%
  summarise(Arrest_Count = n()) %>%
  st_drop_geometry()

# Spatial Join and Aggregation of Vacant Building Data
# ------------------------------------------
df_vacants <- df_vacants %>% filter(!is.na(Longitude) & !is.na(Latitude))
vacants_sf <- st_as_sf(df_vacants, coords = c("Longitude", "Latitude"), crs = 4326)

vacants_summary <- st_join(vacants_sf, baltimore_csa, join = st_within) %>%
  group_by(CSA2010) %>%
  summarise(Vacant_Count = n()) %>%
  st_drop_geometry()

# Master Join
# ------------------------------------------
# Drop duplicate columns (Shape__Area, Shape__Length) to prevent ".x" and ".y" suffix errors.
df_affordability_clean <- df_affordability %>%
  select(-Shape__Area, -Shape__Length)

final_analysis_df <- baltimore_csa %>%
  left_join(df_affordability_clean, by = "CSA2010") %>%
  left_join(arrests_summary, by = "CSA2010") %>%
  left_join(vacants_summary, by = "CSA2010")

# Data Cleaning and Density Calculation
# ------------------------------------------
final_analysis_df <- final_analysis_df %>%
  mutate(
    # Replace NA values with 0 for areas with no reported arrests or vacants
    Arrest_Count = replace_na(Arrest_Count, 0),
    Vacant_Count = replace_na(Vacant_Count, 0),
    
    # Calculate density using standardized Shape__Area
    Arrest_Density = Arrest_Count / Shape__Area,
    Vacant_Density = Vacant_Count / Shape__Area
  )
head(final_analysis_df)

Simple feature collection with 6 features and 30 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -76.7112 ymin: 39.19724 xmax: -76.52969 ymax: 39.35562
Geodetic CRS:  WGS 84
  OBJECTID.x                           CSA2010 affordr10 affordr11 affordr12
1          1     Allendale/Irvington/S. Hilton  57.35865  53.27185  49.81489
2          2   Beechfield/Ten Hills/West Hills  49.63691  52.05539  43.27044
3          3                     Belair-Edison  60.55936  65.33958  44.53620
4          4 Brooklyn/Curtis Bay/Hawkins Point  48.62108  51.61522  38.75193
5          5                            Canton  40.65599  32.25309  38.20652
6          6                 Cedonia/Frankford  60.34902  62.78549  53.82775
  affordr13 affordr14 affordr15 affordr16 affordr17 affordr18 affordr19
1  44.08907   51.0082  49.36186  46.94220  45.89676  46.26437  43.29058
2  42.27171   48.3589  45.03367  42.61978  42.91548  43.21036  49.54955
3  41.05823   71.5012  69.87633  68.03914  68.14159  65.09544  60.01765
4  36.14551   53.0124  56.59743  50.67377  51.49360  52.46277  47.69175
5  32.86986   33.4202  30.25974  28.19325  28.25230  24.28181  26.85051
6  48.25789   58.8472  57.62279  53.62288  54.77651  50.64703  48.11299
                            CSA2020 affordr20 affordr21 affordr22 affordr23
1     Allendale/Irvington/S. Hilton  47.48573  50.52363  52.11888  45.59376
2   Beechfield/Ten Hills/West Hills  56.75294  66.36364  65.93291  69.20684
3                     Belair-Edison  63.10534  64.91477  63.34776  63.58569
4 Brooklyn/Curtis Bay/Hawkins Point  52.09647  61.64617  57.12199  56.58451
5                            Canton  26.66174  30.04967  24.36739  21.81146
6                 Cedonia/Frankford  46.69229  49.44476  51.62690  56.23432
  Shape__Area Shape__Length OBJECTID.y X2015.2019
1    63770462      38770.17          1   43.29058
2    47882528      37524.95          2   49.54955
3    44950030      31307.31          3   60.01765
4   176077743     150987.70          4   47.69175
5    15408538      23338.61          5   26.85051
6    71541340      39962.55          6   48.11299
  Community.Statistical.Area..2020. X2016.2020 X2017.2021 X2018.2022 X2019.2023
1     Allendale/Irvington/S. Hilton   47.48573   50.52363   52.11888   45.59376
2   Beechfield/Ten Hills/West Hills   56.75294   66.36364   65.93291   69.20684
3                     Belair-Edison   63.10534   64.91477   63.34776   63.58569
4 Brooklyn/Curtis Bay/Hawkins Point   52.09647   61.64617   57.12199   56.58451
5                            Canton   26.66174   30.04967   24.36739   21.81146
6                 Cedonia/Frankford   46.69229   49.44476   51.62690   56.23432
  Arrest_Count Vacant_Count                       geometry Arrest_Density
1         1613          353 MULTIPOLYGON (((-76.65726 3...   2.529384e-05
2          549           71 MULTIPOLYGON (((-76.69479 3...   1.146556e-05
3         1839          436 MULTIPOLYGON (((-76.56761 3...   4.091210e-05
4         3162          360 MULTIPOLYGON (((-76.58867 3...   1.795798e-05
5          256           61 MULTIPOLYGON (((-76.5714 39...   1.661417e-05
6         1912          262 MULTIPOLYGON (((-76.52972 3...   2.672581e-05
  Vacant_Density
1   5.535478e-06
2   1.482796e-06
3   9.699660e-06
4   2.044551e-06
5   3.958844e-06
6   3.662218e-06

# ------------------------------------------
# Regression Analysis
# ------------------------------------------
# Hypothesis: Higher rent burden (Affordability Index) will correlate with higher crime and vacant density.
# DV: Arrest_Density / IV: X2019.2023 (Affordability Index), Vacant_Density
regression_model <- lm(Arrest_Density ~ X2019.2023 + Vacant_Density, data = final_analysis_df)
print(summary(regression_model))


Call:
lm(formula = Arrest_Density ~ X2019.2023 + Vacant_Density, data = final_analysis_df)

Residuals:
       Min         1Q     Median         3Q        Max 
-4.662e-05 -2.099e-05 -1.390e-05  5.608e-06  2.031e-04 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     2.448e-05  2.649e-05   0.924    0.360    
X2019.2023     -4.294e-08  5.149e-07  -0.083    0.934    
Vacant_Density  3.548e+00  4.108e-01   8.636 1.28e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.081e-05 on 52 degrees of freedom
Multiple R-squared:  0.5933,    Adjusted R-squared:  0.5777 
F-statistic: 37.94 on 2 and 52 DF,  p-value: 6.915e-11

# Visualization
# ------------------------------------------
# Multiply density by 1,000,000 to improve readability (avoiding scientific notation/e-05)
plot_data <- final_analysis_df %>%
  mutate(
    Arrest_Vis = Arrest_Density * 1000000,
    Vacant_Vis = Vacant_Density * 1000000
  )

# User Interface (UI)
ui <- fluidPage(
  titlePanel("Baltimore Urban Safety Analysis Dashboard"),
  
  sidebarLayout(sidebarPanel(radioButtons
                             ("x_var", "1. Select Independent Variable (X-axis):", 
                   choices = c("Affordability Index (Rent Burden)" = "X2019.2023",
                               "Vacant Building Density" = "Vacant_Density"),
                   selected = "X2019.2023"),
      hr(),
      helpText("Data source: Open Baltimore Portal")
    ),
    
    mainPanel(
      plotOutput("trendPlot", click = "plot_click", height = "450px"),
      h4("Neighborhood Name (Click on points):"),
      verbatimTextOutput("clickInfo"))))

# Server Logic
server <- function(input, output, session) {
  filtered_data <- reactive({
    df <- final_analysis_df
    
    # If X is 'Vacant Density', does not count extreme outlier(Over 7e-05)
    if (input$x_var == "Vacant_Density") {
      df <- df %>% filter(Vacant_Density < 0.00007)}
    return(df)})
  
  # Render the Interactive Scatter Plot
  output$trendPlot <- renderPlot({
    ggplot(filtered_data(), aes_string(x = input$x_var, y = "Arrest_Density")) +
      geom_point(aes(size = Vacant_Density, 
                     color = Vacant_Density), 
                 alpha = 0.6) +
      geom_smooth(method = "lm", 
                  se = TRUE, 
                  color = "red", 
                  linetype = "dashed") +
      scale_color_viridis_c() +
      labs(
        title = "Drivers of Crime Density in Baltimore",
        caption ="Data Source: Open Baltimore Portal",
        y = "Arrest Density",
        x = ifelse(input$x_var == "X2019.2023", "Affordability Index (Rent Burden)", "Vacant Building Density"),
        color = "Vacant Density",
        size = "Vacant Density"
      ) +
      theme_minimal()})
  
  # Click Logic
output$clickInfo <- renderPrint({
    click_data <- nearPoints(filtered_data(), 
                             input$plot_click, 
                             xvar = input$x_var, 
                             yvar = "Arrest_Density")
    if (nrow(click_data) > 0) {print(click_data$CSA2010)} 
    else {print("Click on a point to see the neighborhood name.")}
  })
}

shinyApp(ui = ui, server = server)

Shiny applications not supported in static R Markdown documents

#3-1 Linear Regression Analysis

The multiple linear regression model was conducted to examine the impact of housing affordability and urban decay (vacant buildings) on crime density.

Model Fit (Explanatory Power): The Adjusted R-squared is 0.5777, indicating that approximately 58% of the variation in arrest density across Baltimore neighborhoods can be explained by these two variables. This represents a strong and statistically significant model.

Multiple R-squared:  0.5933,    Adjusted R-squared:  0.5777
F-statistic: 37.94 on 2 and 52 DF,  p-value: 6.915e-11

2-1. Significant Predictor (Vacant Density): The variable [Vacant_Density] has a p-value of 1.28e-11 ***, which is far below the 0.05 threshold. This confirms that vacant housing is a critical and highly significant predictor of crime density. For every unit increase in vacant density, arrest density is expected to rise by approximately 3.55 units, holding other factors constant.

2-2. Non-Significant Predictor (Affordability Index): Interestingly, the Affordability Index [X2019.2023] yielded a p-value of 0.934, suggesting no direct statistical relationship with crime density in this model.

                 Estimate Std. Error t value Pr(>|t|)     
(Intercept)     2.448e-05  2.649e-05   0.924    0.360     
X2019.2023     -4.294e-08  5.149e-07  -0.083    0.934     
Vacant_Density  3.548e+00  4.108e-01   8.636 1.28e-11 ***

#3-2 Visualization (Shiny) Analysis

Dynamic Outlier Handling (Reactive Filtering): While coding, I identified a severe outlier in the Vacant Building Density metric (approaching 8e-05) that was disproportionately skewing the linear regression trend line.

To ensure the visualization accurately reflects the general trend of the broader neighborhood clusters, I implemented a reactive data filter within the Shiny server logic. When the user toggles the independent variable to [Vacant Building Density], the application dynamically executes a filter (Vacant Building Density < 0.0007) to exclude this extreme case. This conditional exclusion prevents the single outlier from pulling the regression line artificially upward, resulting in a much more reliable visual representation of the core data distribution. The complete dataset remains fully intact and unmanipulated when viewing the Affordability Index.

The relationship between rent burden and crime density shows no distinct pattern. The red dashed regression line is nearly flat (slope near zero), and the data points are widely scattered across the plot. Additionally, the broad grey confidence interval indicates a high level of uncertainty. This visually demonstrates that housing affordability alone does not have a direct, meaningful correlation with arrest density.

In contrast, the plot for vacant building density reveals a steep, upward-trending regression line. The data points cluster much more tightly along this path, indicating a strong positive correlation. The narrow confidence band further supports the reliability of this trend.

#4 Conclusion

Based on the R-based regression analysis, I found that the financial status of residents (specifically the affordability index) showed no significant correlation with crime rates. In contrast, vacancy rates demonstrated a remarkably high correlation. This leads to the clear conclusion that areas with higher vacancy rates are significantly more susceptible to criminal activity.

By overlaying these findings with Tableau geospatial visualizations, I was able to clearly identify which areas of Baltimore are relatively safe and which are high-risk for residents. Neighborhoods such as the Southeast coast (Fells Point and Canton), the South coast (Federal Hill and Locust Point), and Northern residential communities (Towson and Poplar Hill) consistently show low vacancy and crime rates, making them highly suitable for living. Conversely, my analysis indicates that areas like Midway/Coldstream, Clifton-Berea, Seton Hill, and Sandtown-Winchester/Harlem Park should be strictly avoided due to their high vacancy and crime density.

Central of the city (Inner Harbor) presents a unique case; it shows a high arrest density despite having a very low vacancy rate. This should be interpreted as a result of proactive policing and high public activity in a major commercial hub, rather than a symptom of systemic urban decay. Therefore, the Inner Harbor is presumed to be significantly safer than other clusters characterized by both high crime and high vacancy rates. Lastly, Druid Hill Park in the central region and the large-scale industrial complexes in the Southeast were excluded from these recommendations as they are primarily non-residential zones.