project1

Author

jongpark

Project Overview

This project explores the intersection of public safety and city administrative actions in Baltimore City. By integrating towing records with police arrest data, I aim to investigate how criminal activity levels and demographics relate to vehicle impoundment patterns and the associated financial costs for citizens.

Data Source

The data used in this study is sourced from the “Baltimore City Open Data portal”” (https://data.baltimorecity.gov/). Two primary datasets were merged: 1. Baltimore Towing Data: Records of all non-consensual tows within the city. 2. BPD Arrests Data: Records of arrests made by the Baltimore Police Department.

Research Questions

I plan to explore three main questions:

Is there a correlation between crime rates and towing counts by district?
Is there a correlation between theft counts and the offenders’ age by district?
Is there any correlation between districts and paid amount for retrieve towed cars?

Variable Definitions

To explore these relationships, I analyzed the following key variables: 1. District (Categorical): The 9 specific Baltimore Police Districts (e.g., Central, Northern, Southeast) where the incidents occurred. 2. TotalPaid (Quantitative): The total fee paid by the vehicle owner to retrieve their car from the impound lot (in USD). 3. Age (Quantitative): The age of individuals arrested within a specific district. 4. StolenVehicleFlag (Categorical): A binary indicator (1 for Yes, 0 for No) identifying if a towed vehicle was a recovered stolen vehicle. 5. Towing Count (Quantitative): The total volume of towing incidents aggregated by district.

Data Processing & Methodology

One of the core challenges of this project will be the inconsistency in address formatting between the two datasets. To ensure high data integrity, I will implement a multi-stage mapping process. I determined that successfully mapping at least 95% of the 38,176 total observations is essential for this project to be statistically meaningful and reliable. To achieve this goal, I plan to actively utilize specialized mapping software alongside AI-assisted tools to bridge the gap between the raw towing data and the BPD district standards.

Quarto

1. Is there a correlation between crime rates and towing counts by district?

library(tidyverse)

Warning: 패키지 'ggplot2'는 R 버전 4.5.3에서 작성되었습니다

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

towing_data <- readr::read_csv("Balt_Towing_mapped.csv")

Rows: 38176 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): `, PickupType, TowedFromLocation, Status, ReleaseDateTime, ReleaseT...
dbl (3): ew, StolenVehicleFlag, TotalPaid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

arrest_data <- readr::read_csv("BPD_Arrests.csv")

Rows: 22414 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): ArrestDateTime, ArrestLocation, IncidentLocation, ChargeDescription...
dbl (1): Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# I filter out NAs and empty strings to ensure the visualization only includes valid districts.
tow_sum <- towing_data %>% 
  group_by(District_Mapped) %>% 
  summarise(tow_count = n()) %>% 
  rename(District = District_Mapped) %>% 
  filter(!is.na(District) & District != "")
arrest_sum <- arrest_data %>% 
  group_by(District) %>% 
  summarise(arrest_count = n()) %>% 
  filter(!is.na(District) & District != "")

# Join the two datasets by District to compare towing activity with arrest volume.
analysis_1 <- tow_sum %>% inner_join(arrest_sum, by = "District")

# I created a manual 3x3 grid layout to represent Baltimore's geography roughly.
# This helps the audience visualize the spatial distribution more intuitively.
grid_layout <- data.frame(
  District = c("Northwest", "Northern", "Northeast", "Western", "Central", "Eastern", "Southwest", "Southern", "Southeast"),
  x = c(1, 2, 3, 1, 2, 3, 1, 2, 3), y = c(3, 3, 3, 2, 2, 2, 1, 1, 1))

plot_df <- analysis_1 %>% inner_join(grid_layout, by = "District")

# Visualization
# Number = Towing Counts, Color and size = Crime Rates
# Overlaying the towing count text allows for a direct comparison of the two variables.
ggplot(plot_df, aes(x = x, y = y)) +
  geom_point(aes(size = arrest_count, color = arrest_count), alpha = 1) +
  geom_text(aes(label = tow_count), color = "white", size = 5.5) +
  geom_text(aes(label = District), vjust = 3.5, size = 4.5) +
  scale_size_continuous(range = c(18, 37)) +
  scale_color_gradient(low = "blue", high = "red") + theme_void() + theme(legend.position = "none") + xlim(0.5, 3.5) + ylim(0.5, 3.8) + 
  labs(title = "Relationship between Towing Rates and Crime Rates by District",
    subtitle = "Number: Towing Count | Color & Size: Arrest Count - Blue(Low) ~ Red(High)", caption = "https://data.baltimorecity.gov/", size = "Towing Count", color = "Arrest Count") +
  theme(plot.title = element_text(hjust = 0.5, size = 16),
        plot.subtitle = element_text(hjust = 0.5))

# Linear Regression
# Performing a linear regression to statistically test if arrest volume predicts towing count.
model1 <- lm(tow_count ~ arrest_count, data = analysis_1)
summary(model1)


Call:
lm(formula = tow_count ~ arrest_count, data = analysis_1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1633.0  -456.9   261.0   562.0  1356.6 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -569.1129  1481.8183  -0.384   0.7123  
arrest_count    2.0146     0.6084   3.311   0.0129 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 945.1 on 7 degrees of freedom
Multiple R-squared:  0.6104,    Adjusted R-squared:  0.5547 
F-statistic: 10.97 on 1 and 7 DF,  p-value: 0.01292

plot(model1)

2. Is there a correlation between theft counts and the offenders’ age by district?

library(tidyverse)

towing_data2 <- readr::read_csv("Balt_Towing_mapped.csv")

Rows: 38176 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): `, PickupType, TowedFromLocation, Status, ReleaseDateTime, ReleaseT...
dbl (3): ew, StolenVehicleFlag, TotalPaid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

arrest_data2 <- readr::read_csv("BPD_Arrests.csv")

Rows: 22414 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): ArrestDateTime, ArrestLocation, IncidentLocation, ChargeDescription...
dbl (1): Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# In this section, My Null Hypothesis (H0) is that there is no correlation 
# between the average age of offenders and the vehicle theft rate in Baltimore districts.


#1. Age by district, Stolen rates by district
age_sum2 <- arrest_data2 %>% 
  filter(!is.na(Age) & Age > 0) %>% # Excluding invalid age records for accuracy.
  group_by(District) %>% 
  summarise(avg_age = mean(Age)) %>% 
  filter(District != "")
stolen_sum <- towing_data2 %>% 
  group_by(District_Mapped) %>%
  summarise(total_towed = n(), stolen_count = sum(StolenVehicleFlag == 1, na.rm = TRUE), stolen_rate = (stolen_count / total_towed) * 100) %>%
  rename(District = District_Mapped) %>%
  filter(District != "")

analysis_2 <- age_sum2 %>% 
  inner_join(stolen_sum, by = "District")

# Visualization
# geom_smooth(method = "lm") is added to visualize the negative trend clearly.
ggplot(analysis_2, aes(x = avg_age, y = stolen_rate)) +
  geom_point(aes(color = District), size = 5) + 
  geom_smooth(method = "lm", color = "black", se = TRUE) + 
  geom_text(aes(label = District), vjust = -1.5, fontface = "bold") +
  theme_minimal() +
  labs(title = "Correlation: Avg Criminal Age vs. Vehicle Theft Rate", 
       subtitle = "Does a younger criminal demographic correlate with higher auto theft?", caption = "https://data.baltimorecity.gov/", 
    x = "Average Age of Arrested Individuals", y = "Vehicle Theft Rate (%)") +
  theme(plot.title = element_text(hjust = 0.5, size = 16), plot.subtitle = element_text(hjust = 0.5), legend.position = "none")

`geom_smooth()` using formula = 'y ~ x'

# Linear Regression
# Testing the 'Stolen Rate ~ Age' model to see if the age variable is a significant predictor.
model2 <- lm(stolen_rate ~ avg_age, data = analysis_2)
summary(model2)


Call:
lm(formula = stolen_rate ~ avg_age, data = analysis_2)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.9722  -0.7223   0.9886   3.1017   4.9182 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  238.731     89.544   2.666   0.0322 *
avg_age       -6.451      2.765  -2.333   0.0524 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.962 on 7 degrees of freedom
Multiple R-squared:  0.4375,    Adjusted R-squared:  0.3572 
F-statistic: 5.445 on 1 and 7 DF,  p-value: 0.05235

plot(model2)

3. Is there any correlation between districts and paid amount for retrieve towed cars?

library(tidyverse)

towing_data3 <- readr::read_csv("Balt_Towing_mapped.csv")

Rows: 38176 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): `, PickupType, TowedFromLocation, Status, ReleaseDateTime, ReleaseT...
dbl (3): ew, StolenVehicleFlag, TotalPaid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Filtering
# The intention here is to focus on the typical range of retrieval costs ($0 - $1,000).
# Removing extreme outliers helps to visualize the median and distribution more clearly
plot_data3 <- towing_data3 %>% 
  filter(!is.na(TotalPaid) & TotalPaid > 0 & TotalPaid < 1000) %>% 
  filter(!is.na(District_Mapped) & District_Mapped != "")

# Visualization
# Using side-by-side boxplots to identify financial disparities between districts.
# outlier.color is set to red to make the extreme cases visible for the audience.
ggplot(plot_data3, aes(x = District_Mapped, y = TotalPaid, fill = District_Mapped)) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 1, alpha = 0.7) +
  labs(x = "By district", y = "Paid amount ($)", 
       title = "Paid amount for retrieve towed cars by districts", 
       caption = "Source: https://data.baltimorecity.gov/") +
  theme_minimal() + theme(legend.position = "none") # Legend removed as x-axis labels identify districts.

This project investigates the link between Baltimore’s public safety data and city towing operations. Using datasets from Baltimore City Open Data, I analyzed variables like police districts, towing counts, offender ages, and retrieval costs to identify geographic and demographic patterns.

To ensure data integrity, I implemented a rigorous multi-stage mapping process. First, I established a baseline using Baltimore Police Department (BPD) district standards and mapped 5,000 initial addresses via the pro trial version of “Mapline” as a seed dataset. An AI model was then trained on this data to recognize and standardize the complex, inconsistent address patterns found in the raw towing records. After 15 cycles of iterative refinement and cross-validation between AI models, I successfully mapped 38,030 out of 38,176 records, achieving a 99.62% accuracy rate.

Spatial Relationship between Towing and Crime

The 3x3 grid visualization illustrates a clear spatial overlap between arrest rates and towing volumes across Baltimore’s districts. Districts with higher law enforcement activity, marked by larger and darker circles, consistently show higher towing counts. This suggests that high-crime areas are also hotspots for frequent administrative towing operations, indicating that towing activity is not just administrative but closely tied to the city’s policing landscape.

Equation: TowingCount = -569.11 + 2.01*(ArrestCount)

Age vs. Vehicle Theft Rate

The scatter plot shows how the age of offenders might influence vehicle theft rates by comparing the average age of those arrested to the frequency of stolen vehicles. The downward-sloping regression line shows a negative correlation, suggesting that districts with a younger criminal demographic often experience higher rates of auto theft. While the p-value of 0.052 is just slightly above the traditional 0.05 threshold, the null hypothesis has technically fail to reject. However, the Adjusted R^2 value of 0.357 indicates that the average age of offenders explains about 35.7% of the differences in theft rates between districts, which defined age is still an unavoidable, strong factor in explaining the variance.

Equation: Stolen rate = 238.731 - 6.451*(average age)

Cost Distribution by District

The boxplot highlights significant financial disparities in vehicle retrieval costs across the city. By limiting the Y-axis at $1,000, the visualization reveals that the Northeast district has a significantly wider spread of paid amounts compared to other regions, indicating much higher volatility in costs. While most districts maintain relatively low median costs, the presence of numerous high-value outliers suggests that retrieval fees are not distributed evenly across Baltimore. These variations likely point to differing regional policies or administrative practices rather than random chance.

For the conclusion, this analysis proves that towing in Baltimore is meaningfully connected to the city’s criminal landscape. From the initial mapping challenges to the final visualizations, it is clear that geographic location significantly impacts both the likelihood of a vehicle being towed and the financial burden of recovering it. Furthermore, I also feel that incorporating data on living standards by district, such as rent and income rates, with the towing data could have led to even more meaningful results. While we haven’t formally studied ANOVA in our current coursework, the visual evidence and regression models suggest that these patterns are systematic indicators of how the city operates.