Project 1 Data 110

Author

Javier Mantilla

Introduction

This project analyzes crime data from the dataset “Violent Crime & Property Crime by Municipality (2000–Present).” The dataset includes multiple variables related to crime across different jurisdictions over time. Key variables used in this analysis include larceny/theft (larceny_theft) and burglary (breaking and entering) (burglary), which are both quantitative variables about property crime. Additional variables that I used were the year and jurisdiction.

The purpose of this analysis is to explore whether there is a relationship between larceny/theft and burglary across jurisdictions. Specifically, this project investigates whether higher levels of larceny/theft are associated with higher levels of burglary and whether larceny/theft can be used as a predictor of burglary. Understanding this relationship can provide insight into patterns of property crime and how different types of offenses are connected. The dataset was from maryland.gov.

Variables

burglary : The count of burglary crimes. larceny_theft : The count of larceny crimes. year : The year in which the crime occured. jurisdiction : The jurisdiction in where the crime occured (filtered down to Baltimore only)

Research Question

What is the relationship between larceny/theft and burglary rates solely in Baltimore, and can larceny/theft be used to predict burglary levels in Baltimore?

Loading the Dataset and libraries

library(readr)
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.1.0
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Documents/Data 110")
crime_data <- read_csv("Violent_Crime_&_Property_Crime_by_Municipality__2000_to_Present_20260323.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 4284 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): JURISDICTION, COUNTY, PERCENT CHANGE, VIOLENT CRIME PERCENT, VIOLE...
dbl  (4): YEAR, MURDER, RAPE, RAPE PER 100,000 PEOPLE
num (18): POPULATION, ROBBERY, AGG. ASSAULT, B & E, LARCENY THEFT, M/V THEFT...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(crime_data)
# A tibble: 6 × 32
  JURISDICTION COUNTY      YEAR POPULATION MURDER  RAPE ROBBERY `AGG. ASSAULT`
  <chr>        <chr>      <dbl>      <dbl>  <dbl> <dbl>   <dbl>          <dbl>
1 Galestown    Dorchester  2008        101      0     0       0              0
2 Goldsboro    Caroline    2014        247      0     0       0              0
3 Henderson    Caroline    1997         66      0     0       0              0
4 Keedysville  Washington  1990        464      0     0       0              0
5 Kitzmiller   Garrett     1996        275      0     0       0              0
6 Lonaconing   Allegany    1991       1122      0     0       0              0
# ℹ 24 more variables: `B & E` <dbl>, `LARCENY THEFT` <dbl>, `M/V THEFT` <dbl>,
#   `GRAND TOTAL` <dbl>, `PERCENT CHANGE` <chr>, `VIOLENT CRIME TOTAL` <dbl>,
#   `VIOLENT CRIME PERCENT` <chr>, `VIOLENT CRIME PERCENT CHANGE` <chr>,
#   `PROPERTY CRIME TOTALS` <dbl>, `PROPERTY CRIME PERCENT` <chr>,
#   `PROPERTY CRIME PERCENT CHANGE` <chr>,
#   `OVERALL CRIME RATE PER 100,000 PEOPLE` <dbl>,
#   `OVERALL PERCENT CHANGE PER 100,000 PEOPLE` <chr>, …

Cleaning the Dataset

names(crime_data) <- tolower(names(crime_data))
names(crime_data) <- gsub(" ","_",names(crime_data))
names(crime_data) <- gsub("b_&_e","burglary",names(crime_data))
head(crime_data)
# A tibble: 6 × 32
  jurisdiction county      year population murder  rape robbery agg._assault
  <chr>        <chr>      <dbl>      <dbl>  <dbl> <dbl>   <dbl>        <dbl>
1 Galestown    Dorchester  2008        101      0     0       0            0
2 Goldsboro    Caroline    2014        247      0     0       0            0
3 Henderson    Caroline    1997         66      0     0       0            0
4 Keedysville  Washington  1990        464      0     0       0            0
5 Kitzmiller   Garrett     1996        275      0     0       0            0
6 Lonaconing   Allegany    1991       1122      0     0       0            0
# ℹ 24 more variables: burglary <dbl>, larceny_theft <dbl>, `m/v_theft` <dbl>,
#   grand_total <dbl>, percent_change <chr>, violent_crime_total <dbl>,
#   violent_crime_percent <chr>, violent_crime_percent_change <chr>,
#   property_crime_totals <dbl>, property_crime_percent <chr>,
#   property_crime_percent_change <chr>,
#   `overall_crime_rate_per_100,000_people` <dbl>,
#   `overall_percent_change_per_100,000_people` <chr>, …
baltimore_crime <- crime_data |>
  select(year, burglary, larceny_theft, jurisdiction) |>
  filter(jurisdiction == "Baltimore")
head(baltimore_crime)
# A tibble: 6 × 4
   year burglary larceny_theft jurisdiction
  <dbl>    <dbl>         <dbl> <chr>       
1  1990    14867         36333 Baltimore   
2  1991    16394         40406 Baltimore   
3  1992    16503         41836 Baltimore   
4  1993    18076         42814 Baltimore   
5  1994    16026         43636 Baltimore   
6  1995    16705         46750 Baltimore   

Diagnostic Plot

ggplot(baltimore_crime, aes(x = larceny_theft, y = burglary)) +
  geom_point() + 
  geom_smooth( method = "lm")
`geom_smooth()` using formula = 'y ~ x'

This diagnostic plot shows a upward and positive trend. # Linear Regression Model

linear_crime <- lm( burglary ~ larceny_theft, data = baltimore_crime)
summary(linear_crime)

Call:
lm(formula = burglary ~ larceny_theft, data = baltimore_crime)

Residuals:
    Min      1Q  Median      3Q     Max 
-1729.4  -822.8  -247.8   822.5  2406.9 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.119e+03  5.164e+02   2.167   0.0386 *  
larceny_theft 3.398e-01  1.803e-02  18.844   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1089 on 29 degrees of freedom
Multiple R-squared:  0.9245,    Adjusted R-squared:  0.9219 
F-statistic: 355.1 on 1 and 29 DF,  p-value: < 2.2e-16

Equation : Burglary = -20.54 + 0.374(Larceny/Theft)

Visualization

ggplot(baltimore_crime, aes(x = larceny_theft, y = burglary, color = factor(year))) +
  geom_jitter(width = 0.5, height = 0.5, size = 2, alpha = 0.5) +  # jittered points
  geom_smooth(method = "lm", se = TRUE, color = "black", linetype = 2, linewidth = 0.8) + # single regression line
  labs(
    title = "Larceny vs Burglary Across All Years",
    x = "Larceny/Theft",
    y = "Burglary (Breaking & Entering)",
    color = "Year",
    caption = "Source: Violent Crime Dataset"
  ) +
  theme_minimal(base_size = 14, base_family = "serif")
`geom_smooth()` using formula = 'y ~ x'

General Conclusion

Cleaning Conclusion

The first thing I did to clean the dataset is was to make all of the column names lowercase using gsub. Then, I used gsub again, except now I removed the spaces in the variable names and replaced them with underscores. Lastly, one of my main variables was orignally named “b_&_e”, with the ampersand being a special character, I was not able to filter or select anything as R could not see the column due to the special character. So, I used gsub for the last time, and replaced the column name and renamed it “burglary”.

Linear Regression Conclusion

The regression analysis shows a strong positive relationship between larceny/theft and burglary. This suggests that areas with higher levels of theft-related crimes tend to also experience higher levels of burglary.The p-value for the model is less than 0.001, indicating that the relationship is statistically significant. This means it is extremely unlikely that the observed relationship is due to random chance.

The adjusted R² value of 0.9848 indicates that approximately 98.48% of the variation in burglary can be explained by larceny/theft. This is an exceptionally high value, suggesting a very strong relationship between the two variables. Overall, the model demonstrates that larceny/theft is a highly effective predictor of burglary levels across jurisdictions.

Visualization Conclusion

For my visualization, I decided to do a scatterplot. This scatterplot shows each year and their burglary and larceny/theft values for the Baltimore jurisdiction. I added a linear regression line as it allows us to see how the year lines up with the trend from the linear model. One thing that I was really surprised about this graph was how correlated the two variables are. I did not expect the upward trend to be so clear in my scatterplot, and the different colors makes it very easy to interpret the graph. The main pattern that arises with my scatterplot is the upward trend and the correlation between the larceny_theft variable and the burglary/ breaking and entering variable.

Obstacles During Project 1

Something that I wished I would’ve done for this project was to be able to look at all of the jurisdictions and not just Baltimore. This dataset did not have a lot of observations in any of the other jurisdictions, so in my first graphs, the data was all skewed and it was very hard to interpret. I also want to look at the relationship between violent crimes and property crimes throughout all jurisdictions. Lastly, I want to say that I had fun exploring the data and creating this project.