This project analyzes crime data from the dataset “Violent Crime & Property Crime by Municipality (2000–Present).” The dataset includes multiple variables related to crime across different jurisdictions over time. Key variables used in this analysis include larceny/theft (larceny_theft) and burglary (breaking and entering) (burglary), which are both quantitative variables about property crime. Additional variables that I used were the year and jurisdiction.
The purpose of this analysis is to explore whether there is a relationship between larceny/theft and burglary across jurisdictions. Specifically, this project investigates whether higher levels of larceny/theft are associated with higher levels of burglary and whether larceny/theft can be used as a predictor of burglary. Understanding this relationship can provide insight into patterns of property crime and how different types of offenses are connected. The dataset was from maryland.gov.
Variables
burglary : The count of burglary crimes. larceny_theft : The count of larceny crimes. year : The year in which the crime occured. jurisdiction : The jurisdiction in where the crime occured (filtered down to Baltimore only)
Research Question
What is the relationship between larceny/theft and burglary rates solely in Baltimore, and can larceny/theft be used to predict burglary levels in Baltimore?
Loading the Dataset and libraries
library(readr)library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ purrr 1.1.0
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 4284 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): JURISDICTION, COUNTY, PERCENT CHANGE, VIOLENT CRIME PERCENT, VIOLE...
dbl (4): YEAR, MURDER, RAPE, RAPE PER 100,000 PEOPLE
num (18): POPULATION, ROBBERY, AGG. ASSAULT, B & E, LARCENY THEFT, M/V THEFT...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(baltimore_crime, aes(x = larceny_theft, y = burglary, color =factor(year))) +geom_jitter(width =0.5, height =0.5, size =2, alpha =0.5) +# jittered pointsgeom_smooth(method ="lm", se =TRUE, color ="black", linetype =2, linewidth =0.8) +# single regression linelabs(title ="Larceny vs Burglary Across All Years",x ="Larceny/Theft",y ="Burglary (Breaking & Entering)",color ="Year",caption ="Source: Violent Crime Dataset" ) +theme_minimal(base_size =14, base_family ="serif")
`geom_smooth()` using formula = 'y ~ x'
General Conclusion
Cleaning Conclusion
The first thing I did to clean the dataset is was to make all of the column names lowercase using gsub. Then, I used gsub again, except now I removed the spaces in the variable names and replaced them with underscores. Lastly, one of my main variables was orignally named “b_&_e”, with the ampersand being a special character, I was not able to filter or select anything as R could not see the column due to the special character. So, I used gsub for the last time, and replaced the column name and renamed it “burglary”.
Linear Regression Conclusion
The regression analysis shows a strong positive relationship between larceny/theft and burglary. This suggests that areas with higher levels of theft-related crimes tend to also experience higher levels of burglary.The p-value for the model is less than 0.001, indicating that the relationship is statistically significant. This means it is extremely unlikely that the observed relationship is due to random chance.
The adjusted R² value of 0.9848 indicates that approximately 98.48% of the variation in burglary can be explained by larceny/theft. This is an exceptionally high value, suggesting a very strong relationship between the two variables. Overall, the model demonstrates that larceny/theft is a highly effective predictor of burglary levels across jurisdictions.
Visualization Conclusion
For my visualization, I decided to do a scatterplot. This scatterplot shows each year and their burglary and larceny/theft values for the Baltimore jurisdiction. I added a linear regression line as it allows us to see how the year lines up with the trend from the linear model. One thing that I was really surprised about this graph was how correlated the two variables are. I did not expect the upward trend to be so clear in my scatterplot, and the different colors makes it very easy to interpret the graph. The main pattern that arises with my scatterplot is the upward trend and the correlation between the larceny_theft variable and the burglary/ breaking and entering variable.
Obstacles During Project 1
Something that I wished I would’ve done for this project was to be able to look at all of the jurisdictions and not just Baltimore. This dataset did not have a lot of observations in any of the other jurisdictions, so in my first graphs, the data was all skewed and it was very hard to interpret. I also want to look at the relationship between violent crimes and property crimes throughout all jurisdictions. Lastly, I want to say that I had fun exploring the data and creating this project.