library(tidyverse)
library(ggfortify)
library(ggplot2)
library(highcharter)
library(scales)
library(RColorBrewer)Violent Crimes
Introduction
For my final project, I chose the data set “Violent Crime & Property Crime by Municipality 2000 to Present.” The data was originally collected by the Maryland Statistical Analysis Center (MSAC). I chose this topic because I am still new to Maryland, and understanding crime patterns helps me learn more about the different communities here.Other than that, I am interetsted in this kind of crime types of data sets. This data set has 4284 observations and 32 variables.
Variables
| Variable Name | Meaning | Data type |
|---|---|---|
| Jurisdiction | The Maryland municipality where the crime data was reported | categorical |
| Year | The year the crime counts were recorded | numerical |
| Population | Total population of the jurisdiction in that year | numerical |
| violent_crime_rate_per_100_000_people | violent crimes per 100,000 population | numerical |
| property_crime_rate_per_100_000_people | Property crimes per 100,000 population | numerical |
How the data was collected
The crime data in this data set are collected each year by local police departments and reported to the Maryland State Police through FBI’s Uniform Crime Reporting (UCR) Program. MSAC then compiles these numbers by municipality and year. The data set does not include a detailed ReadMe file, so information about specific cleaning steps or missing data handling is not provided.
Research Question
What factors best predict violent crime rate across Maryland municipalities?
Source
Maryland Statistical Analysis Center (MSAC), within the Governor’s Office of Crime Control and Prevention (GOCCP)
Loading the libraries
Loading the data set
violent <- read_csv("Violent_Crime_&_Property_Crime_by_Municipality__2000_to_Present_20251203.csv")To look at the structure and first 6 rows
#str(violent)
head(violent)# A tibble: 6 × 32
JURISDICTION COUNTY YEAR POPULATION MURDER RAPE ROBBERY `AGG. ASSAULT`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Galestown Dorchester 2008 101 0 0 0 0
2 Goldsboro Caroline 2014 247 0 0 0 0
3 Henderson Caroline 1997 66 0 0 0 0
4 Keedysville Washington 1990 464 0 0 0 0
5 Kitzmiller Garrett 1996 275 0 0 0 0
6 Lonaconing Allegany 1991 1122 0 0 0 0
# ℹ 24 more variables: `B & E` <dbl>, `LARCENY THEFT` <dbl>, `M/V THEFT` <dbl>,
# `GRAND TOTAL` <dbl>, `PERCENT CHANGE` <chr>, `VIOLENT CRIME TOTAL` <dbl>,
# `VIOLENT CRIME PERCENT` <chr>, `VIOLENT CRIME PERCENT CHANGE` <chr>,
# `PROPERTY CRIME TOTALS` <dbl>, `PROPERTY CRIME PERCENT` <chr>,
# `PROPERTY CRIME PERCENT CHANGE` <chr>,
# `OVERALL CRIME RATE PER 100,000 PEOPLE` <dbl>,
# `OVERALL PERCENT CHANGE PER 100,000 PEOPLE` <chr>, …
Cleaning the data set
names(violent) <- tolower(names(violent))
names(violent) <- gsub(" ","_",names(violent))
names(violent) <- gsub("[.]", "", names(violent))
names(violent) <- gsub("[/,&]","_" ,names(violent))
head(violent)# A tibble: 6 × 32
jurisdiction county year population murder rape robbery agg_assault b___e
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Galestown Dorchest… 2008 101 0 0 0 0 0
2 Goldsboro Caroline 2014 247 0 0 0 0 0
3 Henderson Caroline 1997 66 0 0 0 0 0
4 Keedysville Washingt… 1990 464 0 0 0 0 2
5 Kitzmiller Garrett 1996 275 0 0 0 0 0
6 Lonaconing Allegany 1991 1122 0 0 0 0 0
# ℹ 23 more variables: larceny_theft <dbl>, m_v_theft <dbl>, grand_total <dbl>,
# percent_change <chr>, violent_crime_total <dbl>,
# violent_crime_percent <chr>, violent_crime_percent_change <chr>,
# property_crime_totals <dbl>, property_crime_percent <chr>,
# property_crime_percent_change <chr>,
# overall_crime_rate_per_100_000_people <dbl>,
# overall_percent_change_per_100_000_people <chr>, …
Checking NAs
colSums(is.na(violent)) jurisdiction
0
county
0
year
0
population
0
murder
0
rape
0
robbery
0
agg_assault
0
b___e
0
larceny_theft
0
m_v_theft
0
grand_total
0
percent_change
149
violent_crime_total
0
violent_crime_percent
0
violent_crime_percent_change
149
property_crime_totals
0
property_crime_percent
0
property_crime_percent_change
149
overall_crime_rate_per_100_000_people
1
overall_percent_change_per_100_000_people
149
violent_crime_rate_per_100_000_people
0
violent_crime_rate_percent_change_per_100_000_people
149
property_crime_rate_per_100_000_people
0
property_crime_rate_percent_change_per_100_000_people
149
murder_per_100_000_people
0
rape_per_100_000_people
2
robbery_per_100_000_people
0
agg_assault_per_100_000_people
0
b___e_per_100_000_people
0
larceny_theft_per_100_000_people
0
m_v_theft_per_100_000_people
0
There are no NAs in variables that I am going to use for my project.
Removing unused columns
violent_clean <- violent|>
select(-agg_assault,-b___e,
-larceny_theft,-m_v_theft,-b___e_per_100_000_people,-larceny_theft_per_100_000_people,-m_v_theft_per_100_000_people)
head(violent_clean)# A tibble: 6 × 25
jurisdiction county year population murder rape robbery grand_total
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Galestown Dorchester 2008 101 0 0 0 0
2 Goldsboro Caroline 2014 247 0 0 0 0
3 Henderson Caroline 1997 66 0 0 0 0
4 Keedysville Washington 1990 464 0 0 0 4
5 Kitzmiller Garrett 1996 275 0 0 0 0
6 Lonaconing Allegany 1991 1122 0 0 0 0
# ℹ 17 more variables: percent_change <chr>, violent_crime_total <dbl>,
# violent_crime_percent <chr>, violent_crime_percent_change <chr>,
# property_crime_totals <dbl>, property_crime_percent <chr>,
# property_crime_percent_change <chr>,
# overall_crime_rate_per_100_000_people <dbl>,
# overall_percent_change_per_100_000_people <chr>,
# violent_crime_rate_per_100_000_people <dbl>, …
Selecting the columns that I need for regression model
violent_model <- violent_clean |>
select(jurisdiction,year,county,population,violent_crime_rate_per_100_000_people,property_crime_rate_per_100_000_people)
head(violent_model)# A tibble: 6 × 6
jurisdiction year county population violent_crime_rate_per_100_000_people
<chr> <dbl> <chr> <dbl> <dbl>
1 Galestown 2008 Dorchester 101 0
2 Goldsboro 2014 Caroline 247 0
3 Henderson 1997 Caroline 66 0
4 Keedysville 1990 Washington 464 0
5 Kitzmiller 1996 Garrett 275 0
6 Lonaconing 1991 Allegany 1122 0
# ℹ 1 more variable: property_crime_rate_per_100_000_people <dbl>
Why I chose these predictors
To explore potential predictors, I examined a correlation heatmap of all numeric variables. The plot showed very strong correlations between violent crime rate and individual crime categories such as robbery and aggravated assault. However, these variables are components of violent crime itself and therefore were not appropriate predictors for the final model.Property crime rate showed a strong positive relationship with violent crime rate while remaining independent of the violent crime definition. Population also showed a moderate relationship and helps control for jurisdiction size. Year was included to account for overall trends over time, even though its correlation was weaker. Based on the exploration, property crime rate, population, and year were selected as the most appropriate predictors.
Multiple Linear Regression Model
fit1 <- lm(violent_crime_rate_per_100_000_people~ population + year + property_crime_rate_per_100_000_people, data = violent_model)
autoplot(fit1, 1:4,nrow=2,ncol=2) ## To look at the diagnostic plotssummary(fit1)
Call:
lm(formula = violent_crime_rate_per_100_000_people ~ population +
year + property_crime_rate_per_100_000_people, data = violent_model)
Residuals:
Min 1Q Median 3Q Max
-2412.2 -190.0 -62.7 127.3 7059.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.316e+02 1.392e+03 0.597 0.55
population 1.754e-03 1.063e-04 16.500 <2e-16
year -3.840e-01 6.936e-01 -0.554 0.58
property_crime_rate_per_100_000_people 1.274e-01 2.094e-03 60.852 <2e-16
(Intercept)
population ***
year
property_crime_rate_per_100_000_people ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 389.2 on 4280 degrees of freedom
Multiple R-squared: 0.5186, Adjusted R-squared: 0.5182
F-statistic: 1537 on 3 and 4280 DF, p-value: < 2.2e-16
Linear Regression Analysis
The multiple linear regression model predicts the violent crime rate per 100,000 people across Maryland municipalities based on population, year, and property crime rate per 100,000 people.
Model Equation
Because this is a multiple linear regression, the model follows the form:
y = a + b1x1+ b2x2 + b3x3
Where: y = violent crime rate per 100,000 people a = intercept b1 = effect of population b2 = effect of year b3 = effect of property crime rate
Based on the model output, the fitted equation is:
violent_crime_rate_per_100_000_people = 831.6 + 0.001754(population) -0.384(year) +0.1274(property_crime_rate_per_100_000_people)
This means that:
Violent crime rate increases slightly as population increases. Violent crime rate increases as property crime rate increases. Year has little effect on violent crime rate.
P-value and Adjusted r-squared analysis
The property crime rate per 100,00 people is very strong and statistically significant predictor of violent crime rate (p<2e-16). This indicates that municipalities with higher property crime rates tend to also experience higher violent crime rates.
Population is also statistically significant(p<2e-16), showing that jurisdictions with larger populations generally have slightly higher violent crime rates, although the effect size is small.
The variable year has a high p-value (p=0.58), suggesting that once population and property crime rate are accounted for, year does not have a meaningful effect on violent crime rate.
The adjusted r-squared value for this model is 0.5182, meaning the model explains approximately 51.8% of the variation in violent crime rate across Maryland municipalities. The overall model is statistically significant(p<2.2e-16).
Diagnostic plots
Residuals vs Fitted: The residuals show a slight curve and increasing spread at higher fitted values, suggesting mild nonlinearity and heteroscedasticity. However, the pattern is not extreme.
Normal Q-Q plot: Most points follow the diagonal line, indicating that residuals are approximately normally distributed, though some extreme values devaite from normality.
Scale-Location plot: The spread of residuals increases as fitted values increases, indicating mild heteroscedasticity.
Cook’s Distance: A few influential observations (such as 2531,and 3017) stand out. These likely represent municipalities with usually high crime levels and may have some influence on the model.
Overall, while the regression assumptions are not perfectly met, the model is appropriate for exploratory analysis and provide meaningful insight into factors associated with violent crime rates.
Grouping year variable for 3 categories
violent_model <- violent_model |>
mutate(year_group = case_when(
year >= 2000 & year <= 2007 ~ "2000-2007",
year >= 2008 & year <= 2015 ~ "2008-2015",
year >= 2016 ~ "2016-Present",
TRUE ~ "other"
))
violent_model# A tibble: 4,284 × 7
jurisdiction year county population violent_crime_rate_per_100_000_peo…¹
<chr> <dbl> <chr> <dbl> <dbl>
1 Galestown 2008 Dorchester 101 0
2 Goldsboro 2014 Caroline 247 0
3 Henderson 1997 Caroline 66 0
4 Keedysville 1990 Washington 464 0
5 Kitzmiller 1996 Garrett 275 0
6 Lonaconing 1991 Allegany 1122 0
7 Aberdeen 1990 Harford 13087 657.
8 Aberdeen 1991 Harford 13301 752.
9 Aberdeen 1992 Harford 13432 715.
10 Aberdeen 1993 Harford 13703 1022.
# ℹ 4,274 more rows
# ℹ abbreviated name: ¹violent_crime_rate_per_100_000_people
# ℹ 2 more variables: property_crime_rate_per_100_000_people <dbl>,
# year_group <chr>
I created this for only 2000 to present years, and I put other for years that is before 2000.
Removing others in the year group
violent_model2 <- violent_model |>
filter(year_group != "other")
violent_model2# A tibble: 2,965 × 7
jurisdiction year county population violent_crime_rate_per_100_000_peo…¹
<chr> <dbl> <chr> <dbl> <dbl>
1 Galestown 2008 Dorchester 101 0
2 Goldsboro 2014 Caroline 247 0
3 Aberdeen 2000 Harford 13842 607.
4 Aberdeen 2001 Harford 14048 797.
5 Aberdeen 2002 Harford 14264 883.
6 Aberdeen 2003 Harford 14148 643.
7 Aberdeen 2004 Harford 14311 692.
8 Aberdeen 2005 Harford 14312 1111
9 Aberdeen 2006 Harford 14133 722.
10 Aberdeen 2007 Harford 14187 811.
# ℹ 2,955 more rows
# ℹ abbreviated name: ¹violent_crime_rate_per_100_000_people
# ℹ 2 more variables: property_crime_rate_per_100_000_people <dbl>,
# year_group <chr>
I did this because I only concerned about years from 2000 to present.
Visualization 1
ggplot(violent_model2,aes(x=property_crime_rate_per_100_000_people,
y=violent_crime_rate_per_100_000_people)) +
geom_point(aes(color = year_group),alpha =0.3)+
geom_smooth(method = "lm",se = FALSE , color ="black",linewidth= 0.5,linetype = "dashed")+ ##Added a smoothed trend line to show overall relationship and got that from dslabs and highcharter tutorial
facet_wrap(~year_group)+
labs(title = "Violent Crime vs Property Crime Rates Across Maryland Municipalities (2000-Present)",
x = "Property crime rate (per 100,000 people)",
y = "Violent crime rate (per 100,000 people)",
color = "Year Group",
caption = "Source: Maryland Statistical Analysis Center (MSAC),GOCCP") +
scale_color_manual(values=c("2000-2007" = "#8B3A62",
"2008-2015" = "#7D26CD",
"2016-Present" = "#CD4F39"))+
guides(color = guide_legend(override.aes = list(alpha = 1, size = 4))) + ## https://aosmith.rbind.io/2020/07/09/ggplot2-override-aes/ to change the legend appearance
theme_minimal(base_size = 12, base_family = "serif") + ##changed minimal theme with a serif font and size 12
theme(plot.title = element_text(face = "bold",size = 14, hjust = 0.5),
legend.key.size = unit(1.2, "lines"),
legend.text = element_text(size = 11),
legend.title = element_text(face = "bold")
)`geom_smooth()` using formula = 'y ~ x'
Explanation
Visualization 1 shows the relationship between property crime rates and violent crime rates across Maryland municipalities from 2000 to the present, grouped into three time periods: 2000-2007, 2008-2015, and 2016-present. Each point represents a municipality, with color indicating the year group. During 2000-2007, municipalities show a wide spread in both property and violent crime rates, with several jurisdictions experiencing very high values in both categories. The regression line indicates a strong positive relationship, suggesting that increases in property crime were closely associated with higher violent crime rates during this period. In 2008 - 2015, the overall pattern remains similar, but the spread of points becomes slightly more concentrated, with fewer extreme high-crime outliers. This suggests that while the relationship between property and violent crime persisted, crime rates became somewhat less extreme across municipalities. In the 2016-present period, the relationship continues to be positive, but the overall levels of both property and violent crime appear lower and more tightly clustered.
Preparing the data for highcharter
md_violent <- violent_clean |>
group_by(county) |>
summarize(
violent_rate = mean(violent_crime_rate_per_100_000_people, na.rm = TRUE),
property_rate = mean(property_crime_rate_per_100_000_people,na.rm=TRUE),
population_rate = mean(population,na.rm=TRUE)
)
head(md_violent)# A tibble: 6 × 4
county violent_rate property_rate population_rate
<chr> <dbl> <dbl> <dbl>
1 Allegany 234. 1820. 5828.
2 Anne Arundel 1022. 4720. 36710.
3 Baltimore City 2050. 6489. 658023.
4 Calvert 444. 2708. 2868.
5 Caroline 485. 2590. 1280.
6 Carroll 261. 2117. 5469.
Grouping crime into 3 groups
md_violent <- md_violent |>
mutate(
crime_level = case_when(
violent_rate < 400 ~ "Low Crime",
violent_rate < 900 ~ "Medium Crime",
TRUE ~ "High Crime"
)
)
md_violent# A tibble: 22 × 5
county violent_rate property_rate population_rate crime_level
<chr> <dbl> <dbl> <dbl> <chr>
1 Allegany 234. 1820. 5828. Low Crime
2 Anne Arundel 1022. 4720. 36710. High Crime
3 Baltimore City 2050. 6489. 658023. High Crime
4 Calvert 444. 2708. 2868. Medium Crime
5 Caroline 485. 2590. 1280. Medium Crime
6 Carroll 261. 2117. 5469. Low Crime
7 Cecil 454. 2985. 3069. Medium Crime
8 Charles 576. 2953. 5757. Medium Crime
9 Dorchester 434. 2666. 3179. Medium Crime
10 Frederick 219. 1443. 8182. Low Crime
# ℹ 12 more rows
Just to look at how many crimes for each group
table(md_violent$crime_level)
High Crime Low Crime Medium Crime
3 9 10
Visualization 2
highchart() |>
hc_add_series(
data = md_violent,
type = "bubble",
hcaes(
x = property_rate,
y = violent_rate,
z = population_rate,
group = crime_level,
name = county)) |>
hc_title(text = "Violent vs Property Crime Rates by County in Maryland")|>
hc_subtitle(text = "Bubble size represents average population (2000–Present)") |>
hc_xAxis(title = list(text = "Property Crime Rate (per 100,000 people)"))|>
hc_yAxis(min=0,title = list(text = "Violent Crime Rate (per 100,000 people)"))|> ## https://www.highcharts.com/forum/viewtopic.php?t=9626 I used this to remove negative values in y axis, because there shouldn't be negative crime rates
hc_tooltip(
borderColor = "black",
pointFormat = paste(
"<b>{point.name}</b><br>",
"<b>Crime Level:</b> {series.name}<br>",
"<b>Violent Crime Rate:</b> {point.y:.1f}<br>",
"<b>Property Crime Rate:</b> {point.x:.1f}<br>",
"<b>Population:</b> {point.z:,.0f}")) |>
## Bubble styling
hc_plotOptions(
bubble = list(
minSize = 10,
maxSize = 55,
opacity = 0.85,
marker = list(
lineWidth = 1,
lineColor = "black"))) |>
hc_colors(c("#B22222","#EEB422","#BA55D3")) |>
hc_legend(title = list(text = "Violent Crime Level")) |>
hc_add_theme(hc_theme_flatdark())This was mainly inspired by Jackie’s and Sam’s project 2 visualization.
Explanation
This bubble chart shows the relationship between property crime rates and violent crime rates across Maryland counties, using county-level averages from 2000 to the present. Each bubble represents a single county. The x-axis displays the average property crime rate per 100,000 people, while the y-axis shows the average violent crime rate per 100,000 people. The size of each bubble represents the county’s average population, meaning larger bubbles correspond to more populous counties. The color of the bubbles categorizes counties into low, medium, and high violent crime levels, making it easier to compare patterns across different crime intensities. Overall, the visualization shows a clear positive relationship between property crime and violent crime: counties with higher property crime rates tend to also have higher violent crime rates. Most counties fall into the low to medium crime categories and cluster toward the lower left of the plot, indicating relatively lower crime rates and smaller populations. In contrast, Baltimore City stands out clearly as a high-crime county, with both very high property and violent crime rates as well as the largest population, which is reflected by its noticeably larger bubble.
Citations for highcharter
https://www.geeksforgeeks.org/r-language/data-visualization-with-highcharter-in-r/
https://www.rdocumentation.org/packages/highcharter/versions/0.9.4
https://rpubs.com/codytps/1325459
https://communicate-data-with-r.netlify.app/docs/visualisation/2htmlwidgets/highcharter/
Background Research
Violent crime has been a long-standing public concern in Maryland, especially in urban areas where population density and socioeconomic inequality are higher. According to statewide crime reports, Maryland has historically experienced higher violent crime rates than the national average, with Baltimore City consistently reporting the highest levels of violent crime in the state. Researchers often link these patterns to factors such as poverty, unemployment, housing instability, and long-term disinvestment in certain neighborhoods. These structural conditions increase both the opportunity for crime and the likelihood of repeated victimization within the same communities.
Studies also show that violent crime and property crime tend to rise together, suggesting that they are influenced by similar underlying conditions. Areas with high property crime often experience higher violent crime because both types of crime are associated with economic stress, limited access to resources, and reduced social cohesion. In Maryland, recent reports indicate that while some counties have seen declines in property crime over time, violent crime trends remain uneven, with sharp differences between counties and municipalities. This makes Maryland an important case study for examining how crime varies across jurisdictions and how factors like population size and property crime relate to violent crime rates.
References
Governor’s Office of Crime Control and Prevention. (2023). Maryland crime trends and statistics. State of Maryland. https://goccp.maryland.gov/data-and-reports/
Federal Bureau of Investigation. (2022). Uniform crime reporting (UCR) program. https://www.fbi.gov/services/cjis/ucr
Sampson, R. J., & Groves, W. B. (1989). Community structure and crime: Testing social-disorganization theory. American Journal of Sociology, 94(4), 774–802.
Conclusion
The two visualizations help explain patterns in violent crime across Maryland municipalities. For my second visualization, I originally planned to use Tableau, but it did not work the way I expected for showing relationships between multiple variables. I also briefly explored using GIS, but I realized that a map was not necessary for answering my research question. Instead, I chose a highcharter bubble chart, which better represents the relationship between crime rates and population.
In the first visualization, there is a clear positive relationship between violent crime and property crime across all three time periods. This means that municipalities with higher property crime rates generally also experience higher violent crime rates. One surprising result is that the 2016–present period shows lower overall crime rates compared to the earlier time periods. I expected crime to increase in more recent years, so this pattern was unexpected.
The second visualization summarizes average crime rates by county. Bubble size represents population, making it easier to see how population relates to crime. Larger counties, especially Baltimore City, stand out with higher crime rates, while smaller and medium-sized counties cluster at lower levels. This shows that population plays a role, but it does not fully explain differences in crime across counties.Overall, these visualizations support the regression results and help illustrate crime trends across Maryland.