Violent Crimes

Author

Thiloni Konara

Introduction

For my final project, I chose the data set “Violent Crime & Property Crime by Municipality 2000 to Present.” The data was originally collected by the Maryland Statistical Analysis Center (MSAC). I chose this topic because I am still new to Maryland, and understanding crime patterns helps me learn more about the different communities here.Other than that, I am interetsted in this kind of crime types of data sets. This data set has 4284 observations and 32 variables.

Variables

Variable Name Meaning Data type
Jurisdiction The Maryland municipality where the crime data was reported categorical
Year The year the crime counts were recorded numerical
Population Total population of the jurisdiction in that year numerical
violent_crime_rate_per_100_000_people violent crimes per 100,000 population numerical
property_crime_rate_per_100_000_people Property crimes per 100,000 population numerical

How the data was collected

The crime data in this data set are collected each year by local police departments and reported to the Maryland State Police through FBI’s Uniform Crime Reporting (UCR) Program. MSAC then compiles these numbers by municipality and year. The data set does not include a detailed ReadMe file, so information about specific cleaning steps or missing data handling is not provided.

Research Question

What factors best predict violent crime rate across Maryland municipalities?

Source

Maryland Statistical Analysis Center (MSAC), within the Governor’s Office of Crime Control and Prevention (GOCCP)

Loading the libraries

library(tidyverse)
library(ggfortify)
library(ggplot2)
library(highcharter)
library(scales)
library(RColorBrewer)

Loading the data set

violent <- read_csv("Violent_Crime_&_Property_Crime_by_Municipality__2000_to_Present_20251203.csv")

To look at the structure and first 6 rows

#str(violent)
head(violent)
# A tibble: 6 × 32
  JURISDICTION COUNTY      YEAR POPULATION MURDER  RAPE ROBBERY `AGG. ASSAULT`
  <chr>        <chr>      <dbl>      <dbl>  <dbl> <dbl>   <dbl>          <dbl>
1 Galestown    Dorchester  2008        101      0     0       0              0
2 Goldsboro    Caroline    2014        247      0     0       0              0
3 Henderson    Caroline    1997         66      0     0       0              0
4 Keedysville  Washington  1990        464      0     0       0              0
5 Kitzmiller   Garrett     1996        275      0     0       0              0
6 Lonaconing   Allegany    1991       1122      0     0       0              0
# ℹ 24 more variables: `B & E` <dbl>, `LARCENY THEFT` <dbl>, `M/V THEFT` <dbl>,
#   `GRAND TOTAL` <dbl>, `PERCENT CHANGE` <chr>, `VIOLENT CRIME TOTAL` <dbl>,
#   `VIOLENT CRIME PERCENT` <chr>, `VIOLENT CRIME PERCENT CHANGE` <chr>,
#   `PROPERTY CRIME TOTALS` <dbl>, `PROPERTY CRIME PERCENT` <chr>,
#   `PROPERTY CRIME PERCENT CHANGE` <chr>,
#   `OVERALL CRIME RATE PER 100,000 PEOPLE` <dbl>,
#   `OVERALL PERCENT CHANGE PER 100,000 PEOPLE` <chr>, …

Cleaning the data set

names(violent) <- tolower(names(violent))
names(violent) <- gsub(" ","_",names(violent))
names(violent) <- gsub("[.]", "", names(violent))
names(violent) <- gsub("[/,&]","_" ,names(violent))

head(violent)
# A tibble: 6 × 32
  jurisdiction county     year population murder  rape robbery agg_assault b___e
  <chr>        <chr>     <dbl>      <dbl>  <dbl> <dbl>   <dbl>       <dbl> <dbl>
1 Galestown    Dorchest…  2008        101      0     0       0           0     0
2 Goldsboro    Caroline   2014        247      0     0       0           0     0
3 Henderson    Caroline   1997         66      0     0       0           0     0
4 Keedysville  Washingt…  1990        464      0     0       0           0     2
5 Kitzmiller   Garrett    1996        275      0     0       0           0     0
6 Lonaconing   Allegany   1991       1122      0     0       0           0     0
# ℹ 23 more variables: larceny_theft <dbl>, m_v_theft <dbl>, grand_total <dbl>,
#   percent_change <chr>, violent_crime_total <dbl>,
#   violent_crime_percent <chr>, violent_crime_percent_change <chr>,
#   property_crime_totals <dbl>, property_crime_percent <chr>,
#   property_crime_percent_change <chr>,
#   overall_crime_rate_per_100_000_people <dbl>,
#   overall_percent_change_per_100_000_people <chr>, …

Checking NAs

colSums(is.na(violent))
                                         jurisdiction 
                                                    0 
                                               county 
                                                    0 
                                                 year 
                                                    0 
                                           population 
                                                    0 
                                               murder 
                                                    0 
                                                 rape 
                                                    0 
                                              robbery 
                                                    0 
                                          agg_assault 
                                                    0 
                                                b___e 
                                                    0 
                                        larceny_theft 
                                                    0 
                                            m_v_theft 
                                                    0 
                                          grand_total 
                                                    0 
                                       percent_change 
                                                  149 
                                  violent_crime_total 
                                                    0 
                                violent_crime_percent 
                                                    0 
                         violent_crime_percent_change 
                                                  149 
                                property_crime_totals 
                                                    0 
                               property_crime_percent 
                                                    0 
                        property_crime_percent_change 
                                                  149 
                overall_crime_rate_per_100_000_people 
                                                    1 
            overall_percent_change_per_100_000_people 
                                                  149 
                violent_crime_rate_per_100_000_people 
                                                    0 
 violent_crime_rate_percent_change_per_100_000_people 
                                                  149 
               property_crime_rate_per_100_000_people 
                                                    0 
property_crime_rate_percent_change_per_100_000_people 
                                                  149 
                            murder_per_100_000_people 
                                                    0 
                              rape_per_100_000_people 
                                                    2 
                           robbery_per_100_000_people 
                                                    0 
                       agg_assault_per_100_000_people 
                                                    0 
                             b___e_per_100_000_people 
                                                    0 
                     larceny_theft_per_100_000_people 
                                                    0 
                         m_v_theft_per_100_000_people 
                                                    0 

There are no NAs in variables that I am going to use for my project.

Removing unused columns

violent_clean <- violent|>
  select(-agg_assault,-b___e,
         -larceny_theft,-m_v_theft,-b___e_per_100_000_people,-larceny_theft_per_100_000_people,-m_v_theft_per_100_000_people)
           
head(violent_clean)
# A tibble: 6 × 25
  jurisdiction county      year population murder  rape robbery grand_total
  <chr>        <chr>      <dbl>      <dbl>  <dbl> <dbl>   <dbl>       <dbl>
1 Galestown    Dorchester  2008        101      0     0       0           0
2 Goldsboro    Caroline    2014        247      0     0       0           0
3 Henderson    Caroline    1997         66      0     0       0           0
4 Keedysville  Washington  1990        464      0     0       0           4
5 Kitzmiller   Garrett     1996        275      0     0       0           0
6 Lonaconing   Allegany    1991       1122      0     0       0           0
# ℹ 17 more variables: percent_change <chr>, violent_crime_total <dbl>,
#   violent_crime_percent <chr>, violent_crime_percent_change <chr>,
#   property_crime_totals <dbl>, property_crime_percent <chr>,
#   property_crime_percent_change <chr>,
#   overall_crime_rate_per_100_000_people <dbl>,
#   overall_percent_change_per_100_000_people <chr>,
#   violent_crime_rate_per_100_000_people <dbl>, …

Selecting the columns that I need for regression model

violent_model <- violent_clean |>
  select(jurisdiction,year,county,population,violent_crime_rate_per_100_000_people,property_crime_rate_per_100_000_people)

head(violent_model)
# A tibble: 6 × 6
  jurisdiction  year county     population violent_crime_rate_per_100_000_people
  <chr>        <dbl> <chr>           <dbl>                                 <dbl>
1 Galestown     2008 Dorchester        101                                     0
2 Goldsboro     2014 Caroline          247                                     0
3 Henderson     1997 Caroline           66                                     0
4 Keedysville   1990 Washington        464                                     0
5 Kitzmiller    1996 Garrett           275                                     0
6 Lonaconing    1991 Allegany         1122                                     0
# ℹ 1 more variable: property_crime_rate_per_100_000_people <dbl>

Why I chose these predictors

To explore potential predictors, I examined a correlation heatmap of all numeric variables. The plot showed very strong correlations between violent crime rate and individual crime categories such as robbery and aggravated assault. However, these variables are components of violent crime itself and therefore were not appropriate predictors for the final model.Property crime rate showed a strong positive relationship with violent crime rate while remaining independent of the violent crime definition. Population also showed a moderate relationship and helps control for jurisdiction size. Year was included to account for overall trends over time, even though its correlation was weaker. Based on the exploration, property crime rate, population, and year were selected as the most appropriate predictors.

Multiple Linear Regression Model

fit1 <- lm(violent_crime_rate_per_100_000_people~ population + year + property_crime_rate_per_100_000_people, data = violent_model)

autoplot(fit1, 1:4,nrow=2,ncol=2) ## To look at the diagnostic plots

summary(fit1)

Call:
lm(formula = violent_crime_rate_per_100_000_people ~ population + 
    year + property_crime_rate_per_100_000_people, data = violent_model)

Residuals:
    Min      1Q  Median      3Q     Max 
-2412.2  -190.0   -62.7   127.3  7059.0 

Coefficients:
                                         Estimate Std. Error t value Pr(>|t|)
(Intercept)                             8.316e+02  1.392e+03   0.597     0.55
population                              1.754e-03  1.063e-04  16.500   <2e-16
year                                   -3.840e-01  6.936e-01  -0.554     0.58
property_crime_rate_per_100_000_people  1.274e-01  2.094e-03  60.852   <2e-16
                                          
(Intercept)                               
population                             ***
year                                      
property_crime_rate_per_100_000_people ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 389.2 on 4280 degrees of freedom
Multiple R-squared:  0.5186,    Adjusted R-squared:  0.5182 
F-statistic:  1537 on 3 and 4280 DF,  p-value: < 2.2e-16

Linear Regression Analysis

The multiple linear regression model predicts the violent crime rate per 100,000 people across Maryland municipalities based on population, year, and property crime rate per 100,000 people.

Model Equation

Because this is a multiple linear regression, the model follows the form:

y = a + b1x1+ b2x2 + b3x3

Where: y = violent crime rate per 100,000 people a = intercept b1 = effect of population b2 = effect of year b3 = effect of property crime rate

Based on the model output, the fitted equation is:

violent_crime_rate_per_100_000_people = 831.6 + 0.001754(population) -0.384(year) +0.1274(property_crime_rate_per_100_000_people)

This means that:

Violent crime rate increases slightly as population increases. Violent crime rate increases as property crime rate increases. Year has little effect on violent crime rate.

P-value and Adjusted r-squared analysis

The property crime rate per 100,00 people is very strong and statistically significant predictor of violent crime rate (p<2e-16). This indicates that municipalities with higher property crime rates tend to also experience higher violent crime rates.

Population is also statistically significant(p<2e-16), showing that jurisdictions with larger populations generally have slightly higher violent crime rates, although the effect size is small.

The variable year has a high p-value (p=0.58), suggesting that once population and property crime rate are accounted for, year does not have a meaningful effect on violent crime rate.

The adjusted r-squared value for this model is 0.5182, meaning the model explains approximately 51.8% of the variation in violent crime rate across Maryland municipalities. The overall model is statistically significant(p<2.2e-16).

Diagnostic plots

Residuals vs Fitted: The residuals show a slight curve and increasing spread at higher fitted values, suggesting mild nonlinearity and heteroscedasticity. However, the pattern is not extreme.

Normal Q-Q plot: Most points follow the diagonal line, indicating that residuals are approximately normally distributed, though some extreme values devaite from normality.

Scale-Location plot: The spread of residuals increases as fitted values increases, indicating mild heteroscedasticity.

Cook’s Distance: A few influential observations (such as 2531,and 3017) stand out. These likely represent municipalities with usually high crime levels and may have some influence on the model.

Overall, while the regression assumptions are not perfectly met, the model is appropriate for exploratory analysis and provide meaningful insight into factors associated with violent crime rates.

Grouping year variable for 3 categories

violent_model <- violent_model |>
  mutate(year_group = case_when(
    year >= 2000 & year <= 2007 ~ "2000-2007",
    year >= 2008 & year <= 2015 ~ "2008-2015",
    year >= 2016 ~ "2016-Present",
    TRUE ~ "other"
  ))
violent_model
# A tibble: 4,284 × 7
   jurisdiction  year county     population violent_crime_rate_per_100_000_peo…¹
   <chr>        <dbl> <chr>           <dbl>                                <dbl>
 1 Galestown     2008 Dorchester        101                                   0 
 2 Goldsboro     2014 Caroline          247                                   0 
 3 Henderson     1997 Caroline           66                                   0 
 4 Keedysville   1990 Washington        464                                   0 
 5 Kitzmiller    1996 Garrett           275                                   0 
 6 Lonaconing    1991 Allegany         1122                                   0 
 7 Aberdeen      1990 Harford         13087                                 657.
 8 Aberdeen      1991 Harford         13301                                 752.
 9 Aberdeen      1992 Harford         13432                                 715.
10 Aberdeen      1993 Harford         13703                                1022.
# ℹ 4,274 more rows
# ℹ abbreviated name: ¹​violent_crime_rate_per_100_000_people
# ℹ 2 more variables: property_crime_rate_per_100_000_people <dbl>,
#   year_group <chr>

I created this for only 2000 to present years, and I put other for years that is before 2000.

Removing others in the year group

violent_model2 <- violent_model |>
  filter(year_group != "other")
violent_model2
# A tibble: 2,965 × 7
   jurisdiction  year county     population violent_crime_rate_per_100_000_peo…¹
   <chr>        <dbl> <chr>           <dbl>                                <dbl>
 1 Galestown     2008 Dorchester        101                                   0 
 2 Goldsboro     2014 Caroline          247                                   0 
 3 Aberdeen      2000 Harford         13842                                 607.
 4 Aberdeen      2001 Harford         14048                                 797.
 5 Aberdeen      2002 Harford         14264                                 883.
 6 Aberdeen      2003 Harford         14148                                 643.
 7 Aberdeen      2004 Harford         14311                                 692.
 8 Aberdeen      2005 Harford         14312                                1111 
 9 Aberdeen      2006 Harford         14133                                 722.
10 Aberdeen      2007 Harford         14187                                 811.
# ℹ 2,955 more rows
# ℹ abbreviated name: ¹​violent_crime_rate_per_100_000_people
# ℹ 2 more variables: property_crime_rate_per_100_000_people <dbl>,
#   year_group <chr>

I did this because I only concerned about years from 2000 to present.

Visualization 1

ggplot(violent_model2,aes(x=property_crime_rate_per_100_000_people,
                         y=violent_crime_rate_per_100_000_people)) +
  geom_point(aes(color = year_group),alpha =0.3)+
  geom_smooth(method = "lm",se = FALSE , color ="black",linewidth= 0.5,linetype = "dashed")+ ##Added a smoothed trend line to show overall relationship and got that from dslabs and highcharter tutorial
    facet_wrap(~year_group)+
  labs(title = "Violent Crime vs Property Crime Rates Across Maryland Municipalities (2000-Present)",
       x = "Property crime rate (per 100,000 people)",
       y = "Violent crime rate (per 100,000 people)",
       color = "Year Group",
       caption = "Source: Maryland Statistical Analysis Center (MSAC),GOCCP") +
    scale_color_manual(values=c("2000-2007" = "#8B3A62",
                                "2008-2015" = "#7D26CD",
                                "2016-Present" = "#CD4F39"))+
  guides(color = guide_legend(override.aes = list(alpha = 1, size = 4))) + ## https://aosmith.rbind.io/2020/07/09/ggplot2-override-aes/ to change the legend appearance
  theme_minimal(base_size = 12, base_family = "serif") + ##changed minimal theme with a serif font and size 12
  theme(plot.title = element_text(face = "bold",size = 14, hjust = 0.5),
        legend.key.size = unit(1.2, "lines"),
        legend.text = element_text(size = 11),
        legend.title = element_text(face = "bold")
)
`geom_smooth()` using formula = 'y ~ x'

Explanation

Visualization 1 shows the relationship between property crime rates and violent crime rates across Maryland municipalities from 2000 to the present, grouped into three time periods: 2000-2007, 2008-2015, and 2016-present. Each point represents a municipality, with color indicating the year group. During 2000-2007, municipalities show a wide spread in both property and violent crime rates, with several jurisdictions experiencing very high values in both categories. The regression line indicates a strong positive relationship, suggesting that increases in property crime were closely associated with higher violent crime rates during this period. In 2008 - 2015, the overall pattern remains similar, but the spread of points becomes slightly more concentrated, with fewer extreme high-crime outliers. This suggests that while the relationship between property and violent crime persisted, crime rates became somewhat less extreme across municipalities. In the 2016-present period, the relationship continues to be positive, but the overall levels of both property and violent crime appear lower and more tightly clustered.

Preparing the data for highcharter

md_violent <- violent_clean |>
  group_by(county) |>
  summarize(
    violent_rate = mean(violent_crime_rate_per_100_000_people, na.rm = TRUE),
    property_rate = mean(property_crime_rate_per_100_000_people,na.rm=TRUE),
    population_rate = mean(population,na.rm=TRUE)
  )
head(md_violent)
# A tibble: 6 × 4
  county         violent_rate property_rate population_rate
  <chr>                 <dbl>         <dbl>           <dbl>
1 Allegany               234.         1820.           5828.
2 Anne Arundel          1022.         4720.          36710.
3 Baltimore City        2050.         6489.         658023.
4 Calvert                444.         2708.           2868.
5 Caroline               485.         2590.           1280.
6 Carroll                261.         2117.           5469.

Grouping crime into 3 groups

md_violent <- md_violent |>
  mutate(
   crime_level = case_when(
     violent_rate  < 400 ~ "Low Crime",
     violent_rate < 900 ~ "Medium Crime",
     TRUE ~ "High Crime"
   )
 )
md_violent
# A tibble: 22 × 5
   county         violent_rate property_rate population_rate crime_level 
   <chr>                 <dbl>         <dbl>           <dbl> <chr>       
 1 Allegany               234.         1820.           5828. Low Crime   
 2 Anne Arundel          1022.         4720.          36710. High Crime  
 3 Baltimore City        2050.         6489.         658023. High Crime  
 4 Calvert                444.         2708.           2868. Medium Crime
 5 Caroline               485.         2590.           1280. Medium Crime
 6 Carroll                261.         2117.           5469. Low Crime   
 7 Cecil                  454.         2985.           3069. Medium Crime
 8 Charles                576.         2953.           5757. Medium Crime
 9 Dorchester             434.         2666.           3179. Medium Crime
10 Frederick              219.         1443.           8182. Low Crime   
# ℹ 12 more rows

Just to look at how many crimes for each group

table(md_violent$crime_level)

  High Crime    Low Crime Medium Crime 
           3            9           10 

Visualization 2

highchart() |>
  hc_add_series(
    data = md_violent,
    type = "bubble",
    hcaes(
      x = property_rate,
      y = violent_rate,
      z = population_rate,
      group = crime_level,
      name = county)) |>
  hc_title(text = "Violent vs Property Crime Rates by County in Maryland")|>
  hc_subtitle(text = "Bubble size represents average population (2000–Present)") |>
  hc_xAxis(title = list(text = "Property Crime Rate (per 100,000 people)"))|>
  hc_yAxis(min=0,title = list(text = "Violent Crime Rate (per 100,000 people)"))|> ## https://www.highcharts.com/forum/viewtopic.php?t=9626 I used this to remove negative values in y axis, because there shouldn't be negative crime rates
  hc_tooltip(
    borderColor = "black",
    pointFormat = paste(
      "<b>{point.name}</b><br>",
      "<b>Crime Level:</b> {series.name}<br>",
      "<b>Violent Crime Rate:</b> {point.y:.1f}<br>",
      "<b>Property Crime Rate:</b> {point.x:.1f}<br>",
      "<b>Population:</b> {point.z:,.0f}")) |>
   ## Bubble styling
  hc_plotOptions(
    bubble = list(
      minSize = 10,
      maxSize = 55,
      opacity = 0.85,
      marker = list(
        lineWidth = 1,
        lineColor = "black"))) |>
  hc_colors(c("#B22222","#EEB422","#BA55D3")) |>
  hc_legend(title = list(text = "Violent Crime Level")) |>
  hc_add_theme(hc_theme_flatdark())

This was mainly inspired by Jackie’s and Sam’s project 2 visualization.

Explanation

This bubble chart shows the relationship between property crime rates and violent crime rates across Maryland counties, using county-level averages from 2000 to the present. Each bubble represents a single county. The x-axis displays the average property crime rate per 100,000 people, while the y-axis shows the average violent crime rate per 100,000 people. The size of each bubble represents the county’s average population, meaning larger bubbles correspond to more populous counties. The color of the bubbles categorizes counties into low, medium, and high violent crime levels, making it easier to compare patterns across different crime intensities. Overall, the visualization shows a clear positive relationship between property crime and violent crime: counties with higher property crime rates tend to also have higher violent crime rates. Most counties fall into the low to medium crime categories and cluster toward the lower left of the plot, indicating relatively lower crime rates and smaller populations. In contrast, Baltimore City stands out clearly as a high-crime county, with both very high property and violent crime rates as well as the largest population, which is reflected by its noticeably larger bubble.

Citations for highcharter

https://www.geeksforgeeks.org/r-language/data-visualization-with-highcharter-in-r/

https://www.rdocumentation.org/packages/highcharter/versions/0.9.4

https://rpubs.com/codytps/1325459

https://communicate-data-with-r.netlify.app/docs/visualisation/2htmlwidgets/highcharter/

Background Research

Violent crime has been a long-standing public concern in Maryland, especially in urban areas where population density and socioeconomic inequality are higher. According to statewide crime reports, Maryland has historically experienced higher violent crime rates than the national average, with Baltimore City consistently reporting the highest levels of violent crime in the state. Researchers often link these patterns to factors such as poverty, unemployment, housing instability, and long-term disinvestment in certain neighborhoods. These structural conditions increase both the opportunity for crime and the likelihood of repeated victimization within the same communities.

Studies also show that violent crime and property crime tend to rise together, suggesting that they are influenced by similar underlying conditions. Areas with high property crime often experience higher violent crime because both types of crime are associated with economic stress, limited access to resources, and reduced social cohesion. In Maryland, recent reports indicate that while some counties have seen declines in property crime over time, violent crime trends remain uneven, with sharp differences between counties and municipalities. This makes Maryland an important case study for examining how crime varies across jurisdictions and how factors like population size and property crime relate to violent crime rates.

References

Governor’s Office of Crime Control and Prevention. (2023). Maryland crime trends and statistics. State of Maryland. https://goccp.maryland.gov/data-and-reports/

Federal Bureau of Investigation. (2022). Uniform crime reporting (UCR) program. https://www.fbi.gov/services/cjis/ucr

Sampson, R. J., & Groves, W. B. (1989). Community structure and crime: Testing social-disorganization theory. American Journal of Sociology, 94(4), 774–802.

Conclusion

The two visualizations help explain patterns in violent crime across Maryland municipalities. For my second visualization, I originally planned to use Tableau, but it did not work the way I expected for showing relationships between multiple variables. I also briefly explored using GIS, but I realized that a map was not necessary for answering my research question. Instead, I chose a highcharter bubble chart, which better represents the relationship between crime rates and population.

In the first visualization, there is a clear positive relationship between violent crime and property crime across all three time periods. This means that municipalities with higher property crime rates generally also experience higher violent crime rates. One surprising result is that the 2016–present period shows lower overall crime rates compared to the earlier time periods. I expected crime to increase in more recent years, so this pattern was unexpected.

The second visualization summarizes average crime rates by county. Bubble size represents population, making it easier to see how population relates to crime. Larger counties, especially Baltimore City, stand out with higher crime rates, while smaller and medium-sized counties cluster at lower levels. This shows that population plays a role, but it does not fully explain differences in crime across counties.Overall, these visualizations support the regression results and help illustrate crime trends across Maryland.