Project 1 - Data 110

Author

Ricardo Zavaleta

Introduction

In this dataset, it shows the counts of violent crime and property crime per municipality in the state of Maryland. This dataset is from opendata.maryland.gov. This dataset shows the population of each city and county in Maryland, as well as the counts of crime, such as murder, rape, robbery, b&e etc. For this project I will specifically be using property crime rates per 100,000 and violent crime rate per 100,000. I will specifically be looking at the counties Montgomery County and Prince George’s County because of the large amounts of data there is for other cities and counties. I will be looking at the linear regression of each and compare them between each other.

# load the libraries
library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
library(ggthemes)

Warning: package 'ggthemes' was built under R version 4.5.2

library(ggrepel)

Warning: package 'ggrepel' was built under R version 4.5.2

# set working directory
dataset <- read_csv("Violent_Crime_&_Property_Crime_by_Municipality__2000_to_Present_20260323.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 4284 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): JURISDICTION, COUNTY, PERCENT CHANGE, VIOLENT CRIME PERCENT, VIOLE...
dbl  (4): YEAR, MURDER, RAPE, RAPE PER 100,000 PEOPLE
num (18): POPULATION, ROBBERY, AGG. ASSAULT, B & E, LARCENY THEFT, M/V THEFT...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(dataset)

# A tibble: 6 × 32
  JURISDICTION COUNTY      YEAR POPULATION MURDER  RAPE ROBBERY `AGG. ASSAULT`
  <chr>        <chr>      <dbl>      <dbl>  <dbl> <dbl>   <dbl>          <dbl>
1 Galestown    Dorchester  2008        101      0     0       0              0
2 Goldsboro    Caroline    2014        247      0     0       0              0
3 Henderson    Caroline    1997         66      0     0       0              0
4 Keedysville  Washington  1990        464      0     0       0              0
5 Kitzmiller   Garrett     1996        275      0     0       0              0
6 Lonaconing   Allegany    1991       1122      0     0       0              0
# ℹ 24 more variables: `B & E` <dbl>, `LARCENY THEFT` <dbl>, `M/V THEFT` <dbl>,
#   `GRAND TOTAL` <dbl>, `PERCENT CHANGE` <chr>, `VIOLENT CRIME TOTAL` <dbl>,
#   `VIOLENT CRIME PERCENT` <chr>, `VIOLENT CRIME PERCENT CHANGE` <chr>,
#   `PROPERTY CRIME TOTALS` <dbl>, `PROPERTY CRIME PERCENT` <chr>,
#   `PROPERTY CRIME PERCENT CHANGE` <chr>,
#   `OVERALL CRIME RATE PER 100,000 PEOPLE` <dbl>,
#   `OVERALL PERCENT CHANGE PER 100,000 PEOPLE` <chr>, …

# clean the dataset
names(dataset) <- tolower(names(dataset))
names(dataset) <- gsub(" ","_",names(dataset))
names(dataset) <- gsub("[(). //-]", "_", names(dataset))
names(dataset) <- gsub(",", "", names(dataset))
head(dataset)

# A tibble: 6 × 32
  jurisdiction county  year population murder  rape robbery agg__assault `b_&_e`
  <chr>        <chr>  <dbl>      <dbl>  <dbl> <dbl>   <dbl>        <dbl>   <dbl>
1 Galestown    Dorch…  2008        101      0     0       0            0       0
2 Goldsboro    Carol…  2014        247      0     0       0            0       0
3 Henderson    Carol…  1997         66      0     0       0            0       0
4 Keedysville  Washi…  1990        464      0     0       0            0       2
5 Kitzmiller   Garre…  1996        275      0     0       0            0       0
6 Lonaconing   Alleg…  1991       1122      0     0       0            0       0
# ℹ 23 more variables: larceny_theft <dbl>, m_v_theft <dbl>, grand_total <dbl>,
#   percent_change <chr>, violent_crime_total <dbl>,
#   violent_crime_percent <chr>, violent_crime_percent_change <chr>,
#   property_crime_totals <dbl>, property_crime_percent <chr>,
#   property_crime_percent_change <chr>,
#   overall_crime_rate_per_100000_people <dbl>,
#   overall_percent_change_per_100000_people <chr>, …

#filter to only show the year 2020
dataset_2020 <- dataset |>
  filter(year == "2020")
head(dataset_2020)

# A tibble: 6 × 32
  jurisdiction county  year population murder  rape robbery agg__assault `b_&_e`
  <chr>        <chr>  <dbl>      <dbl>  <dbl> <dbl>   <dbl>        <dbl>   <dbl>
1 Aberdeen     Harfo…  2020      16140      0     6      23           78      43
2 Accident     Garre…  2020        335      0     0       0            2       1
3 Annapolis    Anne …  2020      39315      6    19      51          176     109
4 Baltimore    Balti…  2020     588593    334   324    3418         5326    4099
5 Barclay      Queen…  2020        167      0     0       0            0       0
6 Barnesville  Montg…  2020        180      0     0       0            0       0
# ℹ 23 more variables: larceny_theft <dbl>, m_v_theft <dbl>, grand_total <dbl>,
#   percent_change <chr>, violent_crime_total <dbl>,
#   violent_crime_percent <chr>, violent_crime_percent_change <chr>,
#   property_crime_totals <dbl>, property_crime_percent <chr>,
#   property_crime_percent_change <chr>,
#   overall_crime_rate_per_100000_people <dbl>,
#   overall_percent_change_per_100000_people <chr>, …

# first plot
p1 <- ggplot(dataset_2020, aes(x = `property_crime_rate_per_100000_people`, y = `violent_crime_rate_per_100000_people`)) +
  labs(title = "Property Crime Vs. Violent Crime In Maryland \n Per 100,000",
  caption = "Source: opendata.maryland.gov",
  x = "Property Crime  rates in Maryland per 100,000 (2020)", 
  y = "Violent Crime rates in Maryland per 100,000 (2020)") +
  theme_minimal(base_size = 12) + geom_point(aes(`property_crime_rate_per_100000_people`, `violent_crime_rate_per_100000_people`, colour = county), size = 1.5)  + theme_tufte() 
p1

# add label of city
p2 <-ggplot(dataset_2020, aes(x = `property_crime_rate_per_100000_people`, y = `violent_crime_rate_per_100000_people`, label = county)) +
  labs(title = "Property Crime Vs. Violent Crime In Maryland \n Per 100,000",
  caption = "Source: opendata.maryland.gov",
  x = "Property Crime  rates in Maryland per 100,000 (2020)", 
  y = "Violent Crime rates in Maryland per 100,000 (2020)") +
  theme_minimal(base_size = 12) + geom_point(aes(`property_crime_rate_per_100000_people`, `violent_crime_rate_per_100000_people`, colour = county), size = 1.5) + geom_text_repel(aes(label = jurisdiction,),  nudge_x = 0.5,size=1.8) + theme_tufte()
p2

Warning: ggrepel: 114 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

# focus on only Montgomery County and PG County
moco_pg <- dataset_2020 |>
  filter(county %in% c("Montgomery", "Prince George's"))
head(moco_pg)

# A tibble: 6 × 32
  jurisdiction county  year population murder  rape robbery agg__assault `b_&_e`
  <chr>        <chr>  <dbl>      <dbl>  <dbl> <dbl>   <dbl>        <dbl>   <dbl>
1 Barnesville  Montg…  2020        180      0     0       0            0       0
2 Berwyn Heig… Princ…  2020       3269      0     0       4            1       5
3 Bladensburg  Princ…  2020       9455      0     3      18           42      22
4 Bowie        Princ…  2020      59008      1    11      18           44      68
5 Brentwood    Princ…  2020       3485      0     1       6            3      10
6 Brookeville  Montg…  2020        144      0     0       0            0       0
# ℹ 23 more variables: larceny_theft <dbl>, m_v_theft <dbl>, grand_total <dbl>,
#   percent_change <chr>, violent_crime_total <dbl>,
#   violent_crime_percent <chr>, violent_crime_percent_change <chr>,
#   property_crime_totals <dbl>, property_crime_percent <chr>,
#   property_crime_percent_change <chr>,
#   overall_crime_rate_per_100000_people <dbl>,
#   overall_percent_change_per_100000_people <chr>, …

# Similar graph that shows moco and pg
p6 <- ggplot(moco_pg, aes(x = `property_crime_rate_per_100000_people`, y = `violent_crime_rate_per_100000_people`, label = jurisdiction)) +
  labs(title = "Property Crime Vs. Violent Crime In Maryland \n Per 100,000 (MOCO and PG)",
  caption = "Source: opendata.maryland.gov",
  x = "Property Crime  rates in Maryland per 100,000 (2020)", 
  y = "Violent Crime rates in Maryland per 100,000 (2020)") +
  theme_minimal(base_size = 12) + geom_point(aes(`property_crime_rate_per_100000_people`, `violent_crime_rate_per_100000_people`, colour = county), size = 2) + geom_text_repel(aes(label = jurisdiction,),  nudge_x = 0.5,size=2)  + theme_tufte() + scale_color_brewer(palette = "Oranges")
p6

Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

# linear model
p7 <- p6 + geom_smooth(method='lm',formula=y~x)
p7

Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

# linear regression
fit5 <- lm( violent_crime_rate_per_100000_people ~ property_crime_rate_per_100000_people, data = moco_pg)  #lm(y ~ x)
summary(fit5)


Call:
lm(formula = violent_crime_rate_per_100000_people ~ property_crime_rate_per_100000_people, 
    data = moco_pg)

Residuals:
    Min      1Q  Median      3Q     Max 
-481.22 -128.20  -25.07   63.13  526.85 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           47.34135   49.99854   0.947    0.349    
property_crime_rate_per_100000_people  0.12119    0.02164   5.600 1.39e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 196.2 on 43 degrees of freedom
Multiple R-squared:  0.4217,    Adjusted R-squared:  0.4083 
F-statistic: 31.36 on 1 and 43 DF,  p-value: 1.393e-06

I was able to clean this dataset up by using gsub and tolower by taking away all the capitalized words, periods, commas, and replacing it with underscores. I also used filter to filter out the rest of the years to only show 2020, as well as the counties only being Montgomery County and Prince George’s county. The visualization is able to show how mostly Prince Georges’ county has more crime than Montgomery County, with the plots being a lot higher up as well as being a different color. The linear model was able to show that there is significance between violent crime per 100,000 and property crime per 100,000, which means that if one increases so does the other, and if one decreases so does the other.