Final Project

The topic I chose to look at for my final project was crime in the DMV from 1960 to 2019.I chose to look at both the totals for property and violent crimes as well as the rates for both. I chose this for my project because I am extremely interested in Criminal Justice and love to see what stories the data gathered shows. Keeping in mind that crime data is only the crimes that are reported and might not represent the full picture. The dataset that I chose to use was found on Github (https://corgis-edu.github.io/corgis/csv/state_crime/). This dataset was made from public information from the Unified Crime Reporting Statistics in collaboration with the U.S. Department of Justice and the Federal Bureau of Investigation. Before we get started I’ll provide the descriptions of my key variables as provided by the authors on Github.

  1. Data.Population - The number of people living in this state at the time the report was created.

  2. Data.Rates.Property.All - Rates are the number of reported offenses per 100,000 population. This property reflects all of the Property-related crimes, including burglaries, larcenies, and motor crimes.

  3. Data.Rates.Violent.All - Rates are the number of reported offenses per 100,000 population. This property reflects all of the Violent crimes, including assaults, murders, rapes, and robberies.

  4. Data.Totals.Property.All - This property reflects all of the Property-related crimes, including burglaries, larcenies, and motor crimes.

  5. Data.Totals.Violent.All - This property reflects all of the Violent crimes, including assaults, murders, rapes, and robberies.

Load in libraries and dataset

As always I first load in my libraries and then my csv file. I then use head to preview the dataset.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.3.1
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.1
library(cowplot)
## Warning: package 'cowplot' was built under R version 4.3.1
## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:patchwork':
## 
##     align_plots
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
library(dplyr)

state_crime <- read_csv("state_crime.csv")
## Rows: 3115 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): State
## dbl (20): Year, Data.Population, Data.Rates.Property.All, Data.Rates.Propert...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(state_crime)
## # A tibble: 6 × 21
##   State    Year Data.Population Data.Rates.Property.All Data.Rates.Property.Bu…¹
##   <chr>   <dbl>           <dbl>                   <dbl>                    <dbl>
## 1 Alabama  1960         3266740                   1035.                     356.
## 2 Alabama  1961         3302000                    986.                     339.
## 3 Alabama  1962         3358000                   1067                      349.
## 4 Alabama  1963         3347000                   1151.                     377.
## 5 Alabama  1964         3407000                   1359.                     467.
## 6 Alabama  1965         3462000                   1393.                     474.
## # ℹ abbreviated name: ¹​Data.Rates.Property.Burglary
## # ℹ 16 more variables: Data.Rates.Property.Larceny <dbl>,
## #   Data.Rates.Property.Motor <dbl>, Data.Rates.Violent.All <dbl>,
## #   Data.Rates.Violent.Assault <dbl>, Data.Rates.Violent.Murder <dbl>,
## #   Data.Rates.Violent.Rape <dbl>, Data.Rates.Violent.Robbery <dbl>,
## #   Data.Totals.Property.All <dbl>, Data.Totals.Property.Burglary <dbl>,
## #   Data.Totals.Property.Larceny <dbl>, Data.Totals.Property.Motor <dbl>, …

Cleaning chunks

First I remove the periods and change everything to lowercase. I use head to make sure everything worked.

colnames(state_crime) <- tolower(names(state_crime))
colnames(state_crime) <- str_replace_all(colnames(state_crime), "\\.", "_")

head(state_crime)
## # A tibble: 6 × 21
##   state    year data_population data_rates_property_all data_rates_property_bu…¹
##   <chr>   <dbl>           <dbl>                   <dbl>                    <dbl>
## 1 Alabama  1960         3266740                   1035.                     356.
## 2 Alabama  1961         3302000                    986.                     339.
## 3 Alabama  1962         3358000                   1067                      349.
## 4 Alabama  1963         3347000                   1151.                     377.
## 5 Alabama  1964         3407000                   1359.                     467.
## 6 Alabama  1965         3462000                   1393.                     474.
## # ℹ abbreviated name: ¹​data_rates_property_burglary
## # ℹ 16 more variables: data_rates_property_larceny <dbl>,
## #   data_rates_property_motor <dbl>, data_rates_violent_all <dbl>,
## #   data_rates_violent_assault <dbl>, data_rates_violent_murder <dbl>,
## #   data_rates_violent_rape <dbl>, data_rates_violent_robbery <dbl>,
## #   data_totals_property_all <dbl>, data_totals_property_burglary <dbl>,
## #   data_totals_property_larceny <dbl>, data_totals_property_motor <dbl>, …

Filter by the correct States

I then filter by the DMV and preview the results.

filtered_state_crime <- state_crime %>%
  filter(state %in% c("District of Columbia", "Maryland", "Virginia"))

head(filtered_state_crime)
## # A tibble: 6 × 21
##   state       year data_population data_rates_property_…¹ data_rates_property_…²
##   <chr>      <dbl>           <dbl>                  <dbl>                  <dbl>
## 1 District …  1960          763956                  2159.                   600.
## 2 District …  1961          763956                  2237.                   642.
## 3 District …  1962          784000                  2227.                   641.
## 4 District …  1963          798000                  2612                    875.
## 5 District …  1964          808000                  3122.                  1103.
## 6 District …  1965          803000                  3497                   1231.
## # ℹ abbreviated names: ¹​data_rates_property_all, ²​data_rates_property_burglary
## # ℹ 16 more variables: data_rates_property_larceny <dbl>,
## #   data_rates_property_motor <dbl>, data_rates_violent_all <dbl>,
## #   data_rates_violent_assault <dbl>, data_rates_violent_murder <dbl>,
## #   data_rates_violent_rape <dbl>, data_rates_violent_robbery <dbl>,
## #   data_totals_property_all <dbl>, data_totals_property_burglary <dbl>,
## #   data_totals_property_larceny <dbl>, data_totals_property_motor <dbl>, …

Filter by the correct columns

Next I filter by the columns I want to focus on. Using head to once again check my work.

filtered_state_crime <- filtered_state_crime %>%
  select(state, year, data_population, data_rates_property_all, data_rates_violent_all, data_totals_property_all, data_totals_violent_all)

head(filtered_state_crime)
## # A tibble: 6 × 7
##   state       year data_population data_rates_property_…¹ data_rates_violent_all
##   <chr>      <dbl>           <dbl>                  <dbl>                  <dbl>
## 1 District …  1960          763956                  2159.                   554.
## 2 District …  1961          763956                  2237.                   588.
## 3 District …  1962          784000                  2227.                   606.
## 4 District …  1963          798000                  2612                    594 
## 5 District …  1964          808000                  3122.                   633.
## 6 District …  1965          803000                  3497                    723.
## # ℹ abbreviated name: ¹​data_rates_property_all
## # ℹ 2 more variables: data_totals_property_all <dbl>,
## #   data_totals_violent_all <dbl>

Outlier analysis

First I show a summary of my variables. I won’t have to do analysis on the state or year variables for obvious reasons. For the remaining five variables, I chose to make box plots.

From my boxplots I find five potential outliers over two variables.

summary(filtered_state_crime)
##     state                year      data_population   data_rates_property_all
##  Length:180         Min.   :1960   Min.   : 519000   Min.   :1469           
##  Class :character   1st Qu.:1975   1st Qu.: 721250   1st Qu.:2752           
##  Mode  :character   Median :1990   Median :4385000   Median :4023           
##                     Mean   :1990   Mean   :3874850   Mean   :4296           
##                     3rd Qu.:2004   3rd Qu.:5701108   3rd Qu.:5177           
##                     Max.   :2019   Max.   :8535519   Max.   :9512           
##  data_rates_violent_all data_totals_property_all data_totals_violent_all
##  Min.   : 151.3         Min.   : 16495           Min.   : 4230          
##  1st Qu.: 305.0         1st Qu.: 47831           1st Qu.:10164          
##  Median : 641.5         Median :153503           Median :16749          
##  Mean   : 842.3         Mean   :134370           Mean   :19616          
##  3rd Qu.:1250.6         3rd Qu.:206980           3rd Qu.:26474          
##  Max.   :2921.8         Max.   :267625           Max.   :49757

Population Box Plot

As you can see no potential outliers are present

boxplot(filtered_state_crime$data_population,
  ylab = "Population")

Property Rates Box Plot

As you can two potential outliers are identified.

boxplot(filtered_state_crime$data_rates_property_all,
  ylab = "Property rate")

Violent Rates Box Plot

As you can two potential outliers are identified.

boxplot(filtered_state_crime$data_rates_violent_all,
  ylab = "Violent rate")

Property Total Box Plot

As you can see no potential outliers are present

boxplot(filtered_state_crime$data_totals_property_all,
  ylab = "Property Total")

Violent Total Box Plot

As you can see no potential outliers are present

boxplot(filtered_state_crime$data_totals_violent_all,
  ylab = "Violent Total")

Start Plotting

I first want to make a plot showing the general population over time.

popplot <- ggplot(filtered_state_crime, aes(x = year, y = data_population)) +
  geom_point()
popplot

Editing the Graph

I add labels and a title as well as changing the shape of the points to reflect the state they represent. I also change the Y axis scale. I then change the theme to black and white.

popplot1 <- ggplot(filtered_state_crime, aes(x = year, y = data_population)) +
  geom_point(shape = ifelse(filtered_state_crime$state == "Maryland", 17, 
                     ifelse(filtered_state_crime$state == "Virginia", 18, 16))) +
  scale_y_continuous(labels = scales::comma, breaks = seq(0, max(filtered_state_crime$data_population), 1000000)) +
  labs(x = "Year", y = "Population (in millions)", title = "DMV Population by Year") +
  theme_bw()
popplot1

Adding a Legend

lastly to my graph I add a legend so it can be read.

popplot2 <- ggplot(filtered_state_crime, aes(x = year, y = data_population)) +
  geom_point(aes(shape = state)) +
  scale_y_continuous(labels = scales::comma, breaks = seq(0, max(filtered_state_crime$data_population), 1000000)) +
  labs(x = "Year", y = "Population (in millions)", title = "DMV Population by Year") +
  theme_bw() +
  scale_shape_manual(values = c("District of Columbia" = 16, "Maryland" = 17, "Virginia" = 18), guide = guide_legend(title = "State Shapes"))
popplot2  

Gathering the totals data

I gather the property and violent totals and store them under crime_type and crime_total.

gathered_data1 <- filtered_state_crime %>%
  select(state, year, starts_with("data_totals_")) %>%
  pivot_longer(cols = starts_with("data_totals_"), names_to = "crime_type", values_to = "crime_total")

head(gathered_data1)
## # A tibble: 6 × 4
##   state                 year crime_type               crime_total
##   <chr>                <dbl> <chr>                          <dbl>
## 1 District of Columbia  1960 data_totals_property_all       16495
## 2 District of Columbia  1960 data_totals_violent_all         4230
## 3 District of Columbia  1961 data_totals_property_all       17093
## 4 District of Columbia  1961 data_totals_violent_all         4491
## 5 District of Columbia  1962 data_totals_property_all       17458
## 6 District of Columbia  1962 data_totals_violent_all         4750

Gathering the rates data

I gather the property and violent rates and store them under crime_type and crime_rate.

gathered_data <- filtered_state_crime %>%
  select(state, year, starts_with("data_rates_")) %>%
  pivot_longer(cols = starts_with("data_rates_"), names_to = "crime_type", values_to = "crime_rate")

head(gathered_data)
## # A tibble: 6 × 4
##   state                 year crime_type              crime_rate
##   <chr>                <dbl> <chr>                        <dbl>
## 1 District of Columbia  1960 data_rates_property_all      2159.
## 2 District of Columbia  1960 data_rates_violent_all        554.
## 3 District of Columbia  1961 data_rates_property_all      2237.
## 4 District of Columbia  1961 data_rates_violent_all        588.
## 5 District of Columbia  1962 data_rates_property_all      2227.
## 6 District of Columbia  1962 data_rates_violent_all        606.

Changing the crime type names

I chose to change the names for crime rate here for a cleaner appearance.

modified_gathered_data <- gathered_data %>%
  mutate(crime_type = case_when(crime_type == "data_rates_property_all" ~ "Property Crime Rate",
    crime_type == "data_rates_violent_all" ~ "Violent Crime Rate", TRUE ~ crime_type))

head(modified_gathered_data)
## # A tibble: 6 × 4
##   state                 year crime_type          crime_rate
##   <chr>                <dbl> <chr>                    <dbl>
## 1 District of Columbia  1960 Property Crime Rate      2159.
## 2 District of Columbia  1960 Violent Crime Rate        554.
## 3 District of Columbia  1961 Property Crime Rate      2237.
## 4 District of Columbia  1961 Violent Crime Rate        588.
## 5 District of Columbia  1962 Property Crime Rate      2227.
## 6 District of Columbia  1962 Violent Crime Rate        606.

Changing the crime type names

I chose to change the names for total crime here for a cleaner appearance.

modified_gathered_data1 <- gathered_data1 %>%
  mutate(crime_type = case_when(crime_type == "data_totals_property_all" ~ "Total Property Crime",
    crime_type == "data_totals_violent_all" ~ "Total Violent Crime", TRUE ~ crime_type))

head(modified_gathered_data1)
## # A tibble: 6 × 4
##   state                 year crime_type           crime_total
##   <chr>                <dbl> <chr>                      <dbl>
## 1 District of Columbia  1960 Total Property Crime       16495
## 2 District of Columbia  1960 Total Violent Crime         4230
## 3 District of Columbia  1961 Total Property Crime       17093
## 4 District of Columbia  1961 Total Violent Crime         4491
## 5 District of Columbia  1962 Total Property Crime       17458
## 6 District of Columbia  1962 Total Violent Crime         4750

Plot 1

I start with graphing the crime rate data with the rate per 100,000 being the Y axis and the year being the X axis. I chose to color by crime type so a clear difference is seen.

p1 <- ggplot(modified_gathered_data, aes(x = year, y = crime_rate, color = crime_type)) +
  geom_point() +
  labs(x = "Year", y = "Crime Rate per 100,000") +
  ggtitle("Crime Rates in the DMV by Year")
p1

Plot 2

I then change the shape of the points to reflect the States they represent. After I change the color of the points and the theme of the graph. I chose Red for violent crimes and Blue for property crimes, I did this because I feel that Red is normally used to indicate more severity.

p2 <- ggplot(modified_gathered_data, aes(x = year, y = crime_rate, color = crime_type, shape = state)) +
  geom_point() +
  scale_shape_manual(values = c("District of Columbia" = 16, "Maryland" = 17, "Virginia" = 18)) +
  labs(x = "Year", y = "Crime Rate per 100,000", shape = "State", color = "Crime Type") +
  ggtitle("Crime Rates in the DMV by Year") +
  scale_color_manual(values = c("blue", "red")) +
  theme_light()
p2

Plot 3

I then add trend lines to indicate trends over time.

p3 <- ggplot(modified_gathered_data, aes(x = year, y = crime_rate, color = crime_type, shape = state)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  scale_shape_manual(values = c("District of Columbia" = 16, "Maryland" = 17, "Virginia" = 18)) +
  labs(x = "Year", y = "Crime Rate per 100,000", shape = "State", color = "Crime Type") +
  ggtitle("Crime Rates in the DMV by Year") +
  scale_color_manual(values = c("blue", "red")) +
  theme_light()
p3
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Plot 4

I then use facet wrap to split the graphs so you can focus on one type of crime at a time.

p4 <- ggplot(modified_gathered_data, aes(x = year, y = crime_rate, color = crime_type, shape = state)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  scale_shape_manual(values = c("District of Columbia" = 16, "Maryland" = 17, "Virginia" = 18)) +
  labs(x = "Year", y = "Crime Rate per 100,000", shape = "State", color = "Crime Type") +
  ggtitle("Crime Rates in the DMV by Year") +
  scale_color_manual(values = c("blue", "red")) +
  theme_light() +
  facet_wrap(~ crime_type, ncol = 2) +
guides(color = FALSE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p4
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Plot 5

I start with graphing the total crime data with the total crime being the Y axis and the year being the X axis. I chose to color by crime type so a clear difference is seen.

p5 <- ggplot(modified_gathered_data1, aes(x = year, y = crime_total, color = crime_type)) +
  geom_point() +
  labs(x = "Year", y = "Crime Totals") +
  ggtitle("Crime Totals in the DMV by Year")
p5

Plot 6

I then change the shape of the points to reflect the States they represent. After I change the color of the points and the theme of the graph. I chose Red for violent crimes and Blue for property crimes, I did this because I feel that Red is normally used to indicate more severity. I also scale the Y axis by 50,000.

p6 <- ggplot(modified_gathered_data1, aes(x = year, y = crime_total, color = crime_type, shape = state)) +
  geom_point() +
  scale_y_continuous(breaks = seq(0, max(gathered_data1$crime_total), 50000), labels = scales::comma) +
  scale_shape_manual(values = c("District of Columbia" = 16, "Maryland" = 17, "Virginia" = 18)) +
  labs(x = "Year", y = "Crime Totals", shape = "State", color = "Crime Type") +
  ggtitle("Crime Totals in the DMV by Year") +
  scale_color_manual(values = c("blue", "red")) +
  theme_light()
p6

Plot 7

I then add trend lines to indicate trends over time.

p7 <- ggplot(modified_gathered_data1, aes(x = year, y = crime_total, color = crime_type, shape = state)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  scale_y_continuous(breaks = seq(0, max(gathered_data1$crime_total), 50000), labels = scales::comma) +
  scale_shape_manual(values = c("District of Columbia" = 16, "Maryland" = 17, "Virginia" = 18)) +
  labs(x = "Year", y = "Crime Totals", shape = "State", color = "Crime Type") +
  ggtitle("Crime Totals in the DMV by Year") + 
  scale_color_manual(values = c("blue", "red")) +
  theme_light()
p7
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Plot 8

I then use facet wrap to split the graphs so you can focus on one type of crime at a time.

p8 <- ggplot(modified_gathered_data1, aes(x = year, y = crime_total, color = crime_type, shape = state)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  scale_y_continuous(breaks = seq(0, max(gathered_data1$crime_total), 50000), labels = scales::comma) +
  scale_shape_manual(values = c("District of Columbia" = 16, "Maryland" = 17, "Virginia" = 18)) +
  labs(x = "Year", y = "Crime Totals", shape = "State", color = "Crime Type") +
  ggtitle("Crime Totals in the DMV by Year") +
  scale_color_manual(values = c("blue", "red")) +
  theme_light() +
  facet_wrap(~ crime_type, ncol = 2) +
guides(color = FALSE)
p8
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Plot 9 Final Graph 1

I now us cowplot to combine my population graph and my crime rates graph.

combined_plot <- plot_grid(popplot1, p3, ncol = 2)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
combined_plot

Plot 10 Final Graph 2

I now us cowplot to combine my population graph and my crime totals graph.

combined_plot1 <- plot_grid(popplot1, p7, ncol = 2)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
combined_plot1

Plot 11 Final Graph 3

I now use highcharter to make my crime rates information interactive.

hc <- highchart() %>%
  hc_xAxis(categories = unique(modified_gathered_data$year)) %>%
  hc_yAxis(title = list(text = "Crime Rates per 100,000")) %>%
  hc_add_series(data = modified_gathered_data, type = "scatter",
    hcaes(x = year, y = crime_rate, group = state, color = crime_type)) %>%
  hc_legend(enabled = TRUE) %>%
  hc_title(text = "Crime Rates in the DMV by Year") %>%
  hc_tooltip(pointFormat = "Year: {point.x}<br/>Population: {point.y}<br/>Crime Type: {point.crime_type}")
hc

Plot 12 Final Graph 4

I now use highcharter to make my crime totals information interactive.

hc1 <- highchart() %>%
  hc_xAxis(categories = unique(modified_gathered_data1$year)) %>%
  hc_yAxis(title = list(text = "Crime Totals")) %>%
  hc_add_series(data = modified_gathered_data1, type = "scatter",
    hcaes(x = year, y = crime_total, group = state, color = crime_type)) %>%
  hc_legend(enabled = TRUE) %>%
  hc_title(text = "Crime Totals in the DMV by Year") %>%
  hc_tooltip(pointFormat = "Year: {point.x}<br/>Population: {point.y}<br/>Crime Type: {point.crime_type}")
hc1

Analysis

I think that when you combine the crime rates graph and the crime totals graph with the population growth graph, you get very interesting insights. When looking at the crime totals graph compared to the population growth per year graph you would expect to see a gradual increase in both Total property crime and total violent crime for Virginia and Maryland but you would expect to see DC stay the same due to its population not growing significantly. However what you see is that Maryland and Virginia’s totals outrageously spike up starting the 60s and continuing until about the mid-90s. DC saw a slight increase in the number of property crimes in the first decade but DC does not follow the same trend as VA and MD. For violent crimes, Maryland has a gradual increase until around 1995 while Virginia and DC remain rather constant. However when looking at the crime rates versus the population growth graph you see that DC’s rate of crimes per 100,000 is drastically higher than Virginia or Maryland. This is true for both property and violent crime. This is disproportionate to how much the population grew compared to how much crime increased. While DC’s population remained relatively constant, up until about 1996 they saw drastic increases in property crime and steady increases in violent crime. Interesting to note is how little change there was in Va’s violent crime in this time span. It is interesting to think about what factors might have led to this drastic increase in crime from ~1960 – 1995, and then what might have led to the decline in crime across the board. The three decades leading into 1960 saw a dramatic fall in both property and violent crime. 1960 marks the start of a more than three-decade uptrend in crime overall. The crime explosion is often linked to demographic shifts of post-World War II baby boomers, but I would like to keep exploring this data and topic further. Some interesting factors that might have played a role in the decrease in crime starting around 1995-200 are better economic opportunities, less alcohol consumption, and more modern policing techniques.

Sources: https://corgis-edu.github.io/corgis/csv/state_crime/ https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/downloads/download-printable-files https://bjs.ojp.gov/content/pub/pdf/aus8009.pdf https://bjs.ojp.gov/content/pub/pdf/htus8008.pdf https://www.disastercenter.com/crime/dccrime.htm https://quod.lib.umich.edu/h/humfig/11217607.0002.206/--decivilization-in-the-1960s?rgn=main;view=fulltext