IS460 Module 1 Final

My Dataset - Baltimore Crime

For this assignment, I was tasked with working with a data set of my choice and creating 5 visualizations that convey something about the data. Being that my data set is on Baltimore crime, it contains a lot of information about a specific instance of a crime, containing variables such as the time, date, location, crime type, and more. In the data set, there are 321,414 observations of 17 variables, there are 10 districts and also 14 different types of crime, leaving us with much to look at.

Graph 1

To start off, I wanted to ask and answer some basic questions about the data, like who, what, when, where, and why. With the restrictions of the data set, I couldn’t really figure out the who or why, so I focused what, when, and where. My first graph addresses the what, what types of crimes, and the where, where did they occur. It contains the data pertaining to the amount of crimes committed in each district and breaks it down to show which crimes occurred more or less. After looking at it for a little while, what struck me was the size of the larceny and larceny from auto crimes being committed in the Southeast district, as it is proportional much bigger than the other larceny buckets.

library(dplyr)
library(data.table)
library(ggplot2)
library(scales)
library(RColorBrewer)
library(ggthemes)
library(ggrepel)
library(lubridate)

mypath <- "C:/Users/nrhar/OneDrive/Documents/IS460"
setwd(mypath)

df <- fread("BPD_Part_1_Victim_Based_Crime_Data.csv")

df$CrimeDate <- mdy(df$CrimeDate)
df$CrimeTime <- hms(df$CrimeTime)
df$year <- year(df$CrimeDate)

df1 <- df %>%
  filter(District != "", Description != "") %>%
  data.frame()

df2 <- df1 %>%
  filter(Description != "RAPE") %>%
  select(District, Description) %>%
  group_by(District, Description) %>%
  summarise(n = length(District), .groups = 'keep') %>%
  data.frame()

df2totals <- df2 %>%
  select(District, n) %>%
  group_by(District) %>%
  summarise(total = sum(n), .groups = 'keep') %>%
  data.frame()

ggplot(df2, aes(x = District, y = n, fill = Description)) +
  geom_bar(stat = "identity") +
  labs(title = "Crime Types by District", x = "District", y = "Number of Crimes", fill = "Crime Type") +
  scale_y_continuous(labels = comma) +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(data = df2totals, aes(x = District, y = total, label = scales::comma(total), fill = NULL), vjust = -0.3)

Graph 2

Next, I wanted some more information about when the crimes were happening, so I broke the data down so that all crimes were connected to the hour of the day they were committed and plotted them. As you can see below, there are a very low number of crimes committed in the morning but they begin to rise at noon until they hit a peak level from roughly 4PM until midnight.

df3 <- df1 %>%
  select(CrimeTime) %>%
  mutate(hours = hour(hms(CrimeTime))) %>%
  group_by(hours) %>%
  summarise(m = length(CrimeTime), .groups = 'keep') %>%
  data.frame()

ggplot(df3, aes(x = hours, y = m)) +
  geom_line(color = 'black', size = 1) +
  geom_point(shape = 21, size = 4, color = 'blue', fill = 'blue') +
  labs(title = "Crimes by Hour", x = "Hour of the Day", y = "Number of Crimes") +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = comma,
                     breaks = seq(1, 24, by = 1),
                     limits = c(1, 24)) +
  geom_label_repel(aes(label = scales::comma(m)),
                   box.padding = 1,
                   point.padding = 1, size = 4, color = "Grey50", segment.color = "darkblue")

Graph 3

With this information about Larceny in the Southeast district as well as the peak crime hours, I decided to zoom in further on those exact terms. So, in my following graph, I took just the reports of Larceny in the Southeast district between the hours of 4PM and midnight. It modeled a very similar curve to the above graph when looking at the time frame (4PM-Midnight) featured in the histogram, a rise to a peak at 6PM followed a steady decline, showing us how the Southeast district can be an accurate model for the rest of the city.

df4 <- df1 %>%
  filter(Description == "LARCENY" | Description == "LARCENY FROM AUTO", District == "SOUTHEAST") %>%
  select(CrimeTime) %>%
  mutate(df4hours = hour(hms(CrimeTime))) %>%
  data.frame()

df5 <- df4 %>%
  filter(df4hours >= 16) %>%
  data.frame()

x_labels <- min(df5$df4hours):max(df5$df4hours)

ggplot(df5, aes(x = df4hours)) +
  geom_histogram(bins = 8, color = "darkblue", fill = "blue") +
  labs(title = "Larceny by Hour in Southeast District, 4PM to Midnight", x = "Hour", y = "Larceny Count") +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = x_labels, breaks = x_labels)

Graph 4

Next, I wanted to see if Larceny really is such a big problem in the Southeast district compared to others, as it is not the highest crime district overall, Northeast is (See Graph 1). To do this I made a trellis bar chart breaking down the data to larcenies by district and year. As you can see, the Southeast does year in and year out have significantly more Larcenies than any other district, however another trend can be seen, as in all districts the amount of larcenies is steadily decreasing.

df6 <- df1 %>%
  filter(Description == "LARCENY" | Description == "LARCENY FROM AUTO") %>%
  select(District, year) %>%
  group_by(year, District) %>%
  summarise(n = length(year), .groups = 'keep') %>%
  data.frame()

x_axis_labels <- min(df6$year):max(df6$year)

ggplot(df6, aes(x = year, y = n, fill = District)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = comma) +
  labs(title = "Larceny by Year by District", x = "Year", y = "Larceny Count", fill = "District") +
  scale_fill_brewer(palette = "Set3") +
  facet_wrap(~District, ncol = 5, nrow = 2)

Graph 5

So after noticing the trend of larcenies steadily decreasing, I wanted to see if that was just because of a crime drop in general or just in this specific type of crime. To do this, I constructed a trellis pie chart that broke down the percent of larcenies versus other crime by year from 2014 to 2020. This graph also shows a relatively steady decrease in the percent of crimes that are larceny compared to all others, which would suggest larcenies are decreasing independently from other types crime.

df7 <- df1 %>%
  filter(District == "SOUTHEAST", Description != "RAPE") %>%
  select(Description, year) %>%
  mutate(CrimeType = ifelse(Description == "LARCENY" | Description == "LARCENY FROM AUTO", "Larceny", "Other")) %>%
  group_by(year, CrimeType) %>%
  summarise(n = length(year), .groups = 'keep') %>%
  group_by(year) %>%
  mutate(percent_total = round(100*n/sum(n), 1)) %>%
  data.frame()

ggplot(df7, aes(x = "", y = n, fill = CrimeType)) +
  geom_bar(stat = "identity", position = "fill") +
  coord_polar(theta = "y", start = 0) +
  labs(fill = "States", x =NULL, y = NULL, title = "Crime Type by Year") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank()) +
  facet_wrap(~year, ncol = 4, nrow = 2) +
  scale_fill_brewer(palette = "Reds") +
  geom_text(aes(x = 1.7, label = paste0(percent_total, "%")),
             size = 4,
             position = position_fill(vjust = 0.1))