Chicago Crime Data

library(data.table)
library(plyr)
library(dplyr)
library(lubridate)
library(scales)
library(ggthemes)
library(ggplot2)
library(ggrepel) 
library(RColorBrewer)
library(DescTools)

setwd("C:/Users/Michael/Documents/Grad loyola fall 2021/GB 736")
filename <- "Crimes 2001 to Present.csv"
df <- fread(filename, na.strings = c(NA, ""))

Introduction to Chicago Crime

The city of Chicago can be considered the crown jewel of the Midwest as it sits prominently on the beautiful shores of Lake Michigan. It has been truly the crossroads of United State since the birth of the railways. The position as the crossroads of trade and people on the rail network of the United States caused Chicago to grow exponentially. With a large influx of people over the years, Chicago has the third largest population in the United States.

Dataset

The Chicago Crime data set looks at the various aspects of crimes committed. The main components shown in the visualizations are police districts where crimes were committed, the type of crime, and days of the week, year, and time of day for crimes. Focusing on these aspects of the crime data can help picture what parts of the city experience more crime and possible explanations for that occurrence. The data set is over seven million observations which allows for the data to tell a flushed out story of the crime patterns in Chicago. The data set spans twenty years from 2001 to 2020. The length of time represented by the data set adds even more depth to the data because crime trends take years to display. The data set contains 22 columns and is 1.58 GB in size. The variables I choose to use to build my graphs contained very few NA rows. The only variable that did contain NA’s was the District column. This gives the confidence that the data represented in each visualization is a true representation of crime within Chicago for the years 2001-2020. I obtianed this data set through my Professor, Paul Tallon. If you would like to obtain the data set, please contact Prof. Tallon for more information.

Findings

Crimes by District

This graph shows the districts that have the ten highest crime totals from the years 2001 to 2020. I think this graph is a great place to begin looking at crime in Chicago because it gives a basis for where in the city crime is being reported. As Chicago is a massive city both by population and size, only including the top ten police districts, shows a cleaner, more detailed insight into crime in the city. The top three neighborhoods: Chicago Lawn, Harrison, and Englewood, are in the South and West sides of Chicago. These areas are generally seen as the rough parts of the city, so it is unsurprising seeing they have the highest crime totals in their respective police districts. This graph can highlight the areas of the city where someone would want to avoid if they didn’t want to fall victim of a crime.

DistrictCount <- data.frame(dplyr::count(df, District))
DistrictCount <- DistrictCount[order(DistrictCount$n, decreasing = TRUE), ]
NARows <- which(is.na(DistrictCount$District))
DistrictCount <- DistrictCount[-NARows,]
rownames(DistrictCount) <- c(1:nrow(DistrictCount))
DistrictCount$District <- as.character(DistrictCount$District) 
DistrictCount$n <- as.numeric(DistrictCount$n)
DistrictCount$District_Name <- c("Chicago Lawn", "Harrison", "Englewood", "Gresham", "Grand Central", "South Chicago", 
                                  "Grand Crossing", "Deering", "Near West", "Wentworth", "Town Hall", "Near North",
                                  "Calumet", "Austin", "Ogden", "Central", "Shakespeare", "Jefferson Park", "Morgan park",
                                  "Rogers Park", "Albany Park", "Lincoln", "NA", "NA")
max_x <- round_any(max(DistrictCount$n), 550000, ceiling)
agg_tot_crimes <- DistrictCount[1:10, ] %>%
  select(District_Name, n) %>%
  group_by(District_Name) %>%
  summarise(tot = sum(n), .groups = 'keep') %>%
  data.frame()
ggplot(DistrictCount[1:10,], aes(x = reorder(District_Name, n), y = n)) +
  geom_bar(colour = "black", fill = "lightblue", stat = "identity") +
  labs(title = "Number of Crimes by Police District (Top Ten)", x = "Police District", y ="Number of Crimes") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = comma, 
                     breaks = seq(0, max_x, by = 100000),
                     limits = c(0, max_x)) +
  coord_flip() +
  geom_text(data = agg_tot_crimes, aes(x = District_Name, y = tot, label = scales::comma(tot), fill = NULL), hjust = -0.1, size = 4)

Crimes Per Year

After viewing the police districts that have the highest amounts of crimes, one might ask, “How many crimes are committed per year in Chicago?” Well, surprisingly enough, 2020 recorded the lowest amount of crimes in the data set at 184,951. This is a significant drop off from the high of 2002 topping off at 486,764. That is a 62% decrease in crimes per year from 2002 to 2020. This graph provides a baseline understanding of the overall amount of crime recorded in Chicago.

max_y <- round_any(max(df$Year), 600000, ceiling)
Crimes_per_year <- ggplot(df, aes(x = Year)) +
  geom_histogram(bins = 20, color = "black", fill = "lightgreen") +
  labs(title = "Crimes per year in Chicago", x = "Year", y = "Count of Crimes") +
  scale_y_continuous(labels=comma, 
                     breaks = seq(0, max_y, by = 100000),
                     limits = c(0, max_y)) +
  coord_flip() +
  stat_bin(binwidth = 1, geom = 'text', color = 'black', aes(label=scales::comma(..count..)), hjust = -0.4) +
  theme(plot.title = element_text(hjust = 0.5)) 
x_axis_labels <- min(df$Year):max(df$Year)
Crimes_per_year <- Crimes_per_year + scale_x_continuous(labels = x_axis_labels, breaks = x_axis_labels)
Crimes_per_year

Crime Type

In this graph, the crime is broken down by the top ten crimes plus the other crimes are totaled up into an eleventh category, All Other Offenses. One interesting thing about this data set is the Chicago Police Department have a category named Other Offenses. As this category is labeled “other offenses” with no explanation, it is hard to understand what constitutes an “other offense”; but it is different than All Other Offenses which includes the remaining crimes that are not in the top ten. For each category of crime, the total number at the end of each bar represents that sum of every year in the data set in regards to each category. Theft is the clear leader by almost 200,000. Theft and battery are the two largest crimes reported which is understandable as a wide variety of crimes can fall under a general label such as theft or battery. In the graph, the user can view each block representing each year getting smaller as it moves from 2001 to 2020. This is further proof of the falling number of crimes committed in Chicago.

df_change <- df %>%
  rename(Type = "Primary Type")
df_reasons <- count(df_change, Type)
df_reasons <- df_reasons[order(df_reasons$n, decreasing = TRUE), ]
top_reasons <- df_reasons$Type[1:10]
new_df <- df_change %>%
  filter(Type %in% top_reasons) %>%
  select(Year, Type) %>%
  group_by(Type, Year) %>%
  summarise(n = length(Type), .groups = 'keep') %>%
  data.frame()
other_df <- df_change %>%
  filter(!Type %in% top_reasons) %>%
  select(Year) %>%
  mutate(Type = "ALL OTHER CRIMES") %>%
  group_by(Type, Year) %>%
  summarise(n = length(Type), .groups = 'keep') %>%
  data.frame()
stacked_df <- rbind(new_df, other_df)
agg_tot <- stacked_df %>%
  select(Type, n) %>%
  group_by(Type) %>%
  summarise(tot = sum(n), .groups = 'keep') %>%
  data.frame()
stacked_df$Year <- as.factor(stacked_df$Year)  
max_y <- round_any(max(agg_tot$tot), 500000, ceiling)
colorcount <- length(unique(stacked_df$Year))
getpalette <- colorRampPalette(brewer.pal(9, "Set1"))
ggplot(stacked_df, aes(x = reorder(Type, n, sum), y = n, fill = Year)) +
  geom_bar(stat = "identity", position = position_stack(reverse = TRUE)) +
  coord_flip() +
  labs(title = "Crimes Count by Crime Type", x = "", y = "Crime Count", fill = "Year") +
  theme_hc() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = getpalette(colorcount)) +
  geom_text(data = agg_tot, aes(x = Type, y = tot, label = scales::comma(tot), fill = NULL), hjust = -0.1, size = 4) +
  scale_y_continuous(labels = comma,
                     breaks = seq(0, max_y, by = 500000),
                     limits = c(0, max_y))

Type of Crime by Year

In this graph, the types of crime are broken out by year to give the viewer an under the hood look at the how crimes are committed each year. This graph incorporates the decreasing crime totals from 2001 to 2020 along with the differing types of crimes in each year. The top row shows the first five years in the data set while the bottom row shows the last five years. There is a visible difference in the amount of crime committed in 2001 compared to 2020. In 2001, theft and battery were nearly each 100,000 whereas in 2020 both are below 50,000. This amount of detail provides the user with a visual of crime actively decreasing just like it does in the seocnd visualization and breaks out the total of crimes in the year by type of crime. The user can see what crimes make up the majority of crimes per year such as theft and battery.

colorcount <- length(unique(stacked_df$Year))
getpalette <- colorRampPalette(brewer.pal(9, "Set1"))
ggplot(stacked_df, aes(x = Type, y = n, fill = Type)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(axis.text.x = element_blank()) +
  scale_y_continuous(labels = comma) +
  labs(title = "Total Type of Crime by Year",
       x = "Type of Crime",
       y = "Crime Count",
       fill = "Year") +
  scale_fill_manual(values = getpalette(colorcount)) +
  facet_wrap(~Year, ncol = 5, nrow = 4)

Crimes by Hour

This line plot of the crimes per each hour of the day is interesting in viewing when crimes are committed. A reasonable assumption is that night time would be the highest crime time period because that is considered the typical time for nefarious activity. Although, the hours of 7pm to 12am are on the higher side of graph, the highest amount of crimes were committed during the 12pm-1pm hour. This is the time of day that the most amount of people are active in the the city whether they are on a lunch break, shopping at the store, or just out and about the city. On the opposite side of the graph, the lowest time occurs during the 5am-6am hour. Coming in at just under 100,000 crimes, most people in Chicago must still be asleep. As the sun starts to come up the crime number steadily rise until we hit the noon high of 414,008 crimes.

hours_df <- df %>%
  select(Date) %>%
  mutate(hour24 = hour(mdy_hms(Date))) %>%
  group_by(hour24) %>%
  summarise(n = length(Date), .groups = 'keep') %>%
  data.frame()
x_axis_labels = min(hours_df$hour24) : max(hours_df$hour24)
hi_low <- hours_df %>%
  filter(n == min(n) | n == max(n)) %>%
  data.frame()
ggplot(hours_df, aes(x = hour24, y = n)) +
  geom_line(color = 'black', size = 1) +
  geom_point(shape = 21, size = 4, color = 'blue', fill = 'white') +
  labs(x = "Hour", y = "Crimes Count", title = "Crimes by Hour") +
  scale_y_continuous(labels = comma) +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(labels = x_axis_labels, breaks = x_axis_labels, minor_breaks = NULL) +
  geom_point(data = hi_low, aes(x = hour24, y = n), shape = 21, size = 4, fill = 'blue', color = 'blue') +
  geom_label_repel(aes(label = ifelse(n == max(n) | n == min(n), scales::comma(n), "")),
                   box.padding = 1, 
                   point.padding = 1, 
                   size = 4, 
                   color = 'Grey50', 
                   segment.color = 'darkblue')

Conclusion

This data provides a great insight into the different ways to look at crime data for the city of Chicago. Breaking out the crime by police districts builds an idea of where in the city crime is being committed. After looking at where in the city crime is being committed, its a good idea to see the total crime per year along with the types of crimes being committed. These two graphs give a deep dive into the falling number of crimes per year. Also, the types of crimes being committed allow the user to understand the nature of the crime. The data revealed theft and battery were the two most common crimes. The next graph combined aspects of the previous two graphs to show the decreasing number of crimes over the years with the types of crimes per year. The user is given the opportunity to see the make up of the types of crimes committed each year. The final graph shows the time of day in which the most and least amount of crimes are committed. The graph revealed that the middle of the day is the time of day with the largest number of crimes whereas 5am is the lowest number as people are asleep.