Chicago Crime Dataset Analysis

Introduction

With this assignment we learned how to code with R Studio and create different visualizations for any kind of data. You can simply take data in the form of an excel sheet or CSV file, and represent that data through graphs like heatmaps, bar charts, or line graphs.

The objective of this report is to look at different trends in the chosen data and see how things changed over a certain time period.

# This includes all libraries that were needed. I then read in my data. The working directory is set to where I saved my files for this project.

library(data.table)
library(dplyr)
library(ggplot2)
library(lubridate)
library(scales)
library(ggthemes)
library(ggrepel)
library(RColorBrewer)
library(plotly)

setwd("C:/Users/VictoriaKwortnik/Desktop/GB 736/R_datafiles")

filename = "Crimes 2001 to Present.csv"

df = fread(filename, na.strings = c(NA, ""))

Information on Dataset

The data set that I chose covers Chicago Crime. This dataset is from Professor Tallon’s DropBox folder of various datasets to choose from for class purposes. It originally came from the Chicago Data Portal and covers incidents of crimes committed in the City of Chicago from 2001 through 2021. According to the portal, the data was extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.

The data set has 7.91 million rows and 22 columns, and therefore 22 different variables. Each row is a reported crime and has many details about each crime. Some key variables worth mentioning include a unique ID, case number, date when an incident occurred, the type of incident, location description, whether an arrest was made, and latitude/longitude coordinates of the event.

By looking at this data, we can see how crime changed over time specifically in Chicago. This can include trends in different types of crimes over the years and how often each type is committed, as well as what time of day and day of week crimes are committed more often. We can also look at how many of these crimes ended in an arrest and where they took place within the city of Chicago.

Findings

The idea of this report was to try and link all the data, so that it created somewhat of a story and all connected, and that we could see more of a picture of what crime looked like in Chicago. It starts pretty general with counting the amount of crimes per year. We then look a little more specific into what kinds of crimes happen most often (the top 10). Then it goes into the day of the week of each year, and shows what day(s) crime happened more often. Then narrowing in even more, we can see what hour of the day crime is most common and least common to occur. Then we look into the results of these crimes and see how many ended in arrests. Lastly, we look at where exactly these crimes took place over the years, whether that be on a street, in a residence, or somewhere else. By looking at visuals of all these things, we can see what crime has been like in these past years in Chicago.

# Gets rid of NA's that were found in the "Location Description" column of the data frame. A new dataframe without the NA's is made, to later be used for a graph.

loc_desc = df[(!is.na(df$`Location Description`))]

Crime Count By Year

This visualization is a histogram of crimes committed per year. For each year in the data set, it counts the number of crimes committed per that year and outputs it in a bar graph or otherwise known as a histogram. On the X axis we have each year from 2001 through 2021 and on the Y axis we have the count of crimes that occurred, correlating to each year. You can see that the count of crimes almost follows a decreasing pattern over the years. We can see that in 2001 there were 485,821 crimes reported, while in 2021 only 203,527 crimes. Each bar is labeled at the top so that it is easy for any reader to see the count of crimes for that year. We can also see that the highest crime rate was in 2002 at 486,778 crimes reported that year.

# Code for a histogram to look at the count of crimes within each year

df$year = year(mdy_hms(df$Date))
p1 = ggplot(df, aes(x=year)) +
  geom_histogram(bins = 21, color="black", fill="lavender") +
  labs(title = "Histogram of Crimes by Year", x="Year", y="Count of Crimes") +
  scale_y_continuous(labels=comma) +
  stat_bin(binwidth = 1, geom='text', color='purple', aes(label=scales::comma(after_stat(count))), vjust=-0.5, size=2.4)

x_axis_labels = min(df$year):max(df$year)

p1 = p1 + scale_x_continuous(labels = x_axis_labels, breaks = x_axis_labels)
p1

Types of Crimes and Count

This visualization is a bar chart, but shows something different from what the previous bar chart represented. In this chart the X axis represents the types of crimes committed. All the types of crime in the data set wouldn’t fit onto this chart, so this only covers the top 10 types of crimes, that happened the most frequent. Those crimes consist of battery, criminal damage, narcotics, assault, burglary, motor vehicle theft, deceptive practice, robbery, criminal trespass, and other offenses that weren’t covered in the other variables. On the Y axis, it is the crime count. We can see that based off the graph, the crime of battery occurred the most commonly at 1,369,596 times over the 20 years. Criminal trespass is the least likely, out of the ten, to occur in Chicago. Again, we see each bar has a label at the top so the count is easily pointed out.

# Make a variable to count up each type of crime (top 10) and then create a bar graph with the type and count

crimecount = data.frame(count(df, `Primary Type`))
crimecount = crimecount[order(crimecount$n, decreasing = TRUE),]

ggplot(crimecount[2:11,], aes(x = reorder(Primary.Type, -n), y = n)) +
  geom_bar(colour="black", fill="purple", stat = "identity") +
  labs(title = "Number of Crimes (Top 10)", x = "Type of Crimes Committed", y = "Crime Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = comma) +
  scale_x_discrete(guide=guide_axis(angle = 45)) +
  geom_text(data = crimecount[2:11,], aes(x = Primary.Type, y = n, label = scales::comma(n), fill = NULL), vjust = -0.5, hjust = 0.5, size = 3)

Crime Count By Day of Week and Year

A heatmap was the chosen type of visualization for the specific data being looked at here. We’re looking at the years from 2011 through 2021, along with the days of the week, and the crime counts. The heatmap does a nice job of representing high amounts of crime with a darker color of purple, and a lower amount of crime with a lighter color of purple. On the left hand side is the days of the week, Monday through Sunday. On the bottom of the map are the chosen years. On the right hand side, there is a legend that helps a viewer see what each color represents, with the range of crime count. For example, the darkest of purple is around 50,000, while the lightest color is more around the 30,000s. So looking into this heatmap, we can see that the darkest color occurs in 2011 on a Friday with 53,104 crimes committed on a Friday that year. The lowest crime rate pictured in the map occurs in 2021 on a Thursday at 27,847 crimes. This is a good way to analyze the years and the crime rate, while also looking to see what each day of the week looks like with the amount of crimes occurring each day.

# Create dataframe with info needed and then create a heatmap for the count of crimes for each day of the week and every year in the data set.

days_df = df %>%
  select(Date) %>%
  dplyr::mutate(year = year(mdy_hms(Date)),
         dayoftheweek = weekdays(mdy_hms(Date), abbreviate=TRUE)) %>%
  group_by(year, dayoftheweek) %>%
  summarise(n = length(Date), .groups = 'keep') %>%
  filter(year >= 2011 & year <= 2021) %>%
  data.frame()

days_df$year = as.factor(days_df$year)

day_order = factor(days_df$dayoftheweek, level=c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))

mylevels = c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun')
days_df$dayoftheweek = factor(days_df$dayoftheweek, levels = mylevels)

breaks = c(seq(0, max(days_df$n), by=25000))

g = ggplot(days_df, aes(x = year, y = dayoftheweek, fill = n)) +
  geom_tile(color="black") +
  geom_text(aes(label=comma(n))) +
  coord_equal(ratio=1) +
  labs(title="Heatmap: Crimes by Day of the Week",
       x = "Year",
       y = "Days of the Week",
       fill = "Crime Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_discrete(limits = rev(levels(days_df$dayoftheweek))) +
  scale_fill_continuous(low="white", high="purple", labels=comma) +
  guides(fill = guide_legend(reverse=TRUE, override.aes = list(colour="black"))) 

g

Crime Count By Hour

The graph pictured here is a line chart where we analyze the amount of crimes that occur over each hour of a 24 hour day. For this we can see that every hour, starting with 0 (0 being midnight) through 23 (which is 11pm), on the X axis. On the Y axis we see the crime count which estimates the amount of crimes occurring that hour. Two points are labeled on the graph, the lowest point (minimum) and the highest point (maximum). At 5am we can see the lowest amount of crimes occurs at 101,512 crimes during that hour. At noon, we can see that the most amount of crimes happen during that hour at 428,134 crimes. This kind of makes sense because noon is right in the middle of the day. As we get later into the night, it still stays pretty high, which would also make sense because crimes stereotypically happen later at night. In the early morning, we see less amounts of crime occurring.

# Dataframe for hours of day. Then create a line graph to show how many crimes occur at every hour in a day.

hours_df = df %>%
  select(Date) %>%
  mutate(hour24 = hour(mdy_hms(Date))) %>%
  group_by(hour24) %>%
  summarise(n = length(Date), .groups = 'keep') %>%
  data.frame()

x_axis_labels = min(hours_df$hour24):max(hours_df$hour24)

hi_lo = hours_df %>%
  filter(n == min(n) | n == max(n)) %>%
  data.frame()

ggplot(hours_df, aes(x = hour24, y = n)) +
  geom_line(color='lavender', size=1) +
  geom_point(shape=21, size=4, color='purple', fill='white') +
  labs(x="Hour", y="Crime Count", title = "Crimes by Hour", caption = "Source: Chicago Data Portal: www.xyz.com") +
  scale_y_continuous(labels=comma) +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(labels=x_axis_labels, breaks = x_axis_labels, minor_breaks = NULL) +
  geom_point(data = hi_lo, aes(x = hour24, y = n), shape = 21, size=4, fill='purple', color='purple') +
  geom_label_repel(aes(label =  ifelse(n == max(n) | n == min(n), scales::comma(n), "")), 
                   box.padding = 1, 
                   point.padding = 1, 
                   size=4, color='Grey50', 
                   segment.color='darkblue')

Amount of Arrests Made By Year

In this nested pie chart, arrest results for the crimes covered in this data are visualized. We can see whether or not an arrest was made for each crime is represented through either true or false. In this graph, we can see that purple is false, which means there was no arrest made. The grey color represents true, which means there was an arrest made for the crime. We can also see three different circles, where each represents a different year. The graph covers the years 2019, which is the inner most circle. Working our way out, the middle circle is 2020, and the outermost circle is 2021. This graph is also interactive, which means you can hover over each circle and color and see more details. So looking more into this visualization and the data analyzed, it can be seen that in 2019, 21.5% of crimes committed ended in an arrest. That means 56,114 crimes committed that year warranted an arrest, and we can see that by hovering over. In 2020, 16% of crimes or 33,929 crimes resulted in an arrest, and 11.8% of crimes in 2021. In turn, most crimes committed in Chicago, usually don’t end in arrests, which we can gather from the results of our graph.

# Code for a nested pie chart that shows how many arrests were made in the years 2019, 2020, and 2021.

arrests = count(df, `Arrest`)

crime_df = df%>%
  select(`Arrest`, Date) %>%
  mutate(year = year(mdy_hms(Date)), 
         result = ifelse(`Arrest` == "TRUE", "TRUE", "FALSE")) %>%
  group_by(year, result) %>%
  summarise(n=length(result), .groups = 'keep') %>%
  group_by(year) %>%
  mutate(percent_of_total = round(100*n/sum(n),1)) %>%
  ungroup() %>%
  data.frame()

colors <- c('rgb(144,103,167)', 'rgb(128,133,133)')


fig = plot_ly(hole = 0.7) %>%
layout(title= "Arrests Made (2019-2021)") %>%
  add_trace(data = crime_df[crime_df$year == 2021,],
            labels = ~result,
            values = ~crime_df[crime_df$year == 2021,"n"],
            type = "pie",
            textposition = "inside", marker = list(colors = colors),
            hovertemplate = "Year: 2021<br>Arrest: %{label}<br>Percent: %{percent}<br>Arrest Count: %{value}<extra></extra>") %>%
  add_trace(data = crime_df[crime_df$year == 2020,],
            labels = ~result,
            values = ~crime_df[crime_df$year == 2020,"n"],
            type = "pie",
            textposition = "inside",
            hovertemplate = "Year: 2020<br>Arrest: %{label}<br>Percent: %{percent}<br>Arrest Count: %{value}<extra></extra>",
            domain = list(
              x = c(0.16,0.84),
              y = c(0.16,0.84))) %>%
  add_trace(data = crime_df[crime_df$year == 2019,],
            labels = ~result,
            values = ~crime_df[crime_df$year == 2019,"n"],
            type = "pie",
            textposition = "inside",
            hovertemplate = "Year: 2019<br>Arrest: %{label}<br>Percent: %{percent}<br>Arrest Count: %{value}<extra></extra>",
            domain = list(
              x = c(0.27,0.73),
              y = c(0.27,0.73)))

fig

Location Description of Crimes

In this visualization, the locations of crimes is analyzed for a select four years. The data is analyzed with a trellis of 4 pie charts. The years chosen were 2005, 2010, 2015, and 2020 to show if over time anything changed. Each are broken into a grey, dark purple, or lavender color section and these stand for different locations. The dark purple means that the crime was committed at a residence and a lavender section means that it occurred in the street, while grey represents all other location options- so the crime happened somewhere else other than the street or residence. The chart is another interactive visual, so a viewer can hover over each section and see additional details. In the top left pie chart, crimes committed in 2005 are being represented. We can see that most crimes happened somewhere outside of the street or residence, but that 27.3% of crimes (or 123,767 crimes) occurred in the street. We can see a similar trend for the rest of the years, so this is pretty consistent. In 2020, the amount of crimes committed in a residence increased slightly to 18.4% or 38,679 crimes.

# Code to show a trellis of pie charts of the years 2005, 2010, 2015, and 2020. Covers where crimes were committed in these years.

top_locations = count(loc_desc, `Location Description`)
top_locations = top_locations[order(-n),]

location_df = loc_desc%>%
  select(`Location Description`, Date) %>%
  mutate(year = year(mdy_hms(Date)), 
         location_desc = ifelse(`Location Description` == "STREET", "Street", ifelse(`Location Description` =="RESIDENCE", "Residence", "Other"))) %>%
  group_by(year, location_desc) %>%
  summarise(n=length(location_desc), .groups = 'keep') %>%
  group_by(year) %>%
  mutate(percent_of_total = round(100*n/sum(n),1)) %>%
  ungroup() %>%
  data.frame()

colors = c("grey", 'purple', 'lavender')


plot_ly(textposition="inside", labels = ~location_desc, values = ~n, marker = list(colors = colors)) %>%
  add_pie(data=location_df[location_df$year == 2005,],
          name="2005", title="2005", domain=list(row=0, column=0)) %>%
  add_pie(data=location_df[location_df$year == 2010,],
          name="2010", title="2010", domain=list(row=0, column=1)) %>%
  add_pie(data=location_df[location_df$year == 2015,],
          name="2015", title="2015", domain=list(row=1, column=0)) %>%
  add_pie(data=location_df[location_df$year == 2020,],
          name="2020", title="2020", domain=list(row=1, column=1)) %>%
  layout(title="Count of Crimes Committed at Various Locations (2005, 2010, 2015, 2020)", showlegend = TRUE,
         grid=list(rows=2, columns=2))

Conclusion

Through the project, an in-depth analysis was done on the crime in Chicago. We can see that, according to this data set, the amount of crime in Chicago has actually declined over time (up until 2021, that is). The crime of battery is the most popular type of crime to occur in Chicago, as well as most crimes not resulting in arrests. We were able to look into when a crime is more typical to happen, specifically what day of the week and what time of day. A decent amount of crimes occur in the street or place of residence, but also has a lot of instances occurring somewhere other than these locations. These can also all be important things to know for the police department and anticipating future crimes, as well as Chicago citizens, just to be aware and informed on the crime of the city. Overall, this report looked at more broad information for a higher level outlook, as well as getting more specific in some graphs to look at more detailed aspects of the crime in Chicago.