Overview: The Chicago Crimes 2017 data has 200,000+ crime data entries with numerous parameters of interest, including the type of crime, location coordinates, and whether arrests were made or not. The aim of this project is simple: to provide a succinct summary of the data using R.


Initialisation

knitr::opts_chunk$set(echo = TRUE)

Load Packages

library(readxl)
library(tidyverse)
## ── Attaching packages ────────────────────────────── tidyverse 1.2.1 ──
## âś” ggplot2 2.2.1     âś” purrr   0.2.5
## âś” tibble  1.4.2     âś” dplyr   0.7.5
## âś” tidyr   0.8.1     âś” stringr 1.3.1
## âś” readr   1.1.1     âś” forcats 0.3.0
## ── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
library(ggmap)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date

Load the Chicago Crimes 2017 Data

crimes_17<-read_excel("crimes_17.xlsx")

Describing The Data

In this section, I will attempt to provide snapshots of the Chicago crimes data through various parameters.

Crimes by District

Which district has recorded the most number of crimes in 2017?

#converting type of crime and district to factor levels (categorical variables)
crimes_17$primary_type<-as.factor(crimes_17$primary_type)
crimes_17$District<-as.factor(crimes_17$District)

#filtering out district '31' because that bracket is empty
clean_df<-crimes_17%>%filter(!is.na(District), District != 31)
ggplot(clean_df, aes(District)) + geom_bar()

From the plot, it becomes clear that districts 11, 6, and 8 have very high recorded crime occurrences. District 20 has enjoyed the least number of crimes.

Plotting Crime Locations on Geographic Maps

#loading the Chicago map shows too much of the lake, so I've loaded the neighbouring location 'Cicero' map

map <-get_map(location  = 'Cicero',  zoom = 11)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Cicero&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Cicero&sensor=false
crime_plot<-ggmap(map) + geom_point(aes(x = longitude, y = latitude), data =crimes_17, size = 0.2, alpha = 0.02, color = "blue")
crime_plot
## Warning: Removed 11832 rows containing missing values (geom_point).

From the map, it becomes apparent that downtown Chicago has experienced a lot of crimes compared to Chicago’s suburbs. This may be due to a higher presence of security forces in the city compared to the suburbs, which means that even though crime rates may be high in the suburbs, they weren’t documented. This hypothesis, however, cannot be verified with the given data.

Rate of Arrests

Even with 200,000+ recorded crimes, not all crimes resulted in arrests. In fact, as seen from the table below, only 19.5% of all crimes resulted in arrests.

clean_df%>%summarise(prop_arrest = sum(arrest == TRUE, na.rm = TRUE)/n())
## # A tibble: 1 x 1
##   prop_arrest
##         <dbl>
## 1       0.195

In the following sections, I’ve examined the rate of arrests by types of crime and by district. It’ll give us a clearer picture on arrests.

Rate of Arrests By The Type of Crime
arrest_by_type<-crimes_17%>%group_by(primary_type) %>% summarise(prop_arrest = sum(arrest == TRUE)/n(), total = n()) %>% arrange(desc(prop_arrest))

arrest_by_type
## # A tibble: 32 x 3
##    primary_type                      prop_arrest total
##    <fct>                                   <dbl> <int>
##  1 GAMBLING                                1       191
##  2 LIQUOR LAW VIOLATION                    1       191
##  3 PROSTITUTION                            1       734
##  4 PUBLIC INDECENCY                        1        10
##  5 NARCOTICS                               1.000 11621
##  6 CONCEALED CARRY LICENSE VIOLATION       0.957    69
##  7 INTERFERENCE WITH PUBLIC OFFICER        0.948  1085
##  8 OBSCENITY                               0.814    86
##  9 WEAPONS VIOLATION                       0.784  4685
## 10 PUBLIC PEACE VIOLATION                  0.682  1497
## # ... with 22 more rows
ggplot(arrest_by_type, aes(x = primary_type, y = prop_arrest)) + geom_col() + theme(axis.text.x=element_text(angle=40, hjust=1)) + labs(x = "Type of Crime", y = "Proportion of Arrests")

Gambling, liqour law violation, prostitution and public decency, all involve crimes in which 100% arrests were made. Comparatively, Deceptive Practice, Battery, and Robbery are the types where the least amount of arrests were made for documented cases.

It may be possible that the seriousness of the crime is correlated to the rate of arrests, but we would require more data to draw further conclusion. In particular, if the crime control department introduced a new categorical variable classifying each type of crime by seriousness, then analysing the dependence of arrests on seriousness would be easier to determine.

Rate of Arrests By District
arrest_by_district<-crimes_17%>%filter(!is.na(District), District !=31)%>%group_by(District) %>% summarise(prop_arrest = sum(arrest == TRUE)/n(), total = n()) %>% arrange(desc(prop_arrest))

ggplot(arrest_by_district, aes(x = District, y = prop_arrest)) + geom_col() 

District 11, which had the highest recorded crimes, also has the highest rate of arrests. District 17 appears to have the least rate of arrests.

Exploring the Timeline

To identify which time of the year do crimes occur the most, it will be beneficial to plot crimes by date.

the Date variables stores the date and time of the crime, but it has multiple formats of dates, which will need to be standardised. The following set of codes separates the time and date, and standardises the date format.

#separating dates and times
buffer<-separate(crimes_17, Date, c("date_buffer", "time_buffer"), sep = "T")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 163502 rows
## [1, 3, 4, 7, 9, 12, 16, 19, 21, 22, 24, 25, 26, 27, 28, 29, 32, 33, 34,
## 37, ...].
crimes_17_clean<-separate(buffer, date_buffer, c("date", "time"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 162002 rows
## [1, 3, 4, 7, 9, 12, 16, 19, 22, 24, 25, 26, 27, 28, 29, 32, 33, 34, 37,
## 40, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 105434 rows
## [2, 5, 6, 8, 10, 11, 13, 14, 15, 17, 18, 20, 21, 23, 30, 31, 35, 36, 38,
## 39, ...].
#standardising the date format
a <- as.Date(crimes_17_clean$date,format="%m/%d/%Y")
b <- as.Date(crimes_17_clean$date,format="%Y-%m-%d")
a[is.na(a)] <- b[!is.na(b)]
crimes_17_clean$date<-a

Now that the dates are in a standardised format, it will be easy to plot crimes by date.

ggplot(crimes_17_clean, aes(x = date)) + geom_bar() + labs(x = "Date", y = "Frequency")

It might make more sense to plot the number of crimes on a monthly basis.

#extracting month from the date
crimes_17_clean<-mutate(crimes_17_clean, month = month(date))

crimes_17_clean$month<-as.factor(crimes_17_clean$month)
ggplot(crimes_17_clean, aes(x = month)) + geom_bar() +ggtitle("Crimes Per Month")

A clear trend has emerged over the months. The summer months appear to attract more number of crimes compared to the winter months. January, curiously, doesn’t appear to follow the trend. It clocks the highest number of crimes compared to any month. Further data would be beneficial to explore the reason for this anomaly.