Overview: The Chicago Crimes 2017 data has 200,000+ crime data entries with numerous parameters of interest, including the type of crime, location coordinates, and whether arrests were made or not. The aim of this project is simple: to provide a succinct summary of the data using R.
knitr::opts_chunk$set(echo = TRUE)library(readxl)
library(tidyverse)## ── Attaching packages ────────────────────────────── tidyverse 1.2.1 ──
## âś” ggplot2 2.2.1 âś” purrr 0.2.5
## âś” tibble 1.4.2 âś” dplyr 0.7.5
## âś” tidyr 0.8.1 âś” stringr 1.3.1
## âś” readr 1.1.1 âś” forcats 0.3.0
## ── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
library(ggmap)
library(lubridate)##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
crimes_17<-read_excel("crimes_17.xlsx")In this section, I will attempt to provide snapshots of the Chicago crimes data through various parameters.
Which district has recorded the most number of crimes in 2017?
#converting type of crime and district to factor levels (categorical variables)
crimes_17$primary_type<-as.factor(crimes_17$primary_type)
crimes_17$District<-as.factor(crimes_17$District)
#filtering out district '31' because that bracket is empty
clean_df<-crimes_17%>%filter(!is.na(District), District != 31)
ggplot(clean_df, aes(District)) + geom_bar()From the plot, it becomes clear that districts 11, 6, and 8 have very high recorded crime occurrences. District 20 has enjoyed the least number of crimes.
#loading the Chicago map shows too much of the lake, so I've loaded the neighbouring location 'Cicero' map
map <-get_map(location = 'Cicero', zoom = 11)## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Cicero&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Cicero&sensor=false
crime_plot<-ggmap(map) + geom_point(aes(x = longitude, y = latitude), data =crimes_17, size = 0.2, alpha = 0.02, color = "blue")
crime_plot## Warning: Removed 11832 rows containing missing values (geom_point).
From the map, it becomes apparent that downtown Chicago has experienced a lot of crimes compared to Chicago’s suburbs. This may be due to a higher presence of security forces in the city compared to the suburbs, which means that even though crime rates may be high in the suburbs, they weren’t documented. This hypothesis, however, cannot be verified with the given data.
Even with 200,000+ recorded crimes, not all crimes resulted in arrests. In fact, as seen from the table below, only 19.5% of all crimes resulted in arrests.
clean_df%>%summarise(prop_arrest = sum(arrest == TRUE, na.rm = TRUE)/n())## # A tibble: 1 x 1
## prop_arrest
## <dbl>
## 1 0.195
In the following sections, I’ve examined the rate of arrests by types of crime and by district. It’ll give us a clearer picture on arrests.
arrest_by_type<-crimes_17%>%group_by(primary_type) %>% summarise(prop_arrest = sum(arrest == TRUE)/n(), total = n()) %>% arrange(desc(prop_arrest))
arrest_by_type## # A tibble: 32 x 3
## primary_type prop_arrest total
## <fct> <dbl> <int>
## 1 GAMBLING 1 191
## 2 LIQUOR LAW VIOLATION 1 191
## 3 PROSTITUTION 1 734
## 4 PUBLIC INDECENCY 1 10
## 5 NARCOTICS 1.000 11621
## 6 CONCEALED CARRY LICENSE VIOLATION 0.957 69
## 7 INTERFERENCE WITH PUBLIC OFFICER 0.948 1085
## 8 OBSCENITY 0.814 86
## 9 WEAPONS VIOLATION 0.784 4685
## 10 PUBLIC PEACE VIOLATION 0.682 1497
## # ... with 22 more rows
ggplot(arrest_by_type, aes(x = primary_type, y = prop_arrest)) + geom_col() + theme(axis.text.x=element_text(angle=40, hjust=1)) + labs(x = "Type of Crime", y = "Proportion of Arrests")Gambling, liqour law violation, prostitution and public decency, all involve crimes in which 100% arrests were made. Comparatively, Deceptive Practice, Battery, and Robbery are the types where the least amount of arrests were made for documented cases.
It may be possible that the seriousness of the crime is correlated to the rate of arrests, but we would require more data to draw further conclusion. In particular, if the crime control department introduced a new categorical variable classifying each type of crime by seriousness, then analysing the dependence of arrests on seriousness would be easier to determine.
arrest_by_district<-crimes_17%>%filter(!is.na(District), District !=31)%>%group_by(District) %>% summarise(prop_arrest = sum(arrest == TRUE)/n(), total = n()) %>% arrange(desc(prop_arrest))
ggplot(arrest_by_district, aes(x = District, y = prop_arrest)) + geom_col() District 11, which had the highest recorded crimes, also has the highest rate of arrests. District 17 appears to have the least rate of arrests.
To identify which time of the year do crimes occur the most, it will be beneficial to plot crimes by date.
the Date variables stores the date and time of the crime, but it has multiple formats of dates, which will need to be standardised. The following set of codes separates the time and date, and standardises the date format.
#separating dates and times
buffer<-separate(crimes_17, Date, c("date_buffer", "time_buffer"), sep = "T")## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 163502 rows
## [1, 3, 4, 7, 9, 12, 16, 19, 21, 22, 24, 25, 26, 27, 28, 29, 32, 33, 34,
## 37, ...].
crimes_17_clean<-separate(buffer, date_buffer, c("date", "time"), sep = " ")## Warning: Expected 2 pieces. Additional pieces discarded in 162002 rows
## [1, 3, 4, 7, 9, 12, 16, 19, 22, 24, 25, 26, 27, 28, 29, 32, 33, 34, 37,
## 40, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 105434 rows
## [2, 5, 6, 8, 10, 11, 13, 14, 15, 17, 18, 20, 21, 23, 30, 31, 35, 36, 38,
## 39, ...].
#standardising the date format
a <- as.Date(crimes_17_clean$date,format="%m/%d/%Y")
b <- as.Date(crimes_17_clean$date,format="%Y-%m-%d")
a[is.na(a)] <- b[!is.na(b)]
crimes_17_clean$date<-aNow that the dates are in a standardised format, it will be easy to plot crimes by date.
ggplot(crimes_17_clean, aes(x = date)) + geom_bar() + labs(x = "Date", y = "Frequency")It might make more sense to plot the number of crimes on a monthly basis.
#extracting month from the date
crimes_17_clean<-mutate(crimes_17_clean, month = month(date))
crimes_17_clean$month<-as.factor(crimes_17_clean$month)
ggplot(crimes_17_clean, aes(x = month)) + geom_bar() +ggtitle("Crimes Per Month")A clear trend has emerged over the months. The summer months appear to attract more number of crimes compared to the winter months. January, curiously, doesn’t appear to follow the trend. It clocks the highest number of crimes compared to any month. Further data would be beneficial to explore the reason for this anomaly.