Chicago Crimes 2001-2017 Time Series Analysis
Introduction
A happy citizens are the citizen that can be safe wherever they wanted to be. Sadly crime can also occur in many ways that can potentially harm people. As it is a recurring act that could affect you in many form, is it possible to predict and avoid it? Is there any time that a crime can likely to occur more than any other time?
With a data that reflects reported crime incidents that happened in the city of Chicago from 2001 to early 2017 Using Time Series Analysis, we will try to predict and analyze whether is there any pattern for a crime to likely to happen? What are the best time that require you to be more careful so the crime is not likely to happen to you?
The data is extracted from Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) System.
Import Data
library(dplyr)
library(prophet)
library(xts)
library(highcharter)
library(lubridate)
library(ggplot2)
library(plotly)
library(TSstudio)
library(forecast)
library(MLmetrics)
library(viridisLite)
library(viridis)
library(padr)
library(rmdformats)
library(tidyr)
library(tidyverse)
<- read.csv("data/Chicago_Crimes_2001_to_2004.csv")
cc20012004 <- read.csv("data/Chicago_Crimes_2005_to_2007.csv")
cc20052007 <- read.csv("data/Chicago_Crimes_2008_to_2011.csv")
cc20082011 <- read.csv("data/Chicago_Crimes_2012_to_2017.csv") cc20122017
Due to the data is big and take much resource for computing all data, I will check one of the data above.
glimpse(cc20122017)
Rows: 1,456,714
Columns: 23
$ X <int> 3, 89, 197, 673, 911, 1108, 1130, 1801, 1868, 189~
$ ID <int> 10508693, 10508695, 10508697, 10508698, 10508699,~
$ Case.Number <chr> "HZ250496", "HZ250409", "HZ250503", "HZ250424", "~
$ Date <chr> "05/03/2016 11:40:00 PM", "05/03/2016 09:40:00 PM~
$ Block <chr> "013XX S SAWYER AVE", "061XX S DREXEL AVE", "053X~
$ IUCR <chr> "0486", "0486", "0470", "0460", "0820", "041A", "~
$ Primary.Type <chr> "BATTERY", "BATTERY", "PUBLIC PEACE VIOLATION", "~
$ Description <chr> "DOMESTIC BATTERY SIMPLE", "DOMESTIC BATTERY SIMP~
$ Location.Description <chr> "APARTMENT", "RESIDENCE", "STREET", "SIDEWALK", "~
$ Arrest <chr> "True", "False", "False", "False", "False", "Fals~
$ Domestic <chr> "True", "True", "False", "False", "True", "False"~
$ Beat <int> 1022, 313, 1524, 1532, 1523, 631, 133, 215, 2432,~
$ District <dbl> 10, 3, 15, 15, 15, 6, 1, 2, 24, 7, 3, 18, 12, 1, ~
$ Ward <dbl> 24, 20, 37, 28, 28, 8, 3, 3, 40, 17, 7, 42, 2, 4,~
$ Community.Area <dbl> 29, 42, 25, 25, 25, 44, 35, 38, 1, 67, 43, 8, 28,~
$ FBI.Code <chr> "08B", "08B", "24", "08B", "06", "04B", "08B", "0~
$ X.Coordinate <dbl> 1154907, 1183066, 1140789, 1143223, 1139890, 1183~
$ Y.Coordinate <dbl> 1893681, 1864330, 1904819, 1901475, 1901675, 1850~
$ Year <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2~
$ Updated.On <chr> "05/10/2016 03:56:50 PM", "05/10/2016 03:56:50 PM~
$ Latitude <dbl> 41.86407, 41.78292, 41.89491, 41.88569, 41.88630,~
$ Longitude <dbl> -87.70682, -87.60436, -87.75837, -87.74952, -87.7~
$ Location <chr> "(41.864073157, -87.706818608)", "(41.782921527, ~
The above result is an overview of what kind information that recorded in the data frame. As we will conduct Time Series Analysis, we are going to use only the necessary Information such as Date and the kind of crime that recorded in the Primary.Type.
<- cc20012004[, c('Date', 'Primary.Type')]
cc20012004 <- cc20052007[, c('Date', 'Primary.Type')]
cc20052007 <- cc20082011[, c('Date', 'Primary.Type')]
cc20082011 <- cc20122017[, c('Date', 'Primary.Type')] cc20122017
<- rbind(cc20012004, cc20052007, cc20082011, cc20122017) chicagocrimes
glimpse(chicagocrimes)
Rows: 6,664,949
Columns: 2
$ Date <chr> "01/01/2004 12:01:00 AM", "03/01/2003 12:00:00 AM", "06/2~
$ Primary.Type <chr> "THEFT", "OTHER OFFENSE", "OFFENSE INVOLVING CHILDREN", "~
## Creating timeseries
$Date <- as.Date(cc20122017$Date, "%m/%d/%Y %I:%M:%S %p")
cc20122017<- na.omit(cc20122017) %>% group_by(Date) %>% summarise(Total = n())
by_Date <- xts(by_Date$Total, order.by=as.POSIXct(by_Date$Date))
tseries
<- cc20122017 %>% group_by(Date) %>% summarise(y = n()) %>% mutate(y = log(y))
df
names(df) <- c("ds", "y")
$ds <- factor(df$ds) df
Times Series plot of Chicago Crimes 2012-2017
#Times Series plot of Chicago Crimes 2012-2017
hchart(tseries, name = "Crimes") %>%
hc_add_theme(hc_theme_darkunica()) %>%
hc_credits(enabled = TRUE, text = "Sources: City of Chicago Administration and the Chicago Police Department", style = list(fontSize = "12px")) %>%
hc_title(text = "Times Series plot of Chicago Crimes") %>%
hc_legend(enabled = TRUE)
Data Wrangling
We will clean the data by omitting the unnecessary information. We will only use 2 variables: Date
and Primary.Type
<- chicagocrimes %>%
crime_clean mutate(Date = mdy_hms(Date),
Primary.Type = as.factor(Primary.Type)) %>%
mutate(Date = floor_date(Date, unit = "hours")) %>%
arrange(Date)
By conducting data aggregation, we will see what kind of crime that likely to happen in the city of Chicago.
<- crime_clean %>%
Plot1 group_by(Primary.Type) %>%
summarise(count = n()) %>%
arrange(-count) %>%
head(5) %>%
ggplot(aes(x = count, y = reorder(Primary.Type, count), text = paste("Crime: ", Primary.Type, "<br>",
"Happened: ", count, " times"))) +
geom_col(aes(fill = Primary.Type), show.legend = F) +
scale_fill_viridis(option = "D", discrete = T, direction = 1, begin = 0.9, end = 0.3) +
labs(title = "Top 5 Crime in Chicago", x = "Number of Crime", y = "Crime") +
theme_gray()
ggplotly(Plot1, tooltip = "text")
Based on the plot above, these are the Top 5 Crimes that happened in Chicago. For this Time Series Analysis, We will pick one type of Crime and conduct the analysis based on data from September 2014 to September 2016. and we will create a model to predict the crime rate in the 2017, in this analysis we will choose Theft Crime.
# Create Prediction Time Frame
<- crime_clean %>%
crime_theft filter(Primary.Type == 'THEFT') %>%
group_by(Date) %>%
summarise(Theft = n()) %>%
filter(Date > "2014-09-01" & Date < "2016-09-02") %>%
slice(-c(1:6)) %>%
ungroup()
head(crime_theft, 10)
# A tibble: 10 x 2
Date Theft
<dttm> <int>
1 2014-09-01 00:00:00 10
2 2014-09-01 01:00:00 11
3 2014-09-01 02:00:00 2
4 2014-09-01 03:00:00 1
5 2014-09-01 04:00:00 2
6 2014-09-01 05:00:00 1
7 2014-09-01 06:00:00 1
8 2014-09-01 07:00:00 2
9 2014-09-01 08:00:00 5
10 2014-09-01 09:00:00 11
tail(crime_theft,20)
# A tibble: 20 x 2
Date Theft
<dttm> <int>
1 2016-08-31 21:00:00 7
2 2016-08-31 22:00:00 14
3 2016-08-31 23:00:00 12
4 2016-09-01 00:00:00 14
5 2016-09-01 01:00:00 3
6 2016-09-01 02:00:00 2
7 2016-09-01 03:00:00 2
8 2016-09-01 04:00:00 5
9 2016-09-01 05:00:00 2
10 2016-09-01 06:00:00 9
11 2016-09-01 07:00:00 6
12 2016-09-01 08:00:00 12
13 2016-09-01 09:00:00 12
14 2016-09-01 10:00:00 11
15 2016-09-01 11:00:00 7
16 2016-09-01 12:00:00 14
17 2016-09-01 13:00:00 11
18 2016-09-01 14:00:00 14
19 2016-09-01 15:00:00 11
20 2016-09-01 16:00:00 15
Check whether any missing interval in our data frame by padding our data and check any NA Value that created from our padding.
<- crime_theft %>%
crime_theft pad(start_val = ymd_hms("2014-09-01 00:00:00"), end_val = ymd_hms("2016-08-31 23:00:00")) %>%
mutate(Theft = replace_na(Theft, 0))
anyNA(crime_theft)
[1] FALSE
Cross Validation
After finishing our time frame, we will continue to conduct the cross validation activities. we will split our data into two types of data. train
data as a data that will be trained in to our model, and test
data as a data that provide an unbiased evaluation of a model fit on the training data set.
We will split our train data into two type of data set named train and test’. Our test
data will be a set of data consist of last 365 days of reported crime from our data frame (early Sept 2014 to the last August of 2016).
<- head(crime_theft, nrow(crime_theft) - 365)
theft_train
<- tail(crime_theft, 365) theft_test
Create Time Series Object
To create a time series model, we need to create a time series object from our train data. time series object will be based on the Theft
column as it is the one that we are going to predict, we set the frequency to be 24 as it is total hour of reported crime for 1 day.
<- ts(theft_train$Theft, frequency = 24) theft_ts
Next step, we will decompose our time series object and try to see the trend and seasonality by put it into a plot using autoplot()
#Decompose 1 tahun ke belakang
<- theft_ts %>%
theft_ts_dec tail(365) %>%
decompose()
%>%
theft_ts_dec autoplot()
As we can see from the plot, the trend
still shown us some pattern like our seasonal
, this indicate there are other seasonality pattern that have not been caught by the plot. To overcome this, we will create a Multi Seasonal Time Series Object.
# Create MSTS Object
<- msts(theft_train$Theft, seasonal.periods = c(24, # Daily
theft_multi 24*7, # Weekly
24*30)) # Monthly
# Decompose MSTS Object
<- theft_multi %>%
theft_multi_dec mstl()
%>%
theft_multi_dec tail(365) %>%
autoplot()
From the plot above, we can see the trend
of the Theft
Crime is already going smooth. The Theft
Crime trend itself is increase in the last 365 days.
Seasonality Analysis
# Create a data frame based on MSTS Object
<- as.data.frame(theft_multi_dec) df_theft_multi
1. Daily Seasonality
These are the plot of Daily Seasonality of Theft Crime in Chicago
<- df_theft_multi %>%
Plot2 mutate(day = theft_train$Date) %>%
group_by(day) %>%
summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
head(24*2) %>%
ggplot(aes(x = day, y = seasonal)) +
geom_point(col = "maroon") + geom_line(col = "red") +
theme_minimal()
Plot2
From the visualization, the reported case of theft crime
are increase
approximately from 06:00 AM into 12:00 PM
in the afternoon. We can say that the range of 6 in the morning into 12 PM in the afternoon is the time where people have activities and meet other people outside, while from midnight into the start of the day people tend to rest and does not have any activity, which is likely in the afternoon which no one in their home is a chance for them to make a crime, in this case theft.
2. Weekly Seasonality
These are the plot of Weekly Seasonality of Theft Crime in Chicago
<- df_theft_multi %>%
Plot3 mutate(day = theft_train$Date) %>%
group_by(day) %>%
summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
head(24*7) %>%
ggplot(aes(x = day, y = seasonal)) +
geom_point(col = "maroon") + geom_line(col = "red") +
theme_minimal()
Plot3
we can see from the plot that it’s quiet the same pattern from the plot before, the difference is that there are a slightly increase number of reported crime in the weekend. Though it still unknown how it came out. It is safe to say that people needed to be more cautious about chance of the crime could happen in those times.
3. Monthly Seasonality
These are the plot of monthly Seasonality of Theft Crime in Chicago
<- df_theft_multi %>%
Plot4 mutate(day = theft_train$Date, month = month(theft_train$Date, label = T)) %>%
group_by(month) %>%
summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
head(24*30) %>%
ggplot(aes(x = month, y = seasonal)) +
geom_point() + geom_col() +
theme_minimal()
Plot4
As we can see from the plot above, The crime tend to increase in the month of April
, July
, October
, and December
. The highest was on July. These can be seen as a warning signal to people around the city that the chance of them getting harmed is high around these time.
Modelling
After we succeed creating the MSTS object, we will try to put our object into several models. Some models that we will try to create is
- Multi-Seasonal Holt-Winter
- Multi-Seasonal ARIMA model
- Seasonal naive
Multi-Seasonal Holt-Winters
<- stlm(theft_multi, method = "ets")
theft_hw
<- forecast(theft_hw, h = 365)
f_hw
%>%
theft_multi tail(365) %>%
autoplot() +
autolayer(f_hw, series = "Multi-Seasonal Holt_Winter", color = "blue")
Multi-Seasonal ARIMA
<- stlm(theft_multi, method = "arima")
theft_arima
<- forecast(theft_arima, h = 365)
f_arima
%>%
theft_multi tail(365) %>%
autoplot() +
autolayer(f_arima, series = "Multi-Seasonal ARIMA", color = "red")
Seasonal Naive Method
<- snaive(y = theft_multi, h = 365)
theft_snaive
<- forecast(object = theft_snaive)
f_snaive
%>%
theft_multi tail(365) %>%
autoplot() +
autolayer(f_snaive, series = "Multi-Seasonal ARIMA", color = "#663366")
Model Comparison
To evaluate our model,we will judge it based on the Mean Absolute Error of the model, we use MAE as it is commonly used in time series analysis and easier to interpret. MAE measure the average magnitude of the error in a set of prediction.
# Multiseasonal HW
accuracy(f_hw, theft_test$Theft)
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.00005510394 2.121741 1.637138 NaN Inf 0.5495089 0.03621836
Test set -0.23828268612 3.561546 2.679921 -Inf Inf 0.8995209 NA
# Multiseasonal ARIMA
accuracy(f_arima, theft_test$Theft)
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.001814602 2.111024 1.628264 NaN Inf 0.5465301 0.00008691143
Test set 0.046691421 3.552433 2.632579 -Inf Inf 0.8836304 NA
# Seasonal Naive Method
accuracy(f_snaive, theft_test$Theft)
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.02363449 4.065962 3.115803 -Inf Inf 1.045826 0.1769116
Test set -0.26849315 4.556495 3.534247 -Inf Inf 1.186277 NA
from the above result, we can conclude that our Multi-Seasonal ARIMA
model have the lowest error value by comparison with the value of 2.63
. Hence, we will use Multi-Seasonal ARIMA
model as our prediction model and check the assumptions of the residuals.
Assumption
The assumptions in the time series are tested to measure whether the residuals obtained from the modeling results are good enough to describe and capture information in the data. Why use residual data? Because by using residual data, we can get information from the actual data as well as from the prediction results using the model. A good forecasting method produces the following residual values:
- Uncorrelated residuals. If there are correlated residuals, it means that there is still information left that should be used to calculate forecast results.
- Residuals have an average of 0 (normality of residuals).
To ensure that the resulting residual has the above criteria, there are assumptions to test it.
No-autocorrelation residual
\(H_0\): residual has no-autocorrelation
\(H_1\): residual has autocorrelation
desired p-value > 0.05 (alpha), no-autocorrelation
To check the presence or absence of autocorrelation in the forecasting time series results, you can use several methods, namely:
- using acf residual plot visualization with function
acf(residual model)
- perform a Ljung-box test by using the function
Box.test(residual model, type = "Ljung-Box)
In this case we will use Ljung-Box test for our autocorrelation test and Kolgomorov-Smirnov test for our normality test. It can be conclude that we have a good and optimal model supposed the residual able to fulfill these assumptions.
Box.test(theft_arima$residuals, type = "Ljung-Box", lag = 2*365)
Box-Ljung test
data: theft_arima$residuals
X-squared = 4787.4, df = 730, p-value < 0.00000000000000022
p-value > 0.05 : No Autocorrelation in residual
p-value < 0.05 : There are Autocorrelation in residual
Based on the result, our p-value is < 0.05
, which means there are autocorrelations
in our model residual.
Normality Residual
\(H_0\): residual spread normally
\(H_1\): residuals are not normally distributed
desired p-value > 0.05 (alpha), residuals are normally distributed
To check the normality of the residuals in the forecasting time series results, we can perform a normality test (shapiro test) using the shapiro.test(residual model)
function. Due to Shapiro test cannot done using more than 5000 records, and this dataset have more than 5000 records I will try to use Anderson-Darling
normality test from nortest
package.
library(nortest)
ad.test(theft_arima$residuals)
Anderson-Darling normality test
data: theft_arima$residuals
A = 27.966, p-value < 0.00000000000000022
p-value > 0.05 : Residual normally distributed
p-value < 0.05 : Residual not normally distributed
Based on the result, our p-value is < 0.05
, which means our residual not normally distributed
.
Both assumptions has not been fulfill it can be said that our model can be tuned to be more optimal. We can try to optimizing our model by transforming the data.
Model Improvement
The assumption checking shown us that our model’s residual failed to fulfill the assumptions, means that our model is not optimal enough and can be improved. To fulfill these assumptions, we will try to transform our data. Using recipes
package, we will transform
our data into it square-root value and do some scalling.
# Data transformation
library(recipes)
<- recipe(Theft ~., data = theft_train) %>%
rec step_sqrt(all_outcomes()) %>%
step_center(all_outcomes()) %>%
step_scale(all_outcomes()) %>%
prep()
# change data train
<- juice(rec)
theft_train_rec
# change data validation
<- bake(rec, theft_test) theft_test_rec
We can see below that our Theft
column have been transformed into root-square value and being scaled. this data will be used as our new train data.
# Revert Function
<- function(x, rec) {
rec_rev
<- rec$steps[[2]]$means[["Theft"]]
means <- rec$steps[[3]]$sds[["Theft"]]
sds
<- (x * sds + means)^2
x <- round(x)
x #x <- exp(x)
<- ifelse(x < 0, 0, x)
x
#X <- exp(x) if in log
x
}
Create Time Series Object
We will create a new MSTS
object, based on theft_train_rec
data. we will set it into daily, weekly, and monthly frequency.
# Create MSTS Obect
<- msts(theft_train_rec$Theft, seasonal.periods = c(24, 24*7, 24*30))
theft_multi_rec
# Decompose MSTS Object
<- theft_multi_rec %>%
theft_multi_decomp_rec mstl()
%>%
theft_multi_decomp_rec tail(365) %>%
autoplot()
Modeling
Next, modelling based on transformed data, as we already choose Multi-Seasonal ARIMA as our prediction model, we will create another one based on our new MSTS object.
<- stlm(theft_multi_rec, method = "arima")
theft_arima_rec
<- forecast(theft_arima_rec, h = 365)
f_arima_rec
<- rec_rev(f_arima_rec$mean, rec = rec)
rec_rev_arima
%>%
theft_multi_rec tail(365) %>%
autoplot() +
autolayer(f_arima_rec, series = "Multi-Seasonal ARIMA", color = "red")
# Model Accuracy
accuracy(rec_rev_arima, theft_test$Theft)
ME RMSE MAE MPE MAPE
Test set 0.2657534 3.609349 2.676712 -Inf Inf
From the output we can check that by transforming the data MAE
value increase 0.04%
with the value of MAE 2.67
, comparing with the previous model of ARIMA , the first one without transforming data is better than the second one. Let’s check model assumption.
Assumption
We will check the Autocorrelation and Normality of our new model residual with the same test we used previously.
# Autocorrelation
Box.test(theft_arima_rec$residuals, type = "Ljung-Box", lag = 2*365)
Box-Ljung test
data: theft_arima_rec$residuals
X-squared = 4819, df = 730, p-value < 0.00000000000000022
Based on the result, our p-value is lower than 0.05, which means there are autocorrelations
in our model residual.
# Normality
ad.test(theft_arima_rec$residuals)
Anderson-Darling normality test
data: theft_arima_rec$residuals
A = 10.456, p-value < 0.00000000000000022
Based on the result, our p-value is lower than 0.05, which means our residual not normally distributed
.
As both of the assumptions still has not been fulfilled it can be said that our model can be tuned to be more optimal. Some strategy to overcome this problem are using different kind of model that can create a better prediction and avoid the assumption checking, another way is by conduct time series seasonality adjustment.
Conclusion
After trying to optimizing the model by transforming the data the MAE value is 2.67 % a bit increase from the previous ARIMA model, also we can not fulfilled the assumptions. Comparing with the previous model, our first model of ARIMA without transforming the data still better used with the MAE value 2.63%. Maybe we can try to adjust time series seasonality for more optimizing the model and to fulfilled the assumptions.
Based on the analysis and our model performance, Multi-Seasonal ARIMA model, the first model able to predict the theft crime rate in the city of Chicago with the error around 2.63%, but still, it can be optimized thoroughly with several way as our model does not fulfill the Autocorrelation and Normality assumptions.
As for the seasonal analysis, it is safe to say that the theft crime is likely to increase around 6 o’clock in the morning towards the afternoon time (12:00 PM) and higher case even reported in the weekend. The crime itself is more likely to happen in several months such as April
, July
, October
, and December
, but most highest was happen in July.