Chicago Crimes 2001-2017 Time Series Analysis

Introduction

A happy citizens are the citizen that can be safe wherever they wanted to be. Sadly crime can also occur in many ways that can potentially harm people. As it is a recurring act that could affect you in many form, is it possible to predict and avoid it? Is there any time that a crime can likely to occur more than any other time?

With a data that reflects reported crime incidents that happened in the city of Chicago from 2001 to early 2017 Using Time Series Analysis, we will try to predict and analyze whether is there any pattern for a crime to likely to happen? What are the best time that require you to be more careful so the crime is not likely to happen to you?

The data is extracted from Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) System.

Import Data

library(dplyr)
library(prophet)
library(xts)
library(highcharter)
library(lubridate)
library(ggplot2)
library(plotly)
library(TSstudio)
library(forecast)
library(MLmetrics)
library(viridisLite)
library(viridis)
library(padr)
library(rmdformats)
library(tidyr)
library(tidyverse)

cc20012004 <- read.csv("data/Chicago_Crimes_2001_to_2004.csv")
cc20052007 <- read.csv("data/Chicago_Crimes_2005_to_2007.csv")
cc20082011 <- read.csv("data/Chicago_Crimes_2008_to_2011.csv")
cc20122017 <- read.csv("data/Chicago_Crimes_2012_to_2017.csv")

Due to the data is big and take much resource for computing all data, I will check one of the data above.

glimpse(cc20122017)

Rows: 1,456,714
Columns: 23
$ X                    <int> 3, 89, 197, 673, 911, 1108, 1130, 1801, 1868, 189~
$ ID                   <int> 10508693, 10508695, 10508697, 10508698, 10508699,~
$ Case.Number          <chr> "HZ250496", "HZ250409", "HZ250503", "HZ250424", "~
$ Date                 <chr> "05/03/2016 11:40:00 PM", "05/03/2016 09:40:00 PM~
$ Block                <chr> "013XX S SAWYER AVE", "061XX S DREXEL AVE", "053X~
$ IUCR                 <chr> "0486", "0486", "0470", "0460", "0820", "041A", "~
$ Primary.Type         <chr> "BATTERY", "BATTERY", "PUBLIC PEACE VIOLATION", "~
$ Description          <chr> "DOMESTIC BATTERY SIMPLE", "DOMESTIC BATTERY SIMP~
$ Location.Description <chr> "APARTMENT", "RESIDENCE", "STREET", "SIDEWALK", "~
$ Arrest               <chr> "True", "False", "False", "False", "False", "Fals~
$ Domestic             <chr> "True", "True", "False", "False", "True", "False"~
$ Beat                 <int> 1022, 313, 1524, 1532, 1523, 631, 133, 215, 2432,~
$ District             <dbl> 10, 3, 15, 15, 15, 6, 1, 2, 24, 7, 3, 18, 12, 1, ~
$ Ward                 <dbl> 24, 20, 37, 28, 28, 8, 3, 3, 40, 17, 7, 42, 2, 4,~
$ Community.Area       <dbl> 29, 42, 25, 25, 25, 44, 35, 38, 1, 67, 43, 8, 28,~
$ FBI.Code             <chr> "08B", "08B", "24", "08B", "06", "04B", "08B", "0~
$ X.Coordinate         <dbl> 1154907, 1183066, 1140789, 1143223, 1139890, 1183~
$ Y.Coordinate         <dbl> 1893681, 1864330, 1904819, 1901475, 1901675, 1850~
$ Year                 <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2~
$ Updated.On           <chr> "05/10/2016 03:56:50 PM", "05/10/2016 03:56:50 PM~
$ Latitude             <dbl> 41.86407, 41.78292, 41.89491, 41.88569, 41.88630,~
$ Longitude            <dbl> -87.70682, -87.60436, -87.75837, -87.74952, -87.7~
$ Location             <chr> "(41.864073157, -87.706818608)", "(41.782921527, ~

The above result is an overview of what kind information that recorded in the data frame. As we will conduct Time Series Analysis, we are going to use only the necessary Information such as Date and the kind of crime that recorded in the Primary.Type.

cc20012004 <- cc20012004[, c('Date', 'Primary.Type')]
cc20052007 <- cc20052007[, c('Date', 'Primary.Type')]
cc20082011 <- cc20082011[, c('Date', 'Primary.Type')]
cc20122017 <- cc20122017[, c('Date', 'Primary.Type')]

chicagocrimes <- rbind(cc20012004, cc20052007, cc20082011, cc20122017)

glimpse(chicagocrimes)

Rows: 6,664,949
Columns: 2
$ Date         <chr> "01/01/2004 12:01:00 AM", "03/01/2003 12:00:00 AM", "06/2~
$ Primary.Type <chr> "THEFT", "OTHER OFFENSE", "OFFENSE INVOLVING CHILDREN", "~

## Creating timeseries
cc20122017$Date <- as.Date(cc20122017$Date, "%m/%d/%Y %I:%M:%S %p")
by_Date <- na.omit(cc20122017) %>% group_by(Date) %>% summarise(Total = n())
tseries <- xts(by_Date$Total, order.by=as.POSIXct(by_Date$Date))


df <- cc20122017 %>% group_by(Date) %>% summarise(y = n()) %>% mutate(y = log(y))

names(df) <- c("ds", "y")
df$ds <- factor(df$ds)

Times Series plot of Chicago Crimes 2012-2017

#Times Series plot of Chicago Crimes 2012-2017
hchart(tseries, name = "Crimes") %>%
  hc_add_theme(hc_theme_darkunica()) %>%
  hc_credits(enabled = TRUE, text = "Sources: City of Chicago Administration and the Chicago Police Department", style = list(fontSize = "12px")) %>%
  hc_title(text = "Times Series plot of Chicago Crimes") %>%
  hc_legend(enabled = TRUE)

Data Wrangling

We will clean the data by omitting the unnecessary information. We will only use 2 variables: Date and Primary.Type

crime_clean <- chicagocrimes %>%
  mutate(Date = mdy_hms(Date),
         Primary.Type = as.factor(Primary.Type)) %>%
  mutate(Date = floor_date(Date, unit = "hours")) %>%
  arrange(Date)

By conducting data aggregation, we will see what kind of crime that likely to happen in the city of Chicago.

Plot1 <- crime_clean %>%
  group_by(Primary.Type) %>%
  summarise(count = n()) %>%
  arrange(-count) %>%
  head(5) %>%
  ggplot(aes(x = count, y = reorder(Primary.Type, count), text = paste("Crime: ", Primary.Type, "<br>",
                                                                       "Happened: ", count, " times"))) +
  geom_col(aes(fill = Primary.Type), show.legend = F) + 
  scale_fill_viridis(option = "D", discrete = T, direction = 1, begin = 0.9, end = 0.3) +
  labs(title = "Top 5 Crime in Chicago", x = "Number of Crime", y = "Crime") +
  theme_gray()

ggplotly(Plot1, tooltip = "text")

Based on the plot above, these are the Top 5 Crimes that happened in Chicago. For this Time Series Analysis, We will pick one type of Crime and conduct the analysis based on data from September 2014 to September 2016. and we will create a model to predict the crime rate in the 2017, in this analysis we will choose Theft Crime.

# Create Prediction Time Frame

crime_theft <- crime_clean %>% 
  filter(Primary.Type == 'THEFT') %>%
  group_by(Date) %>%
  summarise(Theft = n()) %>%
  filter(Date > "2014-09-01" & Date < "2016-09-02") %>%
  slice(-c(1:6)) %>%
  ungroup()

head(crime_theft, 10)

# A tibble: 10 x 2
   Date                Theft
   <dttm>              <int>
 1 2014-09-01 00:00:00    10
 2 2014-09-01 01:00:00    11
 3 2014-09-01 02:00:00     2
 4 2014-09-01 03:00:00     1
 5 2014-09-01 04:00:00     2
 6 2014-09-01 05:00:00     1
 7 2014-09-01 06:00:00     1
 8 2014-09-01 07:00:00     2
 9 2014-09-01 08:00:00     5
10 2014-09-01 09:00:00    11

tail(crime_theft,20)

# A tibble: 20 x 2
   Date                Theft
   <dttm>              <int>
 1 2016-08-31 21:00:00     7
 2 2016-08-31 22:00:00    14
 3 2016-08-31 23:00:00    12
 4 2016-09-01 00:00:00    14
 5 2016-09-01 01:00:00     3
 6 2016-09-01 02:00:00     2
 7 2016-09-01 03:00:00     2
 8 2016-09-01 04:00:00     5
 9 2016-09-01 05:00:00     2
10 2016-09-01 06:00:00     9
11 2016-09-01 07:00:00     6
12 2016-09-01 08:00:00    12
13 2016-09-01 09:00:00    12
14 2016-09-01 10:00:00    11
15 2016-09-01 11:00:00     7
16 2016-09-01 12:00:00    14
17 2016-09-01 13:00:00    11
18 2016-09-01 14:00:00    14
19 2016-09-01 15:00:00    11
20 2016-09-01 16:00:00    15

Check whether any missing interval in our data frame by padding our data and check any NA Value that created from our padding.

crime_theft <- crime_theft %>%
  pad(start_val = ymd_hms("2014-09-01 00:00:00"), end_val = ymd_hms("2016-08-31 23:00:00")) %>%
  mutate(Theft = replace_na(Theft, 0))

anyNA(crime_theft)

[1] FALSE

Cross Validation

After finishing our time frame, we will continue to conduct the cross validation activities. we will split our data into two types of data. train data as a data that will be trained in to our model, and test data as a data that provide an unbiased evaluation of a model fit on the training data set.

We will split our train data into two type of data set named train and test’. Our test data will be a set of data consist of last 365 days of reported crime from our data frame (early Sept 2014 to the last August of 2016).

theft_train <- head(crime_theft, nrow(crime_theft) -  365)

theft_test <- tail(crime_theft, 365)

Create Time Series Object

To create a time series model, we need to create a time series object from our train data. time series object will be based on the Theft column as it is the one that we are going to predict, we set the frequency to be 24 as it is total hour of reported crime for 1 day.

theft_ts <- ts(theft_train$Theft, frequency = 24)

Next step, we will decompose our time series object and try to see the trend and seasonality by put it into a plot using autoplot()

#Decompose 1 tahun ke belakang
theft_ts_dec <- theft_ts %>%
  tail(365) %>%
  decompose()


theft_ts_dec %>%
  autoplot()

As we can see from the plot, the trend still shown us some pattern like our seasonal, this indicate there are other seasonality pattern that have not been caught by the plot. To overcome this, we will create a Multi Seasonal Time Series Object.

# Create MSTS Object
theft_multi <- msts(theft_train$Theft, seasonal.periods = c(24, # Daily
                                                            24*7, # Weekly
                                                            24*30)) # Monthly

# Decompose MSTS Object
theft_multi_dec <- theft_multi %>%
  mstl()

theft_multi_dec %>%
  tail(365) %>%
  autoplot()

From the plot above, we can see the trend of the Theft Crime is already going smooth. The Theft Crime trend itself is increase in the last 365 days.

Seasonality Analysis

# Create a data frame based on MSTS Object

df_theft_multi <- as.data.frame(theft_multi_dec)

1. Daily Seasonality

These are the plot of Daily Seasonality of Theft Crime in Chicago

Plot2 <- df_theft_multi %>%
  mutate(day = theft_train$Date) %>%
  group_by(day) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
  head(24*2) %>%
  ggplot(aes(x = day, y = seasonal)) +
  geom_point(col = "maroon") + geom_line(col = "red") +
  theme_minimal()

Plot2

From the visualization, the reported case of theft crime are increase approximately from 06:00 AM into 12:00 PM in the afternoon. We can say that the range of 6 in the morning into 12 PM in the afternoon is the time where people have activities and meet other people outside, while from midnight into the start of the day people tend to rest and does not have any activity, which is likely in the afternoon which no one in their home is a chance for them to make a crime, in this case theft.

2. Weekly Seasonality

These are the plot of Weekly Seasonality of Theft Crime in Chicago

Plot3 <- df_theft_multi %>%
  mutate(day = theft_train$Date) %>%
  group_by(day) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
  head(24*7) %>%
  ggplot(aes(x = day, y = seasonal)) +
  geom_point(col = "maroon") + geom_line(col = "red") +
  theme_minimal()

Plot3

we can see from the plot that it’s quiet the same pattern from the plot before, the difference is that there are a slightly increase number of reported crime in the weekend. Though it still unknown how it came out. It is safe to say that people needed to be more cautious about chance of the crime could happen in those times.

3. Monthly Seasonality

These are the plot of monthly Seasonality of Theft Crime in Chicago

Plot4 <- df_theft_multi %>%
  mutate(day = theft_train$Date, month = month(theft_train$Date, label = T)) %>%
  group_by(month) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
  head(24*30) %>%
  ggplot(aes(x = month, y = seasonal)) +
  geom_point() + geom_col() +
  theme_minimal()

Plot4

As we can see from the plot above, The crime tend to increase in the month of April, July, October, and December. The highest was on July. These can be seen as a warning signal to people around the city that the chance of them getting harmed is high around these time.

Modelling

After we succeed creating the MSTS object, we will try to put our object into several models. Some models that we will try to create is

Multi-Seasonal Holt-Winter
Multi-Seasonal ARIMA model
Seasonal naive

Multi-Seasonal Holt-Winters

theft_hw <- stlm(theft_multi, method = "ets")

f_hw <- forecast(theft_hw, h = 365)

theft_multi %>%
  tail(365) %>%
  autoplot() +
  autolayer(f_hw, series = "Multi-Seasonal Holt_Winter", color = "blue")

Multi-Seasonal ARIMA

theft_arima <- stlm(theft_multi, method = "arima")

f_arima <- forecast(theft_arima, h = 365)

theft_multi %>%
  tail(365) %>%
  autoplot() +
  autolayer(f_arima, series = "Multi-Seasonal ARIMA", color = "red")

Seasonal Naive Method

theft_snaive <- snaive(y = theft_multi, h = 365)

f_snaive <- forecast(object = theft_snaive)

theft_multi %>%
  tail(365) %>%
  autoplot() +
  autolayer(f_snaive, series = "Multi-Seasonal ARIMA", color = "#663366")

Model Comparison

To evaluate our model,we will judge it based on the Mean Absolute Error of the model, we use MAE as it is commonly used in time series analysis and easier to interpret. MAE measure the average magnitude of the error in a set of prediction.

# Multiseasonal HW

accuracy(f_hw, theft_test$Theft)

                         ME     RMSE      MAE  MPE MAPE      MASE       ACF1
Training set -0.00005510394 2.121741 1.637138  NaN  Inf 0.5495089 0.03621836
Test set     -0.23828268612 3.561546 2.679921 -Inf  Inf 0.8995209         NA

# Multiseasonal ARIMA

accuracy(f_arima, theft_test$Theft)

                       ME     RMSE      MAE  MPE MAPE      MASE          ACF1
Training set -0.001814602 2.111024 1.628264  NaN  Inf 0.5465301 0.00008691143
Test set      0.046691421 3.552433 2.632579 -Inf  Inf 0.8836304            NA

# Seasonal Naive Method

accuracy(f_snaive, theft_test$Theft)

                      ME     RMSE      MAE  MPE MAPE     MASE      ACF1
Training set  0.02363449 4.065962 3.115803 -Inf  Inf 1.045826 0.1769116
Test set     -0.26849315 4.556495 3.534247 -Inf  Inf 1.186277        NA

from the above result, we can conclude that our Multi-Seasonal ARIMA model have the lowest error value by comparison with the value of 2.63. Hence, we will use Multi-Seasonal ARIMA model as our prediction model and check the assumptions of the residuals.

Assumption

The assumptions in the time series are tested to measure whether the residuals obtained from the modeling results are good enough to describe and capture information in the data. Why use residual data? Because by using residual data, we can get information from the actual data as well as from the prediction results using the model. A good forecasting method produces the following residual values:

Uncorrelated residuals. If there are correlated residuals, it means that there is still information left that should be used to calculate forecast results.
Residuals have an average of 0 (normality of residuals).

To ensure that the resulting residual has the above criteria, there are assumptions to test it.

No-autocorrelation residual

\(H_0\): residual has no-autocorrelation

\(H_1\): residual has autocorrelation

desired p-value > 0.05 (alpha), no-autocorrelation

To check the presence or absence of autocorrelation in the forecasting time series results, you can use several methods, namely:

using acf residual plot visualization with function acf(residual model)
perform a Ljung-box test by using the function Box.test(residual model, type = "Ljung-Box)

In this case we will use Ljung-Box test for our autocorrelation test and Kolgomorov-Smirnov test for our normality test. It can be conclude that we have a good and optimal model supposed the residual able to fulfill these assumptions.

Box.test(theft_arima$residuals, type = "Ljung-Box", lag = 2*365)


    Box-Ljung test

data:  theft_arima$residuals
X-squared = 4787.4, df = 730, p-value < 0.00000000000000022

p-value > 0.05 : No Autocorrelation in residual
p-value < 0.05 : There are Autocorrelation in residual

Based on the result, our p-value is < 0.05, which means there are autocorrelations in our model residual.

Normality Residual

\(H_0\): residual spread normally

\(H_1\): residuals are not normally distributed

desired p-value > 0.05 (alpha), residuals are normally distributed

To check the normality of the residuals in the forecasting time series results, we can perform a normality test (shapiro test) using the shapiro.test(residual model) function. Due to Shapiro test cannot done using more than 5000 records, and this dataset have more than 5000 records I will try to use Anderson-Darling normality test from nortest package.

library(nortest)
ad.test(theft_arima$residuals)


    Anderson-Darling normality test

data:  theft_arima$residuals
A = 27.966, p-value < 0.00000000000000022

p-value > 0.05 : Residual normally distributed
p-value < 0.05 : Residual not normally distributed

Based on the result, our p-value is < 0.05, which means our residual not normally distributed.

Both assumptions has not been fulfill it can be said that our model can be tuned to be more optimal. We can try to optimizing our model by transforming the data.

Model Improvement

The assumption checking shown us that our model’s residual failed to fulfill the assumptions, means that our model is not optimal enough and can be improved. To fulfill these assumptions, we will try to transform our data. Using recipes package, we will transform our data into it square-root value and do some scalling.

# Data transformation
library(recipes)

rec <- recipe(Theft ~., data = theft_train) %>%
   step_sqrt(all_outcomes()) %>% 
   step_center(all_outcomes()) %>%
   step_scale(all_outcomes()) %>%
   prep()

# change data train

theft_train_rec <- juice(rec)


# change data validation

theft_test_rec <- bake(rec, theft_test)

We can see below that our Theft column have been transformed into root-square value and being scaled. this data will be used as our new train data.

# Revert Function

rec_rev <- function(x, rec) {
   
   means <- rec$steps[[2]]$means[["Theft"]]
   sds <- rec$steps[[3]]$sds[["Theft"]]
   
   x <- (x * sds + means)^2
   x <- round(x)
   #x <- exp(x)
   x <- ifelse(x < 0, 0, x)
   
   #X <- exp(x) if in log
   
   x
   
}

Create Time Series Object

We will create a new MSTS object, based on theft_train_rec data. we will set it into daily, weekly, and monthly frequency.

# Create MSTS Obect

theft_multi_rec <- msts(theft_train_rec$Theft, seasonal.periods = c(24, 24*7, 24*30))

# Decompose MSTS Object

theft_multi_decomp_rec <- theft_multi_rec %>%
  mstl()

theft_multi_decomp_rec %>%
  tail(365) %>%
  autoplot()

Modeling

Next, modelling based on transformed data, as we already choose Multi-Seasonal ARIMA as our prediction model, we will create another one based on our new MSTS object.

theft_arima_rec <- stlm(theft_multi_rec, method = "arima")

f_arima_rec <- forecast(theft_arima_rec, h = 365)

rec_rev_arima <- rec_rev(f_arima_rec$mean, rec = rec)

theft_multi_rec %>%
  tail(365) %>%
  autoplot() +
  autolayer(f_arima_rec, series = "Multi-Seasonal ARIMA", color = "red")

# Model Accuracy
accuracy(rec_rev_arima, theft_test$Theft)

                ME     RMSE      MAE  MPE MAPE
Test set 0.2657534 3.609349 2.676712 -Inf  Inf

From the output we can check that by transforming the data MAE value increase 0.04% with the value of MAE 2.67, comparing with the previous model of ARIMA , the first one without transforming data is better than the second one. Let’s check model assumption.

Assumption

We will check the Autocorrelation and Normality of our new model residual with the same test we used previously.

# Autocorrelation

Box.test(theft_arima_rec$residuals, type = "Ljung-Box", lag = 2*365)


    Box-Ljung test

data:  theft_arima_rec$residuals
X-squared = 4819, df = 730, p-value < 0.00000000000000022

Based on the result, our p-value is lower than 0.05, which means there are autocorrelations in our model residual.

# Normality
ad.test(theft_arima_rec$residuals)


    Anderson-Darling normality test

data:  theft_arima_rec$residuals
A = 10.456, p-value < 0.00000000000000022

Based on the result, our p-value is lower than 0.05, which means our residual not normally distributed.

As both of the assumptions still has not been fulfilled it can be said that our model can be tuned to be more optimal. Some strategy to overcome this problem are using different kind of model that can create a better prediction and avoid the assumption checking, another way is by conduct time series seasonality adjustment.

Conclusion

After trying to optimizing the model by transforming the data the MAE value is 2.67 % a bit increase from the previous ARIMA model, also we can not fulfilled the assumptions. Comparing with the previous model, our first model of ARIMA without transforming the data still better used with the MAE value 2.63%. Maybe we can try to adjust time series seasonality for more optimizing the model and to fulfilled the assumptions.

Based on the analysis and our model performance, Multi-Seasonal ARIMA model, the first model able to predict the theft crime rate in the city of Chicago with the error around 2.63%, but still, it can be optimized thoroughly with several way as our model does not fulfill the Autocorrelation and Normality assumptions.

As for the seasonal analysis, it is safe to say that the theft crime is likely to increase around 6 o’clock in the morning towards the afternoon time (12:00 PM) and higher case even reported in the weekend. The crime itself is more likely to happen in several months such as April, July, October, and December, but most highest was happen in July.