Introduction

A happy citizens are the citizen that can be safe wherever they wanted to be. Sadly crime can also occur in many ways that can potentially harm people. As it is a recurring act that could affect you in many form, is it possible to predict and avoid it? Is there any time that a crime can likely to occur more than any other time?

With a data that reflects reported crime incidents that happened in the city of Chicago from 2001 to early 2020 Using Time Series Analysis, we will try to predict and analyze whether is there any pattern for a crime to likely to happen? What are the best time that require you to be more careful so the crime is not likely to happen to you?

The data is extracted from Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) System.

Attaching Necessary Library

library(tidyr)
library(tidyverse)
library(lubridate)
library(ggplot2)
library(plotly)
library(TSstudio)
library(forecast)
library(MLmetrics)
library(dplyr)
library(viridisLite)
library(viridis)
library(padr)
library(rmdformats)
## Warning: package 'rmdformats' was built under R version 4.0.3
library(recipes)

Data Import

crime <- read.csv("Crimes_-_2001_to_Present.csv")
glimpse(crime)
## Rows: 7,222,382
## Columns: 22
## $ ID                   <int> 11034701, 11227287, 11227583, 11227293, 112276...
## $ Case.Number          <chr> "JA366925", "JB147188", "JB147595", "JB147230"...
## $ Date                 <chr> "01/01/2001 11:00:00 AM", "10/08/2017 03:00:00...
## $ Block                <chr> "016XX E 86TH PL", "092XX S RACINE AVE", "026X...
## $ IUCR                 <chr> "1153", "0281", "0620", "0810", "0281", "1751"...
## $ Primary.Type         <chr> "DECEPTIVE PRACTICE", "CRIM SEXUAL ASSAULT", "...
## $ Description          <chr> "FINANCIAL IDENTITY THEFT OVER $ 300", "NON-AG...
## $ Location.Description <chr> "RESIDENCE", "RESIDENCE", "OTHER", "RESIDENCE"...
## $ Arrest               <chr> "false", "false", "false", "false", "false", "...
## $ Domestic             <chr> "false", "false", "false", "false", "false", "...
## $ Beat                 <int> 412, 2222, 835, 313, 122, 813, 1033, 1432, 123...
## $ District             <int> 4, 22, 8, 3, 1, 8, 10, 14, 1, 6, 17, 4, 24, 4,...
## $ Ward                 <int> 8, 21, 18, 20, 42, 13, 12, 32, 2, 8, 39, 10, 4...
## $ Community.Area       <int> 45, 73, 70, 42, 32, 65, 30, 22, 32, 44, 13, 52...
## $ FBI.Code             <chr> "11", "02", "05", "06", "02", "17", "02", "06"...
## $ X.Coordinate         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Y.Coordinate         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Year                 <int> 2001, 2017, 2017, 2017, 2017, 2015, 2017, 2017...
## $ Updated.On           <chr> "08/05/2017 03:50:08 PM", "02/11/2018 03:57:41...
## $ Latitude             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Longitude            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Location             <chr> "", "", "", "", "", "", "", "", "", "", "", ""...

Above us is an overview of what kind of information that recorded in the data frame. As we will conduct Time Series Analysis, we are going to use only the necessary Information such as Date and the kind of crime that recorded in the Primary.Type.

Data Wrangling

First, we will clean the data by omitting the unnecessary information. We will only use 2 variables: - Date - Primary.Type

Then we will change the data type of these variables into the data type it should have been and aggregate the time into an hourly interval.

crime_clean <- crime %>%
  select(Date, Primary.Type) %>%
  mutate(Date = mdy_hms(Date),
         Primary.Type = as.factor(Primary.Type)) %>%
  mutate(Date = floor_date(Date, unit = "hours")) %>%
  arrange(Date)

By conducting data aggregation, we will see what kind of crime that likely to happen in the city of Chicago.

Plot1 <- crime_clean %>%
  group_by(Primary.Type) %>%
  summarise(count = n()) %>%
  arrange(-count) %>%
  head(5) %>%
  ggplot(aes(x = count, y = reorder(Primary.Type, count), text = paste("Crime: ", Primary.Type, "<br>",
                                                                       "Happened: ", count, " times"))) +
  geom_col(aes(fill = Primary.Type), show.legend = F) + 
  scale_fill_viridis(option = "D", discrete = T, direction = 1, begin = 0.9, end = 0.3) +
  labs(title = "Top 5 Crime in Chicago", x = "Number of Crime", y = "Crime") +
  theme_gray()
## `summarise()` ungrouping output (override with `.groups` argument)
ggplotly(Plot1, tooltip = "text")

Based on the plot above, these are the Top 5 Crimes that happened in Chicago. For this Time Series Analysis, We will pick one type of Crime and conduct the analysis based on data from September 2017 to September 2019. and we will create a model to predict the crime rate in the 2020

In this analysis we will choose Battery Crime

# Create Prediction Time Frame

crime_bat <- crime_clean %>% 
  filter(Primary.Type == 'BATTERY') %>%
  group_by(Date) %>%
  summarise(Battery = n()) %>%
  filter(Date > "2017-09-01" & Date < "2020-09-02") %>%
  slice(-c(1:6)) %>%
  ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
# Check The Beginning and Ending of the Time Frame

head(crime_bat, 10)
## # A tibble: 10 x 2
##    Date                Battery
##    <dttm>                <int>
##  1 2017-09-01 00:00:00       8
##  2 2017-09-01 01:00:00       2
##  3 2017-09-01 02:00:00       6
##  4 2017-09-01 03:00:00       2
##  5 2017-09-01 04:00:00       4
##  6 2017-09-01 05:00:00       1
##  7 2017-09-01 06:00:00       2
##  8 2017-09-01 07:00:00       8
##  9 2017-09-01 08:00:00       2
## 10 2017-09-01 09:00:00       3
# Check The Benginning and Ending of the Time Frame

tail(crime_bat, 20)
## # A tibble: 20 x 2
##    Date                Battery
##    <dttm>                <int>
##  1 2020-08-31 20:00:00       6
##  2 2020-08-31 21:00:00      11
##  3 2020-08-31 22:00:00      13
##  4 2020-08-31 23:00:00       9
##  5 2020-09-01 00:00:00       7
##  6 2020-09-01 01:00:00       8
##  7 2020-09-01 02:00:00       5
##  8 2020-09-01 03:00:00       3
##  9 2020-09-01 05:00:00       1
## 10 2020-09-01 06:00:00       1
## 11 2020-09-01 07:00:00       2
## 12 2020-09-01 08:00:00       4
## 13 2020-09-01 09:00:00       3
## 14 2020-09-01 10:00:00       3
## 15 2020-09-01 11:00:00       3
## 16 2020-09-01 12:00:00       5
## 17 2020-09-01 13:00:00       1
## 18 2020-09-01 14:00:00       6
## 19 2020-09-01 15:00:00       3
## 20 2020-09-01 16:00:00       2

As we seen from the data frame, the end of our prediction time frame still not correct. Our Time Frame will be from the 1st September 2017 into the 1st September 2020.

crime_bat <- crime_bat %>% 
  slice(-c(25679:25694))
# Rechecking The End of the Time Frame

tail(crime_bat)
## # A tibble: 6 x 2
##   Date                Battery
##   <dttm>                <int>
## 1 2020-08-31 18:00:00      10
## 2 2020-08-31 19:00:00       8
## 3 2020-08-31 20:00:00       6
## 4 2020-08-31 21:00:00      11
## 5 2020-08-31 22:00:00      13
## 6 2020-08-31 23:00:00       9

After our Time Frame is already correct, we will see whether any missing interval in our data frame by padding our data and check any NA Value that created from our padding

crime_bat <- crime_bat %>%
  pad(start_val = ymd_hms("2017-09-01 00:00:00"), end_val = ymd_hms("2020-08-31 23:00:00")) %>%
  mutate(Battery = replace_na(Battery, 0))
## pad applied on the interval: hour
anyNA(crime_bat)
## [1] FALSE

Cross Validation

After finishing our time frame, we will continue to conduct the cross validation activities. we will split our data into two types of data. train data as a data that will be trained in to our model, and test data as a data that provide an unbiased evaluation of a model fit on the training data set.

We will split our train data into two type of data set named train and test'. Ourtest` data will be a set of data consist of last 365 days of reported crime from our data frame (early Sept 2019 to the last August of 2020).

bat_train <- head(crime_bat, nrow(crime_bat) -  365)

bat_test <- tail(crime_bat, 365)

Create Time Series Object

To create a time series model, we need to create a time series object from our train data. time series object will be based on the Battery column as it is the one that we are going to predict, we set the frequency to be 24 as it is total hour of reported crime for 1 day.

bat_ts <- ts(bat_train$Battery, frequency = 24)
# Battery_plot <- bat_train %>%
#   ggplot(aes(x = Date, y = Battery)) +
#   geom_line(aes(color = "Battery")) +
#   scale_x_datetime(name = "Date", date_breaks = "2 year") +
#   scale_y_continuous(breaks = seq(0, 400, 100)) + 
#   theme_minimal() +
#   labs(title = "Chicago Battery Crime", subtitle = "2017 - 2020")
# 
# ggplotly(Battery_plot)

Next step, we will decompose our time series object and try to see the trend and seasonality by put it into a plot using autoplot()

#Decompose 1 tahun ke belakang
bat_ts_dec <- bat_ts %>%
  tail(365) %>%
  decompose()


bat_ts_dec %>%
  autoplot()

As we can see from the plot, the trend still shown us some pattern like our seasonal, this indicate there are other seasonality pattern that have not been caught by the plot. To overcome this, we will create a Multi Seasonal Time Series Object.

# Create MSTS Object
Battery_multi <- msts(bat_train$Battery, seasonal.periods = c(24, # Daily
                                                            24*7, # Weekly
                                                            24*30)) # Monthly

# Decompose MSTS Object
Battery_multi_dec <- Battery_multi %>%
  mstl()

Battery_multi_dec %>%
  tail(365) %>%
  autoplot()

From the plot above, we can see the trend of the Battery Crime is already going smooth. The Battery Crime trend itself is decreasing in the last 365 days.

Seasonality Analysis

# Create a data frame based on MSTS Object

df_Battery_multi <- as.data.frame(Battery_multi_dec)

1. Daily Seasonality

These are the plot of Daily Seasonality of Battery Crime in Chicago

Plot2 <- df_Battery_multi %>%
  mutate(day = bat_train$Date) %>%
  group_by(day) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
  head(24*2) %>%
  ggplot(aes(x = day, y = seasonal)) +
  geom_point(col = "maroon") + geom_line(col = "red") +
  theme_minimal()
## `summarise()` ungrouping output (override with `.groups` argument)
Plot2

As you can see, The reported case of battery crime are increase approximately from 06:00 AM into 12:00 AM in the midnight. It safe to say that the range of 6 in the morning into midnight time is the time where people have activities and meet other people outside, while from midnight into the start of the day people tend to rest and does not have any activity, which is less likely to them to be physically harmed by the others.

2. Weekly Seasonality

These are the plot of Weekly Seasonality of Battery Crime in Chicago

Plot3 <- df_Battery_multi %>%
  mutate(day = bat_train$Date) %>%
  group_by(day) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
  head(24*7) %>%
  ggplot(aes(x = day, y = seasonal)) +
  geom_point(col = "maroon") + geom_line(col = "red") +
  theme_minimal()
## `summarise()` ungrouping output (override with `.groups` argument)
Plot3

As we can see from the plot that it is quiet the same pattern from the plot before, the difference is that there are a slightly increase number of reported crime in the midweek. Though it still unknown how it came out. It is safe to say that people needed to be more cautious about chance of the crime could happen in those times.

3. Monthly Seasonality

These are the plot of monthly Seasonality of Battery Crime in Chicago

Plot4 <- df_Battery_multi %>%
  mutate(day = bat_train$Date, month = month(bat_train$Date, label = T)) %>%
  group_by(month) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
  head(24*30) %>%
  ggplot(aes(x = month, y = seasonal)) +
  geom_point() + geom_col() +
  theme_minimal()
## `summarise()` ungrouping output (override with `.groups` argument)
Plot4

As we can see from the plot above, The crime tend to increase in the month of January, March, June, September, and December. These can be seen as a warning signal to people around the city that the chance of them getting harmed is high around these time.

Modelling

After we succeed creating the MSTS object, we will try to put our object into several models. Some models that we will try to create is

  • Multi-Seasonal Holt-Winter
  • Multi-Seasonal ARIMA model
  • Seasonal naive

Time Series Model: Multi-Seasonal Holt-Winters

Battery_hw <- stlm(Battery_multi, method = "ets")

f_hw <- forecast(Battery_hw, h = 365)

Battery_multi %>%
  tail(365) %>%
  autoplot() +
  autolayer(f_hw, series = "Multi-Seasonal Holt_Winter", color = "blue") 

Time Series Model: Multi-Seasonal ARIMA

Battery_arima <- stlm(Battery_multi, method = "arima")

f_arima <- forecast(Battery_arima, h = 365)

Battery_multi %>%
  tail(365) %>%
  autoplot() +
  autolayer(f_arima, series = "Multi-Seasonal ARIMA", color = "red") 

Time Series Model: Seasonal Naive Method

Battery_snaive <- snaive(y = Battery_multi, h = 365)

f_snaive <- forecast(object = Battery_snaive)

Battery_multi %>%
  tail(365) %>%
  autoplot() +
  autolayer(f_snaive, series = "Multi-Seasonal ARIMA", color = "orange") 

Model Comparison

To evaluate our model,we will judge it based on the Mean Absolute Error of the model, we use MAE as it is commonly used in time series analysis and easier to interpret. MAE measure the average magnitude of the error in a set of prediction.

source : https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d

# Model Comparison

# Multiseasonal HW

accuracy(f_hw, bat_test$Battery)
##                         ME     RMSE      MAE  MPE MAPE      MASE       ACF1
## Training set -5.987372e-05 1.875405 1.455464  NaN  Inf 0.5515241 0.04885024
## Test set     -8.167615e-02 2.776759 2.146960 -Inf  Inf 0.8135552         NA
# Multiseasonal ARIMA

accuracy(f_arima, bat_test$Battery)  
##                         ME     RMSE      MAE  MPE MAPE      MASE          ACF1
## Training set -0.0003133535 1.863536 1.447478  NaN  Inf 0.5484981 -0.0002176535
## Test set     -0.1029392770 2.777750 2.150787 -Inf  Inf 0.8150051            NA
# Seasonal Naive Method

accuracy(f_snaive, bat_test$Battery)
##                       ME     RMSE      MAE  MPE MAPE     MASE      ACF1
## Training set -0.02030215 3.947713 2.993061 -Inf  Inf 1.134171 0.2750124
## Test set      0.13972603 3.609349 2.736986 -Inf  Inf 1.037136        NA

from MAE value, we can conclude that our Multi-Seasonal Holt-Winter model have the lowest error value by comparison with the value of 2.146960.

therefore, we will use Multi-Seasonal Holt-Winter model as our prediction model and check the assumptions of the residuals.

Assumption Checking

There are two kind of assumption that will be checked for our model residual:

  • Autocorrelation
  • Normality

we will use Ljung-Box test for our autocorrelation test and Kolgomorov-Smirnov test for our normality test. It can be conclude that we have a good and optimal model supposed the residual able to fulfill these assumptions.

Autocorrelation Test

# Autocorrelation

Box.test(Battery_hw$residuals, type = "Ljung-Box", lag = 2*365)
## 
##  Box-Ljung test
## 
## data:  Battery_hw$residuals
## X-squared = 7721.1, df = 730, p-value < 2.2e-16

p-value > 0.05 : No Autocorrelation in residual

p-value < 0.05 : There are Autocorrelation in residual

Based on the result, our p-value is lower than 0.05, which means there are autocorrelations in our model residual.

Normality Test

# Normality

ks.test(Battery_hw$residuals, y = "pnorm", alternative = "two.sided")
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  Battery_hw$residuals
## D = 0.1394, p-value < 2.2e-16
## alternative hypothesis: two-sided

p-value > 0.05 : Residual normally distributed

p-value < 0.05 : Residual not normally distributed

Based on the result, our p-value is lower than 0.05, which means our residual not normally distributed.

Model Tuning

The assumption checking shown us that our model’s residual failed to fulfill the assumptions, means that our model is not optimal enough and can be improved.

To fulfill these assumptions, we will try to transform our data. Using recipes package, we will transform our data into it square-root value and do some scalling.

# Data transformation

rec <- recipe(Battery ~., data = bat_train) %>%
   step_sqrt(all_outcomes()) %>% 
   step_center(all_outcomes()) %>%
   step_scale(all_outcomes()) %>%
   prep()

# Ubah data train

bat_train_rec <- juice(rec)


# Ubah data val

bat_test_rec <- bake(rec, bat_test)

We can see below that our Batteryr column have been transformed into in root-square value and already being scaled. this data will be used as our new train data.

# Revert Function

rec_rev <- function(x, rec) {
   
   means <- rec$steps[[2]]$means[["Battery"]]
   sds <- rec$steps[[3]]$sds[["Battery"]]
   
   x <- (x * sds + means)^2
   x <- round(x)
   #x <- exp(x)
   x <- ifelse(x < 0, 0, x)
   
   #X <- exp(x) kalo jadi log
   
   x
   
}

Create Time Series Object based on Transformed Data

We will create a new MSTS object, based on bat_train_rec data. we will set it into daily, weekly, and monthly frequency.

# Create MSTS Obect

Battery_multi_rec <- msts(bat_train_rec$Battery, seasonal.periods = c(24, 24*7, 24*30))

# Decompose MSTS Object

Battery_multi_decomp_rec <- Battery_multi_rec %>%
  mstl()

Battery_multi_decomp_rec %>%
  tail(365) %>%
  autoplot()

## Modelling based on Transformed Data

As we already choose Multi-Seasonal Holt-Winter as our prediction model, we will create another Multi-Seasonal Holt-Winter model but based on our new MSTS object.

Battery_hw_rec <- stlm(Battery_multi_rec, method = "ets")

f_hw_rec <- forecast(Battery_hw_rec, h = 365)

rec_rev_hw <- rec_rev(f_hw_rec$mean, rec = rec)

Battery_multi_rec %>%
  tail(365) %>%
  autoplot() +
  autolayer(f_hw_rec, series = "Multi-Seasonal Holt-Winter", color = "blue") 

# Model Accuracy

accuracy(rec_rev_hw, bat_test$Battery)
##                  ME     RMSE      MAE  MPE MAPE
## Test set 0.09041096 2.803618 2.117808 -Inf  Inf

we can see that by transforming our data, We able to decrease the MAE aroun 0.3%, with value of MAE 2.117808

Assumption Checking

We will check the Autocorrelation and Normality of our new model residual with the same test we used previously

# Autocorrelation

Box.test(Battery_hw_rec$residuals, type = "Ljung-Box", lag = 2*365)
## 
##  Box-Ljung test
## 
## data:  Battery_hw_rec$residuals
## X-squared = 7521.1, df = 730, p-value < 2.2e-16

p-value > 0.05 : No Autocorrelation in residual

p-value < 0.05 : There are Autocorrelation in residual

Based on the result, our p-value is lower than 0.05, which means there are autocorrelations in our model residual.

# Normality
ks.test(Battery_hw$residuals, y = "pnorm", alternative = "two.sided")
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  Battery_hw$residuals
## D = 0.1394, p-value < 2.2e-16
## alternative hypothesis: two-sided

p-value > 0.05 : Residual normally distributed

p-value < 0.05 : Residual not normally distributed

Based on the result, our p-value is lower than 0.05, which means our residual not normally distributed.

As both of the assumptions still has not been fulfilled it can be said that our model can be tuned to be more optimal.

Some strategy to overcome this problem are using different kind of model that can create a better prediction and avoif the assumption checking, another way is by conduct time series seasonality adjustment.

Conclusion

Based on the analysis and our model performance, our model able to predict the battery crime rate in the city of Chicago with the error around 2.11%. Still, it can be optimized thoroughly with several way as our model does not fulfill the Autocorrelation and Normality assumptions.

As for the seasonal analysis, it is safe to say that the battery crime is likely to increase around 6 o’clock in the morning towards the midnight time and higher case even reported in the midweek. The crime itself is more likely to happen in several months such as January, March, June, September, and December.