Chicago Crime: A Time Series Analysis
Introduction
A happy citizens are the citizen that can be safe wherever they wanted to be. Sadly crime can also occur in many ways that can potentially harm people. As it is a recurring act that could affect you in many form, is it possible to predict and avoid it? Is there any time that a crime can likely to occur more than any other time?
With a data that reflects reported crime incidents that happened in the city of Chicago from 2001 to early 2020 Using Time Series Analysis, we will try to predict and analyze whether is there any pattern for a crime to likely to happen? What are the best time that require you to be more careful so the crime is not likely to happen to you?
The data is extracted from Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) System.
Attaching Necessary Library
library(tidyr)
library(tidyverse)
library(lubridate)
library(ggplot2)
library(plotly)
library(TSstudio)
library(forecast)
library(MLmetrics)
library(dplyr)
library(viridisLite)
library(viridis)
library(padr)
library(rmdformats)## Warning: package 'rmdformats' was built under R version 4.0.3
Data Import
## Rows: 7,222,382
## Columns: 22
## $ ID <int> 11034701, 11227287, 11227583, 11227293, 112276...
## $ Case.Number <chr> "JA366925", "JB147188", "JB147595", "JB147230"...
## $ Date <chr> "01/01/2001 11:00:00 AM", "10/08/2017 03:00:00...
## $ Block <chr> "016XX E 86TH PL", "092XX S RACINE AVE", "026X...
## $ IUCR <chr> "1153", "0281", "0620", "0810", "0281", "1751"...
## $ Primary.Type <chr> "DECEPTIVE PRACTICE", "CRIM SEXUAL ASSAULT", "...
## $ Description <chr> "FINANCIAL IDENTITY THEFT OVER $ 300", "NON-AG...
## $ Location.Description <chr> "RESIDENCE", "RESIDENCE", "OTHER", "RESIDENCE"...
## $ Arrest <chr> "false", "false", "false", "false", "false", "...
## $ Domestic <chr> "false", "false", "false", "false", "false", "...
## $ Beat <int> 412, 2222, 835, 313, 122, 813, 1033, 1432, 123...
## $ District <int> 4, 22, 8, 3, 1, 8, 10, 14, 1, 6, 17, 4, 24, 4,...
## $ Ward <int> 8, 21, 18, 20, 42, 13, 12, 32, 2, 8, 39, 10, 4...
## $ Community.Area <int> 45, 73, 70, 42, 32, 65, 30, 22, 32, 44, 13, 52...
## $ FBI.Code <chr> "11", "02", "05", "06", "02", "17", "02", "06"...
## $ X.Coordinate <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Y.Coordinate <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Year <int> 2001, 2017, 2017, 2017, 2017, 2015, 2017, 2017...
## $ Updated.On <chr> "08/05/2017 03:50:08 PM", "02/11/2018 03:57:41...
## $ Latitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Longitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Location <chr> "", "", "", "", "", "", "", "", "", "", "", ""...
Above us is an overview of what kind of information that recorded in the data frame. As we will conduct Time Series Analysis, we are going to use only the necessary Information such as Date and the kind of crime that recorded in the Primary.Type.
Data Wrangling
First, we will clean the data by omitting the unnecessary information. We will only use 2 variables: - Date - Primary.Type
Then we will change the data type of these variables into the data type it should have been and aggregate the time into an hourly interval.
crime_clean <- crime %>%
select(Date, Primary.Type) %>%
mutate(Date = mdy_hms(Date),
Primary.Type = as.factor(Primary.Type)) %>%
mutate(Date = floor_date(Date, unit = "hours")) %>%
arrange(Date)By conducting data aggregation, we will see what kind of crime that likely to happen in the city of Chicago.
Plot1 <- crime_clean %>%
group_by(Primary.Type) %>%
summarise(count = n()) %>%
arrange(-count) %>%
head(5) %>%
ggplot(aes(x = count, y = reorder(Primary.Type, count), text = paste("Crime: ", Primary.Type, "<br>",
"Happened: ", count, " times"))) +
geom_col(aes(fill = Primary.Type), show.legend = F) +
scale_fill_viridis(option = "D", discrete = T, direction = 1, begin = 0.9, end = 0.3) +
labs(title = "Top 5 Crime in Chicago", x = "Number of Crime", y = "Crime") +
theme_gray()## `summarise()` ungrouping output (override with `.groups` argument)
Based on the plot above, these are the Top 5 Crimes that happened in Chicago. For this Time Series Analysis, We will pick one type of Crime and conduct the analysis based on data from September 2017 to September 2019. and we will create a model to predict the crime rate in the 2020
In this analysis we will choose Battery Crime
# Create Prediction Time Frame
crime_bat <- crime_clean %>%
filter(Primary.Type == 'BATTERY') %>%
group_by(Date) %>%
summarise(Battery = n()) %>%
filter(Date > "2017-09-01" & Date < "2020-09-02") %>%
slice(-c(1:6)) %>%
ungroup()## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 10 x 2
## Date Battery
## <dttm> <int>
## 1 2017-09-01 00:00:00 8
## 2 2017-09-01 01:00:00 2
## 3 2017-09-01 02:00:00 6
## 4 2017-09-01 03:00:00 2
## 5 2017-09-01 04:00:00 4
## 6 2017-09-01 05:00:00 1
## 7 2017-09-01 06:00:00 2
## 8 2017-09-01 07:00:00 8
## 9 2017-09-01 08:00:00 2
## 10 2017-09-01 09:00:00 3
## # A tibble: 20 x 2
## Date Battery
## <dttm> <int>
## 1 2020-08-31 20:00:00 6
## 2 2020-08-31 21:00:00 11
## 3 2020-08-31 22:00:00 13
## 4 2020-08-31 23:00:00 9
## 5 2020-09-01 00:00:00 7
## 6 2020-09-01 01:00:00 8
## 7 2020-09-01 02:00:00 5
## 8 2020-09-01 03:00:00 3
## 9 2020-09-01 05:00:00 1
## 10 2020-09-01 06:00:00 1
## 11 2020-09-01 07:00:00 2
## 12 2020-09-01 08:00:00 4
## 13 2020-09-01 09:00:00 3
## 14 2020-09-01 10:00:00 3
## 15 2020-09-01 11:00:00 3
## 16 2020-09-01 12:00:00 5
## 17 2020-09-01 13:00:00 1
## 18 2020-09-01 14:00:00 6
## 19 2020-09-01 15:00:00 3
## 20 2020-09-01 16:00:00 2
As we seen from the data frame, the end of our prediction time frame still not correct. Our Time Frame will be from the 1st September 2017 into the 1st September 2020.
## # A tibble: 6 x 2
## Date Battery
## <dttm> <int>
## 1 2020-08-31 18:00:00 10
## 2 2020-08-31 19:00:00 8
## 3 2020-08-31 20:00:00 6
## 4 2020-08-31 21:00:00 11
## 5 2020-08-31 22:00:00 13
## 6 2020-08-31 23:00:00 9
After our Time Frame is already correct, we will see whether any missing interval in our data frame by padding our data and check any NA Value that created from our padding
crime_bat <- crime_bat %>%
pad(start_val = ymd_hms("2017-09-01 00:00:00"), end_val = ymd_hms("2020-08-31 23:00:00")) %>%
mutate(Battery = replace_na(Battery, 0))## pad applied on the interval: hour
## [1] FALSE
Cross Validation
After finishing our time frame, we will continue to conduct the cross validation activities. we will split our data into two types of data. train data as a data that will be trained in to our model, and test data as a data that provide an unbiased evaluation of a model fit on the training data set.
We will split our train data into two type of data set named train and test'. Ourtest` data will be a set of data consist of last 365 days of reported crime from our data frame (early Sept 2019 to the last August of 2020).
Create Time Series Object
To create a time series model, we need to create a time series object from our train data. time series object will be based on the Battery column as it is the one that we are going to predict, we set the frequency to be 24 as it is total hour of reported crime for 1 day.
# Battery_plot <- bat_train %>%
# ggplot(aes(x = Date, y = Battery)) +
# geom_line(aes(color = "Battery")) +
# scale_x_datetime(name = "Date", date_breaks = "2 year") +
# scale_y_continuous(breaks = seq(0, 400, 100)) +
# theme_minimal() +
# labs(title = "Chicago Battery Crime", subtitle = "2017 - 2020")
#
# ggplotly(Battery_plot)Next step, we will decompose our time series object and try to see the trend and seasonality by put it into a plot using autoplot()
#Decompose 1 tahun ke belakang
bat_ts_dec <- bat_ts %>%
tail(365) %>%
decompose()
bat_ts_dec %>%
autoplot()As we can see from the plot, the trend still shown us some pattern like our seasonal, this indicate there are other seasonality pattern that have not been caught by the plot. To overcome this, we will create a Multi Seasonal Time Series Object.
# Create MSTS Object
Battery_multi <- msts(bat_train$Battery, seasonal.periods = c(24, # Daily
24*7, # Weekly
24*30)) # Monthly
# Decompose MSTS Object
Battery_multi_dec <- Battery_multi %>%
mstl()
Battery_multi_dec %>%
tail(365) %>%
autoplot()From the plot above, we can see the trend of the Battery Crime is already going smooth. The Battery Crime trend itself is decreasing in the last 365 days.
Seasonality Analysis
1. Daily Seasonality
These are the plot of Daily Seasonality of Battery Crime in Chicago
Plot2 <- df_Battery_multi %>%
mutate(day = bat_train$Date) %>%
group_by(day) %>%
summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
head(24*2) %>%
ggplot(aes(x = day, y = seasonal)) +
geom_point(col = "maroon") + geom_line(col = "red") +
theme_minimal()## `summarise()` ungrouping output (override with `.groups` argument)
As you can see, The reported case of battery crime are increase approximately from 06:00 AM into 12:00 AM in the midnight. It safe to say that the range of 6 in the morning into midnight time is the time where people have activities and meet other people outside, while from midnight into the start of the day people tend to rest and does not have any activity, which is less likely to them to be physically harmed by the others.
2. Weekly Seasonality
These are the plot of Weekly Seasonality of Battery Crime in Chicago
Plot3 <- df_Battery_multi %>%
mutate(day = bat_train$Date) %>%
group_by(day) %>%
summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
head(24*7) %>%
ggplot(aes(x = day, y = seasonal)) +
geom_point(col = "maroon") + geom_line(col = "red") +
theme_minimal()## `summarise()` ungrouping output (override with `.groups` argument)
As we can see from the plot that it is quiet the same pattern from the plot before, the difference is that there are a slightly increase number of reported crime in the midweek. Though it still unknown how it came out. It is safe to say that people needed to be more cautious about chance of the crime could happen in those times.
3. Monthly Seasonality
These are the plot of monthly Seasonality of Battery Crime in Chicago
Plot4 <- df_Battery_multi %>%
mutate(day = bat_train$Date, month = month(bat_train$Date, label = T)) %>%
group_by(month) %>%
summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
head(24*30) %>%
ggplot(aes(x = month, y = seasonal)) +
geom_point() + geom_col() +
theme_minimal()## `summarise()` ungrouping output (override with `.groups` argument)
As we can see from the plot above, The crime tend to increase in the month of January, March, June, September, and December. These can be seen as a warning signal to people around the city that the chance of them getting harmed is high around these time.
Modelling
After we succeed creating the MSTS object, we will try to put our object into several models. Some models that we will try to create is
- Multi-Seasonal Holt-Winter
- Multi-Seasonal ARIMA model
- Seasonal naive
Time Series Model: Multi-Seasonal Holt-Winters
Battery_hw <- stlm(Battery_multi, method = "ets")
f_hw <- forecast(Battery_hw, h = 365)
Battery_multi %>%
tail(365) %>%
autoplot() +
autolayer(f_hw, series = "Multi-Seasonal Holt_Winter", color = "blue") Time Series Model: Multi-Seasonal ARIMA
Battery_arima <- stlm(Battery_multi, method = "arima")
f_arima <- forecast(Battery_arima, h = 365)
Battery_multi %>%
tail(365) %>%
autoplot() +
autolayer(f_arima, series = "Multi-Seasonal ARIMA", color = "red") Time Series Model: Seasonal Naive Method
Battery_snaive <- snaive(y = Battery_multi, h = 365)
f_snaive <- forecast(object = Battery_snaive)
Battery_multi %>%
tail(365) %>%
autoplot() +
autolayer(f_snaive, series = "Multi-Seasonal ARIMA", color = "orange") Model Comparison
To evaluate our model,we will judge it based on the Mean Absolute Error of the model, we use MAE as it is commonly used in time series analysis and easier to interpret. MAE measure the average magnitude of the error in a set of prediction.
source : https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -5.987372e-05 1.875405 1.455464 NaN Inf 0.5515241 0.04885024
## Test set -8.167615e-02 2.776759 2.146960 -Inf Inf 0.8135552 NA
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.0003133535 1.863536 1.447478 NaN Inf 0.5484981 -0.0002176535
## Test set -0.1029392770 2.777750 2.150787 -Inf Inf 0.8150051 NA
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.02030215 3.947713 2.993061 -Inf Inf 1.134171 0.2750124
## Test set 0.13972603 3.609349 2.736986 -Inf Inf 1.037136 NA
from MAE value, we can conclude that our Multi-Seasonal Holt-Winter model have the lowest error value by comparison with the value of 2.146960.
therefore, we will use Multi-Seasonal Holt-Winter model as our prediction model and check the assumptions of the residuals.
Assumption Checking
There are two kind of assumption that will be checked for our model residual:
- Autocorrelation
- Normality
we will use Ljung-Box test for our autocorrelation test and Kolgomorov-Smirnov test for our normality test. It can be conclude that we have a good and optimal model supposed the residual able to fulfill these assumptions.
Autocorrelation Test
##
## Box-Ljung test
##
## data: Battery_hw$residuals
## X-squared = 7721.1, df = 730, p-value < 2.2e-16
p-value > 0.05 : No Autocorrelation in residual
p-value < 0.05 : There are Autocorrelation in residual
Based on the result, our p-value is lower than 0.05, which means there are autocorrelations in our model residual.
Normality Test
##
## One-sample Kolmogorov-Smirnov test
##
## data: Battery_hw$residuals
## D = 0.1394, p-value < 2.2e-16
## alternative hypothesis: two-sided
p-value > 0.05 : Residual normally distributed
p-value < 0.05 : Residual not normally distributed
Based on the result, our p-value is lower than 0.05, which means our residual not normally distributed.
Model Tuning
The assumption checking shown us that our model’s residual failed to fulfill the assumptions, means that our model is not optimal enough and can be improved.
To fulfill these assumptions, we will try to transform our data. Using recipes package, we will transform our data into it square-root value and do some scalling.
# Data transformation
rec <- recipe(Battery ~., data = bat_train) %>%
step_sqrt(all_outcomes()) %>%
step_center(all_outcomes()) %>%
step_scale(all_outcomes()) %>%
prep()
# Ubah data train
bat_train_rec <- juice(rec)
# Ubah data val
bat_test_rec <- bake(rec, bat_test)We can see below that our Batteryr column have been transformed into in root-square value and already being scaled. this data will be used as our new train data.
# Revert Function
rec_rev <- function(x, rec) {
means <- rec$steps[[2]]$means[["Battery"]]
sds <- rec$steps[[3]]$sds[["Battery"]]
x <- (x * sds + means)^2
x <- round(x)
#x <- exp(x)
x <- ifelse(x < 0, 0, x)
#X <- exp(x) kalo jadi log
x
}Create Time Series Object based on Transformed Data
We will create a new MSTS object, based on bat_train_rec data. we will set it into daily, weekly, and monthly frequency.
# Create MSTS Obect
Battery_multi_rec <- msts(bat_train_rec$Battery, seasonal.periods = c(24, 24*7, 24*30))
# Decompose MSTS Object
Battery_multi_decomp_rec <- Battery_multi_rec %>%
mstl()
Battery_multi_decomp_rec %>%
tail(365) %>%
autoplot() ## Modelling based on Transformed Data
As we already choose Multi-Seasonal Holt-Winter as our prediction model, we will create another Multi-Seasonal Holt-Winter model but based on our new MSTS object.
Battery_hw_rec <- stlm(Battery_multi_rec, method = "ets")
f_hw_rec <- forecast(Battery_hw_rec, h = 365)
rec_rev_hw <- rec_rev(f_hw_rec$mean, rec = rec)
Battery_multi_rec %>%
tail(365) %>%
autoplot() +
autolayer(f_hw_rec, series = "Multi-Seasonal Holt-Winter", color = "blue") ## ME RMSE MAE MPE MAPE
## Test set 0.09041096 2.803618 2.117808 -Inf Inf
we can see that by transforming our data, We able to decrease the MAE aroun 0.3%, with value of MAE 2.117808
Assumption Checking
We will check the Autocorrelation and Normality of our new model residual with the same test we used previously
##
## Box-Ljung test
##
## data: Battery_hw_rec$residuals
## X-squared = 7521.1, df = 730, p-value < 2.2e-16
p-value > 0.05 : No Autocorrelation in residual
p-value < 0.05 : There are Autocorrelation in residual
Based on the result, our p-value is lower than 0.05, which means there are autocorrelations in our model residual.
##
## One-sample Kolmogorov-Smirnov test
##
## data: Battery_hw$residuals
## D = 0.1394, p-value < 2.2e-16
## alternative hypothesis: two-sided
p-value > 0.05 : Residual normally distributed
p-value < 0.05 : Residual not normally distributed
Based on the result, our p-value is lower than 0.05, which means our residual not normally distributed.
As both of the assumptions still has not been fulfilled it can be said that our model can be tuned to be more optimal.
Some strategy to overcome this problem are using different kind of model that can create a better prediction and avoif the assumption checking, another way is by conduct time series seasonality adjustment.
Conclusion
Based on the analysis and our model performance, our model able to predict the battery crime rate in the city of Chicago with the error around 2.11%. Still, it can be optimized thoroughly with several way as our model does not fulfill the Autocorrelation and Normality assumptions.
As for the seasonal analysis, it is safe to say that the battery crime is likely to increase around 6 o’clock in the morning towards the midnight time and higher case even reported in the midweek. The crime itself is more likely to happen in several months such as January, March, June, September, and December.