Crimerate Multiple Seasonality Forecasting

Crime is one of the most terrifying moment a person can experience, everyone try to avoid anything that can lure crimes to their life. Today, crime rate is going higher and higher, so, doing anticipating action is a crucial things to do. One things we can do is predict when and how much crime will be happen in near future.

Knowing when and how much crime will happen, will help police officer to prepare and take action before the worst case happen, its also help us to avoid victims. This time Ill try to predict total crime that will happen in next 7 days in Chicago in each hour. The data set contain total every crime from first January 20223 until 10’th April 2023, and I’ll focus on theft cases since it has the highest number of crime in Chicago. The data set used was obtained from here dataset

Import Library

library(dplyr)
library(lubridate)
library(padr)
library(forecast)
library(tidyr)
library(ggplot2)
library(plotly)

Data Preparation

crime <- read.csv("Crimes_-_2001_to_Present.csv")
rmarkdown::paged_table(crime)

Column description:

ID: Unique identifier for the record.
Case.Number: The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
Date: Date when the incident occurred. this is sometimes a best estimate.
Block: The partially redacted address where the incident occurred, placing it on the same block as the actual address.
IUCR: The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description.
Primary.Type: The primary description of the IUCR code.
Description: The secondary description of the IUCR code, a subcategory of the primary description.
Location.Description: Description of the location where the incident occurred.
Arrest: Indicates whether an arrest was made.
Domestic: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
Beat: Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts.
District: Indicates the police district where the incident occurred.
Ward: The ward (City Council district) where the incident occurred.
Community.Area: Indicates the community area where the incident occurred. Chicago has 77 community areas.
FBI.Code: Indicates the crime classification as outlined in the FBI’s National Incident-Based Reporting System (NIBRS).
X.Coordinate: The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
Y.Coordinate: The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
Year: Year the incident occurred.
Updated.On: Date and time the record was last updated.
Latitude: The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
Longitude: he longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
Location: The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

Data Preprocessing

table(crime$Primary.Type)

#> 
#>                             ARSON                           ASSAULT 
#>                               112                              5505 
#>                           BATTERY                          BURGLARY 
#>                             10674                              1921 
#> CONCEALED CARRY LICENSE VIOLATION                   CRIMINAL DAMAGE 
#>                                53                              7469 
#>           CRIMINAL SEXUAL ASSAULT                 CRIMINAL TRESPASS 
#>                               373                              1323 
#>                DECEPTIVE PRACTICE                          GAMBLING 
#>                              4171                                 2 
#>                          HOMICIDE                 HUMAN TRAFFICKING 
#>                               130                                 2 
#>  INTERFERENCE WITH PUBLIC OFFICER                      INTIMIDATION 
#>                               169                                80 
#>                        KIDNAPPING              LIQUOR LAW VIOLATION 
#>                                30                                52 
#>               MOTOR VEHICLE THEFT                         NARCOTICS 
#>                              8070                              1486 
#>                         OBSCENITY        OFFENSE INVOLVING CHILDREN 
#>                                11                               530 
#>          OTHER NARCOTIC VIOLATION                     OTHER OFFENSE 
#>                                 2                              4197 
#>                      PROSTITUTION                  PUBLIC INDECENCY 
#>                                85                                 1 
#>            PUBLIC PEACE VIOLATION                           ROBBERY 
#>                               202                              2409 
#>                       SEX OFFENSE                          STALKING 
#>                               309                               118 
#>                             THEFT                 WEAPONS VIOLATION 
#>                             13551                              2259

From frequency table above, Theft has the highest number of crime in Chicago, so we can predict the number of theft in next 7 days (a week)

Before we make TS object, we first should do some data preparation such as aggregation and summarise the data. This time we will aggregate the data into daily frequancy.

Data Aggregation

theft <- crime %>% 
  filter(Primary.Type == "THEFT") %>% 
  select(Date) %>% 
  mutate(Date = mdy_hms(Date),
         Date = floor_date(Date, unit = "hour")) %>% 
  group_by(Date) %>% 
  summarise(total_cases = sum(n())) %>% 
  ungroup()

theft

#> # A tibble: 2,318 × 2
#>    Date                total_cases
#>    <dttm>                    <int>
#>  1 2023-01-01 00:00:00          12
#>  2 2023-01-01 01:00:00           8
#>  3 2023-01-01 02:00:00           4
#>  4 2023-01-01 03:00:00           7
#>  5 2023-01-01 04:00:00           3
#>  6 2023-01-01 05:00:00           2
#>  7 2023-01-01 07:00:00           1
#>  8 2023-01-01 08:00:00           6
#>  9 2023-01-01 09:00:00           5
#> 10 2023-01-01 10:00:00           6
#> # ℹ 2,308 more rows

After we got total THEFT cases each day, next we make sure theres no missing time interval. Since our data has 1560 rows in total, its really difficult to check manually. So we gonna use pad() function to fill the missing interval and replace total_cases missing value with zero.

Padding Data

theft_pad <- theft %>% 
  arrange(Date) %>% 
  pad(start_val = ymd_hms("2023-01-01 00:00:00"),
      end_val = ymd_hms("2023-04-10 23:00:00")) %>% 
  mutate(total_cases = replace_na(total_cases, 0))

colSums(is.na(theft_pad))

#>        Date total_cases 
#>           0           0

summary(theft_pad)

#>       Date                      total_cases    
#>  Min.   :2023-01-01 00:00:00   Min.   : 0.000  
#>  1st Qu.:2023-01-25 23:45:00   1st Qu.: 3.000  
#>  Median :2023-02-19 23:30:00   Median : 5.000  
#>  Mean   :2023-02-19 23:30:00   Mean   : 5.646  
#>  3rd Qu.:2023-03-16 23:15:00   3rd Qu.: 8.000  
#>  Max.   :2023-04-10 23:00:00   Max.   :22.000

Build TS Object

First ill use ts() command to make my first ts object to see and explore whether our time series object has trend and seasonal properties (on-seasonal/multiseasonal)

#First model

theft_ts <- ts(theft_pad$total_cases, 
               frequency = 24) #daily seasonality

Plotting TS Object

theft_ts %>% 
  decompose() %>% 
  autoplot()

From our plot above, we know that aur trend still has seasonality patternd captured. So ill try to use another method to make TS object that can handle multiseasonality data.

I’ll try to use msts with some combination of frequency’s number

TS Object Using MSTS

#Second model

theft_pad$total_cases %>% 
  msts(seasonal.periods = c(24, 24*7)) %>% 
  mstl() %>% 
  autoplot()

Our trend still captured seasonal pattern, so ill try to change our seasonal periods

#Third model

theft_pad$total_cases %>% 
  msts(seasonal.periods = c(24*7, 24*7*4)) %>% 
  mstl() %>% 
  autoplot()

Third model has better decomposition among three models, therefor we will use third model to be use as time series model building

theft_msts <- theft_pad$total_cases %>% 
  msts(seasonal.periods = c(24*7, 24*7*4))

Cross Validation

I splitting the data into train and test data for validation purpose. Test data has a week of frequency ,meanwhile the rest as train data

data_test <- tail(theft_msts, 24*7)
data_train <- head(theft_msts, -24*7)

Model building

For model building I’ll use ETS Holt-Winter and SARIMA, the reason why I choose both of it because my data contain trend and seasonal

theft_ets <- stlm(data_train, method = "ets")
theft_arima <- stlm(data_train, method = "arima")

Forecast and Model Evaluation

Forecast

I will predict total crime cases in next seven days, and compare it with actual data (data_test)

theft_ets_f <- forecast(theft_ets, h = 24*7)
theft_arima_f <- forecast(theft_arima, h = 24*7)

Plotting Data

Plotting the forcasting result with its actual data

data_test %>% 
  autoplot() +
  autolayer(theft_ets_f$mean, series = "ETS") +
  autolayer(theft_arima_f$mean, series = "ARIMA")

## Accuracy Test

In this accuracy test, we only focus on RMSE and MAS. RMSE is a measurement to know how differ the prediction with actual value. Same as RMSE, MAE measure how differ the prediction with actual value, but MAE calculate the absolute different value betwwen prediction and actual value.

accuracy(theft_ets_f, data_test)

#>                         ME     RMSE      MAE  MPE MAPE      MASE        ACF1
#> Training set  0.0002433248 1.824809 1.416801  NaN  Inf 0.5441187  0.03429500
#> Test set     -0.9008203365 3.220461 2.414883 -Inf  Inf 0.9274293 -0.07028675
#>              Theil's U
#> Training set        NA
#> Test set             0

accuracy(theft_arima_f, data_test)

#>                          ME     RMSE      MAE MPE MAPE      MASE           ACF1
#> Training set  0.00004705387 1.819897 1.415452 NaN  Inf 0.5436005 -0.00008742695
#> Test set     -0.85509476908 3.204828 2.403232 NaN  Inf 0.9229548 -0.07195926338
#>              Theil's U
#> Training set        NA
#> Test set             0

From RMSE and MAE value we can conclude ARIMA has better performance than ETS.

Assumption Check

Normality Test

Normality test: Shapiro test H0: Residuals are normally distributed H1: Residuals are not normally distributed

We try to accept H0 (P value > 0.005)

shapiro.test(theft_arima_f$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  theft_arima_f$residuals
#> W = 0.99603, p-value = 0.00001274

hist(theft_arima_f$residuals, breaks = 10)

Autocorrelation Test

Autocorrelation test: Box.test - Ljng-Box H0: No autocorrelation in the forecast error H1: Auutocerrelation detected in the forecast residuals

Box.test(theft_arima_f$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  theft_arima_f$residuals
#> X-squared = 0.000017083, df = 1, p-value = 0.9967

Conclusion: Our residuals forecast models are not normally distributed since its has p-value smaller than 0.005 (p-value <0.05) then we reject our null hypothesis (accept alternative hypothesis), I’ll try to build our ts object using Triple Exponential’s Smoothing to see if the model can solf this problem and has better performance than ARIMA. Meanwhile our Autocorrelation test show we accept our null hypothesis since its p-value has greater value than 0.005

Building Objest TS with another method

Holt’s Winter

theft_hw <- HoltWinters(data_train)

theft_hw_f <- forecast(theft_hw, h = 24*7)

shapiro.test(theft_hw_f$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  theft_hw_f$residuals
#> W = 0.96638, p-value < 0.00000000000000022

Using log tranformation

Change all 0 value into 1 to prevent -Inf valu appear that can cause error when building the model using arima

theft_pad_log <- theft_pad %>% 
  mutate(total_cases = if_else(total_cases == 0, 1, total_cases),
         total_cases = log(total_cases)
         )

summary(theft_pad_log)

#>       Date                      total_cases   
#>  Min.   :2023-01-01 00:00:00   Min.   :0.000  
#>  1st Qu.:2023-01-25 23:45:00   1st Qu.:1.099  
#>  Median :2023-02-19 23:30:00   Median :1.609  
#>  Mean   :2023-02-19 23:30:00   Mean   :1.507  
#>  3rd Qu.:2023-03-16 23:15:00   3rd Qu.:2.079  
#>  Max.   :2023-04-10 23:00:00   Max.   :3.091

theft_log_msts <- theft_pad_log$total_cases %>% 
  msts(seasonal.periods = c(24*7, 24*7*4))

data_test_log <- tail(theft_log_msts, 24*7)
data_train_log <- head(theft_log_msts, -24*7)

theft_log_arima <- stlm(data_train_log, method = "arima")

theft_log_arima_f <- forecast(theft_log_arima, h = 24*7)

shapiro.test(exp(theft_log_arima_f$residuals))

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  exp(theft_log_arima_f$residuals)
#> W = 0.93956, p-value < 0.00000000000000022

Even after using log transformation, our residual seems not normally distributed, So I decided to use our second models to do prediciton for next 7 days

Forcasting for next 7 days

Model Building and Plotting

theft_arima_final <- stlm(theft_msts, method = "arima")

theft_final_f <- forecast(theft_arima_final, h = 24*7)

theft_final_f$mean %>% 
  autoplot()

Create New Data Frame

# Membuat vektor tanggal dari 2023-01-01 hingga 2023-01-07
df <- data.frame(Date = seq(from = ymd_hms("2023-04-11 00:00:00"), to = ymd_hms("2023-04-17 23:00:00"), by = "hour"))

df_forcasting <- df %>% 
  mutate(theft_predict = theft_final_f$mean,
         theft_predict = ceiling(theft_predict),
         Date = ymd_hms(Date))

rmarkdown::paged_table(df_forcasting)

Plotting Our Result

plot1 <- ggplot(data = df_forcasting, aes(x = Date, y = theft_predict)) +
  geom_col(aes(fill = theft_predict))+
  theme_minimal()

plot1

Our plot is too crowded so its not intuitive, I’ll try to do aggregation each hour to see which hour has the highest probability for theft to happen

df_agg <- df_forcasting %>% 
  mutate(hour = hour(Date)) %>% 
  select(hour, theft_predict) %>% 
  group_by(hour) %>% 
  summarise(theft_predict = mean(theft_predict))

plot2 <- ggplot(data = df_agg, aes(x = hour, y = theft_predict)) +
  geom_col(aes(fill = theft_predict))+
  theme_minimal()

plot2

From the plot above, we can conclude that highest probability for theft crime to happend is around early days, midday and in afternoon.