Crimerate Multiple Seasonality Forecasting
Crime is one of the most terrifying moment a person can experience, everyone try to avoid anything that can lure crimes to their life. Today, crime rate is going higher and higher, so, doing anticipating action is a crucial things to do. One things we can do is predict when and how much crime will be happen in near future.
Knowing when and how much crime will happen, will help police officer to prepare and take action before the worst case happen, its also help us to avoid victims. This time Ill try to predict total crime that will happen in next 7 days in Chicago in each hour. The data set contain total every crime from first January 20223 until 10’th April 2023, and I’ll focus on theft cases since it has the highest number of crime in Chicago. The data set used was obtained from here dataset
Import Library
library(dplyr)
library(lubridate)
library(padr)
library(forecast)
library(tidyr)
library(ggplot2)
library(plotly)Data Preparation
crime <- read.csv("Crimes_-_2001_to_Present.csv")
rmarkdown::paged_table(crime)Column description:
- ID: Unique identifier for the record.
- Case.Number: The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
- Date: Date when the incident occurred. this is sometimes a best estimate.
- Block: The partially redacted address where the incident occurred, placing it on the same block as the actual address.
- IUCR: The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description.
- Primary.Type: The primary description of the IUCR code.
- Description: The secondary description of the IUCR code, a subcategory of the primary description.
- Location.Description: Description of the location where the incident occurred.
- Arrest: Indicates whether an arrest was made.
- Domestic: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
- Beat: Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts.
- District: Indicates the police district where the incident occurred.
- Ward: The ward (City Council district) where the incident occurred.
- Community.Area: Indicates the community area where the incident occurred. Chicago has 77 community areas.
- FBI.Code: Indicates the crime classification as outlined in the FBI’s National Incident-Based Reporting System (NIBRS).
- X.Coordinate: The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
- Y.Coordinate: The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
- Year: Year the incident occurred.
- Updated.On: Date and time the record was last updated.
- Latitude: The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
- Longitude: he longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
- Location: The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.
Data Preprocessing
table(crime$Primary.Type)#>
#> ARSON ASSAULT
#> 112 5505
#> BATTERY BURGLARY
#> 10674 1921
#> CONCEALED CARRY LICENSE VIOLATION CRIMINAL DAMAGE
#> 53 7469
#> CRIMINAL SEXUAL ASSAULT CRIMINAL TRESPASS
#> 373 1323
#> DECEPTIVE PRACTICE GAMBLING
#> 4171 2
#> HOMICIDE HUMAN TRAFFICKING
#> 130 2
#> INTERFERENCE WITH PUBLIC OFFICER INTIMIDATION
#> 169 80
#> KIDNAPPING LIQUOR LAW VIOLATION
#> 30 52
#> MOTOR VEHICLE THEFT NARCOTICS
#> 8070 1486
#> OBSCENITY OFFENSE INVOLVING CHILDREN
#> 11 530
#> OTHER NARCOTIC VIOLATION OTHER OFFENSE
#> 2 4197
#> PROSTITUTION PUBLIC INDECENCY
#> 85 1
#> PUBLIC PEACE VIOLATION ROBBERY
#> 202 2409
#> SEX OFFENSE STALKING
#> 309 118
#> THEFT WEAPONS VIOLATION
#> 13551 2259
From frequency table above, Theft has the highest number of crime in Chicago, so we can predict the number of theft in next 7 days (a week)
Before we make TS object, we first should do some data preparation such as aggregation and summarise the data. This time we will aggregate the data into daily frequancy.
Data Aggregation
theft <- crime %>%
filter(Primary.Type == "THEFT") %>%
select(Date) %>%
mutate(Date = mdy_hms(Date),
Date = floor_date(Date, unit = "hour")) %>%
group_by(Date) %>%
summarise(total_cases = sum(n())) %>%
ungroup()
theft#> # A tibble: 2,318 × 2
#> Date total_cases
#> <dttm> <int>
#> 1 2023-01-01 00:00:00 12
#> 2 2023-01-01 01:00:00 8
#> 3 2023-01-01 02:00:00 4
#> 4 2023-01-01 03:00:00 7
#> 5 2023-01-01 04:00:00 3
#> 6 2023-01-01 05:00:00 2
#> 7 2023-01-01 07:00:00 1
#> 8 2023-01-01 08:00:00 6
#> 9 2023-01-01 09:00:00 5
#> 10 2023-01-01 10:00:00 6
#> # ℹ 2,308 more rows
After we got total THEFT cases each day, next we make sure theres no missing time interval. Since our data has 1560 rows in total, its really difficult to check manually. So we gonna use pad() function to fill the missing interval and replace total_cases missing value with zero.
Padding Data
theft_pad <- theft %>%
arrange(Date) %>%
pad(start_val = ymd_hms("2023-01-01 00:00:00"),
end_val = ymd_hms("2023-04-10 23:00:00")) %>%
mutate(total_cases = replace_na(total_cases, 0))colSums(is.na(theft_pad))#> Date total_cases
#> 0 0
summary(theft_pad)#> Date total_cases
#> Min. :2023-01-01 00:00:00 Min. : 0.000
#> 1st Qu.:2023-01-25 23:45:00 1st Qu.: 3.000
#> Median :2023-02-19 23:30:00 Median : 5.000
#> Mean :2023-02-19 23:30:00 Mean : 5.646
#> 3rd Qu.:2023-03-16 23:15:00 3rd Qu.: 8.000
#> Max. :2023-04-10 23:00:00 Max. :22.000
Build TS Object
First ill use ts() command to make my first ts object to see and explore whether our time series object has trend and seasonal properties (on-seasonal/multiseasonal)
#First model
theft_ts <- ts(theft_pad$total_cases,
frequency = 24) #daily seasonalityPlotting TS Object
theft_ts %>%
decompose() %>%
autoplot()From our plot above, we know that aur trend still has seasonality patternd captured. So ill try to use another method to make TS object that can handle multiseasonality data.
I’ll try to use msts with some combination of frequency’s number
TS Object Using MSTS
#Second model
theft_pad$total_cases %>%
msts(seasonal.periods = c(24, 24*7)) %>%
mstl() %>%
autoplot()Our trend still captured seasonal pattern, so ill try to change our seasonal periods
#Third model
theft_pad$total_cases %>%
msts(seasonal.periods = c(24*7, 24*7*4)) %>%
mstl() %>%
autoplot()Third model has better decomposition among three models, therefor we will use third model to be use as time series model building
theft_msts <- theft_pad$total_cases %>%
msts(seasonal.periods = c(24*7, 24*7*4))Cross Validation
I splitting the data into train and test data for validation purpose. Test data has a week of frequency ,meanwhile the rest as train data
data_test <- tail(theft_msts, 24*7)
data_train <- head(theft_msts, -24*7)Model building
For model building I’ll use ETS Holt-Winter and SARIMA, the reason why I choose both of it because my data contain trend and seasonal
theft_ets <- stlm(data_train, method = "ets")
theft_arima <- stlm(data_train, method = "arima")Forecast and Model Evaluation
Forecast
I will predict total crime cases in next seven days, and compare it with actual data (data_test)
theft_ets_f <- forecast(theft_ets, h = 24*7)
theft_arima_f <- forecast(theft_arima, h = 24*7)Plotting Data
Plotting the forcasting result with its actual data
data_test %>%
autoplot() +
autolayer(theft_ets_f$mean, series = "ETS") +
autolayer(theft_arima_f$mean, series = "ARIMA")
## Accuracy Test
In this accuracy test, we only focus on RMSE and MAS. RMSE is a measurement to know how differ the prediction with actual value. Same as RMSE, MAE measure how differ the prediction with actual value, but MAE calculate the absolute different value betwwen prediction and actual value.
accuracy(theft_ets_f, data_test)#> ME RMSE MAE MPE MAPE MASE ACF1
#> Training set 0.0002433248 1.824809 1.416801 NaN Inf 0.5441187 0.03429500
#> Test set -0.9008203365 3.220461 2.414883 -Inf Inf 0.9274293 -0.07028675
#> Theil's U
#> Training set NA
#> Test set 0
accuracy(theft_arima_f, data_test)#> ME RMSE MAE MPE MAPE MASE ACF1
#> Training set 0.00004705387 1.819897 1.415452 NaN Inf 0.5436005 -0.00008742695
#> Test set -0.85509476908 3.204828 2.403232 NaN Inf 0.9229548 -0.07195926338
#> Theil's U
#> Training set NA
#> Test set 0
From RMSE and MAE value we can conclude ARIMA has better performance than ETS.
Assumption Check
Normality Test
- Normality test: Shapiro test H0: Residuals are normally distributed H1: Residuals are not normally distributed
We try to accept H0 (P value > 0.005)
shapiro.test(theft_arima_f$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: theft_arima_f$residuals
#> W = 0.99603, p-value = 0.00001274
hist(theft_arima_f$residuals, breaks = 10)Autocorrelation Test
- Autocorrelation test: Box.test - Ljng-Box H0: No autocorrelation in the forecast error H1: Auutocerrelation detected in the forecast residuals
Box.test(theft_arima_f$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: theft_arima_f$residuals
#> X-squared = 0.000017083, df = 1, p-value = 0.9967
Conclusion: Our residuals forecast models are not normally distributed since its has p-value smaller than 0.005 (p-value <0.05) then we reject our null hypothesis (accept alternative hypothesis), I’ll try to build our ts object using Triple Exponential’s Smoothing to see if the model can solf this problem and has better performance than ARIMA. Meanwhile our Autocorrelation test show we accept our null hypothesis since its p-value has greater value than 0.005
Building Objest TS with another method
Holt’s Winter
theft_hw <- HoltWinters(data_train)theft_hw_f <- forecast(theft_hw, h = 24*7)shapiro.test(theft_hw_f$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: theft_hw_f$residuals
#> W = 0.96638, p-value < 0.00000000000000022
Using log tranformation
Change all 0 value into 1 to prevent -Inf valu appear that can cause error when building the model using arima
theft_pad_log <- theft_pad %>%
mutate(total_cases = if_else(total_cases == 0, 1, total_cases),
total_cases = log(total_cases)
)summary(theft_pad_log)#> Date total_cases
#> Min. :2023-01-01 00:00:00 Min. :0.000
#> 1st Qu.:2023-01-25 23:45:00 1st Qu.:1.099
#> Median :2023-02-19 23:30:00 Median :1.609
#> Mean :2023-02-19 23:30:00 Mean :1.507
#> 3rd Qu.:2023-03-16 23:15:00 3rd Qu.:2.079
#> Max. :2023-04-10 23:00:00 Max. :3.091
theft_log_msts <- theft_pad_log$total_cases %>%
msts(seasonal.periods = c(24*7, 24*7*4))data_test_log <- tail(theft_log_msts, 24*7)
data_train_log <- head(theft_log_msts, -24*7)theft_log_arima <- stlm(data_train_log, method = "arima")theft_log_arima_f <- forecast(theft_log_arima, h = 24*7)shapiro.test(exp(theft_log_arima_f$residuals))#>
#> Shapiro-Wilk normality test
#>
#> data: exp(theft_log_arima_f$residuals)
#> W = 0.93956, p-value < 0.00000000000000022
Even after using log transformation, our residual seems not normally distributed, So I decided to use our second models to do prediciton for next 7 days
Forcasting for next 7 days
Model Building and Plotting
theft_arima_final <- stlm(theft_msts, method = "arima")theft_final_f <- forecast(theft_arima_final, h = 24*7)theft_final_f$mean %>%
autoplot()Create New Data Frame
# Membuat vektor tanggal dari 2023-01-01 hingga 2023-01-07
df <- data.frame(Date = seq(from = ymd_hms("2023-04-11 00:00:00"), to = ymd_hms("2023-04-17 23:00:00"), by = "hour"))df_forcasting <- df %>%
mutate(theft_predict = theft_final_f$mean,
theft_predict = ceiling(theft_predict),
Date = ymd_hms(Date))
rmarkdown::paged_table(df_forcasting)Plotting Our Result
plot1 <- ggplot(data = df_forcasting, aes(x = Date, y = theft_predict)) +
geom_col(aes(fill = theft_predict))+
theme_minimal()
plot1Our plot is too crowded so its not intuitive, I’ll try to do aggregation each hour to see which hour has the highest probability for theft to happen
df_agg <- df_forcasting %>%
mutate(hour = hour(Date)) %>%
select(hour, theft_predict) %>%
group_by(hour) %>%
summarise(theft_predict = mean(theft_predict))plot2 <- ggplot(data = df_agg, aes(x = hour, y = theft_predict)) +
geom_col(aes(fill = theft_predict))+
theme_minimal()
plot2From the plot above, we can conclude that highest probability for theft crime to happend is around early days, midday and in afternoon.