About Dataset This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present (2022), minus the most recent seven days. Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.
Using Chicago crime dataset, we will make a prediction model to forecasting crime frequency based on time series, and analyze whether is there any pattern for a crime to likely to happen and time that required to be more careful.
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyr)
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
library(padr)
library(forecast)## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tseries)
library(gridExtra)##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(MLmetrics)##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
library(nortest)crime <- read.csv("data/Crimes_2001.csv")
head(crime)glimpse(crime)## Rows: 7,514,910
## Columns: 22
## $ ï..ID <int> 10224738, 10224739, 11646166, 10224740, 10224741,~
## $ Case.Number <chr> "HY411648", "HY411615", "JC213529", "HY411595", "~
## $ Date <chr> "09/05/2015 01:30:00 PM", "09/04/2015 11:30:00 AM~
## $ Block <chr> "043XX S WOOD ST", "008XX N CENTRAL AVE", "082XX ~
## $ IUCR <chr> "0486", "0870", "0810", "2023", "0560", "0610", "~
## $ Primary.Type <chr> "BATTERY", "THEFT", "THEFT", "NARCOTICS", "ASSAUL~
## $ Description <chr> "DOMESTIC BATTERY SIMPLE", "POCKET-PICKING", "OVE~
## $ Location.Description <chr> "RESIDENCE", "CTA BUS", "RESIDENCE", "SIDEWALK", ~
## $ Arrest <chr> "false", "false", "false", "true", "false", "fals~
## $ Domestic <chr> "true", "false", "true", "false", "true", "false"~
## $ Beat <int> 924, 1511, 631, 1412, 1522, 614, 1434, 1034, 1222~
## $ District <int> 9, 15, 6, 14, 15, 6, 14, 10, 12, 8, 8, 16, 5, 2, ~
## $ Ward <int> 12, 29, 8, 35, 28, 21, 32, 25, 27, 15, 13, 45, 34~
## $ Community.Area <int> 61, 25, 44, 21, 25, 71, 24, 31, 27, 63, 65, 11, 4~
## $ FBI.Code <chr> "08B", "06", "06", "18", "08A", "05", "05", "06",~
## $ X.Coordinate <int> 1165074, 1138875, NA, 1152037, 1141706, 1168430, ~
## $ Y.Coordinate <int> 1875917, 1904869, NA, 1920384, 1900086, 1850165, ~
## $ Year <int> 2015, 2015, 2018, 2015, 2015, 2015, 2015, 2015, 2~
## $ Updated.On <chr> "02/10/2018 03:50:01 PM", "02/10/2018 03:50:01 PM~
## $ Latitude <dbl> 41.81512, 41.89508, NA, 41.93741, 41.88190, 41.74~
## $ Longitude <dbl> -87.67000, -87.76540, NA, -87.71665, -87.75512, -~
## $ Location <chr> "(41.815117282, -87.669999562)", "(41.895080471, ~
Column Description:
As we will conduct Time Series Analysis, we are going to use only the necessary Information which is Date and the kind of crime that recorded in the Primary.Type.
crimes <- crime %>%
select(c(Date, Primary.Type)) %>%
mutate(Primary.Type = as.factor(Primary.Type),
Date = mdy_hms(Date),
Date = floor_date(Date, unit = "hours")) %>% #takes a date-time object and rounds it down to hours unit
arrange(Date)As a limitation of the problem, we will choose a crime case to be analyzed. To determine the case, a plot was created that presented data on the top 5 crimes in Chicago.
crimes %>%
count(Primary.Type, sort = T) %>%
head(5) %>%
ggplot(aes(x = n, y = reorder(Primary.Type, n))) +
geom_col()+
labs(title = 'Top 5 Crime in Chicago',
x = 'Number of Crime',
y = 'Crime')From the chart above, theft has the highest number of crimes among others. Therefore, in this analysis we will try to do a time series prediction analysis for the theft case.
range(crimes$Date)## [1] "2001-01-01 00:00:00 UTC" "2022-03-27 23:00:00 UTC"
For this analysis, we will limit the data analysis to only the number of theft crimes from the beginning of 2018 to the end of 2021.
theft <- crimes %>%
filter(Primary.Type == 'THEFT') %>%
group_by(Date) %>%
summarise(Theft = n()) %>%
filter(Date > '2018-01-01' & Date < '2022-01-02')head(theft, 5)There are still 6 data from 2017, so we will use slice to remove the data.
theft <- theft %>%
slice(-(1:6)) tail(theft, 5)There are still 17 data from 2022, so we will use slice to remove the data.
length(theft$Date)## [1] 33647
theft <- theft %>%
slice(-c(33625:33641))range(theft$Date)## [1] "2018-01-01 00:00:00 UTC" "2022-01-01 16:00:00 UTC"
Now we have the correct range of data. We will fix the missing intervals in the pre-processing data
Data provisions that must be treated in the Time Series, among others:
-no missing intervals -no missing values -data should be ordered by time
colSums(is.na(theft))## Date Theft
## 0 0
theft <- theft %>%
pad(start_val = ymd_hms("2018-01-01 00:00:00"), end_val = ymd_hms("2021-12-31 23:00:00")) %>%
replace(., is.na(.), 0)## pad applied on the interval: hour
range(theft$Date)## [1] "2018-01-01 00:00:00 UTC" "2021-12-31 23:00:00 UTC"
theft_ts <- ts(data = theft$Theft,
start = min(theft$Date),
frequency = 24) # daily seasonalityIn this section, I tried to explore whether our timeseries object has trend and seasonal properties (one-seasonal/multiseasonal).
# decompose ts object
theft_deco <- decompose(theft_ts)
autoplot(theft_deco)Based on the plot, the trend still shows a pattern of ups and downs which indicates a multiseasonal time series object. Therefore, we should try reanalyze this data through multiseasonal time series using another seasonal frequency.
theft$Theft %>%
msts(seasonal.periods = c(24,24*7)) %>% # multiseasonal ts (daily, weekly)
mstl() %>% # multiseasonal ts decomposition
autoplot() theft$Theft %>%
msts(seasonal.periods = c(24, 24*7, 24*7*4)) %>% # multiseasonal ts (daily, weekly, monthly)
mstl() %>% # multiseasonal ts decomposition
autoplot() c.Third Trial
theft$Theft %>%
msts(seasonal.periods = c(24, 24*7, 24*7*4, 24*7*4*12)) %>% # multiseasonal ts (daily, weekly, monthly, annualy)
mstl() %>% # multiseasonal ts decomposition
autoplot()The last ts object showed the best decomposition among the 3 trials and therefore will be used for the time series model building. Additionally, note that the data is an additive time series and this is a valuable information for the model building later.
# assign final ts object
theft_msts <- theft$Theft %>%
msts(seasonal.periods = c(24, 24*7, 24*7*4, 24*7*4*12))
# check for stationary
adf.test(theft_msts)## Warning in adf.test(theft_msts): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: theft_msts
## Dickey-Fuller = -15.133, Lag order = 32, p-value = 0.01
## alternative hypothesis: stationary
Based on Augmented Dickey-Fuller Test (adf.test) result, the p-value is < alpha and therefore the data is already stationary. Therefore, for a model building using SARIMA, we do not need to perform differencing on the data first.
After finishing our time frame, we will continue to conduct the cross validation activities. we will split our data into two types of data. Train data as a data that will be trained in to our model, and test data as a data that provide an unbiased evaluation of a model fit on the training data set.
We will split our train data into two type of data set named train and test. Our test data will be a set of data consist of last 365 days of reported crime from our data frame.
length(theft$Date)## [1] 35064
The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially.
theft_train <- theft_msts %>% head(length(theft_msts) - 24*7*4*12)
theft_test <- theft_msts %>% tail(24*7*4*12)head(theft_msts)## Multi-Seasonal Time Series:
## Start: 1 1
## Seasonal Periods: 24 168 672 8064
## Data:
## [1] 14 8 4 5 2 2
# Decompose MSTS Object
theft_multi_dec <- theft_msts %>%
mstl()
theft_multi_dec %>%
tail(24*7*4*12) %>%
autoplot()# Create a data frame based on MSTS Object
df_theft_multi <- as.data.frame(theft_multi_dec)glimpse(df_theft_multi)## Rows: 35,064
## Columns: 7
## $ Data <dbl> 14, 8, 4, 5, 2, 2, 2, 4, 3, 10, 2, 9, 14, 2, 3, 7, 9, 7, ~
## $ Trend <dbl> 7.036908, 7.037026, 7.037144, 7.037262, 7.037380, 7.03749~
## $ Seasonal24 <dbl> 0.89948115, -2.08040832, -3.88387650, -3.66446487, -4.308~
## $ Seasonal168 <dbl> -0.06862864, 0.51751197, 0.18372466, -0.86785271, 0.37094~
## $ Seasonal672 <dbl> -0.26438916, -0.04622184, 0.02135439, 0.81289349, 0.42051~
## $ Seasonal8064 <dbl> 1.1048611, 0.9843744, 0.6471740, 0.7171870, 0.8164987, 0.~
## $ Remainder <dbl> 5.291767724, 1.587717896, -0.005520492, 0.964975106, -2.3~
df_theft_multi %>%
mutate(day = theft$Date) %>%
group_by(day) %>%
summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal672 + Seasonal8064)) %>%
head(24*7) %>%
ggplot(aes(x = day, y = seasonal)) +
geom_point(col = "maroon") +
geom_line(col = "blue") +
theme_minimal()Based on the daily plot at 1 week, it can be seen that the average theft crime starts to rise at 10 am, and reaches a peak value at 5 pm (after office hours). The figure will fluctuate and start to fall above 12 o’clock in the morning. At that time the public can be advised to be more careful.
df_theft_multi %>%
mutate(day = wday(theft$Date, label = T)) %>%
group_by(day) %>%
summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal672 + Seasonal8064)) %>%
ggplot(aes(x = day, y = seasonal)) +
geom_col() +
theme_minimal()As we can see from the plot above, The theft tend to increase in the Wednesday, Thursday, Friday and Saturday.
df_theft_multi %>%
mutate(month = month(theft$Date, label = T)) %>%
group_by(month) %>%
summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal672 + Seasonal8064)) %>%
ggplot(aes(x = month, y = seasonal)) +
geom_col() +
theme_minimal()Based on the monthly plot, the theft tend to increase from June until October.
For model building, I will compare between two of the widely used time series modeling in business and industry: ETS Holt-Winters and Seasonal Arima. I use ETS Holt-Winters because my data contain both seasonal and trend. I also want to compare it between seasonal Arima to check whether seasonal ARIMA can give better forecasting performance.
# ets Holt-Winters
theft_ets <- stlm(theft_train, method = "ets") # no log transformation for additive data
# ARIMA
theft_arima <- stlm(theft_train, method = "arima")# forecast
theft_ets_f <- forecast(theft_ets, h = 24*7*4*12)
theft_arima_f <- forecast(theft_arima, h = 24*7*4*12)# visualization
a <- autoplot(theft_ets_f, series = "ETS", fcol = "red") +
autolayer(theft_msts, series = "Actual", color = "black") +
labs(subtitle = "Number of Theft Case at Chicago, from Jan 2018 - Dec 2021",
y = "Theft Frequency") +
theme_minimal()
b <- autoplot(theft_arima_f, series = "ARIMA", fcol = "blue") +
autolayer(theft_msts, series = "Actual", color = "black") +
labs(subtitle = "Number of Theft Case at Chicago, from Jan 2018 - Dec 2021",
y = "Theft Frequency") +
theme_minimal()
grid.arrange(a,b)# model evaluation: root mean squared error (RMSE)
data.frame(ETS = MAE(theft_ets_f$mean, theft_test), ARIMA = MAE(theft_arima_f$mean, theft_test))From the analysis above, we can conclude that we have sucsesfully forecast theft frequency and found ARIMA as the best model with the lowest error (MAE 2.318%, although not very different from ETS model).
Normality: Anderson-Darling Test H0 : residuals are normally distributed H1 : residuals are not normally distributed
ad.test(theft_arima_f$residuals)##
## Anderson-Darling normality test
##
## data: theft_arima_f$residuals
## A = 35.599, p-value < 2.2e-16
hist(theft_arima_f$residuals, breaks = 20)Box.test(theft_arima_f$residuals, type = "Ljung-Box") # there is not enough data to reject H0##
## Box-Ljung test
##
## data: theft_arima_f$residuals
## X-squared = 0.080186, df = 1, p-value = 0.777
Based on the assumption check, there is no autocorrelation on our forecast residuals (p-value > 0.05). Still, our forecast’s residuals are not distributed normally, therefore it’s residuals may not be appeared around its mean as seen in the histogram.
In a time series, such errors might emerge from various unpredictable events and is actually quite unavoidable. One strategy to overcome it is to analyze what kinds of unpredictable events that might occur and occurs frequently. This can be done by time series analysis using seasonality adjustment.
Based on the analysis and our model performance,we have sucsesfully forecast theft frequency and found ARIMA as the best model with the lowest error (MAE 2.318%, although not very different from ETS model). Our model able to predict the theft crime rate in the city of Chicago with Measure of Error (MAE) around 2.318%. Still, it can be optimized thoroughly with several way as our model does not fulfill the Normality assumptions.
As for the seasonal analysis, it is safe to say that the theft crime is likely to increase around 10 am, it reaches a peak value in 5 pm (after office hours) and tend to increase towards 12 o’clock in the morning. The crime itself is more likely to happen from June until October.