This project was created to fulfill the Learning by Building (LBB) assignment on the TS - Time Series material in the Machine Learning Specialization Course. The dataset used in this project is the Chicago Crime Dataset. This dataset represents recorded occurrences of criminal activities (excluding murder cases, as individual victim data is available) within the City of Chicago. The data spans from 2001 to the current date, excluding the last seven days. The information is sourced from the CLEAR (Citizen Law Enforcement Analysis and Reporting) system of the Chicago Police Department.
With the Chicago crime dataset, our aim is to develop a prediction model using time series analysis to forecast crime frequency. Through this analysis, we will investigate if there are discernible patterns that indicate the likelihood of a crime occurring and identify specific time periods that demand increased vigilance and caution.
Here are the description of all the columns.
ID: Unique identifier for the record.Case.Number: The Chicago Police Department RD Number
(Records Division Number), which is unique to the
incident.Date: Date when the incident occurred. this is
sometimes a best estimate.Block: The partially redacted address where the
incident occurred, placing it on the same block as the actual
address.IUCR: The Illinois Unifrom Crime Reporting
code. This is directly linked to the Primary Type and Description.Primary.Type: The primary description of the IUCR
code.Description: The secondary description of the IUCR
code, a subcategory of the primary description.Location.Description: Description of the location where
the incident occurred.Arrest: Indicates whether an arrest was made.Domestic: Indicates whether the incident was
domestic-related as defined by the Illinois Domestic Violence Act.Beat: Indicates the beat where the incident occurred. A
beat is the smallest police geographic areaDistrict: Indicates the police district where the
incident occurred.Ward: The ward (City Council district) where
the incident occurred.Community.Area: Indicates the community area where the
incident occurred. Chicago has 77 community areas.FBI.Code: Indicates the crime classification as
outlined in the FBI’s National Incident-Based Reporting System
(NIBRS).X.Coordinate: The x coordinate of the location where
the incident occurred in State Plane Illinois East NAD 1983
projection.Y.Coordinate: The y coordinate of the location where
the incident occurred in State Plane Illinois East NAD 1983
projection.Year: Year the incident occurred.Updated.On: Date and time the record was last
updated.Latitude: The latitude of the location where the
incident occurred.Longitude: The longitude of the location where the
incident occurred.Location: The location where the incident
occurred.We will use only the necessary information or columns for the time
series analysis such as Date and
Primary.Type.
crimes <- crime %>%
select(c(Date, Primary.Type)) %>%
mutate(Date = mdy_hms(Date),
Date = floor_date(Date, unit = "hours"),
Primary.Type = as.factor(Primary.Type)) %>%
arrange(Date)To address the constraints of our analysis, we will focus on studying a specific crime case. To select this case, we generated a plot showcasing information on the five most prevalent crimes in Chicago. This plot helped us identify the crime category that will be further analyzed in-depth.
top_5_crimes <- crimes %>%
count(Primary.Type, sort = TRUE) %>%
head(5) %>%
mutate(Primary.Type = factor(Primary.Type, levels = rev(Primary.Type)))
ggplot(top_5_crimes, aes(x = n, y = reorder(Primary.Type, n), fill = n)) +
geom_col() +
scale_fill_gradient(low = "#f3ff82", high = "#1f4260", labels= scales::comma) +
scale_x_continuous(labels = scales::comma) +
labs(title = 'Top 5 Crime Case in Chicago',
x = 'Number of Crime',
y = 'Crime',
fill = 'Number of Crimes') +
theme(plot.title = element_text(hjust = 0.5))
Based on the plot presented earlier, it is evident that theft has the
highest occurrence of crimes compared to other types. Consequently, for
this analysis, our focus will be on conducting a time series prediction
analysis specifically for the theft case as the top 1
crime case.
For the purpose of this analysis, we will narrow down the data to focus solely on the number of theft crimes occurring between the beginning of 2018 and the end of 2022.
theft <- crimes %>%
filter(Primary.Type == 'THEFT') %>%
group_by(Date) %>%
summarise(Theft = n()) %>%
filter(Date > '2018-01-01' & Date < '2023-01-02')There are 6 data from the end of 2018, so we will use
slice to remove them.
Don’t forget to check the tail of data too.
There are 17 data from the beginning of 2023, so we will also use
slice to remove them.
Make sure once again for the range of the Date.
## [1] "2018-01-01 00:00:00 UTC" "2022-12-31 23:00:00 UTC"
Check the missing value due to the data that used for time series analysis must be treated with no missing values.
## Date Theft
## 0 0
Next step, the data should be ordered by time.
theft <- theft %>%
pad(start_val = ymd_hms("2018-01-01 00:00:00"), end_val = ymd_hms("2022-12-31 23:00:00")) %>%
replace(., is.na(.), 0)## [1] "2018-01-01 00:00:00 UTC" "2022-12-31 23:00:00 UTC"
theft_ts <- ts(data = theft$Theft,
start = min(theft$Date),
frequency = 24) # daily seasonality
autoplot(theft_ts)
From the plot, we got insight that the data is does not have a trend
(down trend or up trend). Further analysis uses the decomposed plot to
determine the properties of the data.
Considering the plot, it is evident that the pattern exhibits
fluctuations or oscillations, suggesting a multiseasonal time series
phenomenon. As a result, to gain deeper insights, we should reanalyze
this data using another seasonal frequency that may reveal additional
seasonal patterns beyond what we have currently identified.
### Daily, Weekly, Monthly, Annualy Seasonality
The latest time series object displayed the most optimal decomposition
among the three attempts seasonality, making it the preferred choice for
building the time series model. It is crucial to acknowledge that the
data follows an additive time series pattern, which will be valuable
information in the subsequent steps of model building.
The cross-validation scheme for time series should be splitted sequentially.
# Decompose MSTS Object
theft_multi_dec <- theft_ts %>%
mstl()
theft_multi_dec %>%
tail(24*7*4*12) %>%
autoplot()
Create the msts object as a dataframe.
df_theft_multi %>%
mutate(day = theft$Date) %>%
group_by(day) %>%
summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal672 + Seasonal8064)) %>%
head(24*7) %>%
ggplot(aes(x = day, y = seasonal)) +
geom_point(col = "#1f4260") +
geom_line(col = "#17b79c") +
theme_minimal()
Insight: Based on the daily plot with a time interval of 1 week, we
observe that the average theft crime exhibits a gradual increase from 10
am, reaching its highest point at 5 pm, which corresponds to the hours
after office hours. Subsequently, the figure displays fluctuations,
gradually decreasing after midnight. During this time, it would be
prudent to advise the public to exercise heightened caution and
vigilance.
df_theft_multi %>%
mutate(day = wday(theft$Date, label = TRUE)) %>%
group_by(day) %>%
summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal672 + Seasonal8064)) %>%
ggplot(aes(x = day, y = seasonal, group = 1)) +
geom_line(col = "#17b79c") +
geom_point(col = "#1f4260") +
theme_minimal()
As observed from the plot, there is a notable trend of increased theft
incidents on Wednesday and Friday.
df_theft_multi %>%
mutate(month = month(theft$Date, label = TRUE)) %>%
group_by(month) %>%
summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal672 + Seasonal8064)) %>%
ggplot(aes(x = month, y = seasonal, group = 1)) +
geom_line(col = "#17b79c") +
geom_point(col = "#1f4260") +
theme_minimal()
As observed from the plot, there is a notable trend of increased theft
incidents from April until August.
For the model building phase, I will conduct a comparison between Simple Exponential Smoothing (SES), ETS Holt-Winters and Seasonal ARIMA.
## ETS(A,N,N)
##
## Call:
## ets(y = theft_train, model = "ANN")
##
## Smoothing parameters:
## alpha = 0.5128
##
## Initial states:
## l = 11.1006
##
## sigma: 3.3013
##
## AIC AICc BIC
## 460351.2 460351.2 460376.7
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.0003627239 3.30125 2.542178 -Inf Inf 0.818224 -0.03970625
## Length Class Mode
## stl 250320 mstl numeric
## model 19 ets list
## modelfunction 1 -none- function
## lambda 0 -none- NULL
## x 35760 msts numeric
## series 1 -none- character
## m 1 -none- numeric
## fitted 35760 ts numeric
## residuals 35760 ts numeric
## Length Class Mode
## stl 250320 mstl numeric
## model 18 forecast_ARIMA list
## modelfunction 1 -none- function
## lambda 0 -none- NULL
## x 35760 msts numeric
## series 1 -none- character
## m 1 -none- numeric
## fitted 35760 ts numeric
## residuals 35760 ts numeric
theft_ses_f <- forecast(theft_ses, h = 24*7*4*12)
theft_ets_f <- forecast(theft_ets, h = 24*7*4*12)
theft_arima_f <- forecast(theft_arima, h = 24*7*4*12)Visualize the forecast
ses <- autoplot(theft_ses_f, series = "SES", fcol = "#abeb88") +
autolayer(theft_ts, series = "Actual", color = "black") +
labs(subtitle = "Number of Theft Case at Chicago from Jan 2018 - Dec 2022",
y = "Frequency") +
theme_minimal()
ets <- autoplot(theft_ets_f, series = "ETS", fcol = "#67d294") +
autolayer(theft_ts, series = "Actual", color = "black") +
labs(subtitle = "Number of Theft Case at Chicago from Jan 2018 - Dec 2022",
y = "Frequency") +
theme_minimal()
arima <- autoplot(theft_arima_f, series = "ARIMA", fcol = "#009a9c") +
autolayer(theft_ts, series = "Actual", color = "black") +
labs(subtitle = "Number of Theft Case at Chicago from Jan 2018 - Dec 2022",
y = "Frequency") +
theme_minimal()
grid.arrange(ses,ets,arima)
### MAE Value
data.frame(SES= MAE(theft_ses_f$mean, theft_test), ETS = MAE(theft_ets_f$mean, theft_test), ARIMA = MAE(theft_arima_f$mean, theft_test))Based on the analysis conducted, we have successfully forecasted theft frequency, and the ARIMA model emerged as the most effective choice because ARIMA model demonstrated slightly better accuracy in predicting theft incidents. Therefore, the ARIMA model is deemed as the preferable option for this particular time series forecasting task.
##
## Anderson-Darling normality test
##
## data: theft_arima_f$residuals
## A = 57.652, p-value < 2.2e-16
##
## Box-Ljung test
##
## data: theft_arima_f$residuals
## X-squared = 0.54887, df = 1, p-value = 0.4588
Based on the assumption check, there is no autocorrelation on our forecast residuals (p-value > 0.05).
Based on the analysis and the performance evaluation of the models, we have successfully forecasted theft frequency, and the ARIMA model has been identified as the best model. It demonstrated the best performance among the models considered for this particular time series forecasting task. If from the seasonality analysis side, and the theft crime is likely to increase around 10 am and there is a notable trend of increased theft incidents from April until August.