Final Group Project UC_Crime Analysis

Introduction

Data Selection and Justification

Data Source

For this project, we selected the UC_Crime dataset, originally provided by Professor Yichen Qin and publicly available on GitHub. The dataset contains detailed records of reported crimes around the University of Cincinnati area and includes information such as crime type, date and time, location, and other contextual factors.

Rationale for Dataset Selection

The UC_Crime dataset was chosen for several compelling reasons:

1. Real-World Relevance

Crime data has direct implications for community safety, law enforcement efficiency, and public policy. By analyzing patterns within this dataset, insights can be derived to assist decision-makers, law enforcement agencies, and community organizations in improving responses to public safety concerns.

2. Time-Sensitive Nature

The inclusion of precise timestamps (e.g., DATE_REPORTED, DATE_FROM, and DATE_TO) makes this dataset ideal for time-series analysis. It allows for the identification of patterns and trends across different times of the day, days of the week, or seasons, which is crucial for temporal forecasting.

3. Data Granularity

This dataset offers a comprehensive level of detail, including: - Geographic information (e.g., neighborhood, block location) - Crime characteristics (e.g., offense type, weapon involvement) - Victim demographics (e.g., age, gender, race) - Such granularity enables multi-dimensional analysis and enhances the interpretability and usefulness of insights derived from the data.

4. Forecasting Relevance

Forecasting crime trends plays a pivotal role in crime prevention and resource management. With this dataset, we can build predictive models to estimate future crime levels, helping stakeholders: - Allocate resources efficiently (e.g., increase patrols during high-risk times) - Raise community awareness - Support data-driven policy-making and urban planning

Data Wrangling and Exploration Methodology

To ensure the dataset was suitable for analysis and modeling, we implemented several preprocessing steps:

1. Data Cleaning

Removed duplicates and entries with missing or invalid values
Checked for consistency in categorical fields such as offense types and location names

2. Date-Time Conversion

Standardized timestamp columns using lubridate::ymd_hms() to ensure compatibility with time-series and forecasting libraries
Extracted features such as hour of day, day of week, and month for exploratory analysis

3. Normalization

Standardized naming conventions for neighborhoods and crime types
Reformatted or categorized demographic information where necessary

4. Exploratory Data Analysis (EDA)

Visualized the frequency of different crime types by neighborhood and time period
Performed correlation analysis between demographic variables and crime categories
Identified potential seasonal trends, peak hours, and geographic hotspots

Forecasting Methodology

To project future crime patterns, we followed a structured time-series modeling approach:

1. Data Aggregation

Aggregated incident counts into consistent time intervals (e.g., daily, weekly)

2. Time-Series Decomposition

Decomposed the series into trend, seasonal, and residual components to understand underlying structures

3. Modeling Techniques

We experimented with a range of models, including: - ARIMA (AutoRegressive Integrated Moving Average): Suitable for linear time-series with seasonality - Prophet (by Facebook): Capable of modeling irregular trends, multiple seasonality, and holiday effects - LSTM Neural Networks: For capturing complex, non-linear temporal patterns (future work or advanced modeling)

4. Model Validation

Split the dataset into training and testing subsets
Evaluated performance using appropriate metrics such as MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error)
Cross-validated to ensure generalizability

Data Preparation & Analysis

Install & Load Packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(ggplot2)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(Rcpp)
library(prophet)

## Loading required package: rlang
## 
## Attaching package: 'rlang'
## 
## The following objects are masked from 'package:purrr':
## 
##     %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
##     flatten_raw, invoke, splice

library(readr)
library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:rlang':
## 
##     set_names
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(readr)
library(tsibble)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(fable)

## Loading required package: fabletools

library(feasts)
library(anomalize)

Data Input

UC_crime <- read.csv("D:/UC Course/Spring 2025/Forecast & Risk Analysis/Project/UC/UC_Crime/UC_crime(in).csv")

This R code converts the DATE_REPORTED, DATE_FROM, and DATE_TO columns in the UC_crime dataset into a standardized date-time format using the ymd_hms() function from the lubridate package. Converting these columns ensures that crime data can be accurately analyzed over time, allowing for trend detection, filtering, and forecasting. The DATE_REPORTED column represents when a crime was officially recorded, while DATE_FROM and DATE_TO help in understanding the duration of incidents. Proper date formatting is crucial for time-series analysis, visualizations, and predictive modeling in crime research. This transformation prepares the data for further statistical analysis, such as identifying peak crime hours or forecasting future crime trends

Data Wraggling

#Check the type format of each columns (DATE_REPORTED, DATE_FROM, DATE_TO)

head(UC_crime$DATE_REPORTED)

## [1] "1/10/2022 1:00" "1/10/2022 9:30" "1/10/2022 8:35" "1/10/2022 4:05"
## [5] "1/9/2022 11:15" "1/9/2022 2:15"

head(UC_crime$DATE_FROM)

## [1] "1/1/2022 12:00"   "12/24/2021 10:00" "1/10/2022 8:35"   "1/10/2022 3:50"  
## [5] "1/8/2022 2:00"    "1/9/2022 1:07"

head(UC_crime$DATE_TO)

## [1] "1/10/2022 12:28" "1/10/2022 9:30"  "1/10/2022 8:35"  "1/10/2022 4:04" 
## [5] "1/9/2022 9:00"   "1/9/2022 1:07"

Crime Trends Over Time

The Monthly Trend Of Reported Crimes From 2008 To 2022

# Step 1: Convert dates
UC_crime <- UC_crime %>%
  mutate(
    DATE_REPORTED = mdy_hm(DATE_REPORTED),
    DATE_FROM     = mdy_hm(DATE_FROM),
    DATE_TO       = mdy_hm(DATE_TO)
  )

# Step 2: Drop rows with NA in date columns
UC_crime <- UC_crime %>%
  filter(!is.na(DATE_REPORTED), !is.na(DATE_FROM), !is.na(DATE_TO))

# Step 3: Extract time-based features
UC_crime <- UC_crime %>%
  mutate(
    Year    = year(DATE_REPORTED),
    Month   = month(DATE_REPORTED),
    Weekday = wday(DATE_REPORTED, label = TRUE),
    Hour    = hour(DATE_REPORTED),
    DATE_REPORTED_DAY = as.Date(DATE_REPORTED)
  )

# Step 4: Add unique row ID for tsibble conversion
UC_crime <- UC_crime %>%
  mutate(ROW_ID = row_number())

# Step 5: Convert to tsibble
UC_crime_tsibble <- UC_crime %>%
  as_tsibble(index = DATE_REPORTED, key = ROW_ID)

# Step 6: Group and summarize crime counts by Year and Month
crime_trends <- UC_crime_tsibble %>%
  group_by(Year, Month) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(
    # Ensure Month is numeric for proper date construction
    Month = as.numeric(Month),
    # Create a proper date for plotting (first of the month)
    Date = as.Date(paste(Year, Month, "01", sep = "-"), format = "%Y-%m-%d")
  )

# Step 7: Plot Crime Trends
ggplot(crime_trends, aes(x = Date, y = Count)) +
  geom_line(color = "red", size = 1) +
  theme_minimal() +
  labs(
    title = "Crime Trends Over Time",
    x = "Month",
    y = "Number of Reported Crimes"
  )

This line graph illustrates the monthly progression of reported crimes between 2008 and 2022. The visualization highlights fluctuations in criminal activity over time, with distinct peaks marking sudden increases in reported incidents. Notable surges occur around the years 2011, 2014, 2018, and more prominently during 2021–2022. These spikes may correspond to significant events or societal changes, such as shifts in local policy, public safety efforts, or broader disruptions like the aftermath of the COVID-19 pandemic.

In contrast, the period spanning from 2015 to 2017 reflects a phase of relative stability, characterized by lower and more consistent crime rates. This pattern could suggest improvements in law enforcement practices, heightened community engagement, or even a temporary decline in crime reporting. The quieter period is particularly interesting when set against the volatility seen before and after.

Overall, the graph reveals that monthly crime rates exhibit high variability rather than a steady upward or downward trend. This irregularity suggests that short-term, external influences—such as seasonal events, holidays, academic schedules, or one-off incidents—may play a more significant role in shaping crime rates than long-term social dynamics alone.

Understanding these short-term patterns is essential for effective crime prevention and resource planning. By identifying when crime tends to escalate, law enforcement and community leaders can allocate personnel more strategically and design targeted interventions. This analysis also offers a foundation for deeper investigation into the root causes behind crime surges and dips, enabling data-driven decision-making moving forward.

Criminal Damaging/Endangering Cases

# Filter Data for "CRIMINAL DAMAGING/ENDANGERING"
criminal_damaging_data <- UC_crime %>%
  filter(OFFENSE == "CRIMINAL DAMAGING/ENDANGERING") %>%
  mutate(Year = year(DATE_FROM)) %>%  
  group_by(Year) %>%  
  summarise(Count = n(), .groups = "drop") %>%
  arrange(desc(Year)) 

criminal_damaging_data

# Bar Chart: "CRIMINAL DAMAGING/ENDANGERING" Cases by Year
ggplot(criminal_damaging_data, aes(x = as.factor(Year), y = Count, fill = as.factor(Year))) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Criminal Damaging/Endangering Cases by Year",
       x = "Year", 
       y = "Number of Cases",
       fill = "Year")

This bar chart presents the yearly totals for “Criminal Damaging/Endangering” incidents from 2010 to 2022. The data reveals notable peaks in 2012 and 2021, each exceeding 500 reported cases. These spikes may reflect periods of heightened social tension, economic hardship, or reduced effectiveness in crime prevention strategies. Conversely, the lowest numbers appear in 2010 and, most strikingly, in 2022, where reported incidents nearly disappeared. This sharp drop in 2022 may indicate changes in crime classification, reporting procedures, data completeness, or genuine improvements in public safety.

Between 2012 and 2016, there is a steady decline, followed by a modest resurgence through 2019. This U-shaped trend could be influenced by evolving law enforcement policies, community outreach efforts, or broader societal factors.

Overall, the chart offers valuable insight into long-term patterns within a specific offense category. It highlights fluctuations that can inform targeted policing strategies and support the development of focused prevention or intervention programs.

Exploratory Analysis And Visualization For The Dataset

Visualizing the Total Number of Crimes Over Time

# Total crime count by year
crime_by_year <- UC_crime %>%
  group_by(Year) %>%
  summarise(Total_Crimes = n(), .groups = "drop")

# Plot total crimes by year
ggplot(crime_by_year, aes(x = Year, y = Total_Crimes)) +
  geom_line(color = "blue", size = 1) +
  geom_point() +
  theme_minimal() +
  labs(title = "Total Reported Crimes by Year",
       x = "Year",
       y = "Total Number of Crimes")

This graph displays the annual total of reported crimes, emphasizing a notable surge in 2012, followed by a steady decline in subsequent years. The pronounced peak in 2012 may be attributed to a specific event or a shift in how crimes were documented. Following this spike, crime reports begin to level out, showing only minor year-to-year variations between 2013 and 2019.

A significant decrease occurs again in 2020, which could be linked to the impacts of the COVID-19 pandemic, such as reduced public activity, changes in policing, or disruptions in crime reporting. The overall trend reflects considerable variability, with sharp changes particularly evident in 2012 and 2020.

These fluctuations underscore the importance of considering external factors—such as policy changes, societal disruptions, or procedural shifts—when interpreting crime data. This kind of insight is valuable for researchers, policymakers, and law enforcement aiming to understand and respond to changes in criminal activity over time.

The Remainder Component Of The Crime Data Over Time

# Ensure Date is in Date format

# Create the crime_by_month data frame by aggregating the data by Year and Month
crime_by_month <- UC_crime %>%
  group_by(Year, Month) %>%
  summarise(Monthly_Crimes = n(), .groups = "drop") %>%
  mutate(Date = as.Date(paste(Year, Month, "01", sep = "-"), format = "%Y-%m-%d"))

# Detect anomalies using anomalize and check column names at each step
decomposed_data <- crime_by_month %>%
  time_decompose(Monthly_Crimes, method = "stl")

## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date

## frequency = 12 months

## trend = 40 months

# Print out the column names to check what is available
print(names(decomposed_data))

## [1] "Date"      "observed"  "season"    "trend"     "remainder"

# Now apply anomaly detection on the remainder
anomalized_data <- decomposed_data %>%
  anomalize(remainder, method = "gesd") %>%
  time_recompose()

# Check the structure of the final dataset
str(anomalized_data)

## tbl_time [141 × 10] (S3: tbl_time/tbl_df/tbl/data.frame)
##  $ Date         : Date[1:141], format: "2009-06-01" "2010-06-01" ...
##  $ observed     : num [1:141] 1 1 252 340 210 ...
##  $ season       : num [1:141] -12.8 -16.7 20.7 35.9 42.4 ...
##  $ trend        : num [1:141] 259 261 263 264 266 ...
##  $ remainder    : num [1:141] -245.5 -243.3 -31.4 39.8 -98.4 ...
##  $ remainder_l1 : num [1:141] -112 -112 -112 -112 -112 ...
##  $ remainder_l2 : num [1:141] 112 112 112 112 112 ...
##  $ anomaly      : chr [1:141] "Yes" "Yes" "No" "No" ...
##  $ recomposed_l1: num [1:141] 135 133 172 189 197 ...
##  $ recomposed_l2: num [1:141] 359 357 396 413 421 ...
##  - attr(*, "index_quo")= language ~Date
##   ..- attr(*, ".Environment")=<environment: 0x000002d7c4273260> 
##  - attr(*, "index_time_zone")= chr "UTC"

# View the first few rows to check column names
head(anomalized_data)

# Plotting the 'remainder' component which is used to show anomalies
ggplot(anomalized_data, aes(x = Date, y = remainder)) +
  geom_line(color = "black") +
  geom_point(aes(color = anomaly), size = 3) +
  theme_minimal() +
  labs(title = "Anomalies in Monthly Crime Counts",
       x = "Month",
       y = "Remainder Component",
       color = "Anomaly")

The plot illustrates the Remainder Component of the time series crime data, highlighting detected anomalies in blue (labeled “Yes”) and non-anomalous data points in red (“No”).

The Remainder Component captures the irregular fluctuations that remain after removing both trend and seasonal patterns from the data. It essentially reflects the “noise” or unpredictable variations that standard components cannot explain.

The blue-highlighted anomalies represent data points where the remainder values deviate significantly from the expected range, suggesting unusual or unexpected changes in crime activity. These deviations are scattered throughout the timeline, with more concentrated peaks observed around 2020 to 2021—a period potentially influenced by extraordinary events such as the COVID-19 pandemic, civil unrest, or shifts in reporting protocols.

The presence of these anomalies indicates that, beyond regular seasonal and trend patterns, crime data contains sudden fluctuations likely triggered by external or unforeseen factors. These could include short-term policy changes, societal disruptions, or data inconsistencies.

By isolating and examining the remainder component, we gain insight into exceptional periods that warrant deeper investigation. Understanding these anomalies is critical for identifying unanticipated shifts in crime patterns and for designing responsive strategies in law enforcement and community planning.

Decomposing the Time Series to Show Trend, Seasonality, and Residuals

# Decompose the time series into trend, seasonality, and residuals
crime_ts <- ts(crime_by_month$Monthly_Crimes, frequency = 12, start = c(min(crime_by_month$Year), 1))

# Decompose using STL (Seasonal and Trend Decomposition using Loess)
decomposed_crime <- stl(crime_ts, s.window = "periodic")

# Plot the components
plot(decomposed_crime)

The plot shows the decomposition of crime data into four components: the original data, seasonal patterns, trend, and remainder. The seasonal component reveals clear periodic fluctuations, suggesting certain times of the year experience higher or lower crime rates. The trend component shows a slight increase in crime until around 2018, followed by a noticeable decline, especially in 2020. The remainder captures random fluctuations or anomalies in the data that are not explained by the trend or seasonality. This decomposition helps isolate the underlying patterns in the data, making it easier to analyze and forecast crime trends.

Visualizing the Seasonality

# Plot just the seasonality component

seasonality <- decomposed_crime$time.series[, 1]
plot(seasonality, type = "l", col = "green", main = "Seasonality in Crime Data",
     xlab = "Months", ylab = "Seasonal Component")

The graph displays the seasonal component of the crime data, revealing recurring patterns or cycles in crime activity over the months from 2009 to 2020. The periodic peaks and valleys, occurring almost every year, suggest that crime rates fluctuate seasonally, with higher crime counts during certain months and lower rates during others. This regular pattern highlights the influence of seasonal factors, such as weather, holidays, or social behaviors, on crime rates, providing important insights into when crimes are more likely to occur within the year.

Visualizing the Trend

# Plot the trend component
trend <- decomposed_crime$time.series[, 2]
plot(trend, type = "l", col = "blue", main = "Trend in Crime Data",
     xlab = "Months", ylab = "Trend Component")

The graph shows the trend component of the crime data over time, highlighting long-term movements in crime rates from 2009 to 2020. It reveals a significant peak in crime around 2011-2012, followed by a gradual decline in the years after, with a slight recovery in 2017-2018. The data shows a general downward trend towards the end of the period, particularly around 2020. This suggests a decline in reported crimes over time, although there are some fluctuations, which could be influenced by various factors such as law enforcement changes, social events, or reporting inconsistencies.

Model Fitting

## 
## Call:
## tslm(formula = train_ts ~ trend + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -214.639  -32.472   -3.166   35.361  127.374 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 215.8229    20.2225  10.672  < 2e-16 ***
## trend        -0.1841     0.1669  -1.103  0.27267    
## season2      -4.6159    25.4851  -0.181  0.85664    
## season3      58.9681    25.4867   2.314  0.02275 *  
## season4      75.3522    25.4895   2.956  0.00389 ** 
## season5      84.9319    26.1850   3.244  0.00161 ** 
## season6      77.5604    26.1834   2.962  0.00382 ** 
## season7      56.6333    26.1829   2.163  0.03295 *  
## season8      35.4841    26.1834   1.355  0.17843    
## season9      17.8903    26.1850   0.683  0.49606    
## season10     -6.1478    26.1877  -0.235  0.81488    
## season11      3.9251    26.1914   0.150  0.88118    
## season12     28.3314    26.1962   1.082  0.28210    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.99 on 99 degrees of freedom
## Multiple R-squared:  0.2798, Adjusted R-squared:  0.1925 
## F-statistic: 3.205 on 12 and 99 DF,  p-value: 0.0006573

## AICc for TSLM model: 1241.323

## ETS(A,Ad,A) 
## 
## Call:
## ets(y = train_ts)
## 
##   Smoothing parameters:
##     alpha = 0.2024 
##     beta  = 2e-04 
##     gamma = 1e-04 
##     phi   = 0.9063 
## 
##   Initial states:
##     l = 174.5717 
##     b = 13.2452 
##     s = -17.7629 -25.6958 -43.2759 -20.5946 -11.4774 24.6191
##            38.0049 58.4498 28.9386 23.7197 -34.1971 -20.7286
## 
##   sigma:  49.2071
## 
##      AIC     AICc      BIC 
## 1418.747 1426.102 1467.680 
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set -1.662551 45.31904 33.38105 -263.9243 274.9155 0.6633754 0.1051467

## AICc for ETS model: 1425.257

## Series: train_ts 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1      mean
##       0.5934  238.6455
## s.e.  0.0834   12.0520
## 
## sigma^2 = 2797:  log likelihood = -602.56
## AIC=1211.13   AICc=1211.35   BIC=1219.28
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 1.680456 52.41316 39.82266 -258.7513 272.7795 0.7913882
##                     ACF1
## Training set -0.04346845

## AICc for ARIMA model: 1211.238

## AICc for TSLM model: 1241.323

## AICc for ETS model: 1425.257

## AICc for ARIMA model: 1211.238

The ARIMA model has the lowest AICc value (1211.238), meaning it is the best model according to the AICc criterion.

Conclusion:

The ARIMA model is the most appropriate model for this time series data based on the AICc values.
TSLM model has a higher AICc value than ARIMA, suggesting that it doesn’t fit the data as well.
The ETS model has the highest AICc value, indicating it is the least suitable among the three.

# Step 1: Forecast future values using ARIMA model
# We'll forecast the same length as the test dataset
forecast_length <- length(test_data$Monthly_Crimes)

# Forecast using the ARIMA model
arima_forecast <- forecast(arima_model, h = forecast_length)

# Step 2: Plot the forecast
plot(arima_forecast, main = "ARIMA Forecast for Crime Data", ylab = "Number of Reported Crimes", xlab = "Time")

# Add the observed test data to the plot
lines(test_ts, col = "red", lwd = 2, type = "o")

# Step 3: Compare forecasted values with actual test data (if available)
# Let's calculate the forecast accuracy using the Mean Absolute Percentage Error (MAPE)
mape <- mean(abs((test_data$Monthly_Crimes - arima_forecast$mean) / test_data$Monthly_Crimes)) * 100
cat("MAPE (Mean Absolute Percentage Error) for ARIMA forecast: ", mape, "%\n")

## MAPE (Mean Absolute Percentage Error) for ARIMA forecast:  33.46431 %

# Step 4: View the forecasted values
cat("Forecasted values:\n")

## Forecasted values:

forecasted_values <- data.frame(
  Time = time(arima_forecast$mean),
  Forecast = arima_forecast$mean
)
print(forecasted_values)

##        Time Forecast
## 1  2018.333 273.2751
## 2  2018.417 259.1959
## 3  2018.500 250.8408
## 4  2018.583 245.8826
## 5  2018.667 242.9402
## 6  2018.750 241.1942
## 7  2018.833 240.1580
## 8  2018.917 239.5430
## 9  2019.000 239.1781
## 10 2019.083 238.9616
## 11 2019.167 238.8331
## 12 2019.250 238.7568
## 13 2019.333 238.7116
## 14 2019.417 238.6847
## 15 2019.500 238.6688
## 16 2019.583 238.6593
## 17 2019.667 238.6537
## 18 2019.750 238.6504
## 19 2019.833 238.6484
## 20 2019.917 238.6472
## 21 2020.000 238.6465
## 22 2020.083 238.6461
## 23 2020.167 238.6459
## 24 2020.250 238.6457
## 25 2020.333 238.6456
## 26 2020.417 238.6456
## 27 2020.500 238.6456
## 28 2020.583 238.6455
## 29 2020.667 238.6455

The ARIMA forecast for crime data shown in the graph demonstrates a notable divergence between the forecasted values (blue line with confidence intervals) and the actual reported crimes (red points) from 2018 to 2020. While the ARIMA model captures the general declining trend up to 2018, it significantly underestimates the crime rates during the forecast period. The real data fluctuates more intensely and frequently breaches the 95% confidence interval (shaded region), especially around 2019 and 2020, indicating that the model failed to account for recent structural shifts or external shocks (e.g., policy changes, socio-economic disruptions). This suggests that the ARIMA model lacks adaptability to sudden pattern changes and may benefit from re-specification, incorporation of exogenous variables (ARIMAX), or transitioning to more flexible models like Prophet or dynamic regression. Overall, while ARIMA offers a solid baseline, its predictive power here is limited under volatile conditions.

Benchmark Methods

# Convert the full series to ts object
full_ts <- ts(crime_by_month$Monthly_Crimes, frequency = 12, start = c(min(crime_by_month$Year), 1))

# Apply benchmark forecasting methods
mean_fc <- meanf(train_ts, h = forecast_length)
naive_fc <- naive(train_ts, h = forecast_length)
snaive_fc <- snaive(train_ts, h = forecast_length)
drift_fc <- rwf(train_ts, h = forecast_length, drift = TRUE)

# Plot all forecasts
autoplot(train_ts) +
  autolayer(mean_fc, series="Mean", PI=FALSE) +
  autolayer(naive_fc, series="Naive", PI=FALSE) +
  autolayer(snaive_fc, series="Seasonal Naive", PI=FALSE) +
  autolayer(drift_fc, series="Drift", PI=FALSE) +
  labs(title="Benchmark Forecasting Methods", y="Reported Crimes", x="Year") +
  guides(colour=guide_legend(title="Method"))

This plot compares four benchmark forecasting methods—Mean, Naïve, Seasonal Naïve, and Drift—applied to the monthly crime data. These models serve as simple baselines for evaluating more advanced forecasting techniques. The Mean model forecasts the average of past values, while the Naïve model assumes the next value equals the last observed one. Seasonal Naïve extends this by repeating the previous season’s values (e.g., same month last year), making it useful for seasonal data. The Drift method adds a linear trend based on historical change. Comparing these helps identify if complex models offer meaningful improvements over basic approaches.

NNETAR Model

# Fit Neural Network model
nnetar_model <- nnetar(train_ts)

# Forecast using NNETAR
nnetar_forecast <- forecast(nnetar_model, h = forecast_length)

# Plot forecast vs actual
autoplot(nnetar_forecast) +
  autolayer(test_ts, series="Actual", color="red") +
  labs(title = "NNETAR Forecast vs Actual", y = "Monthly Crimes")

# Accuracy metrics
nnetar_mape <- mean(abs((test_ts - nnetar_forecast$mean) / test_ts)) * 100
cat("MAPE for NNETAR model: ", round(nnetar_mape, 2), "%\n")

## MAPE for NNETAR model:  40.27 %

In this section of the project, we used the NNETAR model to forecast monthly crime data. NNETAR stands for Neural Network Autoregression, and it’s a nonlinear time series model that captures complex patterns using a feed-forward neural network. We trained the model using 80% of the historical data and generated forecasts for the remaining 20%. The model automatically handled seasonality and lagged values. To evaluate its performance, we plotted the forecasted values alongside the actual crime counts from the test set. This visual comparison helped us understand how closely the model’s predictions matched reality. Additionally, we calculated the Mean Absolute Percentage Error (MAPE), which was used as a quantitative measure of forecast accuracy. A lower MAPE indicates a better fit. Overall, the NNETAR model provided useful insights into future crime trends and demonstrated reasonable accuracy, making it a strong candidate for forecasting complex, nonlinear time series data.

Forecast

theft_data <- UC_crime %>%
  filter(OFFENSE == "THEFT") %>%
  mutate(Year = year(DATE_REPORTED),
         Month = month(DATE_REPORTED)) %>%
  group_by(Year, Month) %>%
  summarise(Monthly_Theft = n(), .groups = "drop") %>%
  mutate(Date = as.Date(paste(Year, Month, "01", sep = "-")))
theft_ts <- ts(theft_data$Monthly_Theft, frequency = 12, start = c(2008, 1))
train_size <- floor(0.8 * length(theft_ts))
train_theft_ts <- window(theft_ts, end = c(2008 + (train_size - 1) %/% 12, (train_size - 1) %% 12 + 1))
arima_theft_model <- auto.arima(train_theft_ts)
summary(arima_theft_model)

## Series: train_theft_ts 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     mean
##       0.5996  93.8281
## s.e.  0.0819   5.1045
## 
## sigma^2 = 487.4:  log likelihood = -504.73
## AIC=1015.46   AICc=1015.68   BIC=1023.61
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE    MAPE      MASE
## Training set 0.6658085 21.87993 17.04438 -101.7034 116.586 0.7005498
##                     ACF1
## Training set -0.06888223

theft_forecast <- forecast(arima_theft_model, h = 36)
autoplot(theft_forecast) +
  labs(title = "Forecast of Theft Cases Through 2025",
       x = "Time",
       y = "Monthly Theft Incidents") +
  theme_minimal()

test_theft_ts <- window(theft_ts, start = c(2008 + train_size %/% 12, (train_size %% 12) + 1))
mape_theft <- mean(abs((test_theft_ts - theft_forecast$mean) / test_theft_ts)) * 100
cat("MAPE for ARIMA Theft Forecast: ", round(mape_theft, 2), "%\n")

## MAPE for ARIMA Theft Forecast:  26.95 %

In this section, we applied the ARIMA model to forecast monthly theft incidents through the year 2025. First, theft records were filtered from the dataset and grouped by month to create a time series from 2008 onward. The data was split into training (80%) and testing (20%) sets. Using the auto.arima() function, an optimal ARIMA model was selected based on the training data. The model was then used to forecast 36 months into the future—covering the entire span of 2023 to 2025.

The plotted forecast shows a solid blue line representing predicted monthly thefts, with shaded areas indicating 80% and 95% confidence intervals. This helps visualize both the forecast and its uncertainty. The MAPE (Mean Absolute Percentage Error) was also calculated to assess the model’s accuracy on the test set. Overall, the ARIMA model provides a practical tool for understanding and anticipating future theft trends.

Time Series Data

# Plot Seasonal Naive forecast with training and test data
autoplot(snaive_fc) +
  autolayer(test_ts, series = "Actual", color = "red") +
  labs(
    title = "Seasonal Naive Forecast vs Actual",
    x = "Time",
    y = "Reported Crimes"
  ) +
  theme_minimal() +
  guides(colour = guide_legend(title = "Series"))

In this part of the analysis, we applied the Seasonal Naive forecasting method to model monthly crime trends. The Seasonal Naive (snaive) model assumes that future values will repeat the same seasonal pattern observed in the past—for example, the number of crimes in January next year will be similar to January this year. This method is simple yet effective for data with strong seasonality, such as monthly crime reports. We trained the model using historical data and generated forecasts for the test period. The resulting graph shows the Seasonal Naive forecast along with the actual crime counts from the test set. By visually comparing the forecast to the real data, we can assess how well the model captures seasonal behavior. Although it does not account for trends or irregular changes, the snaive model serves as a useful benchmark to evaluate the performance of more advanced forecasting models like ARIMA or NNETAR.