Table of contents

  1. Business Objectives
  2. Data Preparation
  3. Data Transformation
  4. Data Understanding and Variables
  5. Type of Model
  6. Modeling Process
  7. Evaluation
  8. Forecasting Crimes for 31 May 2019
  9. Conclusion

Business Objectives

The objective of this study is to build a predictive model on crime occurrences in order to help the Houston Police Department (HPD) manage their available human resource effectively.

Specifically, we aim to predict the number of offenses for a given beat/day based on historical crime data amassed by HPD. This insight will aid HPD in determining the appropriate manpower required in each beat to handle varying forecasted crime levels of the next day.

Scope

The dataset covers the period 1 January 2019 to 30 May 2019. As a result, the purpose of the study is to predict the number of offenses in Houston for 31 May 2019 by beat. HPD uploads crime data on a monthly basis but we assume that the latest data is available on a daily-basis in order to make a forecast for the next day.

Assumption Although the HPD’s data are provided to the public in a monthly interval, it is assumed that HPD can have access to the data on a daily batch.

Data Preparation

HPD maintains monthly crime data based on police incident reports across Houston. The data is saved on the HPD website with each month corresponding to an Excel file. We connect to the HPD web source to extract the data for every month and then append all the datasets into a single table.

The code below displays the framework of this process. Pre-processing algorithms in R were used to connect directly to excel files in the web source in order to retrieve the required information.

url1 <- "01-2019.NIBRS_Public_Data_Group_A&B.xlsx"
Jan <- as.data.frame(read_excel(url1, sheet = 1, col_names = TRUE, skip = 10))
Jan <- Jan%>%dplyr::select(-"...16", -"...11", -"...13", -"...14", -"...2", -"...4", -"...6")


url2 <- "02-2019.NIBRS_Public_Data_Group_A&B.xlsx"
feb <- as.data.frame(read_excel(url2, sheet = 1, col_names = TRUE, skip = 10))
feb <- feb%>%dplyr::select(-"...16", -"...11", -"...13", -"...14", -"...2", -"...4", -"...6")


url3 <- "03-2019.NIBRS_Public_Data_Group_A&B.xlsx"
March <- as.data.frame(read_excel(url3, sheet = 1, col_names = TRUE, skip = 10))
March <- March%>%dplyr::select(-"...16", -"...11", -"...13", -"...14", -"...2", -"...4", -"...6")


url4 <- "4-2019.NIBRS_Public_Data_Group_A&B.XLSX"
April <- as.data.frame(read_excel(url4, sheet = 1, col_names = TRUE, skip = 10))
April <- April%>%dplyr::select(-"...17", -"...11", -"...13", -"...14", -"ZIP", -"...2", -"...5", -"...6")


url5 <- "5-2019.NIBRS_Public_Data_Group_A&B.xlsx"
May <- as.data.frame(read_excel(url5, sheet = 1, col_names = TRUE, skip = 10))
May <- May%>%dplyr::select(-"...17", -"...11", -"...13", -"...14", -"ZIP", -"...2", -"...5", -"...6")


#### Appending Data into a DataFrame

dataa <- rbind(Jan, feb, March, April, May)
dataa1 <- dataa%>%separate(col = "Occurrence Date",
                           into=c("Year","Month", "Day"),
                           convert = TRUE)

Data Transformation

The dataset covers the period of January to May 2019 with variables available as follows:

• Types of incidents (NIBRS description)

• Types of Premises

• Date: Month, Day, Day of Week

• Hour of day

• Police beats

• Address (Block range + street name + street type + suffix)

• Zip Codes (data is only available in Apr-May19 files)

Remove Beats with NA values

The summary below shows that there are some crimes assigned to unknown beats (NA). However, the crimes recorded in unknown beats represent only 0.1% of all observations.

### Cleaning Data

dataa1$date <- paste(dataa1$Year, dataa1$Month, dataa1$Day, sep="-") %>% ymd() %>% as.Date() 
datar <- as.data.frame(dataa1%>%mutate(Weekday = weekdays(date), week = lubridate::week(date)))%>%separate(Premise, c("Premise", "Location2"), sep= ",", extra = "merge", fill = "right")
colnames(datar) <- c("Year", "Month", "Day", "Hour", "NIBRS", "Count", "Beat", "Premise", "Location2", "BlockRange", "Street_Name", "Street_Type", "Suffix", "date", "Weekday", "Week")
datar <- datar%>%dplyr::select(date, Weekday, Week, Beat, Count)

TotalCount <- datar%>%summarise(Count = sum(Count))
Na <- as.data.frame(datar%>%group_by(Beat)%>%summarise(Count = sum(Count))%>%arrange(Beat)%>%filter(is.na(Beat)))%>%mutate(Total_Count= TotalCount$Count)%>%mutate(Perc.Na = round(Count/Total_Count*100,1))
formattable(head(Na))
Beat Count Total_Count Perc.Na
NA 129 104884 0.1

Given that this portion of unknown beats is relatively insignificant, all observations with Beats recorded as NA have been removed from the data set.The dataset was then summarised according to the table below with the crime count identified as the target variable.

### Cleaning Data
datarn <- na.omit(datar)
datar1 <- as.data.frame(datarn%>%group_by(date, Weekday, Week, Beat)%>%summarise(Count = sum(Count))%>%arrange(Beat,date))
formattable(head(datar1))
date Weekday Week Beat Count
2019-01-01 Tuesday 1 10H10 5
2019-01-02 Wednesday 1 10H10 1
2019-01-03 Thursday 1 10H10 3
2019-01-05 Saturday 1 10H10 3
2019-01-07 Monday 1 10H10 5
2019-01-08 Tuesday 2 10H10 2

Input 0 for days with no offense counts

Additionally, we note that the dataset does contain offense counts across beats for every day - there are beats where there is no criminal activity recorded during the period January 2019 to May 2019. This missing information is likely to affect calculations in the model. To overcome this issue, we applied the functions [expand.grid] and [merge] to generate all the possible combination of days and beats for the period.

A new expand.grid table was then merged with initial aggregated data to obtain information for every day and allocate 0 to the days with no criminal activity. Finally, we introduced new important variables (week number and month number) to the aggregated data (See code below).

### Adding 0 to dates with no data

date_range <- seq(min(as.Date(datar1$date)), max(as.Date("2019-05-31")), by = 'days')
beats <- unique(datar1$Beat)
dataframe <- expand.grid(date_range, beats)
colnames(dataframe) <- c("date", "Beat")
cleantable <- merge(x= dataframe, y= datar1, by=c("date", "Beat"), all.x = T)
cleantable <- cleantable%>%dplyr::select(-Weekday, -Week)
cleantable1 <- cleantable%>%mutate(Weekday = weekdays(date), week = lubridate::week(date), Month = lubridate::month(date))%>%arrange(Beat, date)%>%
               mutate(Count = ifelse(is.na(Count), 0, Count))
formattable(head(cleantable1))
date Beat Count Weekday week Month
2019-01-01 10H10 5 Tuesday 1 1
2019-01-02 10H10 1 Wednesday 1 1
2019-01-03 10H10 3 Thursday 1 1
2019-01-04 10H10 0 Friday 1 1
2019-01-05 10H10 3 Saturday 1 1
2019-01-06 10H10 0 Sunday 1 1

Data Understanding and Variables

Target variable

The target variable for the model is [Offense Counts]; the objective of the predictive model is to forecast the number of offenses for the 31 May 2019 (in a given police beat).

Predictive variables

Our preliminary review of the dataset suggests a distinct pattern between Offense counts and Time variables (namely, month / days of week). The graph below illustrates this relationship where Offense Counts vary across peaks and troughs over time - representing a variation in the crime incidents with respect to days of the week and months.

dataa3 <- as.data.frame(cleantable1%>%dplyr::select(date, Count)%>%group_by(date)%>%summarise(Count = sum(Count)))
dataa3 <- dataa3%>%filter(date != "2019-05-31")


ggplot(data = dataa3, aes(x = date, y = Count)) + 
  geom_line(color = "red", size = 1.5) + ggtitle("Crimes From  01 Jan 2019 to 30 May 2019") +  theme_bw() + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 22)) 

Evidently, we observe that across months, we can tell that the number of offenses had dipped slighted in February but was on a rising trend in the subsequent 3 months. This might have been due to a trend or seasonality.

Given the implied relationship, the variables [month] and [day of week] have been chosen as predictor variables in the model. However, it will not be appropriate to just assume that the same level of crime will occur on every particular day of the week. There needs to be a reference to recent trends. Therefore, two new lag variables were introduced using the lag function to calculate the lag number of crimes for 1 week and 2 weeks prior.

In order to consider the events within the week, we also included additional variables [average of crimes in the last 3, 4, 5 and 7 days] to track the predictors in the model.

The graph below illustrates that the peaks/troughs generally occurred in the same day of the week, but there was also a trend where the offense count does not stray too far from the last week or two. A variable [Trend] was thus added to the model.

This trend is a division between the average crime in a week and the average crime in two weeks. It will give an indication that crime had been increasing or decreasing in recent times and account for its effect on crime rates in the immediate future.

season <- datar1%>%dplyr::select(Weekday,date,Beat, Count)%>%mutate(Month = lubridate::month(date))%>%group_by(Weekday,date,Beat, Month)%>%
  summarise(Count = sum(Count))%>%arrange(desc(Count))%>%filter(Month == "5")%>%filter(Beat == "17E10")%>%arrange(date)%>%filter(date != "2019-05-31")

ggplot(season, aes(x=date,y=Count)) + geom_line() +
  geom_point(aes(color=  factor(Weekday, levels =  c("Monday", "Tuesday", "Wednesday",
                                                     "Thursday", "Friday", "Saturday", "Sunday"))),size=5) +
  labs(x='Day of Month',
       y='Number of Crimes)',
       color = "Weekdays",
       title='Crimes in Beat 17E10 from 1 May 2019 to 30 May 2019') +   theme(plot.title = element_text(size = 15, hjust = 0.5, face = "bold"))

A summary of possible Predictor Variables to be used is as follows:

• Month

• Weekday

• Lag1 (Crime count on the day before)

• Lag7 (Crime count on the same day last week)

• Lag14 (Crime count on the same day two weeks ago)

• Last4 (Rolling average last 4 days)

• Last7 (Rolling average last 7 days)

• Last14 (Rolling average last 14 days)

• Last30 (Rolling average last 30 days).

• Trend (Las7/Last14)

The code below displays the introduction of the new predictors.

datar2 <- cleantable1%>%mutate(lag1 = lag(Count, 1))%>%mutate(lag1 = ifelse(is.na(lag1), 0, lag1))%>%
  mutate(lag7 = lag(Count, 7))%>%mutate(lag7 = ifelse(is.na(lag7), 0, lag7))%>%
  mutate(lag14 = lag(Count, 14))%>%mutate(lag14 = ifelse(is.na(lag14), 0, lag14))%>%
  mutate(last4 = rollapply(data = lag1, width = 4, FUN = mean, align = "right", fill = 0, na.rm = T))%>%
  mutate(last7 = rollapply(data = lag1, width = 7, FUN = mean, align = "right", fill = 0, na.rm = T))%>%
  mutate(last14 = rollapply(data = lag1, width = 14, FUN = mean, align = "right", fill = 0, na.rm = T))%>%
  mutate(last30 = rollapply(data = lag1, width = 30, FUN = mean, align = "right", fill = 0, na.rm = T))%>%
  mutate(trend = ifelse(last14 == 0, 0, last7/last14))


datar2 <- datar2%>%mutate(lag1 = ifelse(date == "2019-01-01", 0, lag1))%>%
                mutate(lag7  = ifelse(date >= "2019-01-01" & date <= "2019-01-07", 0, lag7))%>%
                mutate(lag14  = ifelse(date >= "2019-01-01" & date <= "2019-01-14", 0, lag14))%>%
                mutate(last4 = ifelse(date >= "2019-01-01" & date <= "2019-01-04", 0, last4))%>%
                mutate(last7 = ifelse(date >= "2019-01-01" & date <= "2019-01-07", 0, last7))%>%
                mutate(last14 = ifelse(date >= "2019-01-01" & date <= "2019-01-14", 0, last14))%>%
                mutate(last30 = ifelse(date >= "2019-01-01" & date <= "2019-01-30", 0, last30))

Type of Model

The target variable is a discrete count rather than continuous data. As a count data type, its distribution is positively skewed with many observations at value 0. The high number of 0s in the dataset makes it difficult to transform the dataset from a skewed distribution into a normal one. As a result, applying an OLS model for the count data would not be appropriate. Also, it is quite likely that the regression model will produce negative predicted values, which are theoretically impossible with count data.

Therefore, a Poisson or Negative Binomial should be employed for the model to overcome this normality assumption challenge with OLS models. However, the study is not limited to just 2 models- other models will also be employed as a basis of comparison to aid selection of the most accurate and appropriate model.

The accuracy will be measured using a simple train/test split process. Observations on 30 May 2019 were set aside as holdout data, while the other dates from 1 Jan 2019 to 29 May 2019 was used as the training dataset.

Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) has been selected as the measure of model performance:

MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It is the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.

RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It is the square root of the average of squared differences between prediction and actual observation.

A lower RMSE/MAE is associated with better accuracy.

Modelling Process

Methodology

In order to get a better understanding of how the independent variables affect the target variable, the model was first tested using only one police beat. This was done to avoid complications from other beats where each may have different characteristics.

Observations on 30 May 2019 were set aside as holdout data, while the other dates from 1 Jan 2019 to 29 May 2019 was used as the training dataset. The result of each model is then evaluated using MAE/RMSE.

On achieving a model that scored the lowest MAE, we then applied it to the other beats to forecast the offense counts for each of them on 31 May 2019.

Modelling Process for the busiest beat 17E10.

As previously mentioned, the target variable is a discrete count. It is therefore suitable to employ either Poisson or Negative Binomial models. Given that the mean count is not equal to the variance [mean is 14 and the variance is 17], we have chosen negative binomial as the initial model.

beat17E10 <-  datar2%>%filter(Beat == "17E10")
mean <- round(mean(beat17E10$Count),2)
sd <- round((sd(beat17E10$Count))^2, 2)
meansd <- as.data.frame(cbind(mean, sd))
kable(meansd) %>% kable_styling(c("striped", "bordered"))
mean sd
14.18 17.04

Using negative binomial model on all the predictors, observe 4.24 MAE and 4.24 RMSE performance per below.

trainbeat17E10 <- beat17E10%>%filter(date != "2019-05-30", date != "2019-05-31" )
test17E10 <- beat17E10%>%filter(date == "2019-05-30")
modelnb <- glm.nb(Count ~ Weekday + week + Month + last4 + last7 + last14 + last30 +lag1 + lag7 + lag14 + trend, data = trainbeat17E10)
predict1 <- predict(modelnb, test17E10, type = "response" )
maenb <- mean(abs(test17E10$Count - predict1))
RMSEnb <- sqrt(mean((test17E10$Count - predict1)^2))
Measures_Beat_17E10<- c("MAE", "RMSE")
Negative_Binomial_All_Var <- c(maenb, RMSEnb)
results17E10 <- data.frame(Measures_Beat_17E10, round(Negative_Binomial_All_Var,2))
kable(results17E10) %>% kable_styling(c("striped", "bordered")) 
Measures_Beat_17E10 round.Negative_Binomial_All_Var..2.
MAE 4.24
RMSE 4.24

There is scope to improve the forecast further. In order to verify if all the independent variables should be taken into the model, we refer to the correlation plot below that illustrates the correlation between the dependent variable and the independent variables.

cordata <- trainbeat17E10%>%dplyr::select(-date, -Beat, -Weekday)
library(corrplot)
mydata.cor = cor(cordata)
corrplot(mydata.cor)

The plot indicates that variables such as last4, last7, last14, lag7, last14 and trend have some degree of correlation with the dependent variable. The model was revised using only these predictors and the MAE improved from 4.24 to 2.06 as shown in the below table. Using only the independent variables that have some degree of correlation with the dependent variable has helped to improve accuracy.

model2 <- glm.nb(Count ~ Weekday + last4  + last7 + lag7 + last14 + trend, data = trainbeat17E10)
predict2 <- predict(model2, test17E10, type = "response" )
maenb2 <- mean(abs(test17E10$Count - predict2))
RMSEnb2 <-  sqrt(mean((test17E10$Count - predict2)^2))
Negative_Binomial_Selected_Pred <- c(maenb2, RMSEnb2)
results17E10 <- results17E10%>%mutate(Negative_Binomial_Selected_Pred = round(Negative_Binomial_Selected_Pred,2))
formattable(results17E10)
Measures_Beat_17E10 round.Negative_Binomial_All_Var..2. Negative_Binomial_Selected_Pred
MAE 4.24 2.06
RMSE 4.24 2.06

To find the model with the best forecast, a range of other models were also employed for comparison - Decision trees, Support Vector Machines, Exponential smoothing time series, and neural network. Those models were used through the following code.

## Decision Trees

model3 <- rpart(Count ~ Weekday + last4  + last7 + lag7 + last14 + trend, data = trainbeat17E10)
predict3 <- predict(model3, test17E10)
maedt <- round(mean(abs(test17E10$Count - predict3)),2)
RMSEdt <-  round(sqrt(mean((test17E10$Count - predict3)^2)),2)
Decision_Trees <- c(maedt, RMSEdt)

# Support Vector Machine Methodology

trainbeat17E102 <- beat17E10%>%filter(date !="2019-05-30")%>%dplyr::select(Count, last4, last7, lag7, last14, trend)
test17E102 <- beat17E10%>%filter(date == "2019-05-30")%>%dplyr::select(Count, last4, last7, lag7, last14, trend)
model4 <- svm(Count ~ last4  + last7 + lag7 + last14 + trend, data = trainbeat17E102, cost = 10, epsilon=0.02)
predict4 <- predict(model4, test17E10)
maesvm <- round(mean(abs(test17E102$Count - predict4)),2)
RMSEsvm <- round(mean(abs(test17E102$Count - predict4)),2)
SVM <- c(maesvm, RMSEsvm)


# Exponential Smoothing Methodology

series <- beat17E10%>%dplyr::select(Count)
ts <- ts(series,start=c(2019,1,1), frequency=365.25)
train <- window(ts, end = c(2019, 149))
fc1 <- ets(train)%>%forecast(h= 1)
MAEETS <-  round(accuracy(fc1,ts)["Test set", "MAE"],2)
RMSEETS<- round(accuracy(fc1,ts)["Test set", "RMSE"],2)
ETS <- c(MAEETS, RMSEETS)


# Neural Network
set.seed(20)
model5 <- nnet(Count ~ Weekday + last4  + last7 + lag7 + last14 + trend, data = trainbeat17E10, trace = FALSE, size=6, linout = TRUE)
ps <- predict(model5, test17E10)
maenn <- round(mean(abs(test17E10$Count - ps)),2)
RMSEnn <- round(mean(abs(test17E10$Count - ps)),2)
NeuralN <- c(maenn, RMSEnn)

The resulting MAE/RMSE figures for each model can be seen below. Based on the low MAE/RMSE value, it appears that the Negative Binomial with selected predictors model has the best performance, followed by Decision trees, and Neural Networks.

# Table with the accuracy metrics for all the models
results17E10 <- results17E10%>%mutate(Decision_Trees = Decision_Trees, SVM = SVM, ETS = ETS, NeuralN = NeuralN)
formattable(results17E10, align =c("l","c","c","c","c", "c"), list(
  `Measures_Beat_17E10` = formatter("span", style = ~ style(color = "grey",font.weight = "bold")), 
  `Negative_Binomial_Selected_Pred`= color_tile("#fdfeaa", "#fdfeaa")))
Measures_Beat_17E10 round.Negative_Binomial_All_Var..2. Negative_Binomial_Selected_Pred Decision_Trees SVM ETS NeuralN
MAE 4.24 2.06 2.67 4.77 4.3 3.2
RMSE 4.24 2.06 2.67 4.77 4.3 3.2

Having identified Negative Binomial, Decision Trees and Neural Network as the three best models, they were then applied to all the beats with resultant MAE/RMSE performance as below.

train <- datar2%>%filter(date != "2019-05-30", date != "2019-05-31" )
test <- datar2%>%filter(date == "2019-05-30")



### Negative Binomial

lm <- glm.nb(Count ~ Weekday + last4  + last7 + lag7 + last14 + trend, data = train)
Neg.bin_Predic <- round(predict(lm, test, type = "response"),0)
testall <- cbind(test, Neg.bin_Predic)
RMSENBall <- sqrt(mean((testall$Count - testall$Neg.bin_Predic)^2))
MAENBall <- mean(abs(testall$Count - testall$Neg.bin_Predic))
Measures_All_Beats <- c("MAE", "RMSE")
Negative_Binomial <- c(MAENBall, RMSENBall)
resultsall <- data.frame(Measures_All_Beats, Negative_Binomial)


## Decision Trees


modeldt <- rpart(Count ~  Weekday + last4  + last7 + lag7 + last14 + trend, data = train)
Des.Tree.Predic <- round(predict(modeldt,test), 0)
testdt <- cbind(test, Des.Tree.Predic)
RMSEalldt <- sqrt(mean((testdt$Count - testdt$Des.Tree.Predic)^2))
maealldt <- mean(abs(testdt$Count - testdt$Des.Tree.Predic))
Decision.Trees <- c(maealldt, RMSEalldt)
resultsall <- resultsall%>%mutate(Decision.Trees = Decision.Trees)


## Neural Network
set.seed(20)
model6 <- nnet(Count ~ Weekday + last4  + last7 + lag7 + last14 + trend, data = train, trace = FALSE, size=6, linout = TRUE)
nn <- round(predict(model6, test),0)
testnn <- cbind(test, nn)
RMSENBall <- sqrt(mean((testnn$Count - testnn$nn)^2))
MAEnn <- mean(abs(testnn$Count - testnn$nn))

neuralnet <- c(MAEnn, RMSENBall)
resultsall <- resultsall%>%mutate(Neural.Network = neuralnet)
formattable(resultsall)
Measures_All_Beats Negative_Binomial Decision.Trees Neural.Network
MAE 2.992063 2.698413 2.658730
RMSE 4.241705 3.741657 3.682865

Evaluation

Analyzing Errors and Model Performance

Across all beats, we note that the Negative Binomial has the highest MAE. Upon further analysis of the model’s forecasts across beats, it seems that the Negative Binomial model has a tendency for making more drastic forecast values that lead to higher absolute errors across some beats. We can see this below in an extract of the observations where the absolute error is greater than 10.

# Negative Binomial Errors
errors_nb <- testall%>%mutate(Error = abs(Count - Neg.bin_Predic))%>%arrange(desc(Error))%>%dplyr::select(Count, Neg.bin_Predic, Error)%>%filter(Error > 10)
formattable(errors_nb) 
Count Neg.bin_Predic Error
8 23 15
22 37 15
19 6 13
18 7 11
18 7 11

Comparatively, the Decision Tree model records a more moderate tendency for making high absolute errors where only 1 observation was recorded where the difference between forecast and actual was greater than 10. On the other hand, neural network model recorded 2 observations.

# Decision Trees Errors 
errors_dt <- testdt%>%mutate(Error = abs(Count - Des.Tree.Predic))%>%arrange(desc(Error))%>%dplyr::select(Count, Des.Tree.Predic, Error)%>%filter(Error > 10)
formattable(errors_dt) 
Count Des.Tree.Predic Error
19 8 11
# Neural Network Errors
errors_nn <- testnn%>%mutate(Error = abs(Count - nn))%>%arrange(desc(Error))%>%dplyr::select(Count, nn, Error)%>%filter(Error > 10)
formattable(errors_nn) 
Count nn Error
19 6 13
18 7 11

Based on the low MAE/RMSE figures, the Neural Network and Decision Tree models appear to perform well across all the beats. Hence, these 2 models are selected to forecast the number of crimes for 31 May 2019.

Forecasting Crimes for 31 May 2019

In order to leverage on the 2 models, we employ an ensemble learning method by using a simple average of the two algorithms to improve forecast performance. The forecast is thus derived from the average of the output of the decision trees and neural network forecasts.

The table below records the respective 31 May 2019 crime forecasts for each beat based on the Decision Trees model.

#Decision Trees
trainv <- datar2%>%filter(date != "2019-05-31" )
testv <- datar2%>%filter(date == "2019-05-31")

model2 <- rpart(Count ~  Weekday + last4  + last7 + lag7 + last14 + trend, data = trainv)
Decision_Trees_31_05_01 <- round(predict(model2,testv),0)
forecast2 <- cbind(testv, Decision_Trees_31_05_01)
Decision_Trees_Prediction <- forecast2%>%dplyr::select(Beat, Decision_Trees_31_05_01 )
Decision_Trees_Prediction

In the table below, the forecasts from the Neural Network model is included in the column next to Decision Trees and the final forecast is on the right-most column as the average of the two, rounded up to the nearest integer.

#Neural Network
trainv <- datar2%>%filter(date != "2019-05-31" )
testv <- datar2%>%filter(date == "2019-05-31")
set.seed(20)
model2 <- nnet(Count ~ Weekday + last4  + last7 + lag7 + last14 + trend, data = trainv, trace = FALSE, size=6, linout = TRUE)
Neural_Network_31_05_01 <- round(predict(model2,testv),0)
forecast2 <- Decision_Trees_Prediction%>%mutate(Neural_Network_31_05_01  = Neural_Network_31_05_01)%>%dplyr::select(Beat, Decision_Trees_31_05_01, Neural_Network_31_05_01)
forecast2 <- forecast2%>%mutate(Ensemble_Prediction = ceiling((Decision_Trees_31_05_01 + Neural_Network_31_05_01)/2))
forecast2

Conclusion

This exercise aims to predict the number of offenses on the next day based on the following predictor variables: Month, Weekday, crime count on the day before, crime count on the same day last week, crime count on the same day 2 weeks ago, rolling average in the past 4 days, rolling average in the past 7 days, rolling average in the past 14 days, rolling average in the past 30 days, trend in the past 7 days and trend in the past 14 days.

We have tested 6 models (NB with all attributes, NB with selected attributes, decision tree model, Support Vector Machine, Exponential Smoothing Methodology, and Neural Network. After evaluation, we find out that the decision tree model and neural network model both perform well in terms of MAE and RMSE. Decision tree model has a MAE of 2.70 and RSME of 3.74, while Neural Network has a MAE of 2.70 and 3.68. Ensemble learning simple average method is further employed to combine the two algorithms to produce the final model for implementation.

Using this implemented model, HPD can use the predicted number of offenses to determine the human resources to be deployed to each beat. For example, on 31 May 2019, beats with predicted offenses more than 10, such as 12D10, 13D20, etc, should have more officers deployed.