pre code, pre, code {
  white-space: pre !important;
  overflow-x: scroll !important;
  word-break: keep-all !important;
  word-wrap: initial !important;
}

Crime Prediction Using Regression

Abstract.

Violent crime is a serious social problem that affects the social order and economical development of a community. However almost all the law enforcement and crime prevention agencies are consistently plagued by lack of personnel, resources and restricted budgetary allocations. To manage violent crimes given this perennial constraints, it is prudent to consider planning resource allocations in advance through predicting the violent crimes patterns for a given community. I propose to predict violent crimes for a specified area through regression using the provided crime dataset. I will fit both and a random forest model and assess the goodness of fit for the models.

Introduction

Predicting Violent crimes pattern is an indispensable tool for the law enforcement agents. It facilitates in mapping out crime prone regions so as to distribute the limited resources effectively. Furthemore it provides valuable informationfor optimizing available resources ahead of time. According to the FBI, [1] violent crimes imply use of force or threat of using force, such as rape, murder, robbery, aggravated assault, and non-negligent manslaughter. In 2013, it was reported 1,163,146 violent crimes, with an average of 367.9 per 100k inhabitants. This translates to a national average of one violent crime every 27.1 seconds. It is with this considerations that I propose to predict the violent crime per 100k population using regression techniques.

Data Set Description

The provided Crime Data Set provides information on several crimes in specific communities across the USA, combining socio-economical and law enforcement data.

An overview of the data

library(dplyr)      # used for data manipulation and joining data
library(ggplot2)    # used for ploting 

crime_raw <- read.csv("crimedata.csv", na.strings = "?")  # reading the datafile
dim(crime_raw)

## [1] 2215  147

There are 2215 observations with 147 variables. There are 4 non-predictive attributes with information about the community name, county, code and fold. The target variable is the number of violent crimes per 100k population.

Target Variable

The target variable is the number of violent crimes per 100k population. Since it is continuous, it can be visualised plotting its histogram.

p1 <- ggplot(crime_raw, aes(crime_raw$ViolentCrimesPerPop)) +
        geom_histogram( binwidth = 80, fill = "darkgreen") + 
        xlab("Violent Crimes Per proportion")

p1

## Warning: Removed 221 rows containing non-finite values (stat_bin).

The attribute is right skewed variable. transformations will be required to treat the skewness.

Pre-processing the data

Unique Identifier-Check for Duplicates and Missing Values

First we check for duplicates for the City (communityname) names. If number of observations in ‘a’ = number of observations in the crime data set then we have no duplicates.

# find number of unique city names
length(unique(crime_raw$communityname))  #selecting unique city name values

## [1] 2018

There are 2018 unique observations which is lower than the 2215 obsrvations in the crime dataset. Hence there are duplicates in the city names. Now we figure out which city names have been duplicated.

#table is the  list all community names alphabetically and their frquency
table <- data.frame(table(crime_raw$communityname))

##table

#Subset the table for only the duplicates of the community names
duplicates <- table[table$Freq!=1,] 
##duplicates
dim(duplicates)

## [1] 144   2

There are 144 duplicate city names. Hence the city names is not a unique identifier.

Now we subset the crimedata by keeping only the duplicate city names to investigate further

#subset the data for duplicate city names
crime_dups <- crime_raw[crime_raw$communityname %in% duplicates$Var1,] 

#arranging the subset in alphabetical order
crime_dups <- arrange(crime_dups, communityname) 
##crime_dups
dim(crime_dups)

## [1] 341 147

There are 341 observations of duplicated city mames. We can see the community names duplicates is due the fact that multiple states can have the same city name.

We can create a unique identifier by considering both city names and State. This can be done by concatenating city name and state.

#create a new variable by concatenating city name and state

crime_raw <- crime_raw %>%
  mutate(citystate = paste(communityname, state, sep = " "))

There are 2215 unique values when cosidering both community names and State. Since this is also the number of observation in the crimedata then there are no duplicates.

cleaning variable type

Lets look at the variable attributes by using the glimpse function

str(crime_raw)

## 'data.frame':    2215 obs. of  148 variables:
##  $ communityname        : Factor w/ 2018 levels "~","Aberdeencity",..: 152 1035 1781 666 142 1700 1272 42 567 1860 ...
##  $ state                : Factor w/ 48 levels "AK","AL","AR",..: 29 36 35 32 23 24 19 15 27 41 ...
##  $ countyCode           : int  39 45 NA 35 7 NA 21 NA 17 NA ...
##  $ communityCode        : int  5320 47616 NA 29443 5068 NA 50250 NA 25700 NA ...
##  $ fold                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ population           : int  11980 23123 29344 16656 11245 140494 28700 59459 74111 103590 ...
##  $ householdsize        : num  3.1 2.82 2.43 2.4 2.76 2.45 2.6 2.45 2.46 2.62 ...
##  $ racepctblack         : num  1.37 0.8 0.74 1.7 0.53 ...
##  $ racePctWhite         : num  91.8 95.6 94.3 97.3 89.2 ...
##  $ racePctAsian         : num  6.5 3.44 3.43 0.5 1.17 0.9 1.47 0.4 1.25 0.92 ...
##  $ racePctHisp          : num  1.88 0.85 2.35 0.7 0.52 ...
##  $ agePct12t21          : num  12.5 11 11.4 12.6 24.5 ...
##  $ agePct12t29          : num  21.4 21.3 25.9 25.2 40.5 ...
##  $ agePct16t24          : num  10.9 10.5 11 12.2 28.7 ...
##  $ agePct65up           : num  11.3 17.2 10.3 17.6 12.6 ...
##  $ numbUrban            : int  11980 23123 29344 0 0 140494 28700 59449 74115 103590 ...
##  $ pctUrban             : num  100 100 100 0 0 100 100 100 100 100 ...
##  $ medIncome            : int  75122 47917 35669 20580 17390 21577 42805 23221 25326 17852 ...
##  $ pctWWage             : num  89.2 79 82 68.2 69.3 ...
##  $ pctWFarmSelf         : num  1.55 1.11 1.15 0.24 0.55 1 0.39 0.67 2.93 0.86 ...
##  $ pctWInvInc           : num  70.2 64.1 55.7 39 42.8 ...
##  $ pctWSocSec           : num  23.6 35.5 22.2 39.5 32.2 ...
##  $ pctWPubAsst          : num  1.03 2.75 2.94 11.71 11.21 ...
##  $ pctWRetire           : num  18.4 22.9 14.6 18.3 14.4 ...
##  $ medFamInc            : int  79584 55323 42112 26501 24018 27705 50394 28901 34269 24058 ...
##  $ perCapInc            : int  29711 20148 16946 10810 8483 11878 18193 12161 13554 10195 ...
##  $ whitePerCap          : int  30233 20191 17103 10909 9009 12029 18276 12599 13727 12126 ...
##  $ blackPerCap          : int  13600 18137 16644 9984 887 7382 17342 9820 8852 5715 ...
##  $ indianPerCap         : int  5725 0 21606 4941 4425 10264 21482 6634 5344 11313 ...
##  $ AsianPerCap          : int  27101 20074 15528 3541 3352 10753 12639 8802 8011 5770 ...
##  $ OtherPerCap          : int  5115 5250 5954 2451 3000 7192 21852 7428 5332 7320 ...
##  $ HispPerCap           : int  22838 12222 8405 4391 1328 8104 22594 6187 5174 6984 ...
##  $ NumUnderPov          : int  227 885 1389 2831 2855 23223 1126 10320 9603 27767 ...
##  $ PctPopUnderPov       : num  1.96 3.98 4.75 17.23 29.99 ...
##  $ PctLess9thGrade      : num  5.81 5.61 2.8 11.05 12.15 ...
##  $ PctNotHSGrad         : num  9.9 13.72 9.09 33.68 23.06 ...
##  $ PctBSorMore          : num  48.2 29.9 30.1 10.8 25.3 ...
##  $ PctUnemployed        : num  2.7 2.43 4.01 9.86 9.08 5.72 4.85 8.19 4.18 8.39 ...
##  $ PctEmploy            : num  64.5 62 69.8 54.7 52.4 ...
##  $ PctEmplManu          : num  14.65 12.26 15.95 31.22 6.89 ...
##  $ PctEmplProfServ      : num  28.8 29.3 21.5 27.4 36.5 ...
##  $ PctOccupManu         : num  5.49 6.39 8.79 26.76 10.94 ...
##  $ PctOccupMgmtProf     : num  50.7 37.6 32.5 22.7 27.8 ...
##  $ MalePctDivorce       : num  3.67 4.23 10.1 10.98 7.51 ...
##  $ MalePctNevMarr       : num  26.4 28 25.8 28.1 50.7 ...
##  $ FemalePctDiv         : num  5.22 6.45 14.76 14.47 11.64 ...
##  $ TotalPctDiv          : num  4.47 5.42 12.55 12.91 9.73 ...
##  $ PersPerFam           : num  3.22 3.11 2.95 2.98 2.98 2.89 3.14 2.95 3 3.11 ...
##  $ PctFam2Par           : num  91.4 86.9 78.5 64 58.6 ...
##  $ PctKids2Par          : num  90.2 85.3 78.8 62.4 55.2 ...
##  $ PctYoungKids2Par     : num  95.8 96.8 92.4 65.4 66.5 ...
##  $ PctTeen2Par          : num  95.8 86.5 75.7 67.4 79.2 ...
##  $ PctWorkMomYoungKids  : num  44.6 51.1 66.1 59.6 61.2 ...
##  $ PctWorkMom           : num  58.9 62.4 74.2 70.3 68.9 ...
##  $ NumKidsBornNeverMar  : int  31 43 164 561 402 1511 263 2368 751 3537 ...
##  $ PctKidsBornNeverMar  : num  0.36 0.24 0.88 3.84 4.7 1.58 1.18 4.66 1.64 4.71 ...
##  $ NumImmig             : int  1277 1920 1468 339 196 2091 2637 517 1474 4793 ...
##  $ PctImmigRecent       : num  8.69 5.21 16.42 13.86 46.94 ...
##  $ PctImmigRec5         : num  13 8.65 23.98 13.86 56.12 ...
##  $ PctImmigRec8         : num  21 13.3 32.1 15.3 67.9 ...
##  $ PctImmigRec10        : num  30.9 22.5 35.6 15.3 69.9 ...
##  $ PctRecentImmig       : num  0.93 0.43 0.82 0.28 0.82 0.32 1.05 0.11 0.47 0.72 ...
##  $ PctRecImmig5         : num  1.39 0.72 1.2 0.28 0.98 0.45 1.49 0.2 0.67 1.07 ...
##  $ PctRecImmig8         : num  2.24 1.11 1.61 0.31 1.18 0.57 2.2 0.25 0.93 1.63 ...
##  $ PctRecImmig10        : num  3.3 1.87 1.78 0.31 1.22 0.68 2.55 0.29 1.07 2.31 ...
##  $ PctSpeakEnglOnly     : num  85.7 87.8 93.1 95 94.6 ...
##  $ PctNotSpeakEnglWell  : num  1.37 1.81 1.14 0.56 0.39 0.6 0.6 0.28 0.43 2.51 ...
##  $ PctLargHouseFam      : num  4.81 4.25 2.97 3.93 5.23 3.08 5.08 3.85 2.59 6.7 ...
##  $ PctLargHouseOccup    : num  4.17 3.34 2.05 2.56 3.11 1.92 3.46 2.55 1.54 4.1 ...
##  $ PersPerOccupHous     : num  2.99 2.7 2.42 2.37 2.35 2.28 2.55 2.36 2.32 2.45 ...
##  $ PersPerOwnOccHous    : num  3 2.83 2.69 2.51 2.55 2.37 2.89 2.42 2.77 2.47 ...
##  $ PersPerRentOccHous   : num  2.84 1.96 2.06 2.2 2.12 2.16 2.09 2.27 1.91 2.44 ...
##  $ PctPersOwnOccup      : num  91.5 89 64.2 58.2 58.1 ...
##  $ PctPersDenseHous     : num  0.39 1.01 2.03 1.21 2.94 2.11 1.47 1.9 1.67 6.14 ...
##  $ PctHousLess3BR       : num  11.1 23.6 47.5 45.7 55.6 ...
##  $ MedNumBR             : int  3 3 3 3 2 2 3 2 2 2 ...
##  $ HousVacant           : int  64 240 544 669 333 5119 566 2051 1562 5606 ...
##  $ PctHousOccup         : num  98.4 97.2 95.7 91.2 92.5 ...
##  $ PctHousOwnOcc        : num  91 84.9 57.8 54.9 53.6 ...
##  $ PctVacantBoarded     : num  3.12 0 0.92 2.54 3.9 2.09 1.41 6.39 0.45 5.64 ...
##  $ PctVacMore6Mos       : num  37.5 18.33 7.54 57.85 42.64 ...
##  $ MedYrHousBuilt       : int  1959 1958 1976 1939 1958 1966 1956 1954 1971 1960 ...
##  $ PctHousNoPhone       : num  0 0.31 1.55 7 7.45 ...
##  $ PctWOFullPlumb       : num  0.28 0.14 0.12 0.87 0.82 0.31 0.28 0.49 0.19 0.33 ...
##  $ OwnOccLowQuart       : int  215900 136300 74700 36400 30600 37700 155100 26300 54500 28600 ...
##  $ OwnOccMedVal         : int  262600 164200 90400 49600 43200 53900 179000 37000 70300 43100 ...
##  $ OwnOccHiQuart        : int  326900 199900 112000 66500 59500 73100 215500 52400 93700 67400 ...
##  $ OwnOccQrange         : int  111000 63600 37300 30100 28900 35400 60400 26100 39200 38800 ...
##  $ RentLowQ             : int  685 467 370 195 202 215 463 186 241 192 ...
##  $ RentMedian           : int  1001 560 428 250 283 280 669 253 321 281 ...
##  $ RentHighQ            : int  1001 672 520 309 362 349 824 325 387 369 ...
##  $ RentQrange           : int  316 205 150 114 160 134 361 139 146 177 ...
##  $ MedRent              : int  1001 627 484 333 332 340 736 338 355 353 ...
##  $ MedRentPctHousInc    : num  23.8 27.6 24.1 28.7 32.2 26.4 24.4 26.3 25.2 29.6 ...
##  $ MedOwnCostPctInc     : num  21.1 20.7 21.7 20.6 23.2 17.3 20.8 15.1 20.7 19.4 ...
##  $ MedOwnCostPctIncNoMtg: num  14 12.5 11.6 14.5 12.9 11.7 12.5 12.2 12.8 13 ...
##  $ NumInShelters        : int  11 0 16 0 2 327 0 21 125 43 ...
##  $ NumStreet            : int  0 0 0 0 0 4 0 0 15 4 ...
##  $ PctForeignBorn       : num  10.66 8.3 5 2.04 1.74 ...
##   [list output truncated]

Most of the variables have been classified as factors and integers. We need to reclassify them as numeric variables.

We redefine the variable “fold” as factor and from column 6 onwards as numeric

 # classify 'fold' a numeric
crime_raw$fold = as.factor(crime_raw$fold)
crime_raw$countyCode = as.factor(crime_raw$countyCode)
crime_raw$communityCode = as.factor(crime_raw$communityCode)

# remove the citstate variable
crime_raw <- select(crime_raw,-c(148))

# reclassify column 6 onwards first as character first then as numeric
crime_raw[,6:ncol(crime_raw)] = sapply(crime_raw[,6:ncol(crime_raw)], as.character) 
crime_raw[,6:ncol(crime_raw)] = sapply(crime_raw[,6:ncol(crime_raw)], as.numeric)  # reclassify column 6 onwards as numeric

str(crime_raw)

## 'data.frame':    2215 obs. of  147 variables:
##  $ communityname        : Factor w/ 2018 levels "~","Aberdeencity",..: 152 1035 1781 666 142 1700 1272 42 567 1860 ...
##  $ state                : Factor w/ 48 levels "AK","AL","AR",..: 29 36 35 32 23 24 19 15 27 41 ...
##  $ countyCode           : Factor w/ 114 levels "1","3","5","7",..: 20 23 NA 18 4 NA 11 NA 9 NA ...
##  $ communityCode        : Factor w/ 959 levels "70","100","440",..: 58 491 NA 297 52 NA 528 NA 261 NA ...
##  $ fold                 : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ population           : num  11980 23123 29344 16656 11245 ...
##  $ householdsize        : num  3.1 2.82 2.43 2.4 2.76 2.45 2.6 2.45 2.46 2.62 ...
##  $ racepctblack         : num  1.37 0.8 0.74 1.7 0.53 ...
##  $ racePctWhite         : num  91.8 95.6 94.3 97.3 89.2 ...
##  $ racePctAsian         : num  6.5 3.44 3.43 0.5 1.17 0.9 1.47 0.4 1.25 0.92 ...
##  $ racePctHisp          : num  1.88 0.85 2.35 0.7 0.52 ...
##  $ agePct12t21          : num  12.5 11 11.4 12.6 24.5 ...
##  $ agePct12t29          : num  21.4 21.3 25.9 25.2 40.5 ...
##  $ agePct16t24          : num  10.9 10.5 11 12.2 28.7 ...
##  $ agePct65up           : num  11.3 17.2 10.3 17.6 12.6 ...
##  $ numbUrban            : num  11980 23123 29344 0 0 ...
##  $ pctUrban             : num  100 100 100 0 0 100 100 100 100 100 ...
##  $ medIncome            : num  75122 47917 35669 20580 17390 ...
##  $ pctWWage             : num  89.2 79 82 68.2 69.3 ...
##  $ pctWFarmSelf         : num  1.55 1.11 1.15 0.24 0.55 1 0.39 0.67 2.93 0.86 ...
##  $ pctWInvInc           : num  70.2 64.1 55.7 39 42.8 ...
##  $ pctWSocSec           : num  23.6 35.5 22.2 39.5 32.2 ...
##  $ pctWPubAsst          : num  1.03 2.75 2.94 11.71 11.21 ...
##  $ pctWRetire           : num  18.4 22.9 14.6 18.3 14.4 ...
##  $ medFamInc            : num  79584 55323 42112 26501 24018 ...
##  $ perCapInc            : num  29711 20148 16946 10810 8483 ...
##  $ whitePerCap          : num  30233 20191 17103 10909 9009 ...
##  $ blackPerCap          : num  13600 18137 16644 9984 887 ...
##  $ indianPerCap         : num  5725 0 21606 4941 4425 ...
##  $ AsianPerCap          : num  27101 20074 15528 3541 3352 ...
##  $ OtherPerCap          : num  5115 5250 5954 2451 3000 ...
##  $ HispPerCap           : num  22838 12222 8405 4391 1328 ...
##  $ NumUnderPov          : num  227 885 1389 2831 2855 ...
##  $ PctPopUnderPov       : num  1.96 3.98 4.75 17.23 29.99 ...
##  $ PctLess9thGrade      : num  5.81 5.61 2.8 11.05 12.15 ...
##  $ PctNotHSGrad         : num  9.9 13.72 9.09 33.68 23.06 ...
##  $ PctBSorMore          : num  48.2 29.9 30.1 10.8 25.3 ...
##  $ PctUnemployed        : num  2.7 2.43 4.01 9.86 9.08 5.72 4.85 8.19 4.18 8.39 ...
##  $ PctEmploy            : num  64.5 62 69.8 54.7 52.4 ...
##  $ PctEmplManu          : num  14.65 12.26 15.95 31.22 6.89 ...
##  $ PctEmplProfServ      : num  28.8 29.3 21.5 27.4 36.5 ...
##  $ PctOccupManu         : num  5.49 6.39 8.79 26.76 10.94 ...
##  $ PctOccupMgmtProf     : num  50.7 37.6 32.5 22.7 27.8 ...
##  $ MalePctDivorce       : num  3.67 4.23 10.1 10.98 7.51 ...
##  $ MalePctNevMarr       : num  26.4 28 25.8 28.1 50.7 ...
##  $ FemalePctDiv         : num  5.22 6.45 14.76 14.47 11.64 ...
##  $ TotalPctDiv          : num  4.47 5.42 12.55 12.91 9.73 ...
##  $ PersPerFam           : num  3.22 3.11 2.95 2.98 2.98 2.89 3.14 2.95 3 3.11 ...
##  $ PctFam2Par           : num  91.4 86.9 78.5 64 58.6 ...
##  $ PctKids2Par          : num  90.2 85.3 78.8 62.4 55.2 ...
##  $ PctYoungKids2Par     : num  95.8 96.8 92.4 65.4 66.5 ...
##  $ PctTeen2Par          : num  95.8 86.5 75.7 67.4 79.2 ...
##  $ PctWorkMomYoungKids  : num  44.6 51.1 66.1 59.6 61.2 ...
##  $ PctWorkMom           : num  58.9 62.4 74.2 70.3 68.9 ...
##  $ NumKidsBornNeverMar  : num  31 43 164 561 402 ...
##  $ PctKidsBornNeverMar  : num  0.36 0.24 0.88 3.84 4.7 1.58 1.18 4.66 1.64 4.71 ...
##  $ NumImmig             : num  1277 1920 1468 339 196 ...
##  $ PctImmigRecent       : num  8.69 5.21 16.42 13.86 46.94 ...
##  $ PctImmigRec5         : num  13 8.65 23.98 13.86 56.12 ...
##  $ PctImmigRec8         : num  21 13.3 32.1 15.3 67.9 ...
##  $ PctImmigRec10        : num  30.9 22.5 35.6 15.3 69.9 ...
##  $ PctRecentImmig       : num  0.93 0.43 0.82 0.28 0.82 0.32 1.05 0.11 0.47 0.72 ...
##  $ PctRecImmig5         : num  1.39 0.72 1.2 0.28 0.98 0.45 1.49 0.2 0.67 1.07 ...
##  $ PctRecImmig8         : num  2.24 1.11 1.61 0.31 1.18 0.57 2.2 0.25 0.93 1.63 ...
##  $ PctRecImmig10        : num  3.3 1.87 1.78 0.31 1.22 0.68 2.55 0.29 1.07 2.31 ...
##  $ PctSpeakEnglOnly     : num  85.7 87.8 93.1 95 94.6 ...
##  $ PctNotSpeakEnglWell  : num  1.37 1.81 1.14 0.56 0.39 0.6 0.6 0.28 0.43 2.51 ...
##  $ PctLargHouseFam      : num  4.81 4.25 2.97 3.93 5.23 3.08 5.08 3.85 2.59 6.7 ...
##  $ PctLargHouseOccup    : num  4.17 3.34 2.05 2.56 3.11 1.92 3.46 2.55 1.54 4.1 ...
##  $ PersPerOccupHous     : num  2.99 2.7 2.42 2.37 2.35 2.28 2.55 2.36 2.32 2.45 ...
##  $ PersPerOwnOccHous    : num  3 2.83 2.69 2.51 2.55 2.37 2.89 2.42 2.77 2.47 ...
##  $ PersPerRentOccHous   : num  2.84 1.96 2.06 2.2 2.12 2.16 2.09 2.27 1.91 2.44 ...
##  $ PctPersOwnOccup      : num  91.5 89 64.2 58.2 58.1 ...
##  $ PctPersDenseHous     : num  0.39 1.01 2.03 1.21 2.94 2.11 1.47 1.9 1.67 6.14 ...
##  $ PctHousLess3BR       : num  11.1 23.6 47.5 45.7 55.6 ...
##  $ MedNumBR             : num  3 3 3 3 2 2 3 2 2 2 ...
##  $ HousVacant           : num  64 240 544 669 333 ...
##  $ PctHousOccup         : num  98.4 97.2 95.7 91.2 92.5 ...
##  $ PctHousOwnOcc        : num  91 84.9 57.8 54.9 53.6 ...
##  $ PctVacantBoarded     : num  3.12 0 0.92 2.54 3.9 2.09 1.41 6.39 0.45 5.64 ...
##  $ PctVacMore6Mos       : num  37.5 18.33 7.54 57.85 42.64 ...
##  $ MedYrHousBuilt       : num  1959 1958 1976 1939 1958 ...
##  $ PctHousNoPhone       : num  0 0.31 1.55 7 7.45 ...
##  $ PctWOFullPlumb       : num  0.28 0.14 0.12 0.87 0.82 0.31 0.28 0.49 0.19 0.33 ...
##  $ OwnOccLowQuart       : num  215900 136300 74700 36400 30600 ...
##  $ OwnOccMedVal         : num  262600 164200 90400 49600 43200 ...
##  $ OwnOccHiQuart        : num  326900 199900 112000 66500 59500 ...
##  $ OwnOccQrange         : num  111000 63600 37300 30100 28900 35400 60400 26100 39200 38800 ...
##  $ RentLowQ             : num  685 467 370 195 202 215 463 186 241 192 ...
##  $ RentMedian           : num  1001 560 428 250 283 ...
##  $ RentHighQ            : num  1001 672 520 309 362 ...
##  $ RentQrange           : num  316 205 150 114 160 134 361 139 146 177 ...
##  $ MedRent              : num  1001 627 484 333 332 ...
##  $ MedRentPctHousInc    : num  23.8 27.6 24.1 28.7 32.2 26.4 24.4 26.3 25.2 29.6 ...
##  $ MedOwnCostPctInc     : num  21.1 20.7 21.7 20.6 23.2 17.3 20.8 15.1 20.7 19.4 ...
##  $ MedOwnCostPctIncNoMtg: num  14 12.5 11.6 14.5 12.9 11.7 12.5 12.2 12.8 13 ...
##  $ NumInShelters        : num  11 0 16 0 2 327 0 21 125 43 ...
##  $ NumStreet            : num  0 0 0 0 0 4 0 0 15 4 ...
##  $ PctForeignBorn       : num  10.66 8.3 5 2.04 1.74 ...
##   [list output truncated]

The first five columns are now classified as factors and the rest as numeric.

missing data

lets find the sum of missing values

# number of missing values in the dataset
sum(is.na(crime_raw))

## [1] 44592

There are 44592 missing observations in the dataset. Variables with more more than 50% missing data will be dropped. To drop this columns I will select the subset of columns whose sum of missing values is more than 1103.

#dropping columns with more than half of the values missing
crime_new <- crime_raw %>% select_if(~sum(is.na(.)) < 1103) 

##dim(crime_new)

##sum(is.na(crime_new))

Since the number of violent crimes per 100k population is the target variable then all observations that have a missing value for this attribute will be dropped.

#dropping observations with missing value for target variable
crime_new <- crime_new[!is.na(crime_new$ViolentCrimesPerPop),]
dim(crime_new)

## [1] 1994  123

sum(is.na(crime_new))

## [1] 283

There are now 1994 observations of 123 variables observations with 283 missing values.

The remaining categorical variables ‘communityname’, ‘state’ and ‘fold’ are non predictive attributes for a regression model. They are subsequently dropped from the data set for the prediction model

#dropping non predictive attributes
crime <- select(crime_new,-c(communityname, state, fold))
dim(crime)

## [1] 1994  120

sum(is.na(crime))

## [1] 283

There are now 1994 observations of 120 variables observations with 283 missing values.

Regression

Random forest Model

The cleaned dataset has 120 variables which is still a high number of attributes to work with. I will apply a Random Forest learner to predict the violent crimes per proportion and identify the most important features for the prediction. .

# Data Partition to train and test datasets
set.seed(3455)
ind <- sample(2, nrow(crime), replace = TRUE, prob = c(0.7, 0.3))
train <- crime[ind==1,]
test <- crime[ind==2,]

# Random Forest regression model
library(randomForest)

## Warning: package 'randomForest' was built under R version 3.5.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

set.seed(3455)
m1 <- randomForest(
  formula = ViolentCrimesPerPop~.,
  data    = train,
  importance=TRUE, 
  na.action=na.omit
  )
m1

## 
## Call:
##  randomForest(formula = ViolentCrimesPerPop ~ ., data = train,      importance = TRUE, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 39
## 
##           Mean of squared residuals: 12048.63
##                     % Var explained: 96.43

plot(m1)

The initial model has 500 trees with 39 variables tried at each split and 96.43 5 of the variation ex[lined with a mean squared residuals errors of 12048.63. Plotting the model shows the error rate stabilizing at around 150 trees but continues to decrease slowly until around 400 or so trees. The lowest error rate is at 414 trees providing an average error of 109 violent crimes per proportion.

# number of trees with lowest MSE
which.min(m1$mse)

## [1] 414

# RMSE of this optimal random forest
sqrt(m1$mse[which.min(m1$mse)])

## [1] 109.0252

Tuning Random forest Model

To tune the random forest model i started with mtry equals 5 and increase by a factor of 1.5 until the OOB error stops improving by 1%.

Since NA’s are not permitted in predictors, I will drop all observations with a missing value for now.

# omit observations with a missing value
train2 <- (na.omit(train))
test2 <- (na.omit(test))
# names of features
features <- setdiff(names(train2), "ViolentCrimesPerPop")

set.seed(3455)

m2 <- tuneRF(
  x          = train2[features],
  y          = train2$ViolentCrimesPerPop,
  ntreeTry   = 500,
  mtryStart  = 5,
  stepFactor = 1.5,
  improve    = 0.01,
  trace      = FALSE      # to not show real-time progress 
)

## -0.1010102 0.01 
## 0.1437576 0.01 
## 0.2143459 0.01 
## 0.2561191 0.01 
## 0.2942643 0.01 
## 0.3129827 0.01 
## 0.2720634 0.01 
## 0.2649567 0.01 
## 0.04718406 0.01 
## 0.04800755 0.01

m2

##     mtry  OOBError
## 4      4 61087.188
## 5      5 55482.852
## 7      7 47506.769
## 10    10 37323.889
## 15    15 27764.527
## 22    22 19594.418
## 33    33 13461.705
## 49    49  9799.267
## 73    73  7202.885
## 109  109  6863.024
## 119  119  6533.547

plot(m2)

The tuning model shows the OOB error decreasing rapidly and stabilizes from around mtry of 70. However the lowest error is achieved when all 119 variables are included for random sampling at each split.

further grid search with ranger package

I will try to improve the tuning by performing a larger grid search across several hyper parameter using the ‘rangrer’ package. To achieve this a grid is created and then loop through each hyperparameter combination to eavaluate the model.

library(ranger)

## Warning: package 'ranger' was built under R version 3.5.3

## 
## Attaching package: 'ranger'

## The following object is masked from 'package:randomForest':
## 
##     importance

# hyperparameter grid search
hyper_grid <- expand.grid(
  mtry       = seq(70, 80, by = 2),
  node_size  = seq(3, 9, by = 2),
  sampe_size = c(.55, .632, .70, .80),
  OOB_RMSE   = 0
)

# total number of combinations
nrow(hyper_grid)

## [1] 96

for(i in 1:nrow(hyper_grid)) {
  
  # train model
  model <- ranger(
    formula         = ViolentCrimesPerPop ~ ., 
    data            = train2, 
    num.trees       = 500,
    mtry            = hyper_grid$mtry[i],
    min.node.size   = hyper_grid$node_size[i],
    sample.fraction = hyper_grid$sampe_size[i],
    seed            = 3455
    )
  
  # add OOB error to grid
  hyper_grid$OOB_RMSE[i] <- sqrt(model$prediction.error)
}

hyper_grid %>% 
  dplyr::arrange(OOB_RMSE) %>%
  head(10)

##    mtry node_size sampe_size OOB_RMSE
## 1    80         5        0.8 88.10813
## 2    78         5        0.8 88.30338
## 3    80         7        0.8 88.40110
## 4    78         7        0.8 88.57293
## 5    78         3        0.8 88.58484
## 6    80         3        0.8 88.59815
## 7    78         9        0.8 89.05391
## 8    80         9        0.8 89.14533
## 9    74         3        0.8 89.27055
## 10   78         5        0.7 89.32704

The top 10 performing models have OOB RMSE ranges around 88-89 violent crimes per proportion. The mtry values are also ranging around 70-80 variables.

selected model

Hence the selected model we pick mtry = 80, node size = 5 and sample size = 0.8.

#selected model
 set.seed(3455)
 rf_crime <- randomForest(
    formula = ViolentCrimesPerPop~.,
    data    = train,
    ntree = 100,
    mtry    = 80,
    importance=TRUE, 
    na.action=na.omit
  )

 rf_crime

## 
## Call:
##  randomForest(formula = ViolentCrimesPerPop ~ ., data = train,      ntree = 100, mtry = 80, importance = TRUE, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 80
## 
##           Mean of squared residuals: 7470.607
##                     % Var explained: 97.78

plot(rf_crime)

The selected model has 100 trees with 80 variables tried at each split and 96.78% of the variation explained with a mean squared residuals errors of 7470.607

Variable Importance

# Variable Importance
varImpPlot(rf_crime,
           sort = T,
           n.var = 10,
           type = 2,
           cex=0.6,
           main = "Top 10 - Variable Importance")

varUsed(rf_crime)

##   [1]  284  268  412  427  266  245  308  297  275  289  225   89  207  250
##  [15]  446  311  311  352  354  195  170  215  303  331  357  355  306  259
##  [29]  283  277  278  275  324  296  385  361  260  266  299  345  283  271
##  [43]  224  305  325  307  381  364  344  370  418  236  318  306  293  304
##  [57]  225  187  159  177  277  254  274  259  219  254  292  323  291  317
##  [71]   73  314  346  312  364  348  280  271  359  187  153  155  261  189
##  [85]  157  143  261  170  333  246  364  200  129  186  342  327  316  361
##  [99]  341  378  303   90  183  322  489 1083  563 2710  448 3333  288  542
## [113]  301  363  263  472  279  393  375

According to the model the variables with the most impact in descending order;
number of assault per proportion, number of roberries per proportion, number of assaults, number of burglaries per proportion

Prediction

We can use with the preffered model to make a prediction. First the prediction is made on the test data to asses its goodness of fit.

# predict train data
p1 <- predict(rf_crime, train2)
residuals1 = p1 - train2$ViolentCrimesPerPop
#MAE and RMSE for train data
MAE1 = mean(abs(residuals1))
MAE1

## [1] 13.78192

RMSE1 = sqrt(mean((residuals1)^2))
RMSE1

## [1] 34.71283

#predict test data
p2 <- predict(rf_crime, test2)
residuals2 = p2 - test2$ViolentCrimesPerPop
#MAE and RMSE for train data
MAE2 = mean(abs(residuals2))
MAE2

## [1] 41.20281

RMSE2 = sqrt(mean((residuals2)^2))
RMSE2

## [1] 121.5006

The RMSE for prediction on the train data is 34.7 and 121.5 for the test data.