pre code, pre, code {
white-space: pre !important;
overflow-x: scroll !important;
word-break: keep-all !important;
word-wrap: initial !important;
}
Violent crime is a serious social problem that affects the social order and economical development of a community. However almost all the law enforcement and crime prevention agencies are consistently plagued by lack of personnel, resources and restricted budgetary allocations. To manage violent crimes given this perennial constraints, it is prudent to consider planning resource allocations in advance through predicting the violent crimes patterns for a given community. I propose to predict violent crimes for a specified area through regression using the provided crime dataset. I will fit both and a random forest model and assess the goodness of fit for the models.
Predicting Violent crimes pattern is an indispensable tool for the law enforcement agents. It facilitates in mapping out crime prone regions so as to distribute the limited resources effectively. Furthemore it provides valuable informationfor optimizing available resources ahead of time. According to the FBI, [1] violent crimes imply use of force or threat of using force, such as rape, murder, robbery, aggravated assault, and non-negligent manslaughter. In 2013, it was reported 1,163,146 violent crimes, with an average of 367.9 per 100k inhabitants. This translates to a national average of one violent crime every 27.1 seconds. It is with this considerations that I propose to predict the violent crime per 100k population using regression techniques.
The provided Crime Data Set provides information on several crimes in specific communities across the USA, combining socio-economical and law enforcement data.
An overview of the data
library(dplyr) # used for data manipulation and joining data
library(ggplot2) # used for ploting
crime_raw <- read.csv("crimedata.csv", na.strings = "?") # reading the datafile
dim(crime_raw)
## [1] 2215 147
There are 2215 observations with 147 variables. There are 4 non-predictive attributes with information about the community name, county, code and fold. The target variable is the number of violent crimes per 100k population.
The target variable is the number of violent crimes per 100k population. Since it is continuous, it can be visualised plotting its histogram.
p1 <- ggplot(crime_raw, aes(crime_raw$ViolentCrimesPerPop)) +
geom_histogram( binwidth = 80, fill = "darkgreen") +
xlab("Violent Crimes Per proportion")
p1
## Warning: Removed 221 rows containing non-finite values (stat_bin).
The attribute is right skewed variable. transformations will be required to treat the skewness.
First we check for duplicates for the City (communityname) names. If number of observations in ‘a’ = number of observations in the crime data set then we have no duplicates.
# find number of unique city names
length(unique(crime_raw$communityname)) #selecting unique city name values
## [1] 2018
There are 2018 unique observations which is lower than the 2215 obsrvations in the crime dataset. Hence there are duplicates in the city names. Now we figure out which city names have been duplicated.
#table is the list all community names alphabetically and their frquency
table <- data.frame(table(crime_raw$communityname))
##table
#Subset the table for only the duplicates of the community names
duplicates <- table[table$Freq!=1,]
##duplicates
dim(duplicates)
## [1] 144 2
There are 144 duplicate city names. Hence the city names is not a unique identifier.
Now we subset the crimedata by keeping only the duplicate city names to investigate further
#subset the data for duplicate city names
crime_dups <- crime_raw[crime_raw$communityname %in% duplicates$Var1,]
#arranging the subset in alphabetical order
crime_dups <- arrange(crime_dups, communityname)
##crime_dups
dim(crime_dups)
## [1] 341 147
There are 341 observations of duplicated city mames. We can see the community names duplicates is due the fact that multiple states can have the same city name.
We can create a unique identifier by considering both city names and State. This can be done by concatenating city name and state.
#create a new variable by concatenating city name and state
crime_raw <- crime_raw %>%
mutate(citystate = paste(communityname, state, sep = " "))
There are 2215 unique values when cosidering both community names and State. Since this is also the number of observation in the crimedata then there are no duplicates.
Lets look at the variable attributes by using the glimpse function
str(crime_raw)
## 'data.frame': 2215 obs. of 148 variables:
## $ communityname : Factor w/ 2018 levels "~","Aberdeencity",..: 152 1035 1781 666 142 1700 1272 42 567 1860 ...
## $ state : Factor w/ 48 levels "AK","AL","AR",..: 29 36 35 32 23 24 19 15 27 41 ...
## $ countyCode : int 39 45 NA 35 7 NA 21 NA 17 NA ...
## $ communityCode : int 5320 47616 NA 29443 5068 NA 50250 NA 25700 NA ...
## $ fold : int 1 1 1 1 1 1 1 1 1 1 ...
## $ population : int 11980 23123 29344 16656 11245 140494 28700 59459 74111 103590 ...
## $ householdsize : num 3.1 2.82 2.43 2.4 2.76 2.45 2.6 2.45 2.46 2.62 ...
## $ racepctblack : num 1.37 0.8 0.74 1.7 0.53 ...
## $ racePctWhite : num 91.8 95.6 94.3 97.3 89.2 ...
## $ racePctAsian : num 6.5 3.44 3.43 0.5 1.17 0.9 1.47 0.4 1.25 0.92 ...
## $ racePctHisp : num 1.88 0.85 2.35 0.7 0.52 ...
## $ agePct12t21 : num 12.5 11 11.4 12.6 24.5 ...
## $ agePct12t29 : num 21.4 21.3 25.9 25.2 40.5 ...
## $ agePct16t24 : num 10.9 10.5 11 12.2 28.7 ...
## $ agePct65up : num 11.3 17.2 10.3 17.6 12.6 ...
## $ numbUrban : int 11980 23123 29344 0 0 140494 28700 59449 74115 103590 ...
## $ pctUrban : num 100 100 100 0 0 100 100 100 100 100 ...
## $ medIncome : int 75122 47917 35669 20580 17390 21577 42805 23221 25326 17852 ...
## $ pctWWage : num 89.2 79 82 68.2 69.3 ...
## $ pctWFarmSelf : num 1.55 1.11 1.15 0.24 0.55 1 0.39 0.67 2.93 0.86 ...
## $ pctWInvInc : num 70.2 64.1 55.7 39 42.8 ...
## $ pctWSocSec : num 23.6 35.5 22.2 39.5 32.2 ...
## $ pctWPubAsst : num 1.03 2.75 2.94 11.71 11.21 ...
## $ pctWRetire : num 18.4 22.9 14.6 18.3 14.4 ...
## $ medFamInc : int 79584 55323 42112 26501 24018 27705 50394 28901 34269 24058 ...
## $ perCapInc : int 29711 20148 16946 10810 8483 11878 18193 12161 13554 10195 ...
## $ whitePerCap : int 30233 20191 17103 10909 9009 12029 18276 12599 13727 12126 ...
## $ blackPerCap : int 13600 18137 16644 9984 887 7382 17342 9820 8852 5715 ...
## $ indianPerCap : int 5725 0 21606 4941 4425 10264 21482 6634 5344 11313 ...
## $ AsianPerCap : int 27101 20074 15528 3541 3352 10753 12639 8802 8011 5770 ...
## $ OtherPerCap : int 5115 5250 5954 2451 3000 7192 21852 7428 5332 7320 ...
## $ HispPerCap : int 22838 12222 8405 4391 1328 8104 22594 6187 5174 6984 ...
## $ NumUnderPov : int 227 885 1389 2831 2855 23223 1126 10320 9603 27767 ...
## $ PctPopUnderPov : num 1.96 3.98 4.75 17.23 29.99 ...
## $ PctLess9thGrade : num 5.81 5.61 2.8 11.05 12.15 ...
## $ PctNotHSGrad : num 9.9 13.72 9.09 33.68 23.06 ...
## $ PctBSorMore : num 48.2 29.9 30.1 10.8 25.3 ...
## $ PctUnemployed : num 2.7 2.43 4.01 9.86 9.08 5.72 4.85 8.19 4.18 8.39 ...
## $ PctEmploy : num 64.5 62 69.8 54.7 52.4 ...
## $ PctEmplManu : num 14.65 12.26 15.95 31.22 6.89 ...
## $ PctEmplProfServ : num 28.8 29.3 21.5 27.4 36.5 ...
## $ PctOccupManu : num 5.49 6.39 8.79 26.76 10.94 ...
## $ PctOccupMgmtProf : num 50.7 37.6 32.5 22.7 27.8 ...
## $ MalePctDivorce : num 3.67 4.23 10.1 10.98 7.51 ...
## $ MalePctNevMarr : num 26.4 28 25.8 28.1 50.7 ...
## $ FemalePctDiv : num 5.22 6.45 14.76 14.47 11.64 ...
## $ TotalPctDiv : num 4.47 5.42 12.55 12.91 9.73 ...
## $ PersPerFam : num 3.22 3.11 2.95 2.98 2.98 2.89 3.14 2.95 3 3.11 ...
## $ PctFam2Par : num 91.4 86.9 78.5 64 58.6 ...
## $ PctKids2Par : num 90.2 85.3 78.8 62.4 55.2 ...
## $ PctYoungKids2Par : num 95.8 96.8 92.4 65.4 66.5 ...
## $ PctTeen2Par : num 95.8 86.5 75.7 67.4 79.2 ...
## $ PctWorkMomYoungKids : num 44.6 51.1 66.1 59.6 61.2 ...
## $ PctWorkMom : num 58.9 62.4 74.2 70.3 68.9 ...
## $ NumKidsBornNeverMar : int 31 43 164 561 402 1511 263 2368 751 3537 ...
## $ PctKidsBornNeverMar : num 0.36 0.24 0.88 3.84 4.7 1.58 1.18 4.66 1.64 4.71 ...
## $ NumImmig : int 1277 1920 1468 339 196 2091 2637 517 1474 4793 ...
## $ PctImmigRecent : num 8.69 5.21 16.42 13.86 46.94 ...
## $ PctImmigRec5 : num 13 8.65 23.98 13.86 56.12 ...
## $ PctImmigRec8 : num 21 13.3 32.1 15.3 67.9 ...
## $ PctImmigRec10 : num 30.9 22.5 35.6 15.3 69.9 ...
## $ PctRecentImmig : num 0.93 0.43 0.82 0.28 0.82 0.32 1.05 0.11 0.47 0.72 ...
## $ PctRecImmig5 : num 1.39 0.72 1.2 0.28 0.98 0.45 1.49 0.2 0.67 1.07 ...
## $ PctRecImmig8 : num 2.24 1.11 1.61 0.31 1.18 0.57 2.2 0.25 0.93 1.63 ...
## $ PctRecImmig10 : num 3.3 1.87 1.78 0.31 1.22 0.68 2.55 0.29 1.07 2.31 ...
## $ PctSpeakEnglOnly : num 85.7 87.8 93.1 95 94.6 ...
## $ PctNotSpeakEnglWell : num 1.37 1.81 1.14 0.56 0.39 0.6 0.6 0.28 0.43 2.51 ...
## $ PctLargHouseFam : num 4.81 4.25 2.97 3.93 5.23 3.08 5.08 3.85 2.59 6.7 ...
## $ PctLargHouseOccup : num 4.17 3.34 2.05 2.56 3.11 1.92 3.46 2.55 1.54 4.1 ...
## $ PersPerOccupHous : num 2.99 2.7 2.42 2.37 2.35 2.28 2.55 2.36 2.32 2.45 ...
## $ PersPerOwnOccHous : num 3 2.83 2.69 2.51 2.55 2.37 2.89 2.42 2.77 2.47 ...
## $ PersPerRentOccHous : num 2.84 1.96 2.06 2.2 2.12 2.16 2.09 2.27 1.91 2.44 ...
## $ PctPersOwnOccup : num 91.5 89 64.2 58.2 58.1 ...
## $ PctPersDenseHous : num 0.39 1.01 2.03 1.21 2.94 2.11 1.47 1.9 1.67 6.14 ...
## $ PctHousLess3BR : num 11.1 23.6 47.5 45.7 55.6 ...
## $ MedNumBR : int 3 3 3 3 2 2 3 2 2 2 ...
## $ HousVacant : int 64 240 544 669 333 5119 566 2051 1562 5606 ...
## $ PctHousOccup : num 98.4 97.2 95.7 91.2 92.5 ...
## $ PctHousOwnOcc : num 91 84.9 57.8 54.9 53.6 ...
## $ PctVacantBoarded : num 3.12 0 0.92 2.54 3.9 2.09 1.41 6.39 0.45 5.64 ...
## $ PctVacMore6Mos : num 37.5 18.33 7.54 57.85 42.64 ...
## $ MedYrHousBuilt : int 1959 1958 1976 1939 1958 1966 1956 1954 1971 1960 ...
## $ PctHousNoPhone : num 0 0.31 1.55 7 7.45 ...
## $ PctWOFullPlumb : num 0.28 0.14 0.12 0.87 0.82 0.31 0.28 0.49 0.19 0.33 ...
## $ OwnOccLowQuart : int 215900 136300 74700 36400 30600 37700 155100 26300 54500 28600 ...
## $ OwnOccMedVal : int 262600 164200 90400 49600 43200 53900 179000 37000 70300 43100 ...
## $ OwnOccHiQuart : int 326900 199900 112000 66500 59500 73100 215500 52400 93700 67400 ...
## $ OwnOccQrange : int 111000 63600 37300 30100 28900 35400 60400 26100 39200 38800 ...
## $ RentLowQ : int 685 467 370 195 202 215 463 186 241 192 ...
## $ RentMedian : int 1001 560 428 250 283 280 669 253 321 281 ...
## $ RentHighQ : int 1001 672 520 309 362 349 824 325 387 369 ...
## $ RentQrange : int 316 205 150 114 160 134 361 139 146 177 ...
## $ MedRent : int 1001 627 484 333 332 340 736 338 355 353 ...
## $ MedRentPctHousInc : num 23.8 27.6 24.1 28.7 32.2 26.4 24.4 26.3 25.2 29.6 ...
## $ MedOwnCostPctInc : num 21.1 20.7 21.7 20.6 23.2 17.3 20.8 15.1 20.7 19.4 ...
## $ MedOwnCostPctIncNoMtg: num 14 12.5 11.6 14.5 12.9 11.7 12.5 12.2 12.8 13 ...
## $ NumInShelters : int 11 0 16 0 2 327 0 21 125 43 ...
## $ NumStreet : int 0 0 0 0 0 4 0 0 15 4 ...
## $ PctForeignBorn : num 10.66 8.3 5 2.04 1.74 ...
## [list output truncated]
Most of the variables have been classified as factors and integers. We need to reclassify them as numeric variables.
We redefine the variable “fold” as factor and from column 6 onwards as numeric
# classify 'fold' a numeric
crime_raw$fold = as.factor(crime_raw$fold)
crime_raw$countyCode = as.factor(crime_raw$countyCode)
crime_raw$communityCode = as.factor(crime_raw$communityCode)
# remove the citstate variable
crime_raw <- select(crime_raw,-c(148))
# reclassify column 6 onwards first as character first then as numeric
crime_raw[,6:ncol(crime_raw)] = sapply(crime_raw[,6:ncol(crime_raw)], as.character)
crime_raw[,6:ncol(crime_raw)] = sapply(crime_raw[,6:ncol(crime_raw)], as.numeric) # reclassify column 6 onwards as numeric
str(crime_raw)
## 'data.frame': 2215 obs. of 147 variables:
## $ communityname : Factor w/ 2018 levels "~","Aberdeencity",..: 152 1035 1781 666 142 1700 1272 42 567 1860 ...
## $ state : Factor w/ 48 levels "AK","AL","AR",..: 29 36 35 32 23 24 19 15 27 41 ...
## $ countyCode : Factor w/ 114 levels "1","3","5","7",..: 20 23 NA 18 4 NA 11 NA 9 NA ...
## $ communityCode : Factor w/ 959 levels "70","100","440",..: 58 491 NA 297 52 NA 528 NA 261 NA ...
## $ fold : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ population : num 11980 23123 29344 16656 11245 ...
## $ householdsize : num 3.1 2.82 2.43 2.4 2.76 2.45 2.6 2.45 2.46 2.62 ...
## $ racepctblack : num 1.37 0.8 0.74 1.7 0.53 ...
## $ racePctWhite : num 91.8 95.6 94.3 97.3 89.2 ...
## $ racePctAsian : num 6.5 3.44 3.43 0.5 1.17 0.9 1.47 0.4 1.25 0.92 ...
## $ racePctHisp : num 1.88 0.85 2.35 0.7 0.52 ...
## $ agePct12t21 : num 12.5 11 11.4 12.6 24.5 ...
## $ agePct12t29 : num 21.4 21.3 25.9 25.2 40.5 ...
## $ agePct16t24 : num 10.9 10.5 11 12.2 28.7 ...
## $ agePct65up : num 11.3 17.2 10.3 17.6 12.6 ...
## $ numbUrban : num 11980 23123 29344 0 0 ...
## $ pctUrban : num 100 100 100 0 0 100 100 100 100 100 ...
## $ medIncome : num 75122 47917 35669 20580 17390 ...
## $ pctWWage : num 89.2 79 82 68.2 69.3 ...
## $ pctWFarmSelf : num 1.55 1.11 1.15 0.24 0.55 1 0.39 0.67 2.93 0.86 ...
## $ pctWInvInc : num 70.2 64.1 55.7 39 42.8 ...
## $ pctWSocSec : num 23.6 35.5 22.2 39.5 32.2 ...
## $ pctWPubAsst : num 1.03 2.75 2.94 11.71 11.21 ...
## $ pctWRetire : num 18.4 22.9 14.6 18.3 14.4 ...
## $ medFamInc : num 79584 55323 42112 26501 24018 ...
## $ perCapInc : num 29711 20148 16946 10810 8483 ...
## $ whitePerCap : num 30233 20191 17103 10909 9009 ...
## $ blackPerCap : num 13600 18137 16644 9984 887 ...
## $ indianPerCap : num 5725 0 21606 4941 4425 ...
## $ AsianPerCap : num 27101 20074 15528 3541 3352 ...
## $ OtherPerCap : num 5115 5250 5954 2451 3000 ...
## $ HispPerCap : num 22838 12222 8405 4391 1328 ...
## $ NumUnderPov : num 227 885 1389 2831 2855 ...
## $ PctPopUnderPov : num 1.96 3.98 4.75 17.23 29.99 ...
## $ PctLess9thGrade : num 5.81 5.61 2.8 11.05 12.15 ...
## $ PctNotHSGrad : num 9.9 13.72 9.09 33.68 23.06 ...
## $ PctBSorMore : num 48.2 29.9 30.1 10.8 25.3 ...
## $ PctUnemployed : num 2.7 2.43 4.01 9.86 9.08 5.72 4.85 8.19 4.18 8.39 ...
## $ PctEmploy : num 64.5 62 69.8 54.7 52.4 ...
## $ PctEmplManu : num 14.65 12.26 15.95 31.22 6.89 ...
## $ PctEmplProfServ : num 28.8 29.3 21.5 27.4 36.5 ...
## $ PctOccupManu : num 5.49 6.39 8.79 26.76 10.94 ...
## $ PctOccupMgmtProf : num 50.7 37.6 32.5 22.7 27.8 ...
## $ MalePctDivorce : num 3.67 4.23 10.1 10.98 7.51 ...
## $ MalePctNevMarr : num 26.4 28 25.8 28.1 50.7 ...
## $ FemalePctDiv : num 5.22 6.45 14.76 14.47 11.64 ...
## $ TotalPctDiv : num 4.47 5.42 12.55 12.91 9.73 ...
## $ PersPerFam : num 3.22 3.11 2.95 2.98 2.98 2.89 3.14 2.95 3 3.11 ...
## $ PctFam2Par : num 91.4 86.9 78.5 64 58.6 ...
## $ PctKids2Par : num 90.2 85.3 78.8 62.4 55.2 ...
## $ PctYoungKids2Par : num 95.8 96.8 92.4 65.4 66.5 ...
## $ PctTeen2Par : num 95.8 86.5 75.7 67.4 79.2 ...
## $ PctWorkMomYoungKids : num 44.6 51.1 66.1 59.6 61.2 ...
## $ PctWorkMom : num 58.9 62.4 74.2 70.3 68.9 ...
## $ NumKidsBornNeverMar : num 31 43 164 561 402 ...
## $ PctKidsBornNeverMar : num 0.36 0.24 0.88 3.84 4.7 1.58 1.18 4.66 1.64 4.71 ...
## $ NumImmig : num 1277 1920 1468 339 196 ...
## $ PctImmigRecent : num 8.69 5.21 16.42 13.86 46.94 ...
## $ PctImmigRec5 : num 13 8.65 23.98 13.86 56.12 ...
## $ PctImmigRec8 : num 21 13.3 32.1 15.3 67.9 ...
## $ PctImmigRec10 : num 30.9 22.5 35.6 15.3 69.9 ...
## $ PctRecentImmig : num 0.93 0.43 0.82 0.28 0.82 0.32 1.05 0.11 0.47 0.72 ...
## $ PctRecImmig5 : num 1.39 0.72 1.2 0.28 0.98 0.45 1.49 0.2 0.67 1.07 ...
## $ PctRecImmig8 : num 2.24 1.11 1.61 0.31 1.18 0.57 2.2 0.25 0.93 1.63 ...
## $ PctRecImmig10 : num 3.3 1.87 1.78 0.31 1.22 0.68 2.55 0.29 1.07 2.31 ...
## $ PctSpeakEnglOnly : num 85.7 87.8 93.1 95 94.6 ...
## $ PctNotSpeakEnglWell : num 1.37 1.81 1.14 0.56 0.39 0.6 0.6 0.28 0.43 2.51 ...
## $ PctLargHouseFam : num 4.81 4.25 2.97 3.93 5.23 3.08 5.08 3.85 2.59 6.7 ...
## $ PctLargHouseOccup : num 4.17 3.34 2.05 2.56 3.11 1.92 3.46 2.55 1.54 4.1 ...
## $ PersPerOccupHous : num 2.99 2.7 2.42 2.37 2.35 2.28 2.55 2.36 2.32 2.45 ...
## $ PersPerOwnOccHous : num 3 2.83 2.69 2.51 2.55 2.37 2.89 2.42 2.77 2.47 ...
## $ PersPerRentOccHous : num 2.84 1.96 2.06 2.2 2.12 2.16 2.09 2.27 1.91 2.44 ...
## $ PctPersOwnOccup : num 91.5 89 64.2 58.2 58.1 ...
## $ PctPersDenseHous : num 0.39 1.01 2.03 1.21 2.94 2.11 1.47 1.9 1.67 6.14 ...
## $ PctHousLess3BR : num 11.1 23.6 47.5 45.7 55.6 ...
## $ MedNumBR : num 3 3 3 3 2 2 3 2 2 2 ...
## $ HousVacant : num 64 240 544 669 333 ...
## $ PctHousOccup : num 98.4 97.2 95.7 91.2 92.5 ...
## $ PctHousOwnOcc : num 91 84.9 57.8 54.9 53.6 ...
## $ PctVacantBoarded : num 3.12 0 0.92 2.54 3.9 2.09 1.41 6.39 0.45 5.64 ...
## $ PctVacMore6Mos : num 37.5 18.33 7.54 57.85 42.64 ...
## $ MedYrHousBuilt : num 1959 1958 1976 1939 1958 ...
## $ PctHousNoPhone : num 0 0.31 1.55 7 7.45 ...
## $ PctWOFullPlumb : num 0.28 0.14 0.12 0.87 0.82 0.31 0.28 0.49 0.19 0.33 ...
## $ OwnOccLowQuart : num 215900 136300 74700 36400 30600 ...
## $ OwnOccMedVal : num 262600 164200 90400 49600 43200 ...
## $ OwnOccHiQuart : num 326900 199900 112000 66500 59500 ...
## $ OwnOccQrange : num 111000 63600 37300 30100 28900 35400 60400 26100 39200 38800 ...
## $ RentLowQ : num 685 467 370 195 202 215 463 186 241 192 ...
## $ RentMedian : num 1001 560 428 250 283 ...
## $ RentHighQ : num 1001 672 520 309 362 ...
## $ RentQrange : num 316 205 150 114 160 134 361 139 146 177 ...
## $ MedRent : num 1001 627 484 333 332 ...
## $ MedRentPctHousInc : num 23.8 27.6 24.1 28.7 32.2 26.4 24.4 26.3 25.2 29.6 ...
## $ MedOwnCostPctInc : num 21.1 20.7 21.7 20.6 23.2 17.3 20.8 15.1 20.7 19.4 ...
## $ MedOwnCostPctIncNoMtg: num 14 12.5 11.6 14.5 12.9 11.7 12.5 12.2 12.8 13 ...
## $ NumInShelters : num 11 0 16 0 2 327 0 21 125 43 ...
## $ NumStreet : num 0 0 0 0 0 4 0 0 15 4 ...
## $ PctForeignBorn : num 10.66 8.3 5 2.04 1.74 ...
## [list output truncated]
The first five columns are now classified as factors and the rest as numeric.
lets find the sum of missing values
# number of missing values in the dataset
sum(is.na(crime_raw))
## [1] 44592
There are 44592 missing observations in the dataset. Variables with more more than 50% missing data will be dropped. To drop this columns I will select the subset of columns whose sum of missing values is more than 1103.
#dropping columns with more than half of the values missing
crime_new <- crime_raw %>% select_if(~sum(is.na(.)) < 1103)
##dim(crime_new)
##sum(is.na(crime_new))
Since the number of violent crimes per 100k population is the target variable then all observations that have a missing value for this attribute will be dropped.
#dropping observations with missing value for target variable
crime_new <- crime_new[!is.na(crime_new$ViolentCrimesPerPop),]
dim(crime_new)
## [1] 1994 123
sum(is.na(crime_new))
## [1] 283
There are now 1994 observations of 123 variables observations with 283 missing values.
The remaining categorical variables ‘communityname’, ‘state’ and ‘fold’ are non predictive attributes for a regression model. They are subsequently dropped from the data set for the prediction model
#dropping non predictive attributes
crime <- select(crime_new,-c(communityname, state, fold))
dim(crime)
## [1] 1994 120
sum(is.na(crime))
## [1] 283
There are now 1994 observations of 120 variables observations with 283 missing values.
The cleaned dataset has 120 variables which is still a high number of attributes to work with. I will apply a Random Forest learner to predict the violent crimes per proportion and identify the most important features for the prediction. .
# Data Partition to train and test datasets
set.seed(3455)
ind <- sample(2, nrow(crime), replace = TRUE, prob = c(0.7, 0.3))
train <- crime[ind==1,]
test <- crime[ind==2,]
# Random Forest regression model
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
set.seed(3455)
m1 <- randomForest(
formula = ViolentCrimesPerPop~.,
data = train,
importance=TRUE,
na.action=na.omit
)
m1
##
## Call:
## randomForest(formula = ViolentCrimesPerPop ~ ., data = train, importance = TRUE, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 39
##
## Mean of squared residuals: 12048.63
## % Var explained: 96.43
plot(m1)
The initial model has 500 trees with 39 variables tried at each split and 96.43 5 of the variation ex[lined with a mean squared residuals errors of 12048.63. Plotting the model shows the error rate stabilizing at around 150 trees but continues to decrease slowly until around 400 or so trees. The lowest error rate is at 414 trees providing an average error of 109 violent crimes per proportion.
# number of trees with lowest MSE
which.min(m1$mse)
## [1] 414
# RMSE of this optimal random forest
sqrt(m1$mse[which.min(m1$mse)])
## [1] 109.0252
To tune the random forest model i started with mtry equals 5 and increase by a factor of 1.5 until the OOB error stops improving by 1%.
Since NA’s are not permitted in predictors, I will drop all observations with a missing value for now.
# omit observations with a missing value
train2 <- (na.omit(train))
test2 <- (na.omit(test))
# names of features
features <- setdiff(names(train2), "ViolentCrimesPerPop")
set.seed(3455)
m2 <- tuneRF(
x = train2[features],
y = train2$ViolentCrimesPerPop,
ntreeTry = 500,
mtryStart = 5,
stepFactor = 1.5,
improve = 0.01,
trace = FALSE # to not show real-time progress
)
## -0.1010102 0.01
## 0.1437576 0.01
## 0.2143459 0.01
## 0.2561191 0.01
## 0.2942643 0.01
## 0.3129827 0.01
## 0.2720634 0.01
## 0.2649567 0.01
## 0.04718406 0.01
## 0.04800755 0.01
m2
## mtry OOBError
## 4 4 61087.188
## 5 5 55482.852
## 7 7 47506.769
## 10 10 37323.889
## 15 15 27764.527
## 22 22 19594.418
## 33 33 13461.705
## 49 49 9799.267
## 73 73 7202.885
## 109 109 6863.024
## 119 119 6533.547
plot(m2)
The tuning model shows the OOB error decreasing rapidly and stabilizes from around mtry of 70. However the lowest error is achieved when all 119 variables are included for random sampling at each split.
I will try to improve the tuning by performing a larger grid search across several hyper parameter using the ‘rangrer’ package. To achieve this a grid is created and then loop through each hyperparameter combination to eavaluate the model.
library(ranger)
## Warning: package 'ranger' was built under R version 3.5.3
##
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
##
## importance
# hyperparameter grid search
hyper_grid <- expand.grid(
mtry = seq(70, 80, by = 2),
node_size = seq(3, 9, by = 2),
sampe_size = c(.55, .632, .70, .80),
OOB_RMSE = 0
)
# total number of combinations
nrow(hyper_grid)
## [1] 96
for(i in 1:nrow(hyper_grid)) {
# train model
model <- ranger(
formula = ViolentCrimesPerPop ~ .,
data = train2,
num.trees = 500,
mtry = hyper_grid$mtry[i],
min.node.size = hyper_grid$node_size[i],
sample.fraction = hyper_grid$sampe_size[i],
seed = 3455
)
# add OOB error to grid
hyper_grid$OOB_RMSE[i] <- sqrt(model$prediction.error)
}
hyper_grid %>%
dplyr::arrange(OOB_RMSE) %>%
head(10)
## mtry node_size sampe_size OOB_RMSE
## 1 80 5 0.8 88.10813
## 2 78 5 0.8 88.30338
## 3 80 7 0.8 88.40110
## 4 78 7 0.8 88.57293
## 5 78 3 0.8 88.58484
## 6 80 3 0.8 88.59815
## 7 78 9 0.8 89.05391
## 8 80 9 0.8 89.14533
## 9 74 3 0.8 89.27055
## 10 78 5 0.7 89.32704
The top 10 performing models have OOB RMSE ranges around 88-89 violent crimes per proportion. The mtry values are also ranging around 70-80 variables.
Hence the selected model we pick mtry = 80, node size = 5 and sample size = 0.8.
#selected model
set.seed(3455)
rf_crime <- randomForest(
formula = ViolentCrimesPerPop~.,
data = train,
ntree = 100,
mtry = 80,
importance=TRUE,
na.action=na.omit
)
rf_crime
##
## Call:
## randomForest(formula = ViolentCrimesPerPop ~ ., data = train, ntree = 100, mtry = 80, importance = TRUE, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 80
##
## Mean of squared residuals: 7470.607
## % Var explained: 97.78
plot(rf_crime)
The selected model has 100 trees with 80 variables tried at each split and 96.78% of the variation explained with a mean squared residuals errors of 7470.607
# Variable Importance
varImpPlot(rf_crime,
sort = T,
n.var = 10,
type = 2,
cex=0.6,
main = "Top 10 - Variable Importance")
varUsed(rf_crime)
## [1] 284 268 412 427 266 245 308 297 275 289 225 89 207 250
## [15] 446 311 311 352 354 195 170 215 303 331 357 355 306 259
## [29] 283 277 278 275 324 296 385 361 260 266 299 345 283 271
## [43] 224 305 325 307 381 364 344 370 418 236 318 306 293 304
## [57] 225 187 159 177 277 254 274 259 219 254 292 323 291 317
## [71] 73 314 346 312 364 348 280 271 359 187 153 155 261 189
## [85] 157 143 261 170 333 246 364 200 129 186 342 327 316 361
## [99] 341 378 303 90 183 322 489 1083 563 2710 448 3333 288 542
## [113] 301 363 263 472 279 393 375
According to the model the variables with the most impact in descending order;
number of assault per proportion, number of roberries per proportion, number of assaults, number of burglaries per proportion
We can use with the preffered model to make a prediction. First the prediction is made on the test data to asses its goodness of fit.
# predict train data
p1 <- predict(rf_crime, train2)
residuals1 = p1 - train2$ViolentCrimesPerPop
#MAE and RMSE for train data
MAE1 = mean(abs(residuals1))
MAE1
## [1] 13.78192
RMSE1 = sqrt(mean((residuals1)^2))
RMSE1
## [1] 34.71283
#predict test data
p2 <- predict(rf_crime, test2)
residuals2 = p2 - test2$ViolentCrimesPerPop
#MAE and RMSE for train data
MAE2 = mean(abs(residuals2))
MAE2
## [1] 41.20281
RMSE2 = sqrt(mean((residuals2)^2))
RMSE2
## [1] 121.5006
The RMSE for prediction on the train data is 34.7 and 121.5 for the test data.