Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(caret)
library(mice)
library(randomForest)
library(caTools)
library(corrplot)
library(class)
library(rpart)
library(rpart.plot)
library(naniar)
library(xgboost)
library(usmap)
library(DiagrammeR)
library(earth)
library(plotly)
library(wordcloud)
library(RColorBrewer)
library(glmnet)
library(Hmisc)
library(car)

Background

For the final project, we will be working with the Communities & Crime dataset from communities within United States. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.

Problem Statement

Every major violent crime that makes it to the news brings new notions and beliefs that tend to alter the opinion of the public in terms of the safety of their neighborhood and the socio-economic factors that influence the normal functioning of society. For example, with the increase in the number of race-related crimes in our society, it has become imperative to understand whether the racial profile has anything to do with crime rates in a region. Other concerns besides race, include socio-economic status, religious beliefs and the percentage of the population that is composed of immigrants.

Understanding where violent crimes occur in terms of the socio-economic, environmental, and demographic characteristics of the reported regions can be crucial to deciphering why these crimes occur. These characteristics may very well help in predicting in advance where violent crimes are likely to occur through predictive models that can quantify the risk associated with a region. This would greatly help in city planning through the deployment of police forces effectively in order to quell the imminent danger. Determining the key drivers that contribute to the rise in violent crimes will provide invaluable inputs in terms of urban development and design of policing practices.

Dataset Overview

The ‘Communities and Crime’ dataset made available by the University of California, Irvine’s Machine Learning Repository provides an excellent opportunity to test some of these pre-conceived notions prevalent in our society today with regard to race and crimes. It could also be helpful in building predictive models that can help better in city planning and crime reduction. Although the sources in this dataset date back to 1990 and 1995, the directional insights we get from it should still hold true.

Many variables are included in the dataset so that algorithms that select or learn weights for attributes could be tested. However, clearly unrelated attributes were not included; attributes were picked if there was any plausible connection to crime (N=122), plus the attribute to be predicted (Per Capita Violent Crimes). The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units.

The per capita violent crimes variable was calculated using population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault. There was apparently some controversy in some states concerning the counting of rapes. These resulted in missing values for rape, which resulted in incorrect values for per capita violent crime. These cities are not included in the dataset. Many of these omitted communities were from the midwestern USA.

Data is described below based on original values. All numeric data was normalized into the decimal range 0.00-1.00 using an Unsupervised, equal-interval binning method. Attributes retain their distribution and skew (hence for example the population attribute has a mean value of 0.06 because most communities are small). E.g. An attribute described as ‘mean people per household’ is actually the normalized (0-1) version of that value.

The normalization preserves rough ratios of values WITHIN an attribute (e.g. double the value for double the population within the available precision - except for extreme values (all values more than 3 SD above the mean are normalized to 1.00; all values more than 3 SD below the mean are nromalized to 0.00)).

However, the normalization does not preserve relationships between values BETWEEN attributes (e.g. it would not be meaningful to compare the value for whitePerCap with the value for blackPerCap for a community)

A limitation was that the LEMAS survey was of the police departments with at least 100 officers, plus a random sample of smaller departments. For our purposes, communities not found in both census and crime datasets were omitted. Many communities are missing LEMAS data.

The target variable is ‘Violent Crimes Per Population’.

Data Dictionary

The crime dataset has 128 attributes. Below is a brief description of all attributes along with data type details -

datadict <- read_csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/FinalProject/data_dictionary.csv')

datadict %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="100%",height="500px")
No.  Column Description Data Type
1 state US state (by number) - not counted as predictive above, but if considered, should be consided nominal nominal
2 county numeric code for county - not predictive, and many missing values numeric
3 community numeric code for community - not predictive and many missing values numeric
4 communityname community name - not predictive - for information only string
5 fold fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive numeric
6 population population for community numeric - decimal
7 householdsize mean people per household numeric - decimal
8 racepctblack percentage of population that is african american numeric - decimal
9 racePctWhite percentage of population that is caucasian numeric - decimal
10 racePctAsian percentage of population that is of asian heritage numeric - decimal
11 racePctHisp percentage of population that is of hispanic heritage numeric - decimal
12 agePct12t21 percentage of population that is 12-21 in age numeric - decimal
13 agePct12t29 percentage of population that is 12-29 in age numeric - decimal
14 agePct16t24 percentage of population that is 16-24 in age numeric - decimal
15 agePct65up percentage of population that is 65 and over in age numeric - decimal
16 numbUrban number of people living in areas classified as urban numeric - decimal
17 pctUrban percentage of people living in areas classified as urban numeric - decimal
18 medIncome median household income numeric - decimal
19 pctWWage percentage of households with wage or salary income in 1989 numeric - decimal
20 pctWFarmSelf percentage of households with farm or self employment income in 1989 numeric - decimal
21 pctWInvInc percentage of households with investment / rent income in 1989 numeric - decimal
22 pctWSocSec percentage of households with social security income in 1989 numeric - decimal
23 pctWPubAsst percentage of households with public assistance income in 1989 numeric - decimal
24 pctWRetire percentage of households with retirement income in 1989 numeric - decimal
25 medFamInc median family income (differs from household income for non-family households) numeric - decimal
26 perCapInc per capita income numeric - decimal
27 whitePerCap per capita income for caucasians numeric - decimal
28 blackPerCap per capita income for african americans numeric - decimal
29 indianPerCap per capita income for native americans numeric - decimal
30 AsianPerCap per capita income for people with asian heritage numeric - decimal
31 OtherPerCap per capita income for people with ‘other’ heritage numeric - decimal
32 HispPerCap per capita income for people with hispanic heritage numeric - decimal
33 NumUnderPov number of people under the poverty level numeric - decimal
34 PctPopUnderPov percentage of people under the poverty level numeric - decimal
35 PctLess9thGrade percentage of people 25 and over with less than a 9th grade education numeric - decimal
36 PctNotHSGrad percentage of people 25 and over that are not high school graduates numeric - decimal
37 PctBSorMore percentage of people 25 and over with a bachelors degree or higher education numeric - decimal
38 PctUnemployed percentage of people 16 and over, in the labor force, and unemployed numeric - decimal
39 PctEmploy percentage of people 16 and over who are employed numeric - decimal
40 PctEmplManu percentage of people 16 and over who are employed in manufacturing numeric - decimal
41 PctEmplProfServ percentage of people 16 and over who are employed in professional services numeric - decimal
42 PctOccupManu percentage of people 16 and over who are employed in manufacturing numeric - decimal
43 PctOccupMgmtProf percentage of people 16 and over who are employed in management or professional occupations numeric - decimal
44 MalePctDivorce percentage of males who are divorced numeric - decimal
45 MalePctNevMarr percentage of males who have never married numeric - decimal
46 FemalePctDiv percentage of females who are divorced numeric - decimal
47 TotalPctDiv percentage of population who are divorced numeric - decimal
48 PersPerFam mean number of people per family numeric - decimal
49 PctFam2Par percentage of families (with kids) that are headed by two parents numeric - decimal
50 PctKids2Par percentage of kids in family housing with two parents numeric - decimal
51 PctYoungKids2Par percent of kids 4 and under in two parent households numeric - decimal
52 PctTeen2Par percent of kids age 12-17 in two parent households numeric - decimal
53 PctWorkMomYoungKids percentage of moms of kids 6 and under in labor force numeric - decimal
54 PctWorkMom percentage of moms of kids under 18 in labor force numeric - decimal
55 NumIlleg number of kids born to never married numeric - decimal
56 PctIlleg percentage of kids born to never married numeric - decimal
57 NumImmig total number of people known to be foreign born numeric - decimal
58 PctImmigRecent percentage of immigrants who immigated within last 3 years numeric - decimal
59 PctImmigRec5 percentage of immigrants who immigated within last 5 years numeric - decimal
60 PctImmigRec8 percentage of immigrants who immigated within last 8 years numeric - decimal
61 PctImmigRec10 percentage of immigrants who immigated within last 10 years numeric - decimal
62 PctRecentImmig percent of population who have immigrated within the last 3 years numeric - decimal
63 PctRecImmig5 percent of population who have immigrated within the last 5 years numeric - decimal
64 PctRecImmig8 percent of population who have immigrated within the last 8 years numeric - decimal
65 PctRecImmig10 percent of population who have immigrated within the last 10 years numeric - decimal
66 PctSpeakEnglOnly percent of people who speak only English numeric - decimal
67 PctNotSpeakEnglWell percent of people who do not speak English well numeric - decimal
68 PctLargHouseFam percent of family households that are large (6 or more) numeric - decimal
69 PctLargHouseOccup percent of all occupied households that are large (6 or more people) numeric - decimal
70 PersPerOccupHous mean persons per household numeric - decimal
71 PersPerOwnOccHous mean persons per owner occupied household numeric - decimal
72 PersPerRentOccHous mean persons per rental household numeric - decimal
73 PctPersOwnOccup percent of people in owner occupied households numeric - decimal
74 PctPersDenseHous percent of persons in dense housing (more than 1 person per room) numeric - decimal
75 PctHousLess3BR percent of housing units with less than 3 bedrooms numeric - decimal
76 MedNumBR median number of bedrooms numeric - decimal
77 HousVacant number of vacant households numeric - decimal
78 PctHousOccup percent of housing occupied numeric - decimal
79 PctHousOwnOcc percent of households owner occupied numeric - decimal
80 PctVacantBoarded percent of vacant housing that is boarded up numeric - decimal
81 PctVacMore6Mos percent of vacant housing that has been vacant more than 6 months numeric - decimal
82 MedYrHousBuilt median year housing units built numeric - decimal
83 PctHousNoPhone percent of occupied housing units without phone (in 1990, this was rare!) numeric - decimal
84 PctWOFullPlumb percent of housing without complete plumbing facilities numeric - decimal
85 OwnOccLowQuart owner occupied housing - lower quartile value numeric - decimal
86 OwnOccMedVal owner occupied housing - median value numeric - decimal
87 OwnOccHiQuart owner occupied housing - upper quartile value numeric - decimal
88 RentLowQ rental housing - lower quartile rent numeric - decimal
89 RentMedian rental housing - median rent (Census variable H32B from file STF1A) numeric - decimal
90 RentHighQ rental housing - upper quartile rent numeric - decimal
91 MedRent median gross rent (Census variable H43A from file STF3A - includes utilities) numeric - decimal
92 MedRentPctHousInc median gross rent as a percentage of household income numeric - decimal
93 MedOwnCostPctInc median owners cost as a percentage of household income - for owners with a mortgage numeric - decimal
94 MedOwnCostPctIncNoMtg median owners cost as a percentage of household income - for owners without a mortgage numeric - decimal
95 NumInShelters number of people in homeless shelters numeric - decimal
96 NumStreet number of homeless people counted in the street numeric - decimal
97 PctForeignBorn percent of people foreign born numeric - decimal
98 PctBornSameState percent of people born in the same state as currently living numeric - decimal
99 PctSameHouse85 percent of people living in the same house as in 1985 (5 years before) numeric - decimal
100 PctSameCity85 percent of people living in the same city as in 1985 (5 years before) numeric - decimal
101 PctSameState85 percent of people living in the same state as in 1985 (5 years before) numeric - decimal
102 LemasSwornFT number of sworn full time police officers numeric - decimal
103 LemasSwFTPerPop sworn full time police officers per 100K population numeric - decimal
104 LemasSwFTFieldOps number of sworn full time police officers in field operations (on the street as opposed to administrative etc) numeric - decimal
105 LemasSwFTFieldPerPop sworn full time police officers in field operations (on the street as opposed to administrative etc) per 100K population numeric - decimal
106 LemasTotalReq total requests for police numeric - decimal
107 LemasTotReqPerPop total requests for police per 100K popuation numeric - decimal
108 PolicReqPerOffic total requests for police per police officer numeric - decimal
109 PolicPerPop police officers per 100K population numeric - decimal
110 RacialMatchCommPol a measure of the racial match between the community and the police force. High values indicate proportions in community and police force are similar numeric - decimal
111 PctPolicWhite percent of police that are caucasian numeric - decimal
112 PctPolicBlack percent of police that are african american numeric - decimal
113 PctPolicHisp percent of police that are hispanic numeric - decimal
114 PctPolicAsian percent of police that are asian numeric - decimal
115 PctPolicMinor percent of police that are minority of any kind numeric - decimal
116 OfficAssgnDrugUnits number of officers assigned to special drug units numeric - decimal
117 NumKindsDrugsSeiz number of different kinds of drugs seized numeric - decimal
118 PolicAveOTWorked police average overtime worked numeric - decimal
119 LandArea land area in square miles numeric - decimal
120 PopDens population density in persons per square mile numeric - decimal
121 PctUsePubTrans percent of people using public transit for commuting numeric - decimal
122 PolicCars number of police cars numeric - decimal
123 PolicOperBudg police operating budget numeric - decimal
124 LemasPctPolicOnPatr percent of sworn full time police officers on patrol numeric - decimal
125 LemasGangUnitDeploy gang unit deployed numeric - decimal - but really ordinal - 0 means NO, 1 means YES, 0.5 means Part Time
126 LemasPctOfficDrugUn percent of officers assigned to drug units numeric - decimal
127 PolicBudgPerPop police operating budget per population numeric - decimal
128 ViolentCrimesPerPop total number of violent crimes per 100K popuation (numeric - decimal) GOAL attribute to be predicted

Supplimentary Data

We also loaded the supplimentary state level dataset from UCI Machine Learning repository to enrich our crime dataset with state level information -

states <- read.csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/FinalProject/states.csv')

names(states) <- c("state_code","state","stateName","stateENS")

states %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="80%",height="300px")
state_code state stateName stateENS
1 AL Alabama 1779775
2 AK Alaska 1785533
4 AZ Arizona 1779777
5 AR Arkansas 68085
6 CA California 1779778
8 CO Colorado 1779779
9 CT Connecticut 1779780
10 DE Delaware 1779781
11 DC District of Columbia 1702382
12 FL Florida 294478
13 GA Georgia 1705317
15 HI Hawaii 1779782
16 ID Idaho 1779783
17 IL Illinois 1779784
18 IN Indiana 448508
19 IA Iowa 1779785
20 KS Kansas 481813
21 KY Kentucky 1779786
22 LA Louisiana 1629543
23 ME Maine 1779787
24 MD Maryland 1714934
25 MA Massachusetts 606926
26 MI Michigan 1779789
27 MN Minnesota 662849
28 MS Mississippi 1779790
29 MO Missouri 1779791
30 MT Montana 767982
31 NE Nebraska 1779792
32 NV Nevada 1779793
33 NH New Hampshire 1779794
34 NJ New Jersey 1779795
35 NM New Mexico 897535
36 NY New York 1779796
37 NC North Carolina 1027616
38 ND North Dakota 1779797
39 OH Ohio 1085497
40 OK Oklahoma 1102857
41 OR Oregon 1155107
42 PA Pennsylvania 1779798
44 RI Rhode Island 1219835
45 SC South Carolina 1779799
46 SD South Dakota 1785534
47 TN Tennessee 1325873
48 TX Texas 1779801
49 UT Utah 1455989
50 VT Vermont 1779802
51 VA Virginia 1779803
53 WA Washington 1779804
54 WV West Virginia 1779805
55 WI Wisconsin 1779806
56 WY Wyoming 1779807
60 AS American Samoa 1802701
66 GU Guam 1802705
69 MP Northern Mariana Islands 1779809
72 PR Puerto Rico 1779808
74 UM U.S. Minor Outlying Islands 1878752
78 VI U.S. Virgin Islands 1802710

Dataset Preview

dataset <- read_csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/FinalProject/communities.data.csv')
head(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="100%",height="300px")
state county community communityname fold population householdsize racepctblack racePctWhite racePctAsian racePctHisp agePct12t21 agePct12t29 agePct16t24 agePct65up numbUrban pctUrban medIncome pctWWage pctWFarmSelf pctWInvInc pctWSocSec pctWPubAsst pctWRetire medFamInc perCapInc whitePerCap blackPerCap indianPerCap AsianPerCap OtherPerCap HispPerCap NumUnderPov PctPopUnderPov PctLess9thGrade PctNotHSGrad PctBSorMore PctUnemployed PctEmploy PctEmplManu PctEmplProfServ PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr FemalePctDiv TotalPctDiv PersPerFam PctFam2Par PctKids2Par PctYoungKids2Par PctTeen2Par PctWorkMomYoungKids PctWorkMom NumIlleg PctIlleg NumImmig PctImmigRecent PctImmigRec5 PctImmigRec8 PctImmigRec10 PctRecentImmig PctRecImmig5 PctRecImmig8 PctRecImmig10 PctSpeakEnglOnly PctNotSpeakEnglWell PctLargHouseFam PctLargHouseOccup PersPerOccupHous PersPerOwnOccHous PersPerRentOccHous PctPersOwnOccup PctPersDenseHous PctHousLess3BR MedNumBR HousVacant PctHousOccup PctHousOwnOcc PctVacantBoarded PctVacMore6Mos MedYrHousBuilt PctHousNoPhone PctWOFullPlumb OwnOccLowQuart OwnOccMedVal OwnOccHiQuart RentLowQ RentMedian RentHighQ MedRent MedRentPctHousInc MedOwnCostPctInc MedOwnCostPctIncNoMtg NumInShelters NumStreet PctForeignBorn PctBornSameState PctSameHouse85 PctSameCity85 PctSameState85 LemasSwornFT LemasSwFTPerPop LemasSwFTFieldOps LemasSwFTFieldPerPop LemasTotalReq LemasTotReqPerPop PolicReqPerOffic PolicPerPop RacialMatchCommPol PctPolicWhite PctPolicBlack PctPolicHisp PctPolicAsian PctPolicMinor OfficAssgnDrugUnits NumKindsDrugsSeiz PolicAveOTWorked LandArea PopDens PctUsePubTrans PolicCars PolicOperBudg LemasPctPolicOnPatr LemasGangUnitDeploy LemasPctOfficDrugUn PolicBudgPerPop ViolentCrimesPerPop
8 ? ? Lakewoodcity 1 0.19 0.33 0.02 0.90 0.12 0.17 0.34 0.47 0.29 0.32 0.20 1.0 0.37 0.72 0.34 0.60 0.29 0.15 0.43 0.39 0.40 0.39 0.32 0.27 0.27 0.36 0.41 0.08 0.19 0.10 0.18 0.48 0.27 0.68 0.23 0.41 0.25 0.52 0.68 0.40 0.75 0.75 0.35 0.55 0.59 0.61 0.56 0.74 0.76 0.04 0.14 0.03 0.24 0.27 0.37 0.39 0.07 0.07 0.08 0.08 0.89 0.06 0.14 0.13 0.33 0.39 0.28 0.55 0.09 0.51 0.5 0.21 0.71 0.52 0.05 0.26 0.65 0.14 0.06 0.22 0.19 0.18 0.36 0.35 0.38 0.34 0.38 0.46 0.25 0.04 0 0.12 0.42 0.50 0.51 0.64 0.03 0.13 0.96 0.17 0.06 0.18 0.44 0.13 0.94 0.93 0.03 0.07 0.1 0.07 0.02 0.57 0.29 0.12 0.26 0.20 0.06 0.04 0.9 0.5 0.32 0.14 0.20
53 ? ? Tukwilacity 1 0.00 0.16 0.12 0.74 0.45 0.07 0.26 0.59 0.35 0.27 0.02 1.0 0.31 0.72 0.11 0.45 0.25 0.29 0.39 0.29 0.37 0.38 0.33 0.16 0.30 0.22 0.35 0.01 0.24 0.14 0.24 0.30 0.27 0.73 0.57 0.15 0.42 0.36 1.00 0.63 0.91 1.00 0.29 0.43 0.47 0.60 0.39 0.46 0.53 0.00 0.24 0.01 0.52 0.62 0.64 0.63 0.25 0.27 0.25 0.23 0.84 0.10 0.16 0.10 0.17 0.29 0.17 0.26 0.20 0.82 0.0 0.02 0.79 0.24 0.02 0.25 0.65 0.16 0.00 0.21 0.20 0.21 0.42 0.38 0.40 0.37 0.29 0.32 0.18 0.00 0 0.21 0.50 0.34 0.60 0.52 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0.02 0.12 0.45 ? ? ? ? 0.00 ? 0.67
24 ? ? Aberdeentown 1 0.00 0.42 0.49 0.56 0.17 0.04 0.39 0.47 0.28 0.32 0.00 0.0 0.30 0.58 0.19 0.39 0.38 0.40 0.84 0.28 0.27 0.29 0.27 0.07 0.29 0.28 0.39 0.01 0.27 0.27 0.43 0.19 0.36 0.58 0.32 0.29 0.49 0.32 0.63 0.41 0.71 0.70 0.45 0.42 0.44 0.43 0.43 0.71 0.67 0.01 0.46 0.00 0.07 0.06 0.15 0.19 0.02 0.02 0.04 0.05 0.88 0.04 0.20 0.20 0.46 0.52 0.43 0.42 0.15 0.51 0.5 0.01 0.86 0.41 0.29 0.30 0.52 0.47 0.45 0.18 0.17 0.16 0.27 0.29 0.27 0.31 0.48 0.39 0.28 0.00 0 0.14 0.49 0.54 0.67 0.56 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0.01 0.21 0.02 ? ? ? ? 0.00 ? 0.43
34 5 81440 Willingborotownship 1 0.04 0.77 1.00 0.08 0.12 0.10 0.51 0.50 0.34 0.21 0.06 1.0 0.58 0.89 0.21 0.43 0.36 0.20 0.82 0.51 0.36 0.40 0.39 0.16 0.25 0.36 0.44 0.01 0.10 0.09 0.25 0.31 0.33 0.71 0.36 0.45 0.37 0.39 0.34 0.45 0.49 0.44 0.75 0.65 0.54 0.83 0.65 0.85 0.86 0.03 0.33 0.02 0.11 0.20 0.30 0.31 0.05 0.08 0.11 0.11 0.81 0.08 0.56 0.62 0.85 0.77 1.00 0.94 0.12 0.01 0.5 0.01 0.97 0.96 0.60 0.47 0.52 0.11 0.11 0.24 0.21 0.19 0.75 0.70 0.77 0.89 0.63 0.51 0.47 0.00 0 0.19 0.30 0.73 0.64 0.65 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0.02 0.39 0.28 ? ? ? ? 0.00 ? 0.12
42 95 6096 Bethlehemtownship 1 0.01 0.55 0.02 0.95 0.09 0.05 0.38 0.38 0.23 0.36 0.02 0.9 0.50 0.72 0.16 0.68 0.44 0.11 0.71 0.46 0.43 0.41 0.28 0.00 0.74 0.51 0.48 0.00 0.06 0.25 0.30 0.33 0.12 0.65 0.67 0.38 0.42 0.46 0.22 0.27 0.20 0.21 0.51 0.91 0.91 0.89 0.85 0.40 0.60 0.00 0.06 0.00 0.03 0.07 0.20 0.27 0.01 0.02 0.04 0.05 0.88 0.05 0.16 0.19 0.59 0.60 0.37 0.89 0.02 0.19 0.5 0.01 0.89 0.87 0.04 0.55 0.73 0.05 0.14 0.31 0.31 0.30 0.40 0.36 0.38 0.38 0.22 0.51 0.21 0.00 0 0.11 0.72 0.64 0.61 0.53 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0.04 0.09 0.02 ? ? ? ? 0.00 ? 0.03
6 ? ? SouthPasadenacity 1 0.02 0.28 0.06 0.54 1.00 0.25 0.31 0.48 0.27 0.37 0.04 1.0 0.52 0.68 0.20 0.61 0.28 0.15 0.25 0.62 0.72 0.76 0.77 0.28 0.52 0.48 0.60 0.01 0.12 0.13 0.12 0.80 0.10 0.65 0.19 0.77 0.06 0.91 0.49 0.57 0.61 0.58 0.44 0.62 0.69 0.87 0.53 0.30 0.43 0.00 0.11 0.04 0.30 0.35 0.43 0.47 0.50 0.50 0.56 0.57 0.45 0.28 0.25 0.19 0.29 0.53 0.18 0.39 0.26 0.73 0.0 0.02 0.84 0.30 0.16 0.28 0.25 0.02 0.05 0.94 1.00 1.00 0.67 0.63 0.68 0.62 0.47 0.59 0.11 0.00 0 0.70 0.42 0.49 0.73 0.64 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0.01 0.58 0.10 ? ? ? ? 0.00 ? 0.14
dim(dataset)
## [1] 1994  128

The dataset has 1994 observations and 128 variable.

Descriptive Dataset Summary

The dataset contains no duplicate records, but it contains a lot of missing values, represented by “?”. We need to convert these entries into NA values to begin our analysis.

The first 5 non-predictive variables represent the name of the community, the state in which it is located, the county code, the community code, and the fold number for non-random 10 fold cross validation. A summary of the remaining variables is below:

# 
dataset[dataset == '?'] = NA

summary(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
state county community communityname fold population householdsize racepctblack racePctWhite racePctAsian racePctHisp agePct12t21 agePct12t29 agePct16t24 agePct65up numbUrban pctUrban medIncome pctWWage pctWFarmSelf pctWInvInc pctWSocSec pctWPubAsst pctWRetire medFamInc perCapInc whitePerCap blackPerCap indianPerCap AsianPerCap OtherPerCap HispPerCap NumUnderPov PctPopUnderPov PctLess9thGrade PctNotHSGrad PctBSorMore PctUnemployed PctEmploy PctEmplManu PctEmplProfServ PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr FemalePctDiv TotalPctDiv PersPerFam PctFam2Par PctKids2Par PctYoungKids2Par PctTeen2Par PctWorkMomYoungKids PctWorkMom NumIlleg PctIlleg NumImmig PctImmigRecent PctImmigRec5 PctImmigRec8 PctImmigRec10 PctRecentImmig PctRecImmig5 PctRecImmig8 PctRecImmig10 PctSpeakEnglOnly PctNotSpeakEnglWell PctLargHouseFam PctLargHouseOccup PersPerOccupHous PersPerOwnOccHous PersPerRentOccHous PctPersOwnOccup PctPersDenseHous PctHousLess3BR MedNumBR HousVacant PctHousOccup PctHousOwnOcc PctVacantBoarded PctVacMore6Mos MedYrHousBuilt PctHousNoPhone PctWOFullPlumb OwnOccLowQuart OwnOccMedVal OwnOccHiQuart RentLowQ RentMedian RentHighQ MedRent MedRentPctHousInc MedOwnCostPctInc MedOwnCostPctIncNoMtg NumInShelters NumStreet PctForeignBorn PctBornSameState PctSameHouse85 PctSameCity85 PctSameState85 LemasSwornFT LemasSwFTPerPop LemasSwFTFieldOps LemasSwFTFieldPerPop LemasTotalReq LemasTotReqPerPop PolicReqPerOffic PolicPerPop RacialMatchCommPol PctPolicWhite PctPolicBlack PctPolicHisp PctPolicAsian PctPolicMinor OfficAssgnDrugUnits NumKindsDrugsSeiz PolicAveOTWorked LandArea PopDens PctUsePubTrans PolicCars PolicOperBudg LemasPctPolicOnPatr LemasGangUnitDeploy LemasPctOfficDrugUn PolicBudgPerPop ViolentCrimesPerPop
Min. : 1.00 Length:1994 Length:1994 Length:1994 Min. : 1.000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Length:1994 Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Min. :0.00000 Min. :0.0000 Min. :0.0000 Length:1994 Length:1994 Length:1994 Length:1994 Min. :0.00000 Length:1994 Min. :0.000
1st Qu.:12.00 Class :character Class :character Class :character 1st Qu.: 3.000 1st Qu.:0.01000 1st Qu.:0.3500 1st Qu.:0.0200 1st Qu.:0.6300 1st Qu.:0.0400 1st Qu.:0.010 1st Qu.:0.3400 1st Qu.:0.4100 1st Qu.:0.2500 1st Qu.:0.3000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.2000 1st Qu.:0.4400 1st Qu.:0.1600 1st Qu.:0.3700 1st Qu.:0.3500 1st Qu.:0.1425 1st Qu.:0.3600 1st Qu.:0.2300 1st Qu.:0.2200 1st Qu.:0.240 1st Qu.:0.1725 1st Qu.:0.1100 1st Qu.:0.1900 Class :character 1st Qu.:0.2600 1st Qu.:0.01000 1st Qu.:0.110 1st Qu.:0.1600 1st Qu.:0.2300 1st Qu.:0.2100 1st Qu.:0.2200 1st Qu.:0.3800 1st Qu.:0.2500 1st Qu.:0.3200 1st Qu.:0.2400 1st Qu.:0.3100 1st Qu.:0.3300 1st Qu.:0.3100 1st Qu.:0.3600 1st Qu.:0.3600 1st Qu.:0.4000 1st Qu.:0.4900 1st Qu.:0.4900 1st Qu.:0.530 1st Qu.:0.4800 1st Qu.:0.3900 1st Qu.:0.4200 1st Qu.:0.00000 1st Qu.:0.09 1st Qu.:0.00000 1st Qu.:0.1600 1st Qu.:0.2000 1st Qu.:0.2500 1st Qu.:0.2800 1st Qu.:0.0300 1st Qu.:0.0300 1st Qu.:0.0300 1st Qu.:0.0300 1st Qu.:0.7300 1st Qu.:0.0300 1st Qu.:0.1500 1st Qu.:0.1400 1st Qu.:0.3400 1st Qu.:0.3900 1st Qu.:0.2700 1st Qu.:0.4400 1st Qu.:0.0600 1st Qu.:0.4000 1st Qu.:0.0000 1st Qu.:0.01000 1st Qu.:0.6300 1st Qu.:0.4300 1st Qu.:0.0600 1st Qu.:0.2900 1st Qu.:0.3500 1st Qu.:0.0600 1st Qu.:0.1000 1st Qu.:0.0900 1st Qu.:0.0900 1st Qu.:0.0900 1st Qu.:0.1700 1st Qu.:0.2000 1st Qu.:0.220 1st Qu.:0.2100 1st Qu.:0.3700 1st Qu.:0.3200 1st Qu.:0.2500 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0600 1st Qu.:0.4700 1st Qu.:0.4200 1st Qu.:0.5200 1st Qu.:0.5600 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.:0.02000 1st Qu.:0.1000 1st Qu.:0.0200 Class :character Class :character Class :character Class :character 1st Qu.:0.00000 Class :character 1st Qu.:0.070
Median :34.00 Mode :character Mode :character Mode :character Median : 5.000 Median :0.02000 Median :0.4400 Median :0.0600 Median :0.8500 Median :0.0700 Median :0.040 Median :0.4000 Median :0.4800 Median :0.2900 Median :0.4200 Median :0.03000 Median :1.0000 Median :0.3200 Median :0.5600 Median :0.2300 Median :0.4800 Median :0.4750 Median :0.2600 Median :0.4700 Median :0.3300 Median :0.3000 Median :0.320 Median :0.2500 Median :0.1700 Median :0.2800 Mode :character Median :0.3450 Median :0.02000 Median :0.250 Median :0.2700 Median :0.3600 Median :0.3100 Median :0.3200 Median :0.5100 Median :0.3700 Median :0.4100 Median :0.3700 Median :0.4000 Median :0.4700 Median :0.4000 Median :0.5000 Median :0.5000 Median :0.4700 Median :0.6300 Median :0.6400 Median :0.700 Median :0.6100 Median :0.5100 Median :0.5400 Median :0.01000 Median :0.17 Median :0.01000 Median :0.2900 Median :0.3400 Median :0.3900 Median :0.4300 Median :0.0900 Median :0.0800 Median :0.0900 Median :0.0900 Median :0.8700 Median :0.0600 Median :0.2000 Median :0.1900 Median :0.4400 Median :0.4800 Median :0.3600 Median :0.5600 Median :0.1100 Median :0.5100 Median :0.5000 Median :0.03000 Median :0.7700 Median :0.5400 Median :0.1300 Median :0.4200 Median :0.5200 Median :0.1850 Median :0.1900 Median :0.1800 Median :0.1700 Median :0.1800 Median :0.3100 Median :0.3300 Median :0.370 Median :0.3400 Median :0.4800 Median :0.4500 Median :0.3700 Median :0.00000 Median :0.00000 Median :0.1300 Median :0.6300 Median :0.5400 Median :0.6700 Median :0.7000 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median :0.04000 Median :0.1700 Median :0.0700 Mode :character Mode :character Mode :character Mode :character Median :0.00000 Mode :character Median :0.150
Mean :28.68 NA NA NA Mean : 5.494 Mean :0.05759 Mean :0.4634 Mean :0.1796 Mean :0.7537 Mean :0.1537 Mean :0.144 Mean :0.4242 Mean :0.4939 Mean :0.3363 Mean :0.4232 Mean :0.06407 Mean :0.6963 Mean :0.3611 Mean :0.5582 Mean :0.2916 Mean :0.4957 Mean :0.4711 Mean :0.3178 Mean :0.4792 Mean :0.3757 Mean :0.3503 Mean :0.368 Mean :0.2911 Mean :0.2035 Mean :0.3224 NA Mean :0.3863 Mean :0.05551 Mean :0.303 Mean :0.3158 Mean :0.3833 Mean :0.3617 Mean :0.3635 Mean :0.5011 Mean :0.3964 Mean :0.4406 Mean :0.3912 Mean :0.4413 Mean :0.4612 Mean :0.4345 Mean :0.4876 Mean :0.4943 Mean :0.4877 Mean :0.6109 Mean :0.6207 Mean :0.664 Mean :0.5829 Mean :0.5014 Mean :0.5267 Mean :0.03629 Mean :0.25 Mean :0.03006 Mean :0.3202 Mean :0.3606 Mean :0.3991 Mean :0.4279 Mean :0.1814 Mean :0.1821 Mean :0.1848 Mean :0.1829 Mean :0.7859 Mean :0.1506 Mean :0.2676 Mean :0.2519 Mean :0.4621 Mean :0.4944 Mean :0.4041 Mean :0.5626 Mean :0.1863 Mean :0.4952 Mean :0.3147 Mean :0.07682 Mean :0.7195 Mean :0.5487 Mean :0.2045 Mean :0.4333 Mean :0.4942 Mean :0.2645 Mean :0.2431 Mean :0.2647 Mean :0.2635 Mean :0.2689 Mean :0.3464 Mean :0.3725 Mean :0.423 Mean :0.3841 Mean :0.4901 Mean :0.4498 Mean :0.4038 Mean :0.02944 Mean :0.02278 Mean :0.2156 Mean :0.6089 Mean :0.5351 Mean :0.6264 Mean :0.6515 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Mean :0.06523 Mean :0.2329 Mean :0.1617 NA NA NA NA Mean :0.09405 NA Mean :0.238
3rd Qu.:42.00 NA NA NA 3rd Qu.: 8.000 3rd Qu.:0.05000 3rd Qu.:0.5400 3rd Qu.:0.2300 3rd Qu.:0.9400 3rd Qu.:0.1700 3rd Qu.:0.160 3rd Qu.:0.4700 3rd Qu.:0.5400 3rd Qu.:0.3600 3rd Qu.:0.5300 3rd Qu.:0.07000 3rd Qu.:1.0000 3rd Qu.:0.4900 3rd Qu.:0.6900 3rd Qu.:0.3700 3rd Qu.:0.6200 3rd Qu.:0.5800 3rd Qu.:0.4400 3rd Qu.:0.5800 3rd Qu.:0.4800 3rd Qu.:0.4300 3rd Qu.:0.440 3rd Qu.:0.3800 3rd Qu.:0.2500 3rd Qu.:0.4000 NA 3rd Qu.:0.4800 3rd Qu.:0.05000 3rd Qu.:0.450 3rd Qu.:0.4200 3rd Qu.:0.5100 3rd Qu.:0.4600 3rd Qu.:0.4800 3rd Qu.:0.6275 3rd Qu.:0.5200 3rd Qu.:0.5300 3rd Qu.:0.5100 3rd Qu.:0.5400 3rd Qu.:0.5900 3rd Qu.:0.5000 3rd Qu.:0.6200 3rd Qu.:0.6300 3rd Qu.:0.5600 3rd Qu.:0.7600 3rd Qu.:0.7800 3rd Qu.:0.840 3rd Qu.:0.7200 3rd Qu.:0.6200 3rd Qu.:0.6500 3rd Qu.:0.02000 3rd Qu.:0.32 3rd Qu.:0.02000 3rd Qu.:0.4300 3rd Qu.:0.4800 3rd Qu.:0.5300 3rd Qu.:0.5600 3rd Qu.:0.2300 3rd Qu.:0.2300 3rd Qu.:0.2300 3rd Qu.:0.2300 3rd Qu.:0.9400 3rd Qu.:0.1600 3rd Qu.:0.3100 3rd Qu.:0.2900 3rd Qu.:0.5500 3rd Qu.:0.5800 3rd Qu.:0.4900 3rd Qu.:0.7000 3rd Qu.:0.2200 3rd Qu.:0.6000 3rd Qu.:0.5000 3rd Qu.:0.07000 3rd Qu.:0.8600 3rd Qu.:0.6700 3rd Qu.:0.2700 3rd Qu.:0.5600 3rd Qu.:0.6700 3rd Qu.:0.4200 3rd Qu.:0.3300 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.:0.3800 3rd Qu.:0.4900 3rd Qu.:0.5200 3rd Qu.:0.590 3rd Qu.:0.5300 3rd Qu.:0.5900 3rd Qu.:0.5800 3rd Qu.:0.5100 3rd Qu.:0.01000 3rd Qu.:0.00000 3rd Qu.:0.2800 3rd Qu.:0.7775 3rd Qu.:0.6600 3rd Qu.:0.7700 3rd Qu.:0.7900 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3rd Qu.:0.07000 3rd Qu.:0.2800 3rd Qu.:0.1900 NA NA NA NA 3rd Qu.:0.00000 NA 3rd Qu.:0.330
Max. :56.00 NA NA NA Max. :10.000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 NA Max. :1.0000 Max. :1.00000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Max. :1.00000 Max. :1.0000 Max. :1.0000 NA NA NA NA Max. :1.00000 NA Max. :1.000

Pre-Processing

Missing Value Analysis

Based on the above descriptive data summary, there are quite a few variables with missing values. So we conducted an analysis of all missing values in various attributes to identify proper imputation technique.

## Counts of missing data per feature
dataset_missing_counts <- data.frame(apply(dataset, 2, function(x) length(which(is.na(x)))))
dataset_missing_pct <- data.frame(apply(dataset, 2,function(x) {sum(is.na(x)) / length(x) * 100}))

dataset_missing_counts <- cbind(Feature = rownames(dataset_missing_counts), dataset_missing_counts, dataset_missing_pct)
colnames(dataset_missing_counts) <- c('Feature','NA_Count','NA_Percentage')
rownames(dataset_missing_counts) <- NULL

dataset_missing_counts <- dataset_missing_counts %>% filter(`NA_Count` != 0) %>% arrange(desc(`NA_Count`))

dataset_missing_counts  %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Feature NA_Count NA_Percentage
LemasSwornFT 1675 84.0020060
LemasSwFTPerPop 1675 84.0020060
LemasSwFTFieldOps 1675 84.0020060
LemasSwFTFieldPerPop 1675 84.0020060
LemasTotalReq 1675 84.0020060
LemasTotReqPerPop 1675 84.0020060
PolicReqPerOffic 1675 84.0020060
PolicPerPop 1675 84.0020060
RacialMatchCommPol 1675 84.0020060
PctPolicWhite 1675 84.0020060
PctPolicBlack 1675 84.0020060
PctPolicHisp 1675 84.0020060
PctPolicAsian 1675 84.0020060
PctPolicMinor 1675 84.0020060
OfficAssgnDrugUnits 1675 84.0020060
NumKindsDrugsSeiz 1675 84.0020060
PolicAveOTWorked 1675 84.0020060
PolicCars 1675 84.0020060
PolicOperBudg 1675 84.0020060
LemasPctPolicOnPatr 1675 84.0020060
LemasGangUnitDeploy 1675 84.0020060
PolicBudgPerPop 1675 84.0020060
community 1177 59.0270812
county 1174 58.8766299
OtherPerCap 1 0.0501505
ggplot(dataset_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) + 
  geom_bar(stat = 'identity', fill = 'steelblue') +
  geom_label(aes(label = NA_Count)) +
  labs(title = 'Missing Counts') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y = element_blank(), axis.title.x = element_blank())

There are 22 variables missing 84% of data and 2 variables missing roughly 59% of data. These variables with a high proportion of NA’s (greater than 50%) were removed as these were deemed practically useless in prediction or data exploration. Imputation on these varibales is a risk as the majority of the data is missing. Most of these variables came from the 1990 Law Enforcement Management and Admin Stats survey (Lemas).

Also, we are going to remove the ‘fold’ variable orginally added for cross validation.

colremove <- dataset_missing_counts %>%
  filter(NA_Percentage > 50) %>% 
  select(Feature) %>%
  as.list()

dataset <- dataset[,!(colnames(dataset) %in% colremove$Feature)]


# removing the folds variable as it was placed for cross validation
dataset <- dataset[,names(dataset) != 'fold']

We also did check for presence of any degenerate variables in the dataset and removed them.

# capturing the degenerate variables
degenCols <- nearZeroVar(dataset)

# identifying them 
colnames(dataset[,degenCols])
## [1] "LemasPctOfficDrugUn"
# removing from the dataset
dataset <- dataset[,-degenCols]

We removed the ‘communityname’ from the dataset and removed rows with NA values in ‘OtherPerCap’ column -

#dataset <- dataset[,!colnames(dataset) == 'communityname']

dataset <- dataset %>%
  mutate(OtherPerCap = as.numeric(OtherPerCap))

#remove missing brand => df3
dataset <- dataset[!is.na(dataset$OtherPerCap), ]

dim(dataset)
## [1] 1993  102

After removing these variables, our dataset reduced to 101 variables.

Exploratory Data Analysis

To start with our visual exploratory data analysis, we joined the crime dataset with state level enrichment dataset -

# Join with states dataframe
dataset1 <- left_join(dataset, states, 
              by = c("state" = "state_code"))

dataset1 <- rename(dataset1, state_abbr = state.y)

Violent Crime Rate Analysis

District of Columbia leads in terms of violent crime rates in the United States. Los Angeles has the second highest crime rate after DC, followed by Louisiana, South Carolina, Maryland and Florida. DC, however, have much higher crime rates in comparison to the other states, which have roughly almost the same crime rate. However, New York and Florida have very high populations in comparison to the others.

# aggregating the Average ViolentCrimesPerPop by State ID (FIPS)
crime_rate_by_state <- dataset1 %>%
  group_by(state_abbr) %>%
  summarise(AvgCrimesRate = mean(ViolentCrimesPerPop)) %>% 
  mutate(rank = dense_rank(desc(AvgCrimesRate)))

crime_rate_by_state %>% filter(rank <=5) %>% arrange(rank) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="60%",height="250px")
state_abbr AvgCrimesRate rank
DC 1.0000000 1
LA 0.5045455 2
SC 0.4867857 3
MD 0.4800000 4
FL 0.4583333 5
#Violent Crime Rates by States
crime_rate_by_state$hover <- with(crime_rate_by_state, paste(state_abbr,'<br>', "Crime Rank:",rank))
l <- list(color = toRGB("white"), width = 2)
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('blue')
)

p <- crime_rate_by_state %>% plot_geo(locationmode = 'USA-states') %>%
  add_trace(
    z = ~AvgCrimesRate, text = ~hover, locations = ~state_abbr, 
    color = ~AvgCrimesRate, colors = 'Blues'
  ) %>%
 colorbar(title = "Crime Rate/100K Population") %>%
  layout(
    title = 'Violent Crime Rates in the United States',
    geo = g
  )

p

Map of percentage of black population in the US

More than 65% of the population of DC belongs to the African-American community. Other states with high percentage of African-Americans are Mississippi, Georgia, Louisiana, and Delaware. These are some of the states with the highest violent crime rates. No wonder crimes bring about perceptions of race. We need to investigate this further.

#Percentage of African-American population by state
black_pop_by_state <- dataset1 %>%
  group_by(state_abbr) %>%
  summarise(AvgBlackPopRate = mean(racepctblack )) %>% 
  mutate(rank = dense_rank(desc(AvgBlackPopRate)))

black_pop_by_state %>% filter(rank <=5) %>% arrange(rank) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="60%",height="250px")
state_abbr AvgBlackPopRate rank
DC 1.0000000 1
MS 0.6900000 2
GA 0.6608108 3
LA 0.6245455 4
DE 0.6000000 5
black_pop_by_state$hover <- with(black_pop_by_state, paste(state_abbr,'<br>', "Black Population  Rank:",rank))
l <- list(color = toRGB("white"), width = 2)
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('blue')
)

p <- black_pop_by_state %>% plot_geo(locationmode = 'USA-states') %>%
  add_trace(
    z = ~AvgBlackPopRate, text = ~hover, locations = ~state_abbr, 
    color = ~AvgBlackPopRate, colors = 'Greens'
  ) %>%
 colorbar(title = "%Black") %>%
  layout(
    title = 'Percentage of African-American Population',
    geo = g
  )
p

Word Cloud of most dangerous communities

We conclude our initial Visual Data Exploration by taking a look at the communities with the highest violent crime rates. Interestingly, Camden City in New Jersey showed at the top on the list of most dangerous communities in the US in terms of violent crime rates. Other dangerous cities when it comes to violent crime rates are BatonRouge and Alexandria in Louisiana, Baltimore, Atlanta, Anniston, Bessemer and Birmingham in Alabma, Atlantic City.

#Word Cloud of Communities with the highest crime rates

communityCrime <- dataset1 %>% select(communityname,state_abbr, population, ViolentCrimesPerPop) %>% 
  mutate(Place = paste(communityname,state_abbr, sep = ", ")) %>% 
  group_by(Place) %>% 
  summarise(AvgCrimesRate = mean(ViolentCrimesPerPop)) %>% 
  arrange(desc(AvgCrimesRate)) %>% 
  head(15)

par(mfrow = c(1,1))
par(mar = rep(0,4))
set.seed(1)
wordcloud(words = communityCrime$Place, freq = communityCrime$AvgCrimesRate, scale=c(3,0.01),
          random.order=FALSE,
          colors=brewer.pal(8, "Dark2"))

Variable Removal

We still have an extremely high number of variables in the dataset. To further remove variables, we measure the correlation of each independent variable against the dependent variable and remove the variables with low correlation to the independent variable. We removed variables with an absolute correlation of less than 0.25 to the independent variable. This vastly reduces the number of variables in the dataset to 54. We removed these variables as we felt that these variables would contribute very less to our final model. Besides, working with fewer variables help in better interpretation.

dataset <- dataset[,!colnames(dataset) == 'communityname']

#Running a correlation matrix
res2 <- rcorr(as.matrix(dataset))

flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
  )
}

#Finding variables with low correlation to dependent variable
correlation <- flattenCorrMatrix(res2$r, res2$P)
correlation <- correlation %>% filter(column == "ViolentCrimesPerPop") %>% arrange(cor)
remove <- correlation %>% filter(cor >= -0.25 & cor <= 0.25) %>% select(row)


low.corr.dep <- c('HispPerCap','PctSpeakEnglOnly','RentMedian','MedRent','RentHighQ','state',
'OwnOccLowQuart','whitePerCap','OwnOccMedVal','OwnOccHiQuart','AsianPerCap','PctSameHouse85',
'pctWFarmSelf','PctWorkMom','OtherPerCap','PersPerOwnOccHous','MedYrHousBuilt','pctWRetire',
'indianPerCap','PctBornSameState','PctEmplProfServ','PctEmplManu','PersPerOccupHous','householdsize','PctWorkMomYoungKids','PctSameState85','PctVacMore6Mos','racePctAsian','MedOwnCostPctIncNoMtg','agePct12t21','MedOwnCostPctInc','agePct65up','PctSameCity85','pctUrban','agePct16t24',
'pctWSocSec','PersPerFam','agePct12t29','PctUsePubTrans','PctImmigRecent','PctForeignBorn',
'LandArea','PctImmigRec5','PctRecentImmig','PctRecImmig5','PctImmigRec8','PersPerRentOccHous'
)

dataset<- dataset %>% select(-low.corr.dep)

dim(dataset)
## [1] 1993   54

Multi-collinearity

For removing multi-collinearity, we selected pairs of independent variables with absolute correlations greater than 0.9. For deciding which variable to remove among the pairs of independent variables, we regressed the remaining independent variables against the dependent variable and found the variance inflation factor of each variable using the vif function from the rms package in R and the removed the independent variable from each pair of multi-collinear variables which had higher variance inflation factors. This reduced the number of variables in the dataset to 39. The logic for this was that the variables in a pair of multi-collinear variables contribute the same variance information to a model and hence only one of them (with the lower variance inflation factor) is required for modeling.

#Now let's check for multicollinearity
res2 <- rcorr(as.matrix(dataset))
correlation <- flattenCorrMatrix(res2$r, res2$P)
correlation <- correlation %>% mutate(abs_cor = abs(cor)) %>% arrange(desc(abs_cor)) %>% filter(abs_cor > 0.9) %>% select(row, column)

correlation %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="200px")
row column
PctRecImmig8 PctRecImmig10
population numbUrban
PctFam2Par PctKids2Par
PctLargHouseFam PctLargHouseOccup
FemalePctDiv TotalPctDiv
PctPersOwnOccup PctHousOwnOcc
medIncome medFamInc
MalePctDivorce TotalPctDiv
PctBSorMore PctOccupMgmtProf
population NumUnderPov
PctLess9thGrade PctNotHSGrad
NumUnderPov NumIlleg
medFamInc perCapInc
numbUrban NumUnderPov
PctFam2Par PctYoungKids2Par
PctKids2Par PctYoungKids2Par
MalePctDivorce FemalePctDiv
PctFam2Par PctTeen2Par
PctKids2Par PctTeen2Par
#Let's run a simple linear regression to see if any more variables can be removed based on VIF

LR_test <- lm(ViolentCrimesPerPop ~ ., dataset)

vif_df <- data.frame(vif(LR_test)) 
colnames(vif_df) <- c("VIF_Score")

vif_df %>% arrange(desc(`VIF_Score`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="200px")
VIF_Score
TotalPctDiv 928.212429
FemalePctDiv 295.188771
MalePctDivorce 203.199697
PctRecImmig10 165.892005
PctRecImmig8 147.580998
PctLargHouseOccup 129.488362
PctLargHouseFam 124.023935
population 117.672823
PctFam2Par 103.736936
numbUrban 101.604576
medIncome 98.892368
PctKids2Par 97.813837
medFamInc 95.037614
PctPersOwnOccup 78.204266
PctHousOwnOcc 76.714946
PctNotHSGrad 37.715581
NumUnderPov 33.986232
perCapInc 27.944187
PctBSorMore 25.418161
PctPersDenseHous 23.163968
PctOccupMgmtProf 23.097460
PctLess9thGrade 21.029438
PctNotSpeakEnglWell 19.853248
PctPopUnderPov 18.129689
racePctWhite 15.166439
racepctblack 14.945811
NumIlleg 14.432404
pctWWage 13.271613
HousVacant 12.765602
PctYoungKids2Par 12.151434
pctWInvInc 12.081673
PctEmploy 11.608310
PctIlleg 11.490275
racePctHisp 9.970625
pctWPubAsst 9.197796
PctHousLess3BR 8.210647
RentLowQ 8.141949
PctHousNoPhone 7.424820
PctTeen2Par 7.264545
PctOccupManu 6.099481
PctUnemployed 6.099450
MalePctNevMarr 5.330162
NumImmig 5.007225
NumInShelters 4.710435
PctHousOccup 3.543461
PopDens 2.833126
MedNumBR 2.657976
NumStreet 2.473209
MedRentPctHousInc 2.470435
PctImmigRec10 2.372429
PctVacantBoarded 2.096964
blackPerCap 2.028969
PctWOFullPlumb 1.847284

Next, we scan the above list of multi-collinear variables and remove the ones with higher VIF -

remove.multicollinear <- c('PctRecImmig10', 'population', 'PctFam2Par', 'PctLargHouseOccup', 'FemalePctDiv', 'PctPersOwnOccup', 'medIncome', 'TotalPctDiv', 'PctBSorMore', 'PctNotHSGrad', 'NumUnderPov', 'medFamInc', 'numbUrban', 'PctKids2Par')

dataset <- dataset %>% select(-remove.multicollinear)

dim(dataset)
## [1] 1993   40

After removing the variables with high multi-collinearity, we end up with 40 variables.

Correlation Plots

As a final step before modeling, we created correlation plots to assess in advance the independent variables that will be most important in predicting the violent crime rates.

Plot1:
corrMatrix <- round(cor(dataset[,c(1:13,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Plot2:
corrMatrix <- round(cor(dataset[,c(14:26,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Plot3:
corrMatrix <- round(cor(dataset[,c(27:39,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Using the correlation plots, we can say that the following socio-economic factors in a community influence the level of violent crime rates:

  • Percentage of the population that is African-American
  • Percentage of the population living under poverty line
  • Percentage of kids born to never married
  • Percentage of households with public assistance income
  • Percentage of people 16 and over, in the labor force, and unemployed
  • Percentage of males who are divorced
  • Percentage of occupied housing units without a phone
  • Percentage of vacant housing that is boarded up

Splitting Data: Train/Test

We are going to do a 75-25% split for training and test purposes.

sample = sample.split(dataset$ViolentCrimesPerPop, SplitRatio = 0.75)


train = subset(dataset, sample == TRUE) %>% as.matrix()
test = subset(dataset, sample == FALSE) %>% as.matrix()

y_train <- train[,40]
y_test <- test[,40]

X_train <- train[,-40]
X_test <- test[,-40]

#head(train)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Modeling

Re-usable Code Components

Before we start with modeling, we wanted to define some of the re-usable code components in the form of R functions -

## Function for plotting the most important variables

printVarImportance <- function(model_name, vImpDF, varNo){
  # Rename columns
  varImportance <-  tibble::rownames_to_column(vImpDF, "Variable")
  colnames(varImportance)[2] <- "Importance"
  
  # Limit to Top N important variables 
  varImportance <- varImportance %>% arrange(desc(`Importance`)) %>% head(varNo)
  
  # Create and assign Rank
  rankImportance <- varImportance %>% mutate(Rank = paste0('#',dense_rank(desc(Importance))))
  
  # Plot output
  ggplot(rankImportance, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat='identity', fill = 'steelblue') +
  geom_label(aes(label = round(Importance,2))) +
  geom_text(aes(x = Variable, y = 0.5, label = Rank),
           hjust=1.2, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variable') +
  ggtitle(paste("Variable Importance in ", model_name, " Model")) +
  coord_flip()
}

Model Group A: Linear Regression Model - Ridge and LASSO

The glmnet package allows us to perform both the ridge and lasso regression models. This library requires the response variable to be a vector and the predictor variables to be a class of data.matrix.

Model1: Ridge Regression

Ridge regression is a model used “when the number of predictor variables in a set exceeds the number of observations”, or when a dataset suffers from multicollinearity. According to stats, with ridge regression, often predictor variables used in a regression are highly correlated. When they are, the regression coefficient of any one variable can depend on which other predictor variables are included in the model, and which ones are left out.

Now we’ll use the glmnet() function to fit the ridge regression model and specify alpha=0.

fit.ridge <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), alpha = 0, type.measure = "mse", family="gaussian")

To identify what value to use for lambda, we could plot the function using our model fit.ridge$lambda.min, but below we’ll use the s=“lambda.min” argument to save time. The graph will be shown below after all models.

fitted.ridge.train <- predict(fit.ridge, newx = data.matrix(X_train), s="lambda.min")


fitted.ridge.test <- predict(fit.ridge, newx = data.matrix(X_test), s="lambda.min")

cat("Train coefficient: ", cor(as.matrix(y_train), fitted.ridge.train)[1])
## Train coefficient:  0.8141226
cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.ridge.test)[1])
## 
## Test coefficient:  0.7635008

As we can see above, the train and test coefficients are 0.81 and 0.76

Prediction Performance
ridgePredPerf <- postResample(pred = fitted.ridge.test , obs = y_test)
ridgePredPerf['Family'] <- 'Linear Regression'
ridgePredPerf['Model'] <- 'Ridge Regression'

Model2: LASSO Regression - Least Absolute Shrinkage and Selection Operator

According to statisticshowto.com, lasso regression is a type of linear regression that uses shrinkage - which shrinks data points towards a central point, such as the mean.

Now we’ll use the glmnet() function to fit the LASSO regression model and specify alpha=1. We’ll specific s=“lambda.min” to find the best lambda.

fit.lasso <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=1, family="gaussian")


fitted.lasso.train <- predict(fit.lasso, newx = data.matrix(X_train), s="lambda.min")

fitted.lasso.test <- predict(fit.lasso, newx = data.matrix(X_test), s="lambda.min")

cat("Train coefficient: ", cor(as.matrix(y_train), fitted.lasso.train)[1])
## Train coefficient:  0.8157457
cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.lasso.test)[1])
## 
## Test coefficient:  0.7689392

As we can see above, the train and test coefficients are 0.81 and 0.76.

Prediction Performance
lassoPredPerf <- postResample(pred = fitted.lasso.test , obs = y_test)
lassoPredPerf['Family'] <- 'Linear Regression'
lassoPredPerf['Model'] <- 'Lasso Regression'

Model3: Elastic Net Regression

According to machinelearningmastery.com, Elastic Net is an extension of linear regression that will add regulirization penalties to the loss function during training.

fit.elnet <- glmnet(as.matrix(X_train), as.matrix(y_train), family="gaussian", alpha=.5)

fit.elnet.cv <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=.5,
                          family="gaussian")

fitted.elnet.train <- predict(fit.elnet.cv, newx = data.matrix(X_train), s="lambda.min")

fitted.elnet.test <- predict(fit.elnet.cv, newx = data.matrix(X_test), s="lambda.min")

cat("Train coefficient: ", cor(as.matrix(y_train), fitted.elnet.train)[1])
## Train coefficient:  0.8158622
cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.elnet.test)[1])
## 
## Test coefficient:  0.7685533
Prediction Performance
elnetPredPerf <- postResample(pred = fitted.elnet.test , obs = y_test)
elnetPredPerf['Family'] <- 'Linear Regression'
elnetPredPerf['Model'] <- 'Elastic Net Regression'

Plot MSE

par(mfrow=c(3,1))

plot(fit.lasso, xvar="lambda")
plot(fit.ridge, xvar="lambda")
plot(fit.elnet.cv, xvar="lambda")

Model Group B: Non Linear Regression Model

Non Linear Regression models do not follow the y=ax+b approach. It allows for flexibility in the model to fit non linear trends in data.

We will go over four examples of non linear regression: Neural Networks, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Multivariate Adaptive Regression Splines (MARS).

Model4: Neural Networks

Neural Networks are regression techniques inspired by theories of how the brain works (neurons signalling to one another). The outcome is modeled by an input layer and an intermediary set of unobserved values, hidden units, made up of linear combinations of the original predictors.

The parameters used in the model include number of hidden units, decay, number of iterations to find parameter estimates, and the number of parameters used by the model.

# simple average Neural Network with 5 hidden units
nnetAvg <- avNNet(X_train, y_train,
                  size = 5,
                  decay = 0.01,
                  repeats = 5,
                  linout = TRUE,
                  trace = FALSE,
                  maxit = 500,
                  MaxNWts = 5 * (ncol(X_train) + 1) +5+1)
nnetAvg
## Model Averaged Neural Network with 5 Repeats  
## 
## a 39-5-1 network with 206 weights
## options were - linear output units  decay=0.01
Variable Importance
varimp <- varImp(nnetAvg)
varimp %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
Overall
X292 7.6370160
X12 5.9451966
X114 5.5187905
X26 5.2518596
X54 5.1049880
X154 5.0924933
X42 5.0717370
X283 4.9011977
X381 4.6842240
X203 4.6721495
X382 4.6398054
X29 4.5765466
X291 4.5705678
X122 4.4443809
X53 4.3255080
X323 4.2644435
X41 4.1839150
X110 4.1553171
X131 4.1499350
X373 4.1439441
X384 4.1345264
X63 4.1269616
X294 4.1134946
X31 4.0805513
X3 4.0448392
X116 3.9986073
X293 3.9965900
X121 3.9865322
X151 3.8569537
X64 3.8219922
X261 3.7827265
X9 3.7704606
X264 3.7133551
X164 3.6181529
X314 3.6143162
X72 3.6122705
X51 3.5855487
X262 3.5758820
X223 3.5657833
X19 3.5597244
X271 3.5582675
X315 3.5535831
X171 3.5193914
X38 3.4772065
X352 3.4701537
X214 3.4212095
X204 3.3389173
X33 3.3169319
X2 3.2437467
X93 3.2400145
X216 3.2084728
X1 3.1588952
X351 3.1146563
X317 3.0791051
X62 3.0756721
X172 3.0527272
X37 3.0281053
X272 3.0106611
X17 2.9916386
X251 2.9644936
X212 2.9582334
X254 2.9251315
X92 2.9100862
X27 2.8149147
X134 2.7943746
X71 2.7664481
X52 2.7549555
X133 2.7507472
X301 2.7229865
X243 2.7066318
X192 2.6908310
X22 2.6889728
X374 2.6869133
X132 2.6767235
X113 2.6722419
X181 2.6336297
X44 2.6125308
X74 2.5816257
X13 2.5794715
X282 2.5663870
X194 2.5553930
X364 2.5536503
X112 2.5291440
X213 2.4953951
X152 2.4947829
X104 2.4920203
X6 2.4895872
X161 2.4849425
X231 2.4742504
X11 2.4692769
X117 2.4628723
X162 2.4551926
X81 2.4425303
X393 2.4310590
X244 2.4305055
X217 2.3860420
X211 2.3729905
X252 2.3709876
X324 2.3613761
X18 2.3342998
X115 2.3332208
X316 2.3271865
X321 2.3161905
X313 2.3147162
X83 2.3084911
X182 2.3038297
X39 2.2885316
X21 2.2865094
X36 2.2581763
X61 2.2538031
X16 2.2151170
X233 2.2128776
X311 2.1189115
X91 2.1182449
X253 2.1127521
X28 2.1029672
X383 2.0997756
X334 2.0952218
X281 2.0654783
X333 2.0517161
X111 2.0492464
X102 2.0437337
X394 2.0411659
X215 2.0385755
X343 2.0084313
X25 2.0051765
X353 2.0024676
X193 1.9981012
X174 1.9904280
X24 1.9786006
X371 1.9758678
X263 1.9735472
X362 1.9578452
X123 1.9453243
X82 1.9128455
X141 1.9102032
X361 1.8959054
X124 1.8487145
X142 1.8416549
X344 1.8379848
X153 1.8313143
X202 1.8309932
X10 1.8271260
X5 1.8119391
X23 1.7830686
X8 1.7639648
X4 1.7378508
X354 1.7334810
X322 1.7154179
X94 1.7143561
X101 1.7099298
X35 1.7034332
X372 1.6664911
X304 1.6362735
X201 1.6107170
X84 1.6087458
X310 1.5871175
X284 1.5467524
X341 1.5303573
X14 1.4727146
X103 1.4153643
X391 1.4150329
X312 1.4038618
X363 1.3891787
X30 1.3779550
X242 1.3666800
X303 1.3544250
X210 1.3464787
X221 1.3390426
X232 1.3385168
X274 1.3155557
X73 1.3055992
X173 1.3003825
X20 1.2894163
X43 1.2773276
X302 1.2762359
X191 1.2515995
X15 1.2409482
X144 1.2232473
X183 1.2118165
X32 1.1941497
X273 1.1506261
X342 1.1440656
X184 1.0164413
X224 1.0046459
X332 0.9973565
X234 0.9932592
X7 0.9404488
X34 0.8996808
X392 0.8822497
X241 0.8759609
X222 0.8382394
X163 0.7721729
X143 0.6735831
X331 0.6196059

We identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("Neural Network", varImp(nnetAvg), 10)

Prediction Performance
nnetPred <- predict(nnetAvg, newdata = X_test)
nnetPredPerf <- postResample(pred = nnetPred, obs = y_test)
nnetPredPerf['Family'] <- 'Non-Linear Regression'
nnetPredPerf['Model'] <- 'Neural Network'

Model5: K-Nearest Neighbors

The KNN model predicts a new sample for regression using the K-closest data points from the training set. The method uses the Euclidean distance or the straight-line distance between samples.

\[(\sum^p_{j=1}(x_{aj}-x_{bj})^q)^{1/q}\]

The train function in the caret package uses the knn method.

knnFit <- train(X_train, y_train,
                   method = 'knn',
                   preProc = c('center','scale'),
                   tuneLength = 14,
                   trControl = trainControl(method = 'cv'))

knnFit
## k-Nearest Neighbors 
## 
## 1499 samples
##   39 predictor
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1349, 1349, 1348, 1347, 1349, 1350, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE       
##    5  0.1470551  0.6085563  0.10018005
##    7  0.1454218  0.6186791  0.09854605
##    9  0.1440191  0.6269384  0.09784016
##   11  0.1430909  0.6322653  0.09764411
##   13  0.1426350  0.6358800  0.09697388
##   15  0.1434319  0.6321638  0.09772261
##   17  0.1431849  0.6351700  0.09726898
##   19  0.1439635  0.6320082  0.09767728
##   21  0.1443801  0.6305995  0.09764054
##   23  0.1439431  0.6342258  0.09739809
##   25  0.1444537  0.6324687  0.09746166
##   27  0.1448037  0.6308266  0.09757062
##   29  0.1451915  0.6295031  0.09796384
##   31  0.1454098  0.6292510  0.09811286
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.
KNN Plot
plot(knnFit)

Variable Importance
varimp <- varImp(knnFit)
varimp$importance %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
Overall
PctIlleg 100.0000000
racePctWhite 82.6566345
PctYoungKids2Par 80.2817251
PctTeen2Par 78.6232333
racepctblack 70.3212797
pctWInvInc 55.5310000
pctWPubAsst 55.1605903
MalePctDivorce 47.8211990
PctPopUnderPov 44.0230884
PctUnemployed 39.7884001
PctHousNoPhone 35.8730201
PctVacantBoarded 33.5763619
PctHousOwnOcc 33.3328822
PctHousLess3BR 32.1642291
PctPersDenseHous 29.2459802
NumIlleg 29.2056695
PctLess9thGrade 21.1231604
HousVacant 20.6816310
PctLargHouseFam 15.3341087
NumInShelters 14.9067179
MedNumBR 13.6315013
PctWOFullPlumb 12.9198016
perCapInc 12.1955112
PctOccupMgmtProf 11.2535144
PctEmploy 10.3417599
NumStreet 10.3378306
MedRentPctHousInc 9.0280115
PctHousOccup 8.5714403
pctWWage 6.9674327
MalePctNevMarr 6.9603290
PctNotSpeakEnglWell 5.4220236
PctOccupManu 5.3190000
PctImmigRec10 4.9203000
PopDens 4.8423181
racePctHisp 4.2799730
NumImmig 3.9294479
blackPerCap 1.6454549
PctRecImmig8 0.3601896
RentLowQ 0.0000000

We identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("KNN", varImp(knnFit)$importance, 10)

Prediction Performance
knnPred <- predict(knnFit, newdata = X_test)
knnPredPerf <- postResample(pred = knnPred, obs = y_test)
knnPredPerf['Family'] <- 'Non-Linear Regression'
knnPredPerf['Model'] <- 'KNN'

Model6: Support Vector Machines

SVM regression models try to find a regression line where each data point is within a threshold. The error is a parameter of the model and it will then try to minimize the coefficients used in the model. This allows the model eliminate features and have greater interpretability.

We will again use the caret package and apply a radial SVM model. This method will allow the model to find trends in the dataset.

svmRFit <- train(X_train, y_train,
                   method = 'svmRadial',
                   preProc = c('center','scale'),
                   tuneLength = 14,
                   trControl = trainControl(method = 'cv'))
svmRFit
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1499 samples
##   39 predictor
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1349, 1348, 1350, 1350, 1349, 1349, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE       
##      0.25  0.1481972  0.6322960  0.09523909
##      0.50  0.1430390  0.6461203  0.09275885
##      1.00  0.1402594  0.6525146  0.09164402
##      2.00  0.1396685  0.6519714  0.09198224
##      4.00  0.1419226  0.6393452  0.09431469
##      8.00  0.1468268  0.6150199  0.09876876
##     16.00  0.1546668  0.5773157  0.10495410
##     32.00  0.1647599  0.5308168  0.11282599
##     64.00  0.1739939  0.4929692  0.11996609
##    128.00  0.1801527  0.4699293  0.12551726
##    256.00  0.1803473  0.4699574  0.12603844
##    512.00  0.1803473  0.4699574  0.12603844
##   1024.00  0.1803473  0.4699574  0.12603844
##   2048.00  0.1803473  0.4699574  0.12603844
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02507349
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02507349 and C = 2.
Variable Importance
varimp <- varImp(svmRFit)
varimp$importance %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
Overall
PctIlleg 100.0000000
racePctWhite 82.6566345
PctYoungKids2Par 80.2817251
PctTeen2Par 78.6232333
racepctblack 70.3212797
pctWInvInc 55.5310000
pctWPubAsst 55.1605903
MalePctDivorce 47.8211990
PctPopUnderPov 44.0230884
PctUnemployed 39.7884001
PctHousNoPhone 35.8730201
PctVacantBoarded 33.5763619
PctHousOwnOcc 33.3328822
PctHousLess3BR 32.1642291
PctPersDenseHous 29.2459802
NumIlleg 29.2056695
PctLess9thGrade 21.1231604
HousVacant 20.6816310
PctLargHouseFam 15.3341087
NumInShelters 14.9067179
MedNumBR 13.6315013
PctWOFullPlumb 12.9198016
perCapInc 12.1955112
PctOccupMgmtProf 11.2535144
PctEmploy 10.3417599
NumStreet 10.3378306
MedRentPctHousInc 9.0280115
PctHousOccup 8.5714403
pctWWage 6.9674327
MalePctNevMarr 6.9603290
PctNotSpeakEnglWell 5.4220236
PctOccupManu 5.3190000
PctImmigRec10 4.9203000
PopDens 4.8423181
racePctHisp 4.2799730
NumImmig 3.9294479
blackPerCap 1.6454549
PctRecImmig8 0.3601896
RentLowQ 0.0000000

We identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("SVM", varImp(svmRFit)$importance, 10)

Prediction Performance
svmRPred <- predict(svmRFit, newdata = X_test)
svmRPredPerf <- postResample(pred = svmRPred, obs = y_test)
svmRPredPerf['Family'] <- 'Non-Linear Regression'
svmRPredPerf['Model'] <- 'SVM'

Model7:Multivaraite Adaptive Regression Splines (MARS)

The MARS model finds knots to determine piece wise linear model where each new feature models an isolated portion of the data. The predictor / cut point combination that has the smallest error is used for the model in an iterative process until a stopping point is reached.

We will use the earth package to generate this MARS model.

# mars
marsFit <- earth(X_train, y_train)
summary(marsFit)
## Call: earth(x=X_train, y=y_train)
## 
##                          coefficients
## (Intercept)                 0.7263794
## h(0.54-racepctblack)       -0.3101390
## h(0.73-pctWWage)            0.1249294
## h(pctWWage-0.73)           -0.2401202
## h(PctPopUnderPov-0.67)     -0.5455969
## h(MalePctDivorce-0.39)      0.2138445
## h(0.51-PctTeen2Par)         0.2990705
## h(0.39-PctIlleg)           -0.1481046
## h(PctIlleg-0.39)            0.2448670
## h(0.88-PctPersDenseHous)   -0.3320613
## h(0.16-HousVacant)         -0.4654248
## h(0.31-NumStreet)          -0.4153084
## 
## Selected 12 of 21 terms, and 9 of 39 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: racepctblack, PctIlleg, MalePctDivorce, PctPersDenseHous, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 0.01815316    RSS 26.38294    GRSq 0.6714617    RSq 0.6810408
Visualizing MARS Model
plotmo(marsFit)
##  plotmo grid:    racepctblack racePctWhite racePctHisp pctWWage pctWInvInc
##                          0.06         0.85        0.04     0.57       0.48
##  pctWPubAsst perCapInc blackPerCap PctPopUnderPov PctLess9thGrade PctUnemployed
##         0.26      0.31        0.26           0.24            0.27          0.32
##  PctEmploy PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr
##       0.51         0.37              0.4           0.47            0.4
##  PctYoungKids2Par PctTeen2Par NumIlleg PctIlleg NumImmig PctImmigRec10
##               0.7        0.61     0.01     0.17     0.01          0.43
##  PctRecImmig8 PctNotSpeakEnglWell PctLargHouseFam PctPersDenseHous
##          0.09                0.06            0.21             0.11
##  PctHousLess3BR MedNumBR HousVacant PctHousOccup PctHousOwnOcc PctVacantBoarded
##            0.51      0.5       0.03         0.77          0.55             0.13
##  PctHousNoPhone PctWOFullPlumb RentLowQ MedRentPctHousInc NumInShelters
##            0.18           0.19     0.32              0.48             0
##  NumStreet PopDens
##          0    0.17

Variable Importance
varimp <- varImp(marsFit)
varimp %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
Overall
racepctblack 100.00000
PctIlleg 65.01171
MalePctDivorce 47.29172
PctPersDenseHous 47.29172
HousVacant 29.26562
pctWWage 22.81121
PctPopUnderPov 20.13595
NumStreet 15.49737
PctTeen2Par 11.49760

We identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("MARS", varImp(marsFit), 10)

Prediction Performance
marsPred <- predict(marsFit, newdata = X_test)
marsPredPerf <- postResample(pred = marsPred, obs = y_test)
marsPredPerf['Family'] <- 'Non-Linear Regression'
marsPredPerf['Model'] <- 'MARS'

Model Group C: Trees and Boosting Model

Model8: Decision Tree

Decision Trees are a type of Supervised Machine Learning method where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes and the decision nodes are where the data is split. Since our target variable is continuous, we will be building a regression tree.

Decision Tree1

We will do the decision tree with all variables first.

dt <- rpart(ViolentCrimesPerPop~ ., 
               data=as.data.frame(train))
rpart.plot(dt, nn=TRUE)

Here are some of the details of the tree:

printcp(dt)
## 
## Regression tree:
## rpart(formula = ViolentCrimesPerPop ~ ., data = as.data.frame(train))
## 
## Variables actually used in tree construction:
## [1] MalePctDivorce   NumStreet        PctIlleg         PctPersDenseHous
## [5] PctPopUnderPov  
## 
## Root node error: 82.716/1499 = 0.055181
## 
## n= 1499 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.412904      0   1.00000 1.00143 0.050454
## 2 0.063292      1   0.58710 0.59142 0.029666
## 3 0.054263      2   0.52380 0.54890 0.028311
## 4 0.022946      3   0.46954 0.50016 0.026669
## 5 0.019715      4   0.44660 0.48908 0.026478
## 6 0.011323      5   0.42688 0.46579 0.025620
## 7 0.010558      6   0.41556 0.48008 0.026145
## 8 0.010000      7   0.40500 0.47461 0.026067

To achieve improved accuracy, we need to prune the tree using the cross-validation.

plotcp(dt)

dt$cptable
##           CP nsplit rel error    xerror       xstd
## 1 0.41290395      0 1.0000000 1.0014260 0.05045354
## 2 0.06329165      1 0.5870960 0.5914199 0.02966628
## 3 0.05426321      2 0.5238044 0.5489023 0.02831096
## 4 0.02294604      3 0.4695412 0.5001609 0.02666861
## 5 0.01971462      4 0.4465952 0.4890764 0.02647826
## 6 0.01132276      5 0.4268805 0.4657860 0.02561957
## 7 0.01055755      6 0.4155578 0.4800769 0.02614502
## 8 0.01000000      7 0.4050002 0.4746088 0.02606696

The plot above shows the cross validated errors against the complexity parameters. The curve is at its lowest at 7, so we will prune our tree to a size of 7. At size 7, the error is ~0.46 and cp is 0.0104.

Decision Tree 2

prune_dt=prune(dt,cp=0.0104)
rpart.plot(prune_dt)

Predicting Pruned Tree on Test Data
test2<-as.data.frame(test)
test2<- na.omit(test2)
Prune_pred <- predict(prune_dt, 
                   test2)
Model Accuracy
confMat <- table(test2$ViolentCrimesPerPop,Prune_pred)
accuracy <- sum(diag(confMat))/sum(confMat)
accuracy
## [1] 0.006072874

The model accuracy seems to be very low, even after pruning the tree. This might be because our output is continuous and non-binary. However, more often than not, trees do not give very good prediction errors. Therefore, we will build out a random forest models which tend to outperform trees in terms of prediction and miss-classification errors.

Variable Importance
## Call Variable Importance Plot Function
printVarImportance("Decision Tree", data.frame(prune_dt$variable.importance), 10)

Prediction Performance
dtreePredPerf <- postResample(pred = Prune_pred, obs = y_test)
dtreePredPerf['Family'] <- 'Trees & Boosting'
dtreePredPerf['Model'] <- 'Decision Tree'

Model9: Random Forest

Random forests are one the most popular machine learning algorithms. They can be used for classification or regression, deal with a large number of features, and generally are so successful because they provide a good predictive performance, low incidence of over-fitting, and easy interpret-ability.

In creating the best random forest model, we want to minimize the OOB error rate by finding the optimal number of variables selected at each split, known as the mtry. The below code finds the optimal mtry to use in our random forest model.

train2 <- as.data.frame(train)
oob.err=double(101)
test.err=double(101)
for (mtry in 1:101){
  fit=randomForest(ViolentCrimesPerPop~.,data=train2,mtry=mtry,ntree=400, proximity=TRUE)
 oob.err[mtry]=fit$mse[400]
 pred=predict(fit, train2)
 test.err[mtry]=with(train2, mean((ViolentCrimesPerPop-pred)^2))
 cat(mtry," ")}
## 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100  101
matplot(1:mtry,cbind(test.err, oob.err),pch=19, col=c("red", "blue"), type="b",ylab
="Mean Squared Error")
legend("topright",legend=c("00B", "Test"),pch=19,col=c("red", "blue"))

Ideally OOB error and test errors should line up - but in this case the test error seems to be higher than the OOB.

Variable Importance
varImpPlot(fit)

## Call Variable Importance Plot Function
printVarImportance("Random Forest", varImp(fit), 10)

The above graph illustrates the importance of the variables used to predict the Violent Crimes Per Population. The Mean Decrease Accuracy displays how much the model accuracy decreases if we drop the variable. Here, Percentage of kids born to never married is regarded as the most important variable by a wide margin. The Mean Decrease Gini graph displays the variable importance on the Gini impurity index used for splitting trees. Again, Percentage of kids born to never married is the clear leader.

print(fit)
## 
## Call:
##  randomForest(formula = ViolentCrimesPerPop ~ ., data = train2,      mtry = mtry, ntree = 400, proximity = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 400
## No. of variables tried at each split: 39
## 
##           Mean of squared residuals: 0.01920442
##                     % Var explained: 65.2
Prediction Performance
rfPredPerf <- postResample(pred = pred, obs = y_test)
rfPredPerf['Family'] <- 'Trees & Boosting'
rfPredPerf['Model'] <- 'Random Forest'

Model10: Gradient Boosting Method

The boosting method is uses trees in a sequential pattern. The successive tree is developed using information from the previous tree to minimize its error. The boosting method has three tuning parameters, number of trees, shrinkage parameter, and number of splits in a tree.

We will be using the stochastic gradient boosting in our model. This approach resamples the observations and columns in each iteration. We’ll use the caret workflow, which invokes the xgboost package, to automatically adjust the model parameter values, and fit the final best boosted tree that explains the best our data.

Here, we can see our parameters that are used in the model -

# Best tuning parameter
gbm_model$bestTune %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
34 50 2 0.3 0 0.8 1 1
Variable Importance

We identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("XGBoost", varImp(gbm_model)$importance, 10)

Visualization of the first 3 trees used in the model.

xgb.plot.tree(model = gbm_model$finalModel,trees = 1:3)
xgbmPred <- predict(gbm_model, newdata = X_test)
xgbmPredPerf <- postResample(pred = xgbmPred, obs = y_test)
xgbmPredPerf['Family'] <- 'Trees & Boosting'
xgbmPredPerf['Model'] <- 'XGBoost'

Model Summary

modelSummaryDF <- rbind(ridgePredPerf, lassoPredPerf, elnetPredPerf, nnetPredPerf, knnPredPerf, svmRPredPerf, marsPredPerf, dtreePredPerf, rfPredPerf, xgbmPredPerf)

SummaryDF <- modelSummaryDF[, c(4, 5, 1, 2, 3)]

rownames(SummaryDF) <- NULL

SummaryDF %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="100%",height="500px")
Family Model RMSE Rsquared MAE
Linear Regression Ridge Regression 0.134023364555963 0.652096104840263 0.0931015520799759
Linear Regression Lasso Regression 0.134326504546259 0.65023983987147 0.093460493356315
Linear Regression Elastic Net Regression 0.134363506713636 0.650048180124831 0.0935067997619881
Non-Linear Regression Neural Network 0.138079979548364 0.631849583828549 0.0920321820156854
Non-Linear Regression KNN 0.137746336193226 0.64321987228122 0.0935035814388041
Non-Linear Regression SVM 0.137410561896066 0.64343871694592 0.0903724920709437
Non-Linear Regression MARS 0.135214031166373 0.645391442964902 0.0932388471415975
Trees & Boosting Decision Tree 0.153357244851902 0.545361936338709 0.107234983234218
Trees & Boosting Random Forest NA 0.00235260760870183 NA
Trees & Boosting XGBoost 0.136529476057859 0.638865130885269 0.0919573151953104

All the linear regression models - Ridge, Lasso and Elastic Net that we built turned out to be very good models with a multiple R2 of 65%+. Non-Linear regression models like MARS, SVM and KNN also had R2 of 64% and above. Amongst the Tree models, XGBoost performed the best with 63.8% R2. All the above-mentioned models have prediction powers in the same range. Decision tree is great for interpretation but significantly under-performs all other models.

The final model, random forest model has a high Percentage of variance explained of 65.2% and a low mean squared residual (0.019).

But purely based on R2, Ridge regression model is the best pick out of all the models that we have tried.

Conclusion

News related to violent crimes shake the foundations of our society from time to time and make us question our beliefs regarding the prophesied cultural, economic, and racial harmony in our society. With the increase in the number of recent incidents related to violent crimes, it has become imperative to re-evaluate the cause of these crimes based on the socio-economic built of our society. Increased computational power has enabled us to look at a large number of factors at once to determine how the relationships among them influence crime rates in our society. A data-driven approach is paramount to plan for the future and build our communities on the principles of trust and harmony.

The Communities and Crime normalized Dataset from the UCI Machine Learning repository provides an excellent opportunity to look at how a myriad of socio-economic factors influence violent crime rates in our society. Using novel machine learning approaches, predictive models were built with high accuracy which pinpointed several factors that influence these crimes. One theme that arises, when assessing the most important factors, is marriage. Successful marriages contribute to the healthy development of children, who in turn become responsible citizens. Areas with high percentage of kids born to people who were never married, living in dense areas, with little-to-no schooling tend to exhibit high violent crime rates. Although the models point to racial factors, we believe that this has more to do with the relatively poor economic conditions of African Americans in comparison to those of Caucasians.

Despite these contributions, the models developed here suffer from a serious case of over-fitting. Prioritizing generation of insights over building a predictive model, the entire dataset was utilized for both training and validation purposes, except in case of the random forest model which has been proposed for scaling purposes. The random forest model could be further optimized through hyper-parameter tuning to improve its predictive power. Post scaling, this model would need to be adjusted on a timely basis as more data becomes available. Further study in this area could explore other approaches such as Neural Networks to build models that do a better job at predicting violent crime rates.

References

[1) UCI Machine Learning Repository Link] (https://archive.ics.uci.edu/ml/datasets/communities+and+crime)