Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(caret)
library(mice)
library(randomForest)
library(caTools)
library(corrplot)
library(naniar)
library(xgboost)
library(usmap)
library(DiagrammeR)
library(earth)
library(plotly)
library(wordcloud)
library(RColorBrewer)
library(glmnet)
library(Hmisc)
library(car)
library(class)
library(rpart)
library(rpart.plot)

Abstract

This is a dataset of 2018 US communities, demographics of each community, and their crime rates. The dataset has 146 variables where the first four columns are community/location, the middle features are demographic information about each community such as population, age, race, income, and the final columns are types of crimes and overall crime rates. The goal of the project is to understand where violent crime occurs in terms of the socioeconomic and demographic characteristics of the regions. The features can help predict ahead of time where violent crime is likely to occur through predictive models that can quantify the risk associated with a region.

Introduction

The approach to the problem of crime in the different states of the United States implies the investigation and analysis of the crime rates in each state, as well as the factors that may be contributing to said rates.One of the factors that has been studied in relation to crime is the socioeconomic level of a community. There is evidence to suggest that communities with low socioeconomic levels have a higher incidence of crime compared to more prosperous communities. In addition, other socioeconomic factors, such as unemployment, poverty, lack of educational and job opportunities, have also been linked to increased risk of crime. These factors can negatively affect people’s quality of life and increase their vulnerability to crime. However, it is important to note that other factors can also influence the crime rate, such as culture, law enforcement policies, the availability of guns, and other environmental and demographic factors. In summary, socioeconomic factors can be an important factor in the occurrence of crime in different states of the United States, but it is important to consider multiple factors when addressing this complex problem. The analysis that we are going to carry out in this work could be useful to build predictive models that better help in urban planning and crime reduction.

Dataset Overview

The dataset selected for this analysis is ‘Crimes in US Communities Dataset’ - Michael Bryant (Owner).

We have a very complete dataset. According to each state we can see data such as: population for community, percentage of population in 4 age groups, percentage of population according to race, percentage of people using public transit for commuting, and many more data that will allow us to carry out a good analysis.

This is a dataset of 2018 US communities. Numeric-decimal data types have been normalized to two decimal places (0.00). Our target variable is ‘Violent Crimes by Population’, (GOAL attribute). Our crime dataset has 128 attributes. In the following table we can see the variables, the description of each variable and the data Type.

No.  Column Description Data Type
1 state US state (by number) - not counted as predictive above, but if considered, should be consided nominal nominal
2 county numeric code for county - not predictive, and many missing values numeric
3 community numeric code for community - not predictive and many missing values numeric
4 communityname community name - not predictive - for information only string
5 fold fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive numeric
6 population population for community numeric - decimal
7 householdsize mean people per household numeric - decimal
8 racepctblack percentage of population that is african american numeric - decimal
9 racePctWhite percentage of population that is caucasian numeric - decimal
10 racePctAsian percentage of population that is of asian heritage numeric - decimal
11 racePctHisp percentage of population that is of hispanic heritage numeric - decimal
12 agePct12t21 percentage of population that is 12-21 in age numeric - decimal
13 agePct12t29 percentage of population that is 12-29 in age numeric - decimal
14 agePct16t24 percentage of population that is 16-24 in age numeric - decimal
15 agePct65up percentage of population that is 65 and over in age numeric - decimal
16 numbUrban number of people living in areas classified as urban numeric - decimal
17 pctUrban percentage of people living in areas classified as urban numeric - decimal
18 medIncome median household income numeric - decimal
19 pctWWage percentage of households with wage or salary income in 1989 numeric - decimal
20 pctWFarmSelf percentage of households with farm or self employment income in 1989 numeric - decimal
21 pctWInvInc percentage of households with investment / rent income in 1989 numeric - decimal
22 pctWSocSec percentage of households with social security income in 1989 numeric - decimal
23 pctWPubAsst percentage of households with public assistance income in 1989 numeric - decimal
24 pctWRetire percentage of households with retirement income in 1989 numeric - decimal
25 medFamInc median family income (differs from household income for non-family households) numeric - decimal
26 perCapInc per capita income numeric - decimal
27 whitePerCap per capita income for caucasians numeric - decimal
28 blackPerCap per capita income for african americans numeric - decimal
29 indianPerCap per capita income for native americans numeric - decimal
30 AsianPerCap per capita income for people with asian heritage numeric - decimal
31 OtherPerCap per capita income for people with ‘other’ heritage numeric - decimal
32 HispPerCap per capita income for people with hispanic heritage numeric - decimal
33 NumUnderPov number of people under the poverty level numeric - decimal
34 PctPopUnderPov percentage of people under the poverty level numeric - decimal
35 PctLess9thGrade percentage of people 25 and over with less than a 9th grade education numeric - decimal
36 PctNotHSGrad percentage of people 25 and over that are not high school graduates numeric - decimal
37 PctBSorMore percentage of people 25 and over with a bachelors degree or higher education numeric - decimal
38 PctUnemployed percentage of people 16 and over, in the labor force, and unemployed numeric - decimal
39 PctEmploy percentage of people 16 and over who are employed numeric - decimal
40 PctEmplManu percentage of people 16 and over who are employed in manufacturing numeric - decimal
41 PctEmplProfServ percentage of people 16 and over who are employed in professional services numeric - decimal
42 PctOccupManu percentage of people 16 and over who are employed in manufacturing numeric - decimal
43 PctOccupMgmtProf percentage of people 16 and over who are employed in management or professional occupations numeric - decimal
44 MalePctDivorce percentage of males who are divorced numeric - decimal
45 MalePctNevMarr percentage of males who have never married numeric - decimal
46 FemalePctDiv percentage of females who are divorced numeric - decimal
47 TotalPctDiv percentage of population who are divorced numeric - decimal
48 PersPerFam mean number of people per family numeric - decimal
49 PctFam2Par percentage of families (with kids) that are headed by two parents numeric - decimal
50 PctKids2Par percentage of kids in family housing with two parents numeric - decimal
51 PctYoungKids2Par percent of kids 4 and under in two parent households numeric - decimal
52 PctTeen2Par percent of kids age 12-17 in two parent households numeric - decimal
53 PctWorkMomYoungKids percentage of moms of kids 6 and under in labor force numeric - decimal
54 PctWorkMom percentage of moms of kids under 18 in labor force numeric - decimal
55 NumIlleg number of kids born to never married numeric - decimal
56 PctIlleg percentage of kids born to never married numeric - decimal
57 NumImmig total number of people known to be foreign born numeric - decimal
58 PctImmigRecent percentage of immigrants who immigated within last 3 years numeric - decimal
59 PctImmigRec5 percentage of immigrants who immigated within last 5 years numeric - decimal
60 PctImmigRec8 percentage of immigrants who immigated within last 8 years numeric - decimal
61 PctImmigRec10 percentage of immigrants who immigated within last 10 years numeric - decimal
62 PctRecentImmig percent of population who have immigrated within the last 3 years numeric - decimal
63 PctRecImmig5 percent of population who have immigrated within the last 5 years numeric - decimal
64 PctRecImmig8 percent of population who have immigrated within the last 8 years numeric - decimal
65 PctRecImmig10 percent of population who have immigrated within the last 10 years numeric - decimal
66 PctSpeakEnglOnly percent of people who speak only English numeric - decimal
67 PctNotSpeakEnglWell percent of people who do not speak English well numeric - decimal
68 PctLargHouseFam percent of family households that are large (6 or more) numeric - decimal
69 PctLargHouseOccup percent of all occupied households that are large (6 or more people) numeric - decimal
70 PersPerOccupHous mean persons per household numeric - decimal
71 PersPerOwnOccHous mean persons per owner occupied household numeric - decimal
72 PersPerRentOccHous mean persons per rental household numeric - decimal
73 PctPersOwnOccup percent of people in owner occupied households numeric - decimal
74 PctPersDenseHous percent of persons in dense housing (more than 1 person per room) numeric - decimal
75 PctHousLess3BR percent of housing units with less than 3 bedrooms numeric - decimal
76 MedNumBR median number of bedrooms numeric - decimal
77 HousVacant number of vacant households numeric - decimal
78 PctHousOccup percent of housing occupied numeric - decimal
79 PctHousOwnOcc percent of households owner occupied numeric - decimal
80 PctVacantBoarded percent of vacant housing that is boarded up numeric - decimal
81 PctVacMore6Mos percent of vacant housing that has been vacant more than 6 months numeric - decimal
82 MedYrHousBuilt median year housing units built numeric - decimal
83 PctHousNoPhone percent of occupied housing units without phone (in 1990, this was rare!) numeric - decimal
84 PctWOFullPlumb percent of housing without complete plumbing facilities numeric - decimal
85 OwnOccLowQuart owner occupied housing - lower quartile value numeric - decimal
86 OwnOccMedVal owner occupied housing - median value numeric - decimal
87 OwnOccHiQuart owner occupied housing - upper quartile value numeric - decimal
88 RentLowQ rental housing - lower quartile rent numeric - decimal
89 RentMedian rental housing - median rent (Census variable H32B from file STF1A) numeric - decimal
90 RentHighQ rental housing - upper quartile rent numeric - decimal
91 MedRent median gross rent (Census variable H43A from file STF3A - includes utilities) numeric - decimal
92 MedRentPctHousInc median gross rent as a percentage of household income numeric - decimal
93 MedOwnCostPctInc median owners cost as a percentage of household income - for owners with a mortgage numeric - decimal
94 MedOwnCostPctIncNoMtg median owners cost as a percentage of household income - for owners without a mortgage numeric - decimal
95 NumInShelters number of people in homeless shelters numeric - decimal
96 NumStreet number of homeless people counted in the street numeric - decimal
97 PctForeignBorn percent of people foreign born numeric - decimal
98 PctBornSameState percent of people born in the same state as currently living numeric - decimal
99 PctSameHouse85 percent of people living in the same house as in 1985 (5 years before) numeric - decimal
100 PctSameCity85 percent of people living in the same city as in 1985 (5 years before) numeric - decimal
101 PctSameState85 percent of people living in the same state as in 1985 (5 years before) numeric - decimal
102 LemasSwornFT number of sworn full time police officers numeric - decimal
103 LemasSwFTPerPop sworn full time police officers per 100K population numeric - decimal
104 LemasSwFTFieldOps number of sworn full time police officers in field operations (on the street as opposed to administrative etc) numeric - decimal
105 LemasSwFTFieldPerPop sworn full time police officers in field operations (on the street as opposed to administrative etc) per 100K population numeric - decimal
106 LemasTotalReq total requests for police numeric - decimal
107 LemasTotReqPerPop total requests for police per 100K popuation numeric - decimal
108 PolicReqPerOffic total requests for police per police officer numeric - decimal
109 PolicPerPop police officers per 100K population numeric - decimal
110 RacialMatchCommPol a measure of the racial match between the community and the police force. High values indicate proportions in community and police force are similar numeric - decimal
111 PctPolicWhite percent of police that are caucasian numeric - decimal
112 PctPolicBlack percent of police that are african american numeric - decimal
113 PctPolicHisp percent of police that are hispanic numeric - decimal
114 PctPolicAsian percent of police that are asian numeric - decimal
115 PctPolicMinor percent of police that are minority of any kind numeric - decimal
116 OfficAssgnDrugUnits number of officers assigned to special drug units numeric - decimal
117 NumKindsDrugsSeiz number of different kinds of drugs seized numeric - decimal
118 PolicAveOTWorked police average overtime worked numeric - decimal
119 LandArea land area in square miles numeric - decimal
120 PopDens population density in persons per square mile numeric - decimal
121 PctUsePubTrans percent of people using public transit for commuting numeric - decimal
122 PolicCars number of police cars numeric - decimal
123 PolicOperBudg police operating budget numeric - decimal
124 LemasPctPolicOnPatr percent of sworn full time police officers on patrol numeric - decimal
125 LemasGangUnitDeploy gang unit deployed numeric - decimal - but really ordinal - 0 means NO, 1 means YES, 0.5 means Part Time
126 LemasPctOfficDrugUn percent of officers assigned to drug units numeric - decimal
127 PolicBudgPerPop police operating budget per population numeric - decimal
128 ViolentCrimesPerPop total number of violent crimes per 100K popuation (numeric - decimal) GOAL attribute to be predicted

We also load the data set where each State, State code and State name are described.

state_code state stateName stateENS
1 AL Alabama 1779775
2 AK Alaska 1785533
4 AZ Arizona 1779777
5 AR Arkansas 68085
6 CA California 1779778
8 CO Colorado 1779779
9 CT Connecticut 1779780
10 DE Delaware 1779781
11 DC District of Columbia 1702382
12 FL Florida 294478
13 GA Georgia 1705317
15 HI Hawaii 1779782
16 ID Idaho 1779783
17 IL Illinois 1779784
18 IN Indiana 448508
19 IA Iowa 1779785
20 KS Kansas 481813
21 KY Kentucky 1779786
22 LA Louisiana 1629543
23 ME Maine 1779787
24 MD Maryland 1714934
25 MA Massachusetts 606926
26 MI Michigan 1779789
27 MN Minnesota 662849
28 MS Mississippi 1779790
29 MO Missouri 1779791
30 MT Montana 767982
31 NE Nebraska 1779792
32 NV Nevada 1779793
33 NH New Hampshire 1779794
34 NJ New Jersey 1779795
35 NM New Mexico 897535
36 NY New York 1779796
37 NC North Carolina 1027616
38 ND North Dakota 1779797
39 OH Ohio 1085497
40 OK Oklahoma 1102857
41 OR Oregon 1155107
42 PA Pennsylvania 1779798
44 RI Rhode Island 1219835
45 SC South Carolina 1779799
46 SD South Dakota 1785534
47 TN Tennessee 1325873
48 TX Texas 1779801
49 UT Utah 1455989
50 VT Vermont 1779802
51 VA Virginia 1779803
53 WA Washington 1779804
54 WV West Virginia 1779805
55 WI Wisconsin 1779806
56 WY Wyoming 1779807
60 AS American Samoa 1802701
66 GU Guam 1802705
69 MP Northern Mariana Islands 1779809
72 PR Puerto Rico 1779808
74 UM U.S. Minor Outlying Islands 1878752
78 VI U.S. Virgin Islands 1802710

In the following table, we can see for each state and community, population. The percentage of population according to age, race, total number of violent crimes per 100K population and other data that may be useful for our analysis.

state county community communityname fold population householdsize racepctblack racePctWhite racePctAsian racePctHisp agePct12t21 agePct12t29 agePct16t24 agePct65up numbUrban pctUrban medIncome pctWWage pctWFarmSelf pctWInvInc pctWSocSec pctWPubAsst pctWRetire medFamInc perCapInc whitePerCap blackPerCap indianPerCap AsianPerCap OtherPerCap HispPerCap NumUnderPov PctPopUnderPov PctLess9thGrade PctNotHSGrad PctBSorMore PctUnemployed PctEmploy PctEmplManu PctEmplProfServ PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr FemalePctDiv TotalPctDiv PersPerFam PctFam2Par PctKids2Par PctYoungKids2Par PctTeen2Par PctWorkMomYoungKids PctWorkMom NumIlleg PctIlleg NumImmig PctImmigRecent PctImmigRec5 PctImmigRec8 PctImmigRec10 PctRecentImmig PctRecImmig5 PctRecImmig8 PctRecImmig10 PctSpeakEnglOnly PctNotSpeakEnglWell PctLargHouseFam PctLargHouseOccup PersPerOccupHous PersPerOwnOccHous PersPerRentOccHous PctPersOwnOccup PctPersDenseHous PctHousLess3BR MedNumBR HousVacant PctHousOccup PctHousOwnOcc PctVacantBoarded PctVacMore6Mos MedYrHousBuilt PctHousNoPhone PctWOFullPlumb OwnOccLowQuart OwnOccMedVal OwnOccHiQuart RentLowQ RentMedian RentHighQ MedRent MedRentPctHousInc MedOwnCostPctInc MedOwnCostPctIncNoMtg NumInShelters NumStreet PctForeignBorn PctBornSameState PctSameHouse85 PctSameCity85 PctSameState85 LemasSwornFT LemasSwFTPerPop LemasSwFTFieldOps LemasSwFTFieldPerPop LemasTotalReq LemasTotReqPerPop PolicReqPerOffic PolicPerPop RacialMatchCommPol PctPolicWhite PctPolicBlack PctPolicHisp PctPolicAsian PctPolicMinor OfficAssgnDrugUnits NumKindsDrugsSeiz PolicAveOTWorked LandArea PopDens PctUsePubTrans PolicCars PolicOperBudg LemasPctPolicOnPatr LemasGangUnitDeploy LemasPctOfficDrugUn PolicBudgPerPop ViolentCrimesPerPop
8 ? ? Lakewoodcity 1 0.19 0.33 0.02 0.90 0.12 0.17 0.34 0.47 0.29 0.32 0.20 1.0 0.37 0.72 0.34 0.60 0.29 0.15 0.43 0.39 0.40 0.39 0.32 0.27 0.27 0.36 0.41 0.08 0.19 0.10 0.18 0.48 0.27 0.68 0.23 0.41 0.25 0.52 0.68 0.40 0.75 0.75 0.35 0.55 0.59 0.61 0.56 0.74 0.76 0.04 0.14 0.03 0.24 0.27 0.37 0.39 0.07 0.07 0.08 0.08 0.89 0.06 0.14 0.13 0.33 0.39 0.28 0.55 0.09 0.51 0.5 0.21 0.71 0.52 0.05 0.26 0.65 0.14 0.06 0.22 0.19 0.18 0.36 0.35 0.38 0.34 0.38 0.46 0.25 0.04 0 0.12 0.42 0.50 0.51 0.64 0.03 0.13 0.96 0.17 0.06 0.18 0.44 0.13 0.94 0.93 0.03 0.07 0.1 0.07 0.02 0.57 0.29 0.12 0.26 0.20 0.06 0.04 0.9 0.5 0.32 0.14 0.20
53 ? ? Tukwilacity 1 0.00 0.16 0.12 0.74 0.45 0.07 0.26 0.59 0.35 0.27 0.02 1.0 0.31 0.72 0.11 0.45 0.25 0.29 0.39 0.29 0.37 0.38 0.33 0.16 0.30 0.22 0.35 0.01 0.24 0.14 0.24 0.30 0.27 0.73 0.57 0.15 0.42 0.36 1.00 0.63 0.91 1.00 0.29 0.43 0.47 0.60 0.39 0.46 0.53 0.00 0.24 0.01 0.52 0.62 0.64 0.63 0.25 0.27 0.25 0.23 0.84 0.10 0.16 0.10 0.17 0.29 0.17 0.26 0.20 0.82 0.0 0.02 0.79 0.24 0.02 0.25 0.65 0.16 0.00 0.21 0.20 0.21 0.42 0.38 0.40 0.37 0.29 0.32 0.18 0.00 0 0.21 0.50 0.34 0.60 0.52 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0.02 0.12 0.45 ? ? ? ? 0.00 ? 0.67
24 ? ? Aberdeentown 1 0.00 0.42 0.49 0.56 0.17 0.04 0.39 0.47 0.28 0.32 0.00 0.0 0.30 0.58 0.19 0.39 0.38 0.40 0.84 0.28 0.27 0.29 0.27 0.07 0.29 0.28 0.39 0.01 0.27 0.27 0.43 0.19 0.36 0.58 0.32 0.29 0.49 0.32 0.63 0.41 0.71 0.70 0.45 0.42 0.44 0.43 0.43 0.71 0.67 0.01 0.46 0.00 0.07 0.06 0.15 0.19 0.02 0.02 0.04 0.05 0.88 0.04 0.20 0.20 0.46 0.52 0.43 0.42 0.15 0.51 0.5 0.01 0.86 0.41 0.29 0.30 0.52 0.47 0.45 0.18 0.17 0.16 0.27 0.29 0.27 0.31 0.48 0.39 0.28 0.00 0 0.14 0.49 0.54 0.67 0.56 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0.01 0.21 0.02 ? ? ? ? 0.00 ? 0.43
34 5 81440 Willingborotownship 1 0.04 0.77 1.00 0.08 0.12 0.10 0.51 0.50 0.34 0.21 0.06 1.0 0.58 0.89 0.21 0.43 0.36 0.20 0.82 0.51 0.36 0.40 0.39 0.16 0.25 0.36 0.44 0.01 0.10 0.09 0.25 0.31 0.33 0.71 0.36 0.45 0.37 0.39 0.34 0.45 0.49 0.44 0.75 0.65 0.54 0.83 0.65 0.85 0.86 0.03 0.33 0.02 0.11 0.20 0.30 0.31 0.05 0.08 0.11 0.11 0.81 0.08 0.56 0.62 0.85 0.77 1.00 0.94 0.12 0.01 0.5 0.01 0.97 0.96 0.60 0.47 0.52 0.11 0.11 0.24 0.21 0.19 0.75 0.70 0.77 0.89 0.63 0.51 0.47 0.00 0 0.19 0.30 0.73 0.64 0.65 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0.02 0.39 0.28 ? ? ? ? 0.00 ? 0.12
42 95 6096 Bethlehemtownship 1 0.01 0.55 0.02 0.95 0.09 0.05 0.38 0.38 0.23 0.36 0.02 0.9 0.50 0.72 0.16 0.68 0.44 0.11 0.71 0.46 0.43 0.41 0.28 0.00 0.74 0.51 0.48 0.00 0.06 0.25 0.30 0.33 0.12 0.65 0.67 0.38 0.42 0.46 0.22 0.27 0.20 0.21 0.51 0.91 0.91 0.89 0.85 0.40 0.60 0.00 0.06 0.00 0.03 0.07 0.20 0.27 0.01 0.02 0.04 0.05 0.88 0.05 0.16 0.19 0.59 0.60 0.37 0.89 0.02 0.19 0.5 0.01 0.89 0.87 0.04 0.55 0.73 0.05 0.14 0.31 0.31 0.30 0.40 0.36 0.38 0.38 0.22 0.51 0.21 0.00 0 0.11 0.72 0.64 0.61 0.53 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0.04 0.09 0.02 ? ? ? ? 0.00 ? 0.03
6 ? ? SouthPasadenacity 1 0.02 0.28 0.06 0.54 1.00 0.25 0.31 0.48 0.27 0.37 0.04 1.0 0.52 0.68 0.20 0.61 0.28 0.15 0.25 0.62 0.72 0.76 0.77 0.28 0.52 0.48 0.60 0.01 0.12 0.13 0.12 0.80 0.10 0.65 0.19 0.77 0.06 0.91 0.49 0.57 0.61 0.58 0.44 0.62 0.69 0.87 0.53 0.30 0.43 0.00 0.11 0.04 0.30 0.35 0.43 0.47 0.50 0.50 0.56 0.57 0.45 0.28 0.25 0.19 0.29 0.53 0.18 0.39 0.26 0.73 0.0 0.02 0.84 0.30 0.16 0.28 0.25 0.02 0.05 0.94 1.00 1.00 0.67 0.63 0.68 0.62 0.47 0.59 0.11 0.00 0 0.70 0.42 0.49 0.73 0.64 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0.01 0.58 0.10 ? ? ? ? 0.00 ? 0.14

Preparation & Exploration

## [1] 1994  128

The dataset has 1994 observations and 128 variable. We can see that there are missing values in the dataset, we are going to convert these data into ‘NA’ in order to carry out our analysis.

A summary of the variables is below:

state county community communityname fold population householdsize racepctblack racePctWhite racePctAsian racePctHisp agePct12t21 agePct12t29 agePct16t24 agePct65up numbUrban pctUrban medIncome pctWWage pctWFarmSelf pctWInvInc pctWSocSec pctWPubAsst pctWRetire medFamInc perCapInc whitePerCap blackPerCap indianPerCap AsianPerCap OtherPerCap HispPerCap NumUnderPov PctPopUnderPov PctLess9thGrade PctNotHSGrad PctBSorMore PctUnemployed PctEmploy PctEmplManu PctEmplProfServ PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr FemalePctDiv TotalPctDiv PersPerFam PctFam2Par PctKids2Par PctYoungKids2Par PctTeen2Par PctWorkMomYoungKids PctWorkMom NumIlleg PctIlleg NumImmig PctImmigRecent PctImmigRec5 PctImmigRec8 PctImmigRec10 PctRecentImmig PctRecImmig5 PctRecImmig8 PctRecImmig10 PctSpeakEnglOnly PctNotSpeakEnglWell PctLargHouseFam PctLargHouseOccup PersPerOccupHous PersPerOwnOccHous PersPerRentOccHous PctPersOwnOccup PctPersDenseHous PctHousLess3BR MedNumBR HousVacant PctHousOccup PctHousOwnOcc PctVacantBoarded PctVacMore6Mos MedYrHousBuilt PctHousNoPhone PctWOFullPlumb OwnOccLowQuart OwnOccMedVal OwnOccHiQuart RentLowQ RentMedian RentHighQ MedRent MedRentPctHousInc MedOwnCostPctInc MedOwnCostPctIncNoMtg NumInShelters NumStreet PctForeignBorn PctBornSameState PctSameHouse85 PctSameCity85 PctSameState85 LemasSwornFT LemasSwFTPerPop LemasSwFTFieldOps LemasSwFTFieldPerPop LemasTotalReq LemasTotReqPerPop PolicReqPerOffic PolicPerPop RacialMatchCommPol PctPolicWhite PctPolicBlack PctPolicHisp PctPolicAsian PctPolicMinor OfficAssgnDrugUnits NumKindsDrugsSeiz PolicAveOTWorked LandArea PopDens PctUsePubTrans PolicCars PolicOperBudg LemasPctPolicOnPatr LemasGangUnitDeploy LemasPctOfficDrugUn PolicBudgPerPop ViolentCrimesPerPop
Min. : 1.00 Length:1994 Length:1994 Length:1994 Min. : 1.000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Length:1994 Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Length:1994 Min. :0.00000 Min. :0.0000 Min. :0.0000 Length:1994 Length:1994 Length:1994 Length:1994 Min. :0.00000 Length:1994 Min. :0.000
1st Qu.:12.00 Class :character Class :character Class :character 1st Qu.: 3.000 1st Qu.:0.01000 1st Qu.:0.3500 1st Qu.:0.0200 1st Qu.:0.6300 1st Qu.:0.0400 1st Qu.:0.010 1st Qu.:0.3400 1st Qu.:0.4100 1st Qu.:0.2500 1st Qu.:0.3000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.2000 1st Qu.:0.4400 1st Qu.:0.1600 1st Qu.:0.3700 1st Qu.:0.3500 1st Qu.:0.1425 1st Qu.:0.3600 1st Qu.:0.2300 1st Qu.:0.2200 1st Qu.:0.240 1st Qu.:0.1725 1st Qu.:0.1100 1st Qu.:0.1900 Class :character 1st Qu.:0.2600 1st Qu.:0.01000 1st Qu.:0.110 1st Qu.:0.1600 1st Qu.:0.2300 1st Qu.:0.2100 1st Qu.:0.2200 1st Qu.:0.3800 1st Qu.:0.2500 1st Qu.:0.3200 1st Qu.:0.2400 1st Qu.:0.3100 1st Qu.:0.3300 1st Qu.:0.3100 1st Qu.:0.3600 1st Qu.:0.3600 1st Qu.:0.4000 1st Qu.:0.4900 1st Qu.:0.4900 1st Qu.:0.530 1st Qu.:0.4800 1st Qu.:0.3900 1st Qu.:0.4200 1st Qu.:0.00000 1st Qu.:0.09 1st Qu.:0.00000 1st Qu.:0.1600 1st Qu.:0.2000 1st Qu.:0.2500 1st Qu.:0.2800 1st Qu.:0.0300 1st Qu.:0.0300 1st Qu.:0.0300 1st Qu.:0.0300 1st Qu.:0.7300 1st Qu.:0.0300 1st Qu.:0.1500 1st Qu.:0.1400 1st Qu.:0.3400 1st Qu.:0.3900 1st Qu.:0.2700 1st Qu.:0.4400 1st Qu.:0.0600 1st Qu.:0.4000 1st Qu.:0.0000 1st Qu.:0.01000 1st Qu.:0.6300 1st Qu.:0.4300 1st Qu.:0.0600 1st Qu.:0.2900 1st Qu.:0.3500 1st Qu.:0.0600 1st Qu.:0.1000 1st Qu.:0.0900 1st Qu.:0.0900 1st Qu.:0.0900 1st Qu.:0.1700 1st Qu.:0.2000 1st Qu.:0.220 1st Qu.:0.2100 1st Qu.:0.3700 1st Qu.:0.3200 1st Qu.:0.2500 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0600 1st Qu.:0.4700 1st Qu.:0.4200 1st Qu.:0.5200 1st Qu.:0.5600 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.:0.02000 1st Qu.:0.1000 1st Qu.:0.0200 Class :character Class :character Class :character Class :character 1st Qu.:0.00000 Class :character 1st Qu.:0.070
Median :34.00 Mode :character Mode :character Mode :character Median : 5.000 Median :0.02000 Median :0.4400 Median :0.0600 Median :0.8500 Median :0.0700 Median :0.040 Median :0.4000 Median :0.4800 Median :0.2900 Median :0.4200 Median :0.03000 Median :1.0000 Median :0.3200 Median :0.5600 Median :0.2300 Median :0.4800 Median :0.4750 Median :0.2600 Median :0.4700 Median :0.3300 Median :0.3000 Median :0.320 Median :0.2500 Median :0.1700 Median :0.2800 Mode :character Median :0.3450 Median :0.02000 Median :0.250 Median :0.2700 Median :0.3600 Median :0.3100 Median :0.3200 Median :0.5100 Median :0.3700 Median :0.4100 Median :0.3700 Median :0.4000 Median :0.4700 Median :0.4000 Median :0.5000 Median :0.5000 Median :0.4700 Median :0.6300 Median :0.6400 Median :0.700 Median :0.6100 Median :0.5100 Median :0.5400 Median :0.01000 Median :0.17 Median :0.01000 Median :0.2900 Median :0.3400 Median :0.3900 Median :0.4300 Median :0.0900 Median :0.0800 Median :0.0900 Median :0.0900 Median :0.8700 Median :0.0600 Median :0.2000 Median :0.1900 Median :0.4400 Median :0.4800 Median :0.3600 Median :0.5600 Median :0.1100 Median :0.5100 Median :0.5000 Median :0.03000 Median :0.7700 Median :0.5400 Median :0.1300 Median :0.4200 Median :0.5200 Median :0.1850 Median :0.1900 Median :0.1800 Median :0.1700 Median :0.1800 Median :0.3100 Median :0.3300 Median :0.370 Median :0.3400 Median :0.4800 Median :0.4500 Median :0.3700 Median :0.00000 Median :0.00000 Median :0.1300 Median :0.6300 Median :0.5400 Median :0.6700 Median :0.7000 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median :0.04000 Median :0.1700 Median :0.0700 Mode :character Mode :character Mode :character Mode :character Median :0.00000 Mode :character Median :0.150
Mean :28.68 NA NA NA Mean : 5.494 Mean :0.05759 Mean :0.4634 Mean :0.1796 Mean :0.7537 Mean :0.1537 Mean :0.144 Mean :0.4242 Mean :0.4939 Mean :0.3363 Mean :0.4232 Mean :0.06407 Mean :0.6963 Mean :0.3611 Mean :0.5582 Mean :0.2916 Mean :0.4957 Mean :0.4711 Mean :0.3178 Mean :0.4792 Mean :0.3757 Mean :0.3503 Mean :0.368 Mean :0.2911 Mean :0.2035 Mean :0.3224 NA Mean :0.3863 Mean :0.05551 Mean :0.303 Mean :0.3158 Mean :0.3833 Mean :0.3617 Mean :0.3635 Mean :0.5011 Mean :0.3964 Mean :0.4406 Mean :0.3912 Mean :0.4413 Mean :0.4612 Mean :0.4345 Mean :0.4876 Mean :0.4943 Mean :0.4877 Mean :0.6109 Mean :0.6207 Mean :0.664 Mean :0.5829 Mean :0.5014 Mean :0.5267 Mean :0.03629 Mean :0.25 Mean :0.03006 Mean :0.3202 Mean :0.3606 Mean :0.3991 Mean :0.4279 Mean :0.1814 Mean :0.1821 Mean :0.1848 Mean :0.1829 Mean :0.7859 Mean :0.1506 Mean :0.2676 Mean :0.2519 Mean :0.4621 Mean :0.4944 Mean :0.4041 Mean :0.5626 Mean :0.1863 Mean :0.4952 Mean :0.3147 Mean :0.07682 Mean :0.7195 Mean :0.5487 Mean :0.2045 Mean :0.4333 Mean :0.4942 Mean :0.2645 Mean :0.2431 Mean :0.2647 Mean :0.2635 Mean :0.2689 Mean :0.3464 Mean :0.3725 Mean :0.423 Mean :0.3841 Mean :0.4901 Mean :0.4498 Mean :0.4038 Mean :0.02944 Mean :0.02278 Mean :0.2156 Mean :0.6089 Mean :0.5351 Mean :0.6264 Mean :0.6515 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Mean :0.06523 Mean :0.2329 Mean :0.1617 NA NA NA NA Mean :0.09405 NA Mean :0.238
3rd Qu.:42.00 NA NA NA 3rd Qu.: 8.000 3rd Qu.:0.05000 3rd Qu.:0.5400 3rd Qu.:0.2300 3rd Qu.:0.9400 3rd Qu.:0.1700 3rd Qu.:0.160 3rd Qu.:0.4700 3rd Qu.:0.5400 3rd Qu.:0.3600 3rd Qu.:0.5300 3rd Qu.:0.07000 3rd Qu.:1.0000 3rd Qu.:0.4900 3rd Qu.:0.6900 3rd Qu.:0.3700 3rd Qu.:0.6200 3rd Qu.:0.5800 3rd Qu.:0.4400 3rd Qu.:0.5800 3rd Qu.:0.4800 3rd Qu.:0.4300 3rd Qu.:0.440 3rd Qu.:0.3800 3rd Qu.:0.2500 3rd Qu.:0.4000 NA 3rd Qu.:0.4800 3rd Qu.:0.05000 3rd Qu.:0.450 3rd Qu.:0.4200 3rd Qu.:0.5100 3rd Qu.:0.4600 3rd Qu.:0.4800 3rd Qu.:0.6275 3rd Qu.:0.5200 3rd Qu.:0.5300 3rd Qu.:0.5100 3rd Qu.:0.5400 3rd Qu.:0.5900 3rd Qu.:0.5000 3rd Qu.:0.6200 3rd Qu.:0.6300 3rd Qu.:0.5600 3rd Qu.:0.7600 3rd Qu.:0.7800 3rd Qu.:0.840 3rd Qu.:0.7200 3rd Qu.:0.6200 3rd Qu.:0.6500 3rd Qu.:0.02000 3rd Qu.:0.32 3rd Qu.:0.02000 3rd Qu.:0.4300 3rd Qu.:0.4800 3rd Qu.:0.5300 3rd Qu.:0.5600 3rd Qu.:0.2300 3rd Qu.:0.2300 3rd Qu.:0.2300 3rd Qu.:0.2300 3rd Qu.:0.9400 3rd Qu.:0.1600 3rd Qu.:0.3100 3rd Qu.:0.2900 3rd Qu.:0.5500 3rd Qu.:0.5800 3rd Qu.:0.4900 3rd Qu.:0.7000 3rd Qu.:0.2200 3rd Qu.:0.6000 3rd Qu.:0.5000 3rd Qu.:0.07000 3rd Qu.:0.8600 3rd Qu.:0.6700 3rd Qu.:0.2700 3rd Qu.:0.5600 3rd Qu.:0.6700 3rd Qu.:0.4200 3rd Qu.:0.3300 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.:0.3800 3rd Qu.:0.4900 3rd Qu.:0.5200 3rd Qu.:0.590 3rd Qu.:0.5300 3rd Qu.:0.5900 3rd Qu.:0.5800 3rd Qu.:0.5100 3rd Qu.:0.01000 3rd Qu.:0.00000 3rd Qu.:0.2800 3rd Qu.:0.7775 3rd Qu.:0.6600 3rd Qu.:0.7700 3rd Qu.:0.7900 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3rd Qu.:0.07000 3rd Qu.:0.2800 3rd Qu.:0.1900 NA NA NA NA 3rd Qu.:0.00000 NA 3rd Qu.:0.330
Max. :56.00 NA NA NA Max. :10.000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 NA Max. :1.0000 Max. :1.00000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Max. :1.00000 Max. :1.0000 Max. :1.0000 NA NA NA NA Max. :1.00000 NA Max. :1.000

For the variables that have missing values, we will perform the analysis to identify and apply the appropriate imputation technique.

We are going to review how many missing data we have for each attribute.

Feature NA_Count NA_Percentage
LemasSwornFT 1675 84.0020060
LemasSwFTPerPop 1675 84.0020060
LemasSwFTFieldOps 1675 84.0020060
LemasSwFTFieldPerPop 1675 84.0020060
LemasTotalReq 1675 84.0020060
LemasTotReqPerPop 1675 84.0020060
PolicReqPerOffic 1675 84.0020060
PolicPerPop 1675 84.0020060
RacialMatchCommPol 1675 84.0020060
PctPolicWhite 1675 84.0020060
PctPolicBlack 1675 84.0020060
PctPolicHisp 1675 84.0020060
PctPolicAsian 1675 84.0020060
PctPolicMinor 1675 84.0020060
OfficAssgnDrugUnits 1675 84.0020060
NumKindsDrugsSeiz 1675 84.0020060
PolicAveOTWorked 1675 84.0020060
PolicCars 1675 84.0020060
PolicOperBudg 1675 84.0020060
LemasPctPolicOnPatr 1675 84.0020060
LemasGangUnitDeploy 1675 84.0020060
PolicBudgPerPop 1675 84.0020060
community 1177 59.0270812
county 1174 58.8766299
OtherPerCap 1 0.0501505

Let’s graph the amount of missing data:

Reviewing the variables that have missing data, we find that 2 variables have approximately 58.8% of missing data and 22 variables have 84% of missing data. We consider that these variables with this amount of missing data are not useful and correct to be used in our analysis.

Removing the folds variable as it was placed for cross validation.

We are going to identify the degenerate variables and remove the degenerate variables from the data set for our analysis.

## [1] "LemasPctOfficDrugUn"

We are going to eliminate the variable “MottosPctofficDrugUn” - percent of officers assigned to drug units. We are also going to remove from the dataset the variable ‘OtherPerCap’ - per capita income for people with ‘other’ heritage.

## [1] 1993  102

Our dataset now has 102 variables after removing the variables we mentioned above.

Exploratory Data Analysis

Let’s join our dataset with the crime dataset by State.

We are going to start with the analysis of the data for our project:

Violent Crime Rate Analysis

To obtain the violent crime rate, we are going to add the average number of violent crimes by State.

In the following table we can see the 10 states with the highest rate of violent crimes:

state_abbr AvgCrimesRate rank
DC 1.0000000 1
LA 0.5045455 2
SC 0.4867857 3
MD 0.4800000 4
FL 0.4583333 5
NC 0.4019565 6
AL 0.3937209 7
GA 0.3840541 8
DE 0.3700000 9
KS 0.3600000 10

In first place we have the District of Columbia, in second place we have Louisiana and in third place South Carolina.

Violent Crime Rates by States:

Black population Analysis

In the following table we can see the states with the highest percentage of people of the race of African-American population.

state_abbr AvgBlackPopRate rank
DC 1.0000000 1
MS 0.6900000 2
GA 0.6608108 3
LA 0.6245455 4
DE 0.6000000 5
NC 0.5580435 6
SC 0.5339286 7
AL 0.4702326 8
MD 0.4466667 9
VA 0.3972727 10

In first place we have the District of Columbia, in second place we have Mississippi and in third place Georgia.

Percentage of African-American population by state:

We can see that the District of Columbia ranks first in both crime rate and the percentage of African-American people. We might think that there is a relationship between these two aspects.

The crime rate in a state is not necessarily related to the perception of race based on the percentage of the population of a particular race.While the crime rate may be higher in some urban areas than others, there is no evidence to suggest that the crime rate is related to the race of the population.

It is important to note that the perception of race in the United States has historically been influenced by a variety of factors, including education, the media, and politics. Although crime can be a factor in influencing racial perceptions, it is only one of many factors that can influence how race is perceived in a given area.

Most dangerous communities

We review which communities have the highest rates of violent crime.

Word Cloud of Communities with the highest crime rates:

The city of Camdem in the state of New Jersey is the one that occupies the first place with the highest rate of violent crimes. In the top 10 communities with the most violent rates, we have three communities located in the state of Alabama.

Correlation

To select which are the best variables for our analysis, we are going to measure the correlation of each independent variable with the dependent variable and eliminate the variables with low correlation with the independent variable. We are going to eliminate the variables with an absolute correlation of less than 0.25 with the independent variable.

Finding variables with low correlation to dependent variable:

## [1] 1993   54

We now have 54 variables.

Multi-collinearity

Let’s check for multicollinearity. We selected pairs of independent variables with absolute correlations greater than 0.9.

We are going to use the VIF function to eliminate multicollinear variables that have higher variance inflation factors.

row column
PctRecImmig8 PctRecImmig10
population numbUrban
PctFam2Par PctKids2Par
PctLargHouseFam PctLargHouseOccup
FemalePctDiv TotalPctDiv
PctPersOwnOccup PctHousOwnOcc
medIncome medFamInc
MalePctDivorce TotalPctDiv
PctBSorMore PctOccupMgmtProf
population NumUnderPov
PctLess9thGrade PctNotHSGrad
NumUnderPov NumIlleg
medFamInc perCapInc
numbUrban NumUnderPov
PctFam2Par PctYoungKids2Par
PctKids2Par PctYoungKids2Par
MalePctDivorce FemalePctDiv
PctFam2Par PctTeen2Par
PctKids2Par PctTeen2Par
VIF_Score
TotalPctDiv 928.212429
FemalePctDiv 295.188771
MalePctDivorce 203.199697
PctRecImmig10 165.892005
PctRecImmig8 147.580998
PctLargHouseOccup 129.488362
PctLargHouseFam 124.023935
population 117.672823
PctFam2Par 103.736936
numbUrban 101.604576
medIncome 98.892368
PctKids2Par 97.813837
medFamInc 95.037614
PctPersOwnOccup 78.204266
PctHousOwnOcc 76.714946
PctNotHSGrad 37.715581
NumUnderPov 33.986232
perCapInc 27.944187
PctBSorMore 25.418161
PctPersDenseHous 23.163968
PctOccupMgmtProf 23.097460
PctLess9thGrade 21.029438
PctNotSpeakEnglWell 19.853248
PctPopUnderPov 18.129689
racePctWhite 15.166439
racepctblack 14.945811
NumIlleg 14.432404
pctWWage 13.271613
HousVacant 12.765602
PctYoungKids2Par 12.151434
pctWInvInc 12.081673
PctEmploy 11.608310
PctIlleg 11.490275
racePctHisp 9.970625
pctWPubAsst 9.197796
PctHousLess3BR 8.210647
RentLowQ 8.141949
PctHousNoPhone 7.424820
PctTeen2Par 7.264545
PctOccupManu 6.099481
PctUnemployed 6.099450
MalePctNevMarr 5.330162
NumImmig 5.007225
NumInShelters 4.710435
PctHousOccup 3.543461
PopDens 2.833126
MedNumBR 2.657976
NumStreet 2.473209
MedRentPctHousInc 2.470435
PctImmigRec10 2.372429
PctVacantBoarded 2.096964
blackPerCap 2.028969
PctWOFullPlumb 1.847284
## [1] 1993   40

We now have 40 variables.

Correlation Plots

We are going to create correlation plots to evaluate the most important variables in predicting violent crime rates.

corrMatrix <- round(cor(dataset[,c(20:39,40)]),4)
corrMatrix %>% corrplot(.,method="color",   
         type="lower", order="hclust", 
         addCoef.col = "black", 
         tl.col="blue", tl.srt=45, 
         sig.level = 0.01, insig = "blank", 
         diag=FALSE, number.cex = 0.8)

We can observe that some variables influence the rates of violent crimes are:

racepctblack: percentage of population that is african american MalePctDivorce: percentage of males who are divorced PctPopUnderPov: percentage of people under the poverty level pctWPubAsst: percentage of households with public assistance income in 1989 PctUnemployed: percentage of people 16 and over, in the labor force, and unemployed PctLess9thGrade:percentage of people 25 and over with less than a 9th grade education PctOccupManu: percentage of people 16 and over who are employed in manufacturing PctHousNoPhone: percent of occupied housing units without phone (in 1990, this was rare!) PctHousLess3BR: percent of housing units with less than 3 bedrooms

Training and Test Partition of data

Let’s partition the training dataset. We will use 75% of the data for training and 25% for validation.

sample = sample.split(dataset$ViolentCrimesPerPop, SplitRatio = 0.75)
train = subset(dataset, sample == TRUE) %>% as.matrix()
test = subset(dataset, sample == FALSE) %>% as.matrix()
y_train <- train[,40]
y_test <- test[,40]
X_train <- train[,-40]
X_test <- test[,-40]

Modeling

Ridge Regression

Ridge regression is a regularization technique used in statistical modeling to address the problem of multicollinearity in data. It is used when there is a high degree of correlation between the independent variables in a multiple linear regression model, which can lead to unstable estimates of model coefficients and unreliable prediction.

Ridge regression works by adding a regularization term to the model’s objective function, which penalizes the extreme values of the model’s coefficients and reduces their magnitude. This helps to reduce the variance of the model and improve its generalizability to new data.

Now we’ll use the glmnet() function to fit the ridge regression model and specify alpha=0.

set.seed(1801)
fit.ridge <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), alpha = 0, type.measure = "mse", family="gaussian")

To identify what value to use for lambda, we’ll use the s=“lambda.min”.

set.seed(1802)
fitted.ridge.train <- predict(fit.ridge, newx = data.matrix(X_train), s="lambda.min")
fitted.ridge.test <- predict(fit.ridge, newx = data.matrix(X_test), s="lambda.min")
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.ridge.train)[1])
## Train coefficient:  0.8163905
cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.ridge.test)[1])
## 
## Test coefficient:  0.7438792

The train coefficient is 0.816 and test coefficient is 0.778

Prediction Performance

LASSO Regression

LASSO Regression (Least Absolute Shrinkage and Selection Operator) is another regularization technique used in statistical modeling to address the problem of multicollinearity in data and to perform variable selection.

LASSO regression is used when there are a large number of predictor variables in a multiple linear regression model and you want to reduce the complexity of the model by removing irrelevant or less important variables in predicting the response variable.

LASSO regression reduces the model coefficients to zero for some variables, allowing you to identify the most important variables for prediction. LASSO regression is used in situations where there are a large number of predictor variables and a simpler and more easily interpretable linear regression model is needed, removing less important variables. .

Now we’ll use the glmnet() function to fit the LASSO regression model and specify alpha=1.

set.seed(1803)
fit.lasso <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=1, family="gaussian")

To identify what value to use for lambda, we’ll use the s=“lambda.min”.

set.seed(1804)
fitted.lasso.train <- predict(fit.lasso, newx = data.matrix(X_train), s="lambda.min")
fitted.lasso.test <- predict(fit.lasso, newx = data.matrix(X_test), s="lambda.min")
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.lasso.train)[1])
## Train coefficient:  0.8172372
cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.lasso.test)[1])
## 
## Test coefficient:  0.7475433

The train coefficient is 0.818 and test coefficient is 0.782

Prediction Performance

lassoPredPerf <- postResample(pred = fitted.lasso.test , obs = y_test)
lassoPredPerf['Family'] <- 'Linear Regression'
lassoPredPerf['Model'] <- 'Lasso Regression'

Elastic Net Regression

Elastic Net Regression is used in situations where there is a high correlation between the predictor variables and a more stable linear regression model with variable selection is needed, taking advantage of the properties of Ridge regression and LASSO regression.

set.seed(1805)
fit.elnet <- glmnet(as.matrix(X_train), as.matrix(y_train), family="gaussian", alpha=.5)
fit.elnet.cv <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=.5,
                          family="gaussian")
fitted.elnet.train <- predict(fit.elnet.cv, newx = data.matrix(X_train), s="lambda.min")
fitted.elnet.test <- predict(fit.elnet.cv, newx = data.matrix(X_test), s="lambda.min")
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.elnet.train)[1])
## Train coefficient:  0.8171584
cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.elnet.test)[1])
## 
## Test coefficient:  0.7475421

The train coefficient is 0.817 and test coefficient is 0.7823

Prediction Performance

Plot MSE

Non Linear Regression Model

Nonlinear regression is used when the relationship between the response variable and the predictor variables in a regression model cannot be modeled by a linear function. In this case, a nonlinear function is needed to describe the relationship between the variables.

For this analysis we will use: Neural Networks, KNN (K-Nearest Neighbors), SVM (Support Vector Machines), and MARS (Multivariate Adaptive Regression Splines).

Neural Networks

Neural Networks consist of layers of artificial neurons that process input information and generate output. Each neuron receives an input, performs a nonlinear transformation, and produces an output that is used as input to the next layer of neurons. The final output of the network is used to predict the output variable.

Let’s use Neural network with 4 hidden units:

## Model Averaged Neural Network with 4 Repeats  
## 
## a 39-4-1 network with 165 weights
## options were - linear output units  decay=0.01

Variable Importance

Overall
X381 6.7398694
X51 6.3103055
X61 6.2222431
X203 5.9591480
X6 5.8826213
X201 5.5498754
X263 5.4310372
X5 5.1515020
X382 5.0220359
X43 4.8912471
X20 4.8783533
X112 4.8140426
X1 4.7783263
X110 4.7690152
X63 4.7159701
X122 4.5094830
X162 4.3341217
X35 4.2289551
X52 4.1469174
X292 4.1228266
X314 4.0869773
X131 3.9967216
X24 3.9724875
X261 3.9636423
X152 3.9390688
X42 3.9161420
X132 3.8229929
X16 3.8053785
X352 3.8035943
X3 3.7820565
X115 3.7322764
X262 3.6579088
X19 3.6429910
X373 3.6126446
X53 3.6026097
X353 3.5938262
X153 3.4994534
X363 3.4346810
X253 3.4101175
X41 3.3460698
X29 3.3287481
X133 3.2731705
X114 3.2546826
X192 3.2128204
X23 3.2027186
X111 3.1942049
X273 3.1761039
X272 3.1058894
X242 3.0439990
X291 3.0126065
X11 3.0082926
X333 2.9858225
X4 2.9815636
X151 2.9397707
X13 2.9268070
X231 2.9239169
X313 2.8951252
X62 2.7860604
X351 2.7686765
X241 2.7621327
X310 2.7524022
X283 2.6893620
X38 2.6521169
X15 2.6489619
X372 2.6380148
X26 2.6321252
X103 2.6213772
X73 2.6080194
X18 2.5629981
X312 2.5572842
X331 2.5019254
X123 2.4748309
X22 2.4552364
X252 2.4517814
X282 2.4235741
X251 2.3844613
X302 2.3809152
X10 2.3107638
X202 2.3019756
X72 2.2820524
X362 2.2604559
X243 2.2578241
X82 2.2283661
X32 2.2205998
X221 2.2081898
X271 2.1931332
X33 2.1907483
X71 2.1906483
X12 2.1723410
X383 2.1610762
X113 2.0981033
X392 2.0904249
X14 2.0831274
X101 2.0732813
X163 2.0365666
X183 2.0361661
X371 2.0226587
X343 2.0117824
X27 1.9930812
X7 1.9908562
X181 1.9892146
X393 1.9765550
X17 1.9303000
X322 1.8715190
X171 1.8714280
X212 1.8527905
X36 1.8132621
X8 1.7472199
X141 1.7260972
X222 1.7246995
X91 1.6764881
X161 1.6482376
X172 1.6109991
X25 1.5993161
X34 1.5983855
X293 1.5953200
X37 1.5940642
X321 1.4812944
X121 1.4779253
X191 1.4501799
X214 1.3917492
X311 1.3491091
X281 1.3386165
X9 1.3281336
X233 1.3232861
X361 1.2996879
X301 1.2184606
X323 1.2016866
X213 1.1748907
X93 1.1346396
X215 1.1077592
X341 1.1071128
X143 1.0938873
X211 1.0883222
X2 1.0867673
X173 1.0835380
X315 1.0732571
X31 1.0189325
X28 0.9983832
X21 0.9924483
X223 0.9916175
X303 0.9913695
X332 0.9855879
X102 0.9755889
X81 0.9578084
X210 0.8946268
X232 0.8433051
X83 0.7853355
X193 0.6932265
X92 0.6482776
X391 0.5996399
X142 0.5815349
X182 0.5648157
X30 0.5363145
X342 0.3200149
X39 0.2727154

Prediction Performance

KNN model

The KNN model (K-Nearest Neighbors) is used in supervised machine learning for classification and regression. The KNN model is used to predict the class or value of a new data instance based on the characteristics or attributes of the nearest neighboring instances in the training set.

## k-Nearest Neighbors 
## 
## 1499 samples
##   39 predictor
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1349, 1350, 1349, 1350, 1348, 1350, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE       
##    5  0.1470650  0.6164570  0.10158827
##    7  0.1439163  0.6323808  0.09915134
##    9  0.1443167  0.6312507  0.09954053
##   11  0.1431314  0.6398771  0.09805782
##   13  0.1433472  0.6397882  0.09817429
##   15  0.1434775  0.6403601  0.09847477
##   17  0.1433104  0.6436613  0.09814664
##   19  0.1431045  0.6460178  0.09809048
##   21  0.1433757  0.6459280  0.09837959
##   23  0.1427781  0.6499200  0.09771248
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 23.

KNN Plot

Variable Importance

Overall
PctIlleg 100.000000
racePctWhite 84.463999
PctTeen2Par 78.070630
PctYoungKids2Par 76.886371
racepctblack 67.653183
pctWPubAsst 58.263375
pctWInvInc 56.862002
PctPopUnderPov 43.912891
MalePctDivorce 42.763202
PctUnemployed 42.212224
NumIlleg 41.200287
PctVacantBoarded 37.308768
PctHousLess3BR 35.571028
PctHousOwnOcc 35.007781
PctHousNoPhone 34.742189
PctPersDenseHous 33.972668
HousVacant 29.811069
PctLess9thGrade 24.391977
PctLargHouseFam 20.652709
NumInShelters 20.190012
PctWOFullPlumb 15.783587
NumStreet 14.856954
perCapInc 13.844531
MedNumBR 13.694854
PctOccupMgmtProf 11.999077
MedRentPctHousInc 10.471355
PctEmploy 10.447415
PctNotSpeakEnglWell 10.337888
racePctHisp 8.862438
MalePctNevMarr 8.618872
PopDens 7.875297
NumImmig 7.649018
pctWWage 6.623132
PctOccupManu 6.297929
PctHousOccup 6.259769
PctImmigRec10 4.935118
blackPerCap 3.242936
PctRecImmig8 3.227065
RentLowQ 0.000000

Prediction Performance

SVM model

The SVM model (Support Vector Machine) is used in supervised machine learning for classification and regression. It is particularly useful when the data is linearly or nearly linearly separable in the feature space. It is particularly useful in situations where the data is linearly or nearly linearly separable in feature space, and is very efficient in classifying high-dimensional data and small to medium data sets.

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1499 samples
##   39 predictor
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1348, 1349, 1350, 1348, 1349, 1350, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE       Rsquared   MAE       
##     0.25  0.1496910  0.6267657  0.09824670
##     0.50  0.1439197  0.6428549  0.09545459
##     1.00  0.1404450  0.6500871  0.09391998
##     2.00  0.1397640  0.6489056  0.09407907
##     4.00  0.1420422  0.6383513  0.09634762
##     8.00  0.1446649  0.6268358  0.09911138
##    16.00  0.1487816  0.6078767  0.10298394
##    32.00  0.1540362  0.5845711  0.10856409
##    64.00  0.1603825  0.5586436  0.11452927
##   128.00  0.1638883  0.5461915  0.11822571
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02599365
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02599365 and C = 2.

Variable Importance

Overall
PctIlleg 100.000000
racePctWhite 84.463999
PctTeen2Par 78.070630
PctYoungKids2Par 76.886371
racepctblack 67.653183
pctWPubAsst 58.263375
pctWInvInc 56.862002
PctPopUnderPov 43.912891
MalePctDivorce 42.763202
PctUnemployed 42.212224
NumIlleg 41.200287
PctVacantBoarded 37.308768
PctHousLess3BR 35.571028
PctHousOwnOcc 35.007781
PctHousNoPhone 34.742189
PctPersDenseHous 33.972668
HousVacant 29.811069
PctLess9thGrade 24.391977
PctLargHouseFam 20.652709
NumInShelters 20.190012
PctWOFullPlumb 15.783587
NumStreet 14.856954
perCapInc 13.844531
MedNumBR 13.694854
PctOccupMgmtProf 11.999077
MedRentPctHousInc 10.471355
PctEmploy 10.447415
PctNotSpeakEnglWell 10.337888
racePctHisp 8.862438
MalePctNevMarr 8.618872
PopDens 7.875297
NumImmig 7.649018
pctWWage 6.623132
PctOccupManu 6.297929
PctHousOccup 6.259769
PctImmigRec10 4.935118
blackPerCap 3.242936
PctRecImmig8 3.227065
RentLowQ 0.000000

Prediction Performance

MARS model

The MARS model (Multivariate Adaptive Regression Splines) is used in supervised machine learning for regression and classification. It is a non-parametric technique that uses combinations of simple functions to approximate a complex function.

## Call: earth(x=X_train, y=y_train)
## 
##                        coefficients
## (Intercept)               0.4494353
## h(racepctblack-0.86)     -0.9662119
## h(0.21-racePctWhite)      0.5923169
## h(racePctWhite-0.21)     -0.2197684
## h(pctWWage-0.73)         -0.4280440
## h(0.47-pctWInvInc)        0.3971654
## h(0.57-MalePctDivorce)   -0.1789002
## h(MalePctDivorce-0.57)    0.1684589
## h(0.55-PctTeen2Par)       0.2757925
## h(0.4-PctIlleg)          -0.3027351
## h(PctIlleg-0.4)           0.1214990
## h(0.61-PctHousLess3BR)    0.1500186
## h(PctHousLess3BR-0.61)    0.3421317
## h(0.12-HousVacant)       -0.6673085
## h(HousVacant-0.12)        0.1483440
## h(0.02-NumStreet)        -2.1035280
## h(NumStreet-0.02)         0.1050918
## 
## Selected 17 of 21 terms, and 10 of 39 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: PctIlleg, PctTeen2Par, NumStreet, pctWInvInc, HousVacant, ...
## Number of terms at each degree of interaction: 1 16 (additive model)
## GCV 0.01872543    RSS 26.84714    GRSq 0.6611047    RSq 0.6754289

MARS plot

##  plotmo grid:    racepctblack racePctWhite racePctHisp pctWWage pctWInvInc
##                          0.06         0.85        0.04     0.57       0.48
##  pctWPubAsst perCapInc blackPerCap PctPopUnderPov PctLess9thGrade PctUnemployed
##         0.26      0.31        0.25           0.24            0.27          0.33
##  PctEmploy PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr
##       0.51         0.37              0.4           0.46            0.4
##  PctYoungKids2Par PctTeen2Par NumIlleg PctIlleg NumImmig PctImmigRec10
##               0.7         0.6     0.01     0.17     0.01          0.43
##  PctRecImmig8 PctNotSpeakEnglWell PctLargHouseFam PctPersDenseHous
##          0.09                0.06            0.21             0.11
##  PctHousLess3BR MedNumBR HousVacant PctHousOccup PctHousOwnOcc PctVacantBoarded
##            0.51      0.5       0.03         0.77          0.54             0.13
##  PctHousNoPhone PctWOFullPlumb RentLowQ MedRentPctHousInc NumInShelters
##            0.18            0.2     0.32              0.48             0
##  NumStreet PopDens
##          0    0.17

Variable Importance

Overall
PctIlleg 100.00000
PctTeen2Par 40.76356
NumStreet 40.76356
pctWInvInc 35.12809
HousVacant 28.69767
racePctWhite 23.49241
PctHousLess3BR 19.62324
MalePctDivorce 12.42799
pctWWage 10.39611
racepctblack 6.97419

Prediction Performance

Decision Tree model

Decision Trees model is used in supervised machine learning for classification and regression. It is a modeling technique that builds a decision tree from training data to predict the label or value of a new data instance. It is useful when the data has non-linear relationships or complex interactions between the input features, and it is very efficient in processing large data sets.

## 
## Regression tree:
## rpart(formula = ViolentCrimesPerPop ~ ., data = as.data.frame(train))
## 
## Variables actually used in tree construction:
## [1] MalePctDivorce NumStreet      PctIlleg       pctWPubAsst    racePctHisp   
## 
## Root node error: 82.716/1499 = 0.055181
## 
## n= 1499 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.415223      0   1.00000 1.00154 0.050389
## 2 0.067533      1   0.58478 0.62276 0.030649
## 3 0.047360      2   0.51724 0.54645 0.026916
## 4 0.021278      3   0.46988 0.52341 0.027361
## 5 0.018565      4   0.44861 0.53202 0.027842
## 6 0.016753      5   0.43004 0.51126 0.027449
## 7 0.013985      6   0.41329 0.50114 0.026976
## 8 0.010000      7   0.39930 0.48388 0.026672

we need to use the cross-validation.

##           CP nsplit rel error    xerror       xstd
## 1 0.41522304      0 1.0000000 1.0015384 0.05038930
## 2 0.06753305      1 0.5847770 0.6227624 0.03064924
## 3 0.04735950      2 0.5172439 0.5464481 0.02691551
## 4 0.02127802      3 0.4698844 0.5234057 0.02736072
## 5 0.01856464      4 0.4486064 0.5320167 0.02784219
## 6 0.01675309      5 0.4300417 0.5112581 0.02744900
## 7 0.01398494      6 0.4132887 0.5011393 0.02697578
## 8 0.01000000      7 0.3993037 0.4838784 0.02667223

The curve is at its lowest at 6, so we will prune our tree to a size of 6. At size 6, the cp is 0.0118 and error is 0.40.

Decision Tree No.2

Prediction Performance

Model Accuracy

Model Accuracy is a measure of how well the model fits the training and test data. It is used to evaluate the performance of regression models and to compare different models.

## [1] 0.01012146

The accuracy of the model is low, even after pruning the tree.

Variable Importance

Prediction Performance

Random Forest model

Random forests is a supervised machine learning model used for data classification and regression. This model combines multiple decision trees to improve the accuracy of the predictions and reduce overfitting. It is useful when working with large data sets and complex features, you want to avoid overfitting and improve model accuracy.

## 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51

In this case the test error seems to be greater than the OOB. The out-of-box error and test errors should line up.

Variable Importance

## 
## Call:
##  randomForest(formula = ViolentCrimesPerPop ~ ., data = train2,      mtry = mtry, ntree = 200, proximity = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 200
## No. of variables tried at each split: 39
## 
##           Mean of squared residuals: 0.01949456
##                     % Var explained: 64.67

The graphs show us the importance of each variable in predicting violent crime. The mean decrease precision shows how much the precision of the model decreases if we remove the variable. (PctIlleg) percentage of kids born to never married is the most important variable with a very large difference compared to the second variable.

Prediction Performance

Gradient Boosting Method

The boosting method is a supervised machine learning algorithm used for regression and classification problems. It is an assembly technique that combines multiple weak decision trees to form a stronger and more accurate model.

This algorithm is especially useful in complex data analysis and prediction problems, where high precision and good performance are required. It is suitable for both structured and unstructured data, and can handle a large number of features.

nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
34 50 2 0.3 0 0.8 1 1

Variable Importance

According to the graphs we can see which is the most important variable to predict violent crimes. The most important variable is PctIlleg: percentage of kids born to never married.

Let’s graph the tree models used.

Summary models

Family Model RMSE Rsquared MAE
Linear Regression Ridge Regression 0.137610339172913 0.633567028833966 0.0929953293374746
Linear Regression Lasso Regression 0.136960450573007 0.637391790810595 0.0927696565786015
Linear Regression Elastic Net Regression 0.136979850602025 0.637232849484115 0.0927486578951872
Non-Linear Regression Neural Network 0.137874182457811 0.634079293426971 0.0893608224268564
Non-Linear Regression KNN 0.141250293913806 0.614821561549313 0.0934817447045708
Non-Linear Regression SVM 0.136104965394303 0.644798170618556 0.0876881161789374
Non-Linear Regression MARS 0.136926684793201 0.640976259783412 0.0915989445247887
Trees & Boosting Decision Tree 0.168954560886625 0.458655027931917 0.117214158597375
Trees & Boosting Random Forest NA 0.000571275517264647 NA
Trees & Boosting XGBoost 0.139779743455771 0.626022791859345 0.0940080164834435

Reviewing the table where the summary of all the models we used for our analysis is, we can see the following:

In general, all the models used are good with R2 greater than 60%, except for the simple decision tree model, which has an R2 of approximately 45%, and the random forest model with less than 1%.

We consider that the best model is the SVM model (Support Vector Machine) with RMSE 0.136 and R2 of 64.4%.

Conclusion

Based on the data and data exploration conducted, we were able to analyze a large number of factors simultaneously to understand how the relationships between the different factors affect crime rates in our communities and states.

According to the analysis, we were able to observe that the percentage of children born whose parents were never married is a factor that considerably affects the percentage of violent crimes. While it is true that family structure can have some influence on children’s behavior and development, delinquency is a complex problem that has multiple causes and cannot be attributed to a single factor. And while there are other factors such as race that the data suggests could influence the rate of violent crime, crimes may be related to poverty, economic inequality, access to guns, drug and alcohol abuse, discrimination and racism, among other factors.

Therefore, to address the problem of delinquency it is necessary to take into account a wide variety of factors, including but not limited to family structure.

Violent crime is a complex problem that has multiple causes in the United States, some of the possible causes of violent crime in communities are the following:

Socioeconomic inequality: Economic and social inequality can create tensions and conflicts among members of a community, which in turn can lead to violence.

Unemployment: Unemployment can increase despair and hopelessness, which can lead some people to turn to crime to survive.

Poverty: Poverty may be linked to increased violence, as people living in poverty may have less access to resources and opportunities, which can lead to crime.

Drugs and alcohol: Drug and alcohol abuse can increase a person’s likelihood of committing a violent crime.

Racism and Discrimination: Racism and discrimination can contribute to violence, especially in communities where there are tensions between different racial or ethnic groups.

It is important to note that these are just a few of the possible causes of violent crime in communities across the United States, and that each case is unique and may have multiple contributing factors. Therefore, it is necessary to address these problems holistically and focus on finding specific solutions for each community. It is important to avoid stigmatizing certain groups in society and instead focus on understanding and addressing the underlying causes of crime so that effective and sustainable action can be taken to reduce it.