Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(caret)
library(mice)
library(randomForest)
library(caTools)
library(corrplot)
library(naniar)
library(xgboost)
library(usmap)
library(DiagrammeR)
library(earth)
library(plotly)
library(wordcloud)
library(RColorBrewer)
library(glmnet)
library(Hmisc)
library(car)
library(class)
library(rpart)
library(rpart.plot)

Abstract

This is a dataset of 2018 US communities, demographics of each community, and their crime rates. The dataset has 146 variables where the first four columns are community/location, the middle features are demographic information about each community such as population, age, race, income, and the final columns are types of crimes and overall crime rates. The goal of the project is to understand where violent crime occurs in terms of the socioeconomic and demographic characteristics of the regions. The features can help predict ahead of time where violent crime is likely to occur through predictive models that can quantify the risk associated with a region.

Introduction

The approach to the problem of crime in the different states of the United States implies the investigation and analysis of the crime rates in each state, as well as the factors that may be contributing to said rates.One of the factors that has been studied in relation to crime is the socioeconomic level of a community. There is evidence to suggest that communities with low socioeconomic levels have a higher incidence of crime compared to more prosperous communities. In addition, other socioeconomic factors, such as unemployment, poverty, lack of educational and job opportunities, have also been linked to increased risk of crime. These factors can negatively affect people’s quality of life and increase their vulnerability to crime. However, it is important to note that other factors can also influence the crime rate, such as culture, law enforcement policies, the availability of guns, and other environmental and demographic factors. In summary, socioeconomic factors can be an important factor in the occurrence of crime in different states of the United States, but it is important to consider multiple factors when addressing this complex problem. The analysis that we are going to carry out in this work could be useful to build predictive models that better help in urban planning and crime reduction.

Dataset Overview

The dataset selected for this analysis is ‘Crimes in US Communities Dataset’ - Michael Bryant (Owner).

We have a very complete dataset. According to each state we can see data such as: population for community, percentage of population in 4 age groups, percentage of population according to race, percentage of people using public transit for commuting, and many more data that will allow us to carry out a good analysis.

This is a dataset of 2018 US communities. Numeric-decimal data types have been normalized to two decimal places (0.00). Our target variable is ‘Violent Crimes by Population’, (GOAL attribute). Our crime dataset has 128 attributes. In the following table we can see the variables, the description of each variable and the data Type.

No.	Column	Description	Data Type
1	state	US state (by number) - not counted as predictive above, but if considered, should be consided nominal	nominal
2	county	numeric code for county - not predictive, and many missing values	numeric
3	community	numeric code for community - not predictive and many missing values	numeric
4	communityname	community name - not predictive - for information only	string
5	fold	fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive	numeric
6	population	population for community	numeric - decimal
7	householdsize	mean people per household	numeric - decimal
8	racepctblack	percentage of population that is african american	numeric - decimal
9	racePctWhite	percentage of population that is caucasian	numeric - decimal
10	racePctAsian	percentage of population that is of asian heritage	numeric - decimal
11	racePctHisp	percentage of population that is of hispanic heritage	numeric - decimal
12	agePct12t21	percentage of population that is 12-21 in age	numeric - decimal
13	agePct12t29	percentage of population that is 12-29 in age	numeric - decimal
14	agePct16t24	percentage of population that is 16-24 in age	numeric - decimal
15	agePct65up	percentage of population that is 65 and over in age	numeric - decimal
16	numbUrban	number of people living in areas classified as urban	numeric - decimal
17	pctUrban	percentage of people living in areas classified as urban	numeric - decimal
18	medIncome	median household income	numeric - decimal
19	pctWWage	percentage of households with wage or salary income in 1989	numeric - decimal
20	pctWFarmSelf	percentage of households with farm or self employment income in 1989	numeric - decimal
21	pctWInvInc	percentage of households with investment / rent income in 1989	numeric - decimal
22	pctWSocSec	percentage of households with social security income in 1989	numeric - decimal
23	pctWPubAsst	percentage of households with public assistance income in 1989	numeric - decimal
24	pctWRetire	percentage of households with retirement income in 1989	numeric - decimal
25	medFamInc	median family income (differs from household income for non-family households)	numeric - decimal
26	perCapInc	per capita income	numeric - decimal
27	whitePerCap	per capita income for caucasians	numeric - decimal
28	blackPerCap	per capita income for african americans	numeric - decimal
29	indianPerCap	per capita income for native americans	numeric - decimal
30	AsianPerCap	per capita income for people with asian heritage	numeric - decimal
31	OtherPerCap	per capita income for people with ‘other’ heritage	numeric - decimal
32	HispPerCap	per capita income for people with hispanic heritage	numeric - decimal
33	NumUnderPov	number of people under the poverty level	numeric - decimal
34	PctPopUnderPov	percentage of people under the poverty level	numeric - decimal
35	PctLess9thGrade	percentage of people 25 and over with less than a 9th grade education	numeric - decimal
36	PctNotHSGrad	percentage of people 25 and over that are not high school graduates	numeric - decimal
37	PctBSorMore	percentage of people 25 and over with a bachelors degree or higher education	numeric - decimal
38	PctUnemployed	percentage of people 16 and over, in the labor force, and unemployed	numeric - decimal
39	PctEmploy	percentage of people 16 and over who are employed	numeric - decimal
40	PctEmplManu	percentage of people 16 and over who are employed in manufacturing	numeric - decimal
41	PctEmplProfServ	percentage of people 16 and over who are employed in professional services	numeric - decimal
42	PctOccupManu	percentage of people 16 and over who are employed in manufacturing	numeric - decimal
43	PctOccupMgmtProf	percentage of people 16 and over who are employed in management or professional occupations	numeric - decimal
44	MalePctDivorce	percentage of males who are divorced	numeric - decimal
45	MalePctNevMarr	percentage of males who have never married	numeric - decimal
46	FemalePctDiv	percentage of females who are divorced	numeric - decimal
47	TotalPctDiv	percentage of population who are divorced	numeric - decimal
48	PersPerFam	mean number of people per family	numeric - decimal
49	PctFam2Par	percentage of families (with kids) that are headed by two parents	numeric - decimal
50	PctKids2Par	percentage of kids in family housing with two parents	numeric - decimal
51	PctYoungKids2Par	percent of kids 4 and under in two parent households	numeric - decimal
52	PctTeen2Par	percent of kids age 12-17 in two parent households	numeric - decimal
53	PctWorkMomYoungKids	percentage of moms of kids 6 and under in labor force	numeric - decimal
54	PctWorkMom	percentage of moms of kids under 18 in labor force	numeric - decimal
55	NumIlleg	number of kids born to never married	numeric - decimal
56	PctIlleg	percentage of kids born to never married	numeric - decimal
57	NumImmig	total number of people known to be foreign born	numeric - decimal
58	PctImmigRecent	percentage of immigrants who immigated within last 3 years	numeric - decimal
59	PctImmigRec5	percentage of immigrants who immigated within last 5 years	numeric - decimal
60	PctImmigRec8	percentage of immigrants who immigated within last 8 years	numeric - decimal
61	PctImmigRec10	percentage of immigrants who immigated within last 10 years	numeric - decimal
62	PctRecentImmig	percent of population who have immigrated within the last 3 years	numeric - decimal
63	PctRecImmig5	percent of population who have immigrated within the last 5 years	numeric - decimal
64	PctRecImmig8	percent of population who have immigrated within the last 8 years	numeric - decimal
65	PctRecImmig10	percent of population who have immigrated within the last 10 years	numeric - decimal
66	PctSpeakEnglOnly	percent of people who speak only English	numeric - decimal
67	PctNotSpeakEnglWell	percent of people who do not speak English well	numeric - decimal
68	PctLargHouseFam	percent of family households that are large (6 or more)	numeric - decimal
69	PctLargHouseOccup	percent of all occupied households that are large (6 or more people)	numeric - decimal
70	PersPerOccupHous	mean persons per household	numeric - decimal
71	PersPerOwnOccHous	mean persons per owner occupied household	numeric - decimal
72	PersPerRentOccHous	mean persons per rental household	numeric - decimal
73	PctPersOwnOccup	percent of people in owner occupied households	numeric - decimal
74	PctPersDenseHous	percent of persons in dense housing (more than 1 person per room)	numeric - decimal
75	PctHousLess3BR	percent of housing units with less than 3 bedrooms	numeric - decimal
76	MedNumBR	median number of bedrooms	numeric - decimal
77	HousVacant	number of vacant households	numeric - decimal
78	PctHousOccup	percent of housing occupied	numeric - decimal
79	PctHousOwnOcc	percent of households owner occupied	numeric - decimal
80	PctVacantBoarded	percent of vacant housing that is boarded up	numeric - decimal
81	PctVacMore6Mos	percent of vacant housing that has been vacant more than 6 months	numeric - decimal
82	MedYrHousBuilt	median year housing units built	numeric - decimal
83	PctHousNoPhone	percent of occupied housing units without phone (in 1990, this was rare!)	numeric - decimal
84	PctWOFullPlumb	percent of housing without complete plumbing facilities	numeric - decimal
85	OwnOccLowQuart	owner occupied housing - lower quartile value	numeric - decimal
86	OwnOccMedVal	owner occupied housing - median value	numeric - decimal
87	OwnOccHiQuart	owner occupied housing - upper quartile value	numeric - decimal
88	RentLowQ	rental housing - lower quartile rent	numeric - decimal
89	RentMedian	rental housing - median rent (Census variable H32B from file STF1A)	numeric - decimal
90	RentHighQ	rental housing - upper quartile rent	numeric - decimal
91	MedRent	median gross rent (Census variable H43A from file STF3A - includes utilities)	numeric - decimal
92	MedRentPctHousInc	median gross rent as a percentage of household income	numeric - decimal
93	MedOwnCostPctInc	median owners cost as a percentage of household income - for owners with a mortgage	numeric - decimal
94	MedOwnCostPctIncNoMtg	median owners cost as a percentage of household income - for owners without a mortgage	numeric - decimal
95	NumInShelters	number of people in homeless shelters	numeric - decimal
96	NumStreet	number of homeless people counted in the street	numeric - decimal
97	PctForeignBorn	percent of people foreign born	numeric - decimal
98	PctBornSameState	percent of people born in the same state as currently living	numeric - decimal
99	PctSameHouse85	percent of people living in the same house as in 1985 (5 years before)	numeric - decimal
100	PctSameCity85	percent of people living in the same city as in 1985 (5 years before)	numeric - decimal
101	PctSameState85	percent of people living in the same state as in 1985 (5 years before)	numeric - decimal
102	LemasSwornFT	number of sworn full time police officers	numeric - decimal
103	LemasSwFTPerPop	sworn full time police officers per 100K population	numeric - decimal
104	LemasSwFTFieldOps	number of sworn full time police officers in field operations (on the street as opposed to administrative etc)	numeric - decimal
105	LemasSwFTFieldPerPop	sworn full time police officers in field operations (on the street as opposed to administrative etc) per 100K population	numeric - decimal
106	LemasTotalReq	total requests for police	numeric - decimal
107	LemasTotReqPerPop	total requests for police per 100K popuation	numeric - decimal
108	PolicReqPerOffic	total requests for police per police officer	numeric - decimal
109	PolicPerPop	police officers per 100K population	numeric - decimal
110	RacialMatchCommPol	a measure of the racial match between the community and the police force. High values indicate proportions in community and police force are similar	numeric - decimal
111	PctPolicWhite	percent of police that are caucasian	numeric - decimal
112	PctPolicBlack	percent of police that are african american	numeric - decimal
113	PctPolicHisp	percent of police that are hispanic	numeric - decimal
114	PctPolicAsian	percent of police that are asian	numeric - decimal
115	PctPolicMinor	percent of police that are minority of any kind	numeric - decimal
116	OfficAssgnDrugUnits	number of officers assigned to special drug units	numeric - decimal
117	NumKindsDrugsSeiz	number of different kinds of drugs seized	numeric - decimal
118	PolicAveOTWorked	police average overtime worked	numeric - decimal
119	LandArea	land area in square miles	numeric - decimal
120	PopDens	population density in persons per square mile	numeric - decimal
121	PctUsePubTrans	percent of people using public transit for commuting	numeric - decimal
122	PolicCars	number of police cars	numeric - decimal
123	PolicOperBudg	police operating budget	numeric - decimal
124	LemasPctPolicOnPatr	percent of sworn full time police officers on patrol	numeric - decimal
125	LemasGangUnitDeploy	gang unit deployed	numeric - decimal - but really ordinal - 0 means NO, 1 means YES, 0.5 means Part Time
126	LemasPctOfficDrugUn	percent of officers assigned to drug units	numeric - decimal
127	PolicBudgPerPop	police operating budget per population	numeric - decimal
128	ViolentCrimesPerPop	total number of violent crimes per 100K popuation (numeric - decimal) GOAL attribute	to be predicted

We also load the data set where each State, State code and State name are described.

state_code	state	stateName	stateENS
1	AL	Alabama	1779775
2	AK	Alaska	1785533
4	AZ	Arizona	1779777
5	AR	Arkansas	68085
6	CA	California	1779778
8	CO	Colorado	1779779
9	CT	Connecticut	1779780
10	DE	Delaware	1779781
11	DC	District of Columbia	1702382
12	FL	Florida	294478
13	GA	Georgia	1705317
15	HI	Hawaii	1779782
16	ID	Idaho	1779783
17	IL	Illinois	1779784
18	IN	Indiana	448508
19	IA	Iowa	1779785
20	KS	Kansas	481813
21	KY	Kentucky	1779786
22	LA	Louisiana	1629543
23	ME	Maine	1779787
24	MD	Maryland	1714934
25	MA	Massachusetts	606926
26	MI	Michigan	1779789
27	MN	Minnesota	662849
28	MS	Mississippi	1779790
29	MO	Missouri	1779791
30	MT	Montana	767982
31	NE	Nebraska	1779792
32	NV	Nevada	1779793
33	NH	New Hampshire	1779794
34	NJ	New Jersey	1779795
35	NM	New Mexico	897535
36	NY	New York	1779796
37	NC	North Carolina	1027616
38	ND	North Dakota	1779797
39	OH	Ohio	1085497
40	OK	Oklahoma	1102857
41	OR	Oregon	1155107
42	PA	Pennsylvania	1779798
44	RI	Rhode Island	1219835
45	SC	South Carolina	1779799
46	SD	South Dakota	1785534
47	TN	Tennessee	1325873
48	TX	Texas	1779801
49	UT	Utah	1455989
50	VT	Vermont	1779802
51	VA	Virginia	1779803
53	WA	Washington	1779804
54	WV	West Virginia	1779805
55	WI	Wisconsin	1779806
56	WY	Wyoming	1779807
60	AS	American Samoa	1802701
66	GU	Guam	1802705
69	MP	Northern Mariana Islands	1779809
72	PR	Puerto Rico	1779808
74	UM	U.S. Minor Outlying Islands	1878752
78	VI	U.S. Virgin Islands	1802710

In the following table, we can see for each state and community, population. The percentage of population according to age, race, total number of violent crimes per 100K population and other data that may be useful for our analysis.

state	county	community	communityname	fold	population	householdsize	racepctblack	racePctWhite	racePctAsian	racePctHisp	agePct12t21	agePct12t29	agePct16t24	agePct65up	numbUrban	pctUrban	medIncome	pctWWage	pctWFarmSelf	pctWInvInc	pctWSocSec	pctWPubAsst	pctWRetire	medFamInc	perCapInc	whitePerCap	blackPerCap	indianPerCap	AsianPerCap	OtherPerCap	HispPerCap	NumUnderPov	PctPopUnderPov	PctLess9thGrade	PctNotHSGrad	PctBSorMore	PctUnemployed	PctEmploy	PctEmplManu	PctEmplProfServ	PctOccupManu	PctOccupMgmtProf	MalePctDivorce	MalePctNevMarr	FemalePctDiv	TotalPctDiv	PersPerFam	PctFam2Par	PctKids2Par	PctYoungKids2Par	PctTeen2Par	PctWorkMomYoungKids	PctWorkMom	NumIlleg	PctIlleg	NumImmig	PctImmigRecent	PctImmigRec5	PctImmigRec8	PctImmigRec10	PctRecentImmig	PctRecImmig5	PctRecImmig8	PctRecImmig10	PctSpeakEnglOnly	PctNotSpeakEnglWell	PctLargHouseFam	PctLargHouseOccup	PersPerOccupHous	PersPerOwnOccHous	PersPerRentOccHous	PctPersOwnOccup	PctPersDenseHous	PctHousLess3BR	MedNumBR	HousVacant	PctHousOccup	PctHousOwnOcc	PctVacantBoarded	PctVacMore6Mos	MedYrHousBuilt	PctHousNoPhone	PctWOFullPlumb	OwnOccLowQuart	OwnOccMedVal	OwnOccHiQuart	RentLowQ	RentMedian	RentHighQ	MedRent	MedRentPctHousInc	MedOwnCostPctInc	MedOwnCostPctIncNoMtg	NumInShelters	PctForeignBorn	PctBornSameState	PctSameHouse85	PctSameCity85	PctSameState85	LemasSwornFT	LemasSwFTPerPop	LemasSwFTFieldOps	LemasSwFTFieldPerPop	LemasTotalReq	LemasTotReqPerPop	PolicReqPerOffic	PolicPerPop	RacialMatchCommPol	PctPolicWhite	PctPolicBlack	PctPolicHisp	PctPolicAsian	PctPolicMinor	OfficAssgnDrugUnits	NumKindsDrugsSeiz	PolicAveOTWorked	LandArea	PopDens	PctUsePubTrans	PolicCars	PolicOperBudg	LemasPctPolicOnPatr	LemasGangUnitDeploy	LemasPctOfficDrugUn	PolicBudgPerPop	ViolentCrimesPerPop
8	?	?	Lakewoodcity	1	0.19	0.33	0.02	0.90	0.12	0.17	0.34	0.47	0.29	0.32	0.20	1.0	0.37	0.72	0.34	0.60	0.29	0.15	0.43	0.39	0.40	0.39	0.32	0.27	0.27	0.36	0.41	0.08	0.19	0.10	0.18	0.48	0.27	0.68	0.23	0.41	0.25	0.52	0.68	0.40	0.75	0.75	0.35	0.55	0.59	0.61	0.56	0.74	0.76	0.04	0.14	0.03	0.24	0.27	0.37	0.39	0.07	0.07	0.08	0.08	0.89	0.06	0.14	0.13	0.33	0.39	0.28	0.55	0.09	0.51	0.5	0.21	0.71	0.52	0.05	0.26	0.65	0.14	0.06	0.22	0.19	0.18	0.36	0.35	0.38	0.34	0.38	0.46	0.25	0.04	0.12	0.42	0.50	0.51	0.64	0.03	0.13	0.96	0.17	0.06	0.18	0.44	0.13	0.94	0.93	0.03	0.07	0.1	0.07	0.02	0.57	0.29	0.12	0.26	0.20	0.06	0.04	0.9	0.5	0.32	0.14	0.20
53	?	?	Tukwilacity	1	0.00	0.16	0.12	0.74	0.45	0.07	0.26	0.59	0.35	0.27	0.02	1.0	0.31	0.72	0.11	0.45	0.25	0.29	0.39	0.29	0.37	0.38	0.33	0.16	0.30	0.22	0.35	0.01	0.24	0.14	0.24	0.30	0.27	0.73	0.57	0.15	0.42	0.36	1.00	0.63	0.91	1.00	0.29	0.43	0.47	0.60	0.39	0.46	0.53	0.00	0.24	0.01	0.52	0.62	0.64	0.63	0.25	0.27	0.25	0.23	0.84	0.10	0.16	0.10	0.17	0.29	0.17	0.26	0.20	0.82	0.0	0.02	0.79	0.24	0.02	0.25	0.65	0.16	0.00	0.21	0.20	0.21	0.42	0.38	0.40	0.37	0.29	0.32	0.18	0.00	0.21	0.50	0.34	0.60	0.52	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.02	0.12	0.45	?	?	?	?	0.00	?	0.67
24	?	?	Aberdeentown	1	0.00	0.42	0.49	0.56	0.17	0.04	0.39	0.47	0.28	0.32	0.00	0.0	0.30	0.58	0.19	0.39	0.38	0.40	0.84	0.28	0.27	0.29	0.27	0.07	0.29	0.28	0.39	0.01	0.27	0.27	0.43	0.19	0.36	0.58	0.32	0.29	0.49	0.32	0.63	0.41	0.71	0.70	0.45	0.42	0.44	0.43	0.43	0.71	0.67	0.01	0.46	0.00	0.07	0.06	0.15	0.19	0.02	0.02	0.04	0.05	0.88	0.04	0.20	0.20	0.46	0.52	0.43	0.42	0.15	0.51	0.5	0.01	0.86	0.41	0.29	0.30	0.52	0.47	0.45	0.18	0.17	0.16	0.27	0.29	0.27	0.31	0.48	0.39	0.28	0.00	0.14	0.49	0.54	0.67	0.56	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.01	0.21	0.02	?	?	?	?	0.00	?	0.43
34	5	81440	Willingborotownship	1	0.04	0.77	1.00	0.08	0.12	0.10	0.51	0.50	0.34	0.21	0.06	1.0	0.58	0.89	0.21	0.43	0.36	0.20	0.82	0.51	0.36	0.40	0.39	0.16	0.25	0.36	0.44	0.01	0.10	0.09	0.25	0.31	0.33	0.71	0.36	0.45	0.37	0.39	0.34	0.45	0.49	0.44	0.75	0.65	0.54	0.83	0.65	0.85	0.86	0.03	0.33	0.02	0.11	0.20	0.30	0.31	0.05	0.08	0.11	0.11	0.81	0.08	0.56	0.62	0.85	0.77	1.00	0.94	0.12	0.01	0.5	0.01	0.97	0.96	0.60	0.47	0.52	0.11	0.11	0.24	0.21	0.19	0.75	0.70	0.77	0.89	0.63	0.51	0.47	0.00	0.19	0.30	0.73	0.64	0.65	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.02	0.39	0.28	?	?	?	?	0.00	?	0.12
42	95	6096	Bethlehemtownship	1	0.01	0.55	0.02	0.95	0.09	0.05	0.38	0.38	0.23	0.36	0.02	0.9	0.50	0.72	0.16	0.68	0.44	0.11	0.71	0.46	0.43	0.41	0.28	0.00	0.74	0.51	0.48	0.00	0.06	0.25	0.30	0.33	0.12	0.65	0.67	0.38	0.42	0.46	0.22	0.27	0.20	0.21	0.51	0.91	0.91	0.89	0.85	0.40	0.60	0.00	0.06	0.00	0.03	0.07	0.20	0.27	0.01	0.02	0.04	0.05	0.88	0.05	0.16	0.19	0.59	0.60	0.37	0.89	0.02	0.19	0.5	0.01	0.89	0.87	0.04	0.55	0.73	0.05	0.14	0.31	0.31	0.30	0.40	0.36	0.38	0.38	0.22	0.51	0.21	0.00	0.11	0.72	0.64	0.61	0.53	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.04	0.09	0.02	?	?	?	?	0.00	?	0.03
6	?	?	SouthPasadenacity	1	0.02	0.28	0.06	0.54	1.00	0.25	0.31	0.48	0.27	0.37	0.04	1.0	0.52	0.68	0.20	0.61	0.28	0.15	0.25	0.62	0.72	0.76	0.77	0.28	0.52	0.48	0.60	0.01	0.12	0.13	0.12	0.80	0.10	0.65	0.19	0.77	0.06	0.91	0.49	0.57	0.61	0.58	0.44	0.62	0.69	0.87	0.53	0.30	0.43	0.00	0.11	0.04	0.30	0.35	0.43	0.47	0.50	0.50	0.56	0.57	0.45	0.28	0.25	0.19	0.29	0.53	0.18	0.39	0.26	0.73	0.0	0.02	0.84	0.30	0.16	0.28	0.25	0.02	0.05	0.94	1.00	1.00	0.67	0.63	0.68	0.62	0.47	0.59	0.11	0.00	0.70	0.42	0.49	0.73	0.64	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.01	0.58	0.10	?	?	?	?	0.00	?	0.14

Preparation & Exploration

## [1] 1994  128

The dataset has 1994 observations and 128 variable. We can see that there are missing values in the dataset, we are going to convert these data into ‘NA’ in order to carry out our analysis.

A summary of the variables is below:

state	county	community	communityname	fold	population	householdsize	racepctblack	racePctWhite	racePctAsian	racePctHisp	agePct12t21	agePct12t29	agePct16t24	agePct65up	numbUrban	pctUrban	medIncome	pctWWage	pctWFarmSelf	pctWInvInc	pctWSocSec	pctWPubAsst	pctWRetire	medFamInc	perCapInc	whitePerCap	blackPerCap	indianPerCap	AsianPerCap	OtherPerCap	HispPerCap	NumUnderPov	PctPopUnderPov	PctLess9thGrade	PctNotHSGrad	PctBSorMore	PctUnemployed	PctEmploy	PctEmplManu	PctEmplProfServ	PctOccupManu	PctOccupMgmtProf	MalePctDivorce	MalePctNevMarr	FemalePctDiv	TotalPctDiv	PersPerFam	PctFam2Par	PctKids2Par	PctYoungKids2Par	PctTeen2Par	PctWorkMomYoungKids	PctWorkMom	NumIlleg	PctIlleg	NumImmig	PctImmigRecent	PctImmigRec5	PctImmigRec8	PctImmigRec10	PctRecentImmig	PctRecImmig5	PctRecImmig8	PctRecImmig10	PctSpeakEnglOnly	PctNotSpeakEnglWell	PctLargHouseFam	PctLargHouseOccup	PersPerOccupHous	PersPerOwnOccHous	PersPerRentOccHous	PctPersOwnOccup	PctPersDenseHous	PctHousLess3BR	MedNumBR	HousVacant	PctHousOccup	PctHousOwnOcc	PctVacantBoarded	PctVacMore6Mos	MedYrHousBuilt	PctHousNoPhone	PctWOFullPlumb	OwnOccLowQuart	OwnOccMedVal	OwnOccHiQuart	RentLowQ	RentMedian	RentHighQ	MedRent	MedRentPctHousInc	MedOwnCostPctInc	MedOwnCostPctIncNoMtg	NumInShelters	NumStreet	PctForeignBorn	PctBornSameState	PctSameHouse85	PctSameCity85	PctSameState85	LemasSwornFT	LemasSwFTPerPop	LemasSwFTFieldOps	LemasSwFTFieldPerPop	LemasTotalReq	LemasTotReqPerPop	PolicReqPerOffic	PolicPerPop	RacialMatchCommPol	PctPolicWhite	PctPolicBlack	PctPolicHisp	PctPolicAsian	PctPolicMinor	OfficAssgnDrugUnits	NumKindsDrugsSeiz	PolicAveOTWorked	LandArea	PopDens	PctUsePubTrans	PolicCars	PolicOperBudg	LemasPctPolicOnPatr	LemasGangUnitDeploy	LemasPctOfficDrugUn	PolicBudgPerPop	ViolentCrimesPerPop
Min. : 1.00	Length:1994	Length:1994	Length:1994	Min. : 1.000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Length:1994	Min. :0.0000	Min. :0.00000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.00	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Min. :0.00000	Min. :0.0000	Min. :0.0000	Length:1994	Length:1994	Length:1994	Length:1994	Min. :0.00000	Length:1994	Min. :0.000
1st Qu.:12.00	Class :character	Class :character	Class :character	1st Qu.: 3.000	1st Qu.:0.01000	1st Qu.:0.3500	1st Qu.:0.0200	1st Qu.:0.6300	1st Qu.:0.0400	1st Qu.:0.010	1st Qu.:0.3400	1st Qu.:0.4100	1st Qu.:0.2500	1st Qu.:0.3000	1st Qu.:0.00000	1st Qu.:0.0000	1st Qu.:0.2000	1st Qu.:0.4400	1st Qu.:0.1600	1st Qu.:0.3700	1st Qu.:0.3500	1st Qu.:0.1425	1st Qu.:0.3600	1st Qu.:0.2300	1st Qu.:0.2200	1st Qu.:0.240	1st Qu.:0.1725	1st Qu.:0.1100	1st Qu.:0.1900	Class :character	1st Qu.:0.2600	1st Qu.:0.01000	1st Qu.:0.110	1st Qu.:0.1600	1st Qu.:0.2300	1st Qu.:0.2100	1st Qu.:0.2200	1st Qu.:0.3800	1st Qu.:0.2500	1st Qu.:0.3200	1st Qu.:0.2400	1st Qu.:0.3100	1st Qu.:0.3300	1st Qu.:0.3100	1st Qu.:0.3600	1st Qu.:0.3600	1st Qu.:0.4000	1st Qu.:0.4900	1st Qu.:0.4900	1st Qu.:0.530	1st Qu.:0.4800	1st Qu.:0.3900	1st Qu.:0.4200	1st Qu.:0.00000	1st Qu.:0.09	1st Qu.:0.00000	1st Qu.:0.1600	1st Qu.:0.2000	1st Qu.:0.2500	1st Qu.:0.2800	1st Qu.:0.0300	1st Qu.:0.0300	1st Qu.:0.0300	1st Qu.:0.0300	1st Qu.:0.7300	1st Qu.:0.0300	1st Qu.:0.1500	1st Qu.:0.1400	1st Qu.:0.3400	1st Qu.:0.3900	1st Qu.:0.2700	1st Qu.:0.4400	1st Qu.:0.0600	1st Qu.:0.4000	1st Qu.:0.0000	1st Qu.:0.01000	1st Qu.:0.6300	1st Qu.:0.4300	1st Qu.:0.0600	1st Qu.:0.2900	1st Qu.:0.3500	1st Qu.:0.0600	1st Qu.:0.1000	1st Qu.:0.0900	1st Qu.:0.0900	1st Qu.:0.0900	1st Qu.:0.1700	1st Qu.:0.2000	1st Qu.:0.220	1st Qu.:0.2100	1st Qu.:0.3700	1st Qu.:0.3200	1st Qu.:0.2500	1st Qu.:0.00000	1st Qu.:0.00000	1st Qu.:0.0600	1st Qu.:0.4700	1st Qu.:0.4200	1st Qu.:0.5200	1st Qu.:0.5600	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	1st Qu.:0.02000	1st Qu.:0.1000	1st Qu.:0.0200	Class :character	Class :character	Class :character	Class :character	1st Qu.:0.00000	Class :character	1st Qu.:0.070
Median :34.00	Mode :character	Mode :character	Mode :character	Median : 5.000	Median :0.02000	Median :0.4400	Median :0.0600	Median :0.8500	Median :0.0700	Median :0.040	Median :0.4000	Median :0.4800	Median :0.2900	Median :0.4200	Median :0.03000	Median :1.0000	Median :0.3200	Median :0.5600	Median :0.2300	Median :0.4800	Median :0.4750	Median :0.2600	Median :0.4700	Median :0.3300	Median :0.3000	Median :0.320	Median :0.2500	Median :0.1700	Median :0.2800	Mode :character	Median :0.3450	Median :0.02000	Median :0.250	Median :0.2700	Median :0.3600	Median :0.3100	Median :0.3200	Median :0.5100	Median :0.3700	Median :0.4100	Median :0.3700	Median :0.4000	Median :0.4700	Median :0.4000	Median :0.5000	Median :0.5000	Median :0.4700	Median :0.6300	Median :0.6400	Median :0.700	Median :0.6100	Median :0.5100	Median :0.5400	Median :0.01000	Median :0.17	Median :0.01000	Median :0.2900	Median :0.3400	Median :0.3900	Median :0.4300	Median :0.0900	Median :0.0800	Median :0.0900	Median :0.0900	Median :0.8700	Median :0.0600	Median :0.2000	Median :0.1900	Median :0.4400	Median :0.4800	Median :0.3600	Median :0.5600	Median :0.1100	Median :0.5100	Median :0.5000	Median :0.03000	Median :0.7700	Median :0.5400	Median :0.1300	Median :0.4200	Median :0.5200	Median :0.1850	Median :0.1900	Median :0.1800	Median :0.1700	Median :0.1800	Median :0.3100	Median :0.3300	Median :0.370	Median :0.3400	Median :0.4800	Median :0.4500	Median :0.3700	Median :0.00000	Median :0.00000	Median :0.1300	Median :0.6300	Median :0.5400	Median :0.6700	Median :0.7000	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Median :0.04000	Median :0.1700	Median :0.0700	Mode :character	Mode :character	Mode :character	Mode :character	Median :0.00000	Mode :character	Median :0.150
Mean :28.68	NA	NA	NA	Mean : 5.494	Mean :0.05759	Mean :0.4634	Mean :0.1796	Mean :0.7537	Mean :0.1537	Mean :0.144	Mean :0.4242	Mean :0.4939	Mean :0.3363	Mean :0.4232	Mean :0.06407	Mean :0.6963	Mean :0.3611	Mean :0.5582	Mean :0.2916	Mean :0.4957	Mean :0.4711	Mean :0.3178	Mean :0.4792	Mean :0.3757	Mean :0.3503	Mean :0.368	Mean :0.2911	Mean :0.2035	Mean :0.3224	NA	Mean :0.3863	Mean :0.05551	Mean :0.303	Mean :0.3158	Mean :0.3833	Mean :0.3617	Mean :0.3635	Mean :0.5011	Mean :0.3964	Mean :0.4406	Mean :0.3912	Mean :0.4413	Mean :0.4612	Mean :0.4345	Mean :0.4876	Mean :0.4943	Mean :0.4877	Mean :0.6109	Mean :0.6207	Mean :0.664	Mean :0.5829	Mean :0.5014	Mean :0.5267	Mean :0.03629	Mean :0.25	Mean :0.03006	Mean :0.3202	Mean :0.3606	Mean :0.3991	Mean :0.4279	Mean :0.1814	Mean :0.1821	Mean :0.1848	Mean :0.1829	Mean :0.7859	Mean :0.1506	Mean :0.2676	Mean :0.2519	Mean :0.4621	Mean :0.4944	Mean :0.4041	Mean :0.5626	Mean :0.1863	Mean :0.4952	Mean :0.3147	Mean :0.07682	Mean :0.7195	Mean :0.5487	Mean :0.2045	Mean :0.4333	Mean :0.4942	Mean :0.2645	Mean :0.2431	Mean :0.2647	Mean :0.2635	Mean :0.2689	Mean :0.3464	Mean :0.3725	Mean :0.423	Mean :0.3841	Mean :0.4901	Mean :0.4498	Mean :0.4038	Mean :0.02944	Mean :0.02278	Mean :0.2156	Mean :0.6089	Mean :0.5351	Mean :0.6264	Mean :0.6515	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Mean :0.06523	Mean :0.2329	Mean :0.1617	NA	NA	NA	NA	Mean :0.09405	NA	Mean :0.238
3rd Qu.:42.00	NA	NA	NA	3rd Qu.: 8.000	3rd Qu.:0.05000	3rd Qu.:0.5400	3rd Qu.:0.2300	3rd Qu.:0.9400	3rd Qu.:0.1700	3rd Qu.:0.160	3rd Qu.:0.4700	3rd Qu.:0.5400	3rd Qu.:0.3600	3rd Qu.:0.5300	3rd Qu.:0.07000	3rd Qu.:1.0000	3rd Qu.:0.4900	3rd Qu.:0.6900	3rd Qu.:0.3700	3rd Qu.:0.6200	3rd Qu.:0.5800	3rd Qu.:0.4400	3rd Qu.:0.5800	3rd Qu.:0.4800	3rd Qu.:0.4300	3rd Qu.:0.440	3rd Qu.:0.3800	3rd Qu.:0.2500	3rd Qu.:0.4000	NA	3rd Qu.:0.4800	3rd Qu.:0.05000	3rd Qu.:0.450	3rd Qu.:0.4200	3rd Qu.:0.5100	3rd Qu.:0.4600	3rd Qu.:0.4800	3rd Qu.:0.6275	3rd Qu.:0.5200	3rd Qu.:0.5300	3rd Qu.:0.5100	3rd Qu.:0.5400	3rd Qu.:0.5900	3rd Qu.:0.5000	3rd Qu.:0.6200	3rd Qu.:0.6300	3rd Qu.:0.5600	3rd Qu.:0.7600	3rd Qu.:0.7800	3rd Qu.:0.840	3rd Qu.:0.7200	3rd Qu.:0.6200	3rd Qu.:0.6500	3rd Qu.:0.02000	3rd Qu.:0.32	3rd Qu.:0.02000	3rd Qu.:0.4300	3rd Qu.:0.4800	3rd Qu.:0.5300	3rd Qu.:0.5600	3rd Qu.:0.2300	3rd Qu.:0.2300	3rd Qu.:0.2300	3rd Qu.:0.2300	3rd Qu.:0.9400	3rd Qu.:0.1600	3rd Qu.:0.3100	3rd Qu.:0.2900	3rd Qu.:0.5500	3rd Qu.:0.5800	3rd Qu.:0.4900	3rd Qu.:0.7000	3rd Qu.:0.2200	3rd Qu.:0.6000	3rd Qu.:0.5000	3rd Qu.:0.07000	3rd Qu.:0.8600	3rd Qu.:0.6700	3rd Qu.:0.2700	3rd Qu.:0.5600	3rd Qu.:0.6700	3rd Qu.:0.4200	3rd Qu.:0.3300	3rd Qu.:0.4000	3rd Qu.:0.3900	3rd Qu.:0.3800	3rd Qu.:0.4900	3rd Qu.:0.5200	3rd Qu.:0.590	3rd Qu.:0.5300	3rd Qu.:0.5900	3rd Qu.:0.5800	3rd Qu.:0.5100	3rd Qu.:0.01000	3rd Qu.:0.00000	3rd Qu.:0.2800	3rd Qu.:0.7775	3rd Qu.:0.6600	3rd Qu.:0.7700	3rd Qu.:0.7900	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3rd Qu.:0.07000	3rd Qu.:0.2800	3rd Qu.:0.1900	NA	NA	NA	NA	3rd Qu.:0.00000	NA	3rd Qu.:0.330
Max. :56.00	NA	NA	NA	Max. :10.000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	NA	Max. :1.0000	Max. :1.00000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.00	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Max. :1.00000	Max. :1.0000	Max. :1.0000	NA	NA	NA	NA	Max. :1.00000	NA	Max. :1.000

For the variables that have missing values, we will perform the analysis to identify and apply the appropriate imputation technique.

We are going to review how many missing data we have for each attribute.

Feature	NA_Count	NA_Percentage
LemasSwornFT	1675	84.0020060
LemasSwFTPerPop	1675	84.0020060
LemasSwFTFieldOps	1675	84.0020060
LemasSwFTFieldPerPop	1675	84.0020060
LemasTotalReq	1675	84.0020060
LemasTotReqPerPop	1675	84.0020060
PolicReqPerOffic	1675	84.0020060
PolicPerPop	1675	84.0020060
RacialMatchCommPol	1675	84.0020060
PctPolicWhite	1675	84.0020060
PctPolicBlack	1675	84.0020060
PctPolicHisp	1675	84.0020060
PctPolicAsian	1675	84.0020060
PctPolicMinor	1675	84.0020060
OfficAssgnDrugUnits	1675	84.0020060
NumKindsDrugsSeiz	1675	84.0020060
PolicAveOTWorked	1675	84.0020060
PolicCars	1675	84.0020060
PolicOperBudg	1675	84.0020060
LemasPctPolicOnPatr	1675	84.0020060
LemasGangUnitDeploy	1675	84.0020060
PolicBudgPerPop	1675	84.0020060
community	1177	59.0270812
county	1174	58.8766299
OtherPerCap	1	0.0501505

Let’s graph the amount of missing data:

Reviewing the variables that have missing data, we find that 2 variables have approximately 58.8% of missing data and 22 variables have 84% of missing data. We consider that these variables with this amount of missing data are not useful and correct to be used in our analysis.

Removing the folds variable as it was placed for cross validation.

We are going to identify the degenerate variables and remove the degenerate variables from the data set for our analysis.

## [1] "LemasPctOfficDrugUn"

We are going to eliminate the variable “MottosPctofficDrugUn” - percent of officers assigned to drug units. We are also going to remove from the dataset the variable ‘OtherPerCap’ - per capita income for people with ‘other’ heritage.

## [1] 1993  102

Our dataset now has 102 variables after removing the variables we mentioned above.

Exploratory Data Analysis

Let’s join our dataset with the crime dataset by State.

We are going to start with the analysis of the data for our project:

Violent Crime Rate Analysis

To obtain the violent crime rate, we are going to add the average number of violent crimes by State.

In the following table we can see the 10 states with the highest rate of violent crimes:

state_abbr	AvgCrimesRate	rank
DC	1.0000000	1
LA	0.5045455	2
SC	0.4867857	3
MD	0.4800000	4
FL	0.4583333	5
NC	0.4019565	6
AL	0.3937209	7
GA	0.3840541	8
DE	0.3700000	9
KS	0.3600000	10

In first place we have the District of Columbia, in second place we have Louisiana and in third place South Carolina.

Violent Crime Rates by States:

Black population Analysis

In the following table we can see the states with the highest percentage of people of the race of African-American population.

state_abbr	AvgBlackPopRate	rank
DC	1.0000000	1
MS	0.6900000	2
GA	0.6608108	3
LA	0.6245455	4
DE	0.6000000	5
NC	0.5580435	6
SC	0.5339286	7
AL	0.4702326	8
MD	0.4466667	9
VA	0.3972727	10

In first place we have the District of Columbia, in second place we have Mississippi and in third place Georgia.

Percentage of African-American population by state:

We can see that the District of Columbia ranks first in both crime rate and the percentage of African-American people. We might think that there is a relationship between these two aspects.

The crime rate in a state is not necessarily related to the perception of race based on the percentage of the population of a particular race.While the crime rate may be higher in some urban areas than others, there is no evidence to suggest that the crime rate is related to the race of the population.

It is important to note that the perception of race in the United States has historically been influenced by a variety of factors, including education, the media, and politics. Although crime can be a factor in influencing racial perceptions, it is only one of many factors that can influence how race is perceived in a given area.

Most dangerous communities

We review which communities have the highest rates of violent crime.

Word Cloud of Communities with the highest crime rates:

The city of Camdem in the state of New Jersey is the one that occupies the first place with the highest rate of violent crimes. In the top 10 communities with the most violent rates, we have three communities located in the state of Alabama.

Correlation

To select which are the best variables for our analysis, we are going to measure the correlation of each independent variable with the dependent variable and eliminate the variables with low correlation with the independent variable. We are going to eliminate the variables with an absolute correlation of less than 0.25 with the independent variable.

Finding variables with low correlation to dependent variable:

## [1] 1993   54

We now have 54 variables.

Multi-collinearity

Let’s check for multicollinearity. We selected pairs of independent variables with absolute correlations greater than 0.9.

We are going to use the VIF function to eliminate multicollinear variables that have higher variance inflation factors.

row	column
PctRecImmig8	PctRecImmig10
population	numbUrban
PctFam2Par	PctKids2Par
PctLargHouseFam	PctLargHouseOccup
FemalePctDiv	TotalPctDiv
PctPersOwnOccup	PctHousOwnOcc
medIncome	medFamInc
MalePctDivorce	TotalPctDiv
PctBSorMore	PctOccupMgmtProf
population	NumUnderPov
PctLess9thGrade	PctNotHSGrad
NumUnderPov	NumIlleg
medFamInc	perCapInc
numbUrban	NumUnderPov
PctFam2Par	PctYoungKids2Par
PctKids2Par	PctYoungKids2Par
MalePctDivorce	FemalePctDiv
PctFam2Par	PctTeen2Par
PctKids2Par	PctTeen2Par

	VIF_Score
TotalPctDiv	928.212429
FemalePctDiv	295.188771
MalePctDivorce	203.199697
PctRecImmig10	165.892005
PctRecImmig8	147.580998
PctLargHouseOccup	129.488362
PctLargHouseFam	124.023935
population	117.672823
PctFam2Par	103.736936
numbUrban	101.604576
medIncome	98.892368
PctKids2Par	97.813837
medFamInc	95.037614
PctPersOwnOccup	78.204266
PctHousOwnOcc	76.714946
PctNotHSGrad	37.715581
NumUnderPov	33.986232
perCapInc	27.944187
PctBSorMore	25.418161
PctPersDenseHous	23.163968
PctOccupMgmtProf	23.097460
PctLess9thGrade	21.029438
PctNotSpeakEnglWell	19.853248
PctPopUnderPov	18.129689
racePctWhite	15.166439
racepctblack	14.945811
NumIlleg	14.432404
pctWWage	13.271613
HousVacant	12.765602
PctYoungKids2Par	12.151434
pctWInvInc	12.081673
PctEmploy	11.608310
PctIlleg	11.490275
racePctHisp	9.970625
pctWPubAsst	9.197796
PctHousLess3BR	8.210647
RentLowQ	8.141949
PctHousNoPhone	7.424820
PctTeen2Par	7.264545
PctOccupManu	6.099481
PctUnemployed	6.099450
MalePctNevMarr	5.330162
NumImmig	5.007225
NumInShelters	4.710435
PctHousOccup	3.543461
PopDens	2.833126
MedNumBR	2.657976
NumStreet	2.473209
MedRentPctHousInc	2.470435
PctImmigRec10	2.372429
PctVacantBoarded	2.096964
blackPerCap	2.028969
PctWOFullPlumb	1.847284

## [1] 1993   40

We now have 40 variables.

Correlation Plots

We are going to create correlation plots to evaluate the most important variables in predicting violent crime rates.

corrMatrix <- round(cor(dataset[,c(20:39,40)]),4)
corrMatrix %>% corrplot(.,method="color",   
         type="lower", order="hclust", 
         addCoef.col = "black", 
         tl.col="blue", tl.srt=45, 
         sig.level = 0.01, insig = "blank", 
         diag=FALSE, number.cex = 0.8)

We can observe that some variables influence the rates of violent crimes are:

racepctblack: percentage of population that is african american MalePctDivorce: percentage of males who are divorced PctPopUnderPov: percentage of people under the poverty level pctWPubAsst: percentage of households with public assistance income in 1989 PctUnemployed: percentage of people 16 and over, in the labor force, and unemployed PctLess9thGrade:percentage of people 25 and over with less than a 9th grade education PctOccupManu: percentage of people 16 and over who are employed in manufacturing PctHousNoPhone: percent of occupied housing units without phone (in 1990, this was rare!) PctHousLess3BR: percent of housing units with less than 3 bedrooms

Training and Test Partition of data

Let’s partition the training dataset. We will use 75% of the data for training and 25% for validation.

sample = sample.split(dataset$ViolentCrimesPerPop, SplitRatio = 0.75)
train = subset(dataset, sample == TRUE) %>% as.matrix()
test = subset(dataset, sample == FALSE) %>% as.matrix()
y_train <- train[,40]
y_test <- test[,40]
X_train <- train[,-40]
X_test <- test[,-40]

Modeling

Ridge Regression

Ridge regression is a regularization technique used in statistical modeling to address the problem of multicollinearity in data. It is used when there is a high degree of correlation between the independent variables in a multiple linear regression model, which can lead to unstable estimates of model coefficients and unreliable prediction.

Ridge regression works by adding a regularization term to the model’s objective function, which penalizes the extreme values of the model’s coefficients and reduces their magnitude. This helps to reduce the variance of the model and improve its generalizability to new data.

Now we’ll use the glmnet() function to fit the ridge regression model and specify alpha=0.

set.seed(1801)
fit.ridge <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), alpha = 0, type.measure = "mse", family="gaussian")

To identify what value to use for lambda, we’ll use the s=“lambda.min”.

set.seed(1802)
fitted.ridge.train <- predict(fit.ridge, newx = data.matrix(X_train), s="lambda.min")
fitted.ridge.test <- predict(fit.ridge, newx = data.matrix(X_test), s="lambda.min")
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.ridge.train)[1])

## Train coefficient:  0.8163905

cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.ridge.test)[1])

## 
## Test coefficient:  0.7438792

The train coefficient is 0.816 and test coefficient is 0.778

Prediction Performance

LASSO Regression

LASSO Regression (Least Absolute Shrinkage and Selection Operator) is another regularization technique used in statistical modeling to address the problem of multicollinearity in data and to perform variable selection.

LASSO regression is used when there are a large number of predictor variables in a multiple linear regression model and you want to reduce the complexity of the model by removing irrelevant or less important variables in predicting the response variable.

LASSO regression reduces the model coefficients to zero for some variables, allowing you to identify the most important variables for prediction. LASSO regression is used in situations where there are a large number of predictor variables and a simpler and more easily interpretable linear regression model is needed, removing less important variables. .

Now we’ll use the glmnet() function to fit the LASSO regression model and specify alpha=1.

set.seed(1803)
fit.lasso <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=1, family="gaussian")

To identify what value to use for lambda, we’ll use the s=“lambda.min”.

set.seed(1804)
fitted.lasso.train <- predict(fit.lasso, newx = data.matrix(X_train), s="lambda.min")
fitted.lasso.test <- predict(fit.lasso, newx = data.matrix(X_test), s="lambda.min")
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.lasso.train)[1])

## Train coefficient:  0.8172372

cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.lasso.test)[1])

## 
## Test coefficient:  0.7475433

The train coefficient is 0.818 and test coefficient is 0.782

Prediction Performance

lassoPredPerf <- postResample(pred = fitted.lasso.test , obs = y_test)
lassoPredPerf['Family'] <- 'Linear Regression'
lassoPredPerf['Model'] <- 'Lasso Regression'

Elastic Net Regression

Elastic Net Regression is used in situations where there is a high correlation between the predictor variables and a more stable linear regression model with variable selection is needed, taking advantage of the properties of Ridge regression and LASSO regression.

set.seed(1805)
fit.elnet <- glmnet(as.matrix(X_train), as.matrix(y_train), family="gaussian", alpha=.5)
fit.elnet.cv <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=.5,
                          family="gaussian")
fitted.elnet.train <- predict(fit.elnet.cv, newx = data.matrix(X_train), s="lambda.min")
fitted.elnet.test <- predict(fit.elnet.cv, newx = data.matrix(X_test), s="lambda.min")
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.elnet.train)[1])

## Train coefficient:  0.8171584

cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.elnet.test)[1])

## 
## Test coefficient:  0.7475421

The train coefficient is 0.817 and test coefficient is 0.7823

Prediction Performance

Plot MSE

Non Linear Regression Model

Nonlinear regression is used when the relationship between the response variable and the predictor variables in a regression model cannot be modeled by a linear function. In this case, a nonlinear function is needed to describe the relationship between the variables.

For this analysis we will use: Neural Networks, KNN (K-Nearest Neighbors), SVM (Support Vector Machines), and MARS (Multivariate Adaptive Regression Splines).

Neural Networks

Neural Networks consist of layers of artificial neurons that process input information and generate output. Each neuron receives an input, performs a nonlinear transformation, and produces an output that is used as input to the next layer of neurons. The final output of the network is used to predict the output variable.

Let’s use Neural network with 4 hidden units:

## Model Averaged Neural Network with 4 Repeats  
## 
## a 39-4-1 network with 165 weights
## options were - linear output units  decay=0.01

Variable Importance

	Overall
X381	6.7398694
X51	6.3103055
X61	6.2222431
X203	5.9591480
X6	5.8826213
X201	5.5498754
X263	5.4310372
X5	5.1515020
X382	5.0220359
X43	4.8912471
X20	4.8783533
X112	4.8140426
X1	4.7783263
X110	4.7690152
X63	4.7159701
X122	4.5094830
X162	4.3341217
X35	4.2289551
X52	4.1469174
X292	4.1228266
X314	4.0869773
X131	3.9967216
X24	3.9724875
X261	3.9636423
X152	3.9390688
X42	3.9161420
X132	3.8229929
X16	3.8053785
X352	3.8035943
X3	3.7820565
X115	3.7322764
X262	3.6579088
X19	3.6429910
X373	3.6126446
X53	3.6026097
X353	3.5938262
X153	3.4994534
X363	3.4346810
X253	3.4101175
X41	3.3460698
X29	3.3287481
X133	3.2731705
X114	3.2546826
X192	3.2128204
X23	3.2027186
X111	3.1942049
X273	3.1761039
X272	3.1058894
X242	3.0439990
X291	3.0126065
X11	3.0082926
X333	2.9858225
X4	2.9815636
X151	2.9397707
X13	2.9268070
X231	2.9239169
X313	2.8951252
X62	2.7860604
X351	2.7686765
X241	2.7621327
X310	2.7524022
X283	2.6893620
X38	2.6521169
X15	2.6489619
X372	2.6380148
X26	2.6321252
X103	2.6213772
X73	2.6080194
X18	2.5629981
X312	2.5572842
X331	2.5019254
X123	2.4748309
X22	2.4552364
X252	2.4517814
X282	2.4235741
X251	2.3844613
X302	2.3809152
X10	2.3107638
X202	2.3019756
X72	2.2820524
X362	2.2604559
X243	2.2578241
X82	2.2283661
X32	2.2205998
X221	2.2081898
X271	2.1931332
X33	2.1907483
X71	2.1906483
X12	2.1723410
X383	2.1610762
X113	2.0981033
X392	2.0904249
X14	2.0831274
X101	2.0732813
X163	2.0365666
X183	2.0361661
X371	2.0226587
X343	2.0117824
X27	1.9930812
X7	1.9908562
X181	1.9892146
X393	1.9765550
X17	1.9303000
X322	1.8715190
X171	1.8714280
X212	1.8527905
X36	1.8132621
X8	1.7472199
X141	1.7260972
X222	1.7246995
X91	1.6764881
X161	1.6482376
X172	1.6109991
X25	1.5993161
X34	1.5983855
X293	1.5953200
X37	1.5940642
X321	1.4812944
X121	1.4779253
X191	1.4501799
X214	1.3917492
X311	1.3491091
X281	1.3386165
X9	1.3281336
X233	1.3232861
X361	1.2996879
X301	1.2184606
X323	1.2016866
X213	1.1748907
X93	1.1346396
X215	1.1077592
X341	1.1071128
X143	1.0938873
X211	1.0883222
X2	1.0867673
X173	1.0835380
X315	1.0732571
X31	1.0189325
X28	0.9983832
X21	0.9924483
X223	0.9916175
X303	0.9913695
X332	0.9855879
X102	0.9755889
X81	0.9578084
X210	0.8946268
X232	0.8433051
X83	0.7853355
X193	0.6932265
X92	0.6482776
X391	0.5996399
X142	0.5815349
X182	0.5648157
X30	0.5363145
X342	0.3200149
X39	0.2727154

Prediction Performance

KNN model

The KNN model (K-Nearest Neighbors) is used in supervised machine learning for classification and regression. The KNN model is used to predict the class or value of a new data instance based on the characteristics or attributes of the nearest neighboring instances in the training set.

## k-Nearest Neighbors 
## 
## 1499 samples
##   39 predictor
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1349, 1350, 1349, 1350, 1348, 1350, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE       
##    5  0.1470650  0.6164570  0.10158827
##    7  0.1439163  0.6323808  0.09915134
##    9  0.1443167  0.6312507  0.09954053
##   11  0.1431314  0.6398771  0.09805782
##   13  0.1433472  0.6397882  0.09817429
##   15  0.1434775  0.6403601  0.09847477
##   17  0.1433104  0.6436613  0.09814664
##   19  0.1431045  0.6460178  0.09809048
##   21  0.1433757  0.6459280  0.09837959
##   23  0.1427781  0.6499200  0.09771248
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 23.

KNN Plot

Variable Importance

	Overall
PctIlleg	100.000000
racePctWhite	84.463999
PctTeen2Par	78.070630
PctYoungKids2Par	76.886371
racepctblack	67.653183
pctWPubAsst	58.263375
pctWInvInc	56.862002
PctPopUnderPov	43.912891
MalePctDivorce	42.763202
PctUnemployed	42.212224
NumIlleg	41.200287
PctVacantBoarded	37.308768
PctHousLess3BR	35.571028
PctHousOwnOcc	35.007781
PctHousNoPhone	34.742189
PctPersDenseHous	33.972668
HousVacant	29.811069
PctLess9thGrade	24.391977
PctLargHouseFam	20.652709
NumInShelters	20.190012
PctWOFullPlumb	15.783587
NumStreet	14.856954
perCapInc	13.844531
MedNumBR	13.694854
PctOccupMgmtProf	11.999077
MedRentPctHousInc	10.471355
PctEmploy	10.447415
PctNotSpeakEnglWell	10.337888
racePctHisp	8.862438
MalePctNevMarr	8.618872
PopDens	7.875297
NumImmig	7.649018
pctWWage	6.623132
PctOccupManu	6.297929
PctHousOccup	6.259769
PctImmigRec10	4.935118
blackPerCap	3.242936
PctRecImmig8	3.227065
RentLowQ	0.000000

Prediction Performance

SVM model

The SVM model (Support Vector Machine) is used in supervised machine learning for classification and regression. It is particularly useful when the data is linearly or nearly linearly separable in the feature space. It is particularly useful in situations where the data is linearly or nearly linearly separable in feature space, and is very efficient in classifying high-dimensional data and small to medium data sets.

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1499 samples
##   39 predictor
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1348, 1349, 1350, 1348, 1349, 1350, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE       Rsquared   MAE       
##     0.25  0.1496910  0.6267657  0.09824670
##     0.50  0.1439197  0.6428549  0.09545459
##     1.00  0.1404450  0.6500871  0.09391998
##     2.00  0.1397640  0.6489056  0.09407907
##     4.00  0.1420422  0.6383513  0.09634762
##     8.00  0.1446649  0.6268358  0.09911138
##    16.00  0.1487816  0.6078767  0.10298394
##    32.00  0.1540362  0.5845711  0.10856409
##    64.00  0.1603825  0.5586436  0.11452927
##   128.00  0.1638883  0.5461915  0.11822571
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02599365
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02599365 and C = 2.

Variable Importance

	Overall
PctIlleg	100.000000
racePctWhite	84.463999
PctTeen2Par	78.070630
PctYoungKids2Par	76.886371
racepctblack	67.653183
pctWPubAsst	58.263375
pctWInvInc	56.862002
PctPopUnderPov	43.912891
MalePctDivorce	42.763202
PctUnemployed	42.212224
NumIlleg	41.200287
PctVacantBoarded	37.308768
PctHousLess3BR	35.571028
PctHousOwnOcc	35.007781
PctHousNoPhone	34.742189
PctPersDenseHous	33.972668
HousVacant	29.811069
PctLess9thGrade	24.391977
PctLargHouseFam	20.652709
NumInShelters	20.190012
PctWOFullPlumb	15.783587
NumStreet	14.856954
perCapInc	13.844531
MedNumBR	13.694854
PctOccupMgmtProf	11.999077
MedRentPctHousInc	10.471355
PctEmploy	10.447415
PctNotSpeakEnglWell	10.337888
racePctHisp	8.862438
MalePctNevMarr	8.618872
PopDens	7.875297
NumImmig	7.649018
pctWWage	6.623132
PctOccupManu	6.297929
PctHousOccup	6.259769
PctImmigRec10	4.935118
blackPerCap	3.242936
PctRecImmig8	3.227065
RentLowQ	0.000000

Prediction Performance

MARS model

The MARS model (Multivariate Adaptive Regression Splines) is used in supervised machine learning for regression and classification. It is a non-parametric technique that uses combinations of simple functions to approximate a complex function.

## Call: earth(x=X_train, y=y_train)
## 
##                        coefficients
## (Intercept)               0.4494353
## h(racepctblack-0.86)     -0.9662119
## h(0.21-racePctWhite)      0.5923169
## h(racePctWhite-0.21)     -0.2197684
## h(pctWWage-0.73)         -0.4280440
## h(0.47-pctWInvInc)        0.3971654
## h(0.57-MalePctDivorce)   -0.1789002
## h(MalePctDivorce-0.57)    0.1684589
## h(0.55-PctTeen2Par)       0.2757925
## h(0.4-PctIlleg)          -0.3027351
## h(PctIlleg-0.4)           0.1214990
## h(0.61-PctHousLess3BR)    0.1500186
## h(PctHousLess3BR-0.61)    0.3421317
## h(0.12-HousVacant)       -0.6673085
## h(HousVacant-0.12)        0.1483440
## h(0.02-NumStreet)        -2.1035280
## h(NumStreet-0.02)         0.1050918
## 
## Selected 17 of 21 terms, and 10 of 39 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: PctIlleg, PctTeen2Par, NumStreet, pctWInvInc, HousVacant, ...
## Number of terms at each degree of interaction: 1 16 (additive model)
## GCV 0.01872543    RSS 26.84714    GRSq 0.6611047    RSq 0.6754289

MARS plot

##  plotmo grid:    racepctblack racePctWhite racePctHisp pctWWage pctWInvInc
##                          0.06         0.85        0.04     0.57       0.48
##  pctWPubAsst perCapInc blackPerCap PctPopUnderPov PctLess9thGrade PctUnemployed
##         0.26      0.31        0.25           0.24            0.27          0.33
##  PctEmploy PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr
##       0.51         0.37              0.4           0.46            0.4
##  PctYoungKids2Par PctTeen2Par NumIlleg PctIlleg NumImmig PctImmigRec10
##               0.7         0.6     0.01     0.17     0.01          0.43
##  PctRecImmig8 PctNotSpeakEnglWell PctLargHouseFam PctPersDenseHous
##          0.09                0.06            0.21             0.11
##  PctHousLess3BR MedNumBR HousVacant PctHousOccup PctHousOwnOcc PctVacantBoarded
##            0.51      0.5       0.03         0.77          0.54             0.13
##  PctHousNoPhone PctWOFullPlumb RentLowQ MedRentPctHousInc NumInShelters
##            0.18            0.2     0.32              0.48             0
##  NumStreet PopDens
##          0    0.17

Variable Importance

	Overall
PctIlleg	100.00000
PctTeen2Par	40.76356
NumStreet	40.76356
pctWInvInc	35.12809
HousVacant	28.69767
racePctWhite	23.49241
PctHousLess3BR	19.62324
MalePctDivorce	12.42799
pctWWage	10.39611
racepctblack	6.97419

Prediction Performance

Decision Tree model

Decision Trees model is used in supervised machine learning for classification and regression. It is a modeling technique that builds a decision tree from training data to predict the label or value of a new data instance. It is useful when the data has non-linear relationships or complex interactions between the input features, and it is very efficient in processing large data sets.

## 
## Regression tree:
## rpart(formula = ViolentCrimesPerPop ~ ., data = as.data.frame(train))
## 
## Variables actually used in tree construction:
## [1] MalePctDivorce NumStreet      PctIlleg       pctWPubAsst    racePctHisp   
## 
## Root node error: 82.716/1499 = 0.055181
## 
## n= 1499 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.415223      0   1.00000 1.00154 0.050389
## 2 0.067533      1   0.58478 0.62276 0.030649
## 3 0.047360      2   0.51724 0.54645 0.026916
## 4 0.021278      3   0.46988 0.52341 0.027361
## 5 0.018565      4   0.44861 0.53202 0.027842
## 6 0.016753      5   0.43004 0.51126 0.027449
## 7 0.013985      6   0.41329 0.50114 0.026976
## 8 0.010000      7   0.39930 0.48388 0.026672

we need to use the cross-validation.

##           CP nsplit rel error    xerror       xstd
## 1 0.41522304      0 1.0000000 1.0015384 0.05038930
## 2 0.06753305      1 0.5847770 0.6227624 0.03064924
## 3 0.04735950      2 0.5172439 0.5464481 0.02691551
## 4 0.02127802      3 0.4698844 0.5234057 0.02736072
## 5 0.01856464      4 0.4486064 0.5320167 0.02784219
## 6 0.01675309      5 0.4300417 0.5112581 0.02744900
## 7 0.01398494      6 0.4132887 0.5011393 0.02697578
## 8 0.01000000      7 0.3993037 0.4838784 0.02667223

The curve is at its lowest at 6, so we will prune our tree to a size of 6. At size 6, the cp is 0.0118 and error is 0.40.

Decision Tree No.2

Prediction Performance

Model Accuracy

Model Accuracy is a measure of how well the model fits the training and test data. It is used to evaluate the performance of regression models and to compare different models.

## [1] 0.01012146

The accuracy of the model is low, even after pruning the tree.

Variable Importance

Prediction Performance

Random Forest model

Random forests is a supervised machine learning model used for data classification and regression. This model combines multiple decision trees to improve the accuracy of the predictions and reduce overfitting. It is useful when working with large data sets and complex features, you want to avoid overfitting and improve model accuracy.

## 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51

In this case the test error seems to be greater than the OOB. The out-of-box error and test errors should line up.

Variable Importance

## 
## Call:
##  randomForest(formula = ViolentCrimesPerPop ~ ., data = train2,      mtry = mtry, ntree = 200, proximity = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 200
## No. of variables tried at each split: 39
## 
##           Mean of squared residuals: 0.01949456
##                     % Var explained: 64.67

The graphs show us the importance of each variable in predicting violent crime. The mean decrease precision shows how much the precision of the model decreases if we remove the variable. (PctIlleg) percentage of kids born to never married is the most important variable with a very large difference compared to the second variable.

Prediction Performance

Gradient Boosting Method

The boosting method is a supervised machine learning algorithm used for regression and classification problems. It is an assembly technique that combines multiple weak decision trees to form a stronger and more accurate model.

This algorithm is especially useful in complex data analysis and prediction problems, where high precision and good performance are required. It is suitable for both structured and unstructured data, and can handle a large number of features.

	nrounds	max_depth	eta	gamma	colsample_bytree	min_child_weight	subsample
34	50	2	0.3	0	0.8	1	1

Variable Importance

According to the graphs we can see which is the most important variable to predict violent crimes. The most important variable is PctIlleg: percentage of kids born to never married.

Let’s graph the tree models used.

Summary models

Family	Model	RMSE	Rsquared	MAE
Linear Regression	Ridge Regression	0.137610339172913	0.633567028833966	0.0929953293374746
Linear Regression	Lasso Regression	0.136960450573007	0.637391790810595	0.0927696565786015
Linear Regression	Elastic Net Regression	0.136979850602025	0.637232849484115	0.0927486578951872
Non-Linear Regression	Neural Network	0.137874182457811	0.634079293426971	0.0893608224268564
Non-Linear Regression	KNN	0.141250293913806	0.614821561549313	0.0934817447045708
Non-Linear Regression	SVM	0.136104965394303	0.644798170618556	0.0876881161789374
Non-Linear Regression	MARS	0.136926684793201	0.640976259783412	0.0915989445247887
Trees & Boosting	Decision Tree	0.168954560886625	0.458655027931917	0.117214158597375
Trees & Boosting	Random Forest	NA	0.000571275517264647	NA
Trees & Boosting	XGBoost	0.139779743455771	0.626022791859345	0.0940080164834435

Reviewing the table where the summary of all the models we used for our analysis is, we can see the following:

In general, all the models used are good with R2 greater than 60%, except for the simple decision tree model, which has an R2 of approximately 45%, and the random forest model with less than 1%.

We consider that the best model is the SVM model (Support Vector Machine) with RMSE 0.136 and R2 of 64.4%.

Conclusion

Based on the data and data exploration conducted, we were able to analyze a large number of factors simultaneously to understand how the relationships between the different factors affect crime rates in our communities and states.

According to the analysis, we were able to observe that the percentage of children born whose parents were never married is a factor that considerably affects the percentage of violent crimes. While it is true that family structure can have some influence on children’s behavior and development, delinquency is a complex problem that has multiple causes and cannot be attributed to a single factor. And while there are other factors such as race that the data suggests could influence the rate of violent crime, crimes may be related to poverty, economic inequality, access to guns, drug and alcohol abuse, discrimination and racism, among other factors.

Therefore, to address the problem of delinquency it is necessary to take into account a wide variety of factors, including but not limited to family structure.

Violent crime is a complex problem that has multiple causes in the United States, some of the possible causes of violent crime in communities are the following:

Socioeconomic inequality: Economic and social inequality can create tensions and conflicts among members of a community, which in turn can lead to violence.

Unemployment: Unemployment can increase despair and hopelessness, which can lead some people to turn to crime to survive.

Poverty: Poverty may be linked to increased violence, as people living in poverty may have less access to resources and opportunities, which can lead to crime.

Drugs and alcohol: Drug and alcohol abuse can increase a person’s likelihood of committing a violent crime.

Racism and Discrimination: Racism and discrimination can contribute to violence, especially in communities where there are tensions between different racial or ethnic groups.

It is important to note that these are just a few of the possible causes of violent crime in communities across the United States, and that each case is unique and may have multiple contributing factors. Therefore, it is necessary to address these problems holistically and focus on finding specific solutions for each community. It is important to avoid stigmatizing certain groups in society and instead focus on understanding and addressing the underlying causes of crime so that effective and sustainable action can be taken to reduce it.

References

Resource: https://www.kaggle.com/datasets/michaelbryantds/crimedata?select=crimedata.csv

Crimes in US Communities Analysis

Gabriel Santos, Karina Tovar

2023-05-01

Libraries

Abstract

Introduction

Dataset Overview

Preparation & Exploration

Exploratory Data Analysis

Violent Crime Rate Analysis

Black population Analysis

Most dangerous communities

Correlation

Multi-collinearity

Correlation Plots

Training and Test Partition of data

Modeling

Ridge Regression

Prediction Performance

LASSO Regression

Prediction Performance

Elastic Net Regression

Prediction Performance

Plot MSE

Non Linear Regression Model

Neural Networks

Variable Importance

Prediction Performance

KNN model

KNN Plot

Variable Importance

Prediction Performance

SVM model

Variable Importance

Prediction Performance

MARS model

MARS plot

Variable Importance

Prediction Performance

Decision Tree model

Decision Tree No.2

Prediction Performance

Model Accuracy

Variable Importance

Prediction Performance

Random Forest model

Variable Importance

Prediction Performance

Gradient Boosting Method

Variable Importance

Summary models

Conclusion

References