Bikash-Data-622-Assignment-4.knit

Column

Instructions

Exploratory analysis and essay Assignment

Choose a dataset You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.
To complete this task:.

Describe the problem you are trying to solve.
Describe your datases and what you did to prepare the data for analysis.
Methodologies you used for analyzing the data
What’s the purpose of the analysis performed
Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Introduction

We will analyze the Communities & Crime dataset, which contains information from communities across the United States. This dataset integrates socio-economic indicators from the 1990 U.S. Census, law enforcement data from the 1990 U.S. LEMAS survey, and crime statistics from the 1995 FBI UCR. Together, these sources provide a comprehensive view of the demographic, economic, and policing factors associated with crime at the community level.

Study Objective

Public perception of neighborhood safety is often shaped by high profile violent crimes reported in the media. These events raise questions about the underlying social and economic factors that may influence crime in different regions. For instance, the rise in race-related incidents has intensified discussions about whether racial composition plays a role in community crime rates. Other factors, such as socio-economic status, immigration levels, and cultural or religious demographics are also commonly examined when considering what drives crime.

Understanding where violent crimes occur, and how they relate to a community’s socio economic, environmental, and demographic characteristics, is essential for identifying the root causes of these incidents. By building predictive models that quantify the risk associated with a community, we can estimate where violent crime is more likely to occur. Such insights can support city planning efforts, guiding efficient allocation of police resources and proactive safety measures. Ultimately, identifying the key drivers of violent crime provides valuable direction for urban development and the refinement of policing strategies.

Load packages

Load all required packages

Overview of the Data

The Communities and Crime dataset from the UCI Machine Learning Repository brings together socio-economic indicators from the 1990 U.S. Census, police department data from the 1990 LEMAS survey, and crime statistics from the 1995 FBI UCR. Although the data is older, it still provides meaningful insights into community characteristics and crime patterns, and supports building predictive models for crime prevention and urban planning.

The dataset contains 122 predictor variables plus the target variable, Violent Crimes per Population. Variables cover community demographics (e.g., percent urban, median family income) and law-enforcement features (e.g., per-capita officers, percent assigned to drug units). Only attributes with a plausible relationship to crime were included.

Violent crime rates were calculated using murder, rape, robbery, and assault. Some states reported rape inconsistently, which caused missing or inaccurate values; those communities were removed from the dataset.

All numeric features were normalized to a 0–1 scale using equal-interval binning. This preserves each attribute’s internal distribution but does not make values comparable across attributes.

Due to survey limitations, some communities lack LEMAS data, and only locations present across all sources are included. The final prediction target for analysis is Violent Crimes Per Population.

Data Definitions

The crime dataset has 128 attributes. Below is a brief description of all attributes along with data type details -

datadict <- read_csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/FinalProject/data_dictionary.csv')

datadict %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="100%",height="500px")

No.	Column	Description	Data Type
1	state	US state (by number) - not counted as predictive above, but if considered, should be consided nominal	nominal
2	county	numeric code for county - not predictive, and many missing values	numeric
3	community	numeric code for community - not predictive and many missing values	numeric
4	communityname	community name - not predictive - for information only	string
5	fold	fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive	numeric
6	population	population for community	numeric - decimal
7	householdsize	mean people per household	numeric - decimal
8	racepctblack	percentage of population that is african american	numeric - decimal
9	racePctWhite	percentage of population that is caucasian	numeric - decimal
10	racePctAsian	percentage of population that is of asian heritage	numeric - decimal
11	racePctHisp	percentage of population that is of hispanic heritage	numeric - decimal
12	agePct12t21	percentage of population that is 12-21 in age	numeric - decimal
13	agePct12t29	percentage of population that is 12-29 in age	numeric - decimal
14	agePct16t24	percentage of population that is 16-24 in age	numeric - decimal
15	agePct65up	percentage of population that is 65 and over in age	numeric - decimal
16	numbUrban	number of people living in areas classified as urban	numeric - decimal
17	pctUrban	percentage of people living in areas classified as urban	numeric - decimal
18	medIncome	median household income	numeric - decimal
19	pctWWage	percentage of households with wage or salary income in 1989	numeric - decimal
20	pctWFarmSelf	percentage of households with farm or self employment income in 1989	numeric - decimal
21	pctWInvInc	percentage of households with investment / rent income in 1989	numeric - decimal
22	pctWSocSec	percentage of households with social security income in 1989	numeric - decimal
23	pctWPubAsst	percentage of households with public assistance income in 1989	numeric - decimal
24	pctWRetire	percentage of households with retirement income in 1989	numeric - decimal
25	medFamInc	median family income (differs from household income for non-family households)	numeric - decimal
26	perCapInc	per capita income	numeric - decimal
27	whitePerCap	per capita income for caucasians	numeric - decimal
28	blackPerCap	per capita income for african americans	numeric - decimal
29	indianPerCap	per capita income for native americans	numeric - decimal
30	AsianPerCap	per capita income for people with asian heritage	numeric - decimal
31	OtherPerCap	per capita income for people with ‘other’ heritage	numeric - decimal
32	HispPerCap	per capita income for people with hispanic heritage	numeric - decimal
33	NumUnderPov	number of people under the poverty level	numeric - decimal
34	PctPopUnderPov	percentage of people under the poverty level	numeric - decimal
35	PctLess9thGrade	percentage of people 25 and over with less than a 9th grade education	numeric - decimal
36	PctNotHSGrad	percentage of people 25 and over that are not high school graduates	numeric - decimal
37	PctBSorMore	percentage of people 25 and over with a bachelors degree or higher education	numeric - decimal
38	PctUnemployed	percentage of people 16 and over, in the labor force, and unemployed	numeric - decimal
39	PctEmploy	percentage of people 16 and over who are employed	numeric - decimal
40	PctEmplManu	percentage of people 16 and over who are employed in manufacturing	numeric - decimal
41	PctEmplProfServ	percentage of people 16 and over who are employed in professional services	numeric - decimal
42	PctOccupManu	percentage of people 16 and over who are employed in manufacturing	numeric - decimal
43	PctOccupMgmtProf	percentage of people 16 and over who are employed in management or professional occupations	numeric - decimal
44	MalePctDivorce	percentage of males who are divorced	numeric - decimal
45	MalePctNevMarr	percentage of males who have never married	numeric - decimal
46	FemalePctDiv	percentage of females who are divorced	numeric - decimal
47	TotalPctDiv	percentage of population who are divorced	numeric - decimal
48	PersPerFam	mean number of people per family	numeric - decimal
49	PctFam2Par	percentage of families (with kids) that are headed by two parents	numeric - decimal
50	PctKids2Par	percentage of kids in family housing with two parents	numeric - decimal
51	PctYoungKids2Par	percent of kids 4 and under in two parent households	numeric - decimal
52	PctTeen2Par	percent of kids age 12-17 in two parent households	numeric - decimal
53	PctWorkMomYoungKids	percentage of moms of kids 6 and under in labor force	numeric - decimal
54	PctWorkMom	percentage of moms of kids under 18 in labor force	numeric - decimal
55	NumIlleg	number of kids born to never married	numeric - decimal
56	PctIlleg	percentage of kids born to never married	numeric - decimal
57	NumImmig	total number of people known to be foreign born	numeric - decimal
58	PctImmigRecent	percentage of immigrants who immigated within last 3 years	numeric - decimal
59	PctImmigRec5	percentage of immigrants who immigated within last 5 years	numeric - decimal
60	PctImmigRec8	percentage of immigrants who immigated within last 8 years	numeric - decimal
61	PctImmigRec10	percentage of immigrants who immigated within last 10 years	numeric - decimal
62	PctRecentImmig	percent of population who have immigrated within the last 3 years	numeric - decimal
63	PctRecImmig5	percent of population who have immigrated within the last 5 years	numeric - decimal
64	PctRecImmig8	percent of population who have immigrated within the last 8 years	numeric - decimal
65	PctRecImmig10	percent of population who have immigrated within the last 10 years	numeric - decimal
66	PctSpeakEnglOnly	percent of people who speak only English	numeric - decimal
67	PctNotSpeakEnglWell	percent of people who do not speak English well	numeric - decimal
68	PctLargHouseFam	percent of family households that are large (6 or more)	numeric - decimal
69	PctLargHouseOccup	percent of all occupied households that are large (6 or more people)	numeric - decimal
70	PersPerOccupHous	mean persons per household	numeric - decimal
71	PersPerOwnOccHous	mean persons per owner occupied household	numeric - decimal
72	PersPerRentOccHous	mean persons per rental household	numeric - decimal
73	PctPersOwnOccup	percent of people in owner occupied households	numeric - decimal
74	PctPersDenseHous	percent of persons in dense housing (more than 1 person per room)	numeric - decimal
75	PctHousLess3BR	percent of housing units with less than 3 bedrooms	numeric - decimal
76	MedNumBR	median number of bedrooms	numeric - decimal
77	HousVacant	number of vacant households	numeric - decimal
78	PctHousOccup	percent of housing occupied	numeric - decimal
79	PctHousOwnOcc	percent of households owner occupied	numeric - decimal
80	PctVacantBoarded	percent of vacant housing that is boarded up	numeric - decimal
81	PctVacMore6Mos	percent of vacant housing that has been vacant more than 6 months	numeric - decimal
82	MedYrHousBuilt	median year housing units built	numeric - decimal
83	PctHousNoPhone	percent of occupied housing units without phone (in 1990, this was rare!)	numeric - decimal
84	PctWOFullPlumb	percent of housing without complete plumbing facilities	numeric - decimal
85	OwnOccLowQuart	owner occupied housing - lower quartile value	numeric - decimal
86	OwnOccMedVal	owner occupied housing - median value	numeric - decimal
87	OwnOccHiQuart	owner occupied housing - upper quartile value	numeric - decimal
88	RentLowQ	rental housing - lower quartile rent	numeric - decimal
89	RentMedian	rental housing - median rent (Census variable H32B from file STF1A)	numeric - decimal
90	RentHighQ	rental housing - upper quartile rent	numeric - decimal
91	MedRent	median gross rent (Census variable H43A from file STF3A - includes utilities)	numeric - decimal
92	MedRentPctHousInc	median gross rent as a percentage of household income	numeric - decimal
93	MedOwnCostPctInc	median owners cost as a percentage of household income - for owners with a mortgage	numeric - decimal
94	MedOwnCostPctIncNoMtg	median owners cost as a percentage of household income - for owners without a mortgage	numeric - decimal
95	NumInShelters	number of people in homeless shelters	numeric - decimal
96	NumStreet	number of homeless people counted in the street	numeric - decimal
97	PctForeignBorn	percent of people foreign born	numeric - decimal
98	PctBornSameState	percent of people born in the same state as currently living	numeric - decimal
99	PctSameHouse85	percent of people living in the same house as in 1985 (5 years before)	numeric - decimal
100	PctSameCity85	percent of people living in the same city as in 1985 (5 years before)	numeric - decimal
101	PctSameState85	percent of people living in the same state as in 1985 (5 years before)	numeric - decimal
102	LemasSwornFT	number of sworn full time police officers	numeric - decimal
103	LemasSwFTPerPop	sworn full time police officers per 100K population	numeric - decimal
104	LemasSwFTFieldOps	number of sworn full time police officers in field operations (on the street as opposed to administrative etc)	numeric - decimal
105	LemasSwFTFieldPerPop	sworn full time police officers in field operations (on the street as opposed to administrative etc) per 100K population	numeric - decimal
106	LemasTotalReq	total requests for police	numeric - decimal
107	LemasTotReqPerPop	total requests for police per 100K popuation	numeric - decimal
108	PolicReqPerOffic	total requests for police per police officer	numeric - decimal
109	PolicPerPop	police officers per 100K population	numeric - decimal
110	RacialMatchCommPol	a measure of the racial match between the community and the police force. High values indicate proportions in community and police force are similar	numeric - decimal
111	PctPolicWhite	percent of police that are caucasian	numeric - decimal
112	PctPolicBlack	percent of police that are african american	numeric - decimal
113	PctPolicHisp	percent of police that are hispanic	numeric - decimal
114	PctPolicAsian	percent of police that are asian	numeric - decimal
115	PctPolicMinor	percent of police that are minority of any kind	numeric - decimal
116	OfficAssgnDrugUnits	number of officers assigned to special drug units	numeric - decimal
117	NumKindsDrugsSeiz	number of different kinds of drugs seized	numeric - decimal
118	PolicAveOTWorked	police average overtime worked	numeric - decimal
119	LandArea	land area in square miles	numeric - decimal
120	PopDens	population density in persons per square mile	numeric - decimal
121	PctUsePubTrans	percent of people using public transit for commuting	numeric - decimal
122	PolicCars	number of police cars	numeric - decimal
123	PolicOperBudg	police operating budget	numeric - decimal
124	LemasPctPolicOnPatr	percent of sworn full time police officers on patrol	numeric - decimal
125	LemasGangUnitDeploy	gang unit deployed	numeric - decimal - but really ordinal - 0 means NO, 1 means YES, 0.5 means Part Time
126	LemasPctOfficDrugUn	percent of officers assigned to drug units	numeric - decimal
127	PolicBudgPerPop	police operating budget per population	numeric - decimal
128	ViolentCrimesPerPop	total number of violent crimes per 100K popuation (numeric - decimal) GOAL attribute	to be predicted

Supplimentary Data

We also loaded the supplimentary state level dataset from UCI Machine Learning repository to enrich our crime dataset with state level information -

states <- read.csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/FinalProject/states.csv')

names(states) <- c("state_code","state","stateName","stateENS")

states %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="80%",height="300px")

state_code	state	stateName	stateENS
1	AL	Alabama	1779775
2	AK	Alaska	1785533
4	AZ	Arizona	1779777
5	AR	Arkansas	68085
6	CA	California	1779778
8	CO	Colorado	1779779
9	CT	Connecticut	1779780
10	DE	Delaware	1779781
11	DC	District of Columbia	1702382
12	FL	Florida	294478
13	GA	Georgia	1705317
15	HI	Hawaii	1779782
16	ID	Idaho	1779783
17	IL	Illinois	1779784
18	IN	Indiana	448508
19	IA	Iowa	1779785
20	KS	Kansas	481813
21	KY	Kentucky	1779786
22	LA	Louisiana	1629543
23	ME	Maine	1779787
24	MD	Maryland	1714934
25	MA	Massachusetts	606926
26	MI	Michigan	1779789
27	MN	Minnesota	662849
28	MS	Mississippi	1779790
29	MO	Missouri	1779791
30	MT	Montana	767982
31	NE	Nebraska	1779792
32	NV	Nevada	1779793
33	NH	New Hampshire	1779794
34	NJ	New Jersey	1779795
35	NM	New Mexico	897535
36	NY	New York	1779796
37	NC	North Carolina	1027616
38	ND	North Dakota	1779797
39	OH	Ohio	1085497
40	OK	Oklahoma	1102857
41	OR	Oregon	1155107
42	PA	Pennsylvania	1779798
44	RI	Rhode Island	1219835
45	SC	South Carolina	1779799
46	SD	South Dakota	1785534
47	TN	Tennessee	1325873
48	TX	Texas	1779801
49	UT	Utah	1455989
50	VT	Vermont	1779802
51	VA	Virginia	1779803
53	WA	Washington	1779804
54	WV	West Virginia	1779805
55	WI	Wisconsin	1779806
56	WY	Wyoming	1779807
60	AS	American Samoa	1802701
66	GU	Guam	1802705
69	MP	Northern Mariana Islands	1779809
72	PR	Puerto Rico	1779808
74	UM	U.S. Minor Outlying Islands	1878752
78	VI	U.S. Virgin Islands	1802710

Dataset Preview

dataset <- read_csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/FinalProject/communities.data.csv')
head(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="100%",height="300px")

state	county	community	communityname	fold	population	householdsize	racepctblack	racePctWhite	racePctAsian	racePctHisp	agePct12t21	agePct12t29	agePct16t24	agePct65up	numbUrban	pctUrban	medIncome	pctWWage	pctWFarmSelf	pctWInvInc	pctWSocSec	pctWPubAsst	pctWRetire	medFamInc	perCapInc	whitePerCap	blackPerCap	indianPerCap	AsianPerCap	OtherPerCap	HispPerCap	NumUnderPov	PctPopUnderPov	PctLess9thGrade	PctNotHSGrad	PctBSorMore	PctUnemployed	PctEmploy	PctEmplManu	PctEmplProfServ	PctOccupManu	PctOccupMgmtProf	MalePctDivorce	MalePctNevMarr	FemalePctDiv	TotalPctDiv	PersPerFam	PctFam2Par	PctKids2Par	PctYoungKids2Par	PctTeen2Par	PctWorkMomYoungKids	PctWorkMom	NumIlleg	PctIlleg	NumImmig	PctImmigRecent	PctImmigRec5	PctImmigRec8	PctImmigRec10	PctRecentImmig	PctRecImmig5	PctRecImmig8	PctRecImmig10	PctSpeakEnglOnly	PctNotSpeakEnglWell	PctLargHouseFam	PctLargHouseOccup	PersPerOccupHous	PersPerOwnOccHous	PersPerRentOccHous	PctPersOwnOccup	PctPersDenseHous	PctHousLess3BR	MedNumBR	HousVacant	PctHousOccup	PctHousOwnOcc	PctVacantBoarded	PctVacMore6Mos	MedYrHousBuilt	PctHousNoPhone	PctWOFullPlumb	OwnOccLowQuart	OwnOccMedVal	OwnOccHiQuart	RentLowQ	RentMedian	RentHighQ	MedRent	MedRentPctHousInc	MedOwnCostPctInc	MedOwnCostPctIncNoMtg	NumInShelters	PctForeignBorn	PctBornSameState	PctSameHouse85	PctSameCity85	PctSameState85	LemasSwornFT	LemasSwFTPerPop	LemasSwFTFieldOps	LemasSwFTFieldPerPop	LemasTotalReq	LemasTotReqPerPop	PolicReqPerOffic	PolicPerPop	RacialMatchCommPol	PctPolicWhite	PctPolicBlack	PctPolicHisp	PctPolicAsian	PctPolicMinor	OfficAssgnDrugUnits	NumKindsDrugsSeiz	PolicAveOTWorked	LandArea	PopDens	PctUsePubTrans	PolicCars	PolicOperBudg	LemasPctPolicOnPatr	LemasGangUnitDeploy	LemasPctOfficDrugUn	PolicBudgPerPop	ViolentCrimesPerPop
8	?	?	Lakewoodcity	1	0.19	0.33	0.02	0.90	0.12	0.17	0.34	0.47	0.29	0.32	0.20	1.0	0.37	0.72	0.34	0.60	0.29	0.15	0.43	0.39	0.40	0.39	0.32	0.27	0.27	0.36	0.41	0.08	0.19	0.10	0.18	0.48	0.27	0.68	0.23	0.41	0.25	0.52	0.68	0.40	0.75	0.75	0.35	0.55	0.59	0.61	0.56	0.74	0.76	0.04	0.14	0.03	0.24	0.27	0.37	0.39	0.07	0.07	0.08	0.08	0.89	0.06	0.14	0.13	0.33	0.39	0.28	0.55	0.09	0.51	0.5	0.21	0.71	0.52	0.05	0.26	0.65	0.14	0.06	0.22	0.19	0.18	0.36	0.35	0.38	0.34	0.38	0.46	0.25	0.04	0.12	0.42	0.50	0.51	0.64	0.03	0.13	0.96	0.17	0.06	0.18	0.44	0.13	0.94	0.93	0.03	0.07	0.1	0.07	0.02	0.57	0.29	0.12	0.26	0.20	0.06	0.04	0.9	0.5	0.32	0.14	0.20
53	?	?	Tukwilacity	1	0.00	0.16	0.12	0.74	0.45	0.07	0.26	0.59	0.35	0.27	0.02	1.0	0.31	0.72	0.11	0.45	0.25	0.29	0.39	0.29	0.37	0.38	0.33	0.16	0.30	0.22	0.35	0.01	0.24	0.14	0.24	0.30	0.27	0.73	0.57	0.15	0.42	0.36	1.00	0.63	0.91	1.00	0.29	0.43	0.47	0.60	0.39	0.46	0.53	0.00	0.24	0.01	0.52	0.62	0.64	0.63	0.25	0.27	0.25	0.23	0.84	0.10	0.16	0.10	0.17	0.29	0.17	0.26	0.20	0.82	0.0	0.02	0.79	0.24	0.02	0.25	0.65	0.16	0.00	0.21	0.20	0.21	0.42	0.38	0.40	0.37	0.29	0.32	0.18	0.00	0.21	0.50	0.34	0.60	0.52	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.02	0.12	0.45	?	?	?	?	0.00	?	0.67
24	?	?	Aberdeentown	1	0.00	0.42	0.49	0.56	0.17	0.04	0.39	0.47	0.28	0.32	0.00	0.0	0.30	0.58	0.19	0.39	0.38	0.40	0.84	0.28	0.27	0.29	0.27	0.07	0.29	0.28	0.39	0.01	0.27	0.27	0.43	0.19	0.36	0.58	0.32	0.29	0.49	0.32	0.63	0.41	0.71	0.70	0.45	0.42	0.44	0.43	0.43	0.71	0.67	0.01	0.46	0.00	0.07	0.06	0.15	0.19	0.02	0.02	0.04	0.05	0.88	0.04	0.20	0.20	0.46	0.52	0.43	0.42	0.15	0.51	0.5	0.01	0.86	0.41	0.29	0.30	0.52	0.47	0.45	0.18	0.17	0.16	0.27	0.29	0.27	0.31	0.48	0.39	0.28	0.00	0.14	0.49	0.54	0.67	0.56	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.01	0.21	0.02	?	?	?	?	0.00	?	0.43
34	5	81440	Willingborotownship	1	0.04	0.77	1.00	0.08	0.12	0.10	0.51	0.50	0.34	0.21	0.06	1.0	0.58	0.89	0.21	0.43	0.36	0.20	0.82	0.51	0.36	0.40	0.39	0.16	0.25	0.36	0.44	0.01	0.10	0.09	0.25	0.31	0.33	0.71	0.36	0.45	0.37	0.39	0.34	0.45	0.49	0.44	0.75	0.65	0.54	0.83	0.65	0.85	0.86	0.03	0.33	0.02	0.11	0.20	0.30	0.31	0.05	0.08	0.11	0.11	0.81	0.08	0.56	0.62	0.85	0.77	1.00	0.94	0.12	0.01	0.5	0.01	0.97	0.96	0.60	0.47	0.52	0.11	0.11	0.24	0.21	0.19	0.75	0.70	0.77	0.89	0.63	0.51	0.47	0.00	0.19	0.30	0.73	0.64	0.65	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.02	0.39	0.28	?	?	?	?	0.00	?	0.12
42	95	6096	Bethlehemtownship	1	0.01	0.55	0.02	0.95	0.09	0.05	0.38	0.38	0.23	0.36	0.02	0.9	0.50	0.72	0.16	0.68	0.44	0.11	0.71	0.46	0.43	0.41	0.28	0.00	0.74	0.51	0.48	0.00	0.06	0.25	0.30	0.33	0.12	0.65	0.67	0.38	0.42	0.46	0.22	0.27	0.20	0.21	0.51	0.91	0.91	0.89	0.85	0.40	0.60	0.00	0.06	0.00	0.03	0.07	0.20	0.27	0.01	0.02	0.04	0.05	0.88	0.05	0.16	0.19	0.59	0.60	0.37	0.89	0.02	0.19	0.5	0.01	0.89	0.87	0.04	0.55	0.73	0.05	0.14	0.31	0.31	0.30	0.40	0.36	0.38	0.38	0.22	0.51	0.21	0.00	0.11	0.72	0.64	0.61	0.53	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.04	0.09	0.02	?	?	?	?	0.00	?	0.03
6	?	?	SouthPasadenacity	1	0.02	0.28	0.06	0.54	1.00	0.25	0.31	0.48	0.27	0.37	0.04	1.0	0.52	0.68	0.20	0.61	0.28	0.15	0.25	0.62	0.72	0.76	0.77	0.28	0.52	0.48	0.60	0.01	0.12	0.13	0.12	0.80	0.10	0.65	0.19	0.77	0.06	0.91	0.49	0.57	0.61	0.58	0.44	0.62	0.69	0.87	0.53	0.30	0.43	0.00	0.11	0.04	0.30	0.35	0.43	0.47	0.50	0.50	0.56	0.57	0.45	0.28	0.25	0.19	0.29	0.53	0.18	0.39	0.26	0.73	0.0	0.02	0.84	0.30	0.16	0.28	0.25	0.02	0.05	0.94	1.00	1.00	0.67	0.63	0.68	0.62	0.47	0.59	0.11	0.00	0.70	0.42	0.49	0.73	0.64	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.01	0.58	0.10	?	?	?	?	0.00	?	0.14

dim(dataset)

[1] 1994  128

The dataset has 1994 observations and 128 variable.

Descriptive Dataset Summary

The dataset contains no duplicate records, but it contains a lot of missing values, represented by “?”. We need to convert these entries into NA values to begin our analysis.

The first 5 non-predictive variables represent the name of the community, the state in which it is located, the county code, the community code, and the fold number for non-random 10 fold cross validation. A summary of the remaining variables is below:

dataset[dataset == '?'] = NA

summary(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")

state	county	community	communityname	fold	population	householdsize	racepctblack	racePctWhite	racePctAsian	racePctHisp	agePct12t21	agePct12t29	agePct16t24	agePct65up	numbUrban	pctUrban	medIncome	pctWWage	pctWFarmSelf	pctWInvInc	pctWSocSec	pctWPubAsst	pctWRetire	medFamInc	perCapInc	whitePerCap	blackPerCap	indianPerCap	AsianPerCap	OtherPerCap	HispPerCap	NumUnderPov	PctPopUnderPov	PctLess9thGrade	PctNotHSGrad	PctBSorMore	PctUnemployed	PctEmploy	PctEmplManu	PctEmplProfServ	PctOccupManu	PctOccupMgmtProf	MalePctDivorce	MalePctNevMarr	FemalePctDiv	TotalPctDiv	PersPerFam	PctFam2Par	PctKids2Par	PctYoungKids2Par	PctTeen2Par	PctWorkMomYoungKids	PctWorkMom	NumIlleg	PctIlleg	NumImmig	PctImmigRecent	PctImmigRec5	PctImmigRec8	PctImmigRec10	PctRecentImmig	PctRecImmig5	PctRecImmig8	PctRecImmig10	PctSpeakEnglOnly	PctNotSpeakEnglWell	PctLargHouseFam	PctLargHouseOccup	PersPerOccupHous	PersPerOwnOccHous	PersPerRentOccHous	PctPersOwnOccup	PctPersDenseHous	PctHousLess3BR	MedNumBR	HousVacant	PctHousOccup	PctHousOwnOcc	PctVacantBoarded	PctVacMore6Mos	MedYrHousBuilt	PctHousNoPhone	PctWOFullPlumb	OwnOccLowQuart	OwnOccMedVal	OwnOccHiQuart	RentLowQ	RentMedian	RentHighQ	MedRent	MedRentPctHousInc	MedOwnCostPctInc	MedOwnCostPctIncNoMtg	NumInShelters	NumStreet	PctForeignBorn	PctBornSameState	PctSameHouse85	PctSameCity85	PctSameState85	LemasSwornFT	LemasSwFTPerPop	LemasSwFTFieldOps	LemasSwFTFieldPerPop	LemasTotalReq	LemasTotReqPerPop	PolicReqPerOffic	PolicPerPop	RacialMatchCommPol	PctPolicWhite	PctPolicBlack	PctPolicHisp	PctPolicAsian	PctPolicMinor	OfficAssgnDrugUnits	NumKindsDrugsSeiz	PolicAveOTWorked	LandArea	PopDens	PctUsePubTrans	PolicCars	PolicOperBudg	LemasPctPolicOnPatr	LemasGangUnitDeploy	LemasPctOfficDrugUn	PolicBudgPerPop	ViolentCrimesPerPop
Min. : 1.00	Length:1994	Length:1994	Length:1994	Min. : 1.000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Length:1994	Min. :0.0000	Min. :0.00000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.00	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Min. :0.00000	Min. :0.0000	Min. :0.0000	Length:1994	Length:1994	Length:1994	Length:1994	Min. :0.00000	Length:1994	Min. :0.000
1st Qu.:12.00	Class :character	Class :character	Class :character	1st Qu.: 3.000	1st Qu.:0.01000	1st Qu.:0.3500	1st Qu.:0.0200	1st Qu.:0.6300	1st Qu.:0.0400	1st Qu.:0.010	1st Qu.:0.3400	1st Qu.:0.4100	1st Qu.:0.2500	1st Qu.:0.3000	1st Qu.:0.00000	1st Qu.:0.0000	1st Qu.:0.2000	1st Qu.:0.4400	1st Qu.:0.1600	1st Qu.:0.3700	1st Qu.:0.3500	1st Qu.:0.1425	1st Qu.:0.3600	1st Qu.:0.2300	1st Qu.:0.2200	1st Qu.:0.240	1st Qu.:0.1725	1st Qu.:0.1100	1st Qu.:0.1900	Class :character	1st Qu.:0.2600	1st Qu.:0.01000	1st Qu.:0.110	1st Qu.:0.1600	1st Qu.:0.2300	1st Qu.:0.2100	1st Qu.:0.2200	1st Qu.:0.3800	1st Qu.:0.2500	1st Qu.:0.3200	1st Qu.:0.2400	1st Qu.:0.3100	1st Qu.:0.3300	1st Qu.:0.3100	1st Qu.:0.3600	1st Qu.:0.3600	1st Qu.:0.4000	1st Qu.:0.4900	1st Qu.:0.4900	1st Qu.:0.530	1st Qu.:0.4800	1st Qu.:0.3900	1st Qu.:0.4200	1st Qu.:0.00000	1st Qu.:0.09	1st Qu.:0.00000	1st Qu.:0.1600	1st Qu.:0.2000	1st Qu.:0.2500	1st Qu.:0.2800	1st Qu.:0.0300	1st Qu.:0.0300	1st Qu.:0.0300	1st Qu.:0.0300	1st Qu.:0.7300	1st Qu.:0.0300	1st Qu.:0.1500	1st Qu.:0.1400	1st Qu.:0.3400	1st Qu.:0.3900	1st Qu.:0.2700	1st Qu.:0.4400	1st Qu.:0.0600	1st Qu.:0.4000	1st Qu.:0.0000	1st Qu.:0.01000	1st Qu.:0.6300	1st Qu.:0.4300	1st Qu.:0.0600	1st Qu.:0.2900	1st Qu.:0.3500	1st Qu.:0.0600	1st Qu.:0.1000	1st Qu.:0.0900	1st Qu.:0.0900	1st Qu.:0.0900	1st Qu.:0.1700	1st Qu.:0.2000	1st Qu.:0.220	1st Qu.:0.2100	1st Qu.:0.3700	1st Qu.:0.3200	1st Qu.:0.2500	1st Qu.:0.00000	1st Qu.:0.00000	1st Qu.:0.0600	1st Qu.:0.4700	1st Qu.:0.4200	1st Qu.:0.5200	1st Qu.:0.5600	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	1st Qu.:0.02000	1st Qu.:0.1000	1st Qu.:0.0200	Class :character	Class :character	Class :character	Class :character	1st Qu.:0.00000	Class :character	1st Qu.:0.070
Median :34.00	Mode :character	Mode :character	Mode :character	Median : 5.000	Median :0.02000	Median :0.4400	Median :0.0600	Median :0.8500	Median :0.0700	Median :0.040	Median :0.4000	Median :0.4800	Median :0.2900	Median :0.4200	Median :0.03000	Median :1.0000	Median :0.3200	Median :0.5600	Median :0.2300	Median :0.4800	Median :0.4750	Median :0.2600	Median :0.4700	Median :0.3300	Median :0.3000	Median :0.320	Median :0.2500	Median :0.1700	Median :0.2800	Mode :character	Median :0.3450	Median :0.02000	Median :0.250	Median :0.2700	Median :0.3600	Median :0.3100	Median :0.3200	Median :0.5100	Median :0.3700	Median :0.4100	Median :0.3700	Median :0.4000	Median :0.4700	Median :0.4000	Median :0.5000	Median :0.5000	Median :0.4700	Median :0.6300	Median :0.6400	Median :0.700	Median :0.6100	Median :0.5100	Median :0.5400	Median :0.01000	Median :0.17	Median :0.01000	Median :0.2900	Median :0.3400	Median :0.3900	Median :0.4300	Median :0.0900	Median :0.0800	Median :0.0900	Median :0.0900	Median :0.8700	Median :0.0600	Median :0.2000	Median :0.1900	Median :0.4400	Median :0.4800	Median :0.3600	Median :0.5600	Median :0.1100	Median :0.5100	Median :0.5000	Median :0.03000	Median :0.7700	Median :0.5400	Median :0.1300	Median :0.4200	Median :0.5200	Median :0.1850	Median :0.1900	Median :0.1800	Median :0.1700	Median :0.1800	Median :0.3100	Median :0.3300	Median :0.370	Median :0.3400	Median :0.4800	Median :0.4500	Median :0.3700	Median :0.00000	Median :0.00000	Median :0.1300	Median :0.6300	Median :0.5400	Median :0.6700	Median :0.7000	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Median :0.04000	Median :0.1700	Median :0.0700	Mode :character	Mode :character	Mode :character	Mode :character	Median :0.00000	Mode :character	Median :0.150
Mean :28.68	NA	NA	NA	Mean : 5.494	Mean :0.05759	Mean :0.4634	Mean :0.1796	Mean :0.7537	Mean :0.1537	Mean :0.144	Mean :0.4242	Mean :0.4939	Mean :0.3363	Mean :0.4232	Mean :0.06407	Mean :0.6963	Mean :0.3611	Mean :0.5582	Mean :0.2916	Mean :0.4957	Mean :0.4711	Mean :0.3178	Mean :0.4792	Mean :0.3757	Mean :0.3503	Mean :0.368	Mean :0.2911	Mean :0.2035	Mean :0.3224	NA	Mean :0.3863	Mean :0.05551	Mean :0.303	Mean :0.3158	Mean :0.3833	Mean :0.3617	Mean :0.3635	Mean :0.5011	Mean :0.3964	Mean :0.4406	Mean :0.3912	Mean :0.4413	Mean :0.4612	Mean :0.4345	Mean :0.4876	Mean :0.4943	Mean :0.4877	Mean :0.6109	Mean :0.6207	Mean :0.664	Mean :0.5829	Mean :0.5014	Mean :0.5267	Mean :0.03629	Mean :0.25	Mean :0.03006	Mean :0.3202	Mean :0.3606	Mean :0.3991	Mean :0.4279	Mean :0.1814	Mean :0.1821	Mean :0.1848	Mean :0.1829	Mean :0.7859	Mean :0.1506	Mean :0.2676	Mean :0.2519	Mean :0.4621	Mean :0.4944	Mean :0.4041	Mean :0.5626	Mean :0.1863	Mean :0.4952	Mean :0.3147	Mean :0.07682	Mean :0.7195	Mean :0.5487	Mean :0.2045	Mean :0.4333	Mean :0.4942	Mean :0.2645	Mean :0.2431	Mean :0.2647	Mean :0.2635	Mean :0.2689	Mean :0.3464	Mean :0.3725	Mean :0.423	Mean :0.3841	Mean :0.4901	Mean :0.4498	Mean :0.4038	Mean :0.02944	Mean :0.02278	Mean :0.2156	Mean :0.6089	Mean :0.5351	Mean :0.6264	Mean :0.6515	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Mean :0.06523	Mean :0.2329	Mean :0.1617	NA	NA	NA	NA	Mean :0.09405	NA	Mean :0.238
3rd Qu.:42.00	NA	NA	NA	3rd Qu.: 8.000	3rd Qu.:0.05000	3rd Qu.:0.5400	3rd Qu.:0.2300	3rd Qu.:0.9400	3rd Qu.:0.1700	3rd Qu.:0.160	3rd Qu.:0.4700	3rd Qu.:0.5400	3rd Qu.:0.3600	3rd Qu.:0.5300	3rd Qu.:0.07000	3rd Qu.:1.0000	3rd Qu.:0.4900	3rd Qu.:0.6900	3rd Qu.:0.3700	3rd Qu.:0.6200	3rd Qu.:0.5800	3rd Qu.:0.4400	3rd Qu.:0.5800	3rd Qu.:0.4800	3rd Qu.:0.4300	3rd Qu.:0.440	3rd Qu.:0.3800	3rd Qu.:0.2500	3rd Qu.:0.4000	NA	3rd Qu.:0.4800	3rd Qu.:0.05000	3rd Qu.:0.450	3rd Qu.:0.4200	3rd Qu.:0.5100	3rd Qu.:0.4600	3rd Qu.:0.4800	3rd Qu.:0.6275	3rd Qu.:0.5200	3rd Qu.:0.5300	3rd Qu.:0.5100	3rd Qu.:0.5400	3rd Qu.:0.5900	3rd Qu.:0.5000	3rd Qu.:0.6200	3rd Qu.:0.6300	3rd Qu.:0.5600	3rd Qu.:0.7600	3rd Qu.:0.7800	3rd Qu.:0.840	3rd Qu.:0.7200	3rd Qu.:0.6200	3rd Qu.:0.6500	3rd Qu.:0.02000	3rd Qu.:0.32	3rd Qu.:0.02000	3rd Qu.:0.4300	3rd Qu.:0.4800	3rd Qu.:0.5300	3rd Qu.:0.5600	3rd Qu.:0.2300	3rd Qu.:0.2300	3rd Qu.:0.2300	3rd Qu.:0.2300	3rd Qu.:0.9400	3rd Qu.:0.1600	3rd Qu.:0.3100	3rd Qu.:0.2900	3rd Qu.:0.5500	3rd Qu.:0.5800	3rd Qu.:0.4900	3rd Qu.:0.7000	3rd Qu.:0.2200	3rd Qu.:0.6000	3rd Qu.:0.5000	3rd Qu.:0.07000	3rd Qu.:0.8600	3rd Qu.:0.6700	3rd Qu.:0.2700	3rd Qu.:0.5600	3rd Qu.:0.6700	3rd Qu.:0.4200	3rd Qu.:0.3300	3rd Qu.:0.4000	3rd Qu.:0.3900	3rd Qu.:0.3800	3rd Qu.:0.4900	3rd Qu.:0.5200	3rd Qu.:0.590	3rd Qu.:0.5300	3rd Qu.:0.5900	3rd Qu.:0.5800	3rd Qu.:0.5100	3rd Qu.:0.01000	3rd Qu.:0.00000	3rd Qu.:0.2800	3rd Qu.:0.7775	3rd Qu.:0.6600	3rd Qu.:0.7700	3rd Qu.:0.7900	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3rd Qu.:0.07000	3rd Qu.:0.2800	3rd Qu.:0.1900	NA	NA	NA	NA	3rd Qu.:0.00000	NA	3rd Qu.:0.330
Max. :56.00	NA	NA	NA	Max. :10.000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	NA	Max. :1.0000	Max. :1.00000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.00	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Max. :1.00000	Max. :1.0000	Max. :1.0000	NA	NA	NA	NA	Max. :1.00000	NA	Max. :1.000

Pre-Processing

Missing Value Analysis

Based on the above descriptive data summary, there are quite a few variables with missing values. So we conducted an analysis of all missing values in various attributes to identify proper imputation technique.

## Counts of missing data per feature
dataset_missing_counts <- data.frame(apply(dataset, 2, function(x) length(which(is.na(x)))))
dataset_missing_pct <- data.frame(apply(dataset, 2,function(x) {sum(is.na(x)) / length(x) * 100}))

dataset_missing_counts <- cbind(Feature = rownames(dataset_missing_counts), dataset_missing_counts, dataset_missing_pct)
colnames(dataset_missing_counts) <- c('Feature','NA_Count','NA_Percentage')
rownames(dataset_missing_counts) <- NULL

dataset_missing_counts <- dataset_missing_counts %>% filter(`NA_Count` != 0) %>% arrange(desc(`NA_Count`))

dataset_missing_counts  %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Feature	NA_Count	NA_Percentage
LemasSwornFT	1675	84.0020060
LemasSwFTPerPop	1675	84.0020060
LemasSwFTFieldOps	1675	84.0020060
LemasSwFTFieldPerPop	1675	84.0020060
LemasTotalReq	1675	84.0020060
LemasTotReqPerPop	1675	84.0020060
PolicReqPerOffic	1675	84.0020060
PolicPerPop	1675	84.0020060
RacialMatchCommPol	1675	84.0020060
PctPolicWhite	1675	84.0020060
PctPolicBlack	1675	84.0020060
PctPolicHisp	1675	84.0020060
PctPolicAsian	1675	84.0020060
PctPolicMinor	1675	84.0020060
OfficAssgnDrugUnits	1675	84.0020060
NumKindsDrugsSeiz	1675	84.0020060
PolicAveOTWorked	1675	84.0020060
PolicCars	1675	84.0020060
PolicOperBudg	1675	84.0020060
LemasPctPolicOnPatr	1675	84.0020060
LemasGangUnitDeploy	1675	84.0020060
PolicBudgPerPop	1675	84.0020060
community	1177	59.0270812
county	1174	58.8766299
OtherPerCap	1	0.0501505

ggplot(dataset_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) + 
  geom_bar(stat = 'identity', fill = 'steelblue') +
  geom_label(aes(label = NA_Count)) +
  labs(title = 'Missing Counts') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y = element_blank(), axis.title.x = element_blank())

There are 22 variables missing 84% of data and 2 variables missing roughly 59% of data. These variables with a high proportion of NA’s (greater than 50%) were removed as these were deemed practically useless in prediction or data exploration. Imputation on these varibales is a risk as the majority of the data is missing. Most of these variables came from the 1990 Law Enforcement Management and Admin Stats survey (Lemas).

Also, we are going to remove the ‘fold’ variable orginally added for cross validation.

colremove <- dataset_missing_counts %>%
  filter(NA_Percentage > 50) %>% 
  select(Feature) %>%
  as.list()

dataset <- dataset[,!(colnames(dataset) %in% colremove$Feature)]



# removing the folds variable as it was placed for cross validation
dataset <- dataset[,names(dataset) != 'fold']

We also did check for presence of any degenerate variables in the dataset and removed them.

# capturing the degenerate variables
degenCols <- nearZeroVar(dataset)
# identifying them 
colnames(dataset[,degenCols])

[1] "LemasPctOfficDrugUn"

# removing from the dataset
dataset <- dataset[,-degenCols]

We removed the ‘communityname’ from the dataset and removed rows with NA values in ‘OtherPerCap’ column -

#dataset <- dataset[,!colnames(dataset) == 'communityname']

dataset <- dataset %>%
  mutate(OtherPerCap = as.numeric(OtherPerCap))

#remove missing brand => df3
dataset <- dataset[!is.na(dataset$OtherPerCap), ]

dim(dataset)

[1] 1993  102

After removing these variables, our dataset reduced to 101 variables.

Exploratory Data Analysis

To start with our visual exploratory data analysis, we joined the crime dataset with state level enrichment dataset -

# Join with states dataframe
dataset1 <- left_join(dataset, states, 
              by = c("state" = "state_code"))

dataset1 <- rename(dataset1, state_abbr = state.y)

Violent Crime Rate Analysis

District of Columbia leads in terms of violent crime rates in the United States. Los Angeles has the second highest crime rate after DC, followed by Louisiana, South Carolina, Maryland and Florida. DC, however, have much higher crime rates in comparison to the other states, which have roughly almost the same crime rate. However, New York and Florida have very high populations in comparison to the others.

# aggregating the Average ViolentCrimesPerPop by State ID (FIPS)
crime_rate_by_state <- dataset1 %>%
  group_by(state_abbr) %>%
  summarise(AvgCrimesRate = mean(ViolentCrimesPerPop)) %>% 
  mutate(rank = dense_rank(desc(AvgCrimesRate)))

crime_rate_by_state %>% filter(rank <=5) %>% arrange(rank) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="60%",height="250px")

state_abbr	AvgCrimesRate	rank
DC	1.0000000	1
LA	0.5045455	2
SC	0.4867857	3
MD	0.4800000	4
FL	0.4583333	5

#Violent Crime Rates by States
crime_rate_by_state$hover <- with(crime_rate_by_state, paste(state_abbr,'<br>', "Crime Rank:",rank))
l <- list(color = toRGB("white"), width = 2)
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('blue')
)

p <- crime_rate_by_state %>% plot_geo(locationmode = 'USA-states') %>%
  add_trace(
    z = ~AvgCrimesRate, text = ~hover, locations = ~state_abbr, 
    color = ~AvgCrimesRate, colors = 'Blues'
  ) %>%
 colorbar(title = "Crime Rate/100K Population") %>%
  layout(
    title = 'Violent Crime Rates in the United States',
    geo = g
  )

p

Map of percentage of black population in the US

More than 65% of the population of DC belongs to the African-American community. Other states with high percentage of African-Americans are Mississippi, Georgia, Louisiana, and Delaware. These are some of the states with the highest violent crime rates. No wonder crimes bring about perceptions of race. We need to investigate this further.

#Percentage of African-American population by state
black_pop_by_state <- dataset1 %>%
  group_by(state_abbr) %>%
  summarise(AvgBlackPopRate = mean(racepctblack )) %>% 
  mutate(rank = dense_rank(desc(AvgBlackPopRate)))

black_pop_by_state %>% filter(rank <=5) %>% arrange(rank) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="60%",height="250px")

state_abbr	AvgBlackPopRate	rank
DC	1.0000000	1
MS	0.6900000	2
GA	0.6608108	3
LA	0.6245455	4
DE	0.6000000	5

black_pop_by_state$hover <- with(black_pop_by_state, paste(state_abbr,'<br>', "Black Population  Rank:",rank))
l <- list(color = toRGB("white"), width = 2)
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('blue')
)

p <- black_pop_by_state %>% plot_geo(locationmode = 'USA-states') %>%
  add_trace(
    z = ~AvgBlackPopRate, text = ~hover, locations = ~state_abbr, 
    color = ~AvgBlackPopRate, colors = 'Greens'
  ) %>%
 colorbar(title = "%Black") %>%
  layout(
    title = 'Percentage of African-American Population',
    geo = g
  )
p

Word Cloud of most dangerous communities

We conclude our initial Visual Data Exploration by taking a look at the communities with the highest violent crime rates. Interestingly, Camden City in New Jersey showed at the top on the list of most dangerous communities in the US in terms of violent crime rates. Other dangerous cities when it comes to violent crime rates are BatonRouge and Alexandria in Louisiana, Baltimore, Atlanta, Anniston, Bessemer and Birmingham in Alabma, Atlantic City.

#Word Cloud of Communities with the highest crime rates

communityCrime <- dataset1 %>% select(communityname,state_abbr, population, ViolentCrimesPerPop) %>% 
  mutate(Place = paste(communityname,state_abbr, sep = ", ")) %>% 
  group_by(Place) %>% 
  summarise(AvgCrimesRate = mean(ViolentCrimesPerPop)) %>% 
  arrange(desc(AvgCrimesRate)) %>% 
  head(15)

par(mfrow = c(1,1))
par(mar = rep(0,4))
set.seed(1)
wordcloud(words = communityCrime$Place, freq = communityCrime$AvgCrimesRate, scale=c(3,0.01),
          random.order=FALSE,
          colors=brewer.pal(8, "Dark2"))

Variable Removal

We still have an extremely high number of variables in the dataset. To further remove variables, we measure the correlation of each independent variable against the dependent variable and remove the variables with low correlation to the independent variable. We removed variables with an absolute correlation of less than 0.25 to the independent variable. This vastly reduces the number of variables in the dataset to 54. We removed these variables as we felt that these variables would contribute very less to our final model. Besides, working with fewer variables help in better interpretation.

dataset <- dataset[,!colnames(dataset) == 'communityname']

#Running a correlation matrix
res2 <- rcorr(as.matrix(dataset))

flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
  )
}

#Finding variables with low correlation to dependent variable
correlation <- flattenCorrMatrix(res2$r, res2$P)
correlation <- correlation %>% filter(column == "ViolentCrimesPerPop") %>% arrange(cor)
remove <- correlation %>% filter(cor >= -0.25 & cor <= 0.25) %>% select(row)


low.corr.dep <- c('HispPerCap','PctSpeakEnglOnly','RentMedian','MedRent','RentHighQ','state',
'OwnOccLowQuart','whitePerCap','OwnOccMedVal','OwnOccHiQuart','AsianPerCap','PctSameHouse85',
'pctWFarmSelf','PctWorkMom','OtherPerCap','PersPerOwnOccHous','MedYrHousBuilt','pctWRetire',
'indianPerCap','PctBornSameState','PctEmplProfServ','PctEmplManu','PersPerOccupHous','householdsize','PctWorkMomYoungKids','PctSameState85','PctVacMore6Mos','racePctAsian','MedOwnCostPctIncNoMtg','agePct12t21','MedOwnCostPctInc','agePct65up','PctSameCity85','pctUrban','agePct16t24',
'pctWSocSec','PersPerFam','agePct12t29','PctUsePubTrans','PctImmigRecent','PctForeignBorn',
'LandArea','PctImmigRec5','PctRecentImmig','PctRecImmig5','PctImmigRec8','PersPerRentOccHous'
)

dataset<- dataset %>% select(-low.corr.dep)

dim(dataset)

[1] 1993   54

Multi-collinearity

For removing multi-collinearity, we selected pairs of independent variables with absolute correlations greater than 0.9. For deciding which variable to remove among the pairs of independent variables, we regressed the remaining independent variables against the dependent variable and found the variance inflation factor of each variable using the vif function from the rms package in R and the removed the independent variable from each pair of multi-collinear variables which had higher variance inflation factors. This reduced the number of variables in the dataset to 39. The logic for this was that the variables in a pair of multi-collinear variables contribute the same variance information to a model and hence only one of them (with the lower variance inflation factor) is required for modeling.

#Now let's check for multicollinearity
res2 <- rcorr(as.matrix(dataset))
correlation <- flattenCorrMatrix(res2$r, res2$P)
correlation <- correlation %>% mutate(abs_cor = abs(cor)) %>% arrange(desc(abs_cor)) %>% filter(abs_cor > 0.9) %>% select(row, column)

correlation %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="200px")

row	column
PctRecImmig8	PctRecImmig10
population	numbUrban
PctFam2Par	PctKids2Par
PctLargHouseFam	PctLargHouseOccup
FemalePctDiv	TotalPctDiv
PctPersOwnOccup	PctHousOwnOcc
medIncome	medFamInc
MalePctDivorce	TotalPctDiv
PctBSorMore	PctOccupMgmtProf
population	NumUnderPov
PctLess9thGrade	PctNotHSGrad
NumUnderPov	NumIlleg
medFamInc	perCapInc
numbUrban	NumUnderPov
PctFam2Par	PctYoungKids2Par
PctKids2Par	PctYoungKids2Par
MalePctDivorce	FemalePctDiv
PctFam2Par	PctTeen2Par
PctKids2Par	PctTeen2Par

#Let's run a simple linear regression to see if any more variables can be removed based on VIF

LR_test <- lm(ViolentCrimesPerPop ~ ., dataset)

vif_df <- data.frame(vif(LR_test)) 
colnames(vif_df) <- c("VIF_Score")

vif_df %>% arrange(desc(`VIF_Score`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="200px")

	VIF_Score
TotalPctDiv	928.212429
FemalePctDiv	295.188771
MalePctDivorce	203.199697
PctRecImmig10	165.892005
PctRecImmig8	147.580998
PctLargHouseOccup	129.488362
PctLargHouseFam	124.023935
population	117.672823
PctFam2Par	103.736936
numbUrban	101.604576
medIncome	98.892368
PctKids2Par	97.813837
medFamInc	95.037614
PctPersOwnOccup	78.204266
PctHousOwnOcc	76.714946
PctNotHSGrad	37.715581
NumUnderPov	33.986232
perCapInc	27.944187
PctBSorMore	25.418161
PctPersDenseHous	23.163968
PctOccupMgmtProf	23.097460
PctLess9thGrade	21.029438
PctNotSpeakEnglWell	19.853248
PctPopUnderPov	18.129689
racePctWhite	15.166439
racepctblack	14.945811
NumIlleg	14.432404
pctWWage	13.271613
HousVacant	12.765602
PctYoungKids2Par	12.151434
pctWInvInc	12.081673
PctEmploy	11.608310
PctIlleg	11.490275
racePctHisp	9.970625
pctWPubAsst	9.197796
PctHousLess3BR	8.210647
RentLowQ	8.141949
PctHousNoPhone	7.424820
PctTeen2Par	7.264545
PctOccupManu	6.099481
PctUnemployed	6.099450
MalePctNevMarr	5.330162
NumImmig	5.007225
NumInShelters	4.710435
PctHousOccup	3.543461
PopDens	2.833126
MedNumBR	2.657976
NumStreet	2.473209
MedRentPctHousInc	2.470435
PctImmigRec10	2.372429
PctVacantBoarded	2.096964
blackPerCap	2.028969
PctWOFullPlumb	1.847284

Next, we scan the above list of multi-collinear variables and remove the ones with higher VIF -

remove.multicollinear <- c('PctRecImmig10', 'population', 'PctFam2Par', 'PctLargHouseOccup', 'FemalePctDiv', 'PctPersOwnOccup', 'medIncome', 'TotalPctDiv', 'PctBSorMore', 'PctNotHSGrad', 'NumUnderPov', 'medFamInc', 'numbUrban', 'PctKids2Par')

dataset <- dataset %>% select(-remove.multicollinear)

dim(dataset)

[1] 1993   40

After removing the variables with high multi-collinearity, we end up with 40 variables.

Correlation Plots

As a final step before modeling, we created correlation plots to assess in advance the independent variables that will be most important in predicting the violent crime rates.

Plot1:

corrMatrix <- round(cor(dataset[,c(1:13,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Plot2:

corrMatrix <- round(cor(dataset[,c(14:26,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Plot3:

corrMatrix <- round(cor(dataset[,c(27:39,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Using the correlation plots, we can say that the following socio-economic factors in a community influence the level of violent crime rates:

Percentage of the population that is African-American Percentage of the population living under poverty line Percentage of kids born to never married Percentage of households with public assistance income Percentage of people 16 and over, in the labor force, and unemployed Percentage of males who are divorced Percentage of occupied housing units without a phone Percentage of vacant housing that is boarded up

Splitting Data: Train/Test

We are going to do a 75-25% split for training and test purposes.

sample = sample.split(dataset$ViolentCrimesPerPop, SplitRatio = 0.75)


train = subset(dataset, sample == TRUE) %>% as.matrix()
test = subset(dataset, sample == FALSE) %>% as.matrix()

y_train <- train[,40]
y_test <- test[,40]

X_train <- train[,-40]
X_test <- test[,-40]

#head(train)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Modeling

Re-usable Code Components

Before we start with modeling, we wanted to define some of the re-usable code components in the form of R functions -

## Function for plotting the most important variables

printVarImportance <- function(model_name, vImpDF, varNo){
  # Rename columns
  varImportance <-  tibble::rownames_to_column(vImpDF, "Variable")
  colnames(varImportance)[2] <- "Importance"
  
  # Limit to Top N important variables 
  varImportance <- varImportance %>% arrange(desc(`Importance`)) %>% head(varNo)
  
  # Create and assign Rank
  rankImportance <- varImportance %>% mutate(Rank = paste0('#',dense_rank(desc(Importance))))
  
  # Plot output
  ggplot(rankImportance, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat='identity', fill = 'steelblue') +
  geom_label(aes(label = round(Importance,2))) +
  geom_text(aes(x = Variable, y = 0.5, label = Rank),
           hjust=1.2, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variable') +
  ggtitle(paste("Variable Importance in ", model_name, " Model")) +
  coord_flip()
}

Model Group A: Linear Regression Model - Ridge and LASSO

The glmnet package allows us to perform both the ridge and lasso regression models. This library requires the response variable to be a vector and the predictor variables to be a class of data.matrix.

Model1: Ridge Regression

Ridge regression is a model used “when the number of predictor variables in a set exceeds the number of observations”, or when a dataset suffers from multicollinearity. According to stats, with ridge regression, often predictor variables used in a regression are highly correlated. When they are, the regression coefficient of any one variable can depend on which other predictor variables are included in the model, and which ones are left out.

Now we’ll use the glmnet() function to fit the ridge regression model and specify alpha=0.

fit.ridge <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), alpha = 0, type.measure = "mse", family="gaussian")

To identify what value to use for lambda, we could plot the function using our model fit.ridge$lambda.min, but below we’ll use the s=“lambda.min” argument to save time. The graph will be shown below after all models.

fitted.ridge.train <- predict(fit.ridge, newx = data.matrix(X_train), s="lambda.min")


fitted.ridge.test <- predict(fit.ridge, newx = data.matrix(X_test), s="lambda.min")

cat("Train coefficient: ", cor(as.matrix(y_train), fitted.ridge.train)[1])

Train coefficient:  0.8141226

cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.ridge.test)[1])


Test coefficient:  0.7635008

As we can see above, the train and test coefficients are 0.81 and 0.76

Prediction Performance

ridgePredPerf <- postResample(pred = fitted.ridge.test , obs = y_test)
ridgePredPerf['Family'] <- 'Linear Regression'
ridgePredPerf['Model'] <- 'Ridge Regression'

Model2: LASSO Regression - Least Absolute Shrinkage and Selection Operator

According to statisticshowto.com, lasso regression is a type of linear regression that uses shrinkage - which shrinks data points towards a central point, such as the mean.

Now we’ll use the glmnet() function to fit the LASSO regression model and specify alpha=1. We’ll specific s=“lambda.min” to find the best lambda.

fit.lasso <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=1, family="gaussian")


fitted.lasso.train <- predict(fit.lasso, newx = data.matrix(X_train), s="lambda.min")

fitted.lasso.test <- predict(fit.lasso, newx = data.matrix(X_test), s="lambda.min")

cat("Train coefficient: ", cor(as.matrix(y_train), fitted.lasso.train)[1])

Train coefficient:  0.8157457

cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.lasso.test)[1])


Test coefficient:  0.7689392

As we can see above, the train and test coefficients are 0.81 and 0.76.

Prediction Performance

lassoPredPerf <- postResample(pred = fitted.lasso.test , obs = y_test)
lassoPredPerf['Family'] <- 'Linear Regression'
lassoPredPerf['Model'] <- 'Lasso Regression'

Model3: Elastic Net Regression

According to machinelearningmastery.com, Elastic Net is an extension of linear regression that will add regulirization penalties to the loss function during training.

fit.elnet <- glmnet(as.matrix(X_train), as.matrix(y_train), family="gaussian", alpha=.5)

fit.elnet.cv <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=.5,
                          family="gaussian")

fitted.elnet.train <- predict(fit.elnet.cv, newx = data.matrix(X_train), s="lambda.min")

fitted.elnet.test <- predict(fit.elnet.cv, newx = data.matrix(X_test), s="lambda.min")

cat("Train coefficient: ", cor(as.matrix(y_train), fitted.elnet.train)[1])

Train coefficient:  0.8158622

cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.elnet.test)[1])


Test coefficient:  0.7685533

Prediction Performance

elnetPredPerf <- postResample(pred = fitted.elnet.test , obs = y_test)
elnetPredPerf['Family'] <- 'Linear Regression'
elnetPredPerf['Model'] <- 'Elastic Net Regression'

Plot MSE

par(mfrow=c(3,1))

plot(fit.lasso, xvar="lambda")
plot(fit.ridge, xvar="lambda")
plot(fit.elnet.cv, xvar="lambda")

Model Group B: Non Linear Regression Model

Non Linear Regression models do not follow the y=ax+b approach. It allows for flexibility in the model to fit non linear trends in data.

We will go over four examples of non linear regression: Neural Networks, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Multivariate Adaptive Regression Splines (MARS).

Model4: Neural Networks

Neural Networks are regression techniques inspired by theories of how the brain works (neurons signalling to one another). The outcome is modeled by an input layer and an intermediary set of unobserved values, hidden units, made up of linear combinations of the original predictors.

The parameters used in the model include number of hidden units, decay, number of iterations to find parameter estimates, and the number of parameters used by the model.

# simple average Neural Network with 5 hidden units
nnetAvg <- avNNet(X_train, y_train,
                  size = 5,
                  decay = 0.01,
                  repeats = 5,
                  linout = TRUE,
                  trace = FALSE,
                  maxit = 500,
                  MaxNWts = 5 * (ncol(X_train) + 1) +5+1)
nnetAvg

Model Averaged Neural Network with 5 Repeats  

a 39-5-1 network with 206 weights
options were - linear output units  decay=0.01

Variable Importance

varimp <- varImp(nnetAvg)
varimp %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")

	Overall
X292	7.6370160
X12	5.9451966
X114	5.5187905
X26	5.2518596
X54	5.1049880
X154	5.0924933
X42	5.0717370
X283	4.9011977
X381	4.6842240
X203	4.6721495
X382	4.6398054
X29	4.5765466
X291	4.5705678
X122	4.4443809
X53	4.3255080
X323	4.2644435
X41	4.1839150
X110	4.1553171
X131	4.1499350
X373	4.1439441
X384	4.1345264
X63	4.1269616
X294	4.1134946
X31	4.0805513
X3	4.0448392
X116	3.9986073
X293	3.9965900
X121	3.9865322
X151	3.8569537
X64	3.8219922
X261	3.7827265
X9	3.7704606
X264	3.7133551
X164	3.6181529
X314	3.6143162
X72	3.6122705
X51	3.5855487
X262	3.5758820
X223	3.5657833
X19	3.5597244
X271	3.5582675
X315	3.5535831
X171	3.5193914
X38	3.4772065
X352	3.4701537
X214	3.4212095
X204	3.3389173
X33	3.3169319
X2	3.2437467
X93	3.2400145
X216	3.2084728
X1	3.1588952
X351	3.1146563
X317	3.0791051
X62	3.0756721
X172	3.0527272
X37	3.0281053
X272	3.0106611
X17	2.9916386
X251	2.9644936
X212	2.9582334
X254	2.9251315
X92	2.9100862
X27	2.8149147
X134	2.7943746
X71	2.7664481
X52	2.7549555
X133	2.7507472
X301	2.7229865
X243	2.7066318
X192	2.6908310
X22	2.6889728
X374	2.6869133
X132	2.6767235
X113	2.6722419
X181	2.6336297
X44	2.6125308
X74	2.5816257
X13	2.5794715
X282	2.5663870
X194	2.5553930
X364	2.5536503
X112	2.5291440
X213	2.4953951
X152	2.4947829
X104	2.4920203
X6	2.4895872
X161	2.4849425
X231	2.4742504
X11	2.4692769
X117	2.4628723
X162	2.4551926
X81	2.4425303
X393	2.4310590
X244	2.4305055
X217	2.3860420
X211	2.3729905
X252	2.3709876
X324	2.3613761
X18	2.3342998
X115	2.3332208
X316	2.3271865
X321	2.3161905
X313	2.3147162
X83	2.3084911
X182	2.3038297
X39	2.2885316
X21	2.2865094
X36	2.2581763
X61	2.2538031
X16	2.2151170
X233	2.2128776
X311	2.1189115
X91	2.1182449
X253	2.1127521
X28	2.1029672
X383	2.0997756
X334	2.0952218
X281	2.0654783
X333	2.0517161
X111	2.0492464
X102	2.0437337
X394	2.0411659
X215	2.0385755
X343	2.0084313
X25	2.0051765
X353	2.0024676
X193	1.9981012
X174	1.9904280
X24	1.9786006
X371	1.9758678
X263	1.9735472
X362	1.9578452
X123	1.9453243
X82	1.9128455
X141	1.9102032
X361	1.8959054
X124	1.8487145
X142	1.8416549
X344	1.8379848
X153	1.8313143
X202	1.8309932
X10	1.8271260
X5	1.8119391
X23	1.7830686
X8	1.7639648
X4	1.7378508
X354	1.7334810
X322	1.7154179
X94	1.7143561
X101	1.7099298
X35	1.7034332
X372	1.6664911
X304	1.6362735
X201	1.6107170
X84	1.6087458
X310	1.5871175
X284	1.5467524
X341	1.5303573
X14	1.4727146
X103	1.4153643
X391	1.4150329
X312	1.4038618
X363	1.3891787
X30	1.3779550
X242	1.3666800
X303	1.3544250
X210	1.3464787
X221	1.3390426
X232	1.3385168
X274	1.3155557
X73	1.3055992
X173	1.3003825
X20	1.2894163
X43	1.2773276
X302	1.2762359
X191	1.2515995
X15	1.2409482
X144	1.2232473
X183	1.2118165
X32	1.1941497
X273	1.1506261
X342	1.1440656
X184	1.0164413
X224	1.0046459
X332	0.9973565
X234	0.9932592
X7	0.9404488
X34	0.8996808
X392	0.8822497
X241	0.8759609
X222	0.8382394
X163	0.7721729
X143	0.6735831
X331	0.6196059

We identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("Neural Network", varImp(nnetAvg), 10)

Prediction Performance

nnetPred <- predict(nnetAvg, newdata = X_test)
nnetPredPerf <- postResample(pred = nnetPred, obs = y_test)
nnetPredPerf['Family'] <- 'Non-Linear Regression'
nnetPredPerf['Model'] <- 'Neural Network'

Model5: K-Nearest Neighbors

The KNN model predicts a new sample for regression using the K-closest data points from the training set. The method uses the Euclidean distance or the straight-line distance between samples.

The train function in the caret package uses the knn method.

knnFit <- train(X_train, y_train,
                   method = 'knn',
                   preProc = c('center','scale'),
                   tuneLength = 14,
                   trControl = trainControl(method = 'cv'))

knnFit

k-Nearest Neighbors 

1499 samples
  39 predictor

Pre-processing: centered (39), scaled (39) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1349, 1349, 1348, 1347, 1349, 1350, ... 
Resampling results across tuning parameters:

  k   RMSE       Rsquared   MAE       
   5  0.1470551  0.6085563  0.10018005
   7  0.1454218  0.6186791  0.09854605
   9  0.1440191  0.6269384  0.09784016
  11  0.1430909  0.6322653  0.09764411
  13  0.1426350  0.6358800  0.09697388
  15  0.1434319  0.6321638  0.09772261
  17  0.1431849  0.6351700  0.09726898
  19  0.1439635  0.6320082  0.09767728
  21  0.1443801  0.6305995  0.09764054
  23  0.1439431  0.6342258  0.09739809
  25  0.1444537  0.6324687  0.09746166
  27  0.1448037  0.6308266  0.09757062
  29  0.1451915  0.6295031  0.09796384
  31  0.1454098  0.6292510  0.09811286

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 13.

KNN Plot

plot(knnFit)

Variable Importance

varimp <- varImp(knnFit)
varimp$importance %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")

	Overall
PctIlleg	100.0000000
racePctWhite	82.6566345
PctYoungKids2Par	80.2817251
PctTeen2Par	78.6232333
racepctblack	70.3212797
pctWInvInc	55.5310000
pctWPubAsst	55.1605903
MalePctDivorce	47.8211990
PctPopUnderPov	44.0230884
PctUnemployed	39.7884001
PctHousNoPhone	35.8730201
PctVacantBoarded	33.5763619
PctHousOwnOcc	33.3328822
PctHousLess3BR	32.1642291
PctPersDenseHous	29.2459802
NumIlleg	29.2056695
PctLess9thGrade	21.1231604
HousVacant	20.6816310
PctLargHouseFam	15.3341087
NumInShelters	14.9067179
MedNumBR	13.6315013
PctWOFullPlumb	12.9198016
perCapInc	12.1955112
PctOccupMgmtProf	11.2535144
PctEmploy	10.3417599
NumStreet	10.3378306
MedRentPctHousInc	9.0280115
PctHousOccup	8.5714403
pctWWage	6.9674327
MalePctNevMarr	6.9603290
PctNotSpeakEnglWell	5.4220236
PctOccupManu	5.3190000
PctImmigRec10	4.9203000
PopDens	4.8423181
racePctHisp	4.2799730
NumImmig	3.9294479
blackPerCap	1.6454549
PctRecImmig8	0.3601896
RentLowQ	0.0000000

We identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("KNN", varImp(knnFit)$importance, 10)

Prediction Performance

knnPred <- predict(knnFit, newdata = X_test)
knnPredPerf <- postResample(pred = knnPred, obs = y_test)
knnPredPerf['Family'] <- 'Non-Linear Regression'
knnPredPerf['Model'] <- 'KNN'

Model6: Support Vector Machines

SVM regression models try to find a regression line where each data point is within a threshold. The error is a parameter of the model and it will then try to minimize the coefficients used in the model. This allows the model eliminate features and have greater interpretability.

We will again use the caret package and apply a radial SVM model. This method will allow the model to find trends in the dataset.

svmRFit <- train(X_train, y_train,
                   method = 'svmRadial',
                   preProc = c('center','scale'),
                   tuneLength = 14,
                   trControl = trainControl(method = 'cv'))
svmRFit

Support Vector Machines with Radial Basis Function Kernel 

1499 samples
  39 predictor

Pre-processing: centered (39), scaled (39) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1349, 1348, 1350, 1350, 1349, 1349, ... 
Resampling results across tuning parameters:

  C        RMSE       Rsquared   MAE       
     0.25  0.1481972  0.6322960  0.09523909
     0.50  0.1430390  0.6461203  0.09275885
     1.00  0.1402594  0.6525146  0.09164402
     2.00  0.1396685  0.6519714  0.09198224
     4.00  0.1419226  0.6393452  0.09431469
     8.00  0.1468268  0.6150199  0.09876876
    16.00  0.1546668  0.5773157  0.10495410
    32.00  0.1647599  0.5308168  0.11282599
    64.00  0.1739939  0.4929692  0.11996609
   128.00  0.1801527  0.4699293  0.12551726
   256.00  0.1803473  0.4699574  0.12603844
   512.00  0.1803473  0.4699574  0.12603844
  1024.00  0.1803473  0.4699574  0.12603844
  2048.00  0.1803473  0.4699574  0.12603844

Tuning parameter 'sigma' was held constant at a value of 0.02507349
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.02507349 and C = 2.

Variable Importance

varimp <- varImp(svmRFit)
varimp$importance %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")

	Overall
PctIlleg	100.0000000
racePctWhite	82.6566345
PctYoungKids2Par	80.2817251
PctTeen2Par	78.6232333
racepctblack	70.3212797
pctWInvInc	55.5310000
pctWPubAsst	55.1605903
MalePctDivorce	47.8211990
PctPopUnderPov	44.0230884
PctUnemployed	39.7884001
PctHousNoPhone	35.8730201
PctVacantBoarded	33.5763619
PctHousOwnOcc	33.3328822
PctHousLess3BR	32.1642291
PctPersDenseHous	29.2459802
NumIlleg	29.2056695
PctLess9thGrade	21.1231604
HousVacant	20.6816310
PctLargHouseFam	15.3341087
NumInShelters	14.9067179
MedNumBR	13.6315013
PctWOFullPlumb	12.9198016
perCapInc	12.1955112
PctOccupMgmtProf	11.2535144
PctEmploy	10.3417599
NumStreet	10.3378306
MedRentPctHousInc	9.0280115
PctHousOccup	8.5714403
pctWWage	6.9674327
MalePctNevMarr	6.9603290
PctNotSpeakEnglWell	5.4220236
PctOccupManu	5.3190000
PctImmigRec10	4.9203000
PopDens	4.8423181
racePctHisp	4.2799730
NumImmig	3.9294479
blackPerCap	1.6454549
PctRecImmig8	0.3601896
RentLowQ	0.0000000

We identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("SVM", varImp(svmRFit)$importance, 10)

Prediction Performance

svmRPred <- predict(svmRFit, newdata = X_test)
svmRPredPerf <- postResample(pred = svmRPred, obs = y_test)
svmRPredPerf['Family'] <- 'Non-Linear Regression'
svmRPredPerf['Model'] <- 'SVM'

Model7:Multivaraite Adaptive Regression Splines (MARS)

The MARS model finds knots to determine piece wise linear model where each new feature models an isolated portion of the data. The predictor / cut point combination that has the smallest error is used for the model in an iterative process until a stopping point is reached.

We will use the earth package to generate this MARS model.

# mars
marsFit <- earth(X_train, y_train)
summary(marsFit)

Call: earth(x=X_train, y=y_train)

                         coefficients
(Intercept)                 0.7263794
h(0.54-racepctblack)       -0.3101390
h(0.73-pctWWage)            0.1249294
h(pctWWage-0.73)           -0.2401202
h(PctPopUnderPov-0.67)     -0.5455969
h(MalePctDivorce-0.39)      0.2138445
h(0.51-PctTeen2Par)         0.2990705
h(0.39-PctIlleg)           -0.1481046
h(PctIlleg-0.39)            0.2448670
h(0.88-PctPersDenseHous)   -0.3320613
h(0.16-HousVacant)         -0.4654248
h(0.31-NumStreet)          -0.4153084

Selected 12 of 21 terms, and 9 of 39 predictors
Termination condition: RSq changed by less than 0.001 at 21 terms
Importance: racepctblack, PctIlleg, MalePctDivorce, PctPersDenseHous, ...
Number of terms at each degree of interaction: 1 11 (additive model)
GCV 0.01815316    RSS 26.38294    GRSq 0.6714617    RSq 0.6810408

Visualizing MARS Model

plotmo(marsFit)

 plotmo grid:    racepctblack racePctWhite racePctHisp pctWWage pctWInvInc
                         0.06         0.85        0.04     0.57       0.48
 pctWPubAsst perCapInc blackPerCap PctPopUnderPov PctLess9thGrade PctUnemployed
        0.26      0.31        0.26           0.24            0.27          0.32
 PctEmploy PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr
      0.51         0.37              0.4           0.47            0.4
 PctYoungKids2Par PctTeen2Par NumIlleg PctIlleg NumImmig PctImmigRec10
              0.7        0.61     0.01     0.17     0.01          0.43
 PctRecImmig8 PctNotSpeakEnglWell PctLargHouseFam PctPersDenseHous
         0.09                0.06            0.21             0.11
 PctHousLess3BR MedNumBR HousVacant PctHousOccup PctHousOwnOcc PctVacantBoarded
           0.51      0.5       0.03         0.77          0.55             0.13
 PctHousNoPhone PctWOFullPlumb RentLowQ MedRentPctHousInc NumInShelters
           0.18           0.19     0.32              0.48             0
 NumStreet PopDens
         0    0.17

Variable Importance

varimp <- varImp(marsFit)
varimp %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")

	Overall
racepctblack	100.00000
PctIlleg	65.01171
MalePctDivorce	47.29172
PctPersDenseHous	47.29172
HousVacant	29.26562
pctWWage	22.81121
PctPopUnderPov	20.13595
NumStreet	15.49737
PctTeen2Par	11.49760

Overall

We identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("MARS", varImp(marsFit), 10)

Prediction Performance

marsPred <- predict(marsFit, newdata = X_test)
marsPredPerf <- postResample(pred = marsPred, obs = y_test)
marsPredPerf['Family'] <- 'Non-Linear Regression'
marsPredPerf['Model'] <- 'MARS'

Model Group C: Trees and Boosting Model

Model8: Decision Tree

Decision Trees are a type of Supervised Machine Learning method where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes and the decision nodes are where the data is split. Since our target variable is continuous, we will be building a regression tree.

Decision Tree 1

We will do the decision tree with all variables first.

dt <- rpart(ViolentCrimesPerPop~ ., 
               data=as.data.frame(train))
rpart.plot(dt, nn=TRUE)

Here are some of the details of the tree:

printcp(dt)


Regression tree:
rpart(formula = ViolentCrimesPerPop ~ ., data = as.data.frame(train))

Variables actually used in tree construction:
[1] MalePctDivorce   NumStreet        PctIlleg         PctPersDenseHous
[5] PctPopUnderPov  

Root node error: 82.716/1499 = 0.055181

n= 1499 

        CP nsplit rel error  xerror     xstd
1 0.412904      0   1.00000 1.00143 0.050454
2 0.063292      1   0.58710 0.59142 0.029666
3 0.054263      2   0.52380 0.54890 0.028311
4 0.022946      3   0.46954 0.50016 0.026669
5 0.019715      4   0.44660 0.48908 0.026478
6 0.011323      5   0.42688 0.46579 0.025620
7 0.010558      6   0.41556 0.48008 0.026145
8 0.010000      7   0.40500 0.47461 0.026067

To achieve improved accuracy, we need to prune the tree using the cross-validation.

plotcp(dt)

dt$cptable

          CP nsplit rel error    xerror       xstd
1 0.41290395      0 1.0000000 1.0014260 0.05045354
2 0.06329165      1 0.5870960 0.5914199 0.02966628
3 0.05426321      2 0.5238044 0.5489023 0.02831096
4 0.02294604      3 0.4695412 0.5001609 0.02666861
5 0.01971462      4 0.4465952 0.4890764 0.02647826
6 0.01132276      5 0.4268805 0.4657860 0.02561957
7 0.01055755      6 0.4155578 0.4800769 0.02614502
8 0.01000000      7 0.4050002 0.4746088 0.02606696

The plot above shows the cross validated errors against the complexity parameters. The curve is at its lowest at 7, so we will prune our tree to a size of 7. At size 7, the error is ~0.46 and cp is 0.0104.

Decision Tree 2

prune_dt=prune(dt,cp=0.0104)
rpart.plot(prune_dt)

Predicting Pruned Tree on Test Data

test2<-as.data.frame(test)
test2<- na.omit(test2)

Prune_pred <- predict(prune_dt, 
                   test2)

Model Accuracy

confMat <- table(test2$ViolentCrimesPerPop,Prune_pred)
accuracy <- sum(diag(confMat))/sum(confMat)
accuracy

[1] 0.006072874

The model accuracy seems to be very low, even after pruning the tree. This might be because our output is continuous and non-binary. However, more often than not, trees do not give very good prediction errors. Therefore, we will build out a random forest models which tend to outperform trees in terms of prediction and miss-classification errors.

Variable Importance

## Call Variable Importance Plot Function
printVarImportance("Decision Tree", data.frame(prune_dt$variable.importance), 10)

Prediction Performance

dtreePredPerf <- postResample(pred = Prune_pred, obs = y_test)
dtreePredPerf['Family'] <- 'Trees & Boosting'
dtreePredPerf['Model'] <- 'Decision Tree'

Model9: Random Forest

Random forests are one the most popular machine learning algorithms. They can be used for classification or regression, deal with a large number of features, and generally are so successful because they provide a good predictive performance, low incidence of over-fitting, and easy interpret-ability.

In creating the best random forest model, we want to minimize the OOB error rate by finding the optimal number of variables selected at each split, known as the mtry. The below code finds the optimal mtry to use in our random forest model.

train2 <- as.data.frame(train)
oob.err=double(101)
test.err=double(101)
for (mtry in 1:101){
  fit=randomForest(ViolentCrimesPerPop~.,data=train2,mtry=mtry,ntree=400, proximity=TRUE)
 oob.err[mtry]=fit$mse[400]
 pred=predict(fit, train2)
 test.err[mtry]=with(train2, mean((ViolentCrimesPerPop-pred)^2))
 cat(mtry," ")}

1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39

matplot(1:mtry,cbind(test.err, oob.err),pch=19, col=c("red", "blue"), type="b",ylab
="Mean Squared Error")
legend("topright",legend=c("00B", "Test"),pch=19,col=c("red", "blue"))

Ideally OOB error and test errors should line up - but in this case the test error seems to be higher than the OOB.

Variable Importance

varImpPlot(fit)

## Call Variable Importance Plot Function
printVarImportance("Random Forest", varImp(fit), 10)

The above graph illustrates the importance of the variables used to predict the Violent Crimes Per Population. The Mean Decrease Accuracy displays how much the model accuracy decreases if we drop the variable. Here, Percentage of kids born to never married is regarded as the most important variable by a wide margin. The Mean Decrease Gini graph displays the variable importance on the Gini impurity index used for splitting trees. Again, Percentage of kids born to never married is the clear leader.

print(fit)


Call:
 randomForest(formula = ViolentCrimesPerPop ~ ., data = train2,      mtry = mtry, ntree = 400, proximity = TRUE) 
               Type of random forest: regression
                     Number of trees: 400
No. of variables tried at each split: 39

          Mean of squared residuals: 0.01920442
                    % Var explained: 65.2

Prediction Performance

rfPredPerf <- postResample(pred = pred, obs = y_test)
rfPredPerf['Family'] <- 'Trees & Boosting'
rfPredPerf['Model'] <- 'Random Forest'

Model10: Gradient Boosting Method The boosting method is uses trees in a sequential pattern. The successive tree is developed using information from the previous tree to minimize its error. The boosting method has three tuning parameters, number of trees, shrinkage parameter, and number of splits in a tree.

We will be using the stochastic gradient boosting in our model. This approach resamples the observations and columns in each iteration. We’ll use the caret workflow, which invokes the xgboost package, to automatically adjust the model parameter values, and fit the final best boosted tree that explains the best our data.

Here, we can see our parameters that are used in the model -

Model Summary

modelSummaryDF <- rbind(ridgePredPerf, lassoPredPerf, elnetPredPerf, nnetPredPerf, knnPredPerf, svmRPredPerf, marsPredPerf, dtreePredPerf, rfPredPerf)

SummaryDF <- modelSummaryDF[, c(4, 5, 1, 2, 3)]



rownames(SummaryDF) <- NULL

SummaryDF %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="100%",height="500px")

Family	Model	RMSE	Rsquared	MAE
Linear Regression	Ridge Regression	0.134023364559625	0.652096104816265	0.0931015520851205
Linear Regression	Lasso Regression	0.13432650454941	0.650239839854935	0.093460493359668
Linear Regression	Elastic Net Regression	0.134363506716829	0.650048180108082	0.093506799765122
Non-Linear Regression	Neural Network	0.138079979548364	0.631849583828549	0.0920321820156854
Non-Linear Regression	KNN	0.137746336193226	0.64321987228122	0.0935035814388041
Non-Linear Regression	SVM	0.137410561896066	0.64343871694592	0.0903724920709437
Non-Linear Regression	MARS	0.135214031166373	0.645391442964902	0.0932388471415975
Trees & Boosting	Decision Tree	0.153357244851902	0.545361936338709	0.107234983234218
Trees & Boosting	Random Forest	NA	0.00235260760870183	NA

All the linear regression models - Ridge, Lasso and Elastic Net that we built turned out to be very good models with a multiple R2 of 65%+. Non-Linear regression models like MARS, SVM and KNN also had R2 of 64% and above. Amongst the Tree models, XGBoost performed the best with 63.8% R2. All the above-mentioned models have prediction powers in the same range. Decision tree is great for interpretation but significantly under-performs all other models.

The final model, random forest model has a high Percentage of variance explained of 65.2% and a low mean squared residual (0.019).

But purely based on R2, Ridge regression model is the best pick out of all the models that we have tried.

Conclusion

Violent crime stories often dominate the news and shake our collective sense of safety, prompting society to re-examine long-held assumptions about cultural, economic, and racial harmony. As these incidents increase in visibility, it becomes even more important to understand the deeper socio-economic foundations that may contribute to such crimes. With modern computational tools, we are now able to analyze dozens of factors simultaneously, uncovering patterns and relationships that were previously difficult to see. A data-driven perspective is essential if we hope to plan for safer, more resilient communities built on trust and opportunity.

The Communities and Crime normalized dataset from the UCI Machine Learning Repository offers a rich environment for examining how diverse socio-economic variables shape violent crime rates across the United States. Through the application of contemporary machine learning methods, we were able to build predictive models with strong accuracy and identify key drivers of violent crime. A consistent theme that emerged from feature importance rankings was the role of family stability: areas with high rates of children born to never-married parents, limited educational attainment, and dense living conditions tended to experience higher levels of violent crime. While some features related to race surfaced as influential, these signals appear more reflective of long-standing economic disparities between racial groups rather than inherent racial factors themselves.

However, the study is not without limitations. Several models show signs of overfitting due to the use of the full dataset for both training and evaluation. The random forest model—our recommended option for scalability—would benefit from systematic hyperparameter tuning and proper train-test splits to strengthen generalizability. As more data becomes available, ongoing recalibration will be essential. Future work could explore more advanced modeling techniques, such as neural networks or ensemble approaches, to improve predictive performance and deepen our understanding of the socio-economic dynamics behind violent crime.

References

[1) UCI Machine Learning Repository Link] (https://archive.ics.uci.edu/ml/datasets/communities+and+crime)

DATA 622 -Assignment 4:Final Project -Community Demographics and Crime Analysis

Bikash Bhowmik —- 07 Dec 2025

Column

Column