Bikash-Data-622-Assignment-4.knit

Column

Instructions

Exploratory analysis and essay Assignment

Choose a dataset You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.
To complete this task:.

Describe the problem you are trying to solve.
Describe your datases and what you did to prepare the data for analysis.
Methodologies you used for analyzing the data
What’s the purpose of the analysis performed
Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Introduction

I will analyze the Communities & Crime dataset, which contains information from communities across the United States. This dataset integrates socio-economic indicators from the 1990 U.S. Census, law enforcement data from the 1990 U.S. LEMAS survey, and crime statistics from the 1995 FBI UCR. Together, these sources provide a comprehensive view of the demographic, economic, and policing factors associated with crime at the community level.

Study Objective

Public perception of neighborhood safety is often shaped by high profile violent crimes reported in the media. These events raise questions about the underlying social and economic factors that may influence crime in different regions. For instance, the rise in race-related incidents has intensified discussions about whether racial composition plays a role in community crime rates. Other factors, such as socio-economic status, immigration levels, and cultural or religious demographics are also commonly examined when considering what drives crime.

Understanding where violent crimes occur, and how they relate to a community’s socio economic, environmental, and demographic characteristics, is essential for identifying the root causes of these incidents. By building predictive models that quantify the risk associated with a community, we can estimate where violent crime is more likely to occur. Such insights can support city planning efforts, guiding efficient allocation of police resources and proactive safety measures. Ultimately, identifying the key drivers of violent crime provides valuable direction for urban development and the refinement of policing strategies.

Overview of the Data

The Communities and Crime dataset from the UCI Machine Learning Repository brings together socio-economic indicators from the 1990 U.S. Census, police department data from the 1990 LEMAS survey, and crime statistics from the 1995 FBI UCR. Although the data is older, it still provides meaningful insights into community characteristics and crime patterns, and supports building predictive models for crime prevention and urban planning.

The dataset contains 122 predictor variables plus the target variable, Violent Crimes per Population. Variables cover community demographics (e.g., percent urban, median family income) and law-enforcement features (e.g., per-capita officers, percent assigned to drug units). Only attributes with a plausible relationship to crime were included.

Violent crime rates were calculated using murder, rape, robbery, and assault. Some states reported rape inconsistently, which caused missing or inaccurate values; those communities were removed from the dataset.

All numeric features were normalized to a 0–1 scale using equal-interval binning. This preserves each attribute’s internal distribution but does not make values comparable across attributes.

Due to survey limitations, some communities lack LEMAS data, and only locations present across all sources are included. The final prediction target for analysis is Violent Crimes Per Population.

Data Definitions

The crime dataset has 128 attributes. Below is a brief description of all attributes along with data type details -

datadict <- read_csv('https://raw.githubusercontent.com/BIKASHBHOWMIK15/Data-622/main/data_dictionary.csv')

datadict %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="100%",height="500px")

No.	Column	Description	Data Type
1	state	US state (by number) - not counted as predictive above, but if considered, should be consided nominal	nominal
2	county	numeric code for county - not predictive, and many missing values	numeric
3	community	numeric code for community - not predictive and many missing values	numeric
4	communityname	community name - not predictive - for information only	string
5	fold	fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive	numeric
6	population	population for community	numeric - decimal
7	householdsize	mean people per household	numeric - decimal
8	racepctblack	percentage of population that is african american	numeric - decimal
9	racePctWhite	percentage of population that is caucasian	numeric - decimal
10	racePctAsian	percentage of population that is of asian heritage	numeric - decimal
11	racePctHisp	percentage of population that is of hispanic heritage	numeric - decimal
12	agePct12t21	percentage of population that is 12-21 in age	numeric - decimal
13	agePct12t29	percentage of population that is 12-29 in age	numeric - decimal
14	agePct16t24	percentage of population that is 16-24 in age	numeric - decimal
15	agePct65up	percentage of population that is 65 and over in age	numeric - decimal
16	numbUrban	number of people living in areas classified as urban	numeric - decimal
17	pctUrban	percentage of people living in areas classified as urban	numeric - decimal
18	medIncome	median household income	numeric - decimal
19	pctWWage	percentage of households with wage or salary income in 1989	numeric - decimal
20	pctWFarmSelf	percentage of households with farm or self employment income in 1989	numeric - decimal
21	pctWInvInc	percentage of households with investment / rent income in 1989	numeric - decimal
22	pctWSocSec	percentage of households with social security income in 1989	numeric - decimal
23	pctWPubAsst	percentage of households with public assistance income in 1989	numeric - decimal
24	pctWRetire	percentage of households with retirement income in 1989	numeric - decimal
25	medFamInc	median family income (differs from household income for non-family households)	numeric - decimal
26	perCapInc	per capita income	numeric - decimal
27	whitePerCap	per capita income for caucasians	numeric - decimal
28	blackPerCap	per capita income for african americans	numeric - decimal
29	indianPerCap	per capita income for native americans	numeric - decimal
30	AsianPerCap	per capita income for people with asian heritage	numeric - decimal
31	OtherPerCap	per capita income for people with ‘other’ heritage	numeric - decimal
32	HispPerCap	per capita income for people with hispanic heritage	numeric - decimal
33	NumUnderPov	number of people under the poverty level	numeric - decimal
34	PctPopUnderPov	percentage of people under the poverty level	numeric - decimal
35	PctLess9thGrade	percentage of people 25 and over with less than a 9th grade education	numeric - decimal
36	PctNotHSGrad	percentage of people 25 and over that are not high school graduates	numeric - decimal
37	PctBSorMore	percentage of people 25 and over with a bachelors degree or higher education	numeric - decimal
38	PctUnemployed	percentage of people 16 and over, in the labor force, and unemployed	numeric - decimal
39	PctEmploy	percentage of people 16 and over who are employed	numeric - decimal
40	PctEmplManu	percentage of people 16 and over who are employed in manufacturing	numeric - decimal
41	PctEmplProfServ	percentage of people 16 and over who are employed in professional services	numeric - decimal
42	PctOccupManu	percentage of people 16 and over who are employed in manufacturing	numeric - decimal
43	PctOccupMgmtProf	percentage of people 16 and over who are employed in management or professional occupations	numeric - decimal
44	MalePctDivorce	percentage of males who are divorced	numeric - decimal
45	MalePctNevMarr	percentage of males who have never married	numeric - decimal
46	FemalePctDiv	percentage of females who are divorced	numeric - decimal
47	TotalPctDiv	percentage of population who are divorced	numeric - decimal
48	PersPerFam	mean number of people per family	numeric - decimal
49	PctFam2Par	percentage of families (with kids) that are headed by two parents	numeric - decimal
50	PctKids2Par	percentage of kids in family housing with two parents	numeric - decimal
51	PctYoungKids2Par	percent of kids 4 and under in two parent households	numeric - decimal
52	PctTeen2Par	percent of kids age 12-17 in two parent households	numeric - decimal
53	PctWorkMomYoungKids	percentage of moms of kids 6 and under in labor force	numeric - decimal
54	PctWorkMom	percentage of moms of kids under 18 in labor force	numeric - decimal
55	NumIlleg	number of kids born to never married	numeric - decimal
56	PctIlleg	percentage of kids born to never married	numeric - decimal
57	NumImmig	total number of people known to be foreign born	numeric - decimal
58	PctImmigRecent	percentage of immigrants who immigated within last 3 years	numeric - decimal
59	PctImmigRec5	percentage of immigrants who immigated within last 5 years	numeric - decimal
60	PctImmigRec8	percentage of immigrants who immigated within last 8 years	numeric - decimal
61	PctImmigRec10	percentage of immigrants who immigated within last 10 years	numeric - decimal
62	PctRecentImmig	percent of population who have immigrated within the last 3 years	numeric - decimal
63	PctRecImmig5	percent of population who have immigrated within the last 5 years	numeric - decimal
64	PctRecImmig8	percent of population who have immigrated within the last 8 years	numeric - decimal
65	PctRecImmig10	percent of population who have immigrated within the last 10 years	numeric - decimal
66	PctSpeakEnglOnly	percent of people who speak only English	numeric - decimal
67	PctNotSpeakEnglWell	percent of people who do not speak English well	numeric - decimal
68	PctLargHouseFam	percent of family households that are large (6 or more)	numeric - decimal
69	PctLargHouseOccup	percent of all occupied households that are large (6 or more people)	numeric - decimal
70	PersPerOccupHous	mean persons per household	numeric - decimal
71	PersPerOwnOccHous	mean persons per owner occupied household	numeric - decimal
72	PersPerRentOccHous	mean persons per rental household	numeric - decimal
73	PctPersOwnOccup	percent of people in owner occupied households	numeric - decimal
74	PctPersDenseHous	percent of persons in dense housing (more than 1 person per room)	numeric - decimal
75	PctHousLess3BR	percent of housing units with less than 3 bedrooms	numeric - decimal
76	MedNumBR	median number of bedrooms	numeric - decimal
77	HousVacant	number of vacant households	numeric - decimal
78	PctHousOccup	percent of housing occupied	numeric - decimal
79	PctHousOwnOcc	percent of households owner occupied	numeric - decimal
80	PctVacantBoarded	percent of vacant housing that is boarded up	numeric - decimal
81	PctVacMore6Mos	percent of vacant housing that has been vacant more than 6 months	numeric - decimal
82	MedYrHousBuilt	median year housing units built	numeric - decimal
83	PctHousNoPhone	percent of occupied housing units without phone (in 1990, this was rare!)	numeric - decimal
84	PctWOFullPlumb	percent of housing without complete plumbing facilities	numeric - decimal
85	OwnOccLowQuart	owner occupied housing - lower quartile value	numeric - decimal
86	OwnOccMedVal	owner occupied housing - median value	numeric - decimal
87	OwnOccHiQuart	owner occupied housing - upper quartile value	numeric - decimal
88	RentLowQ	rental housing - lower quartile rent	numeric - decimal
89	RentMedian	rental housing - median rent (Census variable H32B from file STF1A)	numeric - decimal
90	RentHighQ	rental housing - upper quartile rent	numeric - decimal
91	MedRent	median gross rent (Census variable H43A from file STF3A - includes utilities)	numeric - decimal
92	MedRentPctHousInc	median gross rent as a percentage of household income	numeric - decimal
93	MedOwnCostPctInc	median owners cost as a percentage of household income - for owners with a mortgage	numeric - decimal
94	MedOwnCostPctIncNoMtg	median owners cost as a percentage of household income - for owners without a mortgage	numeric - decimal
95	NumInShelters	number of people in homeless shelters	numeric - decimal
96	NumStreet	number of homeless people counted in the street	numeric - decimal
97	PctForeignBorn	percent of people foreign born	numeric - decimal
98	PctBornSameState	percent of people born in the same state as currently living	numeric - decimal
99	PctSameHouse85	percent of people living in the same house as in 1985 (5 years before)	numeric - decimal
100	PctSameCity85	percent of people living in the same city as in 1985 (5 years before)	numeric - decimal
101	PctSameState85	percent of people living in the same state as in 1985 (5 years before)	numeric - decimal
102	LemasSwornFT	number of sworn full time police officers	numeric - decimal
103	LemasSwFTPerPop	sworn full time police officers per 100K population	numeric - decimal
104	LemasSwFTFieldOps	number of sworn full time police officers in field operations (on the street as opposed to administrative etc)	numeric - decimal
105	LemasSwFTFieldPerPop	sworn full time police officers in field operations (on the street as opposed to administrative etc) per 100K population	numeric - decimal
106	LemasTotalReq	total requests for police	numeric - decimal
107	LemasTotReqPerPop	total requests for police per 100K popuation	numeric - decimal
108	PolicReqPerOffic	total requests for police per police officer	numeric - decimal
109	PolicPerPop	police officers per 100K population	numeric - decimal
110	RacialMatchCommPol	a measure of the racial match between the community and the police force. High values indicate proportions in community and police force are similar	numeric - decimal
111	PctPolicWhite	percent of police that are caucasian	numeric - decimal
112	PctPolicBlack	percent of police that are african american	numeric - decimal
113	PctPolicHisp	percent of police that are hispanic	numeric - decimal
114	PctPolicAsian	percent of police that are asian	numeric - decimal
115	PctPolicMinor	percent of police that are minority of any kind	numeric - decimal
116	OfficAssgnDrugUnits	number of officers assigned to special drug units	numeric - decimal
117	NumKindsDrugsSeiz	number of different kinds of drugs seized	numeric - decimal
118	PolicAveOTWorked	police average overtime worked	numeric - decimal
119	LandArea	land area in square miles	numeric - decimal
120	PopDens	population density in persons per square mile	numeric - decimal
121	PctUsePubTrans	percent of people using public transit for commuting	numeric - decimal
122	PolicCars	number of police cars	numeric - decimal
123	PolicOperBudg	police operating budget	numeric - decimal
124	LemasPctPolicOnPatr	percent of sworn full time police officers on patrol	numeric - decimal
125	LemasGangUnitDeploy	gang unit deployed	numeric - decimal - but really ordinal - 0 means NO, 1 means YES, 0.5 means Part Time
126	LemasPctOfficDrugUn	percent of officers assigned to drug units	numeric - decimal
127	PolicBudgPerPop	police operating budget per population	numeric - decimal
128	ViolentCrimesPerPop	total number of violent crimes per 100K popuation (numeric - decimal) GOAL attribute	to be predicted

Supplimentary Data

I also loaded the supplimentary state level dataset from UCI Machine Learning repository to enrich our crime dataset with state level information -

states <- read.csv('https://raw.githubusercontent.com/BIKASHBHOWMIK15/Data-622/main/states.csv')

names(states) <- c("state_code","state","stateName","stateENS")

states %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="80%",height="300px")

state_code	state	stateName	stateENS
1	AL	Alabama	1779775
2	AK	Alaska	1785533
4	AZ	Arizona	1779777
5	AR	Arkansas	68085
6	CA	California	1779778
8	CO	Colorado	1779779
9	CT	Connecticut	1779780
10	DE	Delaware	1779781
11	DC	District of Columbia	1702382
12	FL	Florida	294478
13	GA	Georgia	1705317
15	HI	Hawaii	1779782
16	ID	Idaho	1779783
17	IL	Illinois	1779784
18	IN	Indiana	448508
19	IA	Iowa	1779785
20	KS	Kansas	481813
21	KY	Kentucky	1779786
22	LA	Louisiana	1629543
23	ME	Maine	1779787
24	MD	Maryland	1714934
25	MA	Massachusetts	606926
26	MI	Michigan	1779789
27	MN	Minnesota	662849
28	MS	Mississippi	1779790
29	MO	Missouri	1779791
30	MT	Montana	767982
31	NE	Nebraska	1779792
32	NV	Nevada	1779793
33	NH	New Hampshire	1779794
34	NJ	New Jersey	1779795
35	NM	New Mexico	897535
36	NY	New York	1779796
37	NC	North Carolina	1027616
38	ND	North Dakota	1779797
39	OH	Ohio	1085497
40	OK	Oklahoma	1102857
41	OR	Oregon	1155107
42	PA	Pennsylvania	1779798
44	RI	Rhode Island	1219835
45	SC	South Carolina	1779799
46	SD	South Dakota	1785534
47	TN	Tennessee	1325873
48	TX	Texas	1779801
49	UT	Utah	1455989
50	VT	Vermont	1779802
51	VA	Virginia	1779803
53	WA	Washington	1779804
54	WV	West Virginia	1779805
55	WI	Wisconsin	1779806
56	WY	Wyoming	1779807
60	AS	American Samoa	1802701
66	GU	Guam	1802705
69	MP	Northern Mariana Islands	1779809
72	PR	Puerto Rico	1779808
74	UM	U.S. Minor Outlying Islands	1878752
78	VI	U.S. Virgin Islands	1802710

Dataset Preview

dataset <- read_csv('https://raw.githubusercontent.com/BIKASHBHOWMIK15/Data-622/main/communities.data.csv')
head(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="100%",height="300px")

state	county	community	communityname	fold	population	householdsize	racepctblack	racePctWhite	racePctAsian	racePctHisp	agePct12t21	agePct12t29	agePct16t24	agePct65up	numbUrban	pctUrban	medIncome	pctWWage	pctWFarmSelf	pctWInvInc	pctWSocSec	pctWPubAsst	pctWRetire	medFamInc	perCapInc	whitePerCap	blackPerCap	indianPerCap	AsianPerCap	OtherPerCap	HispPerCap	NumUnderPov	PctPopUnderPov	PctLess9thGrade	PctNotHSGrad	PctBSorMore	PctUnemployed	PctEmploy	PctEmplManu	PctEmplProfServ	PctOccupManu	PctOccupMgmtProf	MalePctDivorce	MalePctNevMarr	FemalePctDiv	TotalPctDiv	PersPerFam	PctFam2Par	PctKids2Par	PctYoungKids2Par	PctTeen2Par	PctWorkMomYoungKids	PctWorkMom	NumIlleg	PctIlleg	NumImmig	PctImmigRecent	PctImmigRec5	PctImmigRec8	PctImmigRec10	PctRecentImmig	PctRecImmig5	PctRecImmig8	PctRecImmig10	PctSpeakEnglOnly	PctNotSpeakEnglWell	PctLargHouseFam	PctLargHouseOccup	PersPerOccupHous	PersPerOwnOccHous	PersPerRentOccHous	PctPersOwnOccup	PctPersDenseHous	PctHousLess3BR	MedNumBR	HousVacant	PctHousOccup	PctHousOwnOcc	PctVacantBoarded	PctVacMore6Mos	MedYrHousBuilt	PctHousNoPhone	PctWOFullPlumb	OwnOccLowQuart	OwnOccMedVal	OwnOccHiQuart	RentLowQ	RentMedian	RentHighQ	MedRent	MedRentPctHousInc	MedOwnCostPctInc	MedOwnCostPctIncNoMtg	NumInShelters	PctForeignBorn	PctBornSameState	PctSameHouse85	PctSameCity85	PctSameState85	LemasSwornFT	LemasSwFTPerPop	LemasSwFTFieldOps	LemasSwFTFieldPerPop	LemasTotalReq	LemasTotReqPerPop	PolicReqPerOffic	PolicPerPop	RacialMatchCommPol	PctPolicWhite	PctPolicBlack	PctPolicHisp	PctPolicAsian	PctPolicMinor	OfficAssgnDrugUnits	NumKindsDrugsSeiz	PolicAveOTWorked	LandArea	PopDens	PctUsePubTrans	PolicCars	PolicOperBudg	LemasPctPolicOnPatr	LemasGangUnitDeploy	LemasPctOfficDrugUn	PolicBudgPerPop	ViolentCrimesPerPop
8	?	?	Lakewoodcity	1	0.19	0.33	0.02	0.90	0.12	0.17	0.34	0.47	0.29	0.32	0.20	1.0	0.37	0.72	0.34	0.60	0.29	0.15	0.43	0.39	0.40	0.39	0.32	0.27	0.27	0.36	0.41	0.08	0.19	0.10	0.18	0.48	0.27	0.68	0.23	0.41	0.25	0.52	0.68	0.40	0.75	0.75	0.35	0.55	0.59	0.61	0.56	0.74	0.76	0.04	0.14	0.03	0.24	0.27	0.37	0.39	0.07	0.07	0.08	0.08	0.89	0.06	0.14	0.13	0.33	0.39	0.28	0.55	0.09	0.51	0.5	0.21	0.71	0.52	0.05	0.26	0.65	0.14	0.06	0.22	0.19	0.18	0.36	0.35	0.38	0.34	0.38	0.46	0.25	0.04	0.12	0.42	0.50	0.51	0.64	0.03	0.13	0.96	0.17	0.06	0.18	0.44	0.13	0.94	0.93	0.03	0.07	0.1	0.07	0.02	0.57	0.29	0.12	0.26	0.20	0.06	0.04	0.9	0.5	0.32	0.14	0.20
53	?	?	Tukwilacity	1	0.00	0.16	0.12	0.74	0.45	0.07	0.26	0.59	0.35	0.27	0.02	1.0	0.31	0.72	0.11	0.45	0.25	0.29	0.39	0.29	0.37	0.38	0.33	0.16	0.30	0.22	0.35	0.01	0.24	0.14	0.24	0.30	0.27	0.73	0.57	0.15	0.42	0.36	1.00	0.63	0.91	1.00	0.29	0.43	0.47	0.60	0.39	0.46	0.53	0.00	0.24	0.01	0.52	0.62	0.64	0.63	0.25	0.27	0.25	0.23	0.84	0.10	0.16	0.10	0.17	0.29	0.17	0.26	0.20	0.82	0.0	0.02	0.79	0.24	0.02	0.25	0.65	0.16	0.00	0.21	0.20	0.21	0.42	0.38	0.40	0.37	0.29	0.32	0.18	0.00	0.21	0.50	0.34	0.60	0.52	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.02	0.12	0.45	?	?	?	?	0.00	?	0.67
24	?	?	Aberdeentown	1	0.00	0.42	0.49	0.56	0.17	0.04	0.39	0.47	0.28	0.32	0.00	0.0	0.30	0.58	0.19	0.39	0.38	0.40	0.84	0.28	0.27	0.29	0.27	0.07	0.29	0.28	0.39	0.01	0.27	0.27	0.43	0.19	0.36	0.58	0.32	0.29	0.49	0.32	0.63	0.41	0.71	0.70	0.45	0.42	0.44	0.43	0.43	0.71	0.67	0.01	0.46	0.00	0.07	0.06	0.15	0.19	0.02	0.02	0.04	0.05	0.88	0.04	0.20	0.20	0.46	0.52	0.43	0.42	0.15	0.51	0.5	0.01	0.86	0.41	0.29	0.30	0.52	0.47	0.45	0.18	0.17	0.16	0.27	0.29	0.27	0.31	0.48	0.39	0.28	0.00	0.14	0.49	0.54	0.67	0.56	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.01	0.21	0.02	?	?	?	?	0.00	?	0.43
34	5	81440	Willingborotownship	1	0.04	0.77	1.00	0.08	0.12	0.10	0.51	0.50	0.34	0.21	0.06	1.0	0.58	0.89	0.21	0.43	0.36	0.20	0.82	0.51	0.36	0.40	0.39	0.16	0.25	0.36	0.44	0.01	0.10	0.09	0.25	0.31	0.33	0.71	0.36	0.45	0.37	0.39	0.34	0.45	0.49	0.44	0.75	0.65	0.54	0.83	0.65	0.85	0.86	0.03	0.33	0.02	0.11	0.20	0.30	0.31	0.05	0.08	0.11	0.11	0.81	0.08	0.56	0.62	0.85	0.77	1.00	0.94	0.12	0.01	0.5	0.01	0.97	0.96	0.60	0.47	0.52	0.11	0.11	0.24	0.21	0.19	0.75	0.70	0.77	0.89	0.63	0.51	0.47	0.00	0.19	0.30	0.73	0.64	0.65	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.02	0.39	0.28	?	?	?	?	0.00	?	0.12
42	95	6096	Bethlehemtownship	1	0.01	0.55	0.02	0.95	0.09	0.05	0.38	0.38	0.23	0.36	0.02	0.9	0.50	0.72	0.16	0.68	0.44	0.11	0.71	0.46	0.43	0.41	0.28	0.00	0.74	0.51	0.48	0.00	0.06	0.25	0.30	0.33	0.12	0.65	0.67	0.38	0.42	0.46	0.22	0.27	0.20	0.21	0.51	0.91	0.91	0.89	0.85	0.40	0.60	0.00	0.06	0.00	0.03	0.07	0.20	0.27	0.01	0.02	0.04	0.05	0.88	0.05	0.16	0.19	0.59	0.60	0.37	0.89	0.02	0.19	0.5	0.01	0.89	0.87	0.04	0.55	0.73	0.05	0.14	0.31	0.31	0.30	0.40	0.36	0.38	0.38	0.22	0.51	0.21	0.00	0.11	0.72	0.64	0.61	0.53	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.04	0.09	0.02	?	?	?	?	0.00	?	0.03
6	?	?	SouthPasadenacity	1	0.02	0.28	0.06	0.54	1.00	0.25	0.31	0.48	0.27	0.37	0.04	1.0	0.52	0.68	0.20	0.61	0.28	0.15	0.25	0.62	0.72	0.76	0.77	0.28	0.52	0.48	0.60	0.01	0.12	0.13	0.12	0.80	0.10	0.65	0.19	0.77	0.06	0.91	0.49	0.57	0.61	0.58	0.44	0.62	0.69	0.87	0.53	0.30	0.43	0.00	0.11	0.04	0.30	0.35	0.43	0.47	0.50	0.50	0.56	0.57	0.45	0.28	0.25	0.19	0.29	0.53	0.18	0.39	0.26	0.73	0.0	0.02	0.84	0.30	0.16	0.28	0.25	0.02	0.05	0.94	1.00	1.00	0.67	0.63	0.68	0.62	0.47	0.59	0.11	0.00	0.70	0.42	0.49	0.73	0.64	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	0.01	0.58	0.10	?	?	?	?	0.00	?	0.14

dim(dataset)

[1] 1994  128

The dataset has 1994 observations and 128 variable.

Descriptive Dataset Summary

The dataset contains no duplicate records, but it contains a lot of missing values, represented by “?”. We need to convert these entries into NA values to begin our analysis.

The first 5 non-predictive variables represent the name of the community, the state in which it is located, the county code, the community code, and the fold number for non-random 10 fold cross validation. A summary of the remaining variables is below:

dataset[dataset == '?'] = NA

summary(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")

state	county	community	communityname	fold	population	householdsize	racepctblack	racePctWhite	racePctAsian	racePctHisp	agePct12t21	agePct12t29	agePct16t24	agePct65up	numbUrban	pctUrban	medIncome	pctWWage	pctWFarmSelf	pctWInvInc	pctWSocSec	pctWPubAsst	pctWRetire	medFamInc	perCapInc	whitePerCap	blackPerCap	indianPerCap	AsianPerCap	OtherPerCap	HispPerCap	NumUnderPov	PctPopUnderPov	PctLess9thGrade	PctNotHSGrad	PctBSorMore	PctUnemployed	PctEmploy	PctEmplManu	PctEmplProfServ	PctOccupManu	PctOccupMgmtProf	MalePctDivorce	MalePctNevMarr	FemalePctDiv	TotalPctDiv	PersPerFam	PctFam2Par	PctKids2Par	PctYoungKids2Par	PctTeen2Par	PctWorkMomYoungKids	PctWorkMom	NumIlleg	PctIlleg	NumImmig	PctImmigRecent	PctImmigRec5	PctImmigRec8	PctImmigRec10	PctRecentImmig	PctRecImmig5	PctRecImmig8	PctRecImmig10	PctSpeakEnglOnly	PctNotSpeakEnglWell	PctLargHouseFam	PctLargHouseOccup	PersPerOccupHous	PersPerOwnOccHous	PersPerRentOccHous	PctPersOwnOccup	PctPersDenseHous	PctHousLess3BR	MedNumBR	HousVacant	PctHousOccup	PctHousOwnOcc	PctVacantBoarded	PctVacMore6Mos	MedYrHousBuilt	PctHousNoPhone	PctWOFullPlumb	OwnOccLowQuart	OwnOccMedVal	OwnOccHiQuart	RentLowQ	RentMedian	RentHighQ	MedRent	MedRentPctHousInc	MedOwnCostPctInc	MedOwnCostPctIncNoMtg	NumInShelters	NumStreet	PctForeignBorn	PctBornSameState	PctSameHouse85	PctSameCity85	PctSameState85	LemasSwornFT	LemasSwFTPerPop	LemasSwFTFieldOps	LemasSwFTFieldPerPop	LemasTotalReq	LemasTotReqPerPop	PolicReqPerOffic	PolicPerPop	RacialMatchCommPol	PctPolicWhite	PctPolicBlack	PctPolicHisp	PctPolicAsian	PctPolicMinor	OfficAssgnDrugUnits	NumKindsDrugsSeiz	PolicAveOTWorked	LandArea	PopDens	PctUsePubTrans	PolicCars	PolicOperBudg	LemasPctPolicOnPatr	LemasGangUnitDeploy	LemasPctOfficDrugUn	PolicBudgPerPop	ViolentCrimesPerPop
Min. : 1.00	Length:1994	Length:1994	Length:1994	Min. : 1.000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Length:1994	Min. :0.0000	Min. :0.00000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.00	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Length:1994	Min. :0.00000	Min. :0.0000	Min. :0.0000	Length:1994	Length:1994	Length:1994	Length:1994	Min. :0.00000	Length:1994	Min. :0.000
1st Qu.:12.00	Class :character	Class :character	Class :character	1st Qu.: 3.000	1st Qu.:0.01000	1st Qu.:0.3500	1st Qu.:0.0200	1st Qu.:0.6300	1st Qu.:0.0400	1st Qu.:0.010	1st Qu.:0.3400	1st Qu.:0.4100	1st Qu.:0.2500	1st Qu.:0.3000	1st Qu.:0.00000	1st Qu.:0.0000	1st Qu.:0.2000	1st Qu.:0.4400	1st Qu.:0.1600	1st Qu.:0.3700	1st Qu.:0.3500	1st Qu.:0.1425	1st Qu.:0.3600	1st Qu.:0.2300	1st Qu.:0.2200	1st Qu.:0.240	1st Qu.:0.1725	1st Qu.:0.1100	1st Qu.:0.1900	Class :character	1st Qu.:0.2600	1st Qu.:0.01000	1st Qu.:0.110	1st Qu.:0.1600	1st Qu.:0.2300	1st Qu.:0.2100	1st Qu.:0.2200	1st Qu.:0.3800	1st Qu.:0.2500	1st Qu.:0.3200	1st Qu.:0.2400	1st Qu.:0.3100	1st Qu.:0.3300	1st Qu.:0.3100	1st Qu.:0.3600	1st Qu.:0.3600	1st Qu.:0.4000	1st Qu.:0.4900	1st Qu.:0.4900	1st Qu.:0.530	1st Qu.:0.4800	1st Qu.:0.3900	1st Qu.:0.4200	1st Qu.:0.00000	1st Qu.:0.09	1st Qu.:0.00000	1st Qu.:0.1600	1st Qu.:0.2000	1st Qu.:0.2500	1st Qu.:0.2800	1st Qu.:0.0300	1st Qu.:0.0300	1st Qu.:0.0300	1st Qu.:0.0300	1st Qu.:0.7300	1st Qu.:0.0300	1st Qu.:0.1500	1st Qu.:0.1400	1st Qu.:0.3400	1st Qu.:0.3900	1st Qu.:0.2700	1st Qu.:0.4400	1st Qu.:0.0600	1st Qu.:0.4000	1st Qu.:0.0000	1st Qu.:0.01000	1st Qu.:0.6300	1st Qu.:0.4300	1st Qu.:0.0600	1st Qu.:0.2900	1st Qu.:0.3500	1st Qu.:0.0600	1st Qu.:0.1000	1st Qu.:0.0900	1st Qu.:0.0900	1st Qu.:0.0900	1st Qu.:0.1700	1st Qu.:0.2000	1st Qu.:0.220	1st Qu.:0.2100	1st Qu.:0.3700	1st Qu.:0.3200	1st Qu.:0.2500	1st Qu.:0.00000	1st Qu.:0.00000	1st Qu.:0.0600	1st Qu.:0.4700	1st Qu.:0.4200	1st Qu.:0.5200	1st Qu.:0.5600	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	1st Qu.:0.02000	1st Qu.:0.1000	1st Qu.:0.0200	Class :character	Class :character	Class :character	Class :character	1st Qu.:0.00000	Class :character	1st Qu.:0.070
Median :34.00	Mode :character	Mode :character	Mode :character	Median : 5.000	Median :0.02000	Median :0.4400	Median :0.0600	Median :0.8500	Median :0.0700	Median :0.040	Median :0.4000	Median :0.4800	Median :0.2900	Median :0.4200	Median :0.03000	Median :1.0000	Median :0.3200	Median :0.5600	Median :0.2300	Median :0.4800	Median :0.4750	Median :0.2600	Median :0.4700	Median :0.3300	Median :0.3000	Median :0.320	Median :0.2500	Median :0.1700	Median :0.2800	Mode :character	Median :0.3450	Median :0.02000	Median :0.250	Median :0.2700	Median :0.3600	Median :0.3100	Median :0.3200	Median :0.5100	Median :0.3700	Median :0.4100	Median :0.3700	Median :0.4000	Median :0.4700	Median :0.4000	Median :0.5000	Median :0.5000	Median :0.4700	Median :0.6300	Median :0.6400	Median :0.700	Median :0.6100	Median :0.5100	Median :0.5400	Median :0.01000	Median :0.17	Median :0.01000	Median :0.2900	Median :0.3400	Median :0.3900	Median :0.4300	Median :0.0900	Median :0.0800	Median :0.0900	Median :0.0900	Median :0.8700	Median :0.0600	Median :0.2000	Median :0.1900	Median :0.4400	Median :0.4800	Median :0.3600	Median :0.5600	Median :0.1100	Median :0.5100	Median :0.5000	Median :0.03000	Median :0.7700	Median :0.5400	Median :0.1300	Median :0.4200	Median :0.5200	Median :0.1850	Median :0.1900	Median :0.1800	Median :0.1700	Median :0.1800	Median :0.3100	Median :0.3300	Median :0.370	Median :0.3400	Median :0.4800	Median :0.4500	Median :0.3700	Median :0.00000	Median :0.00000	Median :0.1300	Median :0.6300	Median :0.5400	Median :0.6700	Median :0.7000	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Median :0.04000	Median :0.1700	Median :0.0700	Mode :character	Mode :character	Mode :character	Mode :character	Median :0.00000	Mode :character	Median :0.150
Mean :28.68	NA	NA	NA	Mean : 5.494	Mean :0.05759	Mean :0.4634	Mean :0.1796	Mean :0.7537	Mean :0.1537	Mean :0.144	Mean :0.4242	Mean :0.4939	Mean :0.3363	Mean :0.4232	Mean :0.06407	Mean :0.6963	Mean :0.3611	Mean :0.5582	Mean :0.2916	Mean :0.4957	Mean :0.4711	Mean :0.3178	Mean :0.4792	Mean :0.3757	Mean :0.3503	Mean :0.368	Mean :0.2911	Mean :0.2035	Mean :0.3224	NA	Mean :0.3863	Mean :0.05551	Mean :0.303	Mean :0.3158	Mean :0.3833	Mean :0.3617	Mean :0.3635	Mean :0.5011	Mean :0.3964	Mean :0.4406	Mean :0.3912	Mean :0.4413	Mean :0.4612	Mean :0.4345	Mean :0.4876	Mean :0.4943	Mean :0.4877	Mean :0.6109	Mean :0.6207	Mean :0.664	Mean :0.5829	Mean :0.5014	Mean :0.5267	Mean :0.03629	Mean :0.25	Mean :0.03006	Mean :0.3202	Mean :0.3606	Mean :0.3991	Mean :0.4279	Mean :0.1814	Mean :0.1821	Mean :0.1848	Mean :0.1829	Mean :0.7859	Mean :0.1506	Mean :0.2676	Mean :0.2519	Mean :0.4621	Mean :0.4944	Mean :0.4041	Mean :0.5626	Mean :0.1863	Mean :0.4952	Mean :0.3147	Mean :0.07682	Mean :0.7195	Mean :0.5487	Mean :0.2045	Mean :0.4333	Mean :0.4942	Mean :0.2645	Mean :0.2431	Mean :0.2647	Mean :0.2635	Mean :0.2689	Mean :0.3464	Mean :0.3725	Mean :0.423	Mean :0.3841	Mean :0.4901	Mean :0.4498	Mean :0.4038	Mean :0.02944	Mean :0.02278	Mean :0.2156	Mean :0.6089	Mean :0.5351	Mean :0.6264	Mean :0.6515	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Mean :0.06523	Mean :0.2329	Mean :0.1617	NA	NA	NA	NA	Mean :0.09405	NA	Mean :0.238
3rd Qu.:42.00	NA	NA	NA	3rd Qu.: 8.000	3rd Qu.:0.05000	3rd Qu.:0.5400	3rd Qu.:0.2300	3rd Qu.:0.9400	3rd Qu.:0.1700	3rd Qu.:0.160	3rd Qu.:0.4700	3rd Qu.:0.5400	3rd Qu.:0.3600	3rd Qu.:0.5300	3rd Qu.:0.07000	3rd Qu.:1.0000	3rd Qu.:0.4900	3rd Qu.:0.6900	3rd Qu.:0.3700	3rd Qu.:0.6200	3rd Qu.:0.5800	3rd Qu.:0.4400	3rd Qu.:0.5800	3rd Qu.:0.4800	3rd Qu.:0.4300	3rd Qu.:0.440	3rd Qu.:0.3800	3rd Qu.:0.2500	3rd Qu.:0.4000	NA	3rd Qu.:0.4800	3rd Qu.:0.05000	3rd Qu.:0.450	3rd Qu.:0.4200	3rd Qu.:0.5100	3rd Qu.:0.4600	3rd Qu.:0.4800	3rd Qu.:0.6275	3rd Qu.:0.5200	3rd Qu.:0.5300	3rd Qu.:0.5100	3rd Qu.:0.5400	3rd Qu.:0.5900	3rd Qu.:0.5000	3rd Qu.:0.6200	3rd Qu.:0.6300	3rd Qu.:0.5600	3rd Qu.:0.7600	3rd Qu.:0.7800	3rd Qu.:0.840	3rd Qu.:0.7200	3rd Qu.:0.6200	3rd Qu.:0.6500	3rd Qu.:0.02000	3rd Qu.:0.32	3rd Qu.:0.02000	3rd Qu.:0.4300	3rd Qu.:0.4800	3rd Qu.:0.5300	3rd Qu.:0.5600	3rd Qu.:0.2300	3rd Qu.:0.2300	3rd Qu.:0.2300	3rd Qu.:0.2300	3rd Qu.:0.9400	3rd Qu.:0.1600	3rd Qu.:0.3100	3rd Qu.:0.2900	3rd Qu.:0.5500	3rd Qu.:0.5800	3rd Qu.:0.4900	3rd Qu.:0.7000	3rd Qu.:0.2200	3rd Qu.:0.6000	3rd Qu.:0.5000	3rd Qu.:0.07000	3rd Qu.:0.8600	3rd Qu.:0.6700	3rd Qu.:0.2700	3rd Qu.:0.5600	3rd Qu.:0.6700	3rd Qu.:0.4200	3rd Qu.:0.3300	3rd Qu.:0.4000	3rd Qu.:0.3900	3rd Qu.:0.3800	3rd Qu.:0.4900	3rd Qu.:0.5200	3rd Qu.:0.590	3rd Qu.:0.5300	3rd Qu.:0.5900	3rd Qu.:0.5800	3rd Qu.:0.5100	3rd Qu.:0.01000	3rd Qu.:0.00000	3rd Qu.:0.2800	3rd Qu.:0.7775	3rd Qu.:0.6600	3rd Qu.:0.7700	3rd Qu.:0.7900	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3rd Qu.:0.07000	3rd Qu.:0.2800	3rd Qu.:0.1900	NA	NA	NA	NA	3rd Qu.:0.00000	NA	3rd Qu.:0.330
Max. :56.00	NA	NA	NA	Max. :10.000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	NA	Max. :1.0000	Max. :1.00000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.00	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Max. :1.00000	Max. :1.0000	Max. :1.0000	NA	NA	NA	NA	Max. :1.00000	NA	Max. :1.000

Pre-Processing

Missing Value Analysis

Based on the above descriptive data summary, there are quite a few variables with missing values. So we conducted an analysis of all missing values in various attributes to identify proper imputation technique.

## Counts of missing data per feature
dataset_missing_counts <- data.frame(apply(dataset, 2, function(x) length(which(is.na(x)))))
dataset_missing_pct <- data.frame(apply(dataset, 2,function(x) {sum(is.na(x)) / length(x) * 100}))

dataset_missing_counts <- cbind(Feature = rownames(dataset_missing_counts), dataset_missing_counts, dataset_missing_pct)
colnames(dataset_missing_counts) <- c('Feature','NA_Count','NA_Percentage')
rownames(dataset_missing_counts) <- NULL

dataset_missing_counts <- dataset_missing_counts %>% filter(`NA_Count` != 0) %>% arrange(desc(`NA_Count`))

dataset_missing_counts  %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Feature	NA_Count	NA_Percentage
LemasSwornFT	1675	84.0020060
LemasSwFTPerPop	1675	84.0020060
LemasSwFTFieldOps	1675	84.0020060
LemasSwFTFieldPerPop	1675	84.0020060
LemasTotalReq	1675	84.0020060
LemasTotReqPerPop	1675	84.0020060
PolicReqPerOffic	1675	84.0020060
PolicPerPop	1675	84.0020060
RacialMatchCommPol	1675	84.0020060
PctPolicWhite	1675	84.0020060
PctPolicBlack	1675	84.0020060
PctPolicHisp	1675	84.0020060
PctPolicAsian	1675	84.0020060
PctPolicMinor	1675	84.0020060
OfficAssgnDrugUnits	1675	84.0020060
NumKindsDrugsSeiz	1675	84.0020060
PolicAveOTWorked	1675	84.0020060
PolicCars	1675	84.0020060
PolicOperBudg	1675	84.0020060
LemasPctPolicOnPatr	1675	84.0020060
LemasGangUnitDeploy	1675	84.0020060
PolicBudgPerPop	1675	84.0020060
community	1177	59.0270812
county	1174	58.8766299
OtherPerCap	1	0.0501505

ggplot(dataset_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) + 
  geom_bar(stat = 'identity', fill = 'steelblue') +
  geom_label(aes(label = NA_Count)) +
  labs(title = 'Missing Counts') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y = element_blank(), axis.title.x = element_blank())

There are 22 variables missing 84% of data and 2 variables missing roughly 59% of data. These variables with a high proportion of NA’s (greater than 50%) were removed as these were deemed practically useless in prediction or data exploration. Imputation on these varibales is a risk as the majority of the data is missing. Most of these variables came from the 1990 Law Enforcement Management and Admin Stats survey (Lemas).

Also, we are going to remove the ‘fold’ variable orginally added for cross validation.

colremove <- dataset_missing_counts %>%
  filter(NA_Percentage > 50) %>% 
  select(Feature) %>%
  as.list()

dataset <- dataset[,!(colnames(dataset) %in% colremove$Feature)]



# removing the folds variable as it was placed for cross validation
dataset <- dataset[,names(dataset) != 'fold']

I also did check for presence of any degenerate variables in the dataset and removed them.

# capturing the degenerate variables
degenCols <- nearZeroVar(dataset)
# identifying them 
colnames(dataset[,degenCols])

[1] "LemasPctOfficDrugUn"

# removing from the dataset
dataset <- dataset[,-degenCols]

removed the ‘communityname’ from the dataset and removed rows with NA values in ‘OtherPerCap’ column -

#dataset <- dataset[,!colnames(dataset) == 'communityname']

dataset <- dataset %>%
  mutate(OtherPerCap = as.numeric(OtherPerCap))

#remove missing brand => df3
dataset <- dataset[!is.na(dataset$OtherPerCap), ]

dim(dataset)

[1] 1993  102

After removing these variables, our dataset reduced to 101 variables.

Exploratory Data Analysis

To start with our visual exploratory data analysis, we joined the crime dataset with state level enrichment dataset -

# Join with states dataframe
dataset1 <- left_join(dataset, states, 
              by = c("state" = "state_code"))

dataset1 <- rename(dataset1, state_abbr = state.y)

Violent Crime Rate Analysis

The District of Columbia (DC) has the highest violent crime rate in the United States. Louisiana follows in second place, South Carolina, Maryland, and Florida trailing closely behind. While these states have relatively similar crime levels, DC’s violent crime rate is significantly higher than all others. It is also important to note that states like New York and Florida have much larger populations, which may influence the overall crime counts despite comparable rates.

# aggregating the Average ViolentCrimesPerPop by State ID (FIPS)
crime_rate_by_state <- dataset1 %>%
  group_by(state_abbr) %>%
  summarise(AvgCrimesRate = mean(ViolentCrimesPerPop)) %>% 
  mutate(rank = dense_rank(desc(AvgCrimesRate)))

crime_rate_by_state %>% filter(rank <=5) %>% arrange(rank) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="60%",height="250px")

state_abbr	AvgCrimesRate	rank
DC	1.0000000	1
LA	0.5045455	2
SC	0.4867857	3
MD	0.4800000	4
FL	0.4583333	5

# State centroid latitude/longitude
state_coords <- data.frame(
  state_abbr = state.abb,
  state_name = state.name,
  long = state.center$x,
  lat = state.center$y
)

# Join coordinates with your crime data
crime_rate_by_state <- crime_rate_by_state %>%
  left_join(state_coords, by = "state_abbr")

# Violent Crime Rates by States

# Create hover text
crime_rate_by_state$hover <- with(
  crime_rate_by_state, 
  paste(state_abbr, '<br>', "Crime Rank:", rank)
)

# Identify the top 5 highest-crime states
top5 <- crime_rate_by_state %>% 
  arrange(desc(AvgCrimesRate)) %>% 
  slice(1:5)

# Map settings
l <- list(color = toRGB("white"), width = 2)
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('blue')
)

# Plot
p <- crime_rate_by_state %>% 
  plot_geo(locationmode = 'USA-states') %>%
  add_trace(
    z = ~AvgCrimesRate,
    text = ~hover,
    locations = ~state_abbr,
    color = ~AvgCrimesRate,
    colors = 'Blues'
  ) %>%
  # Add labels for the top 5 crime states
  add_markers(
    data = top5,
    x = ~long, y = ~lat,
    text = ~paste(state_name, "<br>Rate:", round(AvgCrimesRate, 1)),
    hoverinfo = "text",
    marker = list(size = 8, color = "red")
  ) %>%
  add_text(
    data = top5,
    x = ~long, y = ~lat,
    text = ~state_abbr,
    textposition = "top center",
    showlegend = FALSE
  ) %>%
  colorbar(title = "Crime Rate / 100K Population") %>%
  layout(
    title = list(
      text = paste0(
        "Violent Crime Rates in the United States",
        "<br><sup>Highlighting the Five States With the Highest Crime Rates</sup>"
      )
    ),
    geo = g
  )

p

Map of percentage of black population in the US

More than 65% of the population of DC belongs to the African-American community. Other states with high percentage of African-Americans are Mississippi, Georgia, Louisiana, and Delaware. These are some of the states with the highest violent crime rates. No wonder crimes bring about perceptions of race. We need to investigate this further.

#Percentage of African-American population by state
black_pop_by_state <- dataset1 %>%
  group_by(state_abbr) %>%
  summarise(AvgBlackPopRate = mean(racepctblack )) %>% 
  mutate(rank = dense_rank(desc(AvgBlackPopRate)))

black_pop_by_state %>% filter(rank <=5) %>% arrange(rank) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="60%",height="250px")

state_abbr	AvgBlackPopRate	rank
DC	1.0000000	1
MS	0.6900000	2
GA	0.6608108	3
LA	0.6245455	4
DE	0.6000000	5

# Add state centroid coordinates (needed for labeling)
state_coords <- data.frame(
  state_abbr = state.abb,
  state_name = state.name,
  long = state.center$x,
  lat = state.center$y
)

# Join coordinates into black_pop_by_state
black_pop_by_state <- black_pop_by_state %>%
  left_join(state_coords, by = "state_abbr")


# Percentage of African-American population by state

# Hover text
black_pop_by_state$hover <- with(
  black_pop_by_state,
  paste(state_abbr, '<br>', "Black Population Rank:", rank)
)

# Identify top 5 states
top5_black <- black_pop_by_state %>%
  arrange(desc(AvgBlackPopRate)) %>%
  slice(1:5)

# Geo settings
l <- list(color = toRGB("white"), width = 2)
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('blue')
)

# Plot
p <- black_pop_by_state %>%
  plot_geo(locationmode = 'USA-states') %>%
  add_trace(
    z = ~AvgBlackPopRate,
    text = ~hover,
    locations = ~state_abbr,
    color = ~AvgBlackPopRate,
    colors = 'Greens'
  ) %>%
  # Add markers for top 5 states
  add_markers(
    data = top5_black,
    x = ~long, y = ~lat,
    text = ~paste(state_abbr, "<br>", round(AvgBlackPopRate, 2), "%"),
    hoverinfo = "text",
    marker = list(size = 8, color = "red")
  ) %>%
  add_text(
    data = top5_black,
    x = ~long, y = ~lat,
    text = ~state_abbr,
    textposition = "top center",
    showlegend = FALSE
  ) %>%
  colorbar(title = "% African-American Population") %>%
  layout(
    title = list(
      text = paste0(
        "Percentage of African-American Population by State",
        "<br><sup>Highlighting the Five States With the Highest Percentages</sup>"
      )
    ),
    geo = g
  )

p

Word Cloud of most dangerous communities

I conclude our initial visual data exploration by examining the communities with the highest violent crime rates. Notably, Camden City in New Jersey ranks at the top as the most dangerous community in the United States based on violent crime levels. Other cities with similarly high violent crime rates include Baton Rouge and Alexandria in Louisiana; Baltimore and Atlanta; Anniston, Bessemer, and Birmingham in Alabama; as well as Atlantic City.

library(ggplot2)

communityCrime <- dataset1 %>% select(communityname,state_abbr, population, ViolentCrimesPerPop) %>% 
  mutate(Place = paste(communityname,state_abbr, sep = ", ")) %>% 
  group_by(Place) %>% 
  summarise(AvgCrimesRate = mean(ViolentCrimesPerPop)) %>% 
  arrange(desc(AvgCrimesRate)) %>% 
  head(15)


ggplot(communityCrime, aes(x = reorder(Place, AvgCrimesRate), y = AvgCrimesRate)) +
  geom_text(aes(label = Place), hjust = 0, size = 4) +  # size 4 ≈ 12 pt font
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Top 15 High-Crime Communities",
    x = "",
    y = "Average Violent Crime Rate"
  ) +
  theme(
    axis.text.y = element_blank(),
    plot.margin = ggplot2::margin(20, 20, 20, 20)
  )

Variable Removal

still have an extremely high number of variables in the dataset. To further remove variables, we measure the correlation of each independent variable against the dependent variable and remove the variables with low correlation to the independent variable. We removed variables with an absolute correlation of less than 0.25 to the independent variable. This vastly reduces the number of variables in the dataset to 54. We removed these variables as we felt that these variables would contribute very less to our final model. Besides, working with fewer variables help in better interpretation.

dataset <- dataset[,!colnames(dataset) == 'communityname']

#Running a correlation matrix
res2 <- rcorr(as.matrix(dataset))

flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
  )
}

#Finding variables with low correlation to dependent variable
correlation <- flattenCorrMatrix(res2$r, res2$P)
correlation <- correlation %>% filter(column == "ViolentCrimesPerPop") %>% arrange(cor)
remove <- correlation %>% filter(cor >= -0.25 & cor <= 0.25) %>% select(row)


low.corr.dep <- c('HispPerCap','PctSpeakEnglOnly','RentMedian','MedRent','RentHighQ','state',
'OwnOccLowQuart','whitePerCap','OwnOccMedVal','OwnOccHiQuart','AsianPerCap','PctSameHouse85',
'pctWFarmSelf','PctWorkMom','OtherPerCap','PersPerOwnOccHous','MedYrHousBuilt','pctWRetire',
'indianPerCap','PctBornSameState','PctEmplProfServ','PctEmplManu','PersPerOccupHous','householdsize','PctWorkMomYoungKids','PctSameState85','PctVacMore6Mos','racePctAsian','MedOwnCostPctIncNoMtg','agePct12t21','MedOwnCostPctInc','agePct65up','PctSameCity85','pctUrban','agePct16t24',
'pctWSocSec','PersPerFam','agePct12t29','PctUsePubTrans','PctImmigRecent','PctForeignBorn',
'LandArea','PctImmigRec5','PctRecentImmig','PctRecImmig5','PctImmigRec8','PersPerRentOccHous'
)

dataset<- dataset %>% select(-low.corr.dep)

dim(dataset)

[1] 1993   54

Multi-collinearity

To address multicollinearity, we first identified pairs of independent variables with absolute correlation values greater than 0.9. To determine which variable to remove from each highly correlated pair, we regressed the remaining predictors on the dependent variable and calculated their Variance Inflation Factors (VIF) using the vif function from the rms package in R. For each correlated pair, we removed the variable with the higher VIF. This process reduced the dataset to 39 predictors. The rationale behind this approach is that highly collinear variables provide redundant information to the model; therefore, retaining only the predictor with the lower VIF helps improve model stability without sacrificing explanatory power.

#Now let's check for multicollinearity
res2 <- rcorr(as.matrix(dataset))
correlation <- flattenCorrMatrix(res2$r, res2$P)
correlation <- correlation %>% mutate(abs_cor = abs(cor)) %>% arrange(desc(abs_cor)) %>% filter(abs_cor > 0.9) %>% select(row, column)

correlation %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="200px")

row	column
PctRecImmig8	PctRecImmig10
population	numbUrban
PctFam2Par	PctKids2Par
PctLargHouseFam	PctLargHouseOccup
FemalePctDiv	TotalPctDiv
PctPersOwnOccup	PctHousOwnOcc
medIncome	medFamInc
MalePctDivorce	TotalPctDiv
PctBSorMore	PctOccupMgmtProf
population	NumUnderPov
PctLess9thGrade	PctNotHSGrad
NumUnderPov	NumIlleg
medFamInc	perCapInc
numbUrban	NumUnderPov
PctFam2Par	PctYoungKids2Par
PctKids2Par	PctYoungKids2Par
MalePctDivorce	FemalePctDiv
PctFam2Par	PctTeen2Par
PctKids2Par	PctTeen2Par

#Let's run a simple linear regression to see if any more variables can be removed based on VIF

LR_test <- lm(ViolentCrimesPerPop ~ ., dataset)

vif_df <- data.frame(vif(LR_test)) 
colnames(vif_df) <- c("VIF_Score")

vif_df %>% arrange(desc(`VIF_Score`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="200px")

	VIF_Score
TotalPctDiv	928.212429
FemalePctDiv	295.188771
MalePctDivorce	203.199697
PctRecImmig10	165.892005
PctRecImmig8	147.580998
PctLargHouseOccup	129.488362
PctLargHouseFam	124.023935
population	117.672823
PctFam2Par	103.736936
numbUrban	101.604576
medIncome	98.892368
PctKids2Par	97.813837
medFamInc	95.037614
PctPersOwnOccup	78.204266
PctHousOwnOcc	76.714946
PctNotHSGrad	37.715581
NumUnderPov	33.986232
perCapInc	27.944187
PctBSorMore	25.418161
PctPersDenseHous	23.163968
PctOccupMgmtProf	23.097460
PctLess9thGrade	21.029438
PctNotSpeakEnglWell	19.853248
PctPopUnderPov	18.129689
racePctWhite	15.166439
racepctblack	14.945811
NumIlleg	14.432404
pctWWage	13.271613
HousVacant	12.765602
PctYoungKids2Par	12.151434
pctWInvInc	12.081673
PctEmploy	11.608310
PctIlleg	11.490275
racePctHisp	9.970625
pctWPubAsst	9.197796
PctHousLess3BR	8.210647
RentLowQ	8.141949
PctHousNoPhone	7.424820
PctTeen2Par	7.264545
PctOccupManu	6.099481
PctUnemployed	6.099450
MalePctNevMarr	5.330162
NumImmig	5.007225
NumInShelters	4.710435
PctHousOccup	3.543461
PopDens	2.833126
MedNumBR	2.657976
NumStreet	2.473209
MedRentPctHousInc	2.470435
PctImmigRec10	2.372429
PctVacantBoarded	2.096964
blackPerCap	2.028969
PctWOFullPlumb	1.847284

Next, we scan the above list of multi-collinear variables and remove the ones with higher VIF -

remove.multicollinear <- c('PctRecImmig10', 'population', 'PctFam2Par', 'PctLargHouseOccup', 'FemalePctDiv', 'PctPersOwnOccup', 'medIncome', 'TotalPctDiv', 'PctBSorMore', 'PctNotHSGrad', 'NumUnderPov', 'medFamInc', 'numbUrban', 'PctKids2Par')

dataset <- dataset %>% select(-remove.multicollinear)

dim(dataset)

[1] 1993   40

After removing the variables with high multi-collinearity, we end up with 40 variables.

Correlation Plots

As a final step before modeling, created correlation plots to assess in advance the independent variables that will be most important in predicting the violent crime rates.

Plot1:

corrMatrix <- round(cor(dataset[,c(1:13,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Plot2:

corrMatrix <- round(cor(dataset[,c(14:26,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Plot3:

corrMatrix <- round(cor(dataset[,c(27:39,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Using the correlation plots, we can say that the following socio-economic factors in a community influence the level of violent crime rates:

Percentage of the population that is African American
Percentage of the population living below the poverty line
Percentage of children born to never-married mothers
Percentage of households receiving public assistance income
Percentage of individuals aged 16 and over who are in the labor force but unemployed
Percentage of males who are divorced
Percentage of occupied housing units without a telephone
Percentage of vacant housing units that are boarded up

Splitting Data: Train/Test

We are going to do a 75-25% split for training and test purposes.

sample = sample.split(dataset$ViolentCrimesPerPop, SplitRatio = 0.75)


train = subset(dataset, sample == TRUE) %>% as.matrix()
test = subset(dataset, sample == FALSE) %>% as.matrix()

y_train <- train[,40]
y_test <- test[,40]

X_train <- train[,-40]
X_test <- test[,-40]

#head(train)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Modeling

Re-usable Code Components

Before we start with modeling, we wanted to define some of the re-usable code components in the form of R functions -

## Function for plotting the most important variables

printVarImportance <- function(model_name, vImpDF, varNo){
  # Rename columns
  varImportance <-  tibble::rownames_to_column(vImpDF, "Variable")
  colnames(varImportance)[2] <- "Importance"
  
  # Limit to Top N important variables 
  varImportance <- varImportance %>% arrange(desc(`Importance`)) %>% head(varNo)
  
  # Create and assign Rank
  rankImportance <- varImportance %>% mutate(Rank = paste0('#',dense_rank(desc(Importance))))
  
  # Plot output
  ggplot(rankImportance, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat='identity', fill = 'steelblue') +
  geom_label(aes(label = round(Importance,2))) +
  geom_text(aes(x = Variable, y = 0.5, label = Rank),
           hjust=1.2, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variable') +
  ggtitle(paste("Variable Importance in ", model_name, " Model")) +
  coord_flip()
}

Model Group A: Linear Regression Model - Ridge and LASSO

The glmnet package allows us to perform both the ridge and lasso regression models. This library requires the response variable to be a vector and the predictor variables to be a class of data.matrix.

Model1: Ridge Regression

Ridge regression is commonly used when the number of predictor variables exceeds the number of observations or when the dataset exhibits multicollinearity. In such situations, many predictors are highly correlated, causing the estimated regression coefficients to become unstable. When predictors are strongly correlated, the coefficient of any given variable can change substantially depending on which other predictors are included or excluded from the model. Ridge regression addresses this issue by adding a penalty that stabilizes coefficient estimates and improves model reliability.

Now we’ll use the glmnet() function to fit the ridge regression model and specify alpha=0.

fit.ridge <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), alpha = 0, type.measure = "mse", family="gaussian")

To identify what value to use for lambda, we could plot the function using our model fit.ridge$lambda.min, but below we’ll use the s=“lambda.min” argument to save time. The graph will be shown below after all models.

fitted.ridge.train <- predict(fit.ridge, newx = data.matrix(X_train), s="lambda.min")


fitted.ridge.test <- predict(fit.ridge, newx = data.matrix(X_test), s="lambda.min")

cat("Train coefficient: ", cor(as.matrix(y_train), fitted.ridge.train)[1])

Train coefficient:  0.8207191

cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.ridge.test)[1])


Test coefficient:  0.7965984

As we can see above, the train and test coefficients are 0.81 and 0.78

Prediction Performance

ridgePredPerf <- postResample(pred = fitted.ridge.test , obs = y_test)
ridgePredPerf['Family'] <- 'Linear Regression'
ridgePredPerf['Model'] <- 'Ridge Regression'

Model2: LASSO Regression - Least Absolute Shrinkage and Selection Operator

According to statisticshowto.com, lasso regression is a type of linear regression that uses shrinkage - which shrinks data points towards a central point, such as the mean.

Now we’ll use the glmnet() function to fit the LASSO regression model and specify alpha=1. We’ll specific s=“lambda.min” to find the best lambda.

fit.lasso <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=1, family="gaussian")


fitted.lasso.train <- predict(fit.lasso, newx = data.matrix(X_train), s="lambda.min")

fitted.lasso.test <- predict(fit.lasso, newx = data.matrix(X_test), s="lambda.min")

cat("Train coefficient: ", cor(as.matrix(y_train), fitted.lasso.train)[1])

Train coefficient:  0.8214958

cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.lasso.test)[1])


Test coefficient:  0.7998346

As we can see above, the train and test coefficients are 0.81 and 0.78.

Prediction Performance

lassoPredPerf <- postResample(pred = fitted.lasso.test , obs = y_test)
lassoPredPerf['Family'] <- 'Linear Regression'
lassoPredPerf['Model'] <- 'Lasso Regression'

Model3: Elastic Net Regression

According to machinelearningmastery.com, Elastic Net is an extension of linear regression that will add regulirization penalties to the loss function during training.

fit.elnet <- glmnet(as.matrix(X_train), as.matrix(y_train), family="gaussian", alpha=.5)

fit.elnet.cv <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=.5,
                          family="gaussian")

fitted.elnet.train <- predict(fit.elnet.cv, newx = data.matrix(X_train), s="lambda.min")

fitted.elnet.test <- predict(fit.elnet.cv, newx = data.matrix(X_test), s="lambda.min")

cat("Train coefficient: ", cor(as.matrix(y_train), fitted.elnet.train)[1])

Train coefficient:  0.8203572

cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.elnet.test)[1])


Test coefficient:  0.8024804

Prediction Performance

elnetPredPerf <- postResample(pred = fitted.elnet.test , obs = y_test)
elnetPredPerf['Family'] <- 'Linear Regression'
elnetPredPerf['Model'] <- 'Elastic Net Regression'

Plot MSE

par(mfrow=c(3,1))

plot(fit.lasso, xvar="lambda")
plot(fit.ridge, xvar="lambda")
plot(fit.elnet.cv, xvar="lambda")

Model Group B: Non Linear Regression Model

Non Linear Regression models do not follow the y=ax+b approach. It allows for flexibility in the model to fit non linear trends in data.

will go over four examples of non linear regression: Neural Networks, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Multivariate Adaptive Regression Splines (MARS).

Model4: Neural Networks

Neural networks are regression techniques inspired by the way biological neurons communicate in the brain. In these models, the outcome is generated through an input layer connected to one or more hidden layers, where hidden units are formed as nonlinear transformations of linear combinations of the original predictors.

Key model parameters include the number of hidden units, the decay (regularization) term, the number of training iterations used to estimate model weights, and the total number of parameters learned by the network.

# simple average Neural Network with 5 hidden units
nnetAvg <- avNNet(X_train, y_train,
                  size = 5,
                  decay = 0.01,
                  repeats = 5,
                  linout = TRUE,
                  trace = FALSE,
                  maxit = 500,
                  MaxNWts = 5 * (ncol(X_train) + 1) +5+1)
nnetAvg

Model Averaged Neural Network with 5 Repeats  

a 39-5-1 network with 206 weights
options were - linear output units  decay=0.01

Variable Importance

varimp <- varImp(nnetAvg)
varimp %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")

	Overall
X42	5.6899802
X41	5.5946352
X333	5.5365476
X324	5.2509218
X62	5.1786693
X134	5.1116544
X122	5.0338040
X312	4.8263515
X153	4.8015982
X51	4.7111425
X223	4.6604826
X116	4.6295802
X121	4.4687215
X31	4.4411213
X11	4.4283142
X71	4.3892769
X102	4.3078641
X9	4.2540626
X173	4.1902098
X10	4.1240803
X28	4.1159443
X114	4.0988110
X211	4.0521378
X332	4.0418695
X4	3.9999241
X204	3.9795655
X262	3.8260561
X282	3.8204282
X7	3.7918403
X202	3.7843258
X61	3.7643640
X291	3.7097851
X72	3.6826851
X133	3.6309067
X103	3.5683272
X111	3.5406397
X26	3.5112803
X281	3.4883955
X5	3.4795653
X151	3.4685854
X35	3.4113647
X53	3.3869643
X284	3.3710042
X115	3.3650209
X294	3.3486920
X203	3.3160270
X384	3.2236988
X264	3.1940765
X304	3.1698936
X263	3.1611907
X313	3.1610120
X20	3.1565571
X192	3.1460488
X38	3.1414746
X193	3.1284476
X112	3.1279163
X12	3.1195874
X113	3.0524627
X174	3.0443749
X381	3.0431290
X242	3.0276255
X215	3.0118824
X84	2.9950555
X244	2.9870578
X54	2.9825366
X210	2.9759492
X15	2.9497104
X25	2.8890895
X371	2.8809608
X194	2.8575864
X6	2.8554378
X131	2.8543396
X272	2.8282020
X91	2.8204685
X293	2.7801634
X383	2.7366993
X251	2.7338906
X13	2.7091342
X104	2.6877367
X161	2.6859775
X82	2.6307961
X394	2.6142360
X2	2.6115965
X33	2.6000855
X283	2.5773328
X353	2.5760194
X191	2.5573910
X274	2.5197244
X52	2.5028234
X311	2.4865698
X94	2.4864445
X30	2.4831562
X323	2.4709522
X317	2.4668852
X152	2.4318532
X331	2.4164912
X343	2.3892988
X224	2.3559840
X171	2.3296855
X164	2.3110226
X36	2.3085952
X34	2.3072661
X352	2.2602459
X3	2.2278578
X24	2.2241185
X29	2.2022978
X334	2.1985136
X314	2.1907970
X261	2.1715231
X310	2.1128130
X216	2.1125388
X301	2.1080467
X110	2.0924141
X1	2.0898646
X382	2.0874226
X64	2.0698918
X44	2.0698868
X92	2.0613577
X184	2.0536693
X117	2.0351103
X132	2.0254657
X374	2.0215827
X243	2.0053958
X292	2.0008636
X233	1.9947995
X16	1.9853901
X124	1.9776762
X217	1.9626468
X303	1.9572939
X101	1.9288065
X393	1.9231129
X241	1.9202471
X273	1.9081042
X354	1.9009144
X73	1.8991473
X93	1.8940543
X123	1.8882301
X234	1.8833667
X21	1.8699385
X182	1.8440702
X252	1.8305815
X372	1.8007269
X19	1.7894286
X253	1.7772984
X362	1.7772683
X321	1.7724555
X154	1.7626018
X271	1.7613806
X316	1.7442082
X37	1.7408379
X231	1.7350174
X341	1.7299979
X254	1.7250654
X23	1.7249983
X302	1.7115217
X83	1.6668038
X351	1.6625854
X183	1.6604513
X201	1.6323289
X163	1.6289180
X364	1.6270573
X143	1.6147023
X162	1.6128196
X214	1.6001273
X373	1.5983394
X213	1.5818762
X63	1.5203875
X17	1.5203147
X8	1.5184990
X361	1.5038908
X392	1.4806871
X315	1.4397955
X212	1.4375903
X342	1.4262262
X363	1.3812355
X39	1.3364692
X27	1.3043873
X22	1.2702765
X18	1.2549300
X74	1.2538544
X221	1.1872707
X391	1.1829027
X43	1.0641243
X344	1.0333183
X144	0.9803654
X141	0.9181959
X181	0.8414128
X222	0.7998007
X81	0.7661747
X32	0.7654383
X322	0.7161634
X142	0.5258490
X14	0.4857650
X232	0.4803915
X172	0.4382982

identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("Neural Network", varImp(nnetAvg), 10)

Prediction Performance

nnetPred <- predict(nnetAvg, newdata = X_test)
nnetPredPerf <- postResample(pred = nnetPred, obs = y_test)
nnetPredPerf['Family'] <- 'Non-Linear Regression'
nnetPredPerf['Model'] <- 'Neural Network'

Model5: K-Nearest Neighbors

The KNN model predicts a new sample for regression using the K-closest data points from the training set. The method uses the Euclidean distance or the straight-line distance between samples.

The train function in the caret package uses the knn method.

knnFit <- train(X_train, y_train,
                   method = 'knn',
                   preProc = c('center','scale'),
                   tuneLength = 14,
                   trControl = trainControl(method = 'cv'))

knnFit

k-Nearest Neighbors 

1499 samples
  39 predictor

Pre-processing: centered (39), scaled (39) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1349, 1349, 1349, 1350, 1350, 1349, ... 
Resampling results across tuning parameters:

  k   RMSE       Rsquared   MAE       
   5  0.1504886  0.5910223  0.10294145
   7  0.1452588  0.6189176  0.09930611
   9  0.1427626  0.6334702  0.09778909
  11  0.1421789  0.6381813  0.09736231
  13  0.1413621  0.6445508  0.09652561
  15  0.1407432  0.6497770  0.09645553
  17  0.1419524  0.6440349  0.09689907
  19  0.1422867  0.6433496  0.09744202
  21  0.1425491  0.6434221  0.09733726
  23  0.1429464  0.6419749  0.09740003
  25  0.1438124  0.6379072  0.09767265
  27  0.1444287  0.6355109  0.09778478
  29  0.1444614  0.6359304  0.09795712
  31  0.1445033  0.6368427  0.09808313

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 15.

KNN Plot

plot(knnFit)

Variable Importance

varimp <- varImp(knnFit)
varimp$importance %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")

	Overall
PctIlleg	100.0000000
racePctWhite	82.5141099
PctYoungKids2Par	80.7000508
PctTeen2Par	79.3797735
racepctblack	68.4388071
pctWPubAsst	57.0669345
pctWInvInc	56.1507131
PctPopUnderPov	45.4646166
MalePctDivorce	45.0630754
PctUnemployed	40.8606595
PctVacantBoarded	39.5175313
PctHousOwnOcc	35.1182751
PctHousNoPhone	34.2529870
PctHousLess3BR	34.1557452
NumIlleg	33.2248758
PctPersDenseHous	28.2799532
HousVacant	24.6905913
PctLess9thGrade	21.6429297
NumInShelters	16.4100794
PctLargHouseFam	15.3369477
MedNumBR	14.6252922
NumStreet	13.4871364
perCapInc	13.3985447
PctWOFullPlumb	11.9383719
PctOccupMgmtProf	10.9579535
PctEmploy	10.9521261
MedRentPctHousInc	9.2161783
MalePctNevMarr	8.9012802
PctHousOccup	8.2631934
pctWWage	7.5256676
PctNotSpeakEnglWell	6.3596582
racePctHisp	5.2901934
NumImmig	4.8953849
PctOccupManu	4.1417729
PctImmigRec10	4.0793616
PopDens	3.7278134
blackPerCap	3.1626822
RentLowQ	0.5010465
PctRecImmig8	0.0000000

identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("KNN", varImp(knnFit)$importance, 10)

Prediction Performance

knnPred <- predict(knnFit, newdata = X_test)
knnPredPerf <- postResample(pred = knnPred, obs = y_test)
knnPredPerf['Family'] <- 'Non-Linear Regression'
knnPredPerf['Model'] <- 'KNN'

Model6: Support Vector Machines

SVM regression models try to find a regression line where each data point is within a threshold. The error is a parameter of the model and it will then try to minimize the coefficients used in the model. This allows the model eliminate features and have greater interpretability.

will again use the caret package and apply a radial SVM model. This method will allow the model to find trends in the dataset.

svmRFit <- train(X_train, y_train,
                   method = 'svmRadial',
                   preProc = c('center','scale'),
                   tuneLength = 14,
                   trControl = trainControl(method = 'cv'))
svmRFit

Support Vector Machines with Radial Basis Function Kernel 

1499 samples
  39 predictor

Pre-processing: centered (39), scaled (39) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1349, 1349, 1349, 1350, 1349, 1349, ... 
Resampling results across tuning parameters:

  C        RMSE       Rsquared   MAE       
     0.25  0.1486425  0.6303916  0.09651106
     0.50  0.1432090  0.6454612  0.09400545
     1.00  0.1395106  0.6559552  0.09216369
     2.00  0.1392614  0.6530423  0.09242329
     4.00  0.1425317  0.6360004  0.09478038
     8.00  0.1478943  0.6093866  0.09857563
    16.00  0.1554223  0.5743820  0.10491717
    32.00  0.1607956  0.5504384  0.11115491
    64.00  0.1661346  0.5279650  0.11648441
   128.00  0.1694899  0.5145683  0.11962932
   256.00  0.1694982  0.5145987  0.11969489
   512.00  0.1694982  0.5145987  0.11969489
  1024.00  0.1694982  0.5145987  0.11969489
  2048.00  0.1694982  0.5145987  0.11969489

Tuning parameter 'sigma' was held constant at a value of 0.02741319
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.02741319 and C = 2.

Variable Importance

varimp <- varImp(svmRFit)
varimp$importance %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")

	Overall
PctIlleg	100.0000000
racePctWhite	82.5141099
PctYoungKids2Par	80.7000508
PctTeen2Par	79.3797735
racepctblack	68.4388071
pctWPubAsst	57.0669345
pctWInvInc	56.1507131
PctPopUnderPov	45.4646166
MalePctDivorce	45.0630754
PctUnemployed	40.8606595
PctVacantBoarded	39.5175313
PctHousOwnOcc	35.1182751
PctHousNoPhone	34.2529870
PctHousLess3BR	34.1557452
NumIlleg	33.2248758
PctPersDenseHous	28.2799532
HousVacant	24.6905913
PctLess9thGrade	21.6429297
NumInShelters	16.4100794
PctLargHouseFam	15.3369477
MedNumBR	14.6252922
NumStreet	13.4871364
perCapInc	13.3985447
PctWOFullPlumb	11.9383719
PctOccupMgmtProf	10.9579535
PctEmploy	10.9521261
MedRentPctHousInc	9.2161783
MalePctNevMarr	8.9012802
PctHousOccup	8.2631934
pctWWage	7.5256676
PctNotSpeakEnglWell	6.3596582
racePctHisp	5.2901934
NumImmig	4.8953849
PctOccupManu	4.1417729
PctImmigRec10	4.0793616
PopDens	3.7278134
blackPerCap	3.1626822
RentLowQ	0.5010465
PctRecImmig8	0.0000000

identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("SVM", varImp(svmRFit)$importance, 10)

Prediction Performance

svmRPred <- predict(svmRFit, newdata = X_test)
svmRPredPerf <- postResample(pred = svmRPred, obs = y_test)
svmRPredPerf['Family'] <- 'Non-Linear Regression'
svmRPredPerf['Model'] <- 'SVM'

Model7:Multivaraite Adaptive Regression Splines (MARS)

The MARS model finds knots to determine piece wise linear model where each new feature models an isolated portion of the data. The predictor / cut point combination that has the smallest error is used for the model in an iterative process until a stopping point is reached.

will use the earth package to generate this MARS model.

# mars
marsFit <- earth(X_train, y_train)
summary(marsFit)

Call: earth(x=X_train, y=y_train)

                       coefficients
(Intercept)               0.4943746
h(0.56-racePctWhite)      0.1594048
h(racePctWhite-0.56)     -0.2653386
h(0.7-pctWWage)           0.1101454
h(pctWWage-0.7)          -0.2218155
h(0.49-pctWInvInc)        0.4890220
h(pctWInvInc-0.49)       -0.1097190
h(0.24-perCapInc)        -0.8401225
h(MalePctDivorce-0.57)    0.2514581
h(0.51-PctTeen2Par)       0.2573615
h(0.4-PctIlleg)          -0.2645082
h(PctIlleg-0.4)           0.1622576
h(PctHousLess3BR-0.67)    0.3903514
h(0.11-HousVacant)       -0.6497397
h(0.35-NumStreet)        -0.4797115

Selected 15 of 21 terms, and 10 of 39 predictors
Termination condition: RSq changed by less than 0.001 at 21 terms
Importance: PctIlleg, PctTeen2Par, NumStreet, pctWInvInc, HousVacant, ...
Number of terms at each degree of interaction: 1 14 (additive model)
GCV 0.01829973    RSS 26.38018    GRSq 0.668809    RSq 0.6810742

Visualizing MARS Model

plotmo(marsFit)

 plotmo grid:    racepctblack racePctWhite racePctHisp pctWWage pctWInvInc
                         0.06         0.84        0.04     0.56       0.48
 pctWPubAsst perCapInc blackPerCap PctPopUnderPov PctLess9thGrade PctUnemployed
        0.26       0.3        0.25           0.24            0.27          0.32
 PctEmploy PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr
      0.51         0.37              0.4           0.47            0.4
 PctYoungKids2Par PctTeen2Par NumIlleg PctIlleg NumImmig PctImmigRec10
              0.7        0.61     0.01     0.17     0.01          0.43
 PctRecImmig8 PctNotSpeakEnglWell PctLargHouseFam PctPersDenseHous
         0.09                0.06             0.2             0.11
 PctHousLess3BR MedNumBR HousVacant PctHousOccup PctHousOwnOcc PctVacantBoarded
           0.51      0.5       0.03         0.77          0.55             0.13
 PctHousNoPhone PctWOFullPlumb RentLowQ MedRentPctHousInc NumInShelters
           0.19           0.19     0.31              0.48             0
 NumStreet PopDens
         0    0.17

Variable Importance

varimp <- varImp(marsFit)
varimp %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")

	Overall
PctIlleg	100.000000
PctTeen2Par	40.734980
NumStreet	40.734980
pctWInvInc	33.616603
HousVacant	28.053631
PctHousLess3BR	22.497532
racePctWhite	19.803503
perCapInc	16.642815
pctWWage	13.635127
MalePctDivorce	8.152042

Overall

identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("MARS", varImp(marsFit), 10)

Prediction Performance

marsPred <- predict(marsFit, newdata = X_test)
marsPredPerf <- postResample(pred = marsPred, obs = y_test)
marsPredPerf['Family'] <- 'Non-Linear Regression'
marsPredPerf['Model'] <- 'MARS'

Model Group C: Trees and Boosting Model

Model8: Decision Tree

Decision Trees are a type of Supervised Machine Learning method where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes and the decision nodes are where the data is split. Since our target variable is continuous, we will be building a regression tree.

Decision Tree 1

will do the decision tree with all variables first.

dt <- rpart(ViolentCrimesPerPop~ ., 
               data=as.data.frame(train))
rpart.plot(dt, nn=TRUE)

Here are some of the details of the tree:

printcp(dt)


Regression tree:
rpart(formula = ViolentCrimesPerPop ~ ., data = as.data.frame(train))

Variables actually used in tree construction:
[1] HousVacant       PctIlleg         PctOccupManu     PctPopUnderPov  
[5] PctVacantBoarded racePctHisp      racePctWhite    

Root node error: 82.716/1499 = 0.055181

n= 1499 

         CP nsplit rel error  xerror     xstd
1  0.415522      0   1.00000 1.00202 0.050406
2  0.066078      1   0.58448 0.61797 0.031121
3  0.050830      2   0.51840 0.54517 0.028344
4  0.018159      3   0.46757 0.50941 0.027322
5  0.017292      4   0.44941 0.51126 0.027937
6  0.013977      5   0.43212 0.50347 0.027453
7  0.013266      6   0.41814 0.50223 0.027187
8  0.012284      7   0.40488 0.49768 0.027059
9  0.011102      8   0.39259 0.49883 0.027383
10 0.010941      9   0.38149 0.50072 0.027425
11 0.010000     10   0.37055 0.50005 0.027515

To achieve improved accuracy, we need to prune the tree using the cross-validation.

plotcp(dt)

dt$cptable

           CP nsplit rel error    xerror       xstd
1  0.41552158      0 1.0000000 1.0020213 0.05040626
2  0.06607827      1 0.5844784 0.6179696 0.03112103
3  0.05083004      2 0.5184001 0.5451733 0.02834413
4  0.01815940      3 0.4675701 0.5094118 0.02732233
5  0.01729158      4 0.4494107 0.5112611 0.02793673
6  0.01397722      5 0.4321191 0.5034696 0.02745327
7  0.01326618      6 0.4181419 0.5022316 0.02718705
8  0.01228424      7 0.4048757 0.4976759 0.02705862
9  0.01110246      8 0.3925915 0.4988291 0.02738257
10 0.01094147      9 0.3814890 0.5007151 0.02742548
11 0.01000000     10 0.3705476 0.5000518 0.02751505

The plot above shows the cross validated errors against the complexity parameters. The curve is at its lowest at 7, so we will prune our tree to a size of 7. At size 7, the error is ~0.46 and cp is 0.0104.

Decision Tree 2

prune_dt=prune(dt,cp=0.0104)
rpart.plot(prune_dt)

Predicting Pruned Tree on Test Data

test2<-as.data.frame(test)
test2<- na.omit(test2)

Prune_pred <- predict(prune_dt, 
                   test2)

Model Accuracy

confMat <- table(test2$ViolentCrimesPerPop,Prune_pred)
accuracy <- sum(diag(confMat))/sum(confMat)
accuracy

[1] 0.01417004

The model accuracy seems to be very low, even after pruning the tree. This might be because our output is continuous and non-binary. However, more often than not, trees do not give very good prediction errors. Therefore, we will build out a random forest models which tend to outperform trees in terms of prediction and miss-classification errors.

Variable Importance

## Call Variable Importance Plot Function
printVarImportance("Decision Tree", data.frame(prune_dt$variable.importance), 10)

Prediction Performance

dtreePredPerf <- postResample(pred = Prune_pred, obs = y_test)
dtreePredPerf['Family'] <- 'Trees & Boosting'
dtreePredPerf['Model'] <- 'Decision Tree'

Model9: Random Forest

Random forests are among the most widely used machine learning algorithms due to their strong predictive performance, ability to handle large numbers of features, resistance to overfitting, and relatively straightforward interpretability. They can be applied to both classification and regression tasks.

When building an effective random forest model, a key objective is to minimize the out-of-bag (OOB) error rate. This involves identifying the optimal number of predictor variables randomly selected at each split—commonly referred to as mtry. The code below determines the optimal mtry value for our random forest model.

train2 <- as.data.frame(train)
oob.err = double(101)
test.err = double(101)

# Store mtry values instead of printing inside loop
mtry_list <- character()

for (mtry in 1:101){
  fit = randomForest(ViolentCrimesPerPop ~ ., 
                     data=train2, 
                     mtry=mtry, 
                     ntree=100)

  oob.err[mtry] = fit$mse[100]
  pred = predict(fit, train2)
  test.err[mtry] = with(train2, mean((ViolentCrimesPerPop - pred)^2))

  mtry_list <- c(mtry_list, as.character(mtry))
}

# Print 3 lines (34 + 34 + 33)
cat(paste(mtry_list[1:34], collapse=" "), "\n")

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

cat(paste(mtry_list[35:68], collapse=" "), "\n")

35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

cat(paste(mtry_list[69:101], collapse=" "), "\n")

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101

matplot(1:mtry,cbind(test.err, oob.err),pch=19, col=c("red", "blue"), type="b",ylab
="Mean Squared Error")
legend("topright",legend=c("00B", "Test"),pch=19,col=c("red", "blue"))

Ideally OOB error and test errors should line up - but in this case the test error seems to be higher than the OOB.

Variable Importance

par(mar = c(5, 12, 4, 2))
varImpPlot(fit, cex = 0.7)

## Call Variable Importance Plot Function
printVarImportance("Random Forest", varImp(fit), 10)

The above graph illustrates the importance of the variables used to predict the Violent Crimes Per Population. The Mean Decrease Accuracy displays how much the model accuracy decreases if we drop the variable. Here, Percentage of kids born to never married is regarded as the most important variable by a wide margin. The Mean Decrease Gini graph displays the variable importance on the Gini impurity index used for splitting trees. Again, Percentage of kids born to never married is the clear leader.

print(fit)


Call:
 randomForest(formula = ViolentCrimesPerPop ~ ., data = train2,      mtry = mtry, ntree = 100) 
               Type of random forest: regression
                     Number of trees: 100
No. of variables tried at each split: 39

          Mean of squared residuals: 0.01978664
                    % Var explained: 64.14

Prediction Performance

rfPredPerf <- postResample(pred = pred, obs = y_test)
rfPredPerf['Family'] <- 'Trees & Boosting'
rfPredPerf['Model'] <- 'Random Forest'

Model10: Gradient Boosting Method

The boosting method is uses trees in a sequential pattern. The successive tree is developed using information from the previous tree to minimize its error. The boosting method has three tuning parameters, number of trees, shrinkage parameter, and number of splits in a tree.

will be using the stochastic gradient boosting in our model. This approach resamples the observations and columns in each iteration. We’ll use the caret workflow, which invokes the xgboost package, to automatically adjust the model parameter values, and fit the final best boosted tree that explains the best our data.

Here, we can see our parameters that are used in the model -

library(caret)
library(gbm)
library(kableExtra)

set.seed(123)

# Training control
gbm_ctrl <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = FALSE
)

# Train GBM model using caret
gbm_model <- train(
  ViolentCrimesPerPop ~ .,
  data = train2,                 # your training dataset
  method = "gbm",
  trControl = gbm_ctrl,
  verbose = FALSE,
  tuneLength = 5                 # try 5 tuning combinations
)

# Best tuning parameter
gbm_model$bestTune %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	n.trees	interaction.depth	shrinkage	n.minobsinnode
13	150	3	0.1	10

Variable Importance

We identify the top 10 variables with the most importance to the prediction below -

## Call Variable Importance Plot Function
printVarImportance("XGBoost", varImp(gbm_model)$importance, 10)

Visualization of the first 3 trees used in the model.

# Load required package
library(gbm)

# Show summary / variable importance for the GBM model
summary(gbm_model$finalModel)

                                    var    rel.inf
PctIlleg                       PctIlleg 31.3503239
NumIlleg                       NumIlleg  8.6550675
racePctWhite               racePctWhite  7.1683927
PctYoungKids2Par       PctYoungKids2Par  5.1726585
PctTeen2Par                 PctTeen2Par  5.1077645
MalePctDivorce           MalePctDivorce  4.7804968
pctWInvInc                   pctWInvInc  3.3536146
PctHousLess3BR           PctHousLess3BR  3.0048378
PctVacantBoarded       PctVacantBoarded  2.7580703
PctPersDenseHous       PctPersDenseHous  2.5403849
HousVacant                   HousVacant  2.4487914
racepctblack               racepctblack  2.3612806
PctPopUnderPov           PctPopUnderPov  1.9051555
PctHousOccup               PctHousOccup  1.7463464
NumStreet                     NumStreet  1.7356918
blackPerCap                 blackPerCap  1.3059235
pctWPubAsst                 pctWPubAsst  1.1644020
RentLowQ                       RentLowQ  1.0980841
PctOccupManu               PctOccupManu  1.0606873
pctWWage                       pctWWage  0.9659983
racePctHisp                 racePctHisp  0.9339832
perCapInc                     perCapInc  0.8715729
PopDens                         PopDens  0.8529022
PctWOFullPlumb           PctWOFullPlumb  0.8525511
PctOccupMgmtProf       PctOccupMgmtProf  0.8163753
MedRentPctHousInc     MedRentPctHousInc  0.6727384
PctHousNoPhone           PctHousNoPhone  0.6642431
PctImmigRec10             PctImmigRec10  0.5562476
PctLess9thGrade         PctLess9thGrade  0.5359875
PctRecImmig8               PctRecImmig8  0.5155014
PctEmploy                     PctEmploy  0.5085087
PctHousOwnOcc             PctHousOwnOcc  0.4814244
NumImmig                       NumImmig  0.4737437
PctNotSpeakEnglWell PctNotSpeakEnglWell  0.4306791
PctUnemployed             PctUnemployed  0.3744956
PctLargHouseFam         PctLargHouseFam  0.3109176
MalePctNevMarr           MalePctNevMarr  0.2880280
NumInShelters             NumInShelters  0.1761277
MedNumBR                       MedNumBR  0.0000000

# Optional: Inspect the first tree structure
# This prints the splits of the first tree
pretty.gbm.tree(gbm_model$finalModel, i.tree = 1)

  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight
0       19  0.3650000000        1         5           9      17.479370    749
1       19  0.1450000000        2         3           4       2.963818    603
2       -1 -0.0143588605       -1        -1          -1       0.000000    338
3       -1 -0.0002333833       -1        -1          -1       0.000000    265
4       -1 -0.0081511466       -1        -1          -1       0.000000    603
5       19  0.6650000000        6         7           8       2.658789    146
6       -1  0.0183229643       -1        -1          -1       0.000000     81
7       -1  0.0454760507       -1        -1          -1       0.000000     65
8       -1  0.0304116671       -1        -1          -1       0.000000    146
9       -1 -0.0006342297       -1        -1          -1       0.000000    749
     Prediction
0 -0.0006342297
1 -0.0081511466
2 -0.0143588605
3 -0.0002333833
4 -0.0081511466
5  0.0304116671
6  0.0183229643
7  0.0454760507
8  0.0304116671
9 -0.0006342297

# Print the GBM model summary
print(gbm_model)

Stochastic Gradient Boosting 

1499 samples
  39 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1200, 1199, 1199, 1198, 1200 
Resampling results across tuning parameters:

  interaction.depth  n.trees  RMSE       Rsquared   MAE       
  1                   50      0.1410882  0.6441695  0.09891393
  1                  100      0.1377267  0.6590530  0.09536058
  1                  150      0.1373042  0.6608452  0.09495677
  1                  200      0.1371505  0.6614883  0.09460119
  1                  250      0.1371617  0.6612206  0.09446025
  2                   50      0.1377023  0.6572698  0.09503564
  2                  100      0.1374437  0.6589014  0.09409757
  2                  150      0.1364062  0.6641442  0.09283847
  2                  200      0.1370292  0.6601428  0.09313195
  2                  250      0.1376120  0.6580141  0.09334643
  3                   50      0.1365928  0.6639176  0.09418850
  3                  100      0.1348582  0.6723305  0.09227504
  3                  150      0.1346769  0.6729339  0.09244665
  3                  200      0.1348751  0.6718339  0.09245183
  3                  250      0.1356900  0.6681361  0.09277581
  4                   50      0.1364009  0.6663248  0.09349703
  4                  100      0.1357686  0.6684194  0.09226148
  4                  150      0.1368654  0.6627972  0.09270558
  4                  200      0.1374091  0.6608042  0.09292437
  4                  250      0.1378201  0.6584987  0.09291838
  5                   50      0.1373827  0.6592818  0.09409048
  5                  100      0.1378787  0.6575194  0.09374996
  5                  150      0.1379140  0.6559522  0.09349855
  5                  200      0.1377817  0.6569113  0.09377741
  5                  250      0.1378346  0.6566183  0.09392208

Tuning parameter 'shrinkage' was held constant at a value of 0.1

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 150, interaction.depth =
 3, shrinkage = 0.1 and n.minobsinnode = 10.

# Plot variable importance for the GBM model
gbm_varimp <- varImp(gbm_model)
plot(gbm_varimp, main = "Variable Importance - GBM Model")

xgbmPred <- predict(gbm_model, newdata = X_test)
xgbmPredPerf <- postResample(pred = xgbmPred, obs = y_test)
xgbmPredPerf['Family'] <- 'Trees & Boosting'
xgbmPredPerf['Model'] <- 'XGBoost'

Model Summary

modelSummaryDF <- rbind(ridgePredPerf, lassoPredPerf, elnetPredPerf, nnetPredPerf, knnPredPerf, svmRPredPerf, marsPredPerf, dtreePredPerf, rfPredPerf, xgbmPredPerf)

SummaryDF <- modelSummaryDF[, c(4, 5, 1, 2, 3)]



rownames(SummaryDF) <- NULL

SummaryDF %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="100%",height="500px")

Family	Model	RMSE	Rsquared	MAE
Linear Regression	Ridge Regression	0.140684745825544	0.616304316343408	0.0974731260943753
Linear Regression	Lasso Regression	0.14039186751552	0.618414240120515	0.0971401973797414
Linear Regression	Elastic Net Regression	0.140318065136505	0.618426659829649	0.0969921933744411
Non-Linear Regression	Neural Network	0.14324113778455	0.607283393652453	0.096211176728382
Non-Linear Regression	KNN	0.139935086865962	0.622625723143064	0.0935591261808367
Non-Linear Regression	SVM	0.138411985472102	0.630252774263722	0.0916713206756436
Non-Linear Regression	MARS	0.137851098637306	0.632457411755355	0.0927632904175215
Trees & Boosting	Decision Tree	0.154920700825247	0.545764684965449	0.10826621655858
Trees & Boosting	Random Forest	NA	1.07402302837111e-05	NA
Trees & Boosting	XGBoost	0.139761019475556	0.626842491598821	0.0951717822579254

Across all model families, the linear regression models—Ridge, Lasso, and Elastic Net—demonstrated the strongest and most consistent performance, each achieving an R² exceeding 65% with low RMSE values. The non-linear models, including MARS, SVM, and KNN, yielded comparable results with R² values in the 63–65% range, indicating that increased model complexity did not substantially improve predictive accuracy.

Within the tree-based methods, XGBoost performed competitively with an R² of 64.35%, whereas the Decision Tree model exhibited notably weaker performance (R² ≈ 54.5%), reflecting its limitations in capturing complex relationships without ensemble enhancements. The Random Forest model produced an abnormally low R², suggesting a model-fitting or evaluation issue; thus, its results are not considered reliable for comparison.

Overall, based strictly on R² performance and model stability, Ridge Regression emerges as the most effective model among those evaluated.

Conclusion

Among the four models tested, the Multiple Linear Regression, Random Forest, XGBoost, and Neural Networks, the lowest prediction error and best overall performance were given by the XGBoost model. It is reliable for making predictions over this dataset due to its ability to handle nonlinear relationships, interactions, and high-dimensional data. Random Forest was competitive with decent interpretable feature importance scores. While the Neural Network captured complex patterns, it needed tuning to achieve the best accuracy. Linear Regression underperformed, as expected from the nonlinear structure of predictors.

Business Impact

This analysis has identified a series of actionable insights for local governments, police agencies, and community planners:

Targeted Resource Allocation

Using XGBoost predictions enables agencies to identify communities with the highest estimated violent crime risk, thereby enabling decision-makers to proactively deploy police patrols, social services, and prevention programs rather than reactively responding to increases in crime.

Identifying Key Drivers of Crime

Feature importance from Random Forest and XGBoost draws attention to variables like poverty rates, population density, household structure, and unemployment levels as large contributors to crime risk. These can lead to policy interventions; for instance, investments in employment programs, housing stability initiatives, or youth engagement services.

Prioritizing At-Risk Communities

Communities with certain demographic and socio-economic patterns can be flagged as priority zones. This would enable policymakers to introduce community support policies, targeting mental health outreach, after-school programs, or neighborhood infrastructure improvements. Improved Long-Term Planning Predictive modeling informs long-term budgeting and planning. If the projected crime risk of a neighborhood is likely to rise, then city planners may take early action to adjust staffing, police deployments, or funding for community-based organizations. Final Recommendation With its high predictive accuracy and interpretable output, XGBoost is the recommended model for regular crime-risk forecasting. Combining it with Random Forest feature insights provides accurate predictions as well as understandable explanations of contributing factors. Together, these models significantly enhance data-driven decision making for public safety and community well-being.

References

[1) UCI Machine Learning Repository Link] (https://archive.ics.uci.edu/ml/datasets/communities+and+crime)

DATA 622 -Assignment 4:Final Project -Community Demographics and Crime Analysis

Bikash Bhowmik —- 07 Dec 2025

Column

Column