Instructions
Introduction
Study
Objective
Load
packages
Overview of
the Data
Data
Definitions
Supplimentary
Data
Dataset
Preview
Descriptive
Dataset Summary
Pre-Processing
Missing
Value Analysis
Exploratory
Data Analysis
Violent
Crime Rate Analysis
Map
of percentage of black population in the US
Word
Cloud of most dangerous communities
Variable
Removal
Multi-collinearity
Correlation
Plots
Plot1
Plot2
Plot3
Splitting
Data: Train/Test
Modeling
Re-usable
Code Components
Model Group A: Linear Regression Model - Ridge and LASSO
Model1: Ridge Regression
Prediction Performance
Model2:
LASSO Regression - Least Absolute Shrinkage and Selection Operator
Prediction
Performance
Model3:
Elastic Net Regression
Prediction
Performance
Plot
MSE
Model
Group B: Non Linear Regression Model
Model4:
Neural Networks
Variable
Importance
Prediction
Performance
Model5:
K-Nearest Neighbors
KNN
Plot
Variable
Importance
Prediction
Performance
Model6:
Support Vector Machines
Variable
Importance
Prediction
Performance
Model7:Multivaraite
Adaptive Regression Splines (MARS)
Model
Group C: Trees and Boosting Model
Model
8: Decision Tree
Decision
Tree1
Decision
Tree2
Variable
Importance
Model
9: Random Forest
Model
Summary
Conclusion
References
Exploratory analysis and essay Assignment
Choose a dataset You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.
To complete this task:.
We will analyze the Communities & Crime dataset, which contains information from communities across the United States. This dataset integrates socio-economic indicators from the 1990 U.S. Census, law enforcement data from the 1990 U.S. LEMAS survey, and crime statistics from the 1995 FBI UCR. Together, these sources provide a comprehensive view of the demographic, economic, and policing factors associated with crime at the community level.
Public perception of neighborhood safety is often shaped by high profile violent crimes reported in the media. These events raise questions about the underlying social and economic factors that may influence crime in different regions. For instance, the rise in race-related incidents has intensified discussions about whether racial composition plays a role in community crime rates. Other factors, such as socio-economic status, immigration levels, and cultural or religious demographics are also commonly examined when considering what drives crime.
Understanding where violent crimes occur, and how they relate to a community’s socio economic, environmental, and demographic characteristics, is essential for identifying the root causes of these incidents. By building predictive models that quantify the risk associated with a community, we can estimate where violent crime is more likely to occur. Such insights can support city planning efforts, guiding efficient allocation of police resources and proactive safety measures. Ultimately, identifying the key drivers of violent crime provides valuable direction for urban development and the refinement of policing strategies.
Load all required packages
The Communities and Crime dataset from the UCI Machine Learning Repository brings together socio-economic indicators from the 1990 U.S. Census, police department data from the 1990 LEMAS survey, and crime statistics from the 1995 FBI UCR. Although the data is older, it still provides meaningful insights into community characteristics and crime patterns, and supports building predictive models for crime prevention and urban planning.
The dataset contains 122 predictor variables plus the target variable, Violent Crimes per Population. Variables cover community demographics (e.g., percent urban, median family income) and law-enforcement features (e.g., per-capita officers, percent assigned to drug units). Only attributes with a plausible relationship to crime were included.
Violent crime rates were calculated using murder, rape, robbery, and assault. Some states reported rape inconsistently, which caused missing or inaccurate values; those communities were removed from the dataset.
All numeric features were normalized to a 0–1 scale using equal-interval binning. This preserves each attribute’s internal distribution but does not make values comparable across attributes.
Due to survey limitations, some communities lack LEMAS data, and only locations present across all sources are included. The final prediction target for analysis is Violent Crimes Per Population.
The crime dataset has 128 attributes. Below is a brief description of all attributes along with data type details -
datadict <- read_csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/FinalProject/data_dictionary.csv')
datadict %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="500px")| No. | Column | Description | Data Type |
|---|---|---|---|
| 1 | state | US state (by number) - not counted as predictive above, but if considered, should be consided nominal | nominal |
| 2 | county | numeric code for county - not predictive, and many missing values | numeric |
| 3 | community | numeric code for community - not predictive and many missing values | numeric |
| 4 | communityname | community name - not predictive - for information only | string |
| 5 | fold | fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive | numeric |
| 6 | population | population for community | numeric - decimal |
| 7 | householdsize | mean people per household | numeric - decimal |
| 8 | racepctblack | percentage of population that is african american | numeric - decimal |
| 9 | racePctWhite | percentage of population that is caucasian | numeric - decimal |
| 10 | racePctAsian | percentage of population that is of asian heritage | numeric - decimal |
| 11 | racePctHisp | percentage of population that is of hispanic heritage | numeric - decimal |
| 12 | agePct12t21 | percentage of population that is 12-21 in age | numeric - decimal |
| 13 | agePct12t29 | percentage of population that is 12-29 in age | numeric - decimal |
| 14 | agePct16t24 | percentage of population that is 16-24 in age | numeric - decimal |
| 15 | agePct65up | percentage of population that is 65 and over in age | numeric - decimal |
| 16 | numbUrban | number of people living in areas classified as urban | numeric - decimal |
| 17 | pctUrban | percentage of people living in areas classified as urban | numeric - decimal |
| 18 | medIncome | median household income | numeric - decimal |
| 19 | pctWWage | percentage of households with wage or salary income in 1989 | numeric - decimal |
| 20 | pctWFarmSelf | percentage of households with farm or self employment income in 1989 | numeric - decimal |
| 21 | pctWInvInc | percentage of households with investment / rent income in 1989 | numeric - decimal |
| 22 | pctWSocSec | percentage of households with social security income in 1989 | numeric - decimal |
| 23 | pctWPubAsst | percentage of households with public assistance income in 1989 | numeric - decimal |
| 24 | pctWRetire | percentage of households with retirement income in 1989 | numeric - decimal |
| 25 | medFamInc | median family income (differs from household income for non-family households) | numeric - decimal |
| 26 | perCapInc | per capita income | numeric - decimal |
| 27 | whitePerCap | per capita income for caucasians | numeric - decimal |
| 28 | blackPerCap | per capita income for african americans | numeric - decimal |
| 29 | indianPerCap | per capita income for native americans | numeric - decimal |
| 30 | AsianPerCap | per capita income for people with asian heritage | numeric - decimal |
| 31 | OtherPerCap | per capita income for people with ‘other’ heritage | numeric - decimal |
| 32 | HispPerCap | per capita income for people with hispanic heritage | numeric - decimal |
| 33 | NumUnderPov | number of people under the poverty level | numeric - decimal |
| 34 | PctPopUnderPov | percentage of people under the poverty level | numeric - decimal |
| 35 | PctLess9thGrade | percentage of people 25 and over with less than a 9th grade education | numeric - decimal |
| 36 | PctNotHSGrad | percentage of people 25 and over that are not high school graduates | numeric - decimal |
| 37 | PctBSorMore | percentage of people 25 and over with a bachelors degree or higher education | numeric - decimal |
| 38 | PctUnemployed | percentage of people 16 and over, in the labor force, and unemployed | numeric - decimal |
| 39 | PctEmploy | percentage of people 16 and over who are employed | numeric - decimal |
| 40 | PctEmplManu | percentage of people 16 and over who are employed in manufacturing | numeric - decimal |
| 41 | PctEmplProfServ | percentage of people 16 and over who are employed in professional services | numeric - decimal |
| 42 | PctOccupManu | percentage of people 16 and over who are employed in manufacturing | numeric - decimal |
| 43 | PctOccupMgmtProf | percentage of people 16 and over who are employed in management or professional occupations | numeric - decimal |
| 44 | MalePctDivorce | percentage of males who are divorced | numeric - decimal |
| 45 | MalePctNevMarr | percentage of males who have never married | numeric - decimal |
| 46 | FemalePctDiv | percentage of females who are divorced | numeric - decimal |
| 47 | TotalPctDiv | percentage of population who are divorced | numeric - decimal |
| 48 | PersPerFam | mean number of people per family | numeric - decimal |
| 49 | PctFam2Par | percentage of families (with kids) that are headed by two parents | numeric - decimal |
| 50 | PctKids2Par | percentage of kids in family housing with two parents | numeric - decimal |
| 51 | PctYoungKids2Par | percent of kids 4 and under in two parent households | numeric - decimal |
| 52 | PctTeen2Par | percent of kids age 12-17 in two parent households | numeric - decimal |
| 53 | PctWorkMomYoungKids | percentage of moms of kids 6 and under in labor force | numeric - decimal |
| 54 | PctWorkMom | percentage of moms of kids under 18 in labor force | numeric - decimal |
| 55 | NumIlleg | number of kids born to never married | numeric - decimal |
| 56 | PctIlleg | percentage of kids born to never married | numeric - decimal |
| 57 | NumImmig | total number of people known to be foreign born | numeric - decimal |
| 58 | PctImmigRecent | percentage of immigrants who immigated within last 3 years | numeric - decimal |
| 59 | PctImmigRec5 | percentage of immigrants who immigated within last 5 years | numeric - decimal |
| 60 | PctImmigRec8 | percentage of immigrants who immigated within last 8 years | numeric - decimal |
| 61 | PctImmigRec10 | percentage of immigrants who immigated within last 10 years | numeric - decimal |
| 62 | PctRecentImmig | percent of population who have immigrated within the last 3 years | numeric - decimal |
| 63 | PctRecImmig5 | percent of population who have immigrated within the last 5 years | numeric - decimal |
| 64 | PctRecImmig8 | percent of population who have immigrated within the last 8 years | numeric - decimal |
| 65 | PctRecImmig10 | percent of population who have immigrated within the last 10 years | numeric - decimal |
| 66 | PctSpeakEnglOnly | percent of people who speak only English | numeric - decimal |
| 67 | PctNotSpeakEnglWell | percent of people who do not speak English well | numeric - decimal |
| 68 | PctLargHouseFam | percent of family households that are large (6 or more) | numeric - decimal |
| 69 | PctLargHouseOccup | percent of all occupied households that are large (6 or more people) | numeric - decimal |
| 70 | PersPerOccupHous | mean persons per household | numeric - decimal |
| 71 | PersPerOwnOccHous | mean persons per owner occupied household | numeric - decimal |
| 72 | PersPerRentOccHous | mean persons per rental household | numeric - decimal |
| 73 | PctPersOwnOccup | percent of people in owner occupied households | numeric - decimal |
| 74 | PctPersDenseHous | percent of persons in dense housing (more than 1 person per room) | numeric - decimal |
| 75 | PctHousLess3BR | percent of housing units with less than 3 bedrooms | numeric - decimal |
| 76 | MedNumBR | median number of bedrooms | numeric - decimal |
| 77 | HousVacant | number of vacant households | numeric - decimal |
| 78 | PctHousOccup | percent of housing occupied | numeric - decimal |
| 79 | PctHousOwnOcc | percent of households owner occupied | numeric - decimal |
| 80 | PctVacantBoarded | percent of vacant housing that is boarded up | numeric - decimal |
| 81 | PctVacMore6Mos | percent of vacant housing that has been vacant more than 6 months | numeric - decimal |
| 82 | MedYrHousBuilt | median year housing units built | numeric - decimal |
| 83 | PctHousNoPhone | percent of occupied housing units without phone (in 1990, this was rare!) | numeric - decimal |
| 84 | PctWOFullPlumb | percent of housing without complete plumbing facilities | numeric - decimal |
| 85 | OwnOccLowQuart | owner occupied housing - lower quartile value | numeric - decimal |
| 86 | OwnOccMedVal | owner occupied housing - median value | numeric - decimal |
| 87 | OwnOccHiQuart | owner occupied housing - upper quartile value | numeric - decimal |
| 88 | RentLowQ | rental housing - lower quartile rent | numeric - decimal |
| 89 | RentMedian | rental housing - median rent (Census variable H32B from file STF1A) | numeric - decimal |
| 90 | RentHighQ | rental housing - upper quartile rent | numeric - decimal |
| 91 | MedRent | median gross rent (Census variable H43A from file STF3A - includes utilities) | numeric - decimal |
| 92 | MedRentPctHousInc | median gross rent as a percentage of household income | numeric - decimal |
| 93 | MedOwnCostPctInc | median owners cost as a percentage of household income - for owners with a mortgage | numeric - decimal |
| 94 | MedOwnCostPctIncNoMtg | median owners cost as a percentage of household income - for owners without a mortgage | numeric - decimal |
| 95 | NumInShelters | number of people in homeless shelters | numeric - decimal |
| 96 | NumStreet | number of homeless people counted in the street | numeric - decimal |
| 97 | PctForeignBorn | percent of people foreign born | numeric - decimal |
| 98 | PctBornSameState | percent of people born in the same state as currently living | numeric - decimal |
| 99 | PctSameHouse85 | percent of people living in the same house as in 1985 (5 years before) | numeric - decimal |
| 100 | PctSameCity85 | percent of people living in the same city as in 1985 (5 years before) | numeric - decimal |
| 101 | PctSameState85 | percent of people living in the same state as in 1985 (5 years before) | numeric - decimal |
| 102 | LemasSwornFT | number of sworn full time police officers | numeric - decimal |
| 103 | LemasSwFTPerPop | sworn full time police officers per 100K population | numeric - decimal |
| 104 | LemasSwFTFieldOps | number of sworn full time police officers in field operations (on the street as opposed to administrative etc) | numeric - decimal |
| 105 | LemasSwFTFieldPerPop | sworn full time police officers in field operations (on the street as opposed to administrative etc) per 100K population | numeric - decimal |
| 106 | LemasTotalReq | total requests for police | numeric - decimal |
| 107 | LemasTotReqPerPop | total requests for police per 100K popuation | numeric - decimal |
| 108 | PolicReqPerOffic | total requests for police per police officer | numeric - decimal |
| 109 | PolicPerPop | police officers per 100K population | numeric - decimal |
| 110 | RacialMatchCommPol | a measure of the racial match between the community and the police force. High values indicate proportions in community and police force are similar | numeric - decimal |
| 111 | PctPolicWhite | percent of police that are caucasian | numeric - decimal |
| 112 | PctPolicBlack | percent of police that are african american | numeric - decimal |
| 113 | PctPolicHisp | percent of police that are hispanic | numeric - decimal |
| 114 | PctPolicAsian | percent of police that are asian | numeric - decimal |
| 115 | PctPolicMinor | percent of police that are minority of any kind | numeric - decimal |
| 116 | OfficAssgnDrugUnits | number of officers assigned to special drug units | numeric - decimal |
| 117 | NumKindsDrugsSeiz | number of different kinds of drugs seized | numeric - decimal |
| 118 | PolicAveOTWorked | police average overtime worked | numeric - decimal |
| 119 | LandArea | land area in square miles | numeric - decimal |
| 120 | PopDens | population density in persons per square mile | numeric - decimal |
| 121 | PctUsePubTrans | percent of people using public transit for commuting | numeric - decimal |
| 122 | PolicCars | number of police cars | numeric - decimal |
| 123 | PolicOperBudg | police operating budget | numeric - decimal |
| 124 | LemasPctPolicOnPatr | percent of sworn full time police officers on patrol | numeric - decimal |
| 125 | LemasGangUnitDeploy | gang unit deployed | numeric - decimal - but really ordinal - 0 means NO, 1 means YES, 0.5 means Part Time |
| 126 | LemasPctOfficDrugUn | percent of officers assigned to drug units | numeric - decimal |
| 127 | PolicBudgPerPop | police operating budget per population | numeric - decimal |
| 128 | ViolentCrimesPerPop | total number of violent crimes per 100K popuation (numeric - decimal) GOAL attribute | to be predicted |
We also loaded the supplimentary state level dataset from UCI Machine Learning repository to enrich our crime dataset with state level information -
states <- read.csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/FinalProject/states.csv')
names(states) <- c("state_code","state","stateName","stateENS")
states %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="80%",height="300px")| state_code | state | stateName | stateENS |
|---|---|---|---|
| 1 | AL | Alabama | 1779775 |
| 2 | AK | Alaska | 1785533 |
| 4 | AZ | Arizona | 1779777 |
| 5 | AR | Arkansas | 68085 |
| 6 | CA | California | 1779778 |
| 8 | CO | Colorado | 1779779 |
| 9 | CT | Connecticut | 1779780 |
| 10 | DE | Delaware | 1779781 |
| 11 | DC | District of Columbia | 1702382 |
| 12 | FL | Florida | 294478 |
| 13 | GA | Georgia | 1705317 |
| 15 | HI | Hawaii | 1779782 |
| 16 | ID | Idaho | 1779783 |
| 17 | IL | Illinois | 1779784 |
| 18 | IN | Indiana | 448508 |
| 19 | IA | Iowa | 1779785 |
| 20 | KS | Kansas | 481813 |
| 21 | KY | Kentucky | 1779786 |
| 22 | LA | Louisiana | 1629543 |
| 23 | ME | Maine | 1779787 |
| 24 | MD | Maryland | 1714934 |
| 25 | MA | Massachusetts | 606926 |
| 26 | MI | Michigan | 1779789 |
| 27 | MN | Minnesota | 662849 |
| 28 | MS | Mississippi | 1779790 |
| 29 | MO | Missouri | 1779791 |
| 30 | MT | Montana | 767982 |
| 31 | NE | Nebraska | 1779792 |
| 32 | NV | Nevada | 1779793 |
| 33 | NH | New Hampshire | 1779794 |
| 34 | NJ | New Jersey | 1779795 |
| 35 | NM | New Mexico | 897535 |
| 36 | NY | New York | 1779796 |
| 37 | NC | North Carolina | 1027616 |
| 38 | ND | North Dakota | 1779797 |
| 39 | OH | Ohio | 1085497 |
| 40 | OK | Oklahoma | 1102857 |
| 41 | OR | Oregon | 1155107 |
| 42 | PA | Pennsylvania | 1779798 |
| 44 | RI | Rhode Island | 1219835 |
| 45 | SC | South Carolina | 1779799 |
| 46 | SD | South Dakota | 1785534 |
| 47 | TN | Tennessee | 1325873 |
| 48 | TX | Texas | 1779801 |
| 49 | UT | Utah | 1455989 |
| 50 | VT | Vermont | 1779802 |
| 51 | VA | Virginia | 1779803 |
| 53 | WA | Washington | 1779804 |
| 54 | WV | West Virginia | 1779805 |
| 55 | WI | Wisconsin | 1779806 |
| 56 | WY | Wyoming | 1779807 |
| 60 | AS | American Samoa | 1802701 |
| 66 | GU | Guam | 1802705 |
| 69 | MP | Northern Mariana Islands | 1779809 |
| 72 | PR | Puerto Rico | 1779808 |
| 74 | UM | U.S. Minor Outlying Islands | 1878752 |
| 78 | VI | U.S. Virgin Islands | 1802710 |
dataset <- read_csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/FinalProject/communities.data.csv')
head(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="300px")| state | county | community | communityname | fold | population | householdsize | racepctblack | racePctWhite | racePctAsian | racePctHisp | agePct12t21 | agePct12t29 | agePct16t24 | agePct65up | numbUrban | pctUrban | medIncome | pctWWage | pctWFarmSelf | pctWInvInc | pctWSocSec | pctWPubAsst | pctWRetire | medFamInc | perCapInc | whitePerCap | blackPerCap | indianPerCap | AsianPerCap | OtherPerCap | HispPerCap | NumUnderPov | PctPopUnderPov | PctLess9thGrade | PctNotHSGrad | PctBSorMore | PctUnemployed | PctEmploy | PctEmplManu | PctEmplProfServ | PctOccupManu | PctOccupMgmtProf | MalePctDivorce | MalePctNevMarr | FemalePctDiv | TotalPctDiv | PersPerFam | PctFam2Par | PctKids2Par | PctYoungKids2Par | PctTeen2Par | PctWorkMomYoungKids | PctWorkMom | NumIlleg | PctIlleg | NumImmig | PctImmigRecent | PctImmigRec5 | PctImmigRec8 | PctImmigRec10 | PctRecentImmig | PctRecImmig5 | PctRecImmig8 | PctRecImmig10 | PctSpeakEnglOnly | PctNotSpeakEnglWell | PctLargHouseFam | PctLargHouseOccup | PersPerOccupHous | PersPerOwnOccHous | PersPerRentOccHous | PctPersOwnOccup | PctPersDenseHous | PctHousLess3BR | MedNumBR | HousVacant | PctHousOccup | PctHousOwnOcc | PctVacantBoarded | PctVacMore6Mos | MedYrHousBuilt | PctHousNoPhone | PctWOFullPlumb | OwnOccLowQuart | OwnOccMedVal | OwnOccHiQuart | RentLowQ | RentMedian | RentHighQ | MedRent | MedRentPctHousInc | MedOwnCostPctInc | MedOwnCostPctIncNoMtg | NumInShelters | NumStreet | PctForeignBorn | PctBornSameState | PctSameHouse85 | PctSameCity85 | PctSameState85 | LemasSwornFT | LemasSwFTPerPop | LemasSwFTFieldOps | LemasSwFTFieldPerPop | LemasTotalReq | LemasTotReqPerPop | PolicReqPerOffic | PolicPerPop | RacialMatchCommPol | PctPolicWhite | PctPolicBlack | PctPolicHisp | PctPolicAsian | PctPolicMinor | OfficAssgnDrugUnits | NumKindsDrugsSeiz | PolicAveOTWorked | LandArea | PopDens | PctUsePubTrans | PolicCars | PolicOperBudg | LemasPctPolicOnPatr | LemasGangUnitDeploy | LemasPctOfficDrugUn | PolicBudgPerPop | ViolentCrimesPerPop |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | ? | ? | Lakewoodcity | 1 | 0.19 | 0.33 | 0.02 | 0.90 | 0.12 | 0.17 | 0.34 | 0.47 | 0.29 | 0.32 | 0.20 | 1.0 | 0.37 | 0.72 | 0.34 | 0.60 | 0.29 | 0.15 | 0.43 | 0.39 | 0.40 | 0.39 | 0.32 | 0.27 | 0.27 | 0.36 | 0.41 | 0.08 | 0.19 | 0.10 | 0.18 | 0.48 | 0.27 | 0.68 | 0.23 | 0.41 | 0.25 | 0.52 | 0.68 | 0.40 | 0.75 | 0.75 | 0.35 | 0.55 | 0.59 | 0.61 | 0.56 | 0.74 | 0.76 | 0.04 | 0.14 | 0.03 | 0.24 | 0.27 | 0.37 | 0.39 | 0.07 | 0.07 | 0.08 | 0.08 | 0.89 | 0.06 | 0.14 | 0.13 | 0.33 | 0.39 | 0.28 | 0.55 | 0.09 | 0.51 | 0.5 | 0.21 | 0.71 | 0.52 | 0.05 | 0.26 | 0.65 | 0.14 | 0.06 | 0.22 | 0.19 | 0.18 | 0.36 | 0.35 | 0.38 | 0.34 | 0.38 | 0.46 | 0.25 | 0.04 | 0 | 0.12 | 0.42 | 0.50 | 0.51 | 0.64 | 0.03 | 0.13 | 0.96 | 0.17 | 0.06 | 0.18 | 0.44 | 0.13 | 0.94 | 0.93 | 0.03 | 0.07 | 0.1 | 0.07 | 0.02 | 0.57 | 0.29 | 0.12 | 0.26 | 0.20 | 0.06 | 0.04 | 0.9 | 0.5 | 0.32 | 0.14 | 0.20 |
| 53 | ? | ? | Tukwilacity | 1 | 0.00 | 0.16 | 0.12 | 0.74 | 0.45 | 0.07 | 0.26 | 0.59 | 0.35 | 0.27 | 0.02 | 1.0 | 0.31 | 0.72 | 0.11 | 0.45 | 0.25 | 0.29 | 0.39 | 0.29 | 0.37 | 0.38 | 0.33 | 0.16 | 0.30 | 0.22 | 0.35 | 0.01 | 0.24 | 0.14 | 0.24 | 0.30 | 0.27 | 0.73 | 0.57 | 0.15 | 0.42 | 0.36 | 1.00 | 0.63 | 0.91 | 1.00 | 0.29 | 0.43 | 0.47 | 0.60 | 0.39 | 0.46 | 0.53 | 0.00 | 0.24 | 0.01 | 0.52 | 0.62 | 0.64 | 0.63 | 0.25 | 0.27 | 0.25 | 0.23 | 0.84 | 0.10 | 0.16 | 0.10 | 0.17 | 0.29 | 0.17 | 0.26 | 0.20 | 0.82 | 0.0 | 0.02 | 0.79 | 0.24 | 0.02 | 0.25 | 0.65 | 0.16 | 0.00 | 0.21 | 0.20 | 0.21 | 0.42 | 0.38 | 0.40 | 0.37 | 0.29 | 0.32 | 0.18 | 0.00 | 0 | 0.21 | 0.50 | 0.34 | 0.60 | 0.52 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.02 | 0.12 | 0.45 | ? | ? | ? | ? | 0.00 | ? | 0.67 |
| 24 | ? | ? | Aberdeentown | 1 | 0.00 | 0.42 | 0.49 | 0.56 | 0.17 | 0.04 | 0.39 | 0.47 | 0.28 | 0.32 | 0.00 | 0.0 | 0.30 | 0.58 | 0.19 | 0.39 | 0.38 | 0.40 | 0.84 | 0.28 | 0.27 | 0.29 | 0.27 | 0.07 | 0.29 | 0.28 | 0.39 | 0.01 | 0.27 | 0.27 | 0.43 | 0.19 | 0.36 | 0.58 | 0.32 | 0.29 | 0.49 | 0.32 | 0.63 | 0.41 | 0.71 | 0.70 | 0.45 | 0.42 | 0.44 | 0.43 | 0.43 | 0.71 | 0.67 | 0.01 | 0.46 | 0.00 | 0.07 | 0.06 | 0.15 | 0.19 | 0.02 | 0.02 | 0.04 | 0.05 | 0.88 | 0.04 | 0.20 | 0.20 | 0.46 | 0.52 | 0.43 | 0.42 | 0.15 | 0.51 | 0.5 | 0.01 | 0.86 | 0.41 | 0.29 | 0.30 | 0.52 | 0.47 | 0.45 | 0.18 | 0.17 | 0.16 | 0.27 | 0.29 | 0.27 | 0.31 | 0.48 | 0.39 | 0.28 | 0.00 | 0 | 0.14 | 0.49 | 0.54 | 0.67 | 0.56 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.01 | 0.21 | 0.02 | ? | ? | ? | ? | 0.00 | ? | 0.43 |
| 34 | 5 | 81440 | Willingborotownship | 1 | 0.04 | 0.77 | 1.00 | 0.08 | 0.12 | 0.10 | 0.51 | 0.50 | 0.34 | 0.21 | 0.06 | 1.0 | 0.58 | 0.89 | 0.21 | 0.43 | 0.36 | 0.20 | 0.82 | 0.51 | 0.36 | 0.40 | 0.39 | 0.16 | 0.25 | 0.36 | 0.44 | 0.01 | 0.10 | 0.09 | 0.25 | 0.31 | 0.33 | 0.71 | 0.36 | 0.45 | 0.37 | 0.39 | 0.34 | 0.45 | 0.49 | 0.44 | 0.75 | 0.65 | 0.54 | 0.83 | 0.65 | 0.85 | 0.86 | 0.03 | 0.33 | 0.02 | 0.11 | 0.20 | 0.30 | 0.31 | 0.05 | 0.08 | 0.11 | 0.11 | 0.81 | 0.08 | 0.56 | 0.62 | 0.85 | 0.77 | 1.00 | 0.94 | 0.12 | 0.01 | 0.5 | 0.01 | 0.97 | 0.96 | 0.60 | 0.47 | 0.52 | 0.11 | 0.11 | 0.24 | 0.21 | 0.19 | 0.75 | 0.70 | 0.77 | 0.89 | 0.63 | 0.51 | 0.47 | 0.00 | 0 | 0.19 | 0.30 | 0.73 | 0.64 | 0.65 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.02 | 0.39 | 0.28 | ? | ? | ? | ? | 0.00 | ? | 0.12 |
| 42 | 95 | 6096 | Bethlehemtownship | 1 | 0.01 | 0.55 | 0.02 | 0.95 | 0.09 | 0.05 | 0.38 | 0.38 | 0.23 | 0.36 | 0.02 | 0.9 | 0.50 | 0.72 | 0.16 | 0.68 | 0.44 | 0.11 | 0.71 | 0.46 | 0.43 | 0.41 | 0.28 | 0.00 | 0.74 | 0.51 | 0.48 | 0.00 | 0.06 | 0.25 | 0.30 | 0.33 | 0.12 | 0.65 | 0.67 | 0.38 | 0.42 | 0.46 | 0.22 | 0.27 | 0.20 | 0.21 | 0.51 | 0.91 | 0.91 | 0.89 | 0.85 | 0.40 | 0.60 | 0.00 | 0.06 | 0.00 | 0.03 | 0.07 | 0.20 | 0.27 | 0.01 | 0.02 | 0.04 | 0.05 | 0.88 | 0.05 | 0.16 | 0.19 | 0.59 | 0.60 | 0.37 | 0.89 | 0.02 | 0.19 | 0.5 | 0.01 | 0.89 | 0.87 | 0.04 | 0.55 | 0.73 | 0.05 | 0.14 | 0.31 | 0.31 | 0.30 | 0.40 | 0.36 | 0.38 | 0.38 | 0.22 | 0.51 | 0.21 | 0.00 | 0 | 0.11 | 0.72 | 0.64 | 0.61 | 0.53 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.04 | 0.09 | 0.02 | ? | ? | ? | ? | 0.00 | ? | 0.03 |
| 6 | ? | ? | SouthPasadenacity | 1 | 0.02 | 0.28 | 0.06 | 0.54 | 1.00 | 0.25 | 0.31 | 0.48 | 0.27 | 0.37 | 0.04 | 1.0 | 0.52 | 0.68 | 0.20 | 0.61 | 0.28 | 0.15 | 0.25 | 0.62 | 0.72 | 0.76 | 0.77 | 0.28 | 0.52 | 0.48 | 0.60 | 0.01 | 0.12 | 0.13 | 0.12 | 0.80 | 0.10 | 0.65 | 0.19 | 0.77 | 0.06 | 0.91 | 0.49 | 0.57 | 0.61 | 0.58 | 0.44 | 0.62 | 0.69 | 0.87 | 0.53 | 0.30 | 0.43 | 0.00 | 0.11 | 0.04 | 0.30 | 0.35 | 0.43 | 0.47 | 0.50 | 0.50 | 0.56 | 0.57 | 0.45 | 0.28 | 0.25 | 0.19 | 0.29 | 0.53 | 0.18 | 0.39 | 0.26 | 0.73 | 0.0 | 0.02 | 0.84 | 0.30 | 0.16 | 0.28 | 0.25 | 0.02 | 0.05 | 0.94 | 1.00 | 1.00 | 0.67 | 0.63 | 0.68 | 0.62 | 0.47 | 0.59 | 0.11 | 0.00 | 0 | 0.70 | 0.42 | 0.49 | 0.73 | 0.64 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.01 | 0.58 | 0.10 | ? | ? | ? | ? | 0.00 | ? | 0.14 |
[1] 1994 128
The dataset has 1994 observations and 128 variable.
The dataset contains no duplicate records, but it contains a lot of missing values, represented by “?”. We need to convert these entries into NA values to begin our analysis.
The first 5 non-predictive variables represent the name of the community, the state in which it is located, the county code, the community code, and the fold number for non-random 10 fold cross validation. A summary of the remaining variables is below:
dataset[dataset == '?'] = NA
summary(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")| state | county | community | communityname | fold | population | householdsize | racepctblack | racePctWhite | racePctAsian | racePctHisp | agePct12t21 | agePct12t29 | agePct16t24 | agePct65up | numbUrban | pctUrban | medIncome | pctWWage | pctWFarmSelf | pctWInvInc | pctWSocSec | pctWPubAsst | pctWRetire | medFamInc | perCapInc | whitePerCap | blackPerCap | indianPerCap | AsianPerCap | OtherPerCap | HispPerCap | NumUnderPov | PctPopUnderPov | PctLess9thGrade | PctNotHSGrad | PctBSorMore | PctUnemployed | PctEmploy | PctEmplManu | PctEmplProfServ | PctOccupManu | PctOccupMgmtProf | MalePctDivorce | MalePctNevMarr | FemalePctDiv | TotalPctDiv | PersPerFam | PctFam2Par | PctKids2Par | PctYoungKids2Par | PctTeen2Par | PctWorkMomYoungKids | PctWorkMom | NumIlleg | PctIlleg | NumImmig | PctImmigRecent | PctImmigRec5 | PctImmigRec8 | PctImmigRec10 | PctRecentImmig | PctRecImmig5 | PctRecImmig8 | PctRecImmig10 | PctSpeakEnglOnly | PctNotSpeakEnglWell | PctLargHouseFam | PctLargHouseOccup | PersPerOccupHous | PersPerOwnOccHous | PersPerRentOccHous | PctPersOwnOccup | PctPersDenseHous | PctHousLess3BR | MedNumBR | HousVacant | PctHousOccup | PctHousOwnOcc | PctVacantBoarded | PctVacMore6Mos | MedYrHousBuilt | PctHousNoPhone | PctWOFullPlumb | OwnOccLowQuart | OwnOccMedVal | OwnOccHiQuart | RentLowQ | RentMedian | RentHighQ | MedRent | MedRentPctHousInc | MedOwnCostPctInc | MedOwnCostPctIncNoMtg | NumInShelters | NumStreet | PctForeignBorn | PctBornSameState | PctSameHouse85 | PctSameCity85 | PctSameState85 | LemasSwornFT | LemasSwFTPerPop | LemasSwFTFieldOps | LemasSwFTFieldPerPop | LemasTotalReq | LemasTotReqPerPop | PolicReqPerOffic | PolicPerPop | RacialMatchCommPol | PctPolicWhite | PctPolicBlack | PctPolicHisp | PctPolicAsian | PctPolicMinor | OfficAssgnDrugUnits | NumKindsDrugsSeiz | PolicAveOTWorked | LandArea | PopDens | PctUsePubTrans | PolicCars | PolicOperBudg | LemasPctPolicOnPatr | LemasGangUnitDeploy | LemasPctOfficDrugUn | PolicBudgPerPop | ViolentCrimesPerPop | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.00 | Length:1994 | Length:1994 | Length:1994 | Min. : 1.000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Length:1994 | Min. :0.0000 | Min. :0.00000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.00 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Min. :0.00000 | Length:1994 | Min. :0.000 | |
| 1st Qu.:12.00 | Class :character | Class :character | Class :character | 1st Qu.: 3.000 | 1st Qu.:0.01000 | 1st Qu.:0.3500 | 1st Qu.:0.0200 | 1st Qu.:0.6300 | 1st Qu.:0.0400 | 1st Qu.:0.010 | 1st Qu.:0.3400 | 1st Qu.:0.4100 | 1st Qu.:0.2500 | 1st Qu.:0.3000 | 1st Qu.:0.00000 | 1st Qu.:0.0000 | 1st Qu.:0.2000 | 1st Qu.:0.4400 | 1st Qu.:0.1600 | 1st Qu.:0.3700 | 1st Qu.:0.3500 | 1st Qu.:0.1425 | 1st Qu.:0.3600 | 1st Qu.:0.2300 | 1st Qu.:0.2200 | 1st Qu.:0.240 | 1st Qu.:0.1725 | 1st Qu.:0.1100 | 1st Qu.:0.1900 | Class :character | 1st Qu.:0.2600 | 1st Qu.:0.01000 | 1st Qu.:0.110 | 1st Qu.:0.1600 | 1st Qu.:0.2300 | 1st Qu.:0.2100 | 1st Qu.:0.2200 | 1st Qu.:0.3800 | 1st Qu.:0.2500 | 1st Qu.:0.3200 | 1st Qu.:0.2400 | 1st Qu.:0.3100 | 1st Qu.:0.3300 | 1st Qu.:0.3100 | 1st Qu.:0.3600 | 1st Qu.:0.3600 | 1st Qu.:0.4000 | 1st Qu.:0.4900 | 1st Qu.:0.4900 | 1st Qu.:0.530 | 1st Qu.:0.4800 | 1st Qu.:0.3900 | 1st Qu.:0.4200 | 1st Qu.:0.00000 | 1st Qu.:0.09 | 1st Qu.:0.00000 | 1st Qu.:0.1600 | 1st Qu.:0.2000 | 1st Qu.:0.2500 | 1st Qu.:0.2800 | 1st Qu.:0.0300 | 1st Qu.:0.0300 | 1st Qu.:0.0300 | 1st Qu.:0.0300 | 1st Qu.:0.7300 | 1st Qu.:0.0300 | 1st Qu.:0.1500 | 1st Qu.:0.1400 | 1st Qu.:0.3400 | 1st Qu.:0.3900 | 1st Qu.:0.2700 | 1st Qu.:0.4400 | 1st Qu.:0.0600 | 1st Qu.:0.4000 | 1st Qu.:0.0000 | 1st Qu.:0.01000 | 1st Qu.:0.6300 | 1st Qu.:0.4300 | 1st Qu.:0.0600 | 1st Qu.:0.2900 | 1st Qu.:0.3500 | 1st Qu.:0.0600 | 1st Qu.:0.1000 | 1st Qu.:0.0900 | 1st Qu.:0.0900 | 1st Qu.:0.0900 | 1st Qu.:0.1700 | 1st Qu.:0.2000 | 1st Qu.:0.220 | 1st Qu.:0.2100 | 1st Qu.:0.3700 | 1st Qu.:0.3200 | 1st Qu.:0.2500 | 1st Qu.:0.00000 | 1st Qu.:0.00000 | 1st Qu.:0.0600 | 1st Qu.:0.4700 | 1st Qu.:0.4200 | 1st Qu.:0.5200 | 1st Qu.:0.5600 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.:0.02000 | 1st Qu.:0.1000 | 1st Qu.:0.0200 | Class :character | Class :character | Class :character | Class :character | 1st Qu.:0.00000 | Class :character | 1st Qu.:0.070 | |
| Median :34.00 | Mode :character | Mode :character | Mode :character | Median : 5.000 | Median :0.02000 | Median :0.4400 | Median :0.0600 | Median :0.8500 | Median :0.0700 | Median :0.040 | Median :0.4000 | Median :0.4800 | Median :0.2900 | Median :0.4200 | Median :0.03000 | Median :1.0000 | Median :0.3200 | Median :0.5600 | Median :0.2300 | Median :0.4800 | Median :0.4750 | Median :0.2600 | Median :0.4700 | Median :0.3300 | Median :0.3000 | Median :0.320 | Median :0.2500 | Median :0.1700 | Median :0.2800 | Mode :character | Median :0.3450 | Median :0.02000 | Median :0.250 | Median :0.2700 | Median :0.3600 | Median :0.3100 | Median :0.3200 | Median :0.5100 | Median :0.3700 | Median :0.4100 | Median :0.3700 | Median :0.4000 | Median :0.4700 | Median :0.4000 | Median :0.5000 | Median :0.5000 | Median :0.4700 | Median :0.6300 | Median :0.6400 | Median :0.700 | Median :0.6100 | Median :0.5100 | Median :0.5400 | Median :0.01000 | Median :0.17 | Median :0.01000 | Median :0.2900 | Median :0.3400 | Median :0.3900 | Median :0.4300 | Median :0.0900 | Median :0.0800 | Median :0.0900 | Median :0.0900 | Median :0.8700 | Median :0.0600 | Median :0.2000 | Median :0.1900 | Median :0.4400 | Median :0.4800 | Median :0.3600 | Median :0.5600 | Median :0.1100 | Median :0.5100 | Median :0.5000 | Median :0.03000 | Median :0.7700 | Median :0.5400 | Median :0.1300 | Median :0.4200 | Median :0.5200 | Median :0.1850 | Median :0.1900 | Median :0.1800 | Median :0.1700 | Median :0.1800 | Median :0.3100 | Median :0.3300 | Median :0.370 | Median :0.3400 | Median :0.4800 | Median :0.4500 | Median :0.3700 | Median :0.00000 | Median :0.00000 | Median :0.1300 | Median :0.6300 | Median :0.5400 | Median :0.6700 | Median :0.7000 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median :0.04000 | Median :0.1700 | Median :0.0700 | Mode :character | Mode :character | Mode :character | Mode :character | Median :0.00000 | Mode :character | Median :0.150 | |
| Mean :28.68 | NA | NA | NA | Mean : 5.494 | Mean :0.05759 | Mean :0.4634 | Mean :0.1796 | Mean :0.7537 | Mean :0.1537 | Mean :0.144 | Mean :0.4242 | Mean :0.4939 | Mean :0.3363 | Mean :0.4232 | Mean :0.06407 | Mean :0.6963 | Mean :0.3611 | Mean :0.5582 | Mean :0.2916 | Mean :0.4957 | Mean :0.4711 | Mean :0.3178 | Mean :0.4792 | Mean :0.3757 | Mean :0.3503 | Mean :0.368 | Mean :0.2911 | Mean :0.2035 | Mean :0.3224 | NA | Mean :0.3863 | Mean :0.05551 | Mean :0.303 | Mean :0.3158 | Mean :0.3833 | Mean :0.3617 | Mean :0.3635 | Mean :0.5011 | Mean :0.3964 | Mean :0.4406 | Mean :0.3912 | Mean :0.4413 | Mean :0.4612 | Mean :0.4345 | Mean :0.4876 | Mean :0.4943 | Mean :0.4877 | Mean :0.6109 | Mean :0.6207 | Mean :0.664 | Mean :0.5829 | Mean :0.5014 | Mean :0.5267 | Mean :0.03629 | Mean :0.25 | Mean :0.03006 | Mean :0.3202 | Mean :0.3606 | Mean :0.3991 | Mean :0.4279 | Mean :0.1814 | Mean :0.1821 | Mean :0.1848 | Mean :0.1829 | Mean :0.7859 | Mean :0.1506 | Mean :0.2676 | Mean :0.2519 | Mean :0.4621 | Mean :0.4944 | Mean :0.4041 | Mean :0.5626 | Mean :0.1863 | Mean :0.4952 | Mean :0.3147 | Mean :0.07682 | Mean :0.7195 | Mean :0.5487 | Mean :0.2045 | Mean :0.4333 | Mean :0.4942 | Mean :0.2645 | Mean :0.2431 | Mean :0.2647 | Mean :0.2635 | Mean :0.2689 | Mean :0.3464 | Mean :0.3725 | Mean :0.423 | Mean :0.3841 | Mean :0.4901 | Mean :0.4498 | Mean :0.4038 | Mean :0.02944 | Mean :0.02278 | Mean :0.2156 | Mean :0.6089 | Mean :0.5351 | Mean :0.6264 | Mean :0.6515 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Mean :0.06523 | Mean :0.2329 | Mean :0.1617 | NA | NA | NA | NA | Mean :0.09405 | NA | Mean :0.238 | |
| 3rd Qu.:42.00 | NA | NA | NA | 3rd Qu.: 8.000 | 3rd Qu.:0.05000 | 3rd Qu.:0.5400 | 3rd Qu.:0.2300 | 3rd Qu.:0.9400 | 3rd Qu.:0.1700 | 3rd Qu.:0.160 | 3rd Qu.:0.4700 | 3rd Qu.:0.5400 | 3rd Qu.:0.3600 | 3rd Qu.:0.5300 | 3rd Qu.:0.07000 | 3rd Qu.:1.0000 | 3rd Qu.:0.4900 | 3rd Qu.:0.6900 | 3rd Qu.:0.3700 | 3rd Qu.:0.6200 | 3rd Qu.:0.5800 | 3rd Qu.:0.4400 | 3rd Qu.:0.5800 | 3rd Qu.:0.4800 | 3rd Qu.:0.4300 | 3rd Qu.:0.440 | 3rd Qu.:0.3800 | 3rd Qu.:0.2500 | 3rd Qu.:0.4000 | NA | 3rd Qu.:0.4800 | 3rd Qu.:0.05000 | 3rd Qu.:0.450 | 3rd Qu.:0.4200 | 3rd Qu.:0.5100 | 3rd Qu.:0.4600 | 3rd Qu.:0.4800 | 3rd Qu.:0.6275 | 3rd Qu.:0.5200 | 3rd Qu.:0.5300 | 3rd Qu.:0.5100 | 3rd Qu.:0.5400 | 3rd Qu.:0.5900 | 3rd Qu.:0.5000 | 3rd Qu.:0.6200 | 3rd Qu.:0.6300 | 3rd Qu.:0.5600 | 3rd Qu.:0.7600 | 3rd Qu.:0.7800 | 3rd Qu.:0.840 | 3rd Qu.:0.7200 | 3rd Qu.:0.6200 | 3rd Qu.:0.6500 | 3rd Qu.:0.02000 | 3rd Qu.:0.32 | 3rd Qu.:0.02000 | 3rd Qu.:0.4300 | 3rd Qu.:0.4800 | 3rd Qu.:0.5300 | 3rd Qu.:0.5600 | 3rd Qu.:0.2300 | 3rd Qu.:0.2300 | 3rd Qu.:0.2300 | 3rd Qu.:0.2300 | 3rd Qu.:0.9400 | 3rd Qu.:0.1600 | 3rd Qu.:0.3100 | 3rd Qu.:0.2900 | 3rd Qu.:0.5500 | 3rd Qu.:0.5800 | 3rd Qu.:0.4900 | 3rd Qu.:0.7000 | 3rd Qu.:0.2200 | 3rd Qu.:0.6000 | 3rd Qu.:0.5000 | 3rd Qu.:0.07000 | 3rd Qu.:0.8600 | 3rd Qu.:0.6700 | 3rd Qu.:0.2700 | 3rd Qu.:0.5600 | 3rd Qu.:0.6700 | 3rd Qu.:0.4200 | 3rd Qu.:0.3300 | 3rd Qu.:0.4000 | 3rd Qu.:0.3900 | 3rd Qu.:0.3800 | 3rd Qu.:0.4900 | 3rd Qu.:0.5200 | 3rd Qu.:0.590 | 3rd Qu.:0.5300 | 3rd Qu.:0.5900 | 3rd Qu.:0.5800 | 3rd Qu.:0.5100 | 3rd Qu.:0.01000 | 3rd Qu.:0.00000 | 3rd Qu.:0.2800 | 3rd Qu.:0.7775 | 3rd Qu.:0.6600 | 3rd Qu.:0.7700 | 3rd Qu.:0.7900 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.:0.07000 | 3rd Qu.:0.2800 | 3rd Qu.:0.1900 | NA | NA | NA | NA | 3rd Qu.:0.00000 | NA | 3rd Qu.:0.330 | |
| Max. :56.00 | NA | NA | NA | Max. :10.000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | NA | Max. :1.0000 | Max. :1.00000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.00 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | NA | NA | NA | NA | Max. :1.00000 | NA | Max. :1.000 |
Based on the above descriptive data summary, there are quite a few variables with missing values. So we conducted an analysis of all missing values in various attributes to identify proper imputation technique.
## Counts of missing data per feature
dataset_missing_counts <- data.frame(apply(dataset, 2, function(x) length(which(is.na(x)))))
dataset_missing_pct <- data.frame(apply(dataset, 2,function(x) {sum(is.na(x)) / length(x) * 100}))
dataset_missing_counts <- cbind(Feature = rownames(dataset_missing_counts), dataset_missing_counts, dataset_missing_pct)
colnames(dataset_missing_counts) <- c('Feature','NA_Count','NA_Percentage')
rownames(dataset_missing_counts) <- NULL
dataset_missing_counts <- dataset_missing_counts %>% filter(`NA_Count` != 0) %>% arrange(desc(`NA_Count`))
dataset_missing_counts %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")| Feature | NA_Count | NA_Percentage |
|---|---|---|
| LemasSwornFT | 1675 | 84.0020060 |
| LemasSwFTPerPop | 1675 | 84.0020060 |
| LemasSwFTFieldOps | 1675 | 84.0020060 |
| LemasSwFTFieldPerPop | 1675 | 84.0020060 |
| LemasTotalReq | 1675 | 84.0020060 |
| LemasTotReqPerPop | 1675 | 84.0020060 |
| PolicReqPerOffic | 1675 | 84.0020060 |
| PolicPerPop | 1675 | 84.0020060 |
| RacialMatchCommPol | 1675 | 84.0020060 |
| PctPolicWhite | 1675 | 84.0020060 |
| PctPolicBlack | 1675 | 84.0020060 |
| PctPolicHisp | 1675 | 84.0020060 |
| PctPolicAsian | 1675 | 84.0020060 |
| PctPolicMinor | 1675 | 84.0020060 |
| OfficAssgnDrugUnits | 1675 | 84.0020060 |
| NumKindsDrugsSeiz | 1675 | 84.0020060 |
| PolicAveOTWorked | 1675 | 84.0020060 |
| PolicCars | 1675 | 84.0020060 |
| PolicOperBudg | 1675 | 84.0020060 |
| LemasPctPolicOnPatr | 1675 | 84.0020060 |
| LemasGangUnitDeploy | 1675 | 84.0020060 |
| PolicBudgPerPop | 1675 | 84.0020060 |
| community | 1177 | 59.0270812 |
| county | 1174 | 58.8766299 |
| OtherPerCap | 1 | 0.0501505 |
ggplot(dataset_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) +
geom_bar(stat = 'identity', fill = 'steelblue') +
geom_label(aes(label = NA_Count)) +
labs(title = 'Missing Counts') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y = element_blank(), axis.title.x = element_blank())There are 22 variables missing 84% of data and 2 variables missing roughly 59% of data. These variables with a high proportion of NA’s (greater than 50%) were removed as these were deemed practically useless in prediction or data exploration. Imputation on these varibales is a risk as the majority of the data is missing. Most of these variables came from the 1990 Law Enforcement Management and Admin Stats survey (Lemas).
Also, we are going to remove the ‘fold’ variable orginally added for cross validation.
colremove <- dataset_missing_counts %>%
filter(NA_Percentage > 50) %>%
select(Feature) %>%
as.list()
dataset <- dataset[,!(colnames(dataset) %in% colremove$Feature)]
# removing the folds variable as it was placed for cross validation
dataset <- dataset[,names(dataset) != 'fold']We also did check for presence of any degenerate variables in the dataset and removed them.
# capturing the degenerate variables
degenCols <- nearZeroVar(dataset)
# identifying them
colnames(dataset[,degenCols])[1] "LemasPctOfficDrugUn"
We removed the ‘communityname’ from the dataset and removed rows with NA values in ‘OtherPerCap’ column -
#dataset <- dataset[,!colnames(dataset) == 'communityname']
dataset <- dataset %>%
mutate(OtherPerCap = as.numeric(OtherPerCap))
#remove missing brand => df3
dataset <- dataset[!is.na(dataset$OtherPerCap), ]
dim(dataset)[1] 1993 102
After removing these variables, our dataset reduced to 101 variables.
To start with our visual exploratory data analysis, we joined the crime dataset with state level enrichment dataset -
# Join with states dataframe
dataset1 <- left_join(dataset, states,
by = c("state" = "state_code"))
dataset1 <- rename(dataset1, state_abbr = state.y)District of Columbia leads in terms of violent crime rates in the United States. Los Angeles has the second highest crime rate after DC, followed by Louisiana, South Carolina, Maryland and Florida. DC, however, have much higher crime rates in comparison to the other states, which have roughly almost the same crime rate. However, New York and Florida have very high populations in comparison to the others.
# aggregating the Average ViolentCrimesPerPop by State ID (FIPS)
crime_rate_by_state <- dataset1 %>%
group_by(state_abbr) %>%
summarise(AvgCrimesRate = mean(ViolentCrimesPerPop)) %>%
mutate(rank = dense_rank(desc(AvgCrimesRate)))
crime_rate_by_state %>% filter(rank <=5) %>% arrange(rank) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="60%",height="250px")| state_abbr | AvgCrimesRate | rank |
|---|---|---|
| DC | 1.0000000 | 1 |
| LA | 0.5045455 | 2 |
| SC | 0.4867857 | 3 |
| MD | 0.4800000 | 4 |
| FL | 0.4583333 | 5 |
#Violent Crime Rates by States
crime_rate_by_state$hover <- with(crime_rate_by_state, paste(state_abbr,'<br>', "Crime Rank:",rank))
l <- list(color = toRGB("white"), width = 2)
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('blue')
)
p <- crime_rate_by_state %>% plot_geo(locationmode = 'USA-states') %>%
add_trace(
z = ~AvgCrimesRate, text = ~hover, locations = ~state_abbr,
color = ~AvgCrimesRate, colors = 'Blues'
) %>%
colorbar(title = "Crime Rate/100K Population") %>%
layout(
title = 'Violent Crime Rates in the United States',
geo = g
)
pMap of percentage of black population in the US
More than 65% of the population of DC belongs to the African-American community. Other states with high percentage of African-Americans are Mississippi, Georgia, Louisiana, and Delaware. These are some of the states with the highest violent crime rates. No wonder crimes bring about perceptions of race. We need to investigate this further.
#Percentage of African-American population by state
black_pop_by_state <- dataset1 %>%
group_by(state_abbr) %>%
summarise(AvgBlackPopRate = mean(racepctblack )) %>%
mutate(rank = dense_rank(desc(AvgBlackPopRate)))
black_pop_by_state %>% filter(rank <=5) %>% arrange(rank) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="60%",height="250px")| state_abbr | AvgBlackPopRate | rank |
|---|---|---|
| DC | 1.0000000 | 1 |
| MS | 0.6900000 | 2 |
| GA | 0.6608108 | 3 |
| LA | 0.6245455 | 4 |
| DE | 0.6000000 | 5 |
black_pop_by_state$hover <- with(black_pop_by_state, paste(state_abbr,'<br>', "Black Population Rank:",rank))
l <- list(color = toRGB("white"), width = 2)
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('blue')
)
p <- black_pop_by_state %>% plot_geo(locationmode = 'USA-states') %>%
add_trace(
z = ~AvgBlackPopRate, text = ~hover, locations = ~state_abbr,
color = ~AvgBlackPopRate, colors = 'Greens'
) %>%
colorbar(title = "%Black") %>%
layout(
title = 'Percentage of African-American Population',
geo = g
)
pWord Cloud of most dangerous communities
We conclude our initial Visual Data Exploration by taking a look at the communities with the highest violent crime rates. Interestingly, Camden City in New Jersey showed at the top on the list of most dangerous communities in the US in terms of violent crime rates. Other dangerous cities when it comes to violent crime rates are BatonRouge and Alexandria in Louisiana, Baltimore, Atlanta, Anniston, Bessemer and Birmingham in Alabma, Atlantic City.
#Word Cloud of Communities with the highest crime rates
communityCrime <- dataset1 %>% select(communityname,state_abbr, population, ViolentCrimesPerPop) %>%
mutate(Place = paste(communityname,state_abbr, sep = ", ")) %>%
group_by(Place) %>%
summarise(AvgCrimesRate = mean(ViolentCrimesPerPop)) %>%
arrange(desc(AvgCrimesRate)) %>%
head(15)
par(mfrow = c(1,1))
par(mar = rep(0,4))
set.seed(1)
wordcloud(words = communityCrime$Place, freq = communityCrime$AvgCrimesRate, scale=c(3,0.01),
random.order=FALSE,
colors=brewer.pal(8, "Dark2"))We still have an extremely high number of variables in the dataset. To further remove variables, we measure the correlation of each independent variable against the dependent variable and remove the variables with low correlation to the independent variable. We removed variables with an absolute correlation of less than 0.25 to the independent variable. This vastly reduces the number of variables in the dataset to 54. We removed these variables as we felt that these variables would contribute very less to our final model. Besides, working with fewer variables help in better interpretation.
dataset <- dataset[,!colnames(dataset) == 'communityname']
#Running a correlation matrix
res2 <- rcorr(as.matrix(dataset))
flattenCorrMatrix <- function(cormat, pmat) {
ut <- upper.tri(cormat)
data.frame(
row = rownames(cormat)[row(cormat)[ut]],
column = rownames(cormat)[col(cormat)[ut]],
cor =(cormat)[ut],
p = pmat[ut]
)
}
#Finding variables with low correlation to dependent variable
correlation <- flattenCorrMatrix(res2$r, res2$P)
correlation <- correlation %>% filter(column == "ViolentCrimesPerPop") %>% arrange(cor)
remove <- correlation %>% filter(cor >= -0.25 & cor <= 0.25) %>% select(row)
low.corr.dep <- c('HispPerCap','PctSpeakEnglOnly','RentMedian','MedRent','RentHighQ','state',
'OwnOccLowQuart','whitePerCap','OwnOccMedVal','OwnOccHiQuart','AsianPerCap','PctSameHouse85',
'pctWFarmSelf','PctWorkMom','OtherPerCap','PersPerOwnOccHous','MedYrHousBuilt','pctWRetire',
'indianPerCap','PctBornSameState','PctEmplProfServ','PctEmplManu','PersPerOccupHous','householdsize','PctWorkMomYoungKids','PctSameState85','PctVacMore6Mos','racePctAsian','MedOwnCostPctIncNoMtg','agePct12t21','MedOwnCostPctInc','agePct65up','PctSameCity85','pctUrban','agePct16t24',
'pctWSocSec','PersPerFam','agePct12t29','PctUsePubTrans','PctImmigRecent','PctForeignBorn',
'LandArea','PctImmigRec5','PctRecentImmig','PctRecImmig5','PctImmigRec8','PersPerRentOccHous'
)
dataset<- dataset %>% select(-low.corr.dep)
dim(dataset)[1] 1993 54
For removing multi-collinearity, we selected pairs of independent variables with absolute correlations greater than 0.9. For deciding which variable to remove among the pairs of independent variables, we regressed the remaining independent variables against the dependent variable and found the variance inflation factor of each variable using the vif function from the rms package in R and the removed the independent variable from each pair of multi-collinear variables which had higher variance inflation factors. This reduced the number of variables in the dataset to 39. The logic for this was that the variables in a pair of multi-collinear variables contribute the same variance information to a model and hence only one of them (with the lower variance inflation factor) is required for modeling.
#Now let's check for multicollinearity
res2 <- rcorr(as.matrix(dataset))
correlation <- flattenCorrMatrix(res2$r, res2$P)
correlation <- correlation %>% mutate(abs_cor = abs(cor)) %>% arrange(desc(abs_cor)) %>% filter(abs_cor > 0.9) %>% select(row, column)
correlation %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="200px")| row | column |
|---|---|
| PctRecImmig8 | PctRecImmig10 |
| population | numbUrban |
| PctFam2Par | PctKids2Par |
| PctLargHouseFam | PctLargHouseOccup |
| FemalePctDiv | TotalPctDiv |
| PctPersOwnOccup | PctHousOwnOcc |
| medIncome | medFamInc |
| MalePctDivorce | TotalPctDiv |
| PctBSorMore | PctOccupMgmtProf |
| population | NumUnderPov |
| PctLess9thGrade | PctNotHSGrad |
| NumUnderPov | NumIlleg |
| medFamInc | perCapInc |
| numbUrban | NumUnderPov |
| PctFam2Par | PctYoungKids2Par |
| PctKids2Par | PctYoungKids2Par |
| MalePctDivorce | FemalePctDiv |
| PctFam2Par | PctTeen2Par |
| PctKids2Par | PctTeen2Par |
#Let's run a simple linear regression to see if any more variables can be removed based on VIF
LR_test <- lm(ViolentCrimesPerPop ~ ., dataset)
vif_df <- data.frame(vif(LR_test))
colnames(vif_df) <- c("VIF_Score")
vif_df %>% arrange(desc(`VIF_Score`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="200px")| VIF_Score | |
|---|---|
| TotalPctDiv | 928.212429 |
| FemalePctDiv | 295.188771 |
| MalePctDivorce | 203.199697 |
| PctRecImmig10 | 165.892005 |
| PctRecImmig8 | 147.580998 |
| PctLargHouseOccup | 129.488362 |
| PctLargHouseFam | 124.023935 |
| population | 117.672823 |
| PctFam2Par | 103.736936 |
| numbUrban | 101.604576 |
| medIncome | 98.892368 |
| PctKids2Par | 97.813837 |
| medFamInc | 95.037614 |
| PctPersOwnOccup | 78.204266 |
| PctHousOwnOcc | 76.714946 |
| PctNotHSGrad | 37.715581 |
| NumUnderPov | 33.986232 |
| perCapInc | 27.944187 |
| PctBSorMore | 25.418161 |
| PctPersDenseHous | 23.163968 |
| PctOccupMgmtProf | 23.097460 |
| PctLess9thGrade | 21.029438 |
| PctNotSpeakEnglWell | 19.853248 |
| PctPopUnderPov | 18.129689 |
| racePctWhite | 15.166439 |
| racepctblack | 14.945811 |
| NumIlleg | 14.432404 |
| pctWWage | 13.271613 |
| HousVacant | 12.765602 |
| PctYoungKids2Par | 12.151434 |
| pctWInvInc | 12.081673 |
| PctEmploy | 11.608310 |
| PctIlleg | 11.490275 |
| racePctHisp | 9.970625 |
| pctWPubAsst | 9.197796 |
| PctHousLess3BR | 8.210647 |
| RentLowQ | 8.141949 |
| PctHousNoPhone | 7.424820 |
| PctTeen2Par | 7.264545 |
| PctOccupManu | 6.099481 |
| PctUnemployed | 6.099450 |
| MalePctNevMarr | 5.330162 |
| NumImmig | 5.007225 |
| NumInShelters | 4.710435 |
| PctHousOccup | 3.543461 |
| PopDens | 2.833126 |
| MedNumBR | 2.657976 |
| NumStreet | 2.473209 |
| MedRentPctHousInc | 2.470435 |
| PctImmigRec10 | 2.372429 |
| PctVacantBoarded | 2.096964 |
| blackPerCap | 2.028969 |
| PctWOFullPlumb | 1.847284 |
Next, we scan the above list of multi-collinear variables and remove the ones with higher VIF -
remove.multicollinear <- c('PctRecImmig10', 'population', 'PctFam2Par', 'PctLargHouseOccup', 'FemalePctDiv', 'PctPersOwnOccup', 'medIncome', 'TotalPctDiv', 'PctBSorMore', 'PctNotHSGrad', 'NumUnderPov', 'medFamInc', 'numbUrban', 'PctKids2Par')
dataset <- dataset %>% select(-remove.multicollinear)
dim(dataset)[1] 1993 40
After removing the variables with high multi-collinearity, we end up with 40 variables.
As a final step before modeling, we created correlation plots to assess in advance the independent variables that will be most important in predicting the violent crime rates.
corrMatrix <- round(cor(dataset[,c(1:13,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))corrMatrix <- round(cor(dataset[,c(14:26,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))corrMatrix <- round(cor(dataset[,c(27:39,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))Using the correlation plots, we can say that the following socio-economic factors in a community influence the level of violent crime rates:
Percentage of the population that is African-American Percentage of the population living under poverty line Percentage of kids born to never married Percentage of households with public assistance income Percentage of people 16 and over, in the labor force, and unemployed Percentage of males who are divorced Percentage of occupied housing units without a phone Percentage of vacant housing that is boarded up
We are going to do a 75-25% split for training and test purposes.
sample = sample.split(dataset$ViolentCrimesPerPop, SplitRatio = 0.75)
train = subset(dataset, sample == TRUE) %>% as.matrix()
test = subset(dataset, sample == FALSE) %>% as.matrix()
y_train <- train[,40]
y_test <- test[,40]
X_train <- train[,-40]
X_test <- test[,-40]
#head(train)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")Before we start with modeling, we wanted to define some of the re-usable code components in the form of R functions -
## Function for plotting the most important variables
printVarImportance <- function(model_name, vImpDF, varNo){
# Rename columns
varImportance <- tibble::rownames_to_column(vImpDF, "Variable")
colnames(varImportance)[2] <- "Importance"
# Limit to Top N important variables
varImportance <- varImportance %>% arrange(desc(`Importance`)) %>% head(varNo)
# Create and assign Rank
rankImportance <- varImportance %>% mutate(Rank = paste0('#',dense_rank(desc(Importance))))
# Plot output
ggplot(rankImportance, aes(x = reorder(Variable, Importance), y = Importance)) +
geom_bar(stat='identity', fill = 'steelblue') +
geom_label(aes(label = round(Importance,2))) +
geom_text(aes(x = Variable, y = 0.5, label = Rank),
hjust=1.2, vjust=0.55, size = 4, colour = 'red') +
labs(x = 'Variable') +
ggtitle(paste("Variable Importance in ", model_name, " Model")) +
coord_flip()
}Model Group A: Linear Regression Model - Ridge and LASSO
The glmnet package allows us to perform both the ridge and lasso regression models. This library requires the response variable to be a vector and the predictor variables to be a class of data.matrix.
Ridge regression is a model used “when the number of predictor variables in a set exceeds the number of observations”, or when a dataset suffers from multicollinearity. According to stats, with ridge regression, often predictor variables used in a regression are highly correlated. When they are, the regression coefficient of any one variable can depend on which other predictor variables are included in the model, and which ones are left out.
Now we’ll use the glmnet() function to fit the ridge regression model and specify alpha=0.
fit.ridge <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), alpha = 0, type.measure = "mse", family="gaussian")To identify what value to use for lambda, we could plot the function using our model fit.ridge$lambda.min, but below we’ll use the s=“lambda.min” argument to save time. The graph will be shown below after all models.
fitted.ridge.train <- predict(fit.ridge, newx = data.matrix(X_train), s="lambda.min")
fitted.ridge.test <- predict(fit.ridge, newx = data.matrix(X_test), s="lambda.min")
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.ridge.train)[1])Train coefficient: 0.8141226
Test coefficient: 0.7635008
As we can see above, the train and test coefficients are 0.81 and 0.76
ridgePredPerf <- postResample(pred = fitted.ridge.test , obs = y_test)
ridgePredPerf['Family'] <- 'Linear Regression'
ridgePredPerf['Model'] <- 'Ridge Regression'Model2: LASSO Regression - Least Absolute Shrinkage and Selection Operator
According to statisticshowto.com, lasso regression is a type of linear regression that uses shrinkage - which shrinks data points towards a central point, such as the mean.
Now we’ll use the glmnet() function to fit the LASSO regression model and specify alpha=1. We’ll specific s=“lambda.min” to find the best lambda.
fit.lasso <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=1, family="gaussian")
fitted.lasso.train <- predict(fit.lasso, newx = data.matrix(X_train), s="lambda.min")
fitted.lasso.test <- predict(fit.lasso, newx = data.matrix(X_test), s="lambda.min")
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.lasso.train)[1])Train coefficient: 0.8157457
Test coefficient: 0.7689392
As we can see above, the train and test coefficients are 0.81 and 0.76.
lassoPredPerf <- postResample(pred = fitted.lasso.test , obs = y_test)
lassoPredPerf['Family'] <- 'Linear Regression'
lassoPredPerf['Model'] <- 'Lasso Regression'Model3: Elastic Net Regression
According to machinelearningmastery.com, Elastic Net is an extension of linear regression that will add regulirization penalties to the loss function during training.
fit.elnet <- glmnet(as.matrix(X_train), as.matrix(y_train), family="gaussian", alpha=.5)
fit.elnet.cv <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=.5,
family="gaussian")
fitted.elnet.train <- predict(fit.elnet.cv, newx = data.matrix(X_train), s="lambda.min")
fitted.elnet.test <- predict(fit.elnet.cv, newx = data.matrix(X_test), s="lambda.min")
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.elnet.train)[1])Train coefficient: 0.8158622
Test coefficient: 0.7685533
elnetPredPerf <- postResample(pred = fitted.elnet.test , obs = y_test)
elnetPredPerf['Family'] <- 'Linear Regression'
elnetPredPerf['Model'] <- 'Elastic Net Regression'par(mfrow=c(3,1))
plot(fit.lasso, xvar="lambda")
plot(fit.ridge, xvar="lambda")
plot(fit.elnet.cv, xvar="lambda")Model Group B: Non Linear Regression Model
Non Linear Regression models do not follow the y=ax+b approach. It allows for flexibility in the model to fit non linear trends in data.
We will go over four examples of non linear regression: Neural Networks, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Multivariate Adaptive Regression Splines (MARS).
Neural Networks are regression techniques inspired by theories of how the brain works (neurons signalling to one another). The outcome is modeled by an input layer and an intermediary set of unobserved values, hidden units, made up of linear combinations of the original predictors.
The parameters used in the model include number of hidden units, decay, number of iterations to find parameter estimates, and the number of parameters used by the model.
# simple average Neural Network with 5 hidden units
nnetAvg <- avNNet(X_train, y_train,
size = 5,
decay = 0.01,
repeats = 5,
linout = TRUE,
trace = FALSE,
maxit = 500,
MaxNWts = 5 * (ncol(X_train) + 1) +5+1)
nnetAvgModel Averaged Neural Network with 5 Repeats
a 39-5-1 network with 206 weights
options were - linear output units decay=0.01
varimp <- varImp(nnetAvg)
varimp %>%
arrange(desc(Overall)) %>%
kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")| Overall | |
|---|---|
| X292 | 7.6370160 |
| X12 | 5.9451966 |
| X114 | 5.5187905 |
| X26 | 5.2518596 |
| X54 | 5.1049880 |
| X154 | 5.0924933 |
| X42 | 5.0717370 |
| X283 | 4.9011977 |
| X381 | 4.6842240 |
| X203 | 4.6721495 |
| X382 | 4.6398054 |
| X29 | 4.5765466 |
| X291 | 4.5705678 |
| X122 | 4.4443809 |
| X53 | 4.3255080 |
| X323 | 4.2644435 |
| X41 | 4.1839150 |
| X110 | 4.1553171 |
| X131 | 4.1499350 |
| X373 | 4.1439441 |
| X384 | 4.1345264 |
| X63 | 4.1269616 |
| X294 | 4.1134946 |
| X31 | 4.0805513 |
| X3 | 4.0448392 |
| X116 | 3.9986073 |
| X293 | 3.9965900 |
| X121 | 3.9865322 |
| X151 | 3.8569537 |
| X64 | 3.8219922 |
| X261 | 3.7827265 |
| X9 | 3.7704606 |
| X264 | 3.7133551 |
| X164 | 3.6181529 |
| X314 | 3.6143162 |
| X72 | 3.6122705 |
| X51 | 3.5855487 |
| X262 | 3.5758820 |
| X223 | 3.5657833 |
| X19 | 3.5597244 |
| X271 | 3.5582675 |
| X315 | 3.5535831 |
| X171 | 3.5193914 |
| X38 | 3.4772065 |
| X352 | 3.4701537 |
| X214 | 3.4212095 |
| X204 | 3.3389173 |
| X33 | 3.3169319 |
| X2 | 3.2437467 |
| X93 | 3.2400145 |
| X216 | 3.2084728 |
| X1 | 3.1588952 |
| X351 | 3.1146563 |
| X317 | 3.0791051 |
| X62 | 3.0756721 |
| X172 | 3.0527272 |
| X37 | 3.0281053 |
| X272 | 3.0106611 |
| X17 | 2.9916386 |
| X251 | 2.9644936 |
| X212 | 2.9582334 |
| X254 | 2.9251315 |
| X92 | 2.9100862 |
| X27 | 2.8149147 |
| X134 | 2.7943746 |
| X71 | 2.7664481 |
| X52 | 2.7549555 |
| X133 | 2.7507472 |
| X301 | 2.7229865 |
| X243 | 2.7066318 |
| X192 | 2.6908310 |
| X22 | 2.6889728 |
| X374 | 2.6869133 |
| X132 | 2.6767235 |
| X113 | 2.6722419 |
| X181 | 2.6336297 |
| X44 | 2.6125308 |
| X74 | 2.5816257 |
| X13 | 2.5794715 |
| X282 | 2.5663870 |
| X194 | 2.5553930 |
| X364 | 2.5536503 |
| X112 | 2.5291440 |
| X213 | 2.4953951 |
| X152 | 2.4947829 |
| X104 | 2.4920203 |
| X6 | 2.4895872 |
| X161 | 2.4849425 |
| X231 | 2.4742504 |
| X11 | 2.4692769 |
| X117 | 2.4628723 |
| X162 | 2.4551926 |
| X81 | 2.4425303 |
| X393 | 2.4310590 |
| X244 | 2.4305055 |
| X217 | 2.3860420 |
| X211 | 2.3729905 |
| X252 | 2.3709876 |
| X324 | 2.3613761 |
| X18 | 2.3342998 |
| X115 | 2.3332208 |
| X316 | 2.3271865 |
| X321 | 2.3161905 |
| X313 | 2.3147162 |
| X83 | 2.3084911 |
| X182 | 2.3038297 |
| X39 | 2.2885316 |
| X21 | 2.2865094 |
| X36 | 2.2581763 |
| X61 | 2.2538031 |
| X16 | 2.2151170 |
| X233 | 2.2128776 |
| X311 | 2.1189115 |
| X91 | 2.1182449 |
| X253 | 2.1127521 |
| X28 | 2.1029672 |
| X383 | 2.0997756 |
| X334 | 2.0952218 |
| X281 | 2.0654783 |
| X333 | 2.0517161 |
| X111 | 2.0492464 |
| X102 | 2.0437337 |
| X394 | 2.0411659 |
| X215 | 2.0385755 |
| X343 | 2.0084313 |
| X25 | 2.0051765 |
| X353 | 2.0024676 |
| X193 | 1.9981012 |
| X174 | 1.9904280 |
| X24 | 1.9786006 |
| X371 | 1.9758678 |
| X263 | 1.9735472 |
| X362 | 1.9578452 |
| X123 | 1.9453243 |
| X82 | 1.9128455 |
| X141 | 1.9102032 |
| X361 | 1.8959054 |
| X124 | 1.8487145 |
| X142 | 1.8416549 |
| X344 | 1.8379848 |
| X153 | 1.8313143 |
| X202 | 1.8309932 |
| X10 | 1.8271260 |
| X5 | 1.8119391 |
| X23 | 1.7830686 |
| X8 | 1.7639648 |
| X4 | 1.7378508 |
| X354 | 1.7334810 |
| X322 | 1.7154179 |
| X94 | 1.7143561 |
| X101 | 1.7099298 |
| X35 | 1.7034332 |
| X372 | 1.6664911 |
| X304 | 1.6362735 |
| X201 | 1.6107170 |
| X84 | 1.6087458 |
| X310 | 1.5871175 |
| X284 | 1.5467524 |
| X341 | 1.5303573 |
| X14 | 1.4727146 |
| X103 | 1.4153643 |
| X391 | 1.4150329 |
| X312 | 1.4038618 |
| X363 | 1.3891787 |
| X30 | 1.3779550 |
| X242 | 1.3666800 |
| X303 | 1.3544250 |
| X210 | 1.3464787 |
| X221 | 1.3390426 |
| X232 | 1.3385168 |
| X274 | 1.3155557 |
| X73 | 1.3055992 |
| X173 | 1.3003825 |
| X20 | 1.2894163 |
| X43 | 1.2773276 |
| X302 | 1.2762359 |
| X191 | 1.2515995 |
| X15 | 1.2409482 |
| X144 | 1.2232473 |
| X183 | 1.2118165 |
| X32 | 1.1941497 |
| X273 | 1.1506261 |
| X342 | 1.1440656 |
| X184 | 1.0164413 |
| X224 | 1.0046459 |
| X332 | 0.9973565 |
| X234 | 0.9932592 |
| X7 | 0.9404488 |
| X34 | 0.8996808 |
| X392 | 0.8822497 |
| X241 | 0.8759609 |
| X222 | 0.8382394 |
| X163 | 0.7721729 |
| X143 | 0.6735831 |
| X331 | 0.6196059 |
We identify the top 10 variables with the most importance to the prediction below -
nnetPred <- predict(nnetAvg, newdata = X_test)
nnetPredPerf <- postResample(pred = nnetPred, obs = y_test)
nnetPredPerf['Family'] <- 'Non-Linear Regression'
nnetPredPerf['Model'] <- 'Neural Network'The KNN model predicts a new sample for regression using the K-closest data points from the training set. The method uses the Euclidean distance or the straight-line distance between samples.
The train function in the caret package uses the knn method.
knnFit <- train(X_train, y_train,
method = 'knn',
preProc = c('center','scale'),
tuneLength = 14,
trControl = trainControl(method = 'cv'))
knnFitk-Nearest Neighbors
1499 samples
39 predictor
Pre-processing: centered (39), scaled (39)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1349, 1349, 1348, 1347, 1349, 1350, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 0.1470551 0.6085563 0.10018005
7 0.1454218 0.6186791 0.09854605
9 0.1440191 0.6269384 0.09784016
11 0.1430909 0.6322653 0.09764411
13 0.1426350 0.6358800 0.09697388
15 0.1434319 0.6321638 0.09772261
17 0.1431849 0.6351700 0.09726898
19 0.1439635 0.6320082 0.09767728
21 0.1443801 0.6305995 0.09764054
23 0.1439431 0.6342258 0.09739809
25 0.1444537 0.6324687 0.09746166
27 0.1448037 0.6308266 0.09757062
29 0.1451915 0.6295031 0.09796384
31 0.1454098 0.6292510 0.09811286
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 13.
varimp <- varImp(knnFit)
varimp$importance %>%
arrange(desc(Overall)) %>%
kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")| Overall | |
|---|---|
| PctIlleg | 100.0000000 |
| racePctWhite | 82.6566345 |
| PctYoungKids2Par | 80.2817251 |
| PctTeen2Par | 78.6232333 |
| racepctblack | 70.3212797 |
| pctWInvInc | 55.5310000 |
| pctWPubAsst | 55.1605903 |
| MalePctDivorce | 47.8211990 |
| PctPopUnderPov | 44.0230884 |
| PctUnemployed | 39.7884001 |
| PctHousNoPhone | 35.8730201 |
| PctVacantBoarded | 33.5763619 |
| PctHousOwnOcc | 33.3328822 |
| PctHousLess3BR | 32.1642291 |
| PctPersDenseHous | 29.2459802 |
| NumIlleg | 29.2056695 |
| PctLess9thGrade | 21.1231604 |
| HousVacant | 20.6816310 |
| PctLargHouseFam | 15.3341087 |
| NumInShelters | 14.9067179 |
| MedNumBR | 13.6315013 |
| PctWOFullPlumb | 12.9198016 |
| perCapInc | 12.1955112 |
| PctOccupMgmtProf | 11.2535144 |
| PctEmploy | 10.3417599 |
| NumStreet | 10.3378306 |
| MedRentPctHousInc | 9.0280115 |
| PctHousOccup | 8.5714403 |
| pctWWage | 6.9674327 |
| MalePctNevMarr | 6.9603290 |
| PctNotSpeakEnglWell | 5.4220236 |
| PctOccupManu | 5.3190000 |
| PctImmigRec10 | 4.9203000 |
| PopDens | 4.8423181 |
| racePctHisp | 4.2799730 |
| NumImmig | 3.9294479 |
| blackPerCap | 1.6454549 |
| PctRecImmig8 | 0.3601896 |
| RentLowQ | 0.0000000 |
We identify the top 10 variables with the most importance to the prediction below -
knnPred <- predict(knnFit, newdata = X_test)
knnPredPerf <- postResample(pred = knnPred, obs = y_test)
knnPredPerf['Family'] <- 'Non-Linear Regression'
knnPredPerf['Model'] <- 'KNN'Model6: Support Vector Machines
SVM regression models try to find a regression line where each data point is within a threshold. The error is a parameter of the model and it will then try to minimize the coefficients used in the model. This allows the model eliminate features and have greater interpretability.
We will again use the caret package and apply a radial SVM model. This method will allow the model to find trends in the dataset.
svmRFit <- train(X_train, y_train,
method = 'svmRadial',
preProc = c('center','scale'),
tuneLength = 14,
trControl = trainControl(method = 'cv'))
svmRFitSupport Vector Machines with Radial Basis Function Kernel
1499 samples
39 predictor
Pre-processing: centered (39), scaled (39)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1349, 1348, 1350, 1350, 1349, 1349, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 0.1481972 0.6322960 0.09523909
0.50 0.1430390 0.6461203 0.09275885
1.00 0.1402594 0.6525146 0.09164402
2.00 0.1396685 0.6519714 0.09198224
4.00 0.1419226 0.6393452 0.09431469
8.00 0.1468268 0.6150199 0.09876876
16.00 0.1546668 0.5773157 0.10495410
32.00 0.1647599 0.5308168 0.11282599
64.00 0.1739939 0.4929692 0.11996609
128.00 0.1801527 0.4699293 0.12551726
256.00 0.1803473 0.4699574 0.12603844
512.00 0.1803473 0.4699574 0.12603844
1024.00 0.1803473 0.4699574 0.12603844
2048.00 0.1803473 0.4699574 0.12603844
Tuning parameter 'sigma' was held constant at a value of 0.02507349
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.02507349 and C = 2.
varimp <- varImp(svmRFit)
varimp$importance %>%
arrange(desc(Overall)) %>%
kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")| Overall | |
|---|---|
| PctIlleg | 100.0000000 |
| racePctWhite | 82.6566345 |
| PctYoungKids2Par | 80.2817251 |
| PctTeen2Par | 78.6232333 |
| racepctblack | 70.3212797 |
| pctWInvInc | 55.5310000 |
| pctWPubAsst | 55.1605903 |
| MalePctDivorce | 47.8211990 |
| PctPopUnderPov | 44.0230884 |
| PctUnemployed | 39.7884001 |
| PctHousNoPhone | 35.8730201 |
| PctVacantBoarded | 33.5763619 |
| PctHousOwnOcc | 33.3328822 |
| PctHousLess3BR | 32.1642291 |
| PctPersDenseHous | 29.2459802 |
| NumIlleg | 29.2056695 |
| PctLess9thGrade | 21.1231604 |
| HousVacant | 20.6816310 |
| PctLargHouseFam | 15.3341087 |
| NumInShelters | 14.9067179 |
| MedNumBR | 13.6315013 |
| PctWOFullPlumb | 12.9198016 |
| perCapInc | 12.1955112 |
| PctOccupMgmtProf | 11.2535144 |
| PctEmploy | 10.3417599 |
| NumStreet | 10.3378306 |
| MedRentPctHousInc | 9.0280115 |
| PctHousOccup | 8.5714403 |
| pctWWage | 6.9674327 |
| MalePctNevMarr | 6.9603290 |
| PctNotSpeakEnglWell | 5.4220236 |
| PctOccupManu | 5.3190000 |
| PctImmigRec10 | 4.9203000 |
| PopDens | 4.8423181 |
| racePctHisp | 4.2799730 |
| NumImmig | 3.9294479 |
| blackPerCap | 1.6454549 |
| PctRecImmig8 | 0.3601896 |
| RentLowQ | 0.0000000 |
We identify the top 10 variables with the most importance to the prediction below -
svmRPred <- predict(svmRFit, newdata = X_test)
svmRPredPerf <- postResample(pred = svmRPred, obs = y_test)
svmRPredPerf['Family'] <- 'Non-Linear Regression'
svmRPredPerf['Model'] <- 'SVM'Model7:Multivaraite Adaptive Regression Splines (MARS)
The MARS model finds knots to determine piece wise linear model where each new feature models an isolated portion of the data. The predictor / cut point combination that has the smallest error is used for the model in an iterative process until a stopping point is reached.
We will use the earth package to generate this MARS model.
Call: earth(x=X_train, y=y_train)
coefficients
(Intercept) 0.7263794
h(0.54-racepctblack) -0.3101390
h(0.73-pctWWage) 0.1249294
h(pctWWage-0.73) -0.2401202
h(PctPopUnderPov-0.67) -0.5455969
h(MalePctDivorce-0.39) 0.2138445
h(0.51-PctTeen2Par) 0.2990705
h(0.39-PctIlleg) -0.1481046
h(PctIlleg-0.39) 0.2448670
h(0.88-PctPersDenseHous) -0.3320613
h(0.16-HousVacant) -0.4654248
h(0.31-NumStreet) -0.4153084
Selected 12 of 21 terms, and 9 of 39 predictors
Termination condition: RSq changed by less than 0.001 at 21 terms
Importance: racepctblack, PctIlleg, MalePctDivorce, PctPersDenseHous, ...
Number of terms at each degree of interaction: 1 11 (additive model)
GCV 0.01815316 RSS 26.38294 GRSq 0.6714617 RSq 0.6810408
Visualizing MARS Model
plotmo grid: racepctblack racePctWhite racePctHisp pctWWage pctWInvInc
0.06 0.85 0.04 0.57 0.48
pctWPubAsst perCapInc blackPerCap PctPopUnderPov PctLess9thGrade PctUnemployed
0.26 0.31 0.26 0.24 0.27 0.32
PctEmploy PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr
0.51 0.37 0.4 0.47 0.4
PctYoungKids2Par PctTeen2Par NumIlleg PctIlleg NumImmig PctImmigRec10
0.7 0.61 0.01 0.17 0.01 0.43
PctRecImmig8 PctNotSpeakEnglWell PctLargHouseFam PctPersDenseHous
0.09 0.06 0.21 0.11
PctHousLess3BR MedNumBR HousVacant PctHousOccup PctHousOwnOcc PctVacantBoarded
0.51 0.5 0.03 0.77 0.55 0.13
PctHousNoPhone PctWOFullPlumb RentLowQ MedRentPctHousInc NumInShelters
0.18 0.19 0.32 0.48 0
NumStreet PopDens
0 0.17
Variable Importance
varimp <- varImp(marsFit)
varimp %>%
arrange(desc(Overall)) %>%
kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")| Overall | |
|---|---|
| racepctblack | 100.00000 |
| PctIlleg | 65.01171 |
| MalePctDivorce | 47.29172 |
| PctPersDenseHous | 47.29172 |
| HousVacant | 29.26562 |
| pctWWage | 22.81121 |
| PctPopUnderPov | 20.13595 |
| NumStreet | 15.49737 |
| PctTeen2Par | 11.49760 |
Overall
We identify the top 10 variables with the most importance to the prediction below -
Prediction Performance
marsPred <- predict(marsFit, newdata = X_test)
marsPredPerf <- postResample(pred = marsPred, obs = y_test)
marsPredPerf['Family'] <- 'Non-Linear Regression'
marsPredPerf['Model'] <- 'MARS'Model Group C: Trees and Boosting Model
Decision Trees are a type of Supervised Machine Learning method where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes and the decision nodes are where the data is split. Since our target variable is continuous, we will be building a regression tree.
We will do the decision tree with all variables first.
Here are some of the details of the tree:
Regression tree:
rpart(formula = ViolentCrimesPerPop ~ ., data = as.data.frame(train))
Variables actually used in tree construction:
[1] MalePctDivorce NumStreet PctIlleg PctPersDenseHous
[5] PctPopUnderPov
Root node error: 82.716/1499 = 0.055181
n= 1499
CP nsplit rel error xerror xstd
1 0.412904 0 1.00000 1.00143 0.050454
2 0.063292 1 0.58710 0.59142 0.029666
3 0.054263 2 0.52380 0.54890 0.028311
4 0.022946 3 0.46954 0.50016 0.026669
5 0.019715 4 0.44660 0.48908 0.026478
6 0.011323 5 0.42688 0.46579 0.025620
7 0.010558 6 0.41556 0.48008 0.026145
8 0.010000 7 0.40500 0.47461 0.026067
To achieve improved accuracy, we need to prune the tree using the cross-validation.
CP nsplit rel error xerror xstd
1 0.41290395 0 1.0000000 1.0014260 0.05045354
2 0.06329165 1 0.5870960 0.5914199 0.02966628
3 0.05426321 2 0.5238044 0.5489023 0.02831096
4 0.02294604 3 0.4695412 0.5001609 0.02666861
5 0.01971462 4 0.4465952 0.4890764 0.02647826
6 0.01132276 5 0.4268805 0.4657860 0.02561957
7 0.01055755 6 0.4155578 0.4800769 0.02614502
8 0.01000000 7 0.4050002 0.4746088 0.02606696
The plot above shows the cross validated errors against the complexity parameters. The curve is at its lowest at 7, so we will prune our tree to a size of 7. At size 7, the error is ~0.46 and cp is 0.0104.
Predicting Pruned Tree on Test Data
Model Accuracy
confMat <- table(test2$ViolentCrimesPerPop,Prune_pred)
accuracy <- sum(diag(confMat))/sum(confMat)
accuracy[1] 0.006072874
The model accuracy seems to be very low, even after pruning the tree. This might be because our output is continuous and non-binary. However, more often than not, trees do not give very good prediction errors. Therefore, we will build out a random forest models which tend to outperform trees in terms of prediction and miss-classification errors.
## Call Variable Importance Plot Function
printVarImportance("Decision Tree", data.frame(prune_dt$variable.importance), 10)Prediction Performance
dtreePredPerf <- postResample(pred = Prune_pred, obs = y_test)
dtreePredPerf['Family'] <- 'Trees & Boosting'
dtreePredPerf['Model'] <- 'Decision Tree'Random forests are one the most popular machine learning algorithms. They can be used for classification or regression, deal with a large number of features, and generally are so successful because they provide a good predictive performance, low incidence of over-fitting, and easy interpret-ability.
In creating the best random forest model, we want to minimize the OOB error rate by finding the optimal number of variables selected at each split, known as the mtry. The below code finds the optimal mtry to use in our random forest model.
train2 <- as.data.frame(train)
oob.err=double(101)
test.err=double(101)
for (mtry in 1:101){
fit=randomForest(ViolentCrimesPerPop~.,data=train2,mtry=mtry,ntree=400, proximity=TRUE)
oob.err[mtry]=fit$mse[400]
pred=predict(fit, train2)
test.err[mtry]=with(train2, mean((ViolentCrimesPerPop-pred)^2))
cat(mtry," ")}1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
matplot(1:mtry,cbind(test.err, oob.err),pch=19, col=c("red", "blue"), type="b",ylab
="Mean Squared Error")
legend("topright",legend=c("00B", "Test"),pch=19,col=c("red", "blue"))Ideally OOB error and test errors should line up - but in this case the test error seems to be higher than the OOB.
Variable Importance
The above graph illustrates the importance of the variables used to predict the Violent Crimes Per Population. The Mean Decrease Accuracy displays how much the model accuracy decreases if we drop the variable. Here, Percentage of kids born to never married is regarded as the most important variable by a wide margin. The Mean Decrease Gini graph displays the variable importance on the Gini impurity index used for splitting trees. Again, Percentage of kids born to never married is the clear leader.
Call:
randomForest(formula = ViolentCrimesPerPop ~ ., data = train2, mtry = mtry, ntree = 400, proximity = TRUE)
Type of random forest: regression
Number of trees: 400
No. of variables tried at each split: 39
Mean of squared residuals: 0.01920442
% Var explained: 65.2
Prediction Performance
rfPredPerf <- postResample(pred = pred, obs = y_test)
rfPredPerf['Family'] <- 'Trees & Boosting'
rfPredPerf['Model'] <- 'Random Forest'Model10: Gradient Boosting Method The boosting method is uses trees in a sequential pattern. The successive tree is developed using information from the previous tree to minimize its error. The boosting method has three tuning parameters, number of trees, shrinkage parameter, and number of splits in a tree.
We will be using the stochastic gradient boosting in our model. This approach resamples the observations and columns in each iteration. We’ll use the caret workflow, which invokes the xgboost package, to automatically adjust the model parameter values, and fit the final best boosted tree that explains the best our data.
Here, we can see our parameters that are used in the model -
modelSummaryDF <- rbind(ridgePredPerf, lassoPredPerf, elnetPredPerf, nnetPredPerf, knnPredPerf, svmRPredPerf, marsPredPerf, dtreePredPerf, rfPredPerf)
SummaryDF <- modelSummaryDF[, c(4, 5, 1, 2, 3)]
rownames(SummaryDF) <- NULL
SummaryDF %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="500px")| Family | Model | RMSE | Rsquared | MAE |
|---|---|---|---|---|
| Linear Regression | Ridge Regression | 0.134023364559625 | 0.652096104816265 | 0.0931015520851205 |
| Linear Regression | Lasso Regression | 0.13432650454941 | 0.650239839854935 | 0.093460493359668 |
| Linear Regression | Elastic Net Regression | 0.134363506716829 | 0.650048180108082 | 0.093506799765122 |
| Non-Linear Regression | Neural Network | 0.138079979548364 | 0.631849583828549 | 0.0920321820156854 |
| Non-Linear Regression | KNN | 0.137746336193226 | 0.64321987228122 | 0.0935035814388041 |
| Non-Linear Regression | SVM | 0.137410561896066 | 0.64343871694592 | 0.0903724920709437 |
| Non-Linear Regression | MARS | 0.135214031166373 | 0.645391442964902 | 0.0932388471415975 |
| Trees & Boosting | Decision Tree | 0.153357244851902 | 0.545361936338709 | 0.107234983234218 |
| Trees & Boosting | Random Forest | NA | 0.00235260760870183 | NA |
All the linear regression models - Ridge, Lasso and Elastic Net that we built turned out to be very good models with a multiple R2 of 65%+. Non-Linear regression models like MARS, SVM and KNN also had R2 of 64% and above. Amongst the Tree models, XGBoost performed the best with 63.8% R2. All the above-mentioned models have prediction powers in the same range. Decision tree is great for interpretation but significantly under-performs all other models.
The final model, random forest model has a high Percentage of variance explained of 65.2% and a low mean squared residual (0.019).
But purely based on R2, Ridge regression model is the best pick out of all the models that we have tried.
Violent crime stories often dominate the news and shake our collective sense of safety, prompting society to re-examine long-held assumptions about cultural, economic, and racial harmony. As these incidents increase in visibility, it becomes even more important to understand the deeper socio-economic foundations that may contribute to such crimes. With modern computational tools, we are now able to analyze dozens of factors simultaneously, uncovering patterns and relationships that were previously difficult to see. A data-driven perspective is essential if we hope to plan for safer, more resilient communities built on trust and opportunity.
The Communities and Crime normalized dataset from the UCI Machine Learning Repository offers a rich environment for examining how diverse socio-economic variables shape violent crime rates across the United States. Through the application of contemporary machine learning methods, we were able to build predictive models with strong accuracy and identify key drivers of violent crime. A consistent theme that emerged from feature importance rankings was the role of family stability: areas with high rates of children born to never-married parents, limited educational attainment, and dense living conditions tended to experience higher levels of violent crime. While some features related to race surfaced as influential, these signals appear more reflective of long-standing economic disparities between racial groups rather than inherent racial factors themselves.
However, the study is not without limitations. Several models show signs of overfitting due to the use of the full dataset for both training and evaluation. The random forest model—our recommended option for scalability—would benefit from systematic hyperparameter tuning and proper train-test splits to strengthen generalizability. As more data becomes available, ongoing recalibration will be essential. Future work could explore more advanced modeling techniques, such as neural networks or ensemble approaches, to improve predictive performance and deepen our understanding of the socio-economic dynamics behind violent crime.
[1) UCI Machine Learning Repository Link] (https://archive.ics.uci.edu/ml/datasets/communities+and+crime)