library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(caret)
library(mice)
library(randomForest)
library(caTools)
library(corrplot)
library(class)
library(rpart)
library(rpart.plot)
library(naniar)
library(xgboost)
library(usmap)
library(DiagrammeR)
library(earth)
library(plotly)
library(wordcloud)
library(RColorBrewer)
library(glmnet)
library(Hmisc)
library(car)
For the final project, we will be working with the Communities & Crime dataset from communities within United States. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.
Every major violent crime that makes it to the news brings new notions and beliefs that tend to alter the opinion of the public in terms of the safety of their neighborhood and the socio-economic factors that influence the normal functioning of society. For example, with the increase in the number of race-related crimes in our society, it has become imperative to understand whether the racial profile has anything to do with crime rates in a region. Other concerns besides race, include socio-economic status, religious beliefs and the percentage of the population that is composed of immigrants.
Understanding where violent crimes occur in terms of the socio-economic, environmental, and demographic characteristics of the reported regions can be crucial to deciphering why these crimes occur. These characteristics may very well help in predicting in advance where violent crimes are likely to occur through predictive models that can quantify the risk associated with a region. This would greatly help in city planning through the deployment of police forces effectively in order to quell the imminent danger. Determining the key drivers that contribute to the rise in violent crimes will provide invaluable inputs in terms of urban development and design of policing practices.
The ‘Communities and Crime’ dataset made available by the University of California, Irvine’s Machine Learning Repository provides an excellent opportunity to test some of these pre-conceived notions prevalent in our society today with regard to race and crimes. It could also be helpful in building predictive models that can help better in city planning and crime reduction. Although the sources in this dataset date back to 1990 and 1995, the directional insights we get from it should still hold true.
Many variables are included in the dataset so that algorithms that select or learn weights for attributes could be tested. However, clearly unrelated attributes were not included; attributes were picked if there was any plausible connection to crime (N=122), plus the attribute to be predicted (Per Capita Violent Crimes). The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units.
The per capita violent crimes variable was calculated using population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault. There was apparently some controversy in some states concerning the counting of rapes. These resulted in missing values for rape, which resulted in incorrect values for per capita violent crime. These cities are not included in the dataset. Many of these omitted communities were from the midwestern USA.
Data is described below based on original values. All numeric data was normalized into the decimal range 0.00-1.00 using an Unsupervised, equal-interval binning method. Attributes retain their distribution and skew (hence for example the population attribute has a mean value of 0.06 because most communities are small). E.g. An attribute described as ‘mean people per household’ is actually the normalized (0-1) version of that value.
The normalization preserves rough ratios of values WITHIN an attribute (e.g. double the value for double the population within the available precision - except for extreme values (all values more than 3 SD above the mean are normalized to 1.00; all values more than 3 SD below the mean are nromalized to 0.00)).
However, the normalization does not preserve relationships between values BETWEEN attributes (e.g. it would not be meaningful to compare the value for whitePerCap with the value for blackPerCap for a community)
A limitation was that the LEMAS survey was of the police departments with at least 100 officers, plus a random sample of smaller departments. For our purposes, communities not found in both census and crime datasets were omitted. Many communities are missing LEMAS data.
The target variable is ‘Violent Crimes Per Population’.
The crime dataset has 128 attributes. Below is a brief description of all attributes along with data type details -
<- read_csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/FinalProject/data_dictionary.csv')
datadict
%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
datadict scroll_box(width="100%",height="500px")
No. | Column | Description | Data Type |
---|---|---|---|
1 | state | US state (by number) - not counted as predictive above, but if considered, should be consided nominal | nominal |
2 | county | numeric code for county - not predictive, and many missing values | numeric |
3 | community | numeric code for community - not predictive and many missing values | numeric |
4 | communityname | community name - not predictive - for information only | string |
5 | fold | fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive | numeric |
6 | population | population for community | numeric - decimal |
7 | householdsize | mean people per household | numeric - decimal |
8 | racepctblack | percentage of population that is african american | numeric - decimal |
9 | racePctWhite | percentage of population that is caucasian | numeric - decimal |
10 | racePctAsian | percentage of population that is of asian heritage | numeric - decimal |
11 | racePctHisp | percentage of population that is of hispanic heritage | numeric - decimal |
12 | agePct12t21 | percentage of population that is 12-21 in age | numeric - decimal |
13 | agePct12t29 | percentage of population that is 12-29 in age | numeric - decimal |
14 | agePct16t24 | percentage of population that is 16-24 in age | numeric - decimal |
15 | agePct65up | percentage of population that is 65 and over in age | numeric - decimal |
16 | numbUrban | number of people living in areas classified as urban | numeric - decimal |
17 | pctUrban | percentage of people living in areas classified as urban | numeric - decimal |
18 | medIncome | median household income | numeric - decimal |
19 | pctWWage | percentage of households with wage or salary income in 1989 | numeric - decimal |
20 | pctWFarmSelf | percentage of households with farm or self employment income in 1989 | numeric - decimal |
21 | pctWInvInc | percentage of households with investment / rent income in 1989 | numeric - decimal |
22 | pctWSocSec | percentage of households with social security income in 1989 | numeric - decimal |
23 | pctWPubAsst | percentage of households with public assistance income in 1989 | numeric - decimal |
24 | pctWRetire | percentage of households with retirement income in 1989 | numeric - decimal |
25 | medFamInc | median family income (differs from household income for non-family households) | numeric - decimal |
26 | perCapInc | per capita income | numeric - decimal |
27 | whitePerCap | per capita income for caucasians | numeric - decimal |
28 | blackPerCap | per capita income for african americans | numeric - decimal |
29 | indianPerCap | per capita income for native americans | numeric - decimal |
30 | AsianPerCap | per capita income for people with asian heritage | numeric - decimal |
31 | OtherPerCap | per capita income for people with ‘other’ heritage | numeric - decimal |
32 | HispPerCap | per capita income for people with hispanic heritage | numeric - decimal |
33 | NumUnderPov | number of people under the poverty level | numeric - decimal |
34 | PctPopUnderPov | percentage of people under the poverty level | numeric - decimal |
35 | PctLess9thGrade | percentage of people 25 and over with less than a 9th grade education | numeric - decimal |
36 | PctNotHSGrad | percentage of people 25 and over that are not high school graduates | numeric - decimal |
37 | PctBSorMore | percentage of people 25 and over with a bachelors degree or higher education | numeric - decimal |
38 | PctUnemployed | percentage of people 16 and over, in the labor force, and unemployed | numeric - decimal |
39 | PctEmploy | percentage of people 16 and over who are employed | numeric - decimal |
40 | PctEmplManu | percentage of people 16 and over who are employed in manufacturing | numeric - decimal |
41 | PctEmplProfServ | percentage of people 16 and over who are employed in professional services | numeric - decimal |
42 | PctOccupManu | percentage of people 16 and over who are employed in manufacturing | numeric - decimal |
43 | PctOccupMgmtProf | percentage of people 16 and over who are employed in management or professional occupations | numeric - decimal |
44 | MalePctDivorce | percentage of males who are divorced | numeric - decimal |
45 | MalePctNevMarr | percentage of males who have never married | numeric - decimal |
46 | FemalePctDiv | percentage of females who are divorced | numeric - decimal |
47 | TotalPctDiv | percentage of population who are divorced | numeric - decimal |
48 | PersPerFam | mean number of people per family | numeric - decimal |
49 | PctFam2Par | percentage of families (with kids) that are headed by two parents | numeric - decimal |
50 | PctKids2Par | percentage of kids in family housing with two parents | numeric - decimal |
51 | PctYoungKids2Par | percent of kids 4 and under in two parent households | numeric - decimal |
52 | PctTeen2Par | percent of kids age 12-17 in two parent households | numeric - decimal |
53 | PctWorkMomYoungKids | percentage of moms of kids 6 and under in labor force | numeric - decimal |
54 | PctWorkMom | percentage of moms of kids under 18 in labor force | numeric - decimal |
55 | NumIlleg | number of kids born to never married | numeric - decimal |
56 | PctIlleg | percentage of kids born to never married | numeric - decimal |
57 | NumImmig | total number of people known to be foreign born | numeric - decimal |
58 | PctImmigRecent | percentage of immigrants who immigated within last 3 years | numeric - decimal |
59 | PctImmigRec5 | percentage of immigrants who immigated within last 5 years | numeric - decimal |
60 | PctImmigRec8 | percentage of immigrants who immigated within last 8 years | numeric - decimal |
61 | PctImmigRec10 | percentage of immigrants who immigated within last 10 years | numeric - decimal |
62 | PctRecentImmig | percent of population who have immigrated within the last 3 years | numeric - decimal |
63 | PctRecImmig5 | percent of population who have immigrated within the last 5 years | numeric - decimal |
64 | PctRecImmig8 | percent of population who have immigrated within the last 8 years | numeric - decimal |
65 | PctRecImmig10 | percent of population who have immigrated within the last 10 years | numeric - decimal |
66 | PctSpeakEnglOnly | percent of people who speak only English | numeric - decimal |
67 | PctNotSpeakEnglWell | percent of people who do not speak English well | numeric - decimal |
68 | PctLargHouseFam | percent of family households that are large (6 or more) | numeric - decimal |
69 | PctLargHouseOccup | percent of all occupied households that are large (6 or more people) | numeric - decimal |
70 | PersPerOccupHous | mean persons per household | numeric - decimal |
71 | PersPerOwnOccHous | mean persons per owner occupied household | numeric - decimal |
72 | PersPerRentOccHous | mean persons per rental household | numeric - decimal |
73 | PctPersOwnOccup | percent of people in owner occupied households | numeric - decimal |
74 | PctPersDenseHous | percent of persons in dense housing (more than 1 person per room) | numeric - decimal |
75 | PctHousLess3BR | percent of housing units with less than 3 bedrooms | numeric - decimal |
76 | MedNumBR | median number of bedrooms | numeric - decimal |
77 | HousVacant | number of vacant households | numeric - decimal |
78 | PctHousOccup | percent of housing occupied | numeric - decimal |
79 | PctHousOwnOcc | percent of households owner occupied | numeric - decimal |
80 | PctVacantBoarded | percent of vacant housing that is boarded up | numeric - decimal |
81 | PctVacMore6Mos | percent of vacant housing that has been vacant more than 6 months | numeric - decimal |
82 | MedYrHousBuilt | median year housing units built | numeric - decimal |
83 | PctHousNoPhone | percent of occupied housing units without phone (in 1990, this was rare!) | numeric - decimal |
84 | PctWOFullPlumb | percent of housing without complete plumbing facilities | numeric - decimal |
85 | OwnOccLowQuart | owner occupied housing - lower quartile value | numeric - decimal |
86 | OwnOccMedVal | owner occupied housing - median value | numeric - decimal |
87 | OwnOccHiQuart | owner occupied housing - upper quartile value | numeric - decimal |
88 | RentLowQ | rental housing - lower quartile rent | numeric - decimal |
89 | RentMedian | rental housing - median rent (Census variable H32B from file STF1A) | numeric - decimal |
90 | RentHighQ | rental housing - upper quartile rent | numeric - decimal |
91 | MedRent | median gross rent (Census variable H43A from file STF3A - includes utilities) | numeric - decimal |
92 | MedRentPctHousInc | median gross rent as a percentage of household income | numeric - decimal |
93 | MedOwnCostPctInc | median owners cost as a percentage of household income - for owners with a mortgage | numeric - decimal |
94 | MedOwnCostPctIncNoMtg | median owners cost as a percentage of household income - for owners without a mortgage | numeric - decimal |
95 | NumInShelters | number of people in homeless shelters | numeric - decimal |
96 | NumStreet | number of homeless people counted in the street | numeric - decimal |
97 | PctForeignBorn | percent of people foreign born | numeric - decimal |
98 | PctBornSameState | percent of people born in the same state as currently living | numeric - decimal |
99 | PctSameHouse85 | percent of people living in the same house as in 1985 (5 years before) | numeric - decimal |
100 | PctSameCity85 | percent of people living in the same city as in 1985 (5 years before) | numeric - decimal |
101 | PctSameState85 | percent of people living in the same state as in 1985 (5 years before) | numeric - decimal |
102 | LemasSwornFT | number of sworn full time police officers | numeric - decimal |
103 | LemasSwFTPerPop | sworn full time police officers per 100K population | numeric - decimal |
104 | LemasSwFTFieldOps | number of sworn full time police officers in field operations (on the street as opposed to administrative etc) | numeric - decimal |
105 | LemasSwFTFieldPerPop | sworn full time police officers in field operations (on the street as opposed to administrative etc) per 100K population | numeric - decimal |
106 | LemasTotalReq | total requests for police | numeric - decimal |
107 | LemasTotReqPerPop | total requests for police per 100K popuation | numeric - decimal |
108 | PolicReqPerOffic | total requests for police per police officer | numeric - decimal |
109 | PolicPerPop | police officers per 100K population | numeric - decimal |
110 | RacialMatchCommPol | a measure of the racial match between the community and the police force. High values indicate proportions in community and police force are similar | numeric - decimal |
111 | PctPolicWhite | percent of police that are caucasian | numeric - decimal |
112 | PctPolicBlack | percent of police that are african american | numeric - decimal |
113 | PctPolicHisp | percent of police that are hispanic | numeric - decimal |
114 | PctPolicAsian | percent of police that are asian | numeric - decimal |
115 | PctPolicMinor | percent of police that are minority of any kind | numeric - decimal |
116 | OfficAssgnDrugUnits | number of officers assigned to special drug units | numeric - decimal |
117 | NumKindsDrugsSeiz | number of different kinds of drugs seized | numeric - decimal |
118 | PolicAveOTWorked | police average overtime worked | numeric - decimal |
119 | LandArea | land area in square miles | numeric - decimal |
120 | PopDens | population density in persons per square mile | numeric - decimal |
121 | PctUsePubTrans | percent of people using public transit for commuting | numeric - decimal |
122 | PolicCars | number of police cars | numeric - decimal |
123 | PolicOperBudg | police operating budget | numeric - decimal |
124 | LemasPctPolicOnPatr | percent of sworn full time police officers on patrol | numeric - decimal |
125 | LemasGangUnitDeploy | gang unit deployed | numeric - decimal - but really ordinal - 0 means NO, 1 means YES, 0.5 means Part Time |
126 | LemasPctOfficDrugUn | percent of officers assigned to drug units | numeric - decimal |
127 | PolicBudgPerPop | police operating budget per population | numeric - decimal |
128 | ViolentCrimesPerPop | total number of violent crimes per 100K popuation (numeric - decimal) GOAL attribute | to be predicted |
We also loaded the supplimentary state level dataset from UCI Machine Learning repository to enrich our crime dataset with state level information -
<- read.csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/FinalProject/states.csv')
states
names(states) <- c("state_code","state","stateName","stateENS")
%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
states scroll_box(width="80%",height="300px")
state_code | state | stateName | stateENS |
---|---|---|---|
1 | AL | Alabama | 1779775 |
2 | AK | Alaska | 1785533 |
4 | AZ | Arizona | 1779777 |
5 | AR | Arkansas | 68085 |
6 | CA | California | 1779778 |
8 | CO | Colorado | 1779779 |
9 | CT | Connecticut | 1779780 |
10 | DE | Delaware | 1779781 |
11 | DC | District of Columbia | 1702382 |
12 | FL | Florida | 294478 |
13 | GA | Georgia | 1705317 |
15 | HI | Hawaii | 1779782 |
16 | ID | Idaho | 1779783 |
17 | IL | Illinois | 1779784 |
18 | IN | Indiana | 448508 |
19 | IA | Iowa | 1779785 |
20 | KS | Kansas | 481813 |
21 | KY | Kentucky | 1779786 |
22 | LA | Louisiana | 1629543 |
23 | ME | Maine | 1779787 |
24 | MD | Maryland | 1714934 |
25 | MA | Massachusetts | 606926 |
26 | MI | Michigan | 1779789 |
27 | MN | Minnesota | 662849 |
28 | MS | Mississippi | 1779790 |
29 | MO | Missouri | 1779791 |
30 | MT | Montana | 767982 |
31 | NE | Nebraska | 1779792 |
32 | NV | Nevada | 1779793 |
33 | NH | New Hampshire | 1779794 |
34 | NJ | New Jersey | 1779795 |
35 | NM | New Mexico | 897535 |
36 | NY | New York | 1779796 |
37 | NC | North Carolina | 1027616 |
38 | ND | North Dakota | 1779797 |
39 | OH | Ohio | 1085497 |
40 | OK | Oklahoma | 1102857 |
41 | OR | Oregon | 1155107 |
42 | PA | Pennsylvania | 1779798 |
44 | RI | Rhode Island | 1219835 |
45 | SC | South Carolina | 1779799 |
46 | SD | South Dakota | 1785534 |
47 | TN | Tennessee | 1325873 |
48 | TX | Texas | 1779801 |
49 | UT | Utah | 1455989 |
50 | VT | Vermont | 1779802 |
51 | VA | Virginia | 1779803 |
53 | WA | Washington | 1779804 |
54 | WV | West Virginia | 1779805 |
55 | WI | Wisconsin | 1779806 |
56 | WY | Wyoming | 1779807 |
60 | AS | American Samoa | 1802701 |
66 | GU | Guam | 1802705 |
69 | MP | Northern Mariana Islands | 1779809 |
72 | PR | Puerto Rico | 1779808 |
74 | UM | U.S. Minor Outlying Islands | 1878752 |
78 | VI | U.S. Virgin Islands | 1802710 |
<- read_csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/FinalProject/communities.data.csv')
dataset head(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="300px")
state | county | community | communityname | fold | population | householdsize | racepctblack | racePctWhite | racePctAsian | racePctHisp | agePct12t21 | agePct12t29 | agePct16t24 | agePct65up | numbUrban | pctUrban | medIncome | pctWWage | pctWFarmSelf | pctWInvInc | pctWSocSec | pctWPubAsst | pctWRetire | medFamInc | perCapInc | whitePerCap | blackPerCap | indianPerCap | AsianPerCap | OtherPerCap | HispPerCap | NumUnderPov | PctPopUnderPov | PctLess9thGrade | PctNotHSGrad | PctBSorMore | PctUnemployed | PctEmploy | PctEmplManu | PctEmplProfServ | PctOccupManu | PctOccupMgmtProf | MalePctDivorce | MalePctNevMarr | FemalePctDiv | TotalPctDiv | PersPerFam | PctFam2Par | PctKids2Par | PctYoungKids2Par | PctTeen2Par | PctWorkMomYoungKids | PctWorkMom | NumIlleg | PctIlleg | NumImmig | PctImmigRecent | PctImmigRec5 | PctImmigRec8 | PctImmigRec10 | PctRecentImmig | PctRecImmig5 | PctRecImmig8 | PctRecImmig10 | PctSpeakEnglOnly | PctNotSpeakEnglWell | PctLargHouseFam | PctLargHouseOccup | PersPerOccupHous | PersPerOwnOccHous | PersPerRentOccHous | PctPersOwnOccup | PctPersDenseHous | PctHousLess3BR | MedNumBR | HousVacant | PctHousOccup | PctHousOwnOcc | PctVacantBoarded | PctVacMore6Mos | MedYrHousBuilt | PctHousNoPhone | PctWOFullPlumb | OwnOccLowQuart | OwnOccMedVal | OwnOccHiQuart | RentLowQ | RentMedian | RentHighQ | MedRent | MedRentPctHousInc | MedOwnCostPctInc | MedOwnCostPctIncNoMtg | NumInShelters | NumStreet | PctForeignBorn | PctBornSameState | PctSameHouse85 | PctSameCity85 | PctSameState85 | LemasSwornFT | LemasSwFTPerPop | LemasSwFTFieldOps | LemasSwFTFieldPerPop | LemasTotalReq | LemasTotReqPerPop | PolicReqPerOffic | PolicPerPop | RacialMatchCommPol | PctPolicWhite | PctPolicBlack | PctPolicHisp | PctPolicAsian | PctPolicMinor | OfficAssgnDrugUnits | NumKindsDrugsSeiz | PolicAveOTWorked | LandArea | PopDens | PctUsePubTrans | PolicCars | PolicOperBudg | LemasPctPolicOnPatr | LemasGangUnitDeploy | LemasPctOfficDrugUn | PolicBudgPerPop | ViolentCrimesPerPop |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | ? | ? | Lakewoodcity | 1 | 0.19 | 0.33 | 0.02 | 0.90 | 0.12 | 0.17 | 0.34 | 0.47 | 0.29 | 0.32 | 0.20 | 1.0 | 0.37 | 0.72 | 0.34 | 0.60 | 0.29 | 0.15 | 0.43 | 0.39 | 0.40 | 0.39 | 0.32 | 0.27 | 0.27 | 0.36 | 0.41 | 0.08 | 0.19 | 0.10 | 0.18 | 0.48 | 0.27 | 0.68 | 0.23 | 0.41 | 0.25 | 0.52 | 0.68 | 0.40 | 0.75 | 0.75 | 0.35 | 0.55 | 0.59 | 0.61 | 0.56 | 0.74 | 0.76 | 0.04 | 0.14 | 0.03 | 0.24 | 0.27 | 0.37 | 0.39 | 0.07 | 0.07 | 0.08 | 0.08 | 0.89 | 0.06 | 0.14 | 0.13 | 0.33 | 0.39 | 0.28 | 0.55 | 0.09 | 0.51 | 0.5 | 0.21 | 0.71 | 0.52 | 0.05 | 0.26 | 0.65 | 0.14 | 0.06 | 0.22 | 0.19 | 0.18 | 0.36 | 0.35 | 0.38 | 0.34 | 0.38 | 0.46 | 0.25 | 0.04 | 0 | 0.12 | 0.42 | 0.50 | 0.51 | 0.64 | 0.03 | 0.13 | 0.96 | 0.17 | 0.06 | 0.18 | 0.44 | 0.13 | 0.94 | 0.93 | 0.03 | 0.07 | 0.1 | 0.07 | 0.02 | 0.57 | 0.29 | 0.12 | 0.26 | 0.20 | 0.06 | 0.04 | 0.9 | 0.5 | 0.32 | 0.14 | 0.20 |
53 | ? | ? | Tukwilacity | 1 | 0.00 | 0.16 | 0.12 | 0.74 | 0.45 | 0.07 | 0.26 | 0.59 | 0.35 | 0.27 | 0.02 | 1.0 | 0.31 | 0.72 | 0.11 | 0.45 | 0.25 | 0.29 | 0.39 | 0.29 | 0.37 | 0.38 | 0.33 | 0.16 | 0.30 | 0.22 | 0.35 | 0.01 | 0.24 | 0.14 | 0.24 | 0.30 | 0.27 | 0.73 | 0.57 | 0.15 | 0.42 | 0.36 | 1.00 | 0.63 | 0.91 | 1.00 | 0.29 | 0.43 | 0.47 | 0.60 | 0.39 | 0.46 | 0.53 | 0.00 | 0.24 | 0.01 | 0.52 | 0.62 | 0.64 | 0.63 | 0.25 | 0.27 | 0.25 | 0.23 | 0.84 | 0.10 | 0.16 | 0.10 | 0.17 | 0.29 | 0.17 | 0.26 | 0.20 | 0.82 | 0.0 | 0.02 | 0.79 | 0.24 | 0.02 | 0.25 | 0.65 | 0.16 | 0.00 | 0.21 | 0.20 | 0.21 | 0.42 | 0.38 | 0.40 | 0.37 | 0.29 | 0.32 | 0.18 | 0.00 | 0 | 0.21 | 0.50 | 0.34 | 0.60 | 0.52 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.02 | 0.12 | 0.45 | ? | ? | ? | ? | 0.00 | ? | 0.67 |
24 | ? | ? | Aberdeentown | 1 | 0.00 | 0.42 | 0.49 | 0.56 | 0.17 | 0.04 | 0.39 | 0.47 | 0.28 | 0.32 | 0.00 | 0.0 | 0.30 | 0.58 | 0.19 | 0.39 | 0.38 | 0.40 | 0.84 | 0.28 | 0.27 | 0.29 | 0.27 | 0.07 | 0.29 | 0.28 | 0.39 | 0.01 | 0.27 | 0.27 | 0.43 | 0.19 | 0.36 | 0.58 | 0.32 | 0.29 | 0.49 | 0.32 | 0.63 | 0.41 | 0.71 | 0.70 | 0.45 | 0.42 | 0.44 | 0.43 | 0.43 | 0.71 | 0.67 | 0.01 | 0.46 | 0.00 | 0.07 | 0.06 | 0.15 | 0.19 | 0.02 | 0.02 | 0.04 | 0.05 | 0.88 | 0.04 | 0.20 | 0.20 | 0.46 | 0.52 | 0.43 | 0.42 | 0.15 | 0.51 | 0.5 | 0.01 | 0.86 | 0.41 | 0.29 | 0.30 | 0.52 | 0.47 | 0.45 | 0.18 | 0.17 | 0.16 | 0.27 | 0.29 | 0.27 | 0.31 | 0.48 | 0.39 | 0.28 | 0.00 | 0 | 0.14 | 0.49 | 0.54 | 0.67 | 0.56 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.01 | 0.21 | 0.02 | ? | ? | ? | ? | 0.00 | ? | 0.43 |
34 | 5 | 81440 | Willingborotownship | 1 | 0.04 | 0.77 | 1.00 | 0.08 | 0.12 | 0.10 | 0.51 | 0.50 | 0.34 | 0.21 | 0.06 | 1.0 | 0.58 | 0.89 | 0.21 | 0.43 | 0.36 | 0.20 | 0.82 | 0.51 | 0.36 | 0.40 | 0.39 | 0.16 | 0.25 | 0.36 | 0.44 | 0.01 | 0.10 | 0.09 | 0.25 | 0.31 | 0.33 | 0.71 | 0.36 | 0.45 | 0.37 | 0.39 | 0.34 | 0.45 | 0.49 | 0.44 | 0.75 | 0.65 | 0.54 | 0.83 | 0.65 | 0.85 | 0.86 | 0.03 | 0.33 | 0.02 | 0.11 | 0.20 | 0.30 | 0.31 | 0.05 | 0.08 | 0.11 | 0.11 | 0.81 | 0.08 | 0.56 | 0.62 | 0.85 | 0.77 | 1.00 | 0.94 | 0.12 | 0.01 | 0.5 | 0.01 | 0.97 | 0.96 | 0.60 | 0.47 | 0.52 | 0.11 | 0.11 | 0.24 | 0.21 | 0.19 | 0.75 | 0.70 | 0.77 | 0.89 | 0.63 | 0.51 | 0.47 | 0.00 | 0 | 0.19 | 0.30 | 0.73 | 0.64 | 0.65 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.02 | 0.39 | 0.28 | ? | ? | ? | ? | 0.00 | ? | 0.12 |
42 | 95 | 6096 | Bethlehemtownship | 1 | 0.01 | 0.55 | 0.02 | 0.95 | 0.09 | 0.05 | 0.38 | 0.38 | 0.23 | 0.36 | 0.02 | 0.9 | 0.50 | 0.72 | 0.16 | 0.68 | 0.44 | 0.11 | 0.71 | 0.46 | 0.43 | 0.41 | 0.28 | 0.00 | 0.74 | 0.51 | 0.48 | 0.00 | 0.06 | 0.25 | 0.30 | 0.33 | 0.12 | 0.65 | 0.67 | 0.38 | 0.42 | 0.46 | 0.22 | 0.27 | 0.20 | 0.21 | 0.51 | 0.91 | 0.91 | 0.89 | 0.85 | 0.40 | 0.60 | 0.00 | 0.06 | 0.00 | 0.03 | 0.07 | 0.20 | 0.27 | 0.01 | 0.02 | 0.04 | 0.05 | 0.88 | 0.05 | 0.16 | 0.19 | 0.59 | 0.60 | 0.37 | 0.89 | 0.02 | 0.19 | 0.5 | 0.01 | 0.89 | 0.87 | 0.04 | 0.55 | 0.73 | 0.05 | 0.14 | 0.31 | 0.31 | 0.30 | 0.40 | 0.36 | 0.38 | 0.38 | 0.22 | 0.51 | 0.21 | 0.00 | 0 | 0.11 | 0.72 | 0.64 | 0.61 | 0.53 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.04 | 0.09 | 0.02 | ? | ? | ? | ? | 0.00 | ? | 0.03 |
6 | ? | ? | SouthPasadenacity | 1 | 0.02 | 0.28 | 0.06 | 0.54 | 1.00 | 0.25 | 0.31 | 0.48 | 0.27 | 0.37 | 0.04 | 1.0 | 0.52 | 0.68 | 0.20 | 0.61 | 0.28 | 0.15 | 0.25 | 0.62 | 0.72 | 0.76 | 0.77 | 0.28 | 0.52 | 0.48 | 0.60 | 0.01 | 0.12 | 0.13 | 0.12 | 0.80 | 0.10 | 0.65 | 0.19 | 0.77 | 0.06 | 0.91 | 0.49 | 0.57 | 0.61 | 0.58 | 0.44 | 0.62 | 0.69 | 0.87 | 0.53 | 0.30 | 0.43 | 0.00 | 0.11 | 0.04 | 0.30 | 0.35 | 0.43 | 0.47 | 0.50 | 0.50 | 0.56 | 0.57 | 0.45 | 0.28 | 0.25 | 0.19 | 0.29 | 0.53 | 0.18 | 0.39 | 0.26 | 0.73 | 0.0 | 0.02 | 0.84 | 0.30 | 0.16 | 0.28 | 0.25 | 0.02 | 0.05 | 0.94 | 1.00 | 1.00 | 0.67 | 0.63 | 0.68 | 0.62 | 0.47 | 0.59 | 0.11 | 0.00 | 0 | 0.70 | 0.42 | 0.49 | 0.73 | 0.64 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.01 | 0.58 | 0.10 | ? | ? | ? | ? | 0.00 | ? | 0.14 |
dim(dataset)
## [1] 1994 128
The dataset has 1994 observations and 128 variable.
The dataset contains no duplicate records, but it contains a lot of missing values, represented by “?”. We need to convert these entries into NA values to begin our analysis.
The first 5 non-predictive variables represent the name of the community, the state in which it is located, the county code, the community code, and the fold number for non-random 10 fold cross validation. A summary of the remaining variables is below:
#
== '?'] = NA
dataset[dataset
summary(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
state | county | community | communityname | fold | population | householdsize | racepctblack | racePctWhite | racePctAsian | racePctHisp | agePct12t21 | agePct12t29 | agePct16t24 | agePct65up | numbUrban | pctUrban | medIncome | pctWWage | pctWFarmSelf | pctWInvInc | pctWSocSec | pctWPubAsst | pctWRetire | medFamInc | perCapInc | whitePerCap | blackPerCap | indianPerCap | AsianPerCap | OtherPerCap | HispPerCap | NumUnderPov | PctPopUnderPov | PctLess9thGrade | PctNotHSGrad | PctBSorMore | PctUnemployed | PctEmploy | PctEmplManu | PctEmplProfServ | PctOccupManu | PctOccupMgmtProf | MalePctDivorce | MalePctNevMarr | FemalePctDiv | TotalPctDiv | PersPerFam | PctFam2Par | PctKids2Par | PctYoungKids2Par | PctTeen2Par | PctWorkMomYoungKids | PctWorkMom | NumIlleg | PctIlleg | NumImmig | PctImmigRecent | PctImmigRec5 | PctImmigRec8 | PctImmigRec10 | PctRecentImmig | PctRecImmig5 | PctRecImmig8 | PctRecImmig10 | PctSpeakEnglOnly | PctNotSpeakEnglWell | PctLargHouseFam | PctLargHouseOccup | PersPerOccupHous | PersPerOwnOccHous | PersPerRentOccHous | PctPersOwnOccup | PctPersDenseHous | PctHousLess3BR | MedNumBR | HousVacant | PctHousOccup | PctHousOwnOcc | PctVacantBoarded | PctVacMore6Mos | MedYrHousBuilt | PctHousNoPhone | PctWOFullPlumb | OwnOccLowQuart | OwnOccMedVal | OwnOccHiQuart | RentLowQ | RentMedian | RentHighQ | MedRent | MedRentPctHousInc | MedOwnCostPctInc | MedOwnCostPctIncNoMtg | NumInShelters | NumStreet | PctForeignBorn | PctBornSameState | PctSameHouse85 | PctSameCity85 | PctSameState85 | LemasSwornFT | LemasSwFTPerPop | LemasSwFTFieldOps | LemasSwFTFieldPerPop | LemasTotalReq | LemasTotReqPerPop | PolicReqPerOffic | PolicPerPop | RacialMatchCommPol | PctPolicWhite | PctPolicBlack | PctPolicHisp | PctPolicAsian | PctPolicMinor | OfficAssgnDrugUnits | NumKindsDrugsSeiz | PolicAveOTWorked | LandArea | PopDens | PctUsePubTrans | PolicCars | PolicOperBudg | LemasPctPolicOnPatr | LemasGangUnitDeploy | LemasPctOfficDrugUn | PolicBudgPerPop | ViolentCrimesPerPop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 1.00 | Length:1994 | Length:1994 | Length:1994 | Min. : 1.000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Length:1994 | Min. :0.0000 | Min. :0.00000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.00 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Min. :0.00000 | Length:1994 | Min. :0.000 | |
1st Qu.:12.00 | Class :character | Class :character | Class :character | 1st Qu.: 3.000 | 1st Qu.:0.01000 | 1st Qu.:0.3500 | 1st Qu.:0.0200 | 1st Qu.:0.6300 | 1st Qu.:0.0400 | 1st Qu.:0.010 | 1st Qu.:0.3400 | 1st Qu.:0.4100 | 1st Qu.:0.2500 | 1st Qu.:0.3000 | 1st Qu.:0.00000 | 1st Qu.:0.0000 | 1st Qu.:0.2000 | 1st Qu.:0.4400 | 1st Qu.:0.1600 | 1st Qu.:0.3700 | 1st Qu.:0.3500 | 1st Qu.:0.1425 | 1st Qu.:0.3600 | 1st Qu.:0.2300 | 1st Qu.:0.2200 | 1st Qu.:0.240 | 1st Qu.:0.1725 | 1st Qu.:0.1100 | 1st Qu.:0.1900 | Class :character | 1st Qu.:0.2600 | 1st Qu.:0.01000 | 1st Qu.:0.110 | 1st Qu.:0.1600 | 1st Qu.:0.2300 | 1st Qu.:0.2100 | 1st Qu.:0.2200 | 1st Qu.:0.3800 | 1st Qu.:0.2500 | 1st Qu.:0.3200 | 1st Qu.:0.2400 | 1st Qu.:0.3100 | 1st Qu.:0.3300 | 1st Qu.:0.3100 | 1st Qu.:0.3600 | 1st Qu.:0.3600 | 1st Qu.:0.4000 | 1st Qu.:0.4900 | 1st Qu.:0.4900 | 1st Qu.:0.530 | 1st Qu.:0.4800 | 1st Qu.:0.3900 | 1st Qu.:0.4200 | 1st Qu.:0.00000 | 1st Qu.:0.09 | 1st Qu.:0.00000 | 1st Qu.:0.1600 | 1st Qu.:0.2000 | 1st Qu.:0.2500 | 1st Qu.:0.2800 | 1st Qu.:0.0300 | 1st Qu.:0.0300 | 1st Qu.:0.0300 | 1st Qu.:0.0300 | 1st Qu.:0.7300 | 1st Qu.:0.0300 | 1st Qu.:0.1500 | 1st Qu.:0.1400 | 1st Qu.:0.3400 | 1st Qu.:0.3900 | 1st Qu.:0.2700 | 1st Qu.:0.4400 | 1st Qu.:0.0600 | 1st Qu.:0.4000 | 1st Qu.:0.0000 | 1st Qu.:0.01000 | 1st Qu.:0.6300 | 1st Qu.:0.4300 | 1st Qu.:0.0600 | 1st Qu.:0.2900 | 1st Qu.:0.3500 | 1st Qu.:0.0600 | 1st Qu.:0.1000 | 1st Qu.:0.0900 | 1st Qu.:0.0900 | 1st Qu.:0.0900 | 1st Qu.:0.1700 | 1st Qu.:0.2000 | 1st Qu.:0.220 | 1st Qu.:0.2100 | 1st Qu.:0.3700 | 1st Qu.:0.3200 | 1st Qu.:0.2500 | 1st Qu.:0.00000 | 1st Qu.:0.00000 | 1st Qu.:0.0600 | 1st Qu.:0.4700 | 1st Qu.:0.4200 | 1st Qu.:0.5200 | 1st Qu.:0.5600 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.:0.02000 | 1st Qu.:0.1000 | 1st Qu.:0.0200 | Class :character | Class :character | Class :character | Class :character | 1st Qu.:0.00000 | Class :character | 1st Qu.:0.070 | |
Median :34.00 | Mode :character | Mode :character | Mode :character | Median : 5.000 | Median :0.02000 | Median :0.4400 | Median :0.0600 | Median :0.8500 | Median :0.0700 | Median :0.040 | Median :0.4000 | Median :0.4800 | Median :0.2900 | Median :0.4200 | Median :0.03000 | Median :1.0000 | Median :0.3200 | Median :0.5600 | Median :0.2300 | Median :0.4800 | Median :0.4750 | Median :0.2600 | Median :0.4700 | Median :0.3300 | Median :0.3000 | Median :0.320 | Median :0.2500 | Median :0.1700 | Median :0.2800 | Mode :character | Median :0.3450 | Median :0.02000 | Median :0.250 | Median :0.2700 | Median :0.3600 | Median :0.3100 | Median :0.3200 | Median :0.5100 | Median :0.3700 | Median :0.4100 | Median :0.3700 | Median :0.4000 | Median :0.4700 | Median :0.4000 | Median :0.5000 | Median :0.5000 | Median :0.4700 | Median :0.6300 | Median :0.6400 | Median :0.700 | Median :0.6100 | Median :0.5100 | Median :0.5400 | Median :0.01000 | Median :0.17 | Median :0.01000 | Median :0.2900 | Median :0.3400 | Median :0.3900 | Median :0.4300 | Median :0.0900 | Median :0.0800 | Median :0.0900 | Median :0.0900 | Median :0.8700 | Median :0.0600 | Median :0.2000 | Median :0.1900 | Median :0.4400 | Median :0.4800 | Median :0.3600 | Median :0.5600 | Median :0.1100 | Median :0.5100 | Median :0.5000 | Median :0.03000 | Median :0.7700 | Median :0.5400 | Median :0.1300 | Median :0.4200 | Median :0.5200 | Median :0.1850 | Median :0.1900 | Median :0.1800 | Median :0.1700 | Median :0.1800 | Median :0.3100 | Median :0.3300 | Median :0.370 | Median :0.3400 | Median :0.4800 | Median :0.4500 | Median :0.3700 | Median :0.00000 | Median :0.00000 | Median :0.1300 | Median :0.6300 | Median :0.5400 | Median :0.6700 | Median :0.7000 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median :0.04000 | Median :0.1700 | Median :0.0700 | Mode :character | Mode :character | Mode :character | Mode :character | Median :0.00000 | Mode :character | Median :0.150 | |
Mean :28.68 | NA | NA | NA | Mean : 5.494 | Mean :0.05759 | Mean :0.4634 | Mean :0.1796 | Mean :0.7537 | Mean :0.1537 | Mean :0.144 | Mean :0.4242 | Mean :0.4939 | Mean :0.3363 | Mean :0.4232 | Mean :0.06407 | Mean :0.6963 | Mean :0.3611 | Mean :0.5582 | Mean :0.2916 | Mean :0.4957 | Mean :0.4711 | Mean :0.3178 | Mean :0.4792 | Mean :0.3757 | Mean :0.3503 | Mean :0.368 | Mean :0.2911 | Mean :0.2035 | Mean :0.3224 | NA | Mean :0.3863 | Mean :0.05551 | Mean :0.303 | Mean :0.3158 | Mean :0.3833 | Mean :0.3617 | Mean :0.3635 | Mean :0.5011 | Mean :0.3964 | Mean :0.4406 | Mean :0.3912 | Mean :0.4413 | Mean :0.4612 | Mean :0.4345 | Mean :0.4876 | Mean :0.4943 | Mean :0.4877 | Mean :0.6109 | Mean :0.6207 | Mean :0.664 | Mean :0.5829 | Mean :0.5014 | Mean :0.5267 | Mean :0.03629 | Mean :0.25 | Mean :0.03006 | Mean :0.3202 | Mean :0.3606 | Mean :0.3991 | Mean :0.4279 | Mean :0.1814 | Mean :0.1821 | Mean :0.1848 | Mean :0.1829 | Mean :0.7859 | Mean :0.1506 | Mean :0.2676 | Mean :0.2519 | Mean :0.4621 | Mean :0.4944 | Mean :0.4041 | Mean :0.5626 | Mean :0.1863 | Mean :0.4952 | Mean :0.3147 | Mean :0.07682 | Mean :0.7195 | Mean :0.5487 | Mean :0.2045 | Mean :0.4333 | Mean :0.4942 | Mean :0.2645 | Mean :0.2431 | Mean :0.2647 | Mean :0.2635 | Mean :0.2689 | Mean :0.3464 | Mean :0.3725 | Mean :0.423 | Mean :0.3841 | Mean :0.4901 | Mean :0.4498 | Mean :0.4038 | Mean :0.02944 | Mean :0.02278 | Mean :0.2156 | Mean :0.6089 | Mean :0.5351 | Mean :0.6264 | Mean :0.6515 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Mean :0.06523 | Mean :0.2329 | Mean :0.1617 | NA | NA | NA | NA | Mean :0.09405 | NA | Mean :0.238 | |
3rd Qu.:42.00 | NA | NA | NA | 3rd Qu.: 8.000 | 3rd Qu.:0.05000 | 3rd Qu.:0.5400 | 3rd Qu.:0.2300 | 3rd Qu.:0.9400 | 3rd Qu.:0.1700 | 3rd Qu.:0.160 | 3rd Qu.:0.4700 | 3rd Qu.:0.5400 | 3rd Qu.:0.3600 | 3rd Qu.:0.5300 | 3rd Qu.:0.07000 | 3rd Qu.:1.0000 | 3rd Qu.:0.4900 | 3rd Qu.:0.6900 | 3rd Qu.:0.3700 | 3rd Qu.:0.6200 | 3rd Qu.:0.5800 | 3rd Qu.:0.4400 | 3rd Qu.:0.5800 | 3rd Qu.:0.4800 | 3rd Qu.:0.4300 | 3rd Qu.:0.440 | 3rd Qu.:0.3800 | 3rd Qu.:0.2500 | 3rd Qu.:0.4000 | NA | 3rd Qu.:0.4800 | 3rd Qu.:0.05000 | 3rd Qu.:0.450 | 3rd Qu.:0.4200 | 3rd Qu.:0.5100 | 3rd Qu.:0.4600 | 3rd Qu.:0.4800 | 3rd Qu.:0.6275 | 3rd Qu.:0.5200 | 3rd Qu.:0.5300 | 3rd Qu.:0.5100 | 3rd Qu.:0.5400 | 3rd Qu.:0.5900 | 3rd Qu.:0.5000 | 3rd Qu.:0.6200 | 3rd Qu.:0.6300 | 3rd Qu.:0.5600 | 3rd Qu.:0.7600 | 3rd Qu.:0.7800 | 3rd Qu.:0.840 | 3rd Qu.:0.7200 | 3rd Qu.:0.6200 | 3rd Qu.:0.6500 | 3rd Qu.:0.02000 | 3rd Qu.:0.32 | 3rd Qu.:0.02000 | 3rd Qu.:0.4300 | 3rd Qu.:0.4800 | 3rd Qu.:0.5300 | 3rd Qu.:0.5600 | 3rd Qu.:0.2300 | 3rd Qu.:0.2300 | 3rd Qu.:0.2300 | 3rd Qu.:0.2300 | 3rd Qu.:0.9400 | 3rd Qu.:0.1600 | 3rd Qu.:0.3100 | 3rd Qu.:0.2900 | 3rd Qu.:0.5500 | 3rd Qu.:0.5800 | 3rd Qu.:0.4900 | 3rd Qu.:0.7000 | 3rd Qu.:0.2200 | 3rd Qu.:0.6000 | 3rd Qu.:0.5000 | 3rd Qu.:0.07000 | 3rd Qu.:0.8600 | 3rd Qu.:0.6700 | 3rd Qu.:0.2700 | 3rd Qu.:0.5600 | 3rd Qu.:0.6700 | 3rd Qu.:0.4200 | 3rd Qu.:0.3300 | 3rd Qu.:0.4000 | 3rd Qu.:0.3900 | 3rd Qu.:0.3800 | 3rd Qu.:0.4900 | 3rd Qu.:0.5200 | 3rd Qu.:0.590 | 3rd Qu.:0.5300 | 3rd Qu.:0.5900 | 3rd Qu.:0.5800 | 3rd Qu.:0.5100 | 3rd Qu.:0.01000 | 3rd Qu.:0.00000 | 3rd Qu.:0.2800 | 3rd Qu.:0.7775 | 3rd Qu.:0.6600 | 3rd Qu.:0.7700 | 3rd Qu.:0.7900 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.:0.07000 | 3rd Qu.:0.2800 | 3rd Qu.:0.1900 | NA | NA | NA | NA | 3rd Qu.:0.00000 | NA | 3rd Qu.:0.330 | |
Max. :56.00 | NA | NA | NA | Max. :10.000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | NA | Max. :1.0000 | Max. :1.00000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.00 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | NA | NA | NA | NA | Max. :1.00000 | NA | Max. :1.000 |
Based on the above descriptive data summary, there are quite a few variables with missing values. So we conducted an analysis of all missing values in various attributes to identify proper imputation technique.
## Counts of missing data per feature
<- data.frame(apply(dataset, 2, function(x) length(which(is.na(x)))))
dataset_missing_counts <- data.frame(apply(dataset, 2,function(x) {sum(is.na(x)) / length(x) * 100}))
dataset_missing_pct
<- cbind(Feature = rownames(dataset_missing_counts), dataset_missing_counts, dataset_missing_pct)
dataset_missing_counts colnames(dataset_missing_counts) <- c('Feature','NA_Count','NA_Percentage')
rownames(dataset_missing_counts) <- NULL
<- dataset_missing_counts %>% filter(`NA_Count` != 0) %>% arrange(desc(`NA_Count`))
dataset_missing_counts
%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px") dataset_missing_counts
Feature | NA_Count | NA_Percentage |
---|---|---|
LemasSwornFT | 1675 | 84.0020060 |
LemasSwFTPerPop | 1675 | 84.0020060 |
LemasSwFTFieldOps | 1675 | 84.0020060 |
LemasSwFTFieldPerPop | 1675 | 84.0020060 |
LemasTotalReq | 1675 | 84.0020060 |
LemasTotReqPerPop | 1675 | 84.0020060 |
PolicReqPerOffic | 1675 | 84.0020060 |
PolicPerPop | 1675 | 84.0020060 |
RacialMatchCommPol | 1675 | 84.0020060 |
PctPolicWhite | 1675 | 84.0020060 |
PctPolicBlack | 1675 | 84.0020060 |
PctPolicHisp | 1675 | 84.0020060 |
PctPolicAsian | 1675 | 84.0020060 |
PctPolicMinor | 1675 | 84.0020060 |
OfficAssgnDrugUnits | 1675 | 84.0020060 |
NumKindsDrugsSeiz | 1675 | 84.0020060 |
PolicAveOTWorked | 1675 | 84.0020060 |
PolicCars | 1675 | 84.0020060 |
PolicOperBudg | 1675 | 84.0020060 |
LemasPctPolicOnPatr | 1675 | 84.0020060 |
LemasGangUnitDeploy | 1675 | 84.0020060 |
PolicBudgPerPop | 1675 | 84.0020060 |
community | 1177 | 59.0270812 |
county | 1174 | 58.8766299 |
OtherPerCap | 1 | 0.0501505 |
ggplot(dataset_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) +
geom_bar(stat = 'identity', fill = 'steelblue') +
geom_label(aes(label = NA_Count)) +
labs(title = 'Missing Counts') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y = element_blank(), axis.title.x = element_blank())
There are 22 variables missing 84% of data and 2 variables missing roughly 59% of data. These variables with a high proportion of NA’s (greater than 50%) were removed as these were deemed practically useless in prediction or data exploration. Imputation on these varibales is a risk as the majority of the data is missing. Most of these variables came from the 1990 Law Enforcement Management and Admin Stats survey (Lemas).
Also, we are going to remove the ‘fold’ variable orginally added for cross validation.
<- dataset_missing_counts %>%
colremove filter(NA_Percentage > 50) %>%
select(Feature) %>%
as.list()
<- dataset[,!(colnames(dataset) %in% colremove$Feature)]
dataset
# removing the folds variable as it was placed for cross validation
<- dataset[,names(dataset) != 'fold'] dataset
We also did check for presence of any degenerate variables in the dataset and removed them.
# capturing the degenerate variables
<- nearZeroVar(dataset)
degenCols
# identifying them
colnames(dataset[,degenCols])
## [1] "LemasPctOfficDrugUn"
# removing from the dataset
<- dataset[,-degenCols] dataset
We removed the ‘communityname’ from the dataset and removed rows with NA values in ‘OtherPerCap’ column -
#dataset <- dataset[,!colnames(dataset) == 'communityname']
<- dataset %>%
dataset mutate(OtherPerCap = as.numeric(OtherPerCap))
#remove missing brand => df3
<- dataset[!is.na(dataset$OtherPerCap), ]
dataset
dim(dataset)
## [1] 1993 102
After removing these variables, our dataset reduced to 101 variables.
To start with our visual exploratory data analysis, we joined the crime dataset with state level enrichment dataset -
# Join with states dataframe
<- left_join(dataset, states,
dataset1 by = c("state" = "state_code"))
<- rename(dataset1, state_abbr = state.y) dataset1
District of Columbia leads in terms of violent crime rates in the United States. Los Angeles has the second highest crime rate after DC, followed by Louisiana, South Carolina, Maryland and Florida. DC, however, have much higher crime rates in comparison to the other states, which have roughly almost the same crime rate. However, New York and Florida have very high populations in comparison to the others.
# aggregating the Average ViolentCrimesPerPop by State ID (FIPS)
<- dataset1 %>%
crime_rate_by_state group_by(state_abbr) %>%
summarise(AvgCrimesRate = mean(ViolentCrimesPerPop)) %>%
mutate(rank = dense_rank(desc(AvgCrimesRate)))
%>% filter(rank <=5) %>% arrange(rank) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
crime_rate_by_state scroll_box(width="60%",height="250px")
state_abbr | AvgCrimesRate | rank |
---|---|---|
DC | 1.0000000 | 1 |
LA | 0.5045455 | 2 |
SC | 0.4867857 | 3 |
MD | 0.4800000 | 4 |
FL | 0.4583333 | 5 |
#Violent Crime Rates by States
$hover <- with(crime_rate_by_state, paste(state_abbr,'<br>', "Crime Rank:",rank))
crime_rate_by_state<- list(color = toRGB("white"), width = 2)
l <- list(
g scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('blue')
)
<- crime_rate_by_state %>% plot_geo(locationmode = 'USA-states') %>%
p add_trace(
z = ~AvgCrimesRate, text = ~hover, locations = ~state_abbr,
color = ~AvgCrimesRate, colors = 'Blues'
%>%
) colorbar(title = "Crime Rate/100K Population") %>%
layout(
title = 'Violent Crime Rates in the United States',
geo = g
)
p
More than 65% of the population of DC belongs to the African-American community. Other states with high percentage of African-Americans are Mississippi, Georgia, Louisiana, and Delaware. These are some of the states with the highest violent crime rates. No wonder crimes bring about perceptions of race. We need to investigate this further.
#Percentage of African-American population by state
<- dataset1 %>%
black_pop_by_state group_by(state_abbr) %>%
summarise(AvgBlackPopRate = mean(racepctblack )) %>%
mutate(rank = dense_rank(desc(AvgBlackPopRate)))
%>% filter(rank <=5) %>% arrange(rank) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
black_pop_by_state scroll_box(width="60%",height="250px")
state_abbr | AvgBlackPopRate | rank |
---|---|---|
DC | 1.0000000 | 1 |
MS | 0.6900000 | 2 |
GA | 0.6608108 | 3 |
LA | 0.6245455 | 4 |
DE | 0.6000000 | 5 |
$hover <- with(black_pop_by_state, paste(state_abbr,'<br>', "Black Population Rank:",rank))
black_pop_by_state<- list(color = toRGB("white"), width = 2)
l <- list(
g scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('blue')
)
<- black_pop_by_state %>% plot_geo(locationmode = 'USA-states') %>%
p add_trace(
z = ~AvgBlackPopRate, text = ~hover, locations = ~state_abbr,
color = ~AvgBlackPopRate, colors = 'Greens'
%>%
) colorbar(title = "%Black") %>%
layout(
title = 'Percentage of African-American Population',
geo = g
) p
We conclude our initial Visual Data Exploration by taking a look at the communities with the highest violent crime rates. Interestingly, Camden City in New Jersey showed at the top on the list of most dangerous communities in the US in terms of violent crime rates. Other dangerous cities when it comes to violent crime rates are BatonRouge and Alexandria in Louisiana, Baltimore, Atlanta, Anniston, Bessemer and Birmingham in Alabma, Atlantic City.
#Word Cloud of Communities with the highest crime rates
<- dataset1 %>% select(communityname,state_abbr, population, ViolentCrimesPerPop) %>%
communityCrime mutate(Place = paste(communityname,state_abbr, sep = ", ")) %>%
group_by(Place) %>%
summarise(AvgCrimesRate = mean(ViolentCrimesPerPop)) %>%
arrange(desc(AvgCrimesRate)) %>%
head(15)
par(mfrow = c(1,1))
par(mar = rep(0,4))
set.seed(1)
wordcloud(words = communityCrime$Place, freq = communityCrime$AvgCrimesRate, scale=c(3,0.01),
random.order=FALSE,
colors=brewer.pal(8, "Dark2"))
We still have an extremely high number of variables in the dataset. To further remove variables, we measure the correlation of each independent variable against the dependent variable and remove the variables with low correlation to the independent variable. We removed variables with an absolute correlation of less than 0.25 to the independent variable. This vastly reduces the number of variables in the dataset to 54. We removed these variables as we felt that these variables would contribute very less to our final model. Besides, working with fewer variables help in better interpretation.
<- dataset[,!colnames(dataset) == 'communityname']
dataset
#Running a correlation matrix
<- rcorr(as.matrix(dataset))
res2
<- function(cormat, pmat) {
flattenCorrMatrix <- upper.tri(cormat)
ut data.frame(
row = rownames(cormat)[row(cormat)[ut]],
column = rownames(cormat)[col(cormat)[ut]],
cor =(cormat)[ut],
p = pmat[ut]
)
}
#Finding variables with low correlation to dependent variable
<- flattenCorrMatrix(res2$r, res2$P)
correlation <- correlation %>% filter(column == "ViolentCrimesPerPop") %>% arrange(cor)
correlation <- correlation %>% filter(cor >= -0.25 & cor <= 0.25) %>% select(row)
remove
<- c('HispPerCap','PctSpeakEnglOnly','RentMedian','MedRent','RentHighQ','state',
low.corr.dep 'OwnOccLowQuart','whitePerCap','OwnOccMedVal','OwnOccHiQuart','AsianPerCap','PctSameHouse85',
'pctWFarmSelf','PctWorkMom','OtherPerCap','PersPerOwnOccHous','MedYrHousBuilt','pctWRetire',
'indianPerCap','PctBornSameState','PctEmplProfServ','PctEmplManu','PersPerOccupHous','householdsize','PctWorkMomYoungKids','PctSameState85','PctVacMore6Mos','racePctAsian','MedOwnCostPctIncNoMtg','agePct12t21','MedOwnCostPctInc','agePct65up','PctSameCity85','pctUrban','agePct16t24',
'pctWSocSec','PersPerFam','agePct12t29','PctUsePubTrans','PctImmigRecent','PctForeignBorn',
'LandArea','PctImmigRec5','PctRecentImmig','PctRecImmig5','PctImmigRec8','PersPerRentOccHous'
)
<- dataset %>% select(-low.corr.dep)
dataset
dim(dataset)
## [1] 1993 54
For removing multi-collinearity, we selected pairs of independent variables with absolute correlations greater than 0.9. For deciding which variable to remove among the pairs of independent variables, we regressed the remaining independent variables against the dependent variable and found the variance inflation factor of each variable using the vif function from the rms package in R and the removed the independent variable from each pair of multi-collinear variables which had higher variance inflation factors. This reduced the number of variables in the dataset to 39. The logic for this was that the variables in a pair of multi-collinear variables contribute the same variance information to a model and hence only one of them (with the lower variance inflation factor) is required for modeling.
#Now let's check for multicollinearity
<- rcorr(as.matrix(dataset))
res2 <- flattenCorrMatrix(res2$r, res2$P)
correlation <- correlation %>% mutate(abs_cor = abs(cor)) %>% arrange(desc(abs_cor)) %>% filter(abs_cor > 0.9) %>% select(row, column)
correlation
%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="200px") correlation
row | column |
---|---|
PctRecImmig8 | PctRecImmig10 |
population | numbUrban |
PctFam2Par | PctKids2Par |
PctLargHouseFam | PctLargHouseOccup |
FemalePctDiv | TotalPctDiv |
PctPersOwnOccup | PctHousOwnOcc |
medIncome | medFamInc |
MalePctDivorce | TotalPctDiv |
PctBSorMore | PctOccupMgmtProf |
population | NumUnderPov |
PctLess9thGrade | PctNotHSGrad |
NumUnderPov | NumIlleg |
medFamInc | perCapInc |
numbUrban | NumUnderPov |
PctFam2Par | PctYoungKids2Par |
PctKids2Par | PctYoungKids2Par |
MalePctDivorce | FemalePctDiv |
PctFam2Par | PctTeen2Par |
PctKids2Par | PctTeen2Par |
#Let's run a simple linear regression to see if any more variables can be removed based on VIF
<- lm(ViolentCrimesPerPop ~ ., dataset)
LR_test
<- data.frame(vif(LR_test))
vif_df colnames(vif_df) <- c("VIF_Score")
%>% arrange(desc(`VIF_Score`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="200px") vif_df
VIF_Score | |
---|---|
TotalPctDiv | 928.212429 |
FemalePctDiv | 295.188771 |
MalePctDivorce | 203.199697 |
PctRecImmig10 | 165.892005 |
PctRecImmig8 | 147.580998 |
PctLargHouseOccup | 129.488362 |
PctLargHouseFam | 124.023935 |
population | 117.672823 |
PctFam2Par | 103.736936 |
numbUrban | 101.604576 |
medIncome | 98.892368 |
PctKids2Par | 97.813837 |
medFamInc | 95.037614 |
PctPersOwnOccup | 78.204266 |
PctHousOwnOcc | 76.714946 |
PctNotHSGrad | 37.715581 |
NumUnderPov | 33.986232 |
perCapInc | 27.944187 |
PctBSorMore | 25.418161 |
PctPersDenseHous | 23.163968 |
PctOccupMgmtProf | 23.097460 |
PctLess9thGrade | 21.029438 |
PctNotSpeakEnglWell | 19.853248 |
PctPopUnderPov | 18.129689 |
racePctWhite | 15.166439 |
racepctblack | 14.945811 |
NumIlleg | 14.432404 |
pctWWage | 13.271613 |
HousVacant | 12.765602 |
PctYoungKids2Par | 12.151434 |
pctWInvInc | 12.081673 |
PctEmploy | 11.608310 |
PctIlleg | 11.490275 |
racePctHisp | 9.970625 |
pctWPubAsst | 9.197796 |
PctHousLess3BR | 8.210647 |
RentLowQ | 8.141949 |
PctHousNoPhone | 7.424820 |
PctTeen2Par | 7.264545 |
PctOccupManu | 6.099481 |
PctUnemployed | 6.099450 |
MalePctNevMarr | 5.330162 |
NumImmig | 5.007225 |
NumInShelters | 4.710435 |
PctHousOccup | 3.543461 |
PopDens | 2.833126 |
MedNumBR | 2.657976 |
NumStreet | 2.473209 |
MedRentPctHousInc | 2.470435 |
PctImmigRec10 | 2.372429 |
PctVacantBoarded | 2.096964 |
blackPerCap | 2.028969 |
PctWOFullPlumb | 1.847284 |
Next, we scan the above list of multi-collinear variables and remove the ones with higher VIF -
<- c('PctRecImmig10', 'population', 'PctFam2Par', 'PctLargHouseOccup', 'FemalePctDiv', 'PctPersOwnOccup', 'medIncome', 'TotalPctDiv', 'PctBSorMore', 'PctNotHSGrad', 'NumUnderPov', 'medFamInc', 'numbUrban', 'PctKids2Par')
remove.multicollinear
<- dataset %>% select(-remove.multicollinear)
dataset
dim(dataset)
## [1] 1993 40
After removing the variables with high multi-collinearity, we end up with 40 variables.
As a final step before modeling, we created correlation plots to assess in advance the independent variables that will be most important in predicting the violent crime rates.
<- round(cor(dataset[,c(1:13,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100)) corrMatrix
<- round(cor(dataset[,c(14:26,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100)) corrMatrix
<- round(cor(dataset[,c(27:39,40)]),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100)) corrMatrix
Using the correlation plots, we can say that the following socio-economic factors in a community influence the level of violent crime rates:
We are going to do a 75-25% split for training and test purposes.
= sample.split(dataset$ViolentCrimesPerPop, SplitRatio = 0.75)
sample
= subset(dataset, sample == TRUE) %>% as.matrix()
train = subset(dataset, sample == FALSE) %>% as.matrix()
test
<- train[,40]
y_train <- test[,40]
y_test
<- train[,-40]
X_train <- test[,-40]
X_test
#head(train)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Before we start with modeling, we wanted to define some of the re-usable code components in the form of R functions -
## Function for plotting the most important variables
<- function(model_name, vImpDF, varNo){
printVarImportance # Rename columns
<- tibble::rownames_to_column(vImpDF, "Variable")
varImportance colnames(varImportance)[2] <- "Importance"
# Limit to Top N important variables
<- varImportance %>% arrange(desc(`Importance`)) %>% head(varNo)
varImportance
# Create and assign Rank
<- varImportance %>% mutate(Rank = paste0('#',dense_rank(desc(Importance))))
rankImportance
# Plot output
ggplot(rankImportance, aes(x = reorder(Variable, Importance), y = Importance)) +
geom_bar(stat='identity', fill = 'steelblue') +
geom_label(aes(label = round(Importance,2))) +
geom_text(aes(x = Variable, y = 0.5, label = Rank),
hjust=1.2, vjust=0.55, size = 4, colour = 'red') +
labs(x = 'Variable') +
ggtitle(paste("Variable Importance in ", model_name, " Model")) +
coord_flip()
}
The glmnet package allows us to perform both the ridge and lasso regression models. This library requires the response variable to be a vector and the predictor variables to be a class of data.matrix.
Ridge regression is a model used “when the number of predictor variables in a set exceeds the number of observations”, or when a dataset suffers from multicollinearity. According to stats, with ridge regression, often predictor variables used in a regression are highly correlated. When they are, the regression coefficient of any one variable can depend on which other predictor variables are included in the model, and which ones are left out.
Now we’ll use the glmnet() function to fit the ridge regression model and specify alpha=0.
<- cv.glmnet(as.matrix(X_train), as.matrix(y_train), alpha = 0, type.measure = "mse", family="gaussian") fit.ridge
To identify what value to use for lambda, we could plot the function using our model fit.ridge$lambda.min, but below we’ll use the s=“lambda.min” argument to save time. The graph will be shown below after all models.
<- predict(fit.ridge, newx = data.matrix(X_train), s="lambda.min")
fitted.ridge.train
<- predict(fit.ridge, newx = data.matrix(X_test), s="lambda.min")
fitted.ridge.test
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.ridge.train)[1])
## Train coefficient: 0.8141226
cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.ridge.test)[1])
##
## Test coefficient: 0.7635008
As we can see above, the train and test coefficients are 0.81 and 0.76
<- postResample(pred = fitted.ridge.test , obs = y_test)
ridgePredPerf 'Family'] <- 'Linear Regression'
ridgePredPerf['Model'] <- 'Ridge Regression' ridgePredPerf[
According to statisticshowto.com, lasso regression is a type of linear regression that uses shrinkage - which shrinks data points towards a central point, such as the mean.
Now we’ll use the glmnet() function to fit the LASSO regression model and specify alpha=1. We’ll specific s=“lambda.min” to find the best lambda.
<- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=1, family="gaussian")
fit.lasso
<- predict(fit.lasso, newx = data.matrix(X_train), s="lambda.min")
fitted.lasso.train
<- predict(fit.lasso, newx = data.matrix(X_test), s="lambda.min")
fitted.lasso.test
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.lasso.train)[1])
## Train coefficient: 0.8157457
cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.lasso.test)[1])
##
## Test coefficient: 0.7689392
As we can see above, the train and test coefficients are 0.81 and 0.76.
<- postResample(pred = fitted.lasso.test , obs = y_test)
lassoPredPerf 'Family'] <- 'Linear Regression'
lassoPredPerf['Model'] <- 'Lasso Regression' lassoPredPerf[
According to machinelearningmastery.com, Elastic Net is an extension of linear regression that will add regulirization penalties to the loss function during training.
<- glmnet(as.matrix(X_train), as.matrix(y_train), family="gaussian", alpha=.5)
fit.elnet
<- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=.5,
fit.elnet.cv family="gaussian")
<- predict(fit.elnet.cv, newx = data.matrix(X_train), s="lambda.min")
fitted.elnet.train
<- predict(fit.elnet.cv, newx = data.matrix(X_test), s="lambda.min")
fitted.elnet.test
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.elnet.train)[1])
## Train coefficient: 0.8158622
cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.elnet.test)[1])
##
## Test coefficient: 0.7685533
<- postResample(pred = fitted.elnet.test , obs = y_test)
elnetPredPerf 'Family'] <- 'Linear Regression'
elnetPredPerf['Model'] <- 'Elastic Net Regression' elnetPredPerf[
par(mfrow=c(3,1))
plot(fit.lasso, xvar="lambda")
plot(fit.ridge, xvar="lambda")
plot(fit.elnet.cv, xvar="lambda")
Non Linear Regression models do not follow the y=ax+b approach. It allows for flexibility in the model to fit non linear trends in data.
We will go over four examples of non linear regression: Neural Networks, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Multivariate Adaptive Regression Splines (MARS).
Neural Networks are regression techniques inspired by theories of how the brain works (neurons signalling to one another). The outcome is modeled by an input layer and an intermediary set of unobserved values, hidden units, made up of linear combinations of the original predictors.
The parameters used in the model include number of hidden units, decay, number of iterations to find parameter estimates, and the number of parameters used by the model.
# simple average Neural Network with 5 hidden units
<- avNNet(X_train, y_train,
nnetAvg size = 5,
decay = 0.01,
repeats = 5,
linout = TRUE,
trace = FALSE,
maxit = 500,
MaxNWts = 5 * (ncol(X_train) + 1) +5+1)
nnetAvg
## Model Averaged Neural Network with 5 Repeats
##
## a 39-5-1 network with 206 weights
## options were - linear output units decay=0.01
<- varImp(nnetAvg)
varimp %>%
varimp arrange(desc(Overall)) %>%
kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
Overall | |
---|---|
X292 | 7.6370160 |
X12 | 5.9451966 |
X114 | 5.5187905 |
X26 | 5.2518596 |
X54 | 5.1049880 |
X154 | 5.0924933 |
X42 | 5.0717370 |
X283 | 4.9011977 |
X381 | 4.6842240 |
X203 | 4.6721495 |
X382 | 4.6398054 |
X29 | 4.5765466 |
X291 | 4.5705678 |
X122 | 4.4443809 |
X53 | 4.3255080 |
X323 | 4.2644435 |
X41 | 4.1839150 |
X110 | 4.1553171 |
X131 | 4.1499350 |
X373 | 4.1439441 |
X384 | 4.1345264 |
X63 | 4.1269616 |
X294 | 4.1134946 |
X31 | 4.0805513 |
X3 | 4.0448392 |
X116 | 3.9986073 |
X293 | 3.9965900 |
X121 | 3.9865322 |
X151 | 3.8569537 |
X64 | 3.8219922 |
X261 | 3.7827265 |
X9 | 3.7704606 |
X264 | 3.7133551 |
X164 | 3.6181529 |
X314 | 3.6143162 |
X72 | 3.6122705 |
X51 | 3.5855487 |
X262 | 3.5758820 |
X223 | 3.5657833 |
X19 | 3.5597244 |
X271 | 3.5582675 |
X315 | 3.5535831 |
X171 | 3.5193914 |
X38 | 3.4772065 |
X352 | 3.4701537 |
X214 | 3.4212095 |
X204 | 3.3389173 |
X33 | 3.3169319 |
X2 | 3.2437467 |
X93 | 3.2400145 |
X216 | 3.2084728 |
X1 | 3.1588952 |
X351 | 3.1146563 |
X317 | 3.0791051 |
X62 | 3.0756721 |
X172 | 3.0527272 |
X37 | 3.0281053 |
X272 | 3.0106611 |
X17 | 2.9916386 |
X251 | 2.9644936 |
X212 | 2.9582334 |
X254 | 2.9251315 |
X92 | 2.9100862 |
X27 | 2.8149147 |
X134 | 2.7943746 |
X71 | 2.7664481 |
X52 | 2.7549555 |
X133 | 2.7507472 |
X301 | 2.7229865 |
X243 | 2.7066318 |
X192 | 2.6908310 |
X22 | 2.6889728 |
X374 | 2.6869133 |
X132 | 2.6767235 |
X113 | 2.6722419 |
X181 | 2.6336297 |
X44 | 2.6125308 |
X74 | 2.5816257 |
X13 | 2.5794715 |
X282 | 2.5663870 |
X194 | 2.5553930 |
X364 | 2.5536503 |
X112 | 2.5291440 |
X213 | 2.4953951 |
X152 | 2.4947829 |
X104 | 2.4920203 |
X6 | 2.4895872 |
X161 | 2.4849425 |
X231 | 2.4742504 |
X11 | 2.4692769 |
X117 | 2.4628723 |
X162 | 2.4551926 |
X81 | 2.4425303 |
X393 | 2.4310590 |
X244 | 2.4305055 |
X217 | 2.3860420 |
X211 | 2.3729905 |
X252 | 2.3709876 |
X324 | 2.3613761 |
X18 | 2.3342998 |
X115 | 2.3332208 |
X316 | 2.3271865 |
X321 | 2.3161905 |
X313 | 2.3147162 |
X83 | 2.3084911 |
X182 | 2.3038297 |
X39 | 2.2885316 |
X21 | 2.2865094 |
X36 | 2.2581763 |
X61 | 2.2538031 |
X16 | 2.2151170 |
X233 | 2.2128776 |
X311 | 2.1189115 |
X91 | 2.1182449 |
X253 | 2.1127521 |
X28 | 2.1029672 |
X383 | 2.0997756 |
X334 | 2.0952218 |
X281 | 2.0654783 |
X333 | 2.0517161 |
X111 | 2.0492464 |
X102 | 2.0437337 |
X394 | 2.0411659 |
X215 | 2.0385755 |
X343 | 2.0084313 |
X25 | 2.0051765 |
X353 | 2.0024676 |
X193 | 1.9981012 |
X174 | 1.9904280 |
X24 | 1.9786006 |
X371 | 1.9758678 |
X263 | 1.9735472 |
X362 | 1.9578452 |
X123 | 1.9453243 |
X82 | 1.9128455 |
X141 | 1.9102032 |
X361 | 1.8959054 |
X124 | 1.8487145 |
X142 | 1.8416549 |
X344 | 1.8379848 |
X153 | 1.8313143 |
X202 | 1.8309932 |
X10 | 1.8271260 |
X5 | 1.8119391 |
X23 | 1.7830686 |
X8 | 1.7639648 |
X4 | 1.7378508 |
X354 | 1.7334810 |
X322 | 1.7154179 |
X94 | 1.7143561 |
X101 | 1.7099298 |
X35 | 1.7034332 |
X372 | 1.6664911 |
X304 | 1.6362735 |
X201 | 1.6107170 |
X84 | 1.6087458 |
X310 | 1.5871175 |
X284 | 1.5467524 |
X341 | 1.5303573 |
X14 | 1.4727146 |
X103 | 1.4153643 |
X391 | 1.4150329 |
X312 | 1.4038618 |
X363 | 1.3891787 |
X30 | 1.3779550 |
X242 | 1.3666800 |
X303 | 1.3544250 |
X210 | 1.3464787 |
X221 | 1.3390426 |
X232 | 1.3385168 |
X274 | 1.3155557 |
X73 | 1.3055992 |
X173 | 1.3003825 |
X20 | 1.2894163 |
X43 | 1.2773276 |
X302 | 1.2762359 |
X191 | 1.2515995 |
X15 | 1.2409482 |
X144 | 1.2232473 |
X183 | 1.2118165 |
X32 | 1.1941497 |
X273 | 1.1506261 |
X342 | 1.1440656 |
X184 | 1.0164413 |
X224 | 1.0046459 |
X332 | 0.9973565 |
X234 | 0.9932592 |
X7 | 0.9404488 |
X34 | 0.8996808 |
X392 | 0.8822497 |
X241 | 0.8759609 |
X222 | 0.8382394 |
X163 | 0.7721729 |
X143 | 0.6735831 |
X331 | 0.6196059 |
We identify the top 10 variables with the most importance to the prediction below -
## Call Variable Importance Plot Function
printVarImportance("Neural Network", varImp(nnetAvg), 10)
<- predict(nnetAvg, newdata = X_test)
nnetPred <- postResample(pred = nnetPred, obs = y_test)
nnetPredPerf 'Family'] <- 'Non-Linear Regression'
nnetPredPerf['Model'] <- 'Neural Network' nnetPredPerf[
The KNN model predicts a new sample for regression using the K-closest data points from the training set. The method uses the Euclidean distance or the straight-line distance between samples.
\[(\sum^p_{j=1}(x_{aj}-x_{bj})^q)^{1/q}\]
The train
function in the caret package uses the knn
method.
<- train(X_train, y_train,
knnFit method = 'knn',
preProc = c('center','scale'),
tuneLength = 14,
trControl = trainControl(method = 'cv'))
knnFit
## k-Nearest Neighbors
##
## 1499 samples
## 39 predictor
##
## Pre-processing: centered (39), scaled (39)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1349, 1349, 1348, 1347, 1349, 1350, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.1470551 0.6085563 0.10018005
## 7 0.1454218 0.6186791 0.09854605
## 9 0.1440191 0.6269384 0.09784016
## 11 0.1430909 0.6322653 0.09764411
## 13 0.1426350 0.6358800 0.09697388
## 15 0.1434319 0.6321638 0.09772261
## 17 0.1431849 0.6351700 0.09726898
## 19 0.1439635 0.6320082 0.09767728
## 21 0.1443801 0.6305995 0.09764054
## 23 0.1439431 0.6342258 0.09739809
## 25 0.1444537 0.6324687 0.09746166
## 27 0.1448037 0.6308266 0.09757062
## 29 0.1451915 0.6295031 0.09796384
## 31 0.1454098 0.6292510 0.09811286
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.
plot(knnFit)
<- varImp(knnFit)
varimp $importance %>%
varimparrange(desc(Overall)) %>%
kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
Overall | |
---|---|
PctIlleg | 100.0000000 |
racePctWhite | 82.6566345 |
PctYoungKids2Par | 80.2817251 |
PctTeen2Par | 78.6232333 |
racepctblack | 70.3212797 |
pctWInvInc | 55.5310000 |
pctWPubAsst | 55.1605903 |
MalePctDivorce | 47.8211990 |
PctPopUnderPov | 44.0230884 |
PctUnemployed | 39.7884001 |
PctHousNoPhone | 35.8730201 |
PctVacantBoarded | 33.5763619 |
PctHousOwnOcc | 33.3328822 |
PctHousLess3BR | 32.1642291 |
PctPersDenseHous | 29.2459802 |
NumIlleg | 29.2056695 |
PctLess9thGrade | 21.1231604 |
HousVacant | 20.6816310 |
PctLargHouseFam | 15.3341087 |
NumInShelters | 14.9067179 |
MedNumBR | 13.6315013 |
PctWOFullPlumb | 12.9198016 |
perCapInc | 12.1955112 |
PctOccupMgmtProf | 11.2535144 |
PctEmploy | 10.3417599 |
NumStreet | 10.3378306 |
MedRentPctHousInc | 9.0280115 |
PctHousOccup | 8.5714403 |
pctWWage | 6.9674327 |
MalePctNevMarr | 6.9603290 |
PctNotSpeakEnglWell | 5.4220236 |
PctOccupManu | 5.3190000 |
PctImmigRec10 | 4.9203000 |
PopDens | 4.8423181 |
racePctHisp | 4.2799730 |
NumImmig | 3.9294479 |
blackPerCap | 1.6454549 |
PctRecImmig8 | 0.3601896 |
RentLowQ | 0.0000000 |
We identify the top 10 variables with the most importance to the prediction below -
## Call Variable Importance Plot Function
printVarImportance("KNN", varImp(knnFit)$importance, 10)
<- predict(knnFit, newdata = X_test)
knnPred <- postResample(pred = knnPred, obs = y_test)
knnPredPerf 'Family'] <- 'Non-Linear Regression'
knnPredPerf['Model'] <- 'KNN' knnPredPerf[
SVM regression models try to find a regression line where each data point is within a threshold. The error is a parameter of the model and it will then try to minimize the coefficients used in the model. This allows the model eliminate features and have greater interpretability.
We will again use the caret package and apply a radial SVM model. This method will allow the model to find trends in the dataset.
<- train(X_train, y_train,
svmRFit method = 'svmRadial',
preProc = c('center','scale'),
tuneLength = 14,
trControl = trainControl(method = 'cv'))
svmRFit
## Support Vector Machines with Radial Basis Function Kernel
##
## 1499 samples
## 39 predictor
##
## Pre-processing: centered (39), scaled (39)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1349, 1348, 1350, 1350, 1349, 1349, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1481972 0.6322960 0.09523909
## 0.50 0.1430390 0.6461203 0.09275885
## 1.00 0.1402594 0.6525146 0.09164402
## 2.00 0.1396685 0.6519714 0.09198224
## 4.00 0.1419226 0.6393452 0.09431469
## 8.00 0.1468268 0.6150199 0.09876876
## 16.00 0.1546668 0.5773157 0.10495410
## 32.00 0.1647599 0.5308168 0.11282599
## 64.00 0.1739939 0.4929692 0.11996609
## 128.00 0.1801527 0.4699293 0.12551726
## 256.00 0.1803473 0.4699574 0.12603844
## 512.00 0.1803473 0.4699574 0.12603844
## 1024.00 0.1803473 0.4699574 0.12603844
## 2048.00 0.1803473 0.4699574 0.12603844
##
## Tuning parameter 'sigma' was held constant at a value of 0.02507349
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02507349 and C = 2.
<- varImp(svmRFit)
varimp $importance %>%
varimparrange(desc(Overall)) %>%
kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
Overall | |
---|---|
PctIlleg | 100.0000000 |
racePctWhite | 82.6566345 |
PctYoungKids2Par | 80.2817251 |
PctTeen2Par | 78.6232333 |
racepctblack | 70.3212797 |
pctWInvInc | 55.5310000 |
pctWPubAsst | 55.1605903 |
MalePctDivorce | 47.8211990 |
PctPopUnderPov | 44.0230884 |
PctUnemployed | 39.7884001 |
PctHousNoPhone | 35.8730201 |
PctVacantBoarded | 33.5763619 |
PctHousOwnOcc | 33.3328822 |
PctHousLess3BR | 32.1642291 |
PctPersDenseHous | 29.2459802 |
NumIlleg | 29.2056695 |
PctLess9thGrade | 21.1231604 |
HousVacant | 20.6816310 |
PctLargHouseFam | 15.3341087 |
NumInShelters | 14.9067179 |
MedNumBR | 13.6315013 |
PctWOFullPlumb | 12.9198016 |
perCapInc | 12.1955112 |
PctOccupMgmtProf | 11.2535144 |
PctEmploy | 10.3417599 |
NumStreet | 10.3378306 |
MedRentPctHousInc | 9.0280115 |
PctHousOccup | 8.5714403 |
pctWWage | 6.9674327 |
MalePctNevMarr | 6.9603290 |
PctNotSpeakEnglWell | 5.4220236 |
PctOccupManu | 5.3190000 |
PctImmigRec10 | 4.9203000 |
PopDens | 4.8423181 |
racePctHisp | 4.2799730 |
NumImmig | 3.9294479 |
blackPerCap | 1.6454549 |
PctRecImmig8 | 0.3601896 |
RentLowQ | 0.0000000 |
We identify the top 10 variables with the most importance to the prediction below -
## Call Variable Importance Plot Function
printVarImportance("SVM", varImp(svmRFit)$importance, 10)
<- predict(svmRFit, newdata = X_test)
svmRPred <- postResample(pred = svmRPred, obs = y_test)
svmRPredPerf 'Family'] <- 'Non-Linear Regression'
svmRPredPerf['Model'] <- 'SVM' svmRPredPerf[
The MARS model finds knots to determine piece wise linear model where each new feature models an isolated portion of the data. The predictor / cut point combination that has the smallest error is used for the model in an iterative process until a stopping point is reached.
We will use the earth package to generate this MARS model.
# mars
<- earth(X_train, y_train)
marsFit summary(marsFit)
## Call: earth(x=X_train, y=y_train)
##
## coefficients
## (Intercept) 0.7263794
## h(0.54-racepctblack) -0.3101390
## h(0.73-pctWWage) 0.1249294
## h(pctWWage-0.73) -0.2401202
## h(PctPopUnderPov-0.67) -0.5455969
## h(MalePctDivorce-0.39) 0.2138445
## h(0.51-PctTeen2Par) 0.2990705
## h(0.39-PctIlleg) -0.1481046
## h(PctIlleg-0.39) 0.2448670
## h(0.88-PctPersDenseHous) -0.3320613
## h(0.16-HousVacant) -0.4654248
## h(0.31-NumStreet) -0.4153084
##
## Selected 12 of 21 terms, and 9 of 39 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: racepctblack, PctIlleg, MalePctDivorce, PctPersDenseHous, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 0.01815316 RSS 26.38294 GRSq 0.6714617 RSq 0.6810408
plotmo(marsFit)
## plotmo grid: racepctblack racePctWhite racePctHisp pctWWage pctWInvInc
## 0.06 0.85 0.04 0.57 0.48
## pctWPubAsst perCapInc blackPerCap PctPopUnderPov PctLess9thGrade PctUnemployed
## 0.26 0.31 0.26 0.24 0.27 0.32
## PctEmploy PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr
## 0.51 0.37 0.4 0.47 0.4
## PctYoungKids2Par PctTeen2Par NumIlleg PctIlleg NumImmig PctImmigRec10
## 0.7 0.61 0.01 0.17 0.01 0.43
## PctRecImmig8 PctNotSpeakEnglWell PctLargHouseFam PctPersDenseHous
## 0.09 0.06 0.21 0.11
## PctHousLess3BR MedNumBR HousVacant PctHousOccup PctHousOwnOcc PctVacantBoarded
## 0.51 0.5 0.03 0.77 0.55 0.13
## PctHousNoPhone PctWOFullPlumb RentLowQ MedRentPctHousInc NumInShelters
## 0.18 0.19 0.32 0.48 0
## NumStreet PopDens
## 0 0.17
<- varImp(marsFit)
varimp %>%
varimp arrange(desc(Overall)) %>%
kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
Overall | |
---|---|
racepctblack | 100.00000 |
PctIlleg | 65.01171 |
MalePctDivorce | 47.29172 |
PctPersDenseHous | 47.29172 |
HousVacant | 29.26562 |
pctWWage | 22.81121 |
PctPopUnderPov | 20.13595 |
NumStreet | 15.49737 |
PctTeen2Par | 11.49760 |
We identify the top 10 variables with the most importance to the prediction below -
## Call Variable Importance Plot Function
printVarImportance("MARS", varImp(marsFit), 10)
<- predict(marsFit, newdata = X_test)
marsPred <- postResample(pred = marsPred, obs = y_test)
marsPredPerf 'Family'] <- 'Non-Linear Regression'
marsPredPerf['Model'] <- 'MARS' marsPredPerf[
Decision Trees are a type of Supervised Machine Learning method where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes and the decision nodes are where the data is split. Since our target variable is continuous, we will be building a regression tree.
We will do the decision tree with all variables first.
<- rpart(ViolentCrimesPerPop~ .,
dt data=as.data.frame(train))
rpart.plot(dt, nn=TRUE)
Here are some of the details of the tree:
printcp(dt)
##
## Regression tree:
## rpart(formula = ViolentCrimesPerPop ~ ., data = as.data.frame(train))
##
## Variables actually used in tree construction:
## [1] MalePctDivorce NumStreet PctIlleg PctPersDenseHous
## [5] PctPopUnderPov
##
## Root node error: 82.716/1499 = 0.055181
##
## n= 1499
##
## CP nsplit rel error xerror xstd
## 1 0.412904 0 1.00000 1.00143 0.050454
## 2 0.063292 1 0.58710 0.59142 0.029666
## 3 0.054263 2 0.52380 0.54890 0.028311
## 4 0.022946 3 0.46954 0.50016 0.026669
## 5 0.019715 4 0.44660 0.48908 0.026478
## 6 0.011323 5 0.42688 0.46579 0.025620
## 7 0.010558 6 0.41556 0.48008 0.026145
## 8 0.010000 7 0.40500 0.47461 0.026067
To achieve improved accuracy, we need to prune the tree using the cross-validation.
plotcp(dt)
$cptable dt
## CP nsplit rel error xerror xstd
## 1 0.41290395 0 1.0000000 1.0014260 0.05045354
## 2 0.06329165 1 0.5870960 0.5914199 0.02966628
## 3 0.05426321 2 0.5238044 0.5489023 0.02831096
## 4 0.02294604 3 0.4695412 0.5001609 0.02666861
## 5 0.01971462 4 0.4465952 0.4890764 0.02647826
## 6 0.01132276 5 0.4268805 0.4657860 0.02561957
## 7 0.01055755 6 0.4155578 0.4800769 0.02614502
## 8 0.01000000 7 0.4050002 0.4746088 0.02606696
The plot above shows the cross validated errors against the complexity parameters. The curve is at its lowest at 7, so we will prune our tree to a size of 7. At size 7, the error is ~0.46 and cp is 0.0104.
=prune(dt,cp=0.0104)
prune_dtrpart.plot(prune_dt)
<-as.data.frame(test)
test2<- na.omit(test2) test2
<- predict(prune_dt,
Prune_pred test2)
<- table(test2$ViolentCrimesPerPop,Prune_pred)
confMat <- sum(diag(confMat))/sum(confMat)
accuracy accuracy
## [1] 0.006072874
The model accuracy seems to be very low, even after pruning the tree. This might be because our output is continuous and non-binary. However, more often than not, trees do not give very good prediction errors. Therefore, we will build out a random forest models which tend to outperform trees in terms of prediction and miss-classification errors.
## Call Variable Importance Plot Function
printVarImportance("Decision Tree", data.frame(prune_dt$variable.importance), 10)
<- postResample(pred = Prune_pred, obs = y_test)
dtreePredPerf 'Family'] <- 'Trees & Boosting'
dtreePredPerf['Model'] <- 'Decision Tree' dtreePredPerf[
Random forests are one the most popular machine learning algorithms. They can be used for classification or regression, deal with a large number of features, and generally are so successful because they provide a good predictive performance, low incidence of over-fitting, and easy interpret-ability.
In creating the best random forest model, we want to minimize the OOB error rate by finding the optimal number of variables selected at each split, known as the mtry. The below code finds the optimal mtry to use in our random forest model.
<- as.data.frame(train) train2
=double(101)
oob.err=double(101)
test.errfor (mtry in 1:101){
=randomForest(ViolentCrimesPerPop~.,data=train2,mtry=mtry,ntree=400, proximity=TRUE)
fit=fit$mse[400]
oob.err[mtry]=predict(fit, train2)
pred=with(train2, mean((ViolentCrimesPerPop-pred)^2))
test.err[mtry]cat(mtry," ")}
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
matplot(1:mtry,cbind(test.err, oob.err),pch=19, col=c("red", "blue"), type="b",ylab
="Mean Squared Error")
legend("topright",legend=c("00B", "Test"),pch=19,col=c("red", "blue"))
Ideally OOB error and test errors should line up - but in this case the test error seems to be higher than the OOB.
varImpPlot(fit)
## Call Variable Importance Plot Function
printVarImportance("Random Forest", varImp(fit), 10)
The above graph illustrates the importance of the variables used to predict the Violent Crimes Per Population. The Mean Decrease Accuracy displays how much the model accuracy decreases if we drop the variable. Here, Percentage of kids born to never married is regarded as the most important variable by a wide margin. The Mean Decrease Gini graph displays the variable importance on the Gini impurity index used for splitting trees. Again, Percentage of kids born to never married is the clear leader.
print(fit)
##
## Call:
## randomForest(formula = ViolentCrimesPerPop ~ ., data = train2, mtry = mtry, ntree = 400, proximity = TRUE)
## Type of random forest: regression
## Number of trees: 400
## No. of variables tried at each split: 39
##
## Mean of squared residuals: 0.01920442
## % Var explained: 65.2
<- postResample(pred = pred, obs = y_test)
rfPredPerf 'Family'] <- 'Trees & Boosting'
rfPredPerf['Model'] <- 'Random Forest' rfPredPerf[
The boosting method is uses trees in a sequential pattern. The successive tree is developed using information from the previous tree to minimize its error. The boosting method has three tuning parameters, number of trees, shrinkage parameter, and number of splits in a tree.
We will be using the stochastic gradient boosting in our model. This approach resamples the observations and columns in each iteration. We’ll use the caret workflow, which invokes the xgboost package, to automatically adjust the model parameter values, and fit the final best boosted tree that explains the best our data.
Here, we can see our parameters that are used in the model -
# Best tuning parameter
$bestTune %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) gbm_model
nrounds | max_depth | eta | gamma | colsample_bytree | min_child_weight | subsample | |
---|---|---|---|---|---|---|---|
34 | 50 | 2 | 0.3 | 0 | 0.8 | 1 | 1 |
We identify the top 10 variables with the most importance to the prediction below -
## Call Variable Importance Plot Function
printVarImportance("XGBoost", varImp(gbm_model)$importance, 10)
Visualization of the first 3 trees used in the model.
xgb.plot.tree(model = gbm_model$finalModel,trees = 1:3)
<- predict(gbm_model, newdata = X_test)
xgbmPred <- postResample(pred = xgbmPred, obs = y_test)
xgbmPredPerf 'Family'] <- 'Trees & Boosting'
xgbmPredPerf['Model'] <- 'XGBoost' xgbmPredPerf[
<- rbind(ridgePredPerf, lassoPredPerf, elnetPredPerf, nnetPredPerf, knnPredPerf, svmRPredPerf, marsPredPerf, dtreePredPerf, rfPredPerf, xgbmPredPerf)
modelSummaryDF
<- modelSummaryDF[, c(4, 5, 1, 2, 3)]
SummaryDF
rownames(SummaryDF) <- NULL
%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
SummaryDF scroll_box(width="100%",height="500px")
Family | Model | RMSE | Rsquared | MAE |
---|---|---|---|---|
Linear Regression | Ridge Regression | 0.134023364555963 | 0.652096104840263 | 0.0931015520799759 |
Linear Regression | Lasso Regression | 0.134326504546259 | 0.65023983987147 | 0.093460493356315 |
Linear Regression | Elastic Net Regression | 0.134363506713636 | 0.650048180124831 | 0.0935067997619881 |
Non-Linear Regression | Neural Network | 0.138079979548364 | 0.631849583828549 | 0.0920321820156854 |
Non-Linear Regression | KNN | 0.137746336193226 | 0.64321987228122 | 0.0935035814388041 |
Non-Linear Regression | SVM | 0.137410561896066 | 0.64343871694592 | 0.0903724920709437 |
Non-Linear Regression | MARS | 0.135214031166373 | 0.645391442964902 | 0.0932388471415975 |
Trees & Boosting | Decision Tree | 0.153357244851902 | 0.545361936338709 | 0.107234983234218 |
Trees & Boosting | Random Forest | NA | 0.00235260760870183 | NA |
Trees & Boosting | XGBoost | 0.136529476057859 | 0.638865130885269 | 0.0919573151953104 |
All the linear regression models - Ridge, Lasso and Elastic Net that we built turned out to be very good models with a multiple R2 of 65%+. Non-Linear regression models like MARS, SVM and KNN also had R2 of 64% and above. Amongst the Tree models, XGBoost performed the best with 63.8% R2. All the above-mentioned models have prediction powers in the same range. Decision tree is great for interpretation but significantly under-performs all other models.
The final model, random forest model has a high Percentage of variance explained of 65.2% and a low mean squared residual (0.019).
But purely based on R2, Ridge regression model is the best pick out of all the models that we have tried.
News related to violent crimes shake the foundations of our society from time to time and make us question our beliefs regarding the prophesied cultural, economic, and racial harmony in our society. With the increase in the number of recent incidents related to violent crimes, it has become imperative to re-evaluate the cause of these crimes based on the socio-economic built of our society. Increased computational power has enabled us to look at a large number of factors at once to determine how the relationships among them influence crime rates in our society. A data-driven approach is paramount to plan for the future and build our communities on the principles of trust and harmony.
The Communities and Crime normalized Dataset from the UCI Machine Learning repository provides an excellent opportunity to look at how a myriad of socio-economic factors influence violent crime rates in our society. Using novel machine learning approaches, predictive models were built with high accuracy which pinpointed several factors that influence these crimes. One theme that arises, when assessing the most important factors, is marriage. Successful marriages contribute to the healthy development of children, who in turn become responsible citizens. Areas with high percentage of kids born to people who were never married, living in dense areas, with little-to-no schooling tend to exhibit high violent crime rates. Although the models point to racial factors, we believe that this has more to do with the relatively poor economic conditions of African Americans in comparison to those of Caucasians.
Despite these contributions, the models developed here suffer from a serious case of over-fitting. Prioritizing generation of insights over building a predictive model, the entire dataset was utilized for both training and validation purposes, except in case of the random forest model which has been proposed for scaling purposes. The random forest model could be further optimized through hyper-parameter tuning to improve its predictive power. Post scaling, this model would need to be adjusted on a timely basis as more data becomes available. Further study in this area could explore other approaches such as Neural Networks to build models that do a better job at predicting violent crime rates.
[1) UCI Machine Learning Repository Link] (https://archive.ics.uci.edu/ml/datasets/communities+and+crime)