library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(caret)
library(mice)
library(randomForest)
library(caTools)
library(corrplot)
library(naniar)
library(xgboost)
library(usmap)
library(DiagrammeR)
library(earth)
library(plotly)
library(wordcloud)
library(RColorBrewer)
library(glmnet)
library(Hmisc)
library(car)
library(class)
library(rpart)
library(rpart.plot)This is a dataset of 2018 US communities, demographics of each community, and their crime rates. The dataset has 146 variables where the first four columns are community/location, the middle features are demographic information about each community such as population, age, race, income, and the final columns are types of crimes and overall crime rates. The goal of the project is to understand where violent crime occurs in terms of the socioeconomic and demographic characteristics of the regions. The features can help predict ahead of time where violent crime is likely to occur through predictive models that can quantify the risk associated with a region.
The approach to the problem of crime in the different states of the United States implies the investigation and analysis of the crime rates in each state, as well as the factors that may be contributing to said rates.One of the factors that has been studied in relation to crime is the socioeconomic level of a community. There is evidence to suggest that communities with low socioeconomic levels have a higher incidence of crime compared to more prosperous communities. In addition, other socioeconomic factors, such as unemployment, poverty, lack of educational and job opportunities, have also been linked to increased risk of crime. These factors can negatively affect people’s quality of life and increase their vulnerability to crime. However, it is important to note that other factors can also influence the crime rate, such as culture, law enforcement policies, the availability of guns, and other environmental and demographic factors. In summary, socioeconomic factors can be an important factor in the occurrence of crime in different states of the United States, but it is important to consider multiple factors when addressing this complex problem. The analysis that we are going to carry out in this work could be useful to build predictive models that better help in urban planning and crime reduction.
The dataset selected for this analysis is ‘Crimes in US Communities Dataset’ - Michael Bryant (Owner).
We have a very complete dataset. According to each state we can see data such as: population for community, percentage of population in 4 age groups, percentage of population according to race, percentage of people using public transit for commuting, and many more data that will allow us to carry out a good analysis.
This is a dataset of 2018 US communities. Numeric-decimal data types have been normalized to two decimal places (0.00). Our target variable is ‘Violent Crimes by Population’, (GOAL attribute). Our crime dataset has 128 attributes. In the following table we can see the variables, the description of each variable and the data Type.
| No. | Column | Description | Data Type |
|---|---|---|---|
| 1 | state | US state (by number) - not counted as predictive above, but if considered, should be consided nominal | nominal |
| 2 | county | numeric code for county - not predictive, and many missing values | numeric |
| 3 | community | numeric code for community - not predictive and many missing values | numeric |
| 4 | communityname | community name - not predictive - for information only | string |
| 5 | fold | fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive | numeric |
| 6 | population | population for community | numeric - decimal |
| 7 | householdsize | mean people per household | numeric - decimal |
| 8 | racepctblack | percentage of population that is african american | numeric - decimal |
| 9 | racePctWhite | percentage of population that is caucasian | numeric - decimal |
| 10 | racePctAsian | percentage of population that is of asian heritage | numeric - decimal |
| 11 | racePctHisp | percentage of population that is of hispanic heritage | numeric - decimal |
| 12 | agePct12t21 | percentage of population that is 12-21 in age | numeric - decimal |
| 13 | agePct12t29 | percentage of population that is 12-29 in age | numeric - decimal |
| 14 | agePct16t24 | percentage of population that is 16-24 in age | numeric - decimal |
| 15 | agePct65up | percentage of population that is 65 and over in age | numeric - decimal |
| 16 | numbUrban | number of people living in areas classified as urban | numeric - decimal |
| 17 | pctUrban | percentage of people living in areas classified as urban | numeric - decimal |
| 18 | medIncome | median household income | numeric - decimal |
| 19 | pctWWage | percentage of households with wage or salary income in 1989 | numeric - decimal |
| 20 | pctWFarmSelf | percentage of households with farm or self employment income in 1989 | numeric - decimal |
| 21 | pctWInvInc | percentage of households with investment / rent income in 1989 | numeric - decimal |
| 22 | pctWSocSec | percentage of households with social security income in 1989 | numeric - decimal |
| 23 | pctWPubAsst | percentage of households with public assistance income in 1989 | numeric - decimal |
| 24 | pctWRetire | percentage of households with retirement income in 1989 | numeric - decimal |
| 25 | medFamInc | median family income (differs from household income for non-family households) | numeric - decimal |
| 26 | perCapInc | per capita income | numeric - decimal |
| 27 | whitePerCap | per capita income for caucasians | numeric - decimal |
| 28 | blackPerCap | per capita income for african americans | numeric - decimal |
| 29 | indianPerCap | per capita income for native americans | numeric - decimal |
| 30 | AsianPerCap | per capita income for people with asian heritage | numeric - decimal |
| 31 | OtherPerCap | per capita income for people with ‘other’ heritage | numeric - decimal |
| 32 | HispPerCap | per capita income for people with hispanic heritage | numeric - decimal |
| 33 | NumUnderPov | number of people under the poverty level | numeric - decimal |
| 34 | PctPopUnderPov | percentage of people under the poverty level | numeric - decimal |
| 35 | PctLess9thGrade | percentage of people 25 and over with less than a 9th grade education | numeric - decimal |
| 36 | PctNotHSGrad | percentage of people 25 and over that are not high school graduates | numeric - decimal |
| 37 | PctBSorMore | percentage of people 25 and over with a bachelors degree or higher education | numeric - decimal |
| 38 | PctUnemployed | percentage of people 16 and over, in the labor force, and unemployed | numeric - decimal |
| 39 | PctEmploy | percentage of people 16 and over who are employed | numeric - decimal |
| 40 | PctEmplManu | percentage of people 16 and over who are employed in manufacturing | numeric - decimal |
| 41 | PctEmplProfServ | percentage of people 16 and over who are employed in professional services | numeric - decimal |
| 42 | PctOccupManu | percentage of people 16 and over who are employed in manufacturing | numeric - decimal |
| 43 | PctOccupMgmtProf | percentage of people 16 and over who are employed in management or professional occupations | numeric - decimal |
| 44 | MalePctDivorce | percentage of males who are divorced | numeric - decimal |
| 45 | MalePctNevMarr | percentage of males who have never married | numeric - decimal |
| 46 | FemalePctDiv | percentage of females who are divorced | numeric - decimal |
| 47 | TotalPctDiv | percentage of population who are divorced | numeric - decimal |
| 48 | PersPerFam | mean number of people per family | numeric - decimal |
| 49 | PctFam2Par | percentage of families (with kids) that are headed by two parents | numeric - decimal |
| 50 | PctKids2Par | percentage of kids in family housing with two parents | numeric - decimal |
| 51 | PctYoungKids2Par | percent of kids 4 and under in two parent households | numeric - decimal |
| 52 | PctTeen2Par | percent of kids age 12-17 in two parent households | numeric - decimal |
| 53 | PctWorkMomYoungKids | percentage of moms of kids 6 and under in labor force | numeric - decimal |
| 54 | PctWorkMom | percentage of moms of kids under 18 in labor force | numeric - decimal |
| 55 | NumIlleg | number of kids born to never married | numeric - decimal |
| 56 | PctIlleg | percentage of kids born to never married | numeric - decimal |
| 57 | NumImmig | total number of people known to be foreign born | numeric - decimal |
| 58 | PctImmigRecent | percentage of immigrants who immigated within last 3 years | numeric - decimal |
| 59 | PctImmigRec5 | percentage of immigrants who immigated within last 5 years | numeric - decimal |
| 60 | PctImmigRec8 | percentage of immigrants who immigated within last 8 years | numeric - decimal |
| 61 | PctImmigRec10 | percentage of immigrants who immigated within last 10 years | numeric - decimal |
| 62 | PctRecentImmig | percent of population who have immigrated within the last 3 years | numeric - decimal |
| 63 | PctRecImmig5 | percent of population who have immigrated within the last 5 years | numeric - decimal |
| 64 | PctRecImmig8 | percent of population who have immigrated within the last 8 years | numeric - decimal |
| 65 | PctRecImmig10 | percent of population who have immigrated within the last 10 years | numeric - decimal |
| 66 | PctSpeakEnglOnly | percent of people who speak only English | numeric - decimal |
| 67 | PctNotSpeakEnglWell | percent of people who do not speak English well | numeric - decimal |
| 68 | PctLargHouseFam | percent of family households that are large (6 or more) | numeric - decimal |
| 69 | PctLargHouseOccup | percent of all occupied households that are large (6 or more people) | numeric - decimal |
| 70 | PersPerOccupHous | mean persons per household | numeric - decimal |
| 71 | PersPerOwnOccHous | mean persons per owner occupied household | numeric - decimal |
| 72 | PersPerRentOccHous | mean persons per rental household | numeric - decimal |
| 73 | PctPersOwnOccup | percent of people in owner occupied households | numeric - decimal |
| 74 | PctPersDenseHous | percent of persons in dense housing (more than 1 person per room) | numeric - decimal |
| 75 | PctHousLess3BR | percent of housing units with less than 3 bedrooms | numeric - decimal |
| 76 | MedNumBR | median number of bedrooms | numeric - decimal |
| 77 | HousVacant | number of vacant households | numeric - decimal |
| 78 | PctHousOccup | percent of housing occupied | numeric - decimal |
| 79 | PctHousOwnOcc | percent of households owner occupied | numeric - decimal |
| 80 | PctVacantBoarded | percent of vacant housing that is boarded up | numeric - decimal |
| 81 | PctVacMore6Mos | percent of vacant housing that has been vacant more than 6 months | numeric - decimal |
| 82 | MedYrHousBuilt | median year housing units built | numeric - decimal |
| 83 | PctHousNoPhone | percent of occupied housing units without phone (in 1990, this was rare!) | numeric - decimal |
| 84 | PctWOFullPlumb | percent of housing without complete plumbing facilities | numeric - decimal |
| 85 | OwnOccLowQuart | owner occupied housing - lower quartile value | numeric - decimal |
| 86 | OwnOccMedVal | owner occupied housing - median value | numeric - decimal |
| 87 | OwnOccHiQuart | owner occupied housing - upper quartile value | numeric - decimal |
| 88 | RentLowQ | rental housing - lower quartile rent | numeric - decimal |
| 89 | RentMedian | rental housing - median rent (Census variable H32B from file STF1A) | numeric - decimal |
| 90 | RentHighQ | rental housing - upper quartile rent | numeric - decimal |
| 91 | MedRent | median gross rent (Census variable H43A from file STF3A - includes utilities) | numeric - decimal |
| 92 | MedRentPctHousInc | median gross rent as a percentage of household income | numeric - decimal |
| 93 | MedOwnCostPctInc | median owners cost as a percentage of household income - for owners with a mortgage | numeric - decimal |
| 94 | MedOwnCostPctIncNoMtg | median owners cost as a percentage of household income - for owners without a mortgage | numeric - decimal |
| 95 | NumInShelters | number of people in homeless shelters | numeric - decimal |
| 96 | NumStreet | number of homeless people counted in the street | numeric - decimal |
| 97 | PctForeignBorn | percent of people foreign born | numeric - decimal |
| 98 | PctBornSameState | percent of people born in the same state as currently living | numeric - decimal |
| 99 | PctSameHouse85 | percent of people living in the same house as in 1985 (5 years before) | numeric - decimal |
| 100 | PctSameCity85 | percent of people living in the same city as in 1985 (5 years before) | numeric - decimal |
| 101 | PctSameState85 | percent of people living in the same state as in 1985 (5 years before) | numeric - decimal |
| 102 | LemasSwornFT | number of sworn full time police officers | numeric - decimal |
| 103 | LemasSwFTPerPop | sworn full time police officers per 100K population | numeric - decimal |
| 104 | LemasSwFTFieldOps | number of sworn full time police officers in field operations (on the street as opposed to administrative etc) | numeric - decimal |
| 105 | LemasSwFTFieldPerPop | sworn full time police officers in field operations (on the street as opposed to administrative etc) per 100K population | numeric - decimal |
| 106 | LemasTotalReq | total requests for police | numeric - decimal |
| 107 | LemasTotReqPerPop | total requests for police per 100K popuation | numeric - decimal |
| 108 | PolicReqPerOffic | total requests for police per police officer | numeric - decimal |
| 109 | PolicPerPop | police officers per 100K population | numeric - decimal |
| 110 | RacialMatchCommPol | a measure of the racial match between the community and the police force. High values indicate proportions in community and police force are similar | numeric - decimal |
| 111 | PctPolicWhite | percent of police that are caucasian | numeric - decimal |
| 112 | PctPolicBlack | percent of police that are african american | numeric - decimal |
| 113 | PctPolicHisp | percent of police that are hispanic | numeric - decimal |
| 114 | PctPolicAsian | percent of police that are asian | numeric - decimal |
| 115 | PctPolicMinor | percent of police that are minority of any kind | numeric - decimal |
| 116 | OfficAssgnDrugUnits | number of officers assigned to special drug units | numeric - decimal |
| 117 | NumKindsDrugsSeiz | number of different kinds of drugs seized | numeric - decimal |
| 118 | PolicAveOTWorked | police average overtime worked | numeric - decimal |
| 119 | LandArea | land area in square miles | numeric - decimal |
| 120 | PopDens | population density in persons per square mile | numeric - decimal |
| 121 | PctUsePubTrans | percent of people using public transit for commuting | numeric - decimal |
| 122 | PolicCars | number of police cars | numeric - decimal |
| 123 | PolicOperBudg | police operating budget | numeric - decimal |
| 124 | LemasPctPolicOnPatr | percent of sworn full time police officers on patrol | numeric - decimal |
| 125 | LemasGangUnitDeploy | gang unit deployed | numeric - decimal - but really ordinal - 0 means NO, 1 means YES, 0.5 means Part Time |
| 126 | LemasPctOfficDrugUn | percent of officers assigned to drug units | numeric - decimal |
| 127 | PolicBudgPerPop | police operating budget per population | numeric - decimal |
| 128 | ViolentCrimesPerPop | total number of violent crimes per 100K popuation (numeric - decimal) GOAL attribute | to be predicted |
We also load the data set where each State, State code and State name are described.
| state_code | state | stateName | stateENS |
|---|---|---|---|
| 1 | AL | Alabama | 1779775 |
| 2 | AK | Alaska | 1785533 |
| 4 | AZ | Arizona | 1779777 |
| 5 | AR | Arkansas | 68085 |
| 6 | CA | California | 1779778 |
| 8 | CO | Colorado | 1779779 |
| 9 | CT | Connecticut | 1779780 |
| 10 | DE | Delaware | 1779781 |
| 11 | DC | District of Columbia | 1702382 |
| 12 | FL | Florida | 294478 |
| 13 | GA | Georgia | 1705317 |
| 15 | HI | Hawaii | 1779782 |
| 16 | ID | Idaho | 1779783 |
| 17 | IL | Illinois | 1779784 |
| 18 | IN | Indiana | 448508 |
| 19 | IA | Iowa | 1779785 |
| 20 | KS | Kansas | 481813 |
| 21 | KY | Kentucky | 1779786 |
| 22 | LA | Louisiana | 1629543 |
| 23 | ME | Maine | 1779787 |
| 24 | MD | Maryland | 1714934 |
| 25 | MA | Massachusetts | 606926 |
| 26 | MI | Michigan | 1779789 |
| 27 | MN | Minnesota | 662849 |
| 28 | MS | Mississippi | 1779790 |
| 29 | MO | Missouri | 1779791 |
| 30 | MT | Montana | 767982 |
| 31 | NE | Nebraska | 1779792 |
| 32 | NV | Nevada | 1779793 |
| 33 | NH | New Hampshire | 1779794 |
| 34 | NJ | New Jersey | 1779795 |
| 35 | NM | New Mexico | 897535 |
| 36 | NY | New York | 1779796 |
| 37 | NC | North Carolina | 1027616 |
| 38 | ND | North Dakota | 1779797 |
| 39 | OH | Ohio | 1085497 |
| 40 | OK | Oklahoma | 1102857 |
| 41 | OR | Oregon | 1155107 |
| 42 | PA | Pennsylvania | 1779798 |
| 44 | RI | Rhode Island | 1219835 |
| 45 | SC | South Carolina | 1779799 |
| 46 | SD | South Dakota | 1785534 |
| 47 | TN | Tennessee | 1325873 |
| 48 | TX | Texas | 1779801 |
| 49 | UT | Utah | 1455989 |
| 50 | VT | Vermont | 1779802 |
| 51 | VA | Virginia | 1779803 |
| 53 | WA | Washington | 1779804 |
| 54 | WV | West Virginia | 1779805 |
| 55 | WI | Wisconsin | 1779806 |
| 56 | WY | Wyoming | 1779807 |
| 60 | AS | American Samoa | 1802701 |
| 66 | GU | Guam | 1802705 |
| 69 | MP | Northern Mariana Islands | 1779809 |
| 72 | PR | Puerto Rico | 1779808 |
| 74 | UM | U.S. Minor Outlying Islands | 1878752 |
| 78 | VI | U.S. Virgin Islands | 1802710 |
In the following table, we can see for each state and community, population. The percentage of population according to age, race, total number of violent crimes per 100K population and other data that may be useful for our analysis.
| state | county | community | communityname | fold | population | householdsize | racepctblack | racePctWhite | racePctAsian | racePctHisp | agePct12t21 | agePct12t29 | agePct16t24 | agePct65up | numbUrban | pctUrban | medIncome | pctWWage | pctWFarmSelf | pctWInvInc | pctWSocSec | pctWPubAsst | pctWRetire | medFamInc | perCapInc | whitePerCap | blackPerCap | indianPerCap | AsianPerCap | OtherPerCap | HispPerCap | NumUnderPov | PctPopUnderPov | PctLess9thGrade | PctNotHSGrad | PctBSorMore | PctUnemployed | PctEmploy | PctEmplManu | PctEmplProfServ | PctOccupManu | PctOccupMgmtProf | MalePctDivorce | MalePctNevMarr | FemalePctDiv | TotalPctDiv | PersPerFam | PctFam2Par | PctKids2Par | PctYoungKids2Par | PctTeen2Par | PctWorkMomYoungKids | PctWorkMom | NumIlleg | PctIlleg | NumImmig | PctImmigRecent | PctImmigRec5 | PctImmigRec8 | PctImmigRec10 | PctRecentImmig | PctRecImmig5 | PctRecImmig8 | PctRecImmig10 | PctSpeakEnglOnly | PctNotSpeakEnglWell | PctLargHouseFam | PctLargHouseOccup | PersPerOccupHous | PersPerOwnOccHous | PersPerRentOccHous | PctPersOwnOccup | PctPersDenseHous | PctHousLess3BR | MedNumBR | HousVacant | PctHousOccup | PctHousOwnOcc | PctVacantBoarded | PctVacMore6Mos | MedYrHousBuilt | PctHousNoPhone | PctWOFullPlumb | OwnOccLowQuart | OwnOccMedVal | OwnOccHiQuart | RentLowQ | RentMedian | RentHighQ | MedRent | MedRentPctHousInc | MedOwnCostPctInc | MedOwnCostPctIncNoMtg | NumInShelters | NumStreet | PctForeignBorn | PctBornSameState | PctSameHouse85 | PctSameCity85 | PctSameState85 | LemasSwornFT | LemasSwFTPerPop | LemasSwFTFieldOps | LemasSwFTFieldPerPop | LemasTotalReq | LemasTotReqPerPop | PolicReqPerOffic | PolicPerPop | RacialMatchCommPol | PctPolicWhite | PctPolicBlack | PctPolicHisp | PctPolicAsian | PctPolicMinor | OfficAssgnDrugUnits | NumKindsDrugsSeiz | PolicAveOTWorked | LandArea | PopDens | PctUsePubTrans | PolicCars | PolicOperBudg | LemasPctPolicOnPatr | LemasGangUnitDeploy | LemasPctOfficDrugUn | PolicBudgPerPop | ViolentCrimesPerPop |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | ? | ? | Lakewoodcity | 1 | 0.19 | 0.33 | 0.02 | 0.90 | 0.12 | 0.17 | 0.34 | 0.47 | 0.29 | 0.32 | 0.20 | 1.0 | 0.37 | 0.72 | 0.34 | 0.60 | 0.29 | 0.15 | 0.43 | 0.39 | 0.40 | 0.39 | 0.32 | 0.27 | 0.27 | 0.36 | 0.41 | 0.08 | 0.19 | 0.10 | 0.18 | 0.48 | 0.27 | 0.68 | 0.23 | 0.41 | 0.25 | 0.52 | 0.68 | 0.40 | 0.75 | 0.75 | 0.35 | 0.55 | 0.59 | 0.61 | 0.56 | 0.74 | 0.76 | 0.04 | 0.14 | 0.03 | 0.24 | 0.27 | 0.37 | 0.39 | 0.07 | 0.07 | 0.08 | 0.08 | 0.89 | 0.06 | 0.14 | 0.13 | 0.33 | 0.39 | 0.28 | 0.55 | 0.09 | 0.51 | 0.5 | 0.21 | 0.71 | 0.52 | 0.05 | 0.26 | 0.65 | 0.14 | 0.06 | 0.22 | 0.19 | 0.18 | 0.36 | 0.35 | 0.38 | 0.34 | 0.38 | 0.46 | 0.25 | 0.04 | 0 | 0.12 | 0.42 | 0.50 | 0.51 | 0.64 | 0.03 | 0.13 | 0.96 | 0.17 | 0.06 | 0.18 | 0.44 | 0.13 | 0.94 | 0.93 | 0.03 | 0.07 | 0.1 | 0.07 | 0.02 | 0.57 | 0.29 | 0.12 | 0.26 | 0.20 | 0.06 | 0.04 | 0.9 | 0.5 | 0.32 | 0.14 | 0.20 |
| 53 | ? | ? | Tukwilacity | 1 | 0.00 | 0.16 | 0.12 | 0.74 | 0.45 | 0.07 | 0.26 | 0.59 | 0.35 | 0.27 | 0.02 | 1.0 | 0.31 | 0.72 | 0.11 | 0.45 | 0.25 | 0.29 | 0.39 | 0.29 | 0.37 | 0.38 | 0.33 | 0.16 | 0.30 | 0.22 | 0.35 | 0.01 | 0.24 | 0.14 | 0.24 | 0.30 | 0.27 | 0.73 | 0.57 | 0.15 | 0.42 | 0.36 | 1.00 | 0.63 | 0.91 | 1.00 | 0.29 | 0.43 | 0.47 | 0.60 | 0.39 | 0.46 | 0.53 | 0.00 | 0.24 | 0.01 | 0.52 | 0.62 | 0.64 | 0.63 | 0.25 | 0.27 | 0.25 | 0.23 | 0.84 | 0.10 | 0.16 | 0.10 | 0.17 | 0.29 | 0.17 | 0.26 | 0.20 | 0.82 | 0.0 | 0.02 | 0.79 | 0.24 | 0.02 | 0.25 | 0.65 | 0.16 | 0.00 | 0.21 | 0.20 | 0.21 | 0.42 | 0.38 | 0.40 | 0.37 | 0.29 | 0.32 | 0.18 | 0.00 | 0 | 0.21 | 0.50 | 0.34 | 0.60 | 0.52 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.02 | 0.12 | 0.45 | ? | ? | ? | ? | 0.00 | ? | 0.67 |
| 24 | ? | ? | Aberdeentown | 1 | 0.00 | 0.42 | 0.49 | 0.56 | 0.17 | 0.04 | 0.39 | 0.47 | 0.28 | 0.32 | 0.00 | 0.0 | 0.30 | 0.58 | 0.19 | 0.39 | 0.38 | 0.40 | 0.84 | 0.28 | 0.27 | 0.29 | 0.27 | 0.07 | 0.29 | 0.28 | 0.39 | 0.01 | 0.27 | 0.27 | 0.43 | 0.19 | 0.36 | 0.58 | 0.32 | 0.29 | 0.49 | 0.32 | 0.63 | 0.41 | 0.71 | 0.70 | 0.45 | 0.42 | 0.44 | 0.43 | 0.43 | 0.71 | 0.67 | 0.01 | 0.46 | 0.00 | 0.07 | 0.06 | 0.15 | 0.19 | 0.02 | 0.02 | 0.04 | 0.05 | 0.88 | 0.04 | 0.20 | 0.20 | 0.46 | 0.52 | 0.43 | 0.42 | 0.15 | 0.51 | 0.5 | 0.01 | 0.86 | 0.41 | 0.29 | 0.30 | 0.52 | 0.47 | 0.45 | 0.18 | 0.17 | 0.16 | 0.27 | 0.29 | 0.27 | 0.31 | 0.48 | 0.39 | 0.28 | 0.00 | 0 | 0.14 | 0.49 | 0.54 | 0.67 | 0.56 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.01 | 0.21 | 0.02 | ? | ? | ? | ? | 0.00 | ? | 0.43 |
| 34 | 5 | 81440 | Willingborotownship | 1 | 0.04 | 0.77 | 1.00 | 0.08 | 0.12 | 0.10 | 0.51 | 0.50 | 0.34 | 0.21 | 0.06 | 1.0 | 0.58 | 0.89 | 0.21 | 0.43 | 0.36 | 0.20 | 0.82 | 0.51 | 0.36 | 0.40 | 0.39 | 0.16 | 0.25 | 0.36 | 0.44 | 0.01 | 0.10 | 0.09 | 0.25 | 0.31 | 0.33 | 0.71 | 0.36 | 0.45 | 0.37 | 0.39 | 0.34 | 0.45 | 0.49 | 0.44 | 0.75 | 0.65 | 0.54 | 0.83 | 0.65 | 0.85 | 0.86 | 0.03 | 0.33 | 0.02 | 0.11 | 0.20 | 0.30 | 0.31 | 0.05 | 0.08 | 0.11 | 0.11 | 0.81 | 0.08 | 0.56 | 0.62 | 0.85 | 0.77 | 1.00 | 0.94 | 0.12 | 0.01 | 0.5 | 0.01 | 0.97 | 0.96 | 0.60 | 0.47 | 0.52 | 0.11 | 0.11 | 0.24 | 0.21 | 0.19 | 0.75 | 0.70 | 0.77 | 0.89 | 0.63 | 0.51 | 0.47 | 0.00 | 0 | 0.19 | 0.30 | 0.73 | 0.64 | 0.65 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.02 | 0.39 | 0.28 | ? | ? | ? | ? | 0.00 | ? | 0.12 |
| 42 | 95 | 6096 | Bethlehemtownship | 1 | 0.01 | 0.55 | 0.02 | 0.95 | 0.09 | 0.05 | 0.38 | 0.38 | 0.23 | 0.36 | 0.02 | 0.9 | 0.50 | 0.72 | 0.16 | 0.68 | 0.44 | 0.11 | 0.71 | 0.46 | 0.43 | 0.41 | 0.28 | 0.00 | 0.74 | 0.51 | 0.48 | 0.00 | 0.06 | 0.25 | 0.30 | 0.33 | 0.12 | 0.65 | 0.67 | 0.38 | 0.42 | 0.46 | 0.22 | 0.27 | 0.20 | 0.21 | 0.51 | 0.91 | 0.91 | 0.89 | 0.85 | 0.40 | 0.60 | 0.00 | 0.06 | 0.00 | 0.03 | 0.07 | 0.20 | 0.27 | 0.01 | 0.02 | 0.04 | 0.05 | 0.88 | 0.05 | 0.16 | 0.19 | 0.59 | 0.60 | 0.37 | 0.89 | 0.02 | 0.19 | 0.5 | 0.01 | 0.89 | 0.87 | 0.04 | 0.55 | 0.73 | 0.05 | 0.14 | 0.31 | 0.31 | 0.30 | 0.40 | 0.36 | 0.38 | 0.38 | 0.22 | 0.51 | 0.21 | 0.00 | 0 | 0.11 | 0.72 | 0.64 | 0.61 | 0.53 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.04 | 0.09 | 0.02 | ? | ? | ? | ? | 0.00 | ? | 0.03 |
| 6 | ? | ? | SouthPasadenacity | 1 | 0.02 | 0.28 | 0.06 | 0.54 | 1.00 | 0.25 | 0.31 | 0.48 | 0.27 | 0.37 | 0.04 | 1.0 | 0.52 | 0.68 | 0.20 | 0.61 | 0.28 | 0.15 | 0.25 | 0.62 | 0.72 | 0.76 | 0.77 | 0.28 | 0.52 | 0.48 | 0.60 | 0.01 | 0.12 | 0.13 | 0.12 | 0.80 | 0.10 | 0.65 | 0.19 | 0.77 | 0.06 | 0.91 | 0.49 | 0.57 | 0.61 | 0.58 | 0.44 | 0.62 | 0.69 | 0.87 | 0.53 | 0.30 | 0.43 | 0.00 | 0.11 | 0.04 | 0.30 | 0.35 | 0.43 | 0.47 | 0.50 | 0.50 | 0.56 | 0.57 | 0.45 | 0.28 | 0.25 | 0.19 | 0.29 | 0.53 | 0.18 | 0.39 | 0.26 | 0.73 | 0.0 | 0.02 | 0.84 | 0.30 | 0.16 | 0.28 | 0.25 | 0.02 | 0.05 | 0.94 | 1.00 | 1.00 | 0.67 | 0.63 | 0.68 | 0.62 | 0.47 | 0.59 | 0.11 | 0.00 | 0 | 0.70 | 0.42 | 0.49 | 0.73 | 0.64 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 0.01 | 0.58 | 0.10 | ? | ? | ? | ? | 0.00 | ? | 0.14 |
## [1] 1994 128
The dataset has 1994 observations and 128 variable. We can see that there are missing values in the dataset, we are going to convert these data into ‘NA’ in order to carry out our analysis.
A summary of the variables is below:
| state | county | community | communityname | fold | population | householdsize | racepctblack | racePctWhite | racePctAsian | racePctHisp | agePct12t21 | agePct12t29 | agePct16t24 | agePct65up | numbUrban | pctUrban | medIncome | pctWWage | pctWFarmSelf | pctWInvInc | pctWSocSec | pctWPubAsst | pctWRetire | medFamInc | perCapInc | whitePerCap | blackPerCap | indianPerCap | AsianPerCap | OtherPerCap | HispPerCap | NumUnderPov | PctPopUnderPov | PctLess9thGrade | PctNotHSGrad | PctBSorMore | PctUnemployed | PctEmploy | PctEmplManu | PctEmplProfServ | PctOccupManu | PctOccupMgmtProf | MalePctDivorce | MalePctNevMarr | FemalePctDiv | TotalPctDiv | PersPerFam | PctFam2Par | PctKids2Par | PctYoungKids2Par | PctTeen2Par | PctWorkMomYoungKids | PctWorkMom | NumIlleg | PctIlleg | NumImmig | PctImmigRecent | PctImmigRec5 | PctImmigRec8 | PctImmigRec10 | PctRecentImmig | PctRecImmig5 | PctRecImmig8 | PctRecImmig10 | PctSpeakEnglOnly | PctNotSpeakEnglWell | PctLargHouseFam | PctLargHouseOccup | PersPerOccupHous | PersPerOwnOccHous | PersPerRentOccHous | PctPersOwnOccup | PctPersDenseHous | PctHousLess3BR | MedNumBR | HousVacant | PctHousOccup | PctHousOwnOcc | PctVacantBoarded | PctVacMore6Mos | MedYrHousBuilt | PctHousNoPhone | PctWOFullPlumb | OwnOccLowQuart | OwnOccMedVal | OwnOccHiQuart | RentLowQ | RentMedian | RentHighQ | MedRent | MedRentPctHousInc | MedOwnCostPctInc | MedOwnCostPctIncNoMtg | NumInShelters | NumStreet | PctForeignBorn | PctBornSameState | PctSameHouse85 | PctSameCity85 | PctSameState85 | LemasSwornFT | LemasSwFTPerPop | LemasSwFTFieldOps | LemasSwFTFieldPerPop | LemasTotalReq | LemasTotReqPerPop | PolicReqPerOffic | PolicPerPop | RacialMatchCommPol | PctPolicWhite | PctPolicBlack | PctPolicHisp | PctPolicAsian | PctPolicMinor | OfficAssgnDrugUnits | NumKindsDrugsSeiz | PolicAveOTWorked | LandArea | PopDens | PctUsePubTrans | PolicCars | PolicOperBudg | LemasPctPolicOnPatr | LemasGangUnitDeploy | LemasPctOfficDrugUn | PolicBudgPerPop | ViolentCrimesPerPop | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.00 | Length:1994 | Length:1994 | Length:1994 | Min. : 1.000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Length:1994 | Min. :0.0000 | Min. :0.00000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.00 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Length:1994 | Length:1994 | Length:1994 | Length:1994 | Min. :0.00000 | Length:1994 | Min. :0.000 | |
| 1st Qu.:12.00 | Class :character | Class :character | Class :character | 1st Qu.: 3.000 | 1st Qu.:0.01000 | 1st Qu.:0.3500 | 1st Qu.:0.0200 | 1st Qu.:0.6300 | 1st Qu.:0.0400 | 1st Qu.:0.010 | 1st Qu.:0.3400 | 1st Qu.:0.4100 | 1st Qu.:0.2500 | 1st Qu.:0.3000 | 1st Qu.:0.00000 | 1st Qu.:0.0000 | 1st Qu.:0.2000 | 1st Qu.:0.4400 | 1st Qu.:0.1600 | 1st Qu.:0.3700 | 1st Qu.:0.3500 | 1st Qu.:0.1425 | 1st Qu.:0.3600 | 1st Qu.:0.2300 | 1st Qu.:0.2200 | 1st Qu.:0.240 | 1st Qu.:0.1725 | 1st Qu.:0.1100 | 1st Qu.:0.1900 | Class :character | 1st Qu.:0.2600 | 1st Qu.:0.01000 | 1st Qu.:0.110 | 1st Qu.:0.1600 | 1st Qu.:0.2300 | 1st Qu.:0.2100 | 1st Qu.:0.2200 | 1st Qu.:0.3800 | 1st Qu.:0.2500 | 1st Qu.:0.3200 | 1st Qu.:0.2400 | 1st Qu.:0.3100 | 1st Qu.:0.3300 | 1st Qu.:0.3100 | 1st Qu.:0.3600 | 1st Qu.:0.3600 | 1st Qu.:0.4000 | 1st Qu.:0.4900 | 1st Qu.:0.4900 | 1st Qu.:0.530 | 1st Qu.:0.4800 | 1st Qu.:0.3900 | 1st Qu.:0.4200 | 1st Qu.:0.00000 | 1st Qu.:0.09 | 1st Qu.:0.00000 | 1st Qu.:0.1600 | 1st Qu.:0.2000 | 1st Qu.:0.2500 | 1st Qu.:0.2800 | 1st Qu.:0.0300 | 1st Qu.:0.0300 | 1st Qu.:0.0300 | 1st Qu.:0.0300 | 1st Qu.:0.7300 | 1st Qu.:0.0300 | 1st Qu.:0.1500 | 1st Qu.:0.1400 | 1st Qu.:0.3400 | 1st Qu.:0.3900 | 1st Qu.:0.2700 | 1st Qu.:0.4400 | 1st Qu.:0.0600 | 1st Qu.:0.4000 | 1st Qu.:0.0000 | 1st Qu.:0.01000 | 1st Qu.:0.6300 | 1st Qu.:0.4300 | 1st Qu.:0.0600 | 1st Qu.:0.2900 | 1st Qu.:0.3500 | 1st Qu.:0.0600 | 1st Qu.:0.1000 | 1st Qu.:0.0900 | 1st Qu.:0.0900 | 1st Qu.:0.0900 | 1st Qu.:0.1700 | 1st Qu.:0.2000 | 1st Qu.:0.220 | 1st Qu.:0.2100 | 1st Qu.:0.3700 | 1st Qu.:0.3200 | 1st Qu.:0.2500 | 1st Qu.:0.00000 | 1st Qu.:0.00000 | 1st Qu.:0.0600 | 1st Qu.:0.4700 | 1st Qu.:0.4200 | 1st Qu.:0.5200 | 1st Qu.:0.5600 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.:0.02000 | 1st Qu.:0.1000 | 1st Qu.:0.0200 | Class :character | Class :character | Class :character | Class :character | 1st Qu.:0.00000 | Class :character | 1st Qu.:0.070 | |
| Median :34.00 | Mode :character | Mode :character | Mode :character | Median : 5.000 | Median :0.02000 | Median :0.4400 | Median :0.0600 | Median :0.8500 | Median :0.0700 | Median :0.040 | Median :0.4000 | Median :0.4800 | Median :0.2900 | Median :0.4200 | Median :0.03000 | Median :1.0000 | Median :0.3200 | Median :0.5600 | Median :0.2300 | Median :0.4800 | Median :0.4750 | Median :0.2600 | Median :0.4700 | Median :0.3300 | Median :0.3000 | Median :0.320 | Median :0.2500 | Median :0.1700 | Median :0.2800 | Mode :character | Median :0.3450 | Median :0.02000 | Median :0.250 | Median :0.2700 | Median :0.3600 | Median :0.3100 | Median :0.3200 | Median :0.5100 | Median :0.3700 | Median :0.4100 | Median :0.3700 | Median :0.4000 | Median :0.4700 | Median :0.4000 | Median :0.5000 | Median :0.5000 | Median :0.4700 | Median :0.6300 | Median :0.6400 | Median :0.700 | Median :0.6100 | Median :0.5100 | Median :0.5400 | Median :0.01000 | Median :0.17 | Median :0.01000 | Median :0.2900 | Median :0.3400 | Median :0.3900 | Median :0.4300 | Median :0.0900 | Median :0.0800 | Median :0.0900 | Median :0.0900 | Median :0.8700 | Median :0.0600 | Median :0.2000 | Median :0.1900 | Median :0.4400 | Median :0.4800 | Median :0.3600 | Median :0.5600 | Median :0.1100 | Median :0.5100 | Median :0.5000 | Median :0.03000 | Median :0.7700 | Median :0.5400 | Median :0.1300 | Median :0.4200 | Median :0.5200 | Median :0.1850 | Median :0.1900 | Median :0.1800 | Median :0.1700 | Median :0.1800 | Median :0.3100 | Median :0.3300 | Median :0.370 | Median :0.3400 | Median :0.4800 | Median :0.4500 | Median :0.3700 | Median :0.00000 | Median :0.00000 | Median :0.1300 | Median :0.6300 | Median :0.5400 | Median :0.6700 | Median :0.7000 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median :0.04000 | Median :0.1700 | Median :0.0700 | Mode :character | Mode :character | Mode :character | Mode :character | Median :0.00000 | Mode :character | Median :0.150 | |
| Mean :28.68 | NA | NA | NA | Mean : 5.494 | Mean :0.05759 | Mean :0.4634 | Mean :0.1796 | Mean :0.7537 | Mean :0.1537 | Mean :0.144 | Mean :0.4242 | Mean :0.4939 | Mean :0.3363 | Mean :0.4232 | Mean :0.06407 | Mean :0.6963 | Mean :0.3611 | Mean :0.5582 | Mean :0.2916 | Mean :0.4957 | Mean :0.4711 | Mean :0.3178 | Mean :0.4792 | Mean :0.3757 | Mean :0.3503 | Mean :0.368 | Mean :0.2911 | Mean :0.2035 | Mean :0.3224 | NA | Mean :0.3863 | Mean :0.05551 | Mean :0.303 | Mean :0.3158 | Mean :0.3833 | Mean :0.3617 | Mean :0.3635 | Mean :0.5011 | Mean :0.3964 | Mean :0.4406 | Mean :0.3912 | Mean :0.4413 | Mean :0.4612 | Mean :0.4345 | Mean :0.4876 | Mean :0.4943 | Mean :0.4877 | Mean :0.6109 | Mean :0.6207 | Mean :0.664 | Mean :0.5829 | Mean :0.5014 | Mean :0.5267 | Mean :0.03629 | Mean :0.25 | Mean :0.03006 | Mean :0.3202 | Mean :0.3606 | Mean :0.3991 | Mean :0.4279 | Mean :0.1814 | Mean :0.1821 | Mean :0.1848 | Mean :0.1829 | Mean :0.7859 | Mean :0.1506 | Mean :0.2676 | Mean :0.2519 | Mean :0.4621 | Mean :0.4944 | Mean :0.4041 | Mean :0.5626 | Mean :0.1863 | Mean :0.4952 | Mean :0.3147 | Mean :0.07682 | Mean :0.7195 | Mean :0.5487 | Mean :0.2045 | Mean :0.4333 | Mean :0.4942 | Mean :0.2645 | Mean :0.2431 | Mean :0.2647 | Mean :0.2635 | Mean :0.2689 | Mean :0.3464 | Mean :0.3725 | Mean :0.423 | Mean :0.3841 | Mean :0.4901 | Mean :0.4498 | Mean :0.4038 | Mean :0.02944 | Mean :0.02278 | Mean :0.2156 | Mean :0.6089 | Mean :0.5351 | Mean :0.6264 | Mean :0.6515 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Mean :0.06523 | Mean :0.2329 | Mean :0.1617 | NA | NA | NA | NA | Mean :0.09405 | NA | Mean :0.238 | |
| 3rd Qu.:42.00 | NA | NA | NA | 3rd Qu.: 8.000 | 3rd Qu.:0.05000 | 3rd Qu.:0.5400 | 3rd Qu.:0.2300 | 3rd Qu.:0.9400 | 3rd Qu.:0.1700 | 3rd Qu.:0.160 | 3rd Qu.:0.4700 | 3rd Qu.:0.5400 | 3rd Qu.:0.3600 | 3rd Qu.:0.5300 | 3rd Qu.:0.07000 | 3rd Qu.:1.0000 | 3rd Qu.:0.4900 | 3rd Qu.:0.6900 | 3rd Qu.:0.3700 | 3rd Qu.:0.6200 | 3rd Qu.:0.5800 | 3rd Qu.:0.4400 | 3rd Qu.:0.5800 | 3rd Qu.:0.4800 | 3rd Qu.:0.4300 | 3rd Qu.:0.440 | 3rd Qu.:0.3800 | 3rd Qu.:0.2500 | 3rd Qu.:0.4000 | NA | 3rd Qu.:0.4800 | 3rd Qu.:0.05000 | 3rd Qu.:0.450 | 3rd Qu.:0.4200 | 3rd Qu.:0.5100 | 3rd Qu.:0.4600 | 3rd Qu.:0.4800 | 3rd Qu.:0.6275 | 3rd Qu.:0.5200 | 3rd Qu.:0.5300 | 3rd Qu.:0.5100 | 3rd Qu.:0.5400 | 3rd Qu.:0.5900 | 3rd Qu.:0.5000 | 3rd Qu.:0.6200 | 3rd Qu.:0.6300 | 3rd Qu.:0.5600 | 3rd Qu.:0.7600 | 3rd Qu.:0.7800 | 3rd Qu.:0.840 | 3rd Qu.:0.7200 | 3rd Qu.:0.6200 | 3rd Qu.:0.6500 | 3rd Qu.:0.02000 | 3rd Qu.:0.32 | 3rd Qu.:0.02000 | 3rd Qu.:0.4300 | 3rd Qu.:0.4800 | 3rd Qu.:0.5300 | 3rd Qu.:0.5600 | 3rd Qu.:0.2300 | 3rd Qu.:0.2300 | 3rd Qu.:0.2300 | 3rd Qu.:0.2300 | 3rd Qu.:0.9400 | 3rd Qu.:0.1600 | 3rd Qu.:0.3100 | 3rd Qu.:0.2900 | 3rd Qu.:0.5500 | 3rd Qu.:0.5800 | 3rd Qu.:0.4900 | 3rd Qu.:0.7000 | 3rd Qu.:0.2200 | 3rd Qu.:0.6000 | 3rd Qu.:0.5000 | 3rd Qu.:0.07000 | 3rd Qu.:0.8600 | 3rd Qu.:0.6700 | 3rd Qu.:0.2700 | 3rd Qu.:0.5600 | 3rd Qu.:0.6700 | 3rd Qu.:0.4200 | 3rd Qu.:0.3300 | 3rd Qu.:0.4000 | 3rd Qu.:0.3900 | 3rd Qu.:0.3800 | 3rd Qu.:0.4900 | 3rd Qu.:0.5200 | 3rd Qu.:0.590 | 3rd Qu.:0.5300 | 3rd Qu.:0.5900 | 3rd Qu.:0.5800 | 3rd Qu.:0.5100 | 3rd Qu.:0.01000 | 3rd Qu.:0.00000 | 3rd Qu.:0.2800 | 3rd Qu.:0.7775 | 3rd Qu.:0.6600 | 3rd Qu.:0.7700 | 3rd Qu.:0.7900 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.:0.07000 | 3rd Qu.:0.2800 | 3rd Qu.:0.1900 | NA | NA | NA | NA | 3rd Qu.:0.00000 | NA | 3rd Qu.:0.330 | |
| Max. :56.00 | NA | NA | NA | Max. :10.000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | NA | Max. :1.0000 | Max. :1.00000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.00 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | NA | NA | NA | NA | Max. :1.00000 | NA | Max. :1.000 |
For the variables that have missing values, we will perform the analysis to identify and apply the appropriate imputation technique.
We are going to review how many missing data we have for each attribute.
| Feature | NA_Count | NA_Percentage |
|---|---|---|
| LemasSwornFT | 1675 | 84.0020060 |
| LemasSwFTPerPop | 1675 | 84.0020060 |
| LemasSwFTFieldOps | 1675 | 84.0020060 |
| LemasSwFTFieldPerPop | 1675 | 84.0020060 |
| LemasTotalReq | 1675 | 84.0020060 |
| LemasTotReqPerPop | 1675 | 84.0020060 |
| PolicReqPerOffic | 1675 | 84.0020060 |
| PolicPerPop | 1675 | 84.0020060 |
| RacialMatchCommPol | 1675 | 84.0020060 |
| PctPolicWhite | 1675 | 84.0020060 |
| PctPolicBlack | 1675 | 84.0020060 |
| PctPolicHisp | 1675 | 84.0020060 |
| PctPolicAsian | 1675 | 84.0020060 |
| PctPolicMinor | 1675 | 84.0020060 |
| OfficAssgnDrugUnits | 1675 | 84.0020060 |
| NumKindsDrugsSeiz | 1675 | 84.0020060 |
| PolicAveOTWorked | 1675 | 84.0020060 |
| PolicCars | 1675 | 84.0020060 |
| PolicOperBudg | 1675 | 84.0020060 |
| LemasPctPolicOnPatr | 1675 | 84.0020060 |
| LemasGangUnitDeploy | 1675 | 84.0020060 |
| PolicBudgPerPop | 1675 | 84.0020060 |
| community | 1177 | 59.0270812 |
| county | 1174 | 58.8766299 |
| OtherPerCap | 1 | 0.0501505 |
Let’s graph the amount of missing data:
Reviewing the variables that have missing data, we find that 2 variables have approximately 58.8% of missing data and 22 variables have 84% of missing data. We consider that these variables with this amount of missing data are not useful and correct to be used in our analysis.
Removing the folds variable as it was placed for cross validation.
We are going to identify the degenerate variables and remove the degenerate variables from the data set for our analysis.
## [1] "LemasPctOfficDrugUn"
We are going to eliminate the variable “MottosPctofficDrugUn” - percent of officers assigned to drug units. We are also going to remove from the dataset the variable ‘OtherPerCap’ - per capita income for people with ‘other’ heritage.
## [1] 1993 102
Our dataset now has 102 variables after removing the variables we mentioned above.
Let’s join our dataset with the crime dataset by State.
We are going to start with the analysis of the data for our project:
To obtain the violent crime rate, we are going to add the average number of violent crimes by State.
In the following table we can see the 10 states with the highest rate of violent crimes:
| state_abbr | AvgCrimesRate | rank |
|---|---|---|
| DC | 1.0000000 | 1 |
| LA | 0.5045455 | 2 |
| SC | 0.4867857 | 3 |
| MD | 0.4800000 | 4 |
| FL | 0.4583333 | 5 |
| NC | 0.4019565 | 6 |
| AL | 0.3937209 | 7 |
| GA | 0.3840541 | 8 |
| DE | 0.3700000 | 9 |
| KS | 0.3600000 | 10 |
In first place we have the District of Columbia, in second place we have Louisiana and in third place South Carolina.
Violent Crime Rates by States:
In the following table we can see the states with the highest percentage of people of the race of African-American population.
| state_abbr | AvgBlackPopRate | rank |
|---|---|---|
| DC | 1.0000000 | 1 |
| MS | 0.6900000 | 2 |
| GA | 0.6608108 | 3 |
| LA | 0.6245455 | 4 |
| DE | 0.6000000 | 5 |
| NC | 0.5580435 | 6 |
| SC | 0.5339286 | 7 |
| AL | 0.4702326 | 8 |
| MD | 0.4466667 | 9 |
| VA | 0.3972727 | 10 |
In first place we have the District of Columbia, in second place we have Mississippi and in third place Georgia.
Percentage of African-American population by state:
We can see that the District of Columbia ranks first in both crime rate and the percentage of African-American people. We might think that there is a relationship between these two aspects.
The crime rate in a state is not necessarily related to the perception of race based on the percentage of the population of a particular race.While the crime rate may be higher in some urban areas than others, there is no evidence to suggest that the crime rate is related to the race of the population.
It is important to note that the perception of race in the United States has historically been influenced by a variety of factors, including education, the media, and politics. Although crime can be a factor in influencing racial perceptions, it is only one of many factors that can influence how race is perceived in a given area.
We review which communities have the highest rates of violent crime.
Word Cloud of Communities with the highest crime rates:
The city of Camdem in the state of New Jersey is the one that occupies the first place with the highest rate of violent crimes. In the top 10 communities with the most violent rates, we have three communities located in the state of Alabama.
To select which are the best variables for our analysis, we are going to measure the correlation of each independent variable with the dependent variable and eliminate the variables with low correlation with the independent variable. We are going to eliminate the variables with an absolute correlation of less than 0.25 with the independent variable.
Finding variables with low correlation to dependent variable:
## [1] 1993 54
We now have 54 variables.
Let’s check for multicollinearity. We selected pairs of independent variables with absolute correlations greater than 0.9.
We are going to use the VIF function to eliminate multicollinear variables that have higher variance inflation factors.
| row | column |
|---|---|
| PctRecImmig8 | PctRecImmig10 |
| population | numbUrban |
| PctFam2Par | PctKids2Par |
| PctLargHouseFam | PctLargHouseOccup |
| FemalePctDiv | TotalPctDiv |
| PctPersOwnOccup | PctHousOwnOcc |
| medIncome | medFamInc |
| MalePctDivorce | TotalPctDiv |
| PctBSorMore | PctOccupMgmtProf |
| population | NumUnderPov |
| PctLess9thGrade | PctNotHSGrad |
| NumUnderPov | NumIlleg |
| medFamInc | perCapInc |
| numbUrban | NumUnderPov |
| PctFam2Par | PctYoungKids2Par |
| PctKids2Par | PctYoungKids2Par |
| MalePctDivorce | FemalePctDiv |
| PctFam2Par | PctTeen2Par |
| PctKids2Par | PctTeen2Par |
| VIF_Score | |
|---|---|
| TotalPctDiv | 928.212429 |
| FemalePctDiv | 295.188771 |
| MalePctDivorce | 203.199697 |
| PctRecImmig10 | 165.892005 |
| PctRecImmig8 | 147.580998 |
| PctLargHouseOccup | 129.488362 |
| PctLargHouseFam | 124.023935 |
| population | 117.672823 |
| PctFam2Par | 103.736936 |
| numbUrban | 101.604576 |
| medIncome | 98.892368 |
| PctKids2Par | 97.813837 |
| medFamInc | 95.037614 |
| PctPersOwnOccup | 78.204266 |
| PctHousOwnOcc | 76.714946 |
| PctNotHSGrad | 37.715581 |
| NumUnderPov | 33.986232 |
| perCapInc | 27.944187 |
| PctBSorMore | 25.418161 |
| PctPersDenseHous | 23.163968 |
| PctOccupMgmtProf | 23.097460 |
| PctLess9thGrade | 21.029438 |
| PctNotSpeakEnglWell | 19.853248 |
| PctPopUnderPov | 18.129689 |
| racePctWhite | 15.166439 |
| racepctblack | 14.945811 |
| NumIlleg | 14.432404 |
| pctWWage | 13.271613 |
| HousVacant | 12.765602 |
| PctYoungKids2Par | 12.151434 |
| pctWInvInc | 12.081673 |
| PctEmploy | 11.608310 |
| PctIlleg | 11.490275 |
| racePctHisp | 9.970625 |
| pctWPubAsst | 9.197796 |
| PctHousLess3BR | 8.210647 |
| RentLowQ | 8.141949 |
| PctHousNoPhone | 7.424820 |
| PctTeen2Par | 7.264545 |
| PctOccupManu | 6.099481 |
| PctUnemployed | 6.099450 |
| MalePctNevMarr | 5.330162 |
| NumImmig | 5.007225 |
| NumInShelters | 4.710435 |
| PctHousOccup | 3.543461 |
| PopDens | 2.833126 |
| MedNumBR | 2.657976 |
| NumStreet | 2.473209 |
| MedRentPctHousInc | 2.470435 |
| PctImmigRec10 | 2.372429 |
| PctVacantBoarded | 2.096964 |
| blackPerCap | 2.028969 |
| PctWOFullPlumb | 1.847284 |
## [1] 1993 40
We now have 40 variables.
We are going to create correlation plots to evaluate the most important variables in predicting violent crime rates.
corrMatrix <- round(cor(dataset[,c(20:39,40)]),4)
corrMatrix %>% corrplot(.,method="color",
type="lower", order="hclust",
addCoef.col = "black",
tl.col="blue", tl.srt=45,
sig.level = 0.01, insig = "blank",
diag=FALSE, number.cex = 0.8)We can observe that some variables influence the rates of violent crimes are:
racepctblack: percentage of population that is african american MalePctDivorce: percentage of males who are divorced PctPopUnderPov: percentage of people under the poverty level pctWPubAsst: percentage of households with public assistance income in 1989 PctUnemployed: percentage of people 16 and over, in the labor force, and unemployed PctLess9thGrade:percentage of people 25 and over with less than a 9th grade education PctOccupManu: percentage of people 16 and over who are employed in manufacturing PctHousNoPhone: percent of occupied housing units without phone (in 1990, this was rare!) PctHousLess3BR: percent of housing units with less than 3 bedrooms
Let’s partition the training dataset. We will use 75% of the data for training and 25% for validation.
sample = sample.split(dataset$ViolentCrimesPerPop, SplitRatio = 0.75)
train = subset(dataset, sample == TRUE) %>% as.matrix()
test = subset(dataset, sample == FALSE) %>% as.matrix()
y_train <- train[,40]
y_test <- test[,40]
X_train <- train[,-40]
X_test <- test[,-40]Ridge regression is a regularization technique used in statistical modeling to address the problem of multicollinearity in data. It is used when there is a high degree of correlation between the independent variables in a multiple linear regression model, which can lead to unstable estimates of model coefficients and unreliable prediction.
Ridge regression works by adding a regularization term to the model’s objective function, which penalizes the extreme values of the model’s coefficients and reduces their magnitude. This helps to reduce the variance of the model and improve its generalizability to new data.
Now we’ll use the glmnet() function to fit the ridge regression model and specify alpha=0.
set.seed(1801)
fit.ridge <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), alpha = 0, type.measure = "mse", family="gaussian")To identify what value to use for lambda, we’ll use the s=“lambda.min”.
set.seed(1802)
fitted.ridge.train <- predict(fit.ridge, newx = data.matrix(X_train), s="lambda.min")
fitted.ridge.test <- predict(fit.ridge, newx = data.matrix(X_test), s="lambda.min")
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.ridge.train)[1])## Train coefficient: 0.8163905
cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.ridge.test)[1])##
## Test coefficient: 0.7438792
The train coefficient is 0.816 and test coefficient is 0.778
LASSO Regression (Least Absolute Shrinkage and Selection Operator) is another regularization technique used in statistical modeling to address the problem of multicollinearity in data and to perform variable selection.
LASSO regression is used when there are a large number of predictor variables in a multiple linear regression model and you want to reduce the complexity of the model by removing irrelevant or less important variables in predicting the response variable.
LASSO regression reduces the model coefficients to zero for some variables, allowing you to identify the most important variables for prediction. LASSO regression is used in situations where there are a large number of predictor variables and a simpler and more easily interpretable linear regression model is needed, removing less important variables. .
Now we’ll use the glmnet() function to fit the LASSO regression model and specify alpha=1.
set.seed(1803)
fit.lasso <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=1, family="gaussian")To identify what value to use for lambda, we’ll use the s=“lambda.min”.
set.seed(1804)
fitted.lasso.train <- predict(fit.lasso, newx = data.matrix(X_train), s="lambda.min")
fitted.lasso.test <- predict(fit.lasso, newx = data.matrix(X_test), s="lambda.min")
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.lasso.train)[1])## Train coefficient: 0.8172372
cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.lasso.test)[1])##
## Test coefficient: 0.7475433
The train coefficient is 0.818 and test coefficient is 0.782
lassoPredPerf <- postResample(pred = fitted.lasso.test , obs = y_test)
lassoPredPerf['Family'] <- 'Linear Regression'
lassoPredPerf['Model'] <- 'Lasso Regression'Elastic Net Regression is used in situations where there is a high correlation between the predictor variables and a more stable linear regression model with variable selection is needed, taking advantage of the properties of Ridge regression and LASSO regression.
set.seed(1805)
fit.elnet <- glmnet(as.matrix(X_train), as.matrix(y_train), family="gaussian", alpha=.5)
fit.elnet.cv <- cv.glmnet(as.matrix(X_train), as.matrix(y_train), type.measure="mse", alpha=.5,
family="gaussian")
fitted.elnet.train <- predict(fit.elnet.cv, newx = data.matrix(X_train), s="lambda.min")
fitted.elnet.test <- predict(fit.elnet.cv, newx = data.matrix(X_test), s="lambda.min")
cat("Train coefficient: ", cor(as.matrix(y_train), fitted.elnet.train)[1])## Train coefficient: 0.8171584
cat("\nTest coefficient: ", cor(as.matrix(X_test), fitted.elnet.test)[1])##
## Test coefficient: 0.7475421
The train coefficient is 0.817 and test coefficient is 0.7823
Nonlinear regression is used when the relationship between the response variable and the predictor variables in a regression model cannot be modeled by a linear function. In this case, a nonlinear function is needed to describe the relationship between the variables.
For this analysis we will use: Neural Networks, KNN (K-Nearest Neighbors), SVM (Support Vector Machines), and MARS (Multivariate Adaptive Regression Splines).
Neural Networks consist of layers of artificial neurons that process input information and generate output. Each neuron receives an input, performs a nonlinear transformation, and produces an output that is used as input to the next layer of neurons. The final output of the network is used to predict the output variable.
Let’s use Neural network with 4 hidden units:
## Model Averaged Neural Network with 4 Repeats
##
## a 39-4-1 network with 165 weights
## options were - linear output units decay=0.01
| Overall | |
|---|---|
| X381 | 6.7398694 |
| X51 | 6.3103055 |
| X61 | 6.2222431 |
| X203 | 5.9591480 |
| X6 | 5.8826213 |
| X201 | 5.5498754 |
| X263 | 5.4310372 |
| X5 | 5.1515020 |
| X382 | 5.0220359 |
| X43 | 4.8912471 |
| X20 | 4.8783533 |
| X112 | 4.8140426 |
| X1 | 4.7783263 |
| X110 | 4.7690152 |
| X63 | 4.7159701 |
| X122 | 4.5094830 |
| X162 | 4.3341217 |
| X35 | 4.2289551 |
| X52 | 4.1469174 |
| X292 | 4.1228266 |
| X314 | 4.0869773 |
| X131 | 3.9967216 |
| X24 | 3.9724875 |
| X261 | 3.9636423 |
| X152 | 3.9390688 |
| X42 | 3.9161420 |
| X132 | 3.8229929 |
| X16 | 3.8053785 |
| X352 | 3.8035943 |
| X3 | 3.7820565 |
| X115 | 3.7322764 |
| X262 | 3.6579088 |
| X19 | 3.6429910 |
| X373 | 3.6126446 |
| X53 | 3.6026097 |
| X353 | 3.5938262 |
| X153 | 3.4994534 |
| X363 | 3.4346810 |
| X253 | 3.4101175 |
| X41 | 3.3460698 |
| X29 | 3.3287481 |
| X133 | 3.2731705 |
| X114 | 3.2546826 |
| X192 | 3.2128204 |
| X23 | 3.2027186 |
| X111 | 3.1942049 |
| X273 | 3.1761039 |
| X272 | 3.1058894 |
| X242 | 3.0439990 |
| X291 | 3.0126065 |
| X11 | 3.0082926 |
| X333 | 2.9858225 |
| X4 | 2.9815636 |
| X151 | 2.9397707 |
| X13 | 2.9268070 |
| X231 | 2.9239169 |
| X313 | 2.8951252 |
| X62 | 2.7860604 |
| X351 | 2.7686765 |
| X241 | 2.7621327 |
| X310 | 2.7524022 |
| X283 | 2.6893620 |
| X38 | 2.6521169 |
| X15 | 2.6489619 |
| X372 | 2.6380148 |
| X26 | 2.6321252 |
| X103 | 2.6213772 |
| X73 | 2.6080194 |
| X18 | 2.5629981 |
| X312 | 2.5572842 |
| X331 | 2.5019254 |
| X123 | 2.4748309 |
| X22 | 2.4552364 |
| X252 | 2.4517814 |
| X282 | 2.4235741 |
| X251 | 2.3844613 |
| X302 | 2.3809152 |
| X10 | 2.3107638 |
| X202 | 2.3019756 |
| X72 | 2.2820524 |
| X362 | 2.2604559 |
| X243 | 2.2578241 |
| X82 | 2.2283661 |
| X32 | 2.2205998 |
| X221 | 2.2081898 |
| X271 | 2.1931332 |
| X33 | 2.1907483 |
| X71 | 2.1906483 |
| X12 | 2.1723410 |
| X383 | 2.1610762 |
| X113 | 2.0981033 |
| X392 | 2.0904249 |
| X14 | 2.0831274 |
| X101 | 2.0732813 |
| X163 | 2.0365666 |
| X183 | 2.0361661 |
| X371 | 2.0226587 |
| X343 | 2.0117824 |
| X27 | 1.9930812 |
| X7 | 1.9908562 |
| X181 | 1.9892146 |
| X393 | 1.9765550 |
| X17 | 1.9303000 |
| X322 | 1.8715190 |
| X171 | 1.8714280 |
| X212 | 1.8527905 |
| X36 | 1.8132621 |
| X8 | 1.7472199 |
| X141 | 1.7260972 |
| X222 | 1.7246995 |
| X91 | 1.6764881 |
| X161 | 1.6482376 |
| X172 | 1.6109991 |
| X25 | 1.5993161 |
| X34 | 1.5983855 |
| X293 | 1.5953200 |
| X37 | 1.5940642 |
| X321 | 1.4812944 |
| X121 | 1.4779253 |
| X191 | 1.4501799 |
| X214 | 1.3917492 |
| X311 | 1.3491091 |
| X281 | 1.3386165 |
| X9 | 1.3281336 |
| X233 | 1.3232861 |
| X361 | 1.2996879 |
| X301 | 1.2184606 |
| X323 | 1.2016866 |
| X213 | 1.1748907 |
| X93 | 1.1346396 |
| X215 | 1.1077592 |
| X341 | 1.1071128 |
| X143 | 1.0938873 |
| X211 | 1.0883222 |
| X2 | 1.0867673 |
| X173 | 1.0835380 |
| X315 | 1.0732571 |
| X31 | 1.0189325 |
| X28 | 0.9983832 |
| X21 | 0.9924483 |
| X223 | 0.9916175 |
| X303 | 0.9913695 |
| X332 | 0.9855879 |
| X102 | 0.9755889 |
| X81 | 0.9578084 |
| X210 | 0.8946268 |
| X232 | 0.8433051 |
| X83 | 0.7853355 |
| X193 | 0.6932265 |
| X92 | 0.6482776 |
| X391 | 0.5996399 |
| X142 | 0.5815349 |
| X182 | 0.5648157 |
| X30 | 0.5363145 |
| X342 | 0.3200149 |
| X39 | 0.2727154 |
The KNN model (K-Nearest Neighbors) is used in supervised machine learning for classification and regression. The KNN model is used to predict the class or value of a new data instance based on the characteristics or attributes of the nearest neighboring instances in the training set.
## k-Nearest Neighbors
##
## 1499 samples
## 39 predictor
##
## Pre-processing: centered (39), scaled (39)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1349, 1350, 1349, 1350, 1348, 1350, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.1470650 0.6164570 0.10158827
## 7 0.1439163 0.6323808 0.09915134
## 9 0.1443167 0.6312507 0.09954053
## 11 0.1431314 0.6398771 0.09805782
## 13 0.1433472 0.6397882 0.09817429
## 15 0.1434775 0.6403601 0.09847477
## 17 0.1433104 0.6436613 0.09814664
## 19 0.1431045 0.6460178 0.09809048
## 21 0.1433757 0.6459280 0.09837959
## 23 0.1427781 0.6499200 0.09771248
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 23.
| Overall | |
|---|---|
| PctIlleg | 100.000000 |
| racePctWhite | 84.463999 |
| PctTeen2Par | 78.070630 |
| PctYoungKids2Par | 76.886371 |
| racepctblack | 67.653183 |
| pctWPubAsst | 58.263375 |
| pctWInvInc | 56.862002 |
| PctPopUnderPov | 43.912891 |
| MalePctDivorce | 42.763202 |
| PctUnemployed | 42.212224 |
| NumIlleg | 41.200287 |
| PctVacantBoarded | 37.308768 |
| PctHousLess3BR | 35.571028 |
| PctHousOwnOcc | 35.007781 |
| PctHousNoPhone | 34.742189 |
| PctPersDenseHous | 33.972668 |
| HousVacant | 29.811069 |
| PctLess9thGrade | 24.391977 |
| PctLargHouseFam | 20.652709 |
| NumInShelters | 20.190012 |
| PctWOFullPlumb | 15.783587 |
| NumStreet | 14.856954 |
| perCapInc | 13.844531 |
| MedNumBR | 13.694854 |
| PctOccupMgmtProf | 11.999077 |
| MedRentPctHousInc | 10.471355 |
| PctEmploy | 10.447415 |
| PctNotSpeakEnglWell | 10.337888 |
| racePctHisp | 8.862438 |
| MalePctNevMarr | 8.618872 |
| PopDens | 7.875297 |
| NumImmig | 7.649018 |
| pctWWage | 6.623132 |
| PctOccupManu | 6.297929 |
| PctHousOccup | 6.259769 |
| PctImmigRec10 | 4.935118 |
| blackPerCap | 3.242936 |
| PctRecImmig8 | 3.227065 |
| RentLowQ | 0.000000 |
The SVM model (Support Vector Machine) is used in supervised machine learning for classification and regression. It is particularly useful when the data is linearly or nearly linearly separable in the feature space. It is particularly useful in situations where the data is linearly or nearly linearly separable in feature space, and is very efficient in classifying high-dimensional data and small to medium data sets.
## Support Vector Machines with Radial Basis Function Kernel
##
## 1499 samples
## 39 predictor
##
## Pre-processing: centered (39), scaled (39)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1348, 1349, 1350, 1348, 1349, 1350, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1496910 0.6267657 0.09824670
## 0.50 0.1439197 0.6428549 0.09545459
## 1.00 0.1404450 0.6500871 0.09391998
## 2.00 0.1397640 0.6489056 0.09407907
## 4.00 0.1420422 0.6383513 0.09634762
## 8.00 0.1446649 0.6268358 0.09911138
## 16.00 0.1487816 0.6078767 0.10298394
## 32.00 0.1540362 0.5845711 0.10856409
## 64.00 0.1603825 0.5586436 0.11452927
## 128.00 0.1638883 0.5461915 0.11822571
##
## Tuning parameter 'sigma' was held constant at a value of 0.02599365
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02599365 and C = 2.
| Overall | |
|---|---|
| PctIlleg | 100.000000 |
| racePctWhite | 84.463999 |
| PctTeen2Par | 78.070630 |
| PctYoungKids2Par | 76.886371 |
| racepctblack | 67.653183 |
| pctWPubAsst | 58.263375 |
| pctWInvInc | 56.862002 |
| PctPopUnderPov | 43.912891 |
| MalePctDivorce | 42.763202 |
| PctUnemployed | 42.212224 |
| NumIlleg | 41.200287 |
| PctVacantBoarded | 37.308768 |
| PctHousLess3BR | 35.571028 |
| PctHousOwnOcc | 35.007781 |
| PctHousNoPhone | 34.742189 |
| PctPersDenseHous | 33.972668 |
| HousVacant | 29.811069 |
| PctLess9thGrade | 24.391977 |
| PctLargHouseFam | 20.652709 |
| NumInShelters | 20.190012 |
| PctWOFullPlumb | 15.783587 |
| NumStreet | 14.856954 |
| perCapInc | 13.844531 |
| MedNumBR | 13.694854 |
| PctOccupMgmtProf | 11.999077 |
| MedRentPctHousInc | 10.471355 |
| PctEmploy | 10.447415 |
| PctNotSpeakEnglWell | 10.337888 |
| racePctHisp | 8.862438 |
| MalePctNevMarr | 8.618872 |
| PopDens | 7.875297 |
| NumImmig | 7.649018 |
| pctWWage | 6.623132 |
| PctOccupManu | 6.297929 |
| PctHousOccup | 6.259769 |
| PctImmigRec10 | 4.935118 |
| blackPerCap | 3.242936 |
| PctRecImmig8 | 3.227065 |
| RentLowQ | 0.000000 |
The MARS model (Multivariate Adaptive Regression Splines) is used in supervised machine learning for regression and classification. It is a non-parametric technique that uses combinations of simple functions to approximate a complex function.
## Call: earth(x=X_train, y=y_train)
##
## coefficients
## (Intercept) 0.4494353
## h(racepctblack-0.86) -0.9662119
## h(0.21-racePctWhite) 0.5923169
## h(racePctWhite-0.21) -0.2197684
## h(pctWWage-0.73) -0.4280440
## h(0.47-pctWInvInc) 0.3971654
## h(0.57-MalePctDivorce) -0.1789002
## h(MalePctDivorce-0.57) 0.1684589
## h(0.55-PctTeen2Par) 0.2757925
## h(0.4-PctIlleg) -0.3027351
## h(PctIlleg-0.4) 0.1214990
## h(0.61-PctHousLess3BR) 0.1500186
## h(PctHousLess3BR-0.61) 0.3421317
## h(0.12-HousVacant) -0.6673085
## h(HousVacant-0.12) 0.1483440
## h(0.02-NumStreet) -2.1035280
## h(NumStreet-0.02) 0.1050918
##
## Selected 17 of 21 terms, and 10 of 39 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: PctIlleg, PctTeen2Par, NumStreet, pctWInvInc, HousVacant, ...
## Number of terms at each degree of interaction: 1 16 (additive model)
## GCV 0.01872543 RSS 26.84714 GRSq 0.6611047 RSq 0.6754289
## plotmo grid: racepctblack racePctWhite racePctHisp pctWWage pctWInvInc
## 0.06 0.85 0.04 0.57 0.48
## pctWPubAsst perCapInc blackPerCap PctPopUnderPov PctLess9thGrade PctUnemployed
## 0.26 0.31 0.25 0.24 0.27 0.33
## PctEmploy PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr
## 0.51 0.37 0.4 0.46 0.4
## PctYoungKids2Par PctTeen2Par NumIlleg PctIlleg NumImmig PctImmigRec10
## 0.7 0.6 0.01 0.17 0.01 0.43
## PctRecImmig8 PctNotSpeakEnglWell PctLargHouseFam PctPersDenseHous
## 0.09 0.06 0.21 0.11
## PctHousLess3BR MedNumBR HousVacant PctHousOccup PctHousOwnOcc PctVacantBoarded
## 0.51 0.5 0.03 0.77 0.54 0.13
## PctHousNoPhone PctWOFullPlumb RentLowQ MedRentPctHousInc NumInShelters
## 0.18 0.2 0.32 0.48 0
## NumStreet PopDens
## 0 0.17
| Overall | |
|---|---|
| PctIlleg | 100.00000 |
| PctTeen2Par | 40.76356 |
| NumStreet | 40.76356 |
| pctWInvInc | 35.12809 |
| HousVacant | 28.69767 |
| racePctWhite | 23.49241 |
| PctHousLess3BR | 19.62324 |
| MalePctDivorce | 12.42799 |
| pctWWage | 10.39611 |
| racepctblack | 6.97419 |
Decision Trees model is used in supervised machine learning for classification and regression. It is a modeling technique that builds a decision tree from training data to predict the label or value of a new data instance. It is useful when the data has non-linear relationships or complex interactions between the input features, and it is very efficient in processing large data sets.
##
## Regression tree:
## rpart(formula = ViolentCrimesPerPop ~ ., data = as.data.frame(train))
##
## Variables actually used in tree construction:
## [1] MalePctDivorce NumStreet PctIlleg pctWPubAsst racePctHisp
##
## Root node error: 82.716/1499 = 0.055181
##
## n= 1499
##
## CP nsplit rel error xerror xstd
## 1 0.415223 0 1.00000 1.00154 0.050389
## 2 0.067533 1 0.58478 0.62276 0.030649
## 3 0.047360 2 0.51724 0.54645 0.026916
## 4 0.021278 3 0.46988 0.52341 0.027361
## 5 0.018565 4 0.44861 0.53202 0.027842
## 6 0.016753 5 0.43004 0.51126 0.027449
## 7 0.013985 6 0.41329 0.50114 0.026976
## 8 0.010000 7 0.39930 0.48388 0.026672
we need to use the cross-validation.
## CP nsplit rel error xerror xstd
## 1 0.41522304 0 1.0000000 1.0015384 0.05038930
## 2 0.06753305 1 0.5847770 0.6227624 0.03064924
## 3 0.04735950 2 0.5172439 0.5464481 0.02691551
## 4 0.02127802 3 0.4698844 0.5234057 0.02736072
## 5 0.01856464 4 0.4486064 0.5320167 0.02784219
## 6 0.01675309 5 0.4300417 0.5112581 0.02744900
## 7 0.01398494 6 0.4132887 0.5011393 0.02697578
## 8 0.01000000 7 0.3993037 0.4838784 0.02667223
The curve is at its lowest at 6, so we will prune our tree to a size of 6. At size 6, the cp is 0.0118 and error is 0.40.
Model Accuracy is a measure of how well the model fits the training and test data. It is used to evaluate the performance of regression models and to compare different models.
## [1] 0.01012146
The accuracy of the model is low, even after pruning the tree.
Random forests is a supervised machine learning model used for data classification and regression. This model combines multiple decision trees to improve the accuracy of the predictions and reduce overfitting. It is useful when working with large data sets and complex features, you want to avoid overfitting and improve model accuracy.
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
In this case the test error seems to be greater than the OOB. The out-of-box error and test errors should line up.
##
## Call:
## randomForest(formula = ViolentCrimesPerPop ~ ., data = train2, mtry = mtry, ntree = 200, proximity = TRUE)
## Type of random forest: regression
## Number of trees: 200
## No. of variables tried at each split: 39
##
## Mean of squared residuals: 0.01949456
## % Var explained: 64.67
The graphs show us the importance of each variable in predicting violent crime. The mean decrease precision shows how much the precision of the model decreases if we remove the variable. (PctIlleg) percentage of kids born to never married is the most important variable with a very large difference compared to the second variable.
The boosting method is a supervised machine learning algorithm used for regression and classification problems. It is an assembly technique that combines multiple weak decision trees to form a stronger and more accurate model.
This algorithm is especially useful in complex data analysis and prediction problems, where high precision and good performance are required. It is suitable for both structured and unstructured data, and can handle a large number of features.
| nrounds | max_depth | eta | gamma | colsample_bytree | min_child_weight | subsample | |
|---|---|---|---|---|---|---|---|
| 34 | 50 | 2 | 0.3 | 0 | 0.8 | 1 | 1 |
According to the graphs we can see which is the most important variable to predict violent crimes. The most important variable is PctIlleg: percentage of kids born to never married.
Let’s graph the tree models used.
| Family | Model | RMSE | Rsquared | MAE |
|---|---|---|---|---|
| Linear Regression | Ridge Regression | 0.137610339172913 | 0.633567028833966 | 0.0929953293374746 |
| Linear Regression | Lasso Regression | 0.136960450573007 | 0.637391790810595 | 0.0927696565786015 |
| Linear Regression | Elastic Net Regression | 0.136979850602025 | 0.637232849484115 | 0.0927486578951872 |
| Non-Linear Regression | Neural Network | 0.137874182457811 | 0.634079293426971 | 0.0893608224268564 |
| Non-Linear Regression | KNN | 0.141250293913806 | 0.614821561549313 | 0.0934817447045708 |
| Non-Linear Regression | SVM | 0.136104965394303 | 0.644798170618556 | 0.0876881161789374 |
| Non-Linear Regression | MARS | 0.136926684793201 | 0.640976259783412 | 0.0915989445247887 |
| Trees & Boosting | Decision Tree | 0.168954560886625 | 0.458655027931917 | 0.117214158597375 |
| Trees & Boosting | Random Forest | NA | 0.000571275517264647 | NA |
| Trees & Boosting | XGBoost | 0.139779743455771 | 0.626022791859345 | 0.0940080164834435 |
Reviewing the table where the summary of all the models we used for our analysis is, we can see the following:
In general, all the models used are good with R2 greater than 60%, except for the simple decision tree model, which has an R2 of approximately 45%, and the random forest model with less than 1%.
We consider that the best model is the SVM model (Support Vector Machine) with RMSE 0.136 and R2 of 64.4%.
Based on the data and data exploration conducted, we were able to analyze a large number of factors simultaneously to understand how the relationships between the different factors affect crime rates in our communities and states.
According to the analysis, we were able to observe that the percentage of children born whose parents were never married is a factor that considerably affects the percentage of violent crimes. While it is true that family structure can have some influence on children’s behavior and development, delinquency is a complex problem that has multiple causes and cannot be attributed to a single factor. And while there are other factors such as race that the data suggests could influence the rate of violent crime, crimes may be related to poverty, economic inequality, access to guns, drug and alcohol abuse, discrimination and racism, among other factors.
Therefore, to address the problem of delinquency it is necessary to take into account a wide variety of factors, including but not limited to family structure.
Violent crime is a complex problem that has multiple causes in the United States, some of the possible causes of violent crime in communities are the following:
Socioeconomic inequality: Economic and social inequality can create tensions and conflicts among members of a community, which in turn can lead to violence.
Unemployment: Unemployment can increase despair and hopelessness, which can lead some people to turn to crime to survive.
Poverty: Poverty may be linked to increased violence, as people living in poverty may have less access to resources and opportunities, which can lead to crime.
Drugs and alcohol: Drug and alcohol abuse can increase a person’s likelihood of committing a violent crime.
Racism and Discrimination: Racism and discrimination can contribute to violence, especially in communities where there are tensions between different racial or ethnic groups.
It is important to note that these are just a few of the possible causes of violent crime in communities across the United States, and that each case is unique and may have multiple contributing factors. Therefore, it is necessary to address these problems holistically and focus on finding specific solutions for each community. It is important to avoid stigmatizing certain groups in society and instead focus on understanding and addressing the underlying causes of crime so that effective and sustainable action can be taken to reduce it.
Resource: https://www.kaggle.com/datasets/michaelbryantds/crimedata?select=crimedata.csv