Understanding where violent crime happens can be a key to understanding why it happens. Environmental, Social and population characteristics may be important predictors of the level of violent crime in a population. Determining which are most influencial on the level of violent crime will provide valuable input to neighborhood design, urban development, and policing practices.
Part one of this analysis aims to examine the characteristics availible and whether they can be used for predicting the probability of violent crimes in a population using a USA Communities and Crime Data Set, sourced from the UCI Dataset Repository.
This Assignment investigates USA Communities and Crime Data Set, sourced from the UCI Dataset Repository. Creator: Michael Redmond (redmond ‘@’ lasalle.edu); Computer Science; La Salle University; Philadelphia, PA, 19141, USA This Dataset combines socio-economic data from the ’90 Census, law enforcement data from the 1990 Law Enforcement Management and Admin Stats survey, and crime data from the 1995 FBI UCR
The per capita violent crimes variable is calculated using per community population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault in each community.
The dataset contains a large amount of information collected from each community which can be summarized in the broad categories of race, age, employment, marital status, immigration data and home ownership
The dataset is loaded from the provided csv file, which has the header information provided separately.
df.raw <- read.csv("CommViolPredUnnormalizedData.csv", header = F,
stringsAsFactors = FALSE)
df.raw <- df.raw[c(1:129, 146)]
df.headers <- read.csv("CommViolPredUnnormalizedNames.csv", header = F,
stringsAsFactors = FALSE)
names(df.raw) <- as.vector(df.headers[, 1])
The data summary shows the dataset is a mixture of numerical and categorical data. The target variable is the number of Violent Crimes per head of Population.
It can be seen that there are some variables in the dataset which will not contribute to the accurate prediction of violent crime levels. These include Community Identifies: State, County, Community and Community name which can be removed from the feature dataset. There is also a non-predictive variable ‘fold’ which is included a fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests. This will be left in the dataset for debugging but not included in the data analysis.
Since the dataset combines data from different sources, there are some variables which appear to represent similar information eg ‘householdsize’ (mean people per household) and ‘PersPerOccupHous’ (mean persons per household). For this case, ‘householdsize’ is removed from the dataset.
summary(df.raw)
summary(df.raw[, c("householdsize", "PersPerOccupHous")])
coltoRemove <- names(df.raw) %in% c("communityname", "state", "countyCode",
"communityCode", "householdsize")
df.working <- df.raw[!coltoRemove]
Analysing the dataset for missing values or null data in the dataset showed that while there are no ‘na’ values in the dataset, missing values appear in the form of “?”. An analysis of the columns with missing values shows there are 25 columns with missing values which describe mostly community identifiers and policing information. Most of these have a large proportion of the data missing (> 50%) which renders these variables unusable. The exception is the variable “OtherPerCap”, with only one missing value. There are also missing values for the response variable “ViolentCrimesPerPop”.
The columns with large proportions of missing data are removed from the dataset.These are the variables originating from the 1990 Law Enforcement Management and Admin Stats survey (Lemas). Only one variable from the Lemas Dataset remains in the dataset after missing values are removed: LemasPctOfficDrugUn which is the Percentage of officers assigned to Drug units.
For the other columns with missing data, the missing value of variable “OtherPerCap” is replaced with the average “OtherPerCap” value. The samples with missing data for the response variable “ViolentCrimesPerPop” are removed from the dataset.
Exploring the Predictor Variable ViolentCrimesPerPop it can be seen that it has a continuous distribution which is left skewed. The value range is from 0 to 4877 with a large number of outliers in the upper value range.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 161.7 374.1 589.1 794.4 4877.0
## No id variables; using all as measure variables
ViolentCrimesPerPop Distribution
Analysing the data for Normality using the Shapiro-Wilks test showed that none of the variables in the dataset pass the test for normality. This is confirmed by plotting the distributions for some of the variables.Normality is an assumption for many statistical tests, and transformations may be required in order to get the best predictive modelling results.
## [1] 0
Example Non-Normal Distributions
To determine how skewed our data variables are, the measure of skewness is determined using the provided skewness function. An absolute measure of greater than 1.96 is considered to be “non-normal”. Those variables with skewness measure beyond +/- 1.96, shown below, are investigated for possible transformation.
## population racepctblack racePctAsian
## 26.462913 2.263018 4.746267
## racePctHisp agePct12t21 agePct12t29
## 3.198960 3.394310 2.492754
## agePct16t24 numbUrban pctWFarmSelf
## 3.957453 26.150247 2.585836
## perCapInc whitePerCap blackPerCap
## 2.313688 2.450632 8.064897
## indianPerCap AsianPerCap OtherPerCap
## 15.815086 2.624990 5.523497
## HispPerCap NumUnderPov PctLess9thGrade
## 2.278997 27.074392 1.964911
## NumKidsBornNeverMar PctKidsBornNeverMar NumImmig
## 28.957225 2.087402 30.647294
## PctRecentImmig PctRecImmig5 PctRecImmig8
## 2.827920 2.777962 2.689124
## PctRecImmig10 PctSpeakEnglOnly PctNotSpeakEnglWell
## 2.831551 -2.608552 3.876151
## PctLargHouseFam PctLargHouseOccup PctPersDenseHous
## 3.316681 3.795706 4.099861
## HousVacant PctHousOccup PctVacantBoarded
## 16.346151 -3.444287 3.574615
## PctWOFullPlumb OwnOccQrange NumInShelters
## 3.339074 2.200364 33.346872
## NumStreet PctForeignBorn LandArea
## 35.181044 2.457284 23.077605
## PopDens PctUsePubTrans LemasPctOfficDrugUn
## 4.444080 3.632173 5.015527
## ViolentCrimesPerPop
## 2.063762
One group of variables that appear in the list for high skewness are “racepctblack”, “racePctAsian” and “racePctHisp”. These variables, together with “racePctWhite”, should sum to 100% and are compositional variables. Compositional variables are not independent of each other, and require transformation. The Box-Cox transformation is used here for its applicability to handle observations that include true zeros.
‘racePctAsian’ QQplots before and after BC Transformations
The absolute values of the skewness factors after the Box-Cox Transformations can be seen to be less than 1.96, indicating the transforms have reduced the skewness as required. This is also shown in the QQPlots of the variable before and after the transformation.
## racepctblackBC racePctWhiteBC racePctHispBC racePctAsianBC
## -0.002094278 -0.485778963 0.080314736 0.008046790
The Box-Cox Transformation was used to transform the remainder of the highly left skewed variables, using the same method as outlined above, replacing the original variable with the transformed versions.
The Variables with highly negatively skewed distributions, namely ‘PctSpeakEnglOnly’ the percent of people who speak only English and ’pctHousOccup, the percentage of housing occupied, also require transformation.
The distributions of ‘PctSpeakEnglOnly’ compared with the related variable ‘PctNotSpeakEnglWell’, shown in the boxplots, indicate the variables have skewed left and right distributions. To balance these distributions the variables are converted to categorical variables using their quartile values for the cutoff points. In this way we can represent the proportion of Only English speakers and Poor English speakers in a community with the categorical variables. The original variables are not removed in case the continuous variables are required later.
Comparison of ‘PctSpeakEnglOnly’ and ‘PctNotSpeakEnglWell’ Distributions
##
## NoEngQ1 NoEngQ2 NoEngQ3 NoEngQ4
## EngQ1 19 60 304 114
## EngQ2 35 52 33 0
## EngQ3 243 88 21 0
## EngQ4 4 0 0 0
Similarly, the variable ‘PctHousOccup’ is converted into a categorical variable which is evenly distributed amoung the categories. Again the continuous variable is retained for possible future use.
## No id variables; using all as measure variables
PctHousOccup Distribution
## HousOccupQ1 HousOccupQ2 HousOccupQ3 HousOccupQ4
## 388 613 517 476
The Chi test was used to find variables which correlated strongly to the predictor variable ViolentCrimesPerPop.
The variables which correlate closely to the predictor variable are visualised in a scatter plot. It can be seen that there is a negative linear relationship between the percentage population that is white (racePctWhiteBC) to the predictor variable, and also a positive linear relationship between the number of children born to parents who never married (PctKidsBornNeverMar) and the predictor variable. The other variables have less obvious relationsips to the Predictor variable.
Scatter Plots of Significant Variables towards ViolentCrimesPerPop
According to the analysis of the data collected, the main predictors of the level of violent crime in US population are as below:
* number of kids born to never married
* percentage of kids born to never married
* percent of people in owner occupied households
* percent of occupied housing units without phone
* percent of vacant housing that is boarded up
* percent of housing without complete plumbing facilities
* number of people in homeless shelters
* number of homeless people counted in the street
* percent of officers assigned to drug units
* percentage of population that is caucasian
While all 10 predictors had p-value of less than 0.05, the 4 main social predictors which showed the strongest influence to the crime rate were caucasian population percentage, percentage of kids born to never married household, number of people in owner occupied households and percentage of vacant housing that is boarded up.
According to the analysis, percentage of caucasian population has an inverse relationship to the crime rate - higher the population of caucasians in an area, lower the crime rate. The data also shows positive linear relationship of owner occupied household with caucasian population. This is an interesting point as owner occupied households could be interpreted as having a more stable environment for families and this may have lead to dramatic decrease in crime rate as evidenced by the inverse relationship of the crime rate and the number of owner occupied households.
The percentage of vacant housing that is boarded up also has a strong positive linear relationship with the crime rate. High number of vacant housing could be interpreted as an area that is going through economic downturn and have low social activities and stability for the population in the area. It is not a coincident that percentage of the kids born to parents who are never married also has positive linear relationship with the percentage of boarded vacant housing - which could also be an indication that there is a lack of stability in the area.
The analysis of the USA Communities and Crime Data Set showed that after cleanup and transformation the data can be used to predict the rates of Violent Crime in communities. While the data from the 1990 Law Enforcement Management and Admin Stats survey (Lemas) contained too much missing data to be included in the analysis, the remaining data, once transformed appropriately, contained variables which correlated strongly to the response variable.
This is a good indication that with further work towards feature selection and appropriate model choice, the levels of Violent Crime in USA Communities can be accurately predicted with the USA Communities and Crime Data Set.