Introduction

Understanding where violent crime happens can be a key to understanding why it happens. Environmental, Social and population characteristics may be important predictors of the level of violent crime in a population. Determining which are most influencial on the level of violent crime will provide valuable input to neighborhood design, urban development, and policing practices.

Data from the USA Communities and Crime Data Set, sourced from the UCI Dataset Repository, was used with different machine learning models to predict the level of Violent Crime in USA Communities. Different optimization techniques were tested and compared to find the model with the best predicitive performance.

Datset Description

This Assignment utilizes the USA Communities and Crime Data Set, sourced from the UCI Dataset Repository. Creator: Michael Redmond (redmond ‘@’ lasalle.edu); Computer Science; La Salle University; Philadelphia, PA, 19141, USA. This Dataset combines socio-economic data from the ’90 Census, law enforcement data from the 1990 Law Enforcement Management and Admin Stats survey, and crime data from the 1995 FBI UCR

The per capita violent crimes variable is calculated using per community population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault in each community.

The dataset contains a large amount of information collected from each community which can be summarized in the broad categories of race, age, employment, marital status, immigration data and home ownership.

In Part 1 of this Assignment the data was cleaned and transformed, and the processed dataset is used as input to Part 2.

Data Load

The dataset is loaded from a csv file, which was output after Part 1 of the assignment. The dataset contains 104 features which will be used as input variables to the models, most of these varaibles are numerical while a few are categorical. There is one predictor variable “ViolentCrimesPerPop” which is a numerical variable. The dataset is unnormalised.

df.working <- read.csv("Output_CommViolPredUnnormalizedData.csv", 
    header = T, stringsAsFactors = FALSE)
summary(df.working)
drops <- c("X", "fold")
df.working <- df.working[, !(names(df.working) %in% drops)]
dim(df.working)

Machine Learning

Model Comparison

Predicting the level of Violent Crime in a USA community is a numerical prediction, and therefore requires a model that can provide a quantitative output. The models that were trialled were Linear Regression, Decision Trees and SVM. Different methods to optimize the model performance were also used, depending on the type of model.

The measures of performance was MSE for initial model comparison, while both RMSE and R^2 were used for the comparison of the optimized models.

The process of comparing the different models was:
1. Evaluate Model Performance with no Optimiziation
2. Run Optimization Methods applicable to the model using cross-validation
3. Determine Model performance of Optimized model
Finally
4. Compare Model performance of optimized models of each type

The dataset was prepared for machine learning by first normalising the numeric features. Data subsets were prepared using a seed of 1 and the sample function to select 70% of the dataset for the training set and the remainder for the test set. Non-numeric features are removed for those models which require numeric inputs only ie Linear Regression and SVM. These features are retained for the Decision Tree model.

set.seed(1)
train_ind = sample(1:nrow(df.working), 0.7 * nrow(df.working))
normalize <- function(x) {
    return((x - min(x))/(max(x) - min(x)))
}
df.working_dt <- df.working
notneededFeatures <- c("PctSpeakEnglOnlyCat", "PctNotSpeakEnglWellCat", 
    "PctHousOccupCat", "RentQrange")
possible_predictors = colnames(df.working)[!(colnames(df.working) %in% 
    notneededFeatures)]
df.working = df.working[, names(df.working) %in% possible_predictors]
df.norm <- as.data.frame(lapply(df.working, normalize))

Linear Regression

Linear Regression was the first model tested for the reasons that it is a simple model that can be used as a benchmark for the other models. The advantages of Linear Regression are that it often shows good predictive performance and the results are easy to interpret.

lm.fit = lm(ViolentCrimesPerPop ~ ., data = df.norm[train_ind, ])
y_hat = predict(lm.fit, df.norm[-train_ind, -97])
MSE_LM = mse(df.norm[-train_ind, 97], y_hat)

## [1] "Number of Coefficients used by Linear Model=       101"

## [1] "Linear Regression MSE=    0.00461"

The linear model produced includes all the features, which is not easy to interpret. The Means Square error result from the initial Linear Model of 0.00461, appears low, however it is created from normalised data and since MSE is not an absolute value it needs to be compared with the results from other models.

Linear Regression : Subset Selection

In order to reduce the number of inputs, the Subset selection method to choose the best set of features was tested. The regsubsets function was used to find the combination of features which provided the best MSE result. However, the using more than 6 features required a large amount of processing and was not feasible. The function was run with the maximum number of variables set to 6.

regfit.full = regsubsets(ViolentCrimesPerPop ~ ., data = df.norm[train_ind, 
    ], really.big = T, nvmax = 6)
training.mat = model.matrix(ViolentCrimesPerPop ~ ., data = df.norm[train_ind, 
    ])
training.errors = rep(NA, 6)
for (ii in 1:6) {
    coefi = coef(regfit.full, id = ii)
    pred = training.mat[, names(coefi)] %*% coefi
    training.errors[ii] = mse(df.norm[train_ind, 97], pred)
}
test.mat = model.matrix(ViolentCrimesPerPop ~ ., data = df.norm[-train_ind, 
    ])
test.errors = rep(NA, 6)
for (ii in 1:6) {
    coefi = coef(regfit.full, id = ii)
    pred = test.mat[, names(coefi)] %*% coefi
    test.errors[ii] = mse(df.norm[-train_ind, 97], pred)
}
k = which.min(test.errors)
MSE_SLM = test.errors[k]

## [1] "Number of Coefficients in Best Model=         6"
##      (Intercept)       population      agePct12t29       pctWInvInc 
##       0.74124622       0.08751896      -0.18608212      -0.07565421 
##      PctKids2Par PctPersDenseHous   racepctblackBC 
##      -0.27393525       0.16452897       0.16950682 
## [1] "Subset Selection Linear Regression MSE=    0.00476"

The MSE result from the Feature Subset Selection for Linear Regression is 0.00476 which is worse than the result for Simple Linear Regression Model. A reduced subset of Features therefore does not improve the performance of the linear model.

However, the features chosen as the best predictors are of interest. It can be seen that these include:
- population
- percentage of population between the ages of 12 and 29
- percentage of households with investment / rent income in 1989
- percentage of the population that is african american
- percentage of kids in family housing with two parents
- percent of persons in dense housing (more than 1 person per room)

Effect of the Number of Predictors on Subset Selection MSE

The plot of the MSE vs the number of predictors (Fig.1) shows that the error reduces with the number of features added to the dataset. It can also be seen that the test error is less than the training error which is not expected. This could be a result of the sample selection.

Ridge Regression

Ridge Regression introduces a tuning parameter, lamba, to the Linear Regression model to shrink the Regression coefficients. Since the Linear Model has a good performance, Ridge Regression may be a useful tool to reduce the values of the less important coefficients towards zero.

Cross Validation was used to find the best value of the regularization parameter, lambda, using the default value of 10-fold selection. The cv.glmnet function returns two values of lambda. The minimizer, lambda.min, and the always larger lambda.1se, which is a heuristic choice of lambda producing a less complex model, for which the performance in terms of estimated expected generalization error is within one standard error of the minimum.

cvRR.out = cv.glmnet(x = training.mat, y = df.norm[train_ind, 97], 
    alpha = 0, type.measure = "mse")
bestRRlam = cvRR.out$lambda.1se
ridge.mod = glmnet(x = training.mat, y = df.norm[train_ind, 97], alpha = 0)
y_hat = predict(ridge.mod, s = bestRRlam, newx = model.matrix(ViolentCrimesPerPop ~ 
    ., data = df.norm[-train_ind, ]))
MSE_RR = mse(df.norm[-train_ind, 97], y_hat)
plot(cvRR.out, cex = 0.6)

Effect of Ridge Regression Lambda on MSE

The plot of Mean-squared error vis Log(Lambda) shows that the lowest Mean Square error was produced with a small value of lambda, indicating that reducing the feature coefficients does not improve model performance.

## [1] "Ridge regression CV best value of lambda (one standard error)=    0.21779"

## [1] "Ridge regression test MSE=    0.00475"

It can be seen that the lowest Mean Square error was produced with a small value of lambda, indicating that reducing the feature coefficients does not improve model performance.

The MSE result for the Linear Regression Model with Ridge Regression is 0.00475, which is slightly better than the subset model, but not as good as the simple Linear Regression model.

These results are in line with the results so far indicating the Linear regression result is best with a large number of features, and reducing the number of features worsens the performance of the model.

The Lasso

As with Ridge Regression, the Lasso shrinks the coefficient estimates towards zero. However, in the case of the Lasso, the regularization penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter lamda is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection.

Cross validation was used to find the optimal value of lamda using the default value of 10 fold selection, which was then used in the fitted model to predict on the test data subset.

cvL.out = cv.glmnet(training.mat, df.norm[train_ind, 97], alpha = 1)
bestLlam = cvL.out$lambda.1se
lasso.mod = glmnet(training.mat, df.norm[train_ind, 97], alpha = 1)
y_hat = predict(lasso.mod, s = bestLlam, newx = model.matrix(ViolentCrimesPerPop ~ 
    ., data = df.norm[-train_ind, ]))
MSE_Ls = mse(df.norm[-train_ind, 97], y_hat)
plot(cvL.out, cex = 0.6)

Effect of Lasso Lambda on MSE

## [1] "lasso CV best value of lambda (one standard error)=    0.00620"

## [1] "Lasso regression test MSE=    0.00467"

The MSE result from the Linear Regression Lasso model is 0.00467, which is an improvement on the subset and Ridge Regression methods, but again not as good as the Simple Linear model.

It is possible non-linear models may perform better for this dataset and these are explored next.

SVM

The next model testes was the Support Vector Machine (SVM), which can accommodate non-linear class boundaries. SVMs can be used with different kernel types, and for this dataset a radial kernel type was chosen to see whether a non-linear model could produce a better result.

Cross-validation was used to determine the best parameter values of gamma and cost for the model.

model_svmradial.cv <- tune.svm(ViolentCrimesPerPop ~ ., data = df.norm[train_ind, 
    ], kernel = "radial", gamma = c(5e-04, 0.001, 0.002), cost = c(1.75, 
    2, 2.25, 2.5, 2.75))
model_svmradial.tuned <- svm(ViolentCrimesPerPop ~ ., data = df.norm[train_ind, 
    ], kernel = "radial", gamma = model_svmradial.cv$best.parameters$gamma, 
    cost = model_svmradial.cv$best.parameters$cost)
y_hat = predict(model_svmradial.tuned, df.norm[-train_ind, -97])
MSE_SVM = mse(df.norm[-train_ind, 97], y_hat)
plot(model_svmradial.cv, cex = 0.6)

Effect of gamma and Cost values on MSE

The plot of the MSE error for the different parameter values shows the minimal error is obtained when the values of of Cost and Gamma are 2.75 and 0.001 respectively.

## [1] "Best value of gamma =    0.00100"

## [1] "Best value of cost =    2.75000"

## [1] "SVM radial test MSE =    0.00436"

Using these values on a trained model, results in an MSE value of 0.00436 on the test set, which is lower than the best MSE result from the Linear Models.

Decision Tree

The last model tested on the UCI Crime and Communities Dataset is the Decision Tree. The advantage of the Decision Tree is the interpretability of the results, however the predictive performance is, in most cases, not as good as other models. However, combining Decision Trees into a Random Forest can improve the results and so this method is also tested.

df.norm_dt <- as.data.frame(lapply(df.working_dt[, names(df.working_dt) %in% 
    possible_predictors], normalize))
tree.df.norm <- tree(ViolentCrimesPerPop ~ ., df.norm_dt[train_ind, 
    ])
tree.pred <- predict(tree.df.norm, df.norm_dt[-train_ind, -97])
MSE_DT = mse(df.norm_dt[-train_ind, 97], tree.pred)

## [1] "Decision Tree MSE=    0.00650"

The Decision Tree built from the training data shows only 4 features are used to predict the level of Violent Crime:
- percentage of kids born to parents who never married - percentage of kids in family housing with two parents - percentage of population that is caucasian - number of kids born to parents who never married

The MSE result for the Decision Tree is 0.00650, which is as expected higher than the results from the other models.

Decision Tree for Violent Crime Level

## 
## Regression tree:
## tree(formula = ViolentCrimesPerPop ~ ., data = df.norm_dt[train_ind, 
##     ])
## Variables actually used in tree construction:
## [1] "PctKidsBornNeverMar" "PctKids2Par"         "racePctWhiteBC"     
## [4] "NumKidsBornNeverMar"
## Number of terminal nodes:  7 
## Residual mean deviance:  0.00596 = 8.272 / 1388 
## Distribution of residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.5069000 -0.0500800  0.0008231  0.0000000  0.0509300  0.2897000

Using Cross-validation to find the optimal tree size shows that the best result is obtained with a larger number of leaves on the tree, which is in line with the results of the linear models that a more complex model produces better results.

Effect of number of leaves on deviance

Random Forest

Random Forest utilizes the Decision Tree model, by combining the results from mutliple tree built from a random selection of features. Averaging the results has the advantage of reducing the variance, while using a random selection of variables decorrelates the different trees.

The Random Forest model was run on the Training dataset and then tested on the test dataset.

df.norm.rf <- randomForest(ViolentCrimesPerPop ~ ., df.norm_dt[train_ind, 
    ], importance = TRUE, proximity = TRUE)
varImpPlot(df.norm.rf, cex = 0.7, main = NULL, n.var = min(15, nrow(df.norm.rf$importance)))

Random Forest Feature Importance

rf.pred <- predict(df.norm.rf, df.norm_dt[-train_ind, -97])
MSE_RF = mse(df.norm_dt[-train_ind, 97], rf.pred)

## [1] "Random Forest MSE=    0.00465"

The MSE result from the Random Forest is 0.00465 which is equals the best performance of the Linear Models.

The plot of the features compared with MSE shows similar variable which appeared in the simple Decision Tree: - percentage of kids born to parents who never married
- percentage of kids in family housing with two parents
- percentage of population that is caucasian
- number of kids born to parents who never married as well as some which appeared in the Linear Model Subset Selection:
- percentage of the population that is african american
- percent of persons in dense housing (more than 1 person per room)

The plot of the features compared with Node Purity also includes features:
- number of people under the poverty level
- percentage of females who are divorced

Comparison of Models

To compare the performance of the tuned models, the caret package was utilized. This package uses cross validation to compare the performance of all models over different subsets multiple times, returning the RMSE (MSE is not an option) and R-squared results for each model.

library(caret)
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(7)
modelLm <- train(ViolentCrimesPerPop ~ ., data = df.norm, method = "lm", 
    metric = "RMSE", trControl = control)
set.seed(7)
modelRR <- train(ViolentCrimesPerPop ~ ., data = df.norm, method = "glmnet", 
    metric = "RMSE", trControl = control, tuneGrid = data.frame(.alpha = 0, 
        .lambda = bestRRlam))
set.seed(7)
modelLasso <- train(ViolentCrimesPerPop ~ ., data = df.norm, method = "glmnet", 
    metric = "RMSE", trControl = control, tuneGrid = data.frame(.alpha = 1, 
        .lambda = bestLlam))
set.seed(7)
modelSvm <- train(ViolentCrimesPerPop ~ ., data = df.norm, method = "svmLinear2", 
    metric = "RMSE", trControl = control)
set.seed(7)
modelRf <- train(ViolentCrimesPerPop ~ ., data = df.norm, method = "rf", 
    metric = "RMSE", trControl = control)
results <- resamples(list(LM = modelLm, SVM = modelSvm, RR = modelRR, 
    Lasso = modelLasso, Rf = modelRf))
summary(results)
bwplot(results)

Model Comparison Results

## 
## Call:
## summary.resamples(object = results)
## 
## Models: LM, SVM, RR, Lasso, Rf 
## Number of resamples: 30 
## 
## RMSE 
##          Min. 1st Qu.  Median    Mean 3rd Qu.    Max. NA's
## LM    0.05954 0.06674 0.06915 0.06927 0.07279 0.07813    0
## SVM   0.05978 0.06634 0.06846 0.06917 0.07308 0.07890    0
## RR    0.06292 0.06921 0.07153 0.07139 0.07420 0.08103    0
## Lasso 0.06258 0.06791 0.07061 0.07056 0.07307 0.08079    0
## Rf    0.06178 0.06564 0.06953 0.06941 0.07248 0.07875    0
## 
## Rsquared 
##         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## LM    0.6162  0.6506 0.6924 0.6843  0.7138 0.7733    0
## SVM   0.6180  0.6431 0.6919 0.6850  0.7132 0.7728    0
## RR    0.6046  0.6365 0.6769 0.6717  0.7048 0.7413    0
## Lasso 0.6088  0.6395 0.6827 0.6763  0.7067 0.7531    0
## Rf    0.6123  0.6479 0.6908 0.6833  0.7106 0.7534    0

The results from the model comparison show that the performance of the optimized models for Linear Regression, Ridge Regression, Lasso, SVM and the Random Forest are very similar. This indicates the choice of models for the data was correct and the preprocessing of the data did not advantage one model type.

The best performing model both from the measure of RMSE and also R-squared was the SVM model with radial kernal, followed by the Linear Regression model and then the Random Forest.

Conclusion

In this Assignment, data from the USA Communities and Crime Data Set, sourced from the UCI Dataset Repository, was used with different machine learning models to predict the level of Violent Crime in USA Communities.

In Part 1 of the Assignment, the dataset was analysed and cleaned. Improvement actions were taken, such as transforming features to remove skewness, and converting others to categorical features.

In Part 2 of this Assignment, the performance of different Machine Learning Models was tested on the dataset to predict the level of Violent Crime in a community. It was shown that all the models chosen, once optimized with cross-validation, produced very similar predictive performance. This indicates the choice of models for the data was correct and the preprocessing of the data did not advantage one model type.

The features used for the prediction were similar in the different models. The features which most frequently were used to predict the level of Violent Crime were:
- percentage of kids in family housing with two parents
- percentage of kids born to parents who never married
- number of kids born to parents who never married
- percent of persons in dense housing (more than 1 person per room)
- percentage of population that is Caucasian
- percentage of the population that is African American

These features are indicative of the societal and demographic changes in 1990s USA. During this time the US was going through Crack epidemic, which mainly affected under privileged African American communities. In addition, between 1990 to 1995, the number of 15-24 year olds increased by roughly 20%, and the share of the population between the age of 15 to 24 was increased from 13.7% to 14.6%. Their impact may explain the strength of the predictive performance of features relating to race and children.

There is some similarity between these features, which indicates actions to reduce correlated features should have been taken in the data preparation phase.

The large number of features required to obtain a good performance with the Linear Models indicates that it could have been worthwhile exploring polynomial or recursive feature selection if additional processing capability was availible.

Using R-squared (with a scale of 0 to 1) as a measure of predictive performance, these models can be considered to have a moderately good performance on this dataset. The best performing model was the SVM Model with a radial kernal. This performance could be improved further by removing correlated features and better feature selection.

Likelihood of Violent Crimes in USA Communities

dmormandy

Machine Learning Assignment - Part 2