Overview

People’s quality of life consists many factors: available housing, economic viability, and geography among many others. One of the most import factors is security, which is directly related to the prevalence of crime. In large cities, crime negatively affects citizens’ livelihood. Criminal activity prevents people from safely raising families, running businesses, owning property and providing tax revenue among other effects. And once crime appears in a locality, it becomes expensive and difficult to reverse its influence; city planners and public officials still do not agree on the best approaches despite many attempts and experiments.

This study focuses on crime detection and prediction. Specifically, using publicly data gathered from various Boston neighborhoods to predict whether each area has a crime rate above or below below the city median. The predictor variables describe an area’s zoning, housing characteristics, and proximity to critical resources (e.g. employment or highways), among others.

We are going to build several models using different techniques and variable selection. In order to best assess our predictive model, we created a test set within our training data, and split it along an 80/20 training/testing proportion, before applying the finalized model to a separate evaluation dataset that did not contain the target.


1. Data Exploration

The crime training dataset contains 466 observations of 13 variables that are collected in attempts to evaluate the crime within the Boston neighborhoods. The evaluation dataset contains 40 observations of 12 variables. The descriptions of each column are below.

1.1 Summary Statistics

The training data can be previewed below. The target column is the binary dependent variable denoting if a neighborhood had a crime rate above (target = 1) or below (target = 0) the city median.

The table below provides the mean for each variable, the standard deviation and some other descriptive statistics. As we see below, zn, indus, age, and lstat are proportions. Chas is a dummy variable. The rest are integers and fortunately, there are no missing values.

Data summary
Name train_df
Number of rows 466
Number of columns 13
_______________________
Column type frequency:
numeric 13
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
zn 0 1 11.58 23.36 0.00 0.00 0.00 16.25 100.00
indus 0 1 11.11 6.85 0.46 5.15 9.69 18.10 27.74
chas 0 1 0.07 0.26 0.00 0.00 0.00 0.00 1.00
nox 0 1 0.55 0.12 0.39 0.45 0.54 0.62 0.87
rm 0 1 6.29 0.70 3.86 5.89 6.21 6.63 8.78
age 0 1 68.37 28.32 2.90 43.88 77.15 94.10 100.00
dis 0 1 3.80 2.11 1.13 2.10 3.19 5.21 12.13
rad 0 1 9.53 8.69 1.00 4.00 5.00 24.00 24.00
tax 0 1 409.50 167.90 187.00 281.00 334.50 666.00 711.00
ptratio 0 1 18.40 2.20 12.60 16.90 18.90 20.20 22.00
lstat 0 1 12.63 7.10 1.73 7.04 11.35 16.93 37.97
medv 0 1 22.59 9.24 5.00 17.02 21.20 25.00 50.00
target 0 1 0.49 0.50 0.00 0.00 0.00 1.00 1.00


1.2 Distributions

Next, we will check the density plots. It will help us to start thinking about which variables can be included in our model and how they are distributed relative to our target variable, so we graph both the total density and density per target category below.

Both types of skewness appears in several variables; age, ptratio are left-skewed, while dis, lstat, zn are right-skewed. We may consider some transformations for these variables to satisfy linear regression assumptions.

The variables tax and rad appear binomial, and chas is binary and takes mostly the value of 0. Likewise, zn also takes mostly the value of 0, and thus may be worth removing.

1.3 Box Plots

Box plots for all predictors are below and some we can see rm, medv and age related outliers are significant and may need further handling. Additionally, nox, dis, tax and rad have non-overlapping quartiles with little to no outliers; so we can expect to see these appear in multiple models.

Lastly, chas clearly can be converted to a categorical variable as it has binary values.

Before building a model, we need to make sure that we have both classes equally presented in out target variable. This appears to be the case, as class 1 takes 49% and class 0 takes 51% of the target variable. As a result, we don’t have unbalanced class distribution for our target variable that we have to deal with. Otherwise, we had to take some additional steps (bootstrapping, etc) before using logistic regression.

##    target 
## 0.4914163

1.3 Correlation Matrix

Plotting the correlations between the target and the variables, we can see that some variables correlate positively with the target, such as indus (proportion of non-retail business acres), nox (nitrogen oxides concentration), age (proportion of units built prior to 1940), rad (accessibility to highways), tax (property tax rate per $10,000), lstat (lower status of population i.e. a socioeconomic measure), and negatively with zn (proportion of residential land zoned for large lots) and dis (mean distance to 5 Boston employment centers). Intuitively and based on well-studied socioeconomic principles, these make sense, barring perhaps the tax rate, which one would expect to be lower in higher crime-prone neighborhoods.

Variables with correlations close to zero are unlikely to offer significant insights into the factors contributing to a crime rate.

To avoid multicollinearity, we should exclude some variables from the model that have high degrees of correlation with each other. We can see potential cases of multicollinearity for the variables rad/tax, indus/tax, age/nox, indus/nox, rm/medv, lstat/medv, nox/dis, indus/dis, and age/dis.

We will carefully examine the VIF for these later on.

Coefficient
zn -0.4316818
indus 0.6048507
chas 0.0800419
nox 0.7261062
rm -0.1525533
age 0.6301062
dis -0.6186731
rad 0.6281049
tax 0.6111133
ptratio 0.2508489
lstat 0.4691270
medv -0.2705507
target 1.0000000

2. Data preparation

To build our model, we need to transform categorical variables chas, target and rad from numeric (default by the data) to factors. rad in particular is an ordinal categorical index. The data will be split for training and testing to train and check our models. As we saw from the summary tables above, there are no missing values to deal with.

As mentioned above, the variable zn has 0 value as the most occurred. It appears 339 times out of total 466 (72%), and may cause overdispersion, so we will remove it.

##       zn   n
## 1    0.0 339
## 2   12.5  10
## 3   17.5   1
## 4   18.0   1
## 5   20.0  21
## 6   21.0   4
## 7   22.0   9
## 8   25.0   8
## 9   28.0   3
## 10  30.0   6
## 11  33.0   3
## 12  34.0   3
## 13  35.0   3
## 14  40.0   7
## 15  45.0   6
## 16  52.5   3
## 17  55.0   3
## 18  60.0   4
## 19  70.0   3
## 20  75.0   3
## 21  80.0  13
## 22  82.5   2
## 23  85.0   2
## 24  90.0   4
## 25  95.0   4
## 26 100.0   1

Next, let’s use BoxCox transformation to normalize our variables - we will look at the distributions for all: age, dis, lstat, nox, ptratio, tax, medv, and indus.

Compare the original distributions with transformed. As we see on the graphs below, distributions of the transformed variables became closer to normal.

The glmnet() function (will be used later) uses a vector for the explanatory variable and a matrix for the predictor variables. The training and validation data frame are transformed into vectors and the corresponding matrix (X, Y). Since the glmnet() function supports the standardization and transformation of the predictor variables, we do not need to create individual variable sets for the different transformations.

However, we will also split the regularly BoxCox transformed training set, to be able to compare those transformations against the glmnet() transforms and results.

3. Build Models

3.1 Model 1 - Logit model, original data

As the first step, we are going to build the generalized linear model based on the original training dataset with some variables transformed to categorical from numeric. Since our dependent variable takes only two values (0 and 1), we will use logistic regression. To do so, the function glm() with family=binomial is used. At the beginning, all variables are included (besides zn column).

Observations 373
Dependent variable target
Type Generalized linear model
Family binomial
Link logit
χ(18) 422.88
Pseudo-R (Cragg-Uhler) 0.90
Pseudo-R (McFadden) 0.82
AIC 132.08
BIC 206.59
Est. S.E. z val. p
(Intercept) -66.25 5655.79 -0.01 0.99
indus -0.09 0.15 -0.57 0.57
chas1 0.52 0.99 0.53 0.60
nox 69.04 14.93 4.62 0.00
rm -1.51 1.19 -1.27 0.20
age 0.03 0.02 1.58 0.11
dis 0.36 0.28 1.28 0.20
rad2 -0.97 7856.72 -0.00 1.00
rad3 1.56 7326.83 0.00 1.00
rad4 23.16 5655.77 0.00 1.00
rad5 21.28 5655.77 0.00 1.00
rad6 18.97 5655.77 0.00 1.00
rad7 25.09 5655.77 0.00 1.00
rad8 25.57 5655.77 0.00 1.00
rad24 44.61 6040.79 0.01 0.99
tax -0.01 0.01 -1.55 0.12
ptratio 0.49 0.24 2.03 0.04
lstat 0.07 0.07 1.01 0.31
medv 0.33 0.13 2.58 0.01
Standard errors: MLE

With stepAIC, we can try to see what variables should be included in our model to increase the performance. Based on the the results of the function, we will include nox, rm, age, rad, tax, ptratio and medv variables.

## 
## Call:
## glm(formula = target ~ nox + rm + age + rad + tax + ptratio + 
##     medv, family = binomial(link = "logit"), data = transformed_training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.92197  -0.07944   0.00000   0.00005   2.62307  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.487e+01  5.526e+03  -0.010 0.992078    
## nox          5.582e+01  1.029e+01   5.425 5.81e-08 ***
## rm          -1.796e+00  1.013e+00  -1.774 0.076145 .  
## age          3.309e-02  1.530e-02   2.162 0.030585 *  
## rad2        -2.071e+00  7.675e+03   0.000 0.999785    
## rad3         1.602e+00  7.132e+03   0.000 0.999821    
## rad4         2.368e+01  5.526e+03   0.004 0.996582    
## rad5         2.183e+01  5.526e+03   0.004 0.996849    
## rad6         2.007e+01  5.526e+03   0.004 0.997103    
## rad7         2.586e+01  5.526e+03   0.005 0.996266    
## rad8         2.637e+01  5.526e+03   0.005 0.996192    
## rad24        4.551e+01  5.952e+03   0.008 0.993898    
## tax         -1.420e-02  3.953e-03  -3.592 0.000328 ***
## ptratio      4.819e-01  2.030e-01   2.374 0.017600 *  
## medv         2.916e-01  1.097e-01   2.659 0.007836 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.956  on 372  degrees of freedom
## Residual deviance:  97.058  on 358  degrees of freedom
## AIC: 127.06
## 
## Number of Fisher Scoring iterations: 20

The p-value associated with this Chi-Square statistic is below. The lower the p-value, the better the model is able to fit the dataset compared to a model with just an intercept term.

## [1] 0

Model Results

Only nox, age, tax, ptratio and medv have regression coefficients in model that are highly significant at the 5% level. The coefficients of nox (nitrogen oxides concentration), age (proportion of owner-occupied units built prior to 1940), ptratio (pupil-teacher ratio by town) and medv (median value of owner-occupied homes in 1000s dollars) increase the chance of crime while tax (full-value property-tax rate per $10,000). These results seem strange as higher median value of owner-occupied homes in 1000s dollars (medv) should lead to lower crime rate. The same would be for age. It may be a results of multicollinearity.

The null deviance of 516.96 defines how well the target variable can be predicted by a model with only an intercept term.

The residual deviance of 97.06 defines how well the target variable can be predicted by our AIC model that we fit with predictor variables mentioned above. The lower the value, the better the model is able to predict the value of the response variable.

The p-value associated with this Chi-Square statistic is 0 which is less than .05, the model can be useful.

The Akaike information criterion (AIC) is 127.06. We will use this metric to compare the fit of different models. The lower the value, the better the regression model is able to fit the data.

Checking Model Assumptions

We evaluate the modeling assumptions using standard diagnostic plots, marginal model plots, and a Variance Inflation Factor (VIF) to assess collinearity.

The resulting plot 1 below (predicted values vs. residuals) can help us to see if there are any signs of over- or under-predicting. It seems that we are experiencing under-fitting at lower predicted values, with the predicted proportions being larger than the observed proportions. On the other hand, we are over-fitting for a couple of predictor patterns at higher predicted values. The predicted values are much larger than the predicted proportions. The labeled points refer to rows in our data. The Standardized Pearson Residuals plot shows an approximate Standard Normal distribution if the model fits. The model seems good in the middle range but there are extremes on the right and left sides. Some normality of the residuals for the binomial logistic regression models is just an evidence of a decent fitting model.

To check the linearity assumptions (linear relationship between the transformed expected response in terms of the link function and the explanatory variables). We select only numeric variables from our data, add the logit results based on the probability from the predictions and build a scatter plot. We are going check the linearity assumptions visually. Only rm, lstat, medv, nox seem like linear relation with the logit results.

##     1     2     3     4     7     8 
## "pos" "pos" "pos" "neg" "pos" "pos"
## `geom_smooth()` using formula = 'y ~ x'

The Standardized Residuals plot seems to have a constant variance though there are some outliers.

The marginal model plot is a very useful graphical method for deciding if a logistic regression model is adequate or not. The marginal model plots below show reasonable agreement across the two sets of fits indicating that model_1_aic is a valid model.

In terms of multicollinearity, all variables have a VIF less than 5 besides for the medv variable with vif=7.2. As a result, multicollinearity may be a problem for the model. To improve the results, we may consider another way of choosing variables in the future.

##             GVIF Df GVIF^(1/(2*Df))
## nox     3.504719  1        1.872089
## rm      4.529782  1        2.128329
## age     1.946545  1        1.395186
## rad     2.867307  8        1.068051
## tax     1.927101  1        1.388201
## ptratio 2.367283  1        1.538598
## medv    7.240190  1        2.690760

3.2 Model 2 - Logit model, BoxCox transformation

Next, we can try to build the generalized linear model based on the dataset with boxcox transformation.

## 
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"), 
##     data = box_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5144  -0.1491   0.0000   0.0001   3.3920  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.810e+01  3.566e+03  -0.016   0.9870    
## indus       -4.579e-01  3.743e-01  -1.223   0.2213    
## chas1        8.688e-01  1.009e+00   0.861   0.3894    
## nox          1.692e+01  3.594e+00   4.708  2.5e-06 ***
## rm          -1.410e+00  9.308e-01  -1.514   0.1299    
## age          9.989e-03  4.733e-03   2.111   0.0348 *  
## dis          3.588e+00  1.455e+00   2.467   0.0136 *  
## rad2         1.764e+00  5.068e+03   0.000   0.9997    
## rad3         1.884e+01  3.566e+03   0.005   0.9958    
## rad4         2.067e+01  3.566e+03   0.006   0.9954    
## rad5         1.817e+01  3.566e+03   0.005   0.9959    
## rad6         1.606e+01  3.566e+03   0.005   0.9964    
## rad7         2.232e+01  3.566e+03   0.006   0.9950    
## rad8         2.266e+01  3.566e+03   0.006   0.9949    
## rad24        3.865e+01  3.854e+03   0.010   0.9920    
## tax          2.497e+01  4.046e+01   0.617   0.5371    
## ptratio      1.178e-02  1.055e-02   1.117   0.2639    
## lstat       -4.585e-01  5.227e-01  -0.877   0.3805    
## medv         2.238e+00  1.275e+00   1.755   0.0793 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.96  on 372  degrees of freedom
## Residual deviance: 109.49  on 354  degrees of freedom
## AIC: 147.49
## 
## Number of Fisher Scoring iterations: 19

The stepAIC function tells us to include once again the variables nox, age, dis, rad and medv.

## 
## Call:
## glm(formula = target ~ nox + age + dis + rad + medv, family = binomial(link = "logit"), 
##     data = box_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5055  -0.2127   0.0000   0.0001   2.9388  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.644e+01  3.550e+03  -0.005   0.9963    
## nox          1.403e+01  2.797e+00   5.015 5.31e-07 ***
## age          6.479e-03  3.753e-03   1.726   0.0843 .  
## dis          2.921e+00  1.328e+00   2.200   0.0278 *  
## rad2         2.616e-01  4.839e+03   0.000   1.0000    
## rad3         1.932e+01  3.550e+03   0.005   0.9957    
## rad4         2.071e+01  3.550e+03   0.006   0.9953    
## rad5         1.826e+01  3.550e+03   0.005   0.9959    
## rad6         1.662e+01  3.550e+03   0.005   0.9963    
## rad7         2.204e+01  3.550e+03   0.006   0.9950    
## rad8         2.272e+01  3.550e+03   0.006   0.9949    
## rad24        3.809e+01  3.872e+03   0.010   0.9922    
## medv         9.375e-01  5.225e-01   1.794   0.0728 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.96  on 372  degrees of freedom
## Residual deviance: 113.97  on 360  degrees of freedom
## AIC: 139.97
## 
## Number of Fisher Scoring iterations: 19

The p-value associated with this Chi-Square statistic is below.

## [1] 0

Model Results

Only nox and dis have regression coefficients in model that are significant. The coefficients of nox (nitrogen oxides concentration), medv (median value of owner-occupied homes in 1000s dollars), age (proportion of owner-occupied units built prior to 1940), dis (weighted mean of distances to five Boston employment centers) increase the chance of crime. The results are strange again as in the model 1. A higher median value of owner-occupied homes in 1000s dollars (medv) should lead to lower crime rate. The same would be for age. It may be a results of multicollinearity.

The residual deviance is 113.97.

The p-value associated with this Chi-Square statistic is 0 which is less than .05, the model can be useful.

The Akaike information criterion (AIC) is 139.97.

Checking Model Assumptions

The resulting plots show us similar to model 1 picture: under-fitting at lower predicted values, with the predicted proportions being larger than the observed proportions; over-fitting for a couple of predictor patterns at higher predicted values with the predicted values are much larger than the predicted proportions. The Standardized Pearson Residuals plot shows an approximate Standard Normal distribution if the model fits. The model seems good in the middle range but there are extremes on the right and left sides. Some normality of the residuals for the binomial logistic regression models is just an evidence of a decent fitting model.

By checking the linearity assumptions, we see that only rm seem like linear relation with the logit results.

##     1     2     3     4     5     6 
## "pos" "pos" "pos" "neg" "neg" "neg"
## `geom_smooth()` using formula = 'y ~ x'

The Standardized Residuals plot seems to have a constant variance though there are some outliers.

The marginal model plots below show reasonable agreement across the two sets of fits indicating that model_2_aic is a valid model.

In terms of multicollinearity, all variables have a VIF less than 5. As a result, multicollinearity shouldn’t be a problem for our model.

##          GVIF Df GVIF^(1/(2*Df))
## nox  4.400149  1        2.097653
## age  1.816811  1        1.347891
## dis  3.122450  1        1.767046
## rad  3.195604  8        1.075312
## medv 1.803566  1        1.342969

3.3 Model 3 - Lasso Cross Validation

The cv.glmnet() function was used to perform k-fold cross-validation with variable selection using lasso regularization. The following attribute settings were selected for the model:

  • type.measure = “class” - The type.measure is set to class to minimize the misclassification errors of the model since the accurate classification of the validation data set is the desired outcome.
  • nfold = 5 - Through our exploration, we did not uncover any hard and fast rules governing the optimal number of folds for n-fold cross-validation. Given the size of the training dataset, we opted for 5-fold cross-validation.
  • family = binomial - For Logistic Regression, the family attribute of the function is set to binomial.
  • link = logit - For this model, we choose the default link function for a logistic model.
  • alpha =1 - The alpha value of 1 sets the variable shrinkage method to lasso.
  • standardize = TRUE - Finally, we explicitly set the standardization attribute to TRUE; this will normalize the prediction variables around a mean of zero and a standard deviation of one before modeling.

The resulting model is explored by extracting coefficients at two different values for lambda, lambda.min and lambda.1se respectively.

  • The coefficients extracted using lambda.min minimizes the mean cross-validated error. The resulting model drops one predictor variable rm and has an AIC of -352.0.
  • The coefficients extracted using lambda.1se produce the most regularized model (cross-validated error is within one standard error of the minimum). For this model the ‘rm’ and ‘tax’ predictor variables dropout, generating an AIC of -336.8.

Even though the coefficients extracted using the lambda.min have the lower AIC value, selecting the coefficients at lambda.1se generates a more parsimonious model.

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -41.761828746
## zn           -0.069465940
## indus        -0.070602170
## chas          1.494501615
## nox          44.111275356
## rm           -0.029378028
## age           0.028335095
## dis           0.683771334
## rad           0.498388050
## tax          -0.002897184
## ptratio       0.408595628
## lstat         0.070022131
## medv          0.186647791
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) -24.52294587
## zn           -0.02996685
## indus        -0.04128535
## chas          1.15505703
## nox          26.73104404
## rm            .         
## age           0.01815089
## dis           0.24955205
## rad           0.24901659
## tax           .         
## ptratio       0.23311898
## lstat         0.01380822
## medv          0.10119236
## 
## Call:  cv.glmnet(x = X, y = Y, nfolds = 5, family = "binomial", link = "logit",      standardize = TRUE, alpha = 1) 
## 
## Measure: Binomial Deviance 
## 
##       Lambda Index Measure      SE Nonzero
## min 0.001252    62  0.4308 0.05198      12
## 1se 0.005547    46  0.4773 0.04569      10

## $deviance
## lambda.1se 
##  0.4276859 
## attr(,"measure")
## [1] "Binomial Deviance"
## 
## $class
## lambda.1se 
## 0.09651475 
## attr(,"measure")
## [1] "Misclassification Error"
## 
## $auc
## [1] 0.9750072
## attr(,"measure")
## [1] "AUC"
## 
## $mse
## lambda.1se 
##  0.1344123 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.1se 
##  0.3062294 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -352.0349
## 
## $BIC
## [1] -305.8426
## $AICc
## [1] -336.8219
## 
## $BIC
## [1] -298.2138

Looking closer at the remaining coefficients for the selected lambda values, the nox predictor variable has the largest coefficient, and several variables including lstat,age,zn,and indus have coefficients close to zero. As mentioned earlier, the dataset has a high correlation between predictor variables. The lasso regression approaches this issue by selecting the variable with the highest correlation (in this casenox) and shrinking the remaining variables (as can be seen in the plot of coefficients).

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) -24.52294587
## zn           -0.02996685
## indus        -0.04128535
## chas          1.15505703
## nox          26.73104404
## rm            .         
## age           0.01815089
## dis           0.24955205
## rad           0.24901659
## tax           .         
## ptratio       0.23311898
## lstat         0.01380822
## medv          0.10119236

Model Results

The coefficients extracted at the lambda.1se value are used to predict the relative crime rate for the testing data set. The confusion matrix highlights an accuracy of 83.9%.

##          True
## Predicted  0  1 Total
##     0     41  9    50
##     1      6 37    43
##     Total 47 46    93
## 
##  Percent Correct:  0.8387

Checking Model Assumptions

Again we check linear relationship between independent variables and the Logit of the target variable. Visually inspecting the results there is a linear trend in the relationship but there are deviations from the straight line in all variables with the exception of nox and ptratio.

## `geom_smooth()` using formula = 'y ~ x'

The Standardized Residuals plot seems to have a constant variance though there are some outliers.

The lasso regression solve multicollinearity issue by selecting the variable with the largest coefficient while setting the rest to (nearly) zero.

4. Select Model

To choose which of our three models for prediction, we will focus on performance only. I.e. comparing ROC curves, performance statistics and confusion matrices. We did not have models differ in the amount of predictors.

Ideally, all three comparisons favors a single model but they didn’t in our case.

4.1 Confusion Matrices

Comparing model 1 and 2, we see they are very similar except model 2 has a slightly higher miss rate. However, model 3’s false negative rate was much higher than the other models and hurt its overall accuracy.

Model 1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 43  1
##          1  4 45
##                                          
##                Accuracy : 0.9462         
##                  95% CI : (0.879, 0.9823)
##     No Information Rate : 0.5054         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8925         
##                                          
##  Mcnemar's Test P-Value : 0.3711         
##                                          
##             Sensitivity : 0.9783         
##             Specificity : 0.9149         
##          Pos Pred Value : 0.9184         
##          Neg Pred Value : 0.9773         
##              Prevalence : 0.4946         
##          Detection Rate : 0.4839         
##    Detection Prevalence : 0.5269         
##       Balanced Accuracy : 0.9466         
##                                          
##        'Positive' Class : 1              
## 

Model 2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 43  2
##          1  4 44
##                                          
##                Accuracy : 0.9355         
##                  95% CI : (0.8648, 0.976)
##     No Information Rate : 0.5054         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.871          
##                                          
##  Mcnemar's Test P-Value : 0.6831         
##                                          
##             Sensitivity : 0.9565         
##             Specificity : 0.9149         
##          Pos Pred Value : 0.9167         
##          Neg Pred Value : 0.9556         
##              Prevalence : 0.4946         
##          Detection Rate : 0.4731         
##    Detection Prevalence : 0.5161         
##       Balanced Accuracy : 0.9357         
##                                          
##        'Positive' Class : 1              
## 

Model 3

##          Sensitivity          Specificity       Pos Pred Value 
##            0.8043478            0.8723404            0.8604651 
##       Neg Pred Value            Precision               Recall 
##            0.8200000            0.8604651            0.8043478 
##                   F1           Prevalence       Detection Rate 
##            0.8314607            0.4946237            0.3978495 
## Detection Prevalence    Balanced Accuracy 
##            0.4623656            0.8383441

Even though model 3’s accuracy is the lower than for model 1, its AIC is the lowest (-337 comparing to 127/140 for models 1/2) as well as Deviance, so we’ll choose it for predictions. Based on the results of sensitivity, specificity, AUC we suspect the over-fitting problem with models 1 and 2. Models 1 and 2 are similar as we used stepAIC for both of them when selecting variables but it happened that BoxCox transformations lower AIC result for model 2.

Model Accuracy Classification error rate F1 Deviance R2 Sensitivity Specificity Precision AIC
Model #1 0.9466 0.0534 0.9474 97.0579 0.8123 0.9783 0.9149 0.9184 127.0579
Model #2 0.9357 0.0643 0.9362 113.9708 0.7795 0.9565 0.9149 0.9167 139.9708
Model #3 0.9389 0.0611 0.8315 0.5816 NA 0.8043 0.8723 0.8605 -336.8219

4.2 ROC Curves

As expected from the confusion matrices, we see the ROC curves for models being closer to ideal.

4.3 Predictions

Using the selected model we will predict the eval dataset. The matrix used in the predict() function requires a target variable so we will create a dummy variable for these purposes and set it to 0. The prediction function will generate a vector of probablisites that we will use to calcuate a binomial prediction (using a threshold of 0.5). The results of the prediction will be written to the eval_predictions_lasso.csv and loaded to the GitHub repo.

5. Conclusion

After investigating a few models, the third model, where 5-fold cross-validation with variable selection using lasso regularization, performed the best with an AIC score of -336.8 when incorporating ten predictors out of a possible twelve. The model achieved an accuracy rate of 93.9%. The relative size of the coefficients is noteworthy. The nox predictor variable is the largest coefficient of the ten remaining variables, with 4 of those coefficients being close to zero (less than 0.05 difference from zero). Although there is clearly a relationship between pollution captured by the nox variable and relative crime rate, we recommend further investigation to fully understand the causal pathways.

6. References

  1. Faraway, J. J. (2014). Linear Models with R, Second Edition. CRC Press.
  2. Sheather, S. (2009). A Modern Approach to Regression with R. Springer Science & Business Media.
  3. Detecting Multicollinearity Using Variance Inflation Factors | STAT 462. (n.d.). https://online.stat.psu.edu/stat462/node/180/
  4. Faraway, J. J. (2016). Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. CRC Press.

Appendix: R code

## ----setup, include=FALSE------------------------------------------------------------------------------------------------------------------
library(tidyverse)
library(dplyr)
library(reshape2)
library(ggplot2)
library(glmnet)
library(psych)
library(caret)
library(pROC)
library(skimr)
library(lemon)
library(DT)
library(kableExtra)
library(forecast)
library(corrplot)
library(ggpubr)
library(Hmisc)
library(caTools)
library(performance)
library(jtools)
library(regclass)
library(MASS)
library(PerformanceAnalytics)
library(leaps)
library(broom)
library(vip)

source("logistic_functions.R")

knitr::opts_chunk$set(echo = FALSE)


## ------------------------------------------------------------------------------------------------------------------------------------------
set.seed(1233)


## ----data----------------------------------------------------------------------------------------------------------------------------------
# read train and evaluation datasets
train_df <- read.csv("https://raw.githubusercontent.com/cliftonleesps/data621_group1/main/hw3/crime-training-data_modified.csv")
eval_df <- read.csv("https://raw.githubusercontent.com/cliftonleesps/data621_group1/main/hw3/crime-evaluation-data_modified.csv")



## ----summary1------------------------------------------------------------------------------------------------------------------------------
DT::datatable(
      train_df,
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE) 


## ----summary-------------------------------------------------------------------------------------------------------------------------------
skim_without_charts(train_df)


## ----density_plot--------------------------------------------------------------------------------------------------------------------------

m_df <- train_df %>% pivot_longer(!target, names_to='variable' , values_to = 'value')
m_df %>% ggplot(aes(x=value, group=target, fill=target)) + 
geom_density(color='#023020') + facet_wrap(~variable, scales = 'free',  ncol = 4) + theme_bw()

m_df %>% ggplot(aes(x= value)) + 
geom_density(color='#023020', fill='gray') + facet_wrap(~variable, scales = 'free',  ncol = 4) + theme_bw()



## ----box_plot------------------------------------------------------------------------------------------------------------------------------

m_df %>% ggplot(aes(x=target, y=value, group=target)) + 
geom_boxplot(color='#023020', fill='gray') + facet_wrap(~variable, scales = 'free',  ncol = 4) +
  stat_summary(fun = "mean",  geom = "point", shape = 8, size = 2, color = "steelblue") + 
  stat_summary(fun = "median",  geom = "point", shape = 8, size = 2, color = "red") + theme_bw()



## ----balance-------------------------------------------------------------------------------------------------------------------------------
proportion <- colMeans(train_df['target'])
proportion


## ----corr----------------------------------------------------------------------------------------------------------------------------------

rcore <- rcorr(as.matrix(train_df %>% dplyr::select(where(is.numeric))))
coeff <- rcore$r
corrplot(coeff, tl.cex = .7, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust",
         diag=FALSE)



## ----corr_numbers--------------------------------------------------------------------------------------------------------------------------
tst <- train_df
tst <- tst[,13]
kable(cor(drop_na(train_df))[,13], "html", escape = F, col.names = c('Coefficient')) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T)


## ----factor--------------------------------------------------------------------------------------------------------------------------------
transformed_train_df <- train_df
transformed_train_df$chas <- as.factor(transformed_train_df$chas)
transformed_train_df$target <- as.factor(transformed_train_df$target)
transformed_train_df$rad <- as.factor(transformed_train_df$rad)


## ----count_zn------------------------------------------------------------------------------------------------------------------------------
count(train_df,zn)

transformed_train_df <- subset(transformed_train_df, select = -c(zn))


## ----split_data_1--------------------------------------------------------------------------------------------------------------------------
#split data for training and testing
sample <- sample.split(transformed_train_df$target, SplitRatio = 0.8)
transformed_training  <- subset(transformed_train_df, sample == TRUE)
transformed_testing   <- subset(transformed_train_df, sample == FALSE)


## ----boxcox, fig.keep="none"---------------------------------------------------------------------------------------------------------------
# Create new dataframe to work with
box_train_df <- transformed_train_df


age_boxcox <- boxcox(lm(box_train_df$age ~ 1))
age_lambda <- age_boxcox$x[which.max(age_boxcox$y)]
box_train_df$age <- BoxCox(box_train_df$age, age_lambda)

dis_boxcox <- boxcox(lm(box_train_df$dis ~ 1))
dis_lambda <- dis_boxcox$x[which.max(dis_boxcox$y)]
box_train_df$dis <- BoxCox(box_train_df$dis, dis_lambda)

lstat_boxcox <- boxcox(lm(box_train_df$lstat ~ 1))
lstat_lambda <- lstat_boxcox$x[which.max(lstat_boxcox$y)]
box_train_df$lstat <- BoxCox(box_train_df$lstat, lstat_lambda)

nox_boxcox <- boxcox(lm(box_train_df$nox ~ 1))
nox_lambda <- lstat_boxcox$x[which.max(nox_boxcox$y)]
box_train_df$nox <- BoxCox(box_train_df$nox, nox_lambda)

ptratio_boxcox <- boxcox(lm(box_train_df$ptratio ~ 1))
ptratio_lambda <- lstat_boxcox$x[which.max(ptratio_boxcox$y)]
box_train_df$ptratio <- BoxCox(box_train_df$ptratio, ptratio_lambda)

tax_boxcox <- boxcox(lm(box_train_df$tax ~ 1))
tax_lambda <- tax_boxcox$x[which.max(tax_boxcox$y)]
box_train_df$tax <- BoxCox(box_train_df$tax, tax_lambda)

indus_boxcox <- boxcox(lm(box_train_df$indus ~ 1))
indus_lambda <- indus_boxcox$x[which.max(indus_boxcox$y)]
box_train_df$indus <- BoxCox(box_train_df$indus, indus_lambda)

medv_boxcox <- boxcox(lm(box_train_df$medv ~ 1))
medv_lambda <- medv_boxcox$x[which.max(medv_boxcox$y)]
box_train_df$medv <- BoxCox(box_train_df$medv, medv_lambda)



## ----box_hists, warning=FALSE, message=FALSE-----------------------------------------------------------------------------------------------
m_df_box <- box_train_df %>% dplyr::select(c(age, dis, lstat, nox, ptratio, tax, indus, medv)) %>% melt() 

a <- m_df_box %>% ggplot(aes(x= value)) + 
geom_density(color='#023020', fill='gray') + facet_wrap(~variable, scales = 'free',  ncol = 1) + theme_bw()

m_df <- transformed_train_df %>% dplyr::select(c(age, dis, lstat, nox, ptratio, tax, indus, medv)) %>% melt() 

b <- m_df %>% ggplot(aes(x= value)) + 
geom_density(color='#023020', fill='gray') + facet_wrap(~variable, scales = 'free',  ncol = 1) + theme_bw()

ggarrange(a, b + rremove("x.text"), 
          labels = c("Transformed", "Original"))


## ----split_data_2--------------------------------------------------------------------------------------------------------------------------
# Split data for training and testing
sample <- sample.split(box_train_df$target, SplitRatio = 0.8)
box_training  <- subset(box_train_df, sample == TRUE)
box_testing   <- subset(box_train_df, sample == FALSE)


## ------------------------------------------------------------------------------------------------------------------------------------------
# set seed for consistancy 
set.seed(1233)

# 80/20 split of the training data set
sample <- sample.split(train_df$target, SplitRatio = 0.8)
train_data  <- subset(train_df, sample == TRUE)
test_data   <- subset(train_df, sample == FALSE)


# build X matrix and Y vector
X <- model.matrix(target ~ ., data=train_data)[,-1]
Y <- train_data[,"target"] 



## ----general, warning=FALSE, message=FALSE-------------------------------------------------------------------------------------------------
model_1 <- glm(transformed_training, formula = target ~., family = binomial(link = "logit"))
summ(model_1)


## ----model1_aic, warning=FALSE, message=FALSE----------------------------------------------------------------------------------------------
model_1_aic <- model_1 %>% stepAIC(trace = FALSE)
summary(model_1_aic)


## ----chi_model_1---------------------------------------------------------------------------------------------------------------------------
1-pchisq(model_1_aic$null.deviance-model_1_aic$deviance, model_1_aic$df.null- model_1_aic$df.residual)


## ----check_lm1-----------------------------------------------------------------------------------------------------------------------------
par(mfrow=c(2,2))
plot(model_1_aic)



## ----model_1_linearity---------------------------------------------------------------------------------------------------------------------
probabilities <- predict(model_1_aic, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)

#Only numeric predictors
data <- transformed_training %>%
  dplyr::select_if(is.numeric) 
predictors <- colnames(data)

# Bind the logit and tidying the data for plot
data <- data %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

#Scatter plot
ggplot(data, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")


## ----model_1_residuals---------------------------------------------------------------------------------------------------------------------
model_1_aic.data <- augment(model_1_aic) %>% 
  mutate(index = 1:n())

ggplot(model_1_aic.data, aes(index, .std.resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw()


## ----warning=FALSE, message=FALSE----------------------------------------------------------------------------------------------------------
car::mmps(model_1_aic, span = 3/4, layout = c(2, 2))


## ----model_1_vif---------------------------------------------------------------------------------------------------------------------------
car::vif(model_1_aic)


## ----general_box, warning=FALSE, message=FALSE---------------------------------------------------------------------------------------------
model_2 <- glm(box_training, formula = target ~ ., family = binomial(link = "logit"))
summary(model_2)


## ----model2_aic, warning=FALSE, message=FALSE----------------------------------------------------------------------------------------------
model_2_aic <- model_2 %>% stepAIC(trace = FALSE)
summary(model_2_aic)


## ----chi_model_2---------------------------------------------------------------------------------------------------------------------------
1-pchisq(model_2_aic$null.deviance-model_2_aic$deviance, model_2_aic$df.null- model_2_aic$df.residual)


## ----check_lm2-----------------------------------------------------------------------------------------------------------------------------
par(mfrow=c(2,2))
plot(model_2_aic)


## ----model_2_linearity---------------------------------------------------------------------------------------------------------------------
probabilities <- predict(model_2_aic, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)

#Only numeric predictors
data <- box_training %>%
  dplyr::select_if(is.numeric) 
predictors <- colnames(data)

# Bind the logit and tidying the data for plot
data <- data %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

#Scatter plot
ggplot(data, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")


## ----model_2_residuals---------------------------------------------------------------------------------------------------------------------
model_2_aic.data <- augment(model_2_aic) %>% 
  mutate(index = 1:n())

ggplot(model_2_aic.data, aes(index, .std.resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw()


## ----warning=FALSE, message=FALSE----------------------------------------------------------------------------------------------------------
car::mmps(model_2_aic, span = 3/4, layout = c(2, 2))


## ----model_2_vif---------------------------------------------------------------------------------------------------------------------------
car::vif(model_2_aic)


## ------------------------------------------------------------------------------------------------------------------------------------------

lasso.model<- cv.glmnet(x=X,y=Y,
                       family = "binomial", 
                       link = "logit",
                       standardize = TRUE,                       #standardize  
                       nfold = 5,
                       alpha=1)                                  #alpha=1 is lasso

l.min <- lasso.model$lambda.min
l.1se <- lasso.model$lambda.1se
coef(lasso.model, s = "lambda.min" )
coef(lasso.model, s = "lambda.1se" )
lasso.model



## ------------------------------------------------------------------------------------------------------------------------------------------

par(mfrow=c(2,2))

plot(lasso.model)
plot(lasso.model$glmnet.fit, xvar="lambda", label=TRUE)
plot(lasso.model$glmnet.fit, xvar='dev', label=TRUE)

rocs <- roc.glmnet(lasso.model, newx = X, newy = Y )
plot(rocs,type="l")  



## ------------------------------------------------------------------------------------------------------------------------------------------
assess.glmnet(lasso.model,           
              newx = X,              
              newy = Y )    

print(glmnet_cv_aicc(lasso.model, 'lambda.min'))
print(glmnet_cv_aicc(lasso.model, 'lambda.1se'))



## ------------------------------------------------------------------------------------------------------------------------------------------
coef(lasso.model, s = "lambda.1se" )
vip(lasso.model, num_features=20 ,geom = "col", include_type=TRUE, lambda = "lambda.1se")


## ------------------------------------------------------------------------------------------------------------------------------------------
# Create matrix new data
X_test <- model.matrix(target ~ ., data=test_data)[,-1]
Y_test <- test_data[,"target"] 


# predict using coefficients at lambda.min
lassoPred <- predict(lasso.model, newx = X_test, type = "response", s = 'lambda.1se')

pred_df <- test_data
pred_df$target_prob <- lassoPred[,1]
pred_df$target_pred <- ifelse(lassoPred > 0.5, 1, 0)[,1]



## ------------------------------------------------------------------------------------------------------------------------------------------

confusion.glmnet(lasso.model, newx = X_test, newy = Y_test, s = 'lambda.1se')



## ------------------------------------------------------------------------------------------------------------------------------------------

pred_df <- pred_df %>%
  mutate(logit = log(target_prob/(1-target_prob))) 

m_df <- pred_df %>% pivot_longer(!c(target,target_prob,target_pred,logit), 
                                 names_to='variable' , values_to = 'value')

#Scatter plot
ggplot(m_df, aes(logit, value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~variable, scales = "free_y")



## ------------------------------------------------------------------------------------------------------------------------------------------
pred_df$index <- as.numeric(rownames(pred_df))
pred_df$resid <- pred_df$target_prob - pred_df$target


ggplot(pred_df, aes(index, resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw()



## ----model_1_confusionMatrix---------------------------------------------------------------------------------------------------------------

transformed_testing$model_1 <- ifelse(predict.glm(model_1_aic, transformed_testing, "response") >= 0.5, 1, 0)
conf_matrix_1 <- confusionMatrix(factor(transformed_testing$model_1), factor(transformed_testing$target), "1")


results <- tibble(Model = "Model #1", Accuracy=conf_matrix_1$byClass[11], 
                  "Classification error rate" = 1 - conf_matrix_1$byClass[11],
                  F1 = conf_matrix_1$byClass[7],
                  Deviance= model_1_aic$deviance, 
                  R2 = 1 - model_1_aic$deviance / model_1_aic$null.deviance,
                  Sensitivity = conf_matrix_1$byClass["Sensitivity"],
                  Specificity = conf_matrix_1$byClass["Specificity"],
                  Precision = precision(factor(transformed_testing$model_1), factor(transformed_testing$target), "1"),
                  AIC= model_1_aic$aic)

conf_matrix_1


## ----model_2_confusionMatrix---------------------------------------------------------------------------------------------------------------

box_testing$model_2 <- ifelse(predict.glm(model_2_aic, box_testing, "response") >= 0.5, 1, 0)
conf_matrix_2 <- confusionMatrix(factor(box_testing$model_2), factor(box_testing$target), "1")


results2 <- rbind(results,tibble(Model = "Model #2", Accuracy=conf_matrix_2$byClass[11], 
                  "Classification error rate" = 1 - conf_matrix_2$byClass[11],
                  F1 = conf_matrix_2$byClass[7],
                  Deviance=model_2_aic$deviance, 
                  R2 = 1 - model_2_aic$deviance / model_2_aic$null.deviance,
                  Sensitivity = conf_matrix_2$byClass["Sensitivity"],
                  Specificity = conf_matrix_2$byClass["Specificity"],
                  Precision = precision(factor(box_testing$model_2), factor(box_testing$target), "1"),
                  AIC= model_2_aic$aic))

conf_matrix_2


## ----show_confusion_matrix_3---------------------------------------------------------------------------------------------------------------
conf_matrix_3 = confusionMatrix(factor(pred_df$target_pred),factor(pred_df$target), "1")
conf_matrix_3$byClass


lasso.r1 <- assess.glmnet(lasso.model,           
                                newx = X_test,              
                                newy = Y_test )   
lasso.r2 <- glmnet_cv_aicc(lasso.model, 'lambda.1se')


results2 <- rbind(results2,tibble(Model = "Model #3", Accuracy=lasso.r1$auc[1], 
                  "Classification error rate" = 1 - lasso.r1$auc[1],
                  F1 = conf_matrix_3$byClass[7],
                  Deviance=lasso.r1$deviance[[1]], 
                  R2 = NA,
                  Sensitivity = conf_matrix_3$byClass["Sensitivity"],
                  Specificity = conf_matrix_3$byClass["Specificity"],
                  Precision = precision(factor(pred_df$target_pred),factor(pred_df$target), "1"),
                  AIC= lasso.r2$AICc))



## ------------------------------------------------------------------------------------------------------------------------------------------
kable(results2, digits=4) %>% 
  kable_styling(bootstrap_options = "basic", position = "center")


## ----roc_model_2, warning=FALSE, message=FALSE---------------------------------------------------------------------------------------------
par(mfrow=c(2,2))

# Model 1 curve
plot(roc(transformed_testing$target, transformed_testing$model_1, plot = FALSE,  print.auc = TRUE), quiet=TRUE, main="Model 1 - ROC Curve")


# Model 2 curve
plot(roc(box_testing$target, box_testing$model_2, plot = FALSE,  print.auc = TRUE, quiet=TRUE), main="Model 2 - ROC Curve")


# Model 3 curve
plot(roc(pred_df$target, pred_df$target_pred, plot = FALSE,  print.auc = TRUE, quiet=TRUE), main="Model 3 - ROC Curve")


## ------------------------------------------------------------------------------------------------------------------------------------------

# build X matrix and Y vector
eval_df$target <- 0
X_eval <- model.matrix(target ~ ., data=eval_df)[,-1]

# predict using coefficients at lambda.min
lassoEval <- predict(lasso.model, newx = X_eval, type = "response", s = 'lambda.1se')

eval_pred_df <- eval_df
eval_pred_df$target_prob <- lassoEval[,1]
eval_pred_df$target_pred <- ifelse(lassoEval > 0.5, 1, 0)[,1]

eval_pred_df <- subset(eval_pred_df, select = -c(target))
write.csv(eval_pred_df, 'eval_predictions_lasso.csv', row.names=F)


## ----results_table-------------------------------------------------------------------------------------------------------------------------
DT::datatable(
      eval_pred_df,
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE)