Section 1 - Data Exploration

The original data was loaded containing 466 observations, 13 variable predictors, and 1 response variable with about 49% were classified with having a crime rate above the median crime rate. A summary and boxplots of all variables reveal no missing data, but suggest some columns may contain outliers.

library(ggcorrplot)
library(car)
library(MASS)
library(dplyr)
library(ggplot2)
library(caret)
library(pROC)
library(pscl)

# Read the training data
df <- read.csv("https://raw.githubusercontent.com/L-Velasco/DATA621_FA18/master/HW3/crime-training-data.csv", stringsAsFactors = FALSE)

original_tr <- df

1.1 Size, Summary and Boxplots of the Data

dim(original_tr)
## [1] 466  14
summary(original_tr)
##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         black            lstat       
##  Min.   :187.0   Min.   :12.6   Min.   :  0.32   Min.   : 1.730  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.:375.61   1st Qu.: 7.043  
##  Median :334.5   Median :18.9   Median :391.34   Median :11.350  
##  Mean   :409.5   Mean   :18.4   Mean   :357.12   Mean   :12.631  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:396.24   3rd Qu.:16.930  
##  Max.   :711.0   Max.   :22.0   Max.   :396.90   Max.   :37.970  
##       medv           target      
##  Min.   : 5.00   Min.   :0.0000  
##  1st Qu.:17.02   1st Qu.:0.0000  
##  Median :21.20   Median :0.0000  
##  Mean   :22.59   Mean   :0.4914  
##  3rd Qu.:25.00   3rd Qu.:1.0000  
##  Max.   :50.00   Max.   :1.0000
table(original_tr$target)/sum(table(original_tr$target))
## 
##         0         1 
## 0.5085837 0.4914163

1.2 Relationships and Correlations with Response Variable

Based on given predictors, the plots suggest a crime rate being above the median crime rate in neighborhoods with higher values/proportions of non-retail business acres (indus), nitrogen oxides concentration (nox), age of units prior to 1940 (age), accessibility to radial highways (rad), property tax rate (tax) pupil-teacher ratio (ptratio), and lower status of population (lstat).

The values below shows the correlation of response variable with all predictor variables:

##          [,1]
## zn      -0.43
## indus    0.60
## chas     0.08
## nox      0.73
## rm      -0.15
## age      0.63
## dis     -0.62
## rad      0.63
## tax      0.61
## ptratio  0.25
## black   -0.35
## lstat    0.47
## medv    -0.27
## target   1.00

1.3 Correlations Among Predictor Variables

Judging from the correlation plot and considering an 80% and above correlation percentage between predictors, there appear some strong relationships between the following variables:

rad, tax = accessibility to radial highways and full-value property tax rate

indus, nox = proportion of non-retail business acres and nitrogen oxides concentration

dis, nox = mean distances to employment centers and nitrogen oxides concentration

dis, age = mean distances to employment centers and proportion of owner-occupied units built prior to 1940

In contrast, chas, a dummy variable indicating whether the suburbs borders the Charles River, has little to no relationship among the predictors.

corr<- round(cor(original_tr),1)
ggcorrplot(corr, lab = TRUE)

Section 2 - Data Preparation

For transformation, outliers will be treated and one variable will be removed

2.1 Remove variables

# Remove chas  
new_tr <- dplyr::select(original_tr, -chas)
dim(new_tr)
## [1] 466  13

2.2 Cap Outliers

Any outliers outside of lower 1.5IQR would be capped at 5th %ile, and observations above the upper 1.5IQR would be capped at 95th %ile.

# Outlier Capping

id <- c(1:12)
for (val in id) {
  qnt <- quantile(new_tr[,val], probs=c(.25, .75), na.rm = T)
  caps <- quantile(new_tr[,val], probs=c(.05, .95), na.rm = T)
  H <- 1.5 * IQR(new_tr[,val], na.rm = T)
  new_tr[,val][new_tr[,val] < (qnt[1] - H)] <- caps[1]
  new_tr[,val][new_tr[,val] > (qnt[2] + H)] <- caps[2]
}

Section 3 - Build Models

Split the training data set such that 60% of the observations will be used to train the model and 40% will be used to test the model to derive performance metrics.

oTrain <- createDataPartition(original_tr$target, p=0.6, list=FALSE)
otraining <- original_tr[oTrain,]
otesting <- original_tr[-oTrain,]


nTrain <- createDataPartition(new_tr$target, p=0.6, list=FALSE)
ntraining <- new_tr[nTrain,]
ntesting <- new_tr[-nTrain,]

set.seed(123)

3.1 Model with all original variables

# build the model using training set
full.model <- glm(target ~., data = otraining , family = binomial)
summary(full.model)
## 
## Call:
## glm(formula = target ~ ., family = binomial, data = otraining)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7575  -0.1241  -0.0014   0.0041   3.6567  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -43.197948   9.464799  -4.564 5.02e-06 ***
## zn           -0.062587   0.045567  -1.374  0.16959    
## indus        -0.100889   0.061837  -1.632  0.10278    
## chas          1.032045   1.004216   1.028  0.30409    
## nox          52.910638  11.037793   4.794 1.64e-06 ***
## rm           -0.823585   1.017438  -0.809  0.41825    
## age           0.050586   0.019678   2.571  0.01015 *  
## dis           0.884916   0.304671   2.904  0.00368 ** 
## rad           0.651020   0.209885   3.102  0.00192 ** 
## tax          -0.006167   0.004029  -1.531  0.12584    
## ptratio       0.557943   0.174304   3.201  0.00137 ** 
## black        -0.008593   0.005917  -1.452  0.14646    
## lstat         0.009696   0.071740   0.135  0.89249    
## medv          0.234534   0.092172   2.545  0.01094 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.03  on 279  degrees of freedom
## Residual deviance: 112.12  on 266  degrees of freedom
## AIC: 140.12
## 
## Number of Fisher Scoring iterations: 9
round(exp(cbind(Estimate=coef(full.model))),2)
##                 Estimate
## (Intercept) 0.000000e+00
## zn          9.400000e-01
## indus       9.000000e-01
## chas        2.810000e+00
## nox         9.523536e+22
## rm          4.400000e-01
## age         1.050000e+00
## dis         2.420000e+00
## rad         1.920000e+00
## tax         9.900000e-01
## ptratio     1.750000e+00
## black       9.900000e-01
## lstat       1.010000e+00
## medv        1.260000e+00
# evaluate the model by predicting using the testing set
m1_prob <- predict(full.model, otesting, type = "response")
m1_pclass <- ifelse(m1_prob >= 0.5, 1, 0)

# create confusion matrix
pclass <- factor(m1_pclass,levels = c(1,0))
aclass <- factor(otesting$target,levels = c(1,0))
confusionMatrix(pclass, aclass);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  0
##          1 81  7
##          0 11 87
##                                           
##                Accuracy : 0.9032          
##                  95% CI : (0.8514, 0.9416)
##     No Information Rate : 0.5054          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8063          
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.8804          
##             Specificity : 0.9255          
##          Pos Pred Value : 0.9205          
##          Neg Pred Value : 0.8878          
##              Prevalence : 0.4946          
##          Detection Rate : 0.4355          
##    Detection Prevalence : 0.4731          
##       Balanced Accuracy : 0.9030          
##                                           
##        'Positive' Class : 1               
## 
# plot and show area under the curve
plot(roc(otesting$target, m1_prob),print.auc=TRUE)

# get McFadden
m1_mcFadden <- pR2(full.model); m1_mcFadden["McFadden"]
##  McFadden 
## 0.7110564

3.2 Model with transformed dataset

# build the model using training set
transformed.model <- glm(target ~., data = ntraining, family = binomial)
summary(transformed.model)
## 
## Call:
## glm(formula = target ~ ., family = binomial, data = ntraining)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6297  -0.2989  -0.0086   0.0038   3.1830  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -36.213380   8.048091  -4.500 6.81e-06 ***
## zn           -0.035050   0.041893  -0.837  0.40279    
## indus        -0.053894   0.055428  -0.972  0.33089    
## nox          45.423683   9.272543   4.899 9.65e-07 ***
## rm           -0.205872   0.893224  -0.230  0.81772    
## age           0.030031   0.015437   1.945  0.05173 .  
## dis           0.523904   0.271135   1.932  0.05333 .  
## rad           0.607347   0.221664   2.740  0.00615 ** 
## tax          -0.009094   0.003591  -2.532  0.01133 *  
## ptratio       0.401609   0.157360   2.552  0.01071 *  
## black        -0.006738   0.003299  -2.043  0.04110 *  
## lstat         0.071248   0.064616   1.103  0.27018    
## medv          0.151790   0.076962   1.972  0.04858 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 387.81  on 279  degrees of freedom
## Residual deviance: 126.58  on 267  degrees of freedom
## AIC: 152.58
## 
## Number of Fisher Scoring iterations: 9
round(exp(cbind(Estimate=coef(transformed.model))),2)
##                 Estimate
## (Intercept) 0.000000e+00
## zn          9.700000e-01
## indus       9.500000e-01
## nox         5.336483e+19
## rm          8.100000e-01
## age         1.030000e+00
## dis         1.690000e+00
## rad         1.840000e+00
## tax         9.900000e-01
## ptratio     1.490000e+00
## black       9.900000e-01
## lstat       1.070000e+00
## medv        1.160000e+00
# evaluate the model by predicting using the testing set
m2_prob <- predict(transformed.model, ntesting, type = "response")
m2_pclass <- ifelse(m2_prob >= 0.5, 1, 0)

# create confusion matrix
pclass <- factor(m2_pclass,levels = c(1,0))
aclass <- factor(ntesting$target,levels = c(1,0))
confusionMatrix(pclass, aclass)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  0
##          1 84  8
##          0 10 84
##                                           
##                Accuracy : 0.9032          
##                  95% CI : (0.8514, 0.9416)
##     No Information Rate : 0.5054          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8065          
##  Mcnemar's Test P-Value : 0.8137          
##                                           
##             Sensitivity : 0.8936          
##             Specificity : 0.9130          
##          Pos Pred Value : 0.9130          
##          Neg Pred Value : 0.8936          
##              Prevalence : 0.5054          
##          Detection Rate : 0.4516          
##    Detection Prevalence : 0.4946          
##       Balanced Accuracy : 0.9033          
##                                           
##        'Positive' Class : 1               
## 
# plot and show area under the curve
plot(roc(ntesting$target, m2_prob),print.auc=TRUE)

# get McFadden
m2_mcFadden <- pR2(transformed.model); m2_mcFadden["McFadden"]
##  McFadden 
## 0.6736048

3.3 Stepwise variable selection

# build the model using training set
step.model <- transformed.model %>% stepAIC(trace = FALSE)
summary(step.model)
## 
## Call:
## glm(formula = target ~ nox + age + dis + rad + tax + ptratio + 
##     black + medv, family = binomial, data = ntraining)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.58936  -0.31262  -0.02618   0.00426   2.98275  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -32.828505   7.153286  -4.589 4.45e-06 ***
## nox          40.095011   7.836512   5.116 3.11e-07 ***
## age           0.034088   0.012437   2.741 0.006129 ** 
## dis           0.393870   0.233398   1.688 0.091498 .  
## rad           0.645589   0.193535   3.336 0.000851 ***
## tax          -0.010177   0.003203  -3.177 0.001486 ** 
## ptratio       0.397401   0.137997   2.880 0.003979 ** 
## black        -0.006435   0.003261  -1.974 0.048436 *  
## medv          0.097175   0.043636   2.227 0.025950 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 387.81  on 279  degrees of freedom
## Residual deviance: 129.89  on 271  degrees of freedom
## AIC: 147.89
## 
## Number of Fisher Scoring iterations: 9
round(exp(cbind(Estimate=coef(step.model))),2)
##                 Estimate
## (Intercept) 0.000000e+00
## nox         2.588463e+17
## age         1.030000e+00
## dis         1.480000e+00
## rad         1.910000e+00
## tax         9.900000e-01
## ptratio     1.490000e+00
## black       9.900000e-01
## medv        1.100000e+00
# evaluate the model by predicting using the testing set
m3_prob <- predict(step.model, ntesting, type = "response")
m3_pclass <- ifelse(m3_prob >= 0.5, 1, 0)

# create confusion matrix
pclass <- factor(m3_pclass,levels = c(1,0))
aclass <- factor(ntesting$target,levels = c(1,0))
confusionMatrix(pclass, aclass)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  0
##          1 84  9
##          0 10 83
##                                           
##                Accuracy : 0.8978          
##                  95% CI : (0.8451, 0.9374)
##     No Information Rate : 0.5054          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7957          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8936          
##             Specificity : 0.9022          
##          Pos Pred Value : 0.9032          
##          Neg Pred Value : 0.8925          
##              Prevalence : 0.5054          
##          Detection Rate : 0.4516          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8979          
##                                           
##        'Positive' Class : 1               
## 
# plot and show area under the curve
plot(roc(ntesting$target, m3_prob),print.auc=TRUE)

# get McFadden
m3_mcFadden <- pR2(step.model); m3_mcFadden["McFadden"]
##  McFadden 
## 0.6650601

Section 4 - Select Models

Although not big difference in the performance of all 3 models, the transformed model performs slightly better in accuracy, recall, precision, AIC, and roc measure.

The second model will be used to predict the evaluation dataset.

# Read the training data
df <- read.csv("https://raw.githubusercontent.com/L-Velasco/DATA621_FA18/master/HW3/crime-evaluation-data.csv", stringsAsFactors = FALSE)

eval_df <- df

eval_df <- dplyr::select(df, -chas)

id <- c(1:12)
for (val in id) {
  qnt <- quantile(eval_df[,val], probs=c(.25, .75), na.rm = T)
  caps <- quantile(eval_df[,val], probs=c(.05, .95), na.rm = T)
  H <- 1.5 * IQR(eval_df[,val], na.rm = T)
  eval_df[,val][eval_df[,val] < (qnt[1] - H)] <- caps[1]
  eval_df[,val][eval_df[,val] > (qnt[2] + H)] <- caps[2]
}

eval_prob <- predict(transformed.model, eval_df, type = "response")
eval_pclass <- ifelse(eval_prob >= 0.5, 1, 0)


eval_df$target <- eval_pclass
  
# Export
  write.csv(eval_df,file="HW3_Neighborhoods Crime.csv")