The original data was loaded containing 466 observations, 13 variable predictors, and 1 response variable with about 49% were classified with having a crime rate above the median crime rate. A summary and boxplots of all variables reveal no missing data, but suggest some columns may contain outliers.
library(ggcorrplot)
library(car)
library(MASS)
library(dplyr)
library(ggplot2)
library(caret)
library(pROC)
library(pscl)
# Read the training data
df <- read.csv("https://raw.githubusercontent.com/L-Velasco/DATA621_FA18/master/HW3/crime-training-data.csv", stringsAsFactors = FALSE)
original_tr <- df
dim(original_tr)
## [1] 466 14
summary(original_tr)
## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio black lstat
## Min. :187.0 Min. :12.6 Min. : 0.32 Min. : 1.730
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.:375.61 1st Qu.: 7.043
## Median :334.5 Median :18.9 Median :391.34 Median :11.350
## Mean :409.5 Mean :18.4 Mean :357.12 Mean :12.631
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:396.24 3rd Qu.:16.930
## Max. :711.0 Max. :22.0 Max. :396.90 Max. :37.970
## medv target
## Min. : 5.00 Min. :0.0000
## 1st Qu.:17.02 1st Qu.:0.0000
## Median :21.20 Median :0.0000
## Mean :22.59 Mean :0.4914
## 3rd Qu.:25.00 3rd Qu.:1.0000
## Max. :50.00 Max. :1.0000
table(original_tr$target)/sum(table(original_tr$target))
##
## 0 1
## 0.5085837 0.4914163
Based on given predictors, the plots suggest a crime rate being above the median crime rate in neighborhoods with higher values/proportions of non-retail business acres (indus), nitrogen oxides concentration (nox), age of units prior to 1940 (age), accessibility to radial highways (rad), property tax rate (tax) pupil-teacher ratio (ptratio), and lower status of population (lstat).
The values below shows the correlation of response variable with all predictor variables:
## [,1]
## zn -0.43
## indus 0.60
## chas 0.08
## nox 0.73
## rm -0.15
## age 0.63
## dis -0.62
## rad 0.63
## tax 0.61
## ptratio 0.25
## black -0.35
## lstat 0.47
## medv -0.27
## target 1.00
Judging from the correlation plot and considering an 80% and above correlation percentage between predictors, there appear some strong relationships between the following variables:
rad, tax = accessibility to radial highways and full-value property tax rate
indus, nox = proportion of non-retail business acres and nitrogen oxides concentration
dis, nox = mean distances to employment centers and nitrogen oxides concentration
dis, age = mean distances to employment centers and proportion of owner-occupied units built prior to 1940
In contrast, chas, a dummy variable indicating whether the suburbs borders the Charles River, has little to no relationship among the predictors.
corr<- round(cor(original_tr),1)
ggcorrplot(corr, lab = TRUE)
For transformation, outliers will be treated and one variable will be removed
# Remove chas
new_tr <- dplyr::select(original_tr, -chas)
dim(new_tr)
## [1] 466 13
Any outliers outside of lower 1.5IQR would be capped at 5th %ile, and observations above the upper 1.5IQR would be capped at 95th %ile.
# Outlier Capping
id <- c(1:12)
for (val in id) {
qnt <- quantile(new_tr[,val], probs=c(.25, .75), na.rm = T)
caps <- quantile(new_tr[,val], probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(new_tr[,val], na.rm = T)
new_tr[,val][new_tr[,val] < (qnt[1] - H)] <- caps[1]
new_tr[,val][new_tr[,val] > (qnt[2] + H)] <- caps[2]
}
Split the training data set such that 60% of the observations will be used to train the model and 40% will be used to test the model to derive performance metrics.
oTrain <- createDataPartition(original_tr$target, p=0.6, list=FALSE)
otraining <- original_tr[oTrain,]
otesting <- original_tr[-oTrain,]
nTrain <- createDataPartition(new_tr$target, p=0.6, list=FALSE)
ntraining <- new_tr[nTrain,]
ntesting <- new_tr[-nTrain,]
set.seed(123)
# build the model using training set
full.model <- glm(target ~., data = otraining , family = binomial)
summary(full.model)
##
## Call:
## glm(formula = target ~ ., family = binomial, data = otraining)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7575 -0.1241 -0.0014 0.0041 3.6567
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -43.197948 9.464799 -4.564 5.02e-06 ***
## zn -0.062587 0.045567 -1.374 0.16959
## indus -0.100889 0.061837 -1.632 0.10278
## chas 1.032045 1.004216 1.028 0.30409
## nox 52.910638 11.037793 4.794 1.64e-06 ***
## rm -0.823585 1.017438 -0.809 0.41825
## age 0.050586 0.019678 2.571 0.01015 *
## dis 0.884916 0.304671 2.904 0.00368 **
## rad 0.651020 0.209885 3.102 0.00192 **
## tax -0.006167 0.004029 -1.531 0.12584
## ptratio 0.557943 0.174304 3.201 0.00137 **
## black -0.008593 0.005917 -1.452 0.14646
## lstat 0.009696 0.071740 0.135 0.89249
## medv 0.234534 0.092172 2.545 0.01094 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.03 on 279 degrees of freedom
## Residual deviance: 112.12 on 266 degrees of freedom
## AIC: 140.12
##
## Number of Fisher Scoring iterations: 9
round(exp(cbind(Estimate=coef(full.model))),2)
## Estimate
## (Intercept) 0.000000e+00
## zn 9.400000e-01
## indus 9.000000e-01
## chas 2.810000e+00
## nox 9.523536e+22
## rm 4.400000e-01
## age 1.050000e+00
## dis 2.420000e+00
## rad 1.920000e+00
## tax 9.900000e-01
## ptratio 1.750000e+00
## black 9.900000e-01
## lstat 1.010000e+00
## medv 1.260000e+00
# evaluate the model by predicting using the testing set
m1_prob <- predict(full.model, otesting, type = "response")
m1_pclass <- ifelse(m1_prob >= 0.5, 1, 0)
# create confusion matrix
pclass <- factor(m1_pclass,levels = c(1,0))
aclass <- factor(otesting$target,levels = c(1,0))
confusionMatrix(pclass, aclass);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 81 7
## 0 11 87
##
## Accuracy : 0.9032
## 95% CI : (0.8514, 0.9416)
## No Information Rate : 0.5054
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8063
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.8804
## Specificity : 0.9255
## Pos Pred Value : 0.9205
## Neg Pred Value : 0.8878
## Prevalence : 0.4946
## Detection Rate : 0.4355
## Detection Prevalence : 0.4731
## Balanced Accuracy : 0.9030
##
## 'Positive' Class : 1
##
# plot and show area under the curve
plot(roc(otesting$target, m1_prob),print.auc=TRUE)
# get McFadden
m1_mcFadden <- pR2(full.model); m1_mcFadden["McFadden"]
## McFadden
## 0.7110564
# build the model using training set
transformed.model <- glm(target ~., data = ntraining, family = binomial)
summary(transformed.model)
##
## Call:
## glm(formula = target ~ ., family = binomial, data = ntraining)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6297 -0.2989 -0.0086 0.0038 3.1830
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -36.213380 8.048091 -4.500 6.81e-06 ***
## zn -0.035050 0.041893 -0.837 0.40279
## indus -0.053894 0.055428 -0.972 0.33089
## nox 45.423683 9.272543 4.899 9.65e-07 ***
## rm -0.205872 0.893224 -0.230 0.81772
## age 0.030031 0.015437 1.945 0.05173 .
## dis 0.523904 0.271135 1.932 0.05333 .
## rad 0.607347 0.221664 2.740 0.00615 **
## tax -0.009094 0.003591 -2.532 0.01133 *
## ptratio 0.401609 0.157360 2.552 0.01071 *
## black -0.006738 0.003299 -2.043 0.04110 *
## lstat 0.071248 0.064616 1.103 0.27018
## medv 0.151790 0.076962 1.972 0.04858 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 387.81 on 279 degrees of freedom
## Residual deviance: 126.58 on 267 degrees of freedom
## AIC: 152.58
##
## Number of Fisher Scoring iterations: 9
round(exp(cbind(Estimate=coef(transformed.model))),2)
## Estimate
## (Intercept) 0.000000e+00
## zn 9.700000e-01
## indus 9.500000e-01
## nox 5.336483e+19
## rm 8.100000e-01
## age 1.030000e+00
## dis 1.690000e+00
## rad 1.840000e+00
## tax 9.900000e-01
## ptratio 1.490000e+00
## black 9.900000e-01
## lstat 1.070000e+00
## medv 1.160000e+00
# evaluate the model by predicting using the testing set
m2_prob <- predict(transformed.model, ntesting, type = "response")
m2_pclass <- ifelse(m2_prob >= 0.5, 1, 0)
# create confusion matrix
pclass <- factor(m2_pclass,levels = c(1,0))
aclass <- factor(ntesting$target,levels = c(1,0))
confusionMatrix(pclass, aclass)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 84 8
## 0 10 84
##
## Accuracy : 0.9032
## 95% CI : (0.8514, 0.9416)
## No Information Rate : 0.5054
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8065
## Mcnemar's Test P-Value : 0.8137
##
## Sensitivity : 0.8936
## Specificity : 0.9130
## Pos Pred Value : 0.9130
## Neg Pred Value : 0.8936
## Prevalence : 0.5054
## Detection Rate : 0.4516
## Detection Prevalence : 0.4946
## Balanced Accuracy : 0.9033
##
## 'Positive' Class : 1
##
# plot and show area under the curve
plot(roc(ntesting$target, m2_prob),print.auc=TRUE)
# get McFadden
m2_mcFadden <- pR2(transformed.model); m2_mcFadden["McFadden"]
## McFadden
## 0.6736048
# build the model using training set
step.model <- transformed.model %>% stepAIC(trace = FALSE)
summary(step.model)
##
## Call:
## glm(formula = target ~ nox + age + dis + rad + tax + ptratio +
## black + medv, family = binomial, data = ntraining)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.58936 -0.31262 -0.02618 0.00426 2.98275
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -32.828505 7.153286 -4.589 4.45e-06 ***
## nox 40.095011 7.836512 5.116 3.11e-07 ***
## age 0.034088 0.012437 2.741 0.006129 **
## dis 0.393870 0.233398 1.688 0.091498 .
## rad 0.645589 0.193535 3.336 0.000851 ***
## tax -0.010177 0.003203 -3.177 0.001486 **
## ptratio 0.397401 0.137997 2.880 0.003979 **
## black -0.006435 0.003261 -1.974 0.048436 *
## medv 0.097175 0.043636 2.227 0.025950 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 387.81 on 279 degrees of freedom
## Residual deviance: 129.89 on 271 degrees of freedom
## AIC: 147.89
##
## Number of Fisher Scoring iterations: 9
round(exp(cbind(Estimate=coef(step.model))),2)
## Estimate
## (Intercept) 0.000000e+00
## nox 2.588463e+17
## age 1.030000e+00
## dis 1.480000e+00
## rad 1.910000e+00
## tax 9.900000e-01
## ptratio 1.490000e+00
## black 9.900000e-01
## medv 1.100000e+00
# evaluate the model by predicting using the testing set
m3_prob <- predict(step.model, ntesting, type = "response")
m3_pclass <- ifelse(m3_prob >= 0.5, 1, 0)
# create confusion matrix
pclass <- factor(m3_pclass,levels = c(1,0))
aclass <- factor(ntesting$target,levels = c(1,0))
confusionMatrix(pclass, aclass)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 84 9
## 0 10 83
##
## Accuracy : 0.8978
## 95% CI : (0.8451, 0.9374)
## No Information Rate : 0.5054
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7957
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8936
## Specificity : 0.9022
## Pos Pred Value : 0.9032
## Neg Pred Value : 0.8925
## Prevalence : 0.5054
## Detection Rate : 0.4516
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8979
##
## 'Positive' Class : 1
##
# plot and show area under the curve
plot(roc(ntesting$target, m3_prob),print.auc=TRUE)
# get McFadden
m3_mcFadden <- pR2(step.model); m3_mcFadden["McFadden"]
## McFadden
## 0.6650601
Although not big difference in the performance of all 3 models, the transformed model performs slightly better in accuracy, recall, precision, AIC, and roc measure.
The second model will be used to predict the evaluation dataset.
# Read the training data
df <- read.csv("https://raw.githubusercontent.com/L-Velasco/DATA621_FA18/master/HW3/crime-evaluation-data.csv", stringsAsFactors = FALSE)
eval_df <- df
eval_df <- dplyr::select(df, -chas)
id <- c(1:12)
for (val in id) {
qnt <- quantile(eval_df[,val], probs=c(.25, .75), na.rm = T)
caps <- quantile(eval_df[,val], probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(eval_df[,val], na.rm = T)
eval_df[,val][eval_df[,val] < (qnt[1] - H)] <- caps[1]
eval_df[,val][eval_df[,val] > (qnt[2] + H)] <- caps[2]
}
eval_prob <- predict(transformed.model, eval_df, type = "response")
eval_pclass <- ifelse(eval_prob >= 0.5, 1, 0)
eval_df$target <- eval_pclass
# Export
write.csv(eval_df,file="HW3_Neighborhoods Crime.csv")