library(ggplot2)
require(gridExtra)
library(car)
library(factoextra)
library(dplyr)
library(DT)
library(knitr)
df <- read.csv("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Classification%20Project/crime-training-data_modified.csv")
datatable(df)
First, we take a look at a summary of the data. A few items of interest are revealed:
summary(df)
## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio lstat medv
## Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02
## Median :334.5 Median :18.9 Median :11.350 Median :21.20
## Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00
## target
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4914
## 3rd Qu.:1.0000
## Max. :1.0000
Next, we create boxplots of each of the features - color coded by the target variable. These boxplots reveal significant information about the predictor variables
chas dummy variable has most of its values at 0grid.arrange(ggplot(df, aes(x = as.factor(target), y = zn, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
ggplot(df, aes(x = as.factor(target), y = indus, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
ggplot(df, aes(x = as.factor(target), y = chas, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
ggplot(df, aes(x = as.factor(target), y = nox, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
ggplot(df, aes(x = as.factor(target), y = rm, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
ggplot(df, aes(x = as.factor(target), y = age, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
ggplot(df, aes(x = as.factor(target), y = dis, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
ggplot(df, aes(x = as.factor(target), y = rad, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
ggplot(df, aes(x = as.factor(target), y = tax, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
ggplot(df, aes(x = as.factor(target), y = ptratio, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
ggplot(df, aes(x = as.factor(target), y = lstat, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
ggplot(df, aes(x = as.factor(target), y = medv, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
ncol=4)
PCA can be used for classification, but for now, it will be used to visualize the clusters. First, the number of components will be selected based on the variances explained by each component.
Taking a look at the plot of percentages explained by each principal component, it seems like most of the variance can be explained by 2 principal components.
df.pca <- prcomp(df[1:12], center = TRUE, scale. = TRUE)
fviz_eig(df.pca)
Using these two principal components, a scatterplot of the clusters can be created. Having two principal components makes it easier to distinguish between the two clusters, though there is some overlap.
fviz_pca_ind(df.pca,
col.ind = as.factor(df$target), # Color by the quality of representation
palette = c("#00AFBB", "#FC4E07"),
addEllipses = TRUE,
legend.title = "Target",
labels = FALSE
)
Since the dataset does not have any missing values and there are no outliers that particulary stand out, data preparation will be limited. However, we will locate and address any influential outliers using Cooks Distance. Outliers that have a Cooks distance outside the acceptable threshold of 4 / (N - k - 1) where N is the number of observations and k is the number of predictive variables, will be removed.
mod <- lm(target ~ ., data=df)
cooksd <- cooks.distance(mod)
plot(cooksd, pch="*", cex=2, main="Influential Outliers by Cooks distance")
abline(h = 4 / (nrow(df) - ncol(df) - 1), col="red") # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red") # add labels
We remove the influencial outliers. Removing these outliers also makes the two primary components (visualized in the previous step) explain more of the variance in the data.
influential <- as.numeric(names(cooksd)[(cooksd > 4 / (nrow(df) - ncol(df) - 1))])
df <- df[-influential, ]
We will build a logistic classifer using generlized linear regresson with binomaial distribution.
Lets evaluate the distribution of target class label and check whether the dataset is imbalanced or not.
table(df$target)
##
## 0 1
## 224 213
we see that both label 0 and label 1 is balanced and have nearly equal number of datapoints.
Now lets split the given data set into 80% of training data and 20% testing data. And build logistic classifer with the training set
library(caTools)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(caret)
## Loading required package: lattice
set.seed(123)
split = sample.split(df$target, SplitRatio = 0.8)
training_set = subset(df, split == TRUE)
test_set = subset(df, split == FALSE)
log_classifer <- glm(target ~ ., data = training_set, family = "binomial")
summary(log_classifer)
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4671 -0.0494 -0.0004 0.0004 3.5381
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -44.742177 10.761584 -4.158 3.22e-05 ***
## zn -0.014315 0.045798 -0.313 0.75461
## indus 0.237724 0.159054 1.495 0.13502
## chas -2.872278 3.055233 -0.940 0.34716
## nox 51.881443 12.562054 4.130 3.63e-05 ***
## rm -0.301124 1.120871 -0.269 0.78820
## age 0.072289 0.023377 3.092 0.00199 **
## dis 0.469036 0.320498 1.463 0.14334
## rad 0.971234 0.296170 3.279 0.00104 **
## tax -0.023633 0.008772 -2.694 0.00706 **
## ptratio 0.541119 0.211931 2.553 0.01067 *
## lstat -0.030661 0.094303 -0.325 0.74508
## medv 0.112647 0.100988 1.115 0.26466
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 483.585 on 348 degrees of freedom
## Residual deviance: 96.733 on 336 degrees of freedom
## AIC: 122.73
##
## Number of Fisher Scoring iterations: 10
Above summary says there are very few independent variables are significant having p-value less than 0.05 Some of the significant Independent Variables are
We will build another logistic classifier with these significant variables and compare the results with classifier with all variables.
Lets use the orginal logistic classifer to predict the test dataset. We will also plot an ROC graph and figure the best fit threshold for the classifier.
result = predict(log_classifer, newdata = test_set, type = "response")
test_set$scored.probability1 <- result
plot.roc(test_set$target, test_set$scored.probability1,
main="Optimal threshold for Classifer-1", percent=TRUE, of="thresholds", # compute AUC (of threshold)
thresholds="best", # select the (best) threshold
print.auc = TRUE,
print.thres="best")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
From the above ROC graph, we got the optimal threshold is 0.5. Lets use this threshold and print the confusion matrix and evaluate the preformance metric such as accuracy, Precision, Sensitivity.
test_set$prediction1 <- ifelse(test_set$scored.probability1 > 0.5, 1, 0)
confusionMatrix(as.factor(test_set$prediction1),as.factor( test_set$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 41 0
## 1 4 43
##
## Accuracy : 0.9545
## 95% CI : (0.8877, 0.9875)
## No Information Rate : 0.5114
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9092
##
## Mcnemar's Test P-Value : 0.1336
##
## Sensitivity : 0.9111
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9149
## Prevalence : 0.5114
## Detection Rate : 0.4659
## Detection Prevalence : 0.4659
## Balanced Accuracy : 0.9556
##
## 'Positive' Class : 0
##
This classifer gives
Lets build a classifer with significant independent variables.
log_classifer1 <- glm(target ~ nox + age + rad + tax + ptratio, data = training_set, family = "binomial")
summary(log_classifer1)
##
## Call:
## glm(formula = target ~ nox + age + rad + tax + ptratio, family = "binomial",
## data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0987 -0.0381 -0.0005 0.0003 3.4890
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -40.362585 7.845611 -5.145 2.68e-07 ***
## nox 57.952112 11.741424 4.936 7.99e-07 ***
## age 0.064163 0.017117 3.749 0.000178 ***
## rad 0.979399 0.256121 3.824 0.000131 ***
## tax -0.023954 0.006699 -3.576 0.000349 ***
## ptratio 0.381221 0.143148 2.663 0.007742 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 483.58 on 348 degrees of freedom
## Residual deviance: 104.07 on 343 degrees of freedom
## AIC: 116.07
##
## Number of Fisher Scoring iterations: 9
The AIC value is reduced, but the residual deviance increased a bit compared to the previous model. Lets test predict and evaluate the preformance metrics of the new classifer.
result = predict(log_classifer1, newdata = test_set, type = "response")
test_set$scored.probability2 <- result
plot.roc(test_set$target, test_set$scored.probability2,
main="Optimal threshold for Classifer-2", percent=TRUE, of="thresholds", # compute AUC (of threshold)
thresholds="best", # select the (best) threshold
print.auc = TRUE,
print.thres="best")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
The optimal threshold of the new model is 0.4. We will predict the target label and print the confusion matrix.
test_set$prediction1 <- ifelse(test_set$scored.probability1 > 0.4, 1, 0)
confusionMatrix(as.factor(test_set$prediction1),as.factor( test_set$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 40 0
## 1 5 43
##
## Accuracy : 0.9432
## 95% CI : (0.8724, 0.9813)
## No Information Rate : 0.5114
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8866
##
## Mcnemar's Test P-Value : 0.07364
##
## Sensitivity : 0.8889
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.8958
## Prevalence : 0.5114
## Detection Rate : 0.4545
## Detection Prevalence : 0.4545
## Balanced Accuracy : 0.9444
##
## 'Positive' Class : 0
##
This model gives
Based on performance metrics such as AUC, Accuracy, Precisoion and Sensitiviy, the first classifer seems to be giving better results. We will use the first classifer and the threshold = 0.5 to predict for the evaluation data
df_eval <- read.csv("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Classification%20Project/crime-evaluation-data_modified.csv")
result = predict(log_classifer, newdata = df_eval, type = "response")
df_eval$scored.probability <- result
df_eval$prediction <- ifelse(result > 0.5, 1, 0)
kable(df_eval)
| zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | scored.probability | prediction |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 4.03 | 34.7 | 0.0109259 | 0 |
| 0 | 8.14 | 0 | 0.538 | 6.096 | 84.5 | 4.4619 | 4 | 307 | 21.0 | 10.26 | 18.2 | 0.7683939 | 1 |
| 0 | 8.14 | 0 | 0.538 | 6.495 | 94.4 | 4.4547 | 4 | 307 | 21.0 | 12.80 | 18.4 | 0.8501849 | 1 |
| 0 | 8.14 | 0 | 0.538 | 5.950 | 82.0 | 3.9900 | 4 | 307 | 21.0 | 27.71 | 13.2 | 0.4360758 | 0 |
| 0 | 5.96 | 0 | 0.499 | 5.850 | 41.5 | 3.9342 | 5 | 279 | 19.2 | 8.77 | 21.0 | 0.0264893 | 0 |
| 25 | 5.13 | 0 | 0.453 | 5.741 | 66.2 | 7.2254 | 8 | 284 | 19.7 | 13.15 | 18.7 | 0.3749295 | 0 |
| 25 | 5.13 | 0 | 0.453 | 5.966 | 93.4 | 6.8185 | 8 | 284 | 19.7 | 14.44 | 16.0 | 0.7011616 | 1 |
| 0 | 4.49 | 0 | 0.449 | 6.630 | 56.1 | 4.4377 | 3 | 247 | 18.5 | 6.53 | 26.6 | 0.0017325 | 0 |
| 0 | 4.49 | 0 | 0.449 | 6.121 | 56.8 | 3.7476 | 3 | 247 | 18.5 | 8.44 | 22.2 | 0.0008837 | 0 |
| 0 | 2.89 | 0 | 0.445 | 6.163 | 69.6 | 3.4952 | 2 | 276 | 18.0 | 11.34 | 21.4 | 0.0001323 | 0 |
| 0 | 25.65 | 0 | 0.581 | 5.856 | 97.0 | 1.9444 | 2 | 188 | 19.1 | 25.41 | 17.3 | 0.9987252 | 1 |
| 0 | 25.65 | 0 | 0.581 | 5.613 | 95.6 | 1.7572 | 2 | 188 | 19.1 | 27.26 | 15.7 | 0.9981869 | 1 |
| 0 | 21.89 | 0 | 0.624 | 5.637 | 94.7 | 1.9799 | 4 | 437 | 21.2 | 18.34 | 14.3 | 0.9932404 | 1 |
| 0 | 19.58 | 0 | 0.605 | 6.101 | 93.0 | 2.2834 | 5 | 403 | 14.7 | 9.81 | 25.0 | 0.9551789 | 1 |
| 0 | 19.58 | 0 | 0.605 | 5.880 | 97.3 | 2.3887 | 5 | 403 | 14.7 | 12.03 | 19.1 | 0.9401003 | 1 |
| 0 | 10.59 | 1 | 0.489 | 5.960 | 92.1 | 3.8771 | 4 | 277 | 18.6 | 17.27 | 21.7 | 0.0234977 | 0 |
| 0 | 6.20 | 0 | 0.504 | 6.552 | 21.4 | 3.3751 | 8 | 307 | 17.4 | 3.76 | 31.5 | 0.0691393 | 0 |
| 0 | 6.20 | 0 | 0.507 | 8.247 | 70.4 | 3.6519 | 8 | 307 | 17.4 | 3.95 | 48.3 | 0.9311122 | 1 |
| 22 | 5.86 | 0 | 0.431 | 6.957 | 6.8 | 8.9067 | 7 | 330 | 19.1 | 3.53 | 29.6 | 0.0020914 | 0 |
| 90 | 2.97 | 0 | 0.400 | 7.088 | 20.8 | 7.3073 | 1 | 285 | 15.3 | 7.85 | 32.2 | 0.0000001 | 0 |
| 80 | 1.76 | 0 | 0.385 | 6.230 | 31.5 | 9.0892 | 1 | 241 | 18.2 | 12.93 | 20.1 | 0.0000010 | 0 |
| 33 | 2.18 | 0 | 0.472 | 6.616 | 58.1 | 3.3700 | 7 | 222 | 18.4 | 8.93 | 28.4 | 0.1206934 | 0 |
| 0 | 9.90 | 0 | 0.544 | 6.122 | 52.8 | 2.6403 | 4 | 304 | 18.4 | 5.98 | 22.1 | 0.1202251 | 0 |
| 0 | 7.38 | 0 | 0.493 | 6.415 | 40.1 | 4.7211 | 5 | 287 | 19.6 | 6.12 | 25.0 | 0.0511290 | 0 |
| 0 | 7.38 | 0 | 0.493 | 6.312 | 28.9 | 5.4159 | 5 | 287 | 19.6 | 6.15 | 23.0 | 0.0266003 | 0 |
| 0 | 5.19 | 0 | 0.515 | 5.895 | 59.6 | 5.6150 | 5 | 224 | 20.2 | 10.56 | 18.5 | 0.6526216 | 1 |
| 80 | 2.01 | 0 | 0.435 | 6.635 | 29.7 | 8.3440 | 4 | 280 | 17.0 | 5.99 | 24.5 | 0.0000591 | 0 |
| 0 | 18.10 | 0 | 0.718 | 3.561 | 87.9 | 1.6132 | 24 | 666 | 20.2 | 7.12 | 27.5 | 1.0000000 | 1 |
| 0 | 18.10 | 1 | 0.631 | 7.016 | 97.5 | 1.2024 | 24 | 666 | 20.2 | 2.96 | 50.0 | 1.0000000 | 1 |
| 0 | 18.10 | 0 | 0.584 | 6.348 | 86.1 | 2.0527 | 24 | 666 | 20.2 | 17.64 | 14.5 | 0.9999996 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.935 | 87.9 | 1.8206 | 24 | 666 | 20.2 | 34.02 | 8.4 | 1.0000000 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.627 | 93.9 | 1.8172 | 24 | 666 | 20.2 | 22.88 | 12.8 | 1.0000000 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.818 | 92.4 | 1.8662 | 24 | 666 | 20.2 | 22.11 | 10.5 | 1.0000000 | 1 |
| 0 | 18.10 | 0 | 0.740 | 6.219 | 100.0 | 2.0048 | 24 | 666 | 20.2 | 16.59 | 18.4 | 1.0000000 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.854 | 96.6 | 1.8956 | 24 | 666 | 20.2 | 23.79 | 10.8 | 1.0000000 | 1 |
| 0 | 18.10 | 0 | 0.713 | 6.525 | 86.5 | 2.4358 | 24 | 666 | 20.2 | 18.13 | 14.1 | 1.0000000 | 1 |
| 0 | 18.10 | 0 | 0.713 | 6.376 | 88.4 | 2.5671 | 24 | 666 | 20.2 | 14.65 | 17.7 | 1.0000000 | 1 |
| 0 | 18.10 | 0 | 0.655 | 6.209 | 65.4 | 2.9634 | 24 | 666 | 20.2 | 13.22 | 21.4 | 1.0000000 | 1 |
| 0 | 9.69 | 0 | 0.585 | 5.794 | 70.6 | 2.8927 | 6 | 391 | 19.2 | 14.10 | 18.3 | 0.7743514 | 1 |
| 0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273 | 21.0 | 5.64 | 23.9 | 0.8479228 | 1 |