library(ggplot2)
require(gridExtra)
library(car)
library(factoextra)
library(dplyr)
library(DT)
library(knitr)

Data Exploration

df <- read.csv("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Classification%20Project/crime-training-data_modified.csv")
datatable(df)

Summary

First, we take a look at a summary of the data. A few items of interest are revealed:

  • There are no missing values in the dataset
  • There are no immediately apparent outliers
  • Expected clusters are of similar size (237 and 229). This is a necessary assumption for algorithms such as K-Means clustering.
summary(df)
##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv      
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20  
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00  
##      target      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4914  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Boxplots

Next, we create boxplots of each of the features - color coded by the target variable. These boxplots reveal significant information about the predictor variables

  • The chas dummy variable has most of its values at 0
  • indus, zn, nox, age, dis, rad, tax, ptratio, lstat, and medv seem to have strong affects on the target variable
grid.arrange(ggplot(df, aes(x = as.factor(target), y = zn, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
             ggplot(df, aes(x = as.factor(target), y = indus, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
             ggplot(df, aes(x = as.factor(target), y = chas, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
             ggplot(df, aes(x = as.factor(target), y = nox, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
             ggplot(df, aes(x = as.factor(target), y = rm, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
             ggplot(df, aes(x = as.factor(target), y = age, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
             ggplot(df, aes(x = as.factor(target), y = dis, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
             ggplot(df, aes(x = as.factor(target), y = rad, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
             ggplot(df, aes(x = as.factor(target), y = tax, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
             ggplot(df, aes(x = as.factor(target), y = ptratio, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
             ggplot(df, aes(x = as.factor(target), y = lstat, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
             ggplot(df, aes(x = as.factor(target), y = medv, fill = as.factor(target))) + geom_boxplot() + theme(legend.position = "none") ,
             ncol=4)

PCA Component Visualization

PCA can be used for classification, but for now, it will be used to visualize the clusters. First, the number of components will be selected based on the variances explained by each component.

Taking a look at the plot of percentages explained by each principal component, it seems like most of the variance can be explained by 2 principal components.

df.pca <- prcomp(df[1:12], center = TRUE, scale. = TRUE)
fviz_eig(df.pca)

Using these two principal components, a scatterplot of the clusters can be created. Having two principal components makes it easier to distinguish between the two clusters, though there is some overlap.

fviz_pca_ind(df.pca,
             col.ind = as.factor(df$target), # Color by the quality of representation
             palette = c("#00AFBB",  "#FC4E07"),
             addEllipses = TRUE, 
             legend.title = "Target",
             labels = FALSE
             )

Data Preparation

Since the dataset does not have any missing values and there are no outliers that particulary stand out, data preparation will be limited. However, we will locate and address any influential outliers using Cooks Distance. Outliers that have a Cooks distance outside the acceptable threshold of 4 / (N - k - 1) where N is the number of observations and k is the number of predictive variables, will be removed.

Cooks Distance

mod <- lm(target ~ ., data=df)
cooksd <- cooks.distance(mod)
plot(cooksd, pch="*", cex=2, main="Influential Outliers by Cooks distance")
abline(h = 4 / (nrow(df) - ncol(df) - 1), col="red")  # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red")  # add labels

We remove the influencial outliers. Removing these outliers also makes the two primary components (visualized in the previous step) explain more of the variance in the data.

influential <- as.numeric(names(cooksd)[(cooksd > 4 / (nrow(df) - ncol(df) - 1))])
df <- df[-influential, ]

Building logistic regression

We will build a logistic classifer using generlized linear regresson with binomaial distribution.

Lets evaluate the distribution of target class label and check whether the dataset is imbalanced or not.

table(df$target)
## 
##   0   1 
## 224 213

we see that both label 0 and label 1 is balanced and have nearly equal number of datapoints.

Now lets split the given data set into 80% of training data and 20% testing data. And build logistic classifer with the training set

library(caTools)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(caret)
## Loading required package: lattice
set.seed(123)
split = sample.split(df$target, SplitRatio = 0.8)
training_set = subset(df, split == TRUE)
test_set = subset(df, split == FALSE)

log_classifer <- glm(target ~ ., data = training_set, family = "binomial")
summary(log_classifer)
## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4671  -0.0494  -0.0004   0.0004   3.5381  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -44.742177  10.761584  -4.158 3.22e-05 ***
## zn           -0.014315   0.045798  -0.313  0.75461    
## indus         0.237724   0.159054   1.495  0.13502    
## chas         -2.872278   3.055233  -0.940  0.34716    
## nox          51.881443  12.562054   4.130 3.63e-05 ***
## rm           -0.301124   1.120871  -0.269  0.78820    
## age           0.072289   0.023377   3.092  0.00199 ** 
## dis           0.469036   0.320498   1.463  0.14334    
## rad           0.971234   0.296170   3.279  0.00104 ** 
## tax          -0.023633   0.008772  -2.694  0.00706 ** 
## ptratio       0.541119   0.211931   2.553  0.01067 *  
## lstat        -0.030661   0.094303  -0.325  0.74508    
## medv          0.112647   0.100988   1.115  0.26466    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 483.585  on 348  degrees of freedom
## Residual deviance:  96.733  on 336  degrees of freedom
## AIC: 122.73
## 
## Number of Fisher Scoring iterations: 10

Above summary says there are very few independent variables are significant having p-value less than 0.05 Some of the significant Independent Variables are

  • nox
  • age
  • rad
  • tax
  • ptratio

We will build another logistic classifier with these significant variables and compare the results with classifier with all variables.

Lets use the orginal logistic classifer to predict the test dataset. We will also plot an ROC graph and figure the best fit threshold for the classifier.

result = predict(log_classifer, newdata = test_set, type = "response")
test_set$scored.probability1 <- result

plot.roc(test_set$target, test_set$scored.probability1,
main="Optimal threshold for Classifer-1", percent=TRUE, of="thresholds", # compute AUC (of threshold)
thresholds="best", # select the (best) threshold
print.auc = TRUE,
print.thres="best") 
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

From the above ROC graph, we got the optimal threshold is 0.5. Lets use this threshold and print the confusion matrix and evaluate the preformance metric such as accuracy, Precision, Sensitivity.

test_set$prediction1 <- ifelse(test_set$scored.probability1 > 0.5, 1, 0)
confusionMatrix(as.factor(test_set$prediction1),as.factor( test_set$target))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 41  0
##          1  4 43
##                                           
##                Accuracy : 0.9545          
##                  95% CI : (0.8877, 0.9875)
##     No Information Rate : 0.5114          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9092          
##                                           
##  Mcnemar's Test P-Value : 0.1336          
##                                           
##             Sensitivity : 0.9111          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9149          
##              Prevalence : 0.5114          
##          Detection Rate : 0.4659          
##    Detection Prevalence : 0.4659          
##       Balanced Accuracy : 0.9556          
##                                           
##        'Positive' Class : 0               
## 

This classifer gives

  • 95% Accuracy
  • 100% Precision with No Type-1 Error(False Positive Error)
  • 91% Sensitivity with minor Type-2 Error(False negative Error)
  • 98.6% AUC(Area under Curve)

Lets build a classifer with significant independent variables.

log_classifer1 <- glm(target ~ nox + age + rad + tax + ptratio, data = training_set, family = "binomial")
summary(log_classifer1)
## 
## Call:
## glm(formula = target ~ nox + age + rad + tax + ptratio, family = "binomial", 
##     data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0987  -0.0381  -0.0005   0.0003   3.4890  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -40.362585   7.845611  -5.145 2.68e-07 ***
## nox          57.952112  11.741424   4.936 7.99e-07 ***
## age           0.064163   0.017117   3.749 0.000178 ***
## rad           0.979399   0.256121   3.824 0.000131 ***
## tax          -0.023954   0.006699  -3.576 0.000349 ***
## ptratio       0.381221   0.143148   2.663 0.007742 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 483.58  on 348  degrees of freedom
## Residual deviance: 104.07  on 343  degrees of freedom
## AIC: 116.07
## 
## Number of Fisher Scoring iterations: 9

The AIC value is reduced, but the residual deviance increased a bit compared to the previous model. Lets test predict and evaluate the preformance metrics of the new classifer.

result = predict(log_classifer1, newdata = test_set, type = "response")
test_set$scored.probability2 <- result

plot.roc(test_set$target, test_set$scored.probability2,
main="Optimal threshold for Classifer-2", percent=TRUE, of="thresholds", # compute AUC (of threshold)
thresholds="best", # select the (best) threshold
print.auc = TRUE, 
print.thres="best") 
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

The optimal threshold of the new model is 0.4. We will predict the target label and print the confusion matrix.

test_set$prediction1 <- ifelse(test_set$scored.probability1 > 0.4, 1, 0)
confusionMatrix(as.factor(test_set$prediction1),as.factor( test_set$target))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 40  0
##          1  5 43
##                                           
##                Accuracy : 0.9432          
##                  95% CI : (0.8724, 0.9813)
##     No Information Rate : 0.5114          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8866          
##                                           
##  Mcnemar's Test P-Value : 0.07364         
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.8958          
##              Prevalence : 0.5114          
##          Detection Rate : 0.4545          
##    Detection Prevalence : 0.4545          
##       Balanced Accuracy : 0.9444          
##                                           
##        'Positive' Class : 0               
## 

This model gives

  • 94% Accuracy
  • 100% Precision
  • 88% Sensitivity
  • 98.1% AUC( Area under Curve)

Selection of the Best Model

Based on performance metrics such as AUC, Accuracy, Precisoion and Sensitiviy, the first classifer seems to be giving better results. We will use the first classifer and the threshold = 0.5 to predict for the evaluation data

df_eval <- read.csv("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Classification%20Project/crime-evaluation-data_modified.csv")

result = predict(log_classifer, newdata = df_eval, type = "response")
df_eval$scored.probability <- result
df_eval$prediction <- ifelse(result > 0.5, 1, 0)

kable(df_eval)
zn indus chas nox rm age dis rad tax ptratio lstat medv scored.probability prediction
0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7 0.0109259 0
0 8.14 0 0.538 6.096 84.5 4.4619 4 307 21.0 10.26 18.2 0.7683939 1
0 8.14 0 0.538 6.495 94.4 4.4547 4 307 21.0 12.80 18.4 0.8501849 1
0 8.14 0 0.538 5.950 82.0 3.9900 4 307 21.0 27.71 13.2 0.4360758 0
0 5.96 0 0.499 5.850 41.5 3.9342 5 279 19.2 8.77 21.0 0.0264893 0
25 5.13 0 0.453 5.741 66.2 7.2254 8 284 19.7 13.15 18.7 0.3749295 0
25 5.13 0 0.453 5.966 93.4 6.8185 8 284 19.7 14.44 16.0 0.7011616 1
0 4.49 0 0.449 6.630 56.1 4.4377 3 247 18.5 6.53 26.6 0.0017325 0
0 4.49 0 0.449 6.121 56.8 3.7476 3 247 18.5 8.44 22.2 0.0008837 0
0 2.89 0 0.445 6.163 69.6 3.4952 2 276 18.0 11.34 21.4 0.0001323 0
0 25.65 0 0.581 5.856 97.0 1.9444 2 188 19.1 25.41 17.3 0.9987252 1
0 25.65 0 0.581 5.613 95.6 1.7572 2 188 19.1 27.26 15.7 0.9981869 1
0 21.89 0 0.624 5.637 94.7 1.9799 4 437 21.2 18.34 14.3 0.9932404 1
0 19.58 0 0.605 6.101 93.0 2.2834 5 403 14.7 9.81 25.0 0.9551789 1
0 19.58 0 0.605 5.880 97.3 2.3887 5 403 14.7 12.03 19.1 0.9401003 1
0 10.59 1 0.489 5.960 92.1 3.8771 4 277 18.6 17.27 21.7 0.0234977 0
0 6.20 0 0.504 6.552 21.4 3.3751 8 307 17.4 3.76 31.5 0.0691393 0
0 6.20 0 0.507 8.247 70.4 3.6519 8 307 17.4 3.95 48.3 0.9311122 1
22 5.86 0 0.431 6.957 6.8 8.9067 7 330 19.1 3.53 29.6 0.0020914 0
90 2.97 0 0.400 7.088 20.8 7.3073 1 285 15.3 7.85 32.2 0.0000001 0
80 1.76 0 0.385 6.230 31.5 9.0892 1 241 18.2 12.93 20.1 0.0000010 0
33 2.18 0 0.472 6.616 58.1 3.3700 7 222 18.4 8.93 28.4 0.1206934 0
0 9.90 0 0.544 6.122 52.8 2.6403 4 304 18.4 5.98 22.1 0.1202251 0
0 7.38 0 0.493 6.415 40.1 4.7211 5 287 19.6 6.12 25.0 0.0511290 0
0 7.38 0 0.493 6.312 28.9 5.4159 5 287 19.6 6.15 23.0 0.0266003 0
0 5.19 0 0.515 5.895 59.6 5.6150 5 224 20.2 10.56 18.5 0.6526216 1
80 2.01 0 0.435 6.635 29.7 8.3440 4 280 17.0 5.99 24.5 0.0000591 0
0 18.10 0 0.718 3.561 87.9 1.6132 24 666 20.2 7.12 27.5 1.0000000 1
0 18.10 1 0.631 7.016 97.5 1.2024 24 666 20.2 2.96 50.0 1.0000000 1
0 18.10 0 0.584 6.348 86.1 2.0527 24 666 20.2 17.64 14.5 0.9999996 1
0 18.10 0 0.740 5.935 87.9 1.8206 24 666 20.2 34.02 8.4 1.0000000 1
0 18.10 0 0.740 5.627 93.9 1.8172 24 666 20.2 22.88 12.8 1.0000000 1
0 18.10 0 0.740 5.818 92.4 1.8662 24 666 20.2 22.11 10.5 1.0000000 1
0 18.10 0 0.740 6.219 100.0 2.0048 24 666 20.2 16.59 18.4 1.0000000 1
0 18.10 0 0.740 5.854 96.6 1.8956 24 666 20.2 23.79 10.8 1.0000000 1
0 18.10 0 0.713 6.525 86.5 2.4358 24 666 20.2 18.13 14.1 1.0000000 1
0 18.10 0 0.713 6.376 88.4 2.5671 24 666 20.2 14.65 17.7 1.0000000 1
0 18.10 0 0.655 6.209 65.4 2.9634 24 666 20.2 13.22 21.4 1.0000000 1
0 9.69 0 0.585 5.794 70.6 2.8927 6 391 19.2 14.10 18.3 0.7743514 1
0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 5.64 23.9 0.8479228 1