Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(MASS)
library(corrplot)
library(RColorBrewer)
library(GGally)
library(ggResidpanel)
library(psych)
library(mice)
library(reshape2)
library(cowplot)
library(car)
library(caTools)
library(VIM)
library(broom)
library(pROC)
library(caret)
library(geoR)
library(moments)
library(glmulti)

Introduction

In this homework assignment, we will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). Our training data comprises 466 observations and 13 variables.

Objective

Our objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the data set:

Data Load

Loaded Training and Evalutaion data sets into respective data frames.

train_df <- read.csv("https://raw.githubusercontent.com/soumya2g/CUNYDataMiningHomeWork/master/HomeWork_3/DataFiles/crime-training-data_modified.csv")

eval_df <- read.csv("https://raw.githubusercontent.com/soumya2g/CUNYDataMiningHomeWork/master/HomeWork_3/DataFiles/crime-evaluation-data_modified.csv")

Training Data

Sample snapshot of training data frame -

head(train_df, 20) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
zn indus chas nox rm age dis rad tax ptratio lstat medv target
0 19.58 0 0.605 7.929 96.2 2.0459 5 403 14.7 3.70 50.0 1
0 19.58 1 0.871 5.403 100.0 1.3216 5 403 14.7 26.82 13.4 1
0 18.10 0 0.740 6.485 100.0 1.9784 24 666 20.2 18.85 15.4 1
30 4.93 0 0.428 6.393 7.8 7.0355 6 300 16.6 5.19 23.7 0
0 2.46 0 0.488 7.155 92.2 2.7006 3 193 17.8 4.82 37.9 0
0 8.56 0 0.520 6.781 71.3 2.8561 5 384 20.9 7.67 26.5 0
0 18.10 0 0.693 5.453 100.0 1.4896 24 666 20.2 30.59 5.0 1
0 18.10 0 0.693 4.519 100.0 1.6582 24 666 20.2 36.98 7.0 1
0 5.19 0 0.515 6.316 38.1 6.4584 5 224 20.2 5.68 22.2 0
80 3.64 0 0.392 5.876 19.1 9.2203 1 315 16.4 9.25 20.9 0
22 5.86 0 0.431 6.438 8.9 7.3967 7 330 19.1 3.59 24.8 0
0 12.83 0 0.437 6.286 45.0 4.5026 5 398 18.7 8.94 21.4 0
0 18.10 0 0.532 7.061 77.0 3.4106 24 666 20.2 7.01 25.0 1
22 5.86 0 0.431 8.259 8.4 8.9067 7 330 19.1 3.54 42.8 1
0 2.46 0 0.488 6.153 68.8 3.2797 3 193 17.8 13.15 29.6 0
0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7 0
100 1.32 0 0.411 6.816 40.5 8.3248 5 256 15.1 3.95 31.6 0
20 3.97 0 0.647 5.560 62.8 1.9865 5 264 13.0 10.45 22.8 1
0 18.10 0 0.679 5.896 95.4 1.9096 24 666 20.2 24.39 8.3 1
0 18.10 0 0.671 6.545 99.1 1.5192 24 666 20.2 21.08 10.9 1
str(train_df)
## 'data.frame':    466 obs. of  13 variables:
##  $ zn     : num  0 0 0 30 0 0 0 0 0 80 ...
##  $ indus  : num  19.58 19.58 18.1 4.93 2.46 ...
##  $ chas   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
##  $ rm     : num  7.93 5.4 6.49 6.39 7.16 ...
##  $ age    : num  96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
##  $ dis    : num  2.05 1.32 1.98 7.04 2.7 ...
##  $ rad    : int  5 5 24 6 3 5 24 24 5 1 ...
##  $ tax    : int  403 403 666 300 193 384 666 666 224 315 ...
##  $ ptratio: num  14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
##  $ lstat  : num  3.7 26.82 18.85 5.19 4.82 ...
##  $ medv   : num  50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
##  $ target : int  1 1 1 0 0 0 1 1 0 0 ...

PART I: Data Exploration

We wanted to start off data exploration process with high level descriptive statistical summary and missing/exception value analysis.

Descriptive Statistical Summary

Basic statistical summary of all features and dependant variable (TARGET_WINS).

stat_summary <- function(df){
  df %>%
    summary() %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
    scroll_box(width="100%",height="400px")
}
stat_summary(train_df)
   zn </th>
 indus </th>
  chas </th>
  nox </th>
   rm </th>
  age </th>
  dis </th>
  rad </th>
  tax </th>
ptratio </th>
 lstat </th>
  medv </th>
 target </th>
Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890 Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00 Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00 Min. :0.0000
1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02 1st Qu.:0.0000
Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380 Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00 Median :334.5 Median :18.9 Median :11.350 Median :21.20 Median :0.0000
Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543 Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53 Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59 Mean :0.4914
3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00 3rd Qu.:1.0000
Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00 Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00 Max. :1.0000

We also used describe() function of ‘psych’ package to summarize additional statistical measurements like Standard Deviation, Skewness, Kurtois, Standard Error etc.

stat_desc <- function(df){
df %>% 
    describe() %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
}

stat_desc(train_df)
vars n mean sd median trimmed mad min max range skew kurtosis se
zn 1 466 11.5772532 23.3646511 0.00000 5.3542781 0.0000000 0.0000 100.0000 100.0000 2.1768152 3.8135765 1.0823466
indus 2 466 11.1050215 6.8458549 9.69000 10.9082353 9.3403800 0.4600 27.7400 27.2800 0.2885450 -1.2432132 0.3171281
chas 3 466 0.0708155 0.2567920 0.00000 0.0000000 0.0000000 0.0000 1.0000 1.0000 3.3354899 9.1451313 0.0118957
nox 4 466 0.5543105 0.1166667 0.53800 0.5442684 0.1334340 0.3890 0.8710 0.4820 0.7463281 -0.0357736 0.0054045
rm 5 466 6.2906738 0.7048513 6.21000 6.2570615 0.5166861 3.8630 8.7800 4.9170 0.4793202 1.5424378 0.0326516
age 6 466 68.3675966 28.3213784 77.15000 70.9553476 30.0226500 2.9000 100.0000 97.1000 -0.5777075 -1.0098814 1.3119625
dis 7 466 3.7956929 2.1069496 3.19095 3.5443647 1.9144814 1.1296 12.1265 10.9969 0.9988926 0.4719679 0.0976026
rad 8 466 9.5300429 8.6859272 5.00000 8.6978610 1.4826000 1.0000 24.0000 23.0000 1.0102788 -0.8619110 0.4023678
tax 9 466 409.5021459 167.9000887 334.50000 401.5080214 104.5233000 187.0000 711.0000 524.0000 0.6593136 -1.1480456 7.7778214
ptratio 10 466 18.3984979 2.1968447 18.90000 18.5970588 1.9273800 12.6000 22.0000 9.4000 -0.7542681 -0.4003627 0.1017669
lstat 11 466 12.6314592 7.1018907 11.35000 11.8809626 7.0720020 1.7300 37.9700 36.2400 0.9055864 0.5033688 0.3289887
medv 12 466 22.5892704 9.2396814 21.20000 21.6304813 6.0045300 5.0000 50.0000 45.0000 1.0766920 1.3737825 0.4280200
target 13 466 0.4914163 0.5004636 0.00000 0.4893048 0.0000000 0.0000 1.0000 1.0000 0.0342293 -2.0031131 0.0231835

Missing Value Analysis

Below are the features that has NA values. It is clear from the table below, TEAM_BATTING_HBP(>90%) and TEAM_BASERUN_CS (>33%) have significant NA Values -

## Counts of missing data per feature
train_na_df <- data.frame(apply(train_df, 2, function(x) length(which(is.na(x)))))
train_na_df1 <- data.frame(apply(train_df, 2,function(x) {sum(is.na(x)) / length(x) * 100}))

train_na_df <- cbind(Feature = rownames(train_na_df), train_na_df, train_na_df1)
colnames(train_na_df) <- c('Feature Name','No. of NA Recocrds','Percentage of NA Records')
rownames(train_na_df) <- NULL


train_na_df%>% filter(`No. of NA Recocrds` != 0) %>% arrange(desc(`No. of NA Recocrds`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Feature Name No. of NA Recocrds Percentage of NA Records

Based on the above empty table, there are no missing records. So we have a coplete data set to work with.

Descriptive Statistical Plots

Box Plots

## Box plots:
gb1 <- ggplot(data = train_df, aes(y = target)) + geom_boxplot()
gb2 <- ggplot(data = train_df, aes(y = zn)) + geom_boxplot()
gb3 <- ggplot(data = train_df, aes(y = indus)) + geom_boxplot()
gb4 <- ggplot(data = train_df, aes(y = chas)) + geom_boxplot()
gb5 <- ggplot(data = train_df, aes(y = nox)) + geom_boxplot()
gb6 <- ggplot(data = train_df, aes(y = rm)) + geom_boxplot()
gb7 <- ggplot(data = train_df, aes(y = age)) + geom_boxplot()
gb8 <- ggplot(data = train_df, aes(y = dis)) + geom_boxplot()
gb9 <- ggplot(data = train_df, aes(y = rad)) + geom_boxplot()
gb10 <- ggplot(data = train_df, aes(y = tax)) + geom_boxplot()
gb11 <- ggplot(data = train_df, aes(y = ptratio)) + geom_boxplot()
gb12 <- ggplot(data = train_df, aes(y = lstat)) + geom_boxplot()
gb13 <- ggplot(data = train_df, aes(y = medv)) + geom_boxplot()


plot_grid(gb1, gb2, gb3, gb4, gb5, gb6, gb7, gb8, gb9, gb10,
          gb11, gb12, gb13, labels = "AUTO, scale = 8")

Density Plots

train_df %>%
  gather(variable, value, target:zn) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "dodgerblue4", color="dodgerblue4") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

Observations Summary

  • Our dependant variable target is a binary variable as expected. There another bunary dummy variable chas
  • Four of the predictors are expressed as proportions, with values ranging from 0 to 100:
  • zn
  • indus
  • age
  • lstat
  • indus, rad and tax features are showing bi-modal distribution
  • Some of the features like dis, lstat, nox and zn are right skewed
  • age and ptratio are left skewed
  • The variable rad is described as an index value of accessibility to radial highways. We assume this variable is an ordinal data type

PART II: Data Preparation

Correlation Plot

corrMatrix <- round(cor(train_df),4)

corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Based on the Correlation plot above, there is a high degree of collinearity amogst independent variables like rad, tax, indus, dis, nox etc.

Handling multicollinearity

indus and nox: correlation value of 0.76. This result makes sense, as we expect areas with dense industry concentration to have higher environmental pollutants such as NO2.

dis and nox: correlation value of -0.77. This result is consistent with our intuition: we expect areas close to employment centers to have higher concentrations of environmental pollutants, and areas farther away to have lower concentrations.

rad and tax: These two variables are strongly correlated (91%). Access to radial highways and tax rates appear are strongly correlated values. We are particularly concerned about the multicollinearity effects of these two variables. The fact that these two variables are nearly perfectly correlated indicates that one of them can legitimately be removed from the data set, and we chose rad to be removed from our model for further analysis.

train_df <- train_df %>% dplyr::select(-c(rad))
eval_df <- eval_df %>% dplyr::select(-c(rad))

Pair-wise Relationships

Let’s look at scatter plot/correlation matrix to get a better feel for the relationships between pairs of variables. In the figure below, we plotted the high crime areas in blue and the low crime areas in green. We also included a loess curve in the scatter plots to get a better feel for the relationship between variables.

cols = c('red','blue')
train_df$target_mod <- factor(ifelse(train_df$target == 1 , 'y', 'n'))

panel.cor <- function(x,y, digits = 2, cex.cor, ...) {
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- cor(x,y)
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  text(0.5,0.5, txt, cex =sqrt(abs(r*6)))
  
}
pairs(train_df[,1:12],  col= cols[train_df$target_mod], gap=0.3, lower.panel = panel.smooth, upper.panel = panel.cor, cex.labels = 3)

Per the above par plot, there are some complex, non-linear LOESS curve shapes for some of the predictor variables against the response variable -

  • rm: The LOESS curve initially indicates a negative relationship between the number of rooms and the crime rate category. However, when the number of rooms exceeds approximately 7, the relationship becomes positive. This strange LOESS curve shape is most likely the result of the model fitting to sparse data, as there are only about 30 observations with an average room count in excess of 7.

  • ptratio: Normally we would expect crime rates to be higher in areas with high pupil-teacher ratios. However, the LOESS curve initially indicates an increasing propensity for high crime rates with increases to the this ptratio. Because this variable is left skewed with a high density of ratios clustered around 20, we believe this unusual curve shape is due to the LOESS model fitting to sparse data at low ratio values.

  • medv: We expect high median home values to be associated with higher median values. This pattern appears to hold in our LOESS curve for median values below approximately $30k. The pattern then reverses for values above $30k. Once again, we believe this pattern reversal is due to sparse data. The variable medv is right skewed, with relatively few data observations where the median value exceeds 30k

Feature Engineering

In binary logistic regression, it is desirable to have predictor variables that are normally distributed, whenever possible. The data in the crime dataset presents some factors that would lead us to have to perform some “Transformations” on the data. These transformation include adding categorical variables, log of variables, and adding power transformations etc.

age: This variable is left skewed. The level of skewness is as below -

skewness(train_df$age)
## [1] -0.5795721

We perform the box-cox procedure to determine an appropriate transformation:

boxcoxfit(train_df$age)
## Fitted parameters:
##       lambda         beta      sigmasq 
##     1.317655   205.697942 10492.780979 
## 
## Convergence code returned by optim: 0

We apply the suggested power transformation of 1.3 and store in a new variable, age_new.

dis: This variable is right skewed. The level of skewness is as below -

skewness(train_df$dis)
## [1] 1.002117

We perform the box-cox procedure to determine an appropriate transformation:

boxcoxfit(train_df$dis)
## Fitted parameters:
##     lambda       beta    sigmasq 
## -0.1467279  1.0719066  0.2051234 
## 
## Convergence code returned by optim: 0

Given that the value of the lambda parameter is fairly close to 0, we will use the log transformation and save to results to a new variable, dis_new.

indus: This variable has a bimodal distribution with values clustered around two ranges. Basic power transformation will not result in approximate normal distribution. Also, based on the pairwise plot, high crime rates are primarily concentrated in high industry areas. Hence, we decided to add a categorical variable indus_high_ind with 1 and 0 with a cut-off point of 14 which is approximately in the middle of two mode centers.

lstat: This variable is right skewed. The level of skewness is as below -

skewness(train_df$lstat)
## [1] 0.9085092

We perform the box-cox procedure to determine an appropriate transformation:

boxcoxfit(train_df$lstat)
## Fitted parameters:
##    lambda      beta   sigmasq 
## 0.2328042 3.2351269 1.0549878 
## 
## Convergence code returned by optim: 0

Based on this output, we create a new variable lstat_new, that applies a quarter root transformation to the original variable.

medv: This variable is right skewed. The level of skewness is as below -

skewness(train_df$medv)
## [1] 1.080167

We perform the box-cox procedure to determine an appropriate transformation:

boxcoxfit(train_df$medv)
## Fitted parameters:
##    lambda      beta   sigmasq 
## 0.2348612 4.4693904 0.6926584 
## 
## Convergence code returned by optim: 0

Based on this output, we create a new variable medv_new, that applies a quarter root transformation to the original variable.

nox: This variable is moderately right skewed. We perform the box-cox procedure to determine an appropriate transformation:

boxcoxfit(train_df$nox)
## Fitted parameters:
##     lambda       beta    sigmasq 
## -0.9494657 -0.8622020  0.1261795 
## 
## Convergence code returned by optim: 0

Based on the box-cox procedure output, We will create a new variable nox_inv, that is the reciprocal of the raw nox value. We then multiply the reciprocal by -1 to preserve the direction of the original relationship.

ptratio: This variable is moderately left skewed. The level of skewness is as below -

skewness(train_df$ptratio)
## [1] -0.7567025

We perform the box-cox procedure to determine an appropriate transformation:

boxcoxfit(train_df$ptratio)
## Fitted parameters:
##       lambda         beta      sigmasq 
## 4.144091e+00 4.574381e+04 3.608320e+08 
## 
## Convergence code returned by optim: 0

The suggested power transformation of 4 does not correct the the left skew, and its implementation also creates unusually large, transformed ptratio values(e.g. 200,000 and higher). Therefore, We will forgo the power transformation for the sake of simplicity.

rm: This variable is moderately right skewed. The level of skewness is as below -

skewness(train_df$rm)
## [1] 0.4808673

We perform the box-cox procedure to determine an appropriate transformation:

boxcoxfit(train_df$rm)
## Fitted parameters:
##     lambda       beta    sigmasq 
## 0.20380031 2.22393623 0.02626356 
## 
## Convergence code returned by optim: 0

Based on this output, we create a new variable rm_new`, that applies a quarter root transformation to the original variable.

tax: The variable tax also has a bi-modal shape, with values densely clustered around 300 and 700 with no values recording in the training data between between 470 and 665. Because power transformations have limited effectiveness in approximating a normal distribution, we’ll create a new categorical variable, tax_high_ind, that assigns a value of 1 when the tax value is greater than or equal to 500, and 0 otherwise. The 500 cutoff reflects an approximate halfway point between the two modal centers.

feature_engineering <- function(df){

  df$age_new  <- df$age^1.3
  df$dis_new <- log(df$dis)
  df$indus_high_ind <- ifelse(df$indus <= 14, 0, 1)
  df$lstat_new <- df$lstat^0.25
  df$medv_new <- df$medv^0.25
  df$nox_inv <- -1/df$nox
  df$rm_new <- df$rm^0.25
  df$tax_high_ind <- ifelse(df$tax <= 500, 0, 1)

  return(df)
}

train_df <- feature_engineering(train_df)
eval_df <- feature_engineering(eval_df)

Data Preparation Results

After making feature removal and additional feature engineering steps, below is the summary of all available features -

stat_desc(train_df)
vars n mean sd median trimmed mad min max range skew kurtosis se
zn 1 466 11.5772532 23.3646511 0.000000 5.3542781 0.0000000 0.0000000 100.000000 100.000000 2.1768152 3.8135765 1.0823466
indus 2 466 11.1050215 6.8458549 9.690000 10.9082353 9.3403800 0.4600000 27.740000 27.280000 0.2885450 -1.2432132 0.3171281
chas 3 466 0.0708155 0.2567920 0.000000 0.0000000 0.0000000 0.0000000 1.000000 1.000000 3.3354899 9.1451313 0.0118957
nox 4 466 0.5543105 0.1166667 0.538000 0.5442684 0.1334340 0.3890000 0.871000 0.482000 0.7463281 -0.0357736 0.0054045
rm 5 466 6.2906738 0.7048513 6.210000 6.2570615 0.5166861 3.8630000 8.780000 4.917000 0.4793202 1.5424378 0.0326516
age 6 466 68.3675966 28.3213784 77.150000 70.9553476 30.0226500 2.9000000 100.000000 97.100000 -0.5777075 -1.0098814 1.3119625
dis 7 466 3.7956929 2.1069496 3.190950 3.5443647 1.9144814 1.1296000 12.126500 10.996900 0.9988926 0.4719679 0.0976026
tax 8 466 409.5021459 167.9000887 334.500000 401.5080214 104.5233000 187.0000000 711.000000 524.000000 0.6593136 -1.1480456 7.7778214
ptratio 9 466 18.3984979 2.1968447 18.900000 18.5970588 1.9273800 12.6000000 22.000000 9.400000 -0.7542681 -0.4003627 0.1017669
lstat 10 466 12.6314592 7.1018907 11.350000 11.8809626 7.0720020 1.7300000 37.970000 36.240000 0.9055864 0.5033688 0.3289887
medv 11 466 22.5892704 9.2396814 21.200000 21.6304813 6.0045300 5.0000000 50.000000 45.000000 1.0766920 1.3737825 0.4280200
target 12 466 0.4914163 0.5004636 0.000000 0.4893048 0.0000000 0.0000000 1.000000 1.000000 0.0342293 -2.0031131 0.0231835
target_mod* 13 466 1.4914163 0.5004636 1.000000 1.4893048 0.0000000 1.0000000 2.000000 1.000000 0.0342293 -2.0031131 0.0231835
age_new 14 466 251.8570242 124.0342831 284.143045 260.3295138 147.1891999 3.9913278 398.107171 394.115843 -0.4250679 -1.2565776 5.7457772
dis_new 15 466 1.1875848 0.5412299 1.160315 1.1782986 0.6732974 0.1218636 2.495393 2.373530 0.1426949 -1.0086881 0.0250720
indus_high_ind 16 466 0.3776824 0.4853286 0.000000 0.3475936 0.0000000 0.0000000 1.000000 1.000000 0.5029793 -1.7507550 0.0224824
lstat_new 17 466 1.8287387 0.2677688 1.835476 1.8277552 0.2935691 1.1468630 2.482334 1.335471 -0.0056206 -0.5923852 0.0124041
medv_new 18 466 2.1474072 0.2180936 2.145774 2.1452873 0.1620255 1.4953488 2.659148 1.163799 0.0400482 0.4814771 0.0101030
nox_inv 19 466 -1.8792908 0.3667673 -1.858736 -1.8812132 0.4642629 -2.5706941 -1.148106 1.422589 0.0485116 -1.0201547 0.0169902
rm_new 20 466 1.5818618 0.0441441 1.578603 1.5810231 0.0327835 1.4019456 1.721368 0.319422 0.0414313 1.6393896 0.0020449
tax_high_ind 21 466 0.2703863 0.4446367 0.000000 0.2139037 0.0000000 0.0000000 1.000000 1.000000 1.0305987 -0.9398694 0.0205974
str(train_df)
## 'data.frame':    466 obs. of  21 variables:
##  $ zn            : num  0 0 0 30 0 0 0 0 0 80 ...
##  $ indus         : num  19.58 19.58 18.1 4.93 2.46 ...
##  $ chas          : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ nox           : num  0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
##  $ rm            : num  7.93 5.4 6.49 6.39 7.16 ...
##  $ age           : num  96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
##  $ dis           : num  2.05 1.32 1.98 7.04 2.7 ...
##  $ tax           : int  403 403 666 300 193 384 666 666 224 315 ...
##  $ ptratio       : num  14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
##  $ lstat         : num  3.7 26.82 18.85 5.19 4.82 ...
##  $ medv          : num  50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
##  $ target        : int  1 1 1 0 0 0 1 1 0 0 ...
##  $ target_mod    : Factor w/ 2 levels "n","y": 2 2 2 1 1 1 2 2 1 1 ...
##  $ age_new       : num  378.6 398.1 398.1 14.4 358.2 ...
##  $ dis_new       : num  0.716 0.279 0.682 1.951 0.993 ...
##  $ indus_high_ind: num  1 1 1 0 0 0 1 1 0 0 ...
##  $ lstat_new     : num  1.39 2.28 2.08 1.51 1.48 ...
##  $ medv_new      : num  2.66 1.91 1.98 2.21 2.48 ...
##  $ nox_inv       : num  -1.65 -1.15 -1.35 -2.34 -2.05 ...
##  $ rm_new        : num  1.68 1.52 1.6 1.59 1.64 ...
##  $ tax_high_ind  : num  0 0 1 0 0 0 1 1 0 0 ...
cols = c('red','blue')

pairs(train_df[,c(1, 16, 3, 19:20, 14:15, 21, 9, 17:18, 12)],  col= cols[train_df$target_mod], gap=0.3, lower.panel = panel.smooth, upper.panel = panel.cor, cex.labels = 3)

Based on above plot, below feature pairs are showing high degree of colinearity -

  • nox_inv and age_new: 0.80
  • nox_inv and dis_new: -0.88
  • lstat_new and medv_new: -0.83

Multicollinearity can assessed by computing a score called the Variance Inflation Factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model. So let’s review VIFs -

lm_model <- lm(target ~ . ,data=train_df[,c(1, 16, 3, 19:20, 14:15, 21, 9, 17:18, 12)])
vif(lm_model)
##             zn indus_high_ind           chas        nox_inv         rm_new 
##       2.154960       4.531029       1.067921       7.022698       2.203387 
##        age_new        dis_new   tax_high_ind        ptratio      lstat_new 
##       3.685974       5.444450       3.325173       1.871051       5.199120 
##       medv_new 
##       4.155697

Due to VIF scores above 5, nox_inv, dis_new and lstat_new may not be used together in a model.

train_df <- train_df %>% dplyr::select(-c(target_mod))

PART III: Building Models

Test train approach

We have divided the traning dataset into training and test sets using a 80/20 split. We will build our models on the training set and evaluate it on the test set.

set.seed(123)
split <- sample.split(train_df$target, SplitRatio = 0.8)
training_set <- subset(train_df, split == TRUE)
test_set <- subset(train_df, split == FALSE)

Below is the list of complete list of variables before we embark on Model building process -

str(training_set)
## 'data.frame':    373 obs. of  20 variables:
##  $ zn            : num  0 0 0 0 0 80 22 0 0 22 ...
##  $ indus         : num  19.58 18.1 2.46 8.56 5.19 ...
##  $ chas          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox           : num  0.605 0.74 0.488 0.52 0.515 0.392 0.431 0.437 0.532 0.431 ...
##  $ rm            : num  7.93 6.49 7.16 6.78 6.32 ...
##  $ age           : num  96.2 100 92.2 71.3 38.1 19.1 8.9 45 77 8.4 ...
##  $ dis           : num  2.05 1.98 2.7 2.86 6.46 ...
##  $ tax           : int  403 666 193 384 224 315 330 398 666 330 ...
##  $ ptratio       : num  14.7 20.2 17.8 20.9 20.2 16.4 19.1 18.7 20.2 19.1 ...
##  $ lstat         : num  3.7 18.85 4.82 7.67 5.68 ...
##  $ medv          : num  50 15.4 37.9 26.5 22.2 20.9 24.8 21.4 25 42.8 ...
##  $ target        : int  1 1 0 0 0 0 0 0 1 1 ...
##  $ age_new       : num  379 398 358 256 114 ...
##  $ dis_new       : num  0.716 0.682 0.993 1.049 1.865 ...
##  $ indus_high_ind: num  1 1 0 0 0 0 0 0 1 0 ...
##  $ lstat_new     : num  1.39 2.08 1.48 1.66 1.54 ...
##  $ medv_new      : num  2.66 1.98 2.48 2.27 2.17 ...
##  $ nox_inv       : num  -1.65 -1.35 -2.05 -1.92 -1.94 ...
##  $ rm_new        : num  1.68 1.6 1.64 1.61 1.59 ...
##  $ tax_high_ind  : num  0 1 0 0 0 0 0 0 1 0 ...

We will build four different models to see which one yields the best performance.

Model 1:

We will start off with a model with all the original variables excluding any derived features -

baseline_logit_model <- glm(target~. - age_new - dis_new - indus_high_ind - lstat_new -medv_new - nox_inv - rm_new - tax_high_ind,family="binomial",data=training_set)

summary(baseline_logit_model) 
## 
## Call:
## glm(formula = target ~ . - age_new - dis_new - indus_high_ind - 
##     lstat_new - medv_new - nox_inv - rm_new - tax_high_ind, family = "binomial", 
##     data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2759  -0.2826  -0.0073   0.1604   3.4518  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -39.144899   6.514643  -6.009 1.87e-09 ***
## zn           -0.077304   0.035255  -2.193 0.028328 *  
## indus        -0.143998   0.059105  -2.436 0.014838 *  
## chas          2.386050   0.876089   2.724 0.006459 ** 
## nox          46.693161   7.930718   5.888 3.92e-09 ***
## rm           -0.173873   0.623150  -0.279 0.780228    
## age           0.029778   0.013226   2.252 0.024353 *  
## dis           0.874731   0.230132   3.801 0.000144 ***
## tax           0.007596   0.002334   3.255 0.001135 ** 
## ptratio       0.273426   0.119378   2.290 0.021997 *  
## lstat        -0.006013   0.054019  -0.111 0.911365    
## medv          0.154569   0.058928   2.623 0.008715 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.96  on 372  degrees of freedom
## Residual deviance: 186.58  on 361  degrees of freedom
## AIC: 210.58
## 
## Number of Fisher Scoring iterations: 8

Model Statistics

test_set$baseline_logit_model <- ifelse(predict.glm(baseline_logit_model, test_set,"response") >= 0.5,1,0)
cm_model1 <- confusionMatrix(factor(test_set$baseline_logit_model), factor(test_set$target),"1")

cm_model1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 42  5
##          1  5 41
##                                           
##                Accuracy : 0.8925          
##                  95% CI : (0.8111, 0.9472)
##     No Information Rate : 0.5054          
##     P-Value [Acc > NIR] : 2.021e-15       
##                                           
##                   Kappa : 0.7849          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8913          
##             Specificity : 0.8936          
##          Pos Pred Value : 0.8913          
##          Neg Pred Value : 0.8936          
##              Prevalence : 0.4946          
##          Detection Rate : 0.4409          
##    Detection Prevalence : 0.4946          
##       Balanced Accuracy : 0.8925          
##                                           
##        'Positive' Class : 1               
## 

VIF Results

Multicollinearity can assessed by computing a score called the Variance Inflation Factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

vif(baseline_logit_model)
##       zn    indus     chas      nox       rm      age      dis      tax 
## 1.756565 3.719601 1.316733 4.529997 4.593193 2.481361 4.186295 2.707443 
##  ptratio    lstat     medv 
## 1.744104 3.005254 6.165711

AUC Curve

rocCurve <- roc(response = test_set$target, 
                predictor = test_set$baseline_logit_model)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Plot the ROC Curve:
plot(rocCurve, legacy.axes = TRUE, main = "ROC Curve using pROC",print.auc = TRUE)

# Model Stats
results <- tibble(model = "baseline_logit_model",predictors = 11,F1 = cm_model1$byClass[7],
                  deviance=baseline_logit_model$deviance, 
                  r2 = 1 - baseline_logit_model$deviance/baseline_logit_model$null.deviance,
                  aic=baseline_logit_model$aic, auc = auc(rocCurve), AICc = aicc(baseline_logit_model))

model <- as.data.frame(glance(baseline_logit_model))

results <- cbind(results, model)

Model 2:

In the 2nd model, we have added all the variables including the derived features.

full_logit_model <- glm(target~. ,family="binomial",data=training_set)

summary(full_logit_model) 
## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0476  -0.1904  -0.0061   0.0805   3.4507  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     5.453e+02  1.787e+02   3.051 0.002280 ** 
## zn             -4.620e-02  3.587e-02  -1.288 0.197825    
## indus          -5.106e-01  1.109e-01  -4.605 4.13e-06 ***
## chas            2.789e+00  1.003e+00   2.780 0.005439 ** 
## nox             8.432e+01  5.322e+01   1.584 0.113081    
## rm              3.349e+01  9.311e+00   3.597 0.000322 ***
## age            -3.548e-01  1.630e-01  -2.177 0.029512 *  
## dis            -2.752e+00  1.010e+00  -2.724 0.006449 ** 
## tax             9.898e-03  4.783e-03   2.069 0.038518 *  
## ptratio         4.591e-01  1.644e-01   2.792 0.005231 ** 
## lstat          -3.286e-01  2.259e-01  -1.454 0.145818    
## medv           -1.640e-01  2.168e-01  -0.757 0.449258    
## age_new         8.968e-02  3.735e-02   2.401 0.016366 *  
## dis_new         1.567e+01  4.306e+00   3.639 0.000274 ***
## indus_high_ind  7.697e+00  1.775e+00   4.337 1.45e-05 ***
## lstat_new       7.809e+00  5.994e+00   1.303 0.192646    
## medv_new        1.168e+01  9.023e+00   1.295 0.195401    
## nox_inv        -7.018e+00  1.455e+01  -0.482 0.629571    
## rm_new         -5.446e+02  1.485e+02  -3.667 0.000245 ***
## tax_high_ind   -1.793e+00  1.808e+00  -0.992 0.321409    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.96  on 372  degrees of freedom
## Residual deviance: 139.90  on 353  degrees of freedom
## AIC: 179.9
## 
## Number of Fisher Scoring iterations: 8

Model Statistics

test_set$full_logit_model <- ifelse(predict.glm(full_logit_model, test_set,"response") >= 0.5,1,0)
cm_model2 <- confusionMatrix(factor(test_set$full_logit_model), factor(test_set$target),"1")

cm_model2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 43  6
##          1  4 40
##                                           
##                Accuracy : 0.8925          
##                  95% CI : (0.8111, 0.9472)
##     No Information Rate : 0.5054          
##     P-Value [Acc > NIR] : 2.021e-15       
##                                           
##                   Kappa : 0.7848          
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.8696          
##             Specificity : 0.9149          
##          Pos Pred Value : 0.9091          
##          Neg Pred Value : 0.8776          
##              Prevalence : 0.4946          
##          Detection Rate : 0.4301          
##    Detection Prevalence : 0.4731          
##       Balanced Accuracy : 0.8922          
##                                           
##        'Positive' Class : 1               
## 

VIF Results

Multicollinearity can assessed by computing a score called the Variance Inflation Factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

vif(full_logit_model)
##             zn          indus           chas            nox             rm 
##       1.610938       9.477768       1.504821     152.077801     811.155752 
##            age            dis            tax        ptratio          lstat 
##     324.363346      47.101697       8.075497       2.036233      43.445253 
##           medv        age_new        dis_new indus_high_ind      lstat_new 
##      69.324325     331.738615      64.150678       9.434319      45.841001 
##       medv_new        nox_inv         rm_new   tax_high_ind 
##      63.495066     139.102795     769.268700       6.423736

AUC Curve

rocCurve <- roc(response = test_set$target, 
                predictor = test_set$full_logit_model)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Plot the ROC Curve:
plot(rocCurve, legacy.axes = TRUE, main = "ROC Curve using pROC",print.auc = TRUE)

# Model Stats
results_2 <- tibble(model = "full_logit_model",predictors = 19,F1 = cm_model2$byClass[7],
                  deviance=full_logit_model$deviance, 
                  r2 = 1 - full_logit_model$deviance/full_logit_model$null.deviance,
                  aic = full_logit_model$aic, auc = auc(rocCurve), AICc = aicc(full_logit_model))

model_2 <- as.data.frame(glance(full_logit_model))

results_2 <- cbind(results_2, model_2)

results <- rbind(results,results_2)

Model Summary

In this model F1 score and AUC is lower than the first model and also many variables are insignificant.

Model 3:

In the 3rd model, we have kept the significant variables only from the previous model.

significant_logit_model <- glm(target ~ chas + nox + tax + ptratio + dis_new + indus_high_ind + rm_new, family="binomial", data=training_set)

summary(significant_logit_model) 
## 
## Call:
## glm(formula = target ~ chas + nox + tax + ptratio + dis_new + 
##     indus_high_ind + rm_new, family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5398  -0.3616  -0.1202   0.2698   2.6743  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -53.187062  10.652443  -4.993 5.95e-07 ***
## chas             2.077610   0.746562   2.783  0.00539 ** 
## nox             41.082799   6.191022   6.636 3.23e-11 ***
## tax              0.003717   0.001940   1.916  0.05542 .  
## ptratio          0.195953   0.095617   2.049  0.04043 *  
## dis_new          2.286927   0.738955   3.095  0.00197 ** 
## indus_high_ind   0.241489   0.638261   0.378  0.70517    
## rm_new          14.592401   4.896342   2.980  0.00288 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.96  on 372  degrees of freedom
## Residual deviance: 206.96  on 365  degrees of freedom
## AIC: 222.96
## 
## Number of Fisher Scoring iterations: 7

Model Statistics

test_set$significant_logit_model <- ifelse(predict.glm(significant_logit_model, test_set,"response") >= 0.5,1,0)
cm_model3 <- confusionMatrix(factor(test_set$significant_logit_model), factor(test_set$target),"1")

cm_model3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 41  7
##          1  6 39
##                                           
##                Accuracy : 0.8602          
##                  95% CI : (0.7728, 0.9234)
##     No Information Rate : 0.5054          
##     P-Value [Acc > NIR] : 6.388e-13       
##                                           
##                   Kappa : 0.7203          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8478          
##             Specificity : 0.8723          
##          Pos Pred Value : 0.8667          
##          Neg Pred Value : 0.8542          
##              Prevalence : 0.4946          
##          Detection Rate : 0.4194          
##    Detection Prevalence : 0.4839          
##       Balanced Accuracy : 0.8601          
##                                           
##        'Positive' Class : 1               
## 

VIF Results

Multicollinearity can assessed by computing a score called the Variance Inflation Factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

vif(significant_logit_model)
##           chas            nox            tax        ptratio        dis_new 
##       1.157036       3.773906       1.942803       1.282896       3.559120 
## indus_high_ind         rm_new 
##       2.166474       1.483406

AUC Curve

rocCurve <- roc(response = test_set$target, 
                predictor = test_set$significant_logit_model)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Plot the ROC Curve:
plot(rocCurve, legacy.axes = TRUE, main = "ROC Curve using pROC",print.auc = TRUE)

# Model Stats
results_3 <- tibble(model = "significant_logit_model",predictors = 7,F1 = cm_model3$byClass[7],
                  deviance=significant_logit_model$deviance, 
                  r2 = 1 - significant_logit_model$deviance/significant_logit_model$null.deviance,
                  aic = significant_logit_model$aic, auc = auc(rocCurve), AICc = aicc(significant_logit_model))

model_3 <- as.data.frame(glance(significant_logit_model))

results_3 <- cbind(results_3, model_3)

results <- rbind(results,results_3)

Model Summary

In this model AIC is higher than the other two models. But the F1 and AUC scores are lower than the first and second model.

Model 4: Automatic Selection Stepwise Model

For our last model, we’ll fit a binary logistic regression using a stepwise regression procedure, with variable selection occurring in both forward and backward directions.

For simplicity, we’ll only include first order terms, but we’ll open up the pool of candidate variables to all variables in our data set-using transformed versions of our variables, where applicable. There is one exception though: we exclude rad_high due to its extremely high correlation with tax_high.

model.upper <- glm(target ~ zn + indus_high_ind + chas + nox_inv + rm_new + age_new + dis_new + tax_high_ind + ptratio + lstat_new + medv_new,family="binomial",data=training_set)
model.null = glm(target ~ 1, 
                 data=training_set,
                 family = "binomial")
stepwise_model <- step(model.null, scope = list(upper=model.upper, lower = model.null),
                 trace = 0, direction = 'both')
summary(stepwise_model)
## 
## Call:
## glm(formula = target ~ nox_inv + tax_high_ind + dis_new + medv_new + 
##     age_new + ptratio + chas, family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8158  -0.2621  -0.0256   0.2758   2.8889  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.851370   4.762656  -0.599 0.549377    
## nox_inv      12.670515   1.912865   6.624 3.50e-11 ***
## tax_high_ind  2.475549   0.738061   3.354 0.000796 ***
## dis_new       3.384582   0.838093   4.038 5.38e-05 ***
## medv_new      6.562114   1.624009   4.041 5.33e-05 ***
## age_new       0.007125   0.002482   2.871 0.004096 ** 
## ptratio       0.321442   0.113148   2.841 0.004499 ** 
## chas          1.705403   0.758426   2.249 0.024537 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.96  on 372  degrees of freedom
## Residual deviance: 194.22  on 365  degrees of freedom
## AIC: 210.22
## 
## Number of Fisher Scoring iterations: 7

Model Statistics

test_set$stepwise_model <- ifelse(predict.glm(stepwise_model, test_set,"response") >= 0.5,1,0)
cm_model4 <- confusionMatrix(factor(test_set$stepwise_model), factor(test_set$target),"1")

cm_model4
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 42  5
##          1  5 41
##                                           
##                Accuracy : 0.8925          
##                  95% CI : (0.8111, 0.9472)
##     No Information Rate : 0.5054          
##     P-Value [Acc > NIR] : 2.021e-15       
##                                           
##                   Kappa : 0.7849          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8913          
##             Specificity : 0.8936          
##          Pos Pred Value : 0.8913          
##          Neg Pred Value : 0.8936          
##              Prevalence : 0.4946          
##          Detection Rate : 0.4409          
##    Detection Prevalence : 0.4946          
##       Balanced Accuracy : 0.8925          
##                                           
##        'Positive' Class : 1               
## 

VIF Results

Multicollinearity can assessed by computing a score called the Variance Inflation Factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

vif(stepwise_model)
##      nox_inv tax_high_ind      dis_new     medv_new      age_new 
##     4.067633     1.598767     3.795100     2.938636     1.825200 
##      ptratio         chas 
##     1.688802     1.130040

AUC Curve

rocCurve <- roc(response = test_set$target, 
                predictor = test_set$stepwise_model)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Plot the ROC Curve:
plot(rocCurve, legacy.axes = TRUE, main = "ROC Curve using pROC",print.auc = TRUE)

# Model Stats
results_4 <- tibble(model = "stepwise_model",predictors = 7,F1 = cm_model4$byClass[7],
                  deviance=stepwise_model$deviance, 
                  r2 = 1 - stepwise_model$deviance/stepwise_model$null.deviance,
                  aic = stepwise_model$aic, auc = auc(rocCurve), AICc = aicc(stepwise_model))

model_4 <- as.data.frame(glance(stepwise_model))

results_4 <- cbind(results_4, model_4)

results <- rbind(results,results_4)

Model Summary

In this model F1 and AUC scores are identical to first model and higher than the other two models.

PART IV: Selecting Models

Compare Key statistics

The table below summarizes the model statistics for all four of our models. The models are listed from left to right in accordance with the order in which they were described in Part III.

The p-value does not indicate a statistically significant difference. Now let’s compare all three model fits, using AIC, corrected AIC, BIC, and loglikehood values.

results %>% kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
model predictors F1 deviance r2 aic auc AICc null.deviance df.null logLik AIC BIC deviance df.residual
baseline_logit_model 11 0.8913043 186.5777 0.6390844 210.5777 0.8924607 211.4443 516.9564 372 -93.28883 210.5777 257.6366 186.5777 361
full_logit_model 19 0.8888889 139.8985 0.7293804 179.8985 0.8922294 182.2849 516.9564 372 -69.94926 179.8985 258.3301 139.8985 353
significant_logit_model 7 0.8571429 206.9577 0.5996613 222.9577 0.8600833 223.3533 516.9564 372 -103.47883 222.9577 254.3303 206.9577 365
stepwise_model 7 0.8913043 194.2181 0.6243048 210.2181 0.8924607 210.6137 516.9564 372 -97.10903 210.2181 241.5907 194.2181 365

Based on the above table, AIC, AICc and BIC provide a consistent interpretation of model fits:

Model 1 has the second highest AIC, AICc, and BIC values, which is indicative of poor relative fit. Model 2 has lower AIC and AICc but slightly higher BIC than those of Model 1. The interpretation is that Model 2 is superior to Model 1. Based on similar observation Model 2 is superior to Model 3 as well. Model 3 has higher AIC, AICc, and BIC, compared to Model 1,2 and 4. Model 4 has smallest BIC of all 4 models, smaller AIC and AICc compared to model 1 and model 3 which indicates a superior model fit.

Model Interpretation

Between the four models, we decide to select model 4 as the best model based on the above analysis.

model_coefficients <- cm_model4$byClass

model_coefficients %>% kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
x
Sensitivity 0.8913043
Specificity 0.8936170
Pos Pred Value 0.8913043
Neg Pred Value 0.8936170
Precision 0.8913043
Recall 0.8913043
F1 0.8913043
Prevalence 0.4946237
Detection Rate 0.4408602
Detection Prevalence 0.4946237
Balanced Accuracy 0.8924607

Our model has relatively high values of sensitivity and specificity. We conclude that our model is fairly strong. Our last step is to judge our model against a test data set.

Model Prediction

eval_df$predict_target <- ifelse(predict.glm(stepwise_model, eval_df,"response") >= 0.5,1,0)


eval_df %>% kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
zn indus chas nox rm age dis tax ptratio lstat medv age_new dis_new indus_high_ind lstat_new medv_new nox_inv rm_new tax_high_ind predict_target
0 7.07 0 0.469 7.185 61.1 4.9671 242 17.8 4.03 34.7 209.82323 1.6028362 0 1.416858 2.427070 -2.132196 1.637219 0 0
0 8.14 0 0.538 6.096 84.5 4.4619 307 21.0 10.26 18.2 319.82597 1.4955747 0 1.789727 2.065465 -1.858736 1.571308 0 1
0 8.14 0 0.538 6.495 94.4 4.4547 307 21.0 12.80 18.4 369.37168 1.4939597 0 1.891483 2.071116 -1.858736 1.596411 0 1
0 8.14 0 0.538 5.950 82.0 3.9900 307 21.0 27.71 13.2 307.57994 1.3837912 0 2.294347 1.906090 -1.858736 1.561814 0 0
0 5.96 0 0.499 5.850 41.5 3.9342 279 19.2 8.77 21.0 126.90026 1.3697076 0 1.720877 2.140695 -2.004008 1.555210 0 0
25 5.13 0 0.453 5.741 66.2 7.2254 284 19.7 13.15 18.7 232.87099 1.9776026 0 1.904283 2.079507 -2.207505 1.547914 0 0
25 5.13 0 0.453 5.966 93.4 6.8185 284 19.7 14.44 16.0 364.29309 1.9196395 0 1.949359 2.000000 -2.207505 1.562863 0 0
0 4.49 0 0.449 6.630 56.1 4.4377 247 18.5 6.53 26.6 187.78103 1.4901362 0 1.598558 2.271017 -2.227171 1.604643 0 0
0 4.49 0 0.449 6.121 56.8 3.7476 247 18.5 8.44 22.2 190.83272 1.3211156 0 1.704455 2.170642 -2.227171 1.572916 0 0
0 2.89 0 0.445 6.163 69.6 3.4952 276 18.0 11.34 21.4 248.53755 1.2513906 0 1.835073 2.150817 -2.247191 1.575608 0 0
0 25.65 0 0.581 5.856 97.0 1.9444 188 19.1 25.41 17.3 382.65135 0.6649534 1 2.245180 2.039443 -1.721170 1.555608 0 0
0 25.65 0 0.581 5.613 95.6 1.7572 188 19.1 27.26 15.7 375.48731 0.5637216 1 2.284975 1.990558 -1.721170 1.539213 0 0
0 21.89 0 0.624 5.637 94.7 1.9799 437 21.2 18.34 14.3 370.89841 0.6830463 1 2.069426 1.944617 -1.602564 1.540856 0 1
0 19.58 0 0.605 6.101 93.0 2.2834 403 14.7 9.81 25.0 362.26621 0.8256656 1 1.769772 2.236068 -1.652893 1.571630 0 1
0 19.58 0 0.605 5.880 97.3 2.3887 403 14.7 12.03 19.1 384.19056 0.8707493 1 1.862372 2.090539 -1.652893 1.557200 0 1
0 10.59 1 0.489 5.960 92.1 3.8771 277 18.6 17.27 21.7 357.71530 1.3550875 0 2.038558 2.158316 -2.044990 1.562470 0 1
0 6.20 0 0.504 6.552 21.4 3.3751 307 17.4 3.76 31.5 53.64607 1.2164250 0 1.392506 2.369069 -1.984127 1.599902 0 0
0 6.20 0 0.507 8.247 70.4 3.6519 307 17.4 3.95 48.3 252.25771 1.2952476 0 1.409773 2.636251 -1.972387 1.694627 0 1
22 5.86 0 0.431 6.957 6.8 8.9067 330 19.1 3.53 29.6 12.08542 2.1868038 0 1.370704 2.332507 -2.320186 1.624073 0 0
90 2.97 0 0.400 7.088 20.8 7.3073 285 15.3 7.85 32.2 51.69902 1.9888738 0 1.673853 2.382122 -2.500000 1.631665 0 0
80 1.76 0 0.385 6.230 31.5 9.0892 241 18.2 12.93 20.1 88.67552 2.2070869 0 1.896268 2.117381 -2.597403 1.579872 0 0
33 2.18 0 0.472 6.616 58.1 3.3700 222 18.4 8.93 28.4 196.53005 1.2149127 0 1.728673 2.308498 -2.118644 1.603795 0 0
0 9.90 0 0.544 6.122 52.8 2.6403 304 18.4 5.98 22.1 173.54979 0.9708925 0 1.563779 2.168194 -1.838235 1.572980 0 0
0 7.38 0 0.493 6.415 40.1 4.7211 287 19.6 6.12 25.0 121.36339 1.5520418 0 1.572852 2.236068 -2.028398 1.591473 0 0
0 7.38 0 0.493 6.312 28.9 5.4159 287 19.6 6.15 23.0 79.28065 1.6893391 0 1.574776 2.189939 -2.028398 1.585046 0 0
0 5.19 0 0.515 5.895 59.6 5.6150 224 20.2 10.56 18.5 203.15155 1.7254416 0 1.802669 2.073924 -1.941748 1.558192 0 0
80 2.01 0 0.435 6.635 29.7 8.3440 280 17.0 5.99 24.5 82.14543 2.1215427 0 1.564432 2.224803 -2.298851 1.604945 0 0
0 18.10 0 0.718 3.561 87.9 1.6132 666 20.2 7.12 27.5 336.65536 0.4782198 1 1.633503 2.289988 -1.392758 1.373703 1 1
0 18.10 1 0.631 7.016 97.5 1.2024 666 20.2 2.96 50.0 385.21749 0.1843196 1 1.311665 2.659148 -1.584786 1.627505 1 1
0 18.10 0 0.584 6.348 86.1 2.0527 666 20.2 17.64 14.5 327.72087 0.7191560 1 2.049390 1.951381 -1.712329 1.587301 1 1
0 18.10 0 0.740 5.935 87.9 1.8206 666 20.2 34.02 8.4 336.65536 0.5991661 1 2.415091 1.702432 -1.351351 1.560828 1 1
0 18.10 0 0.740 5.627 93.9 1.8172 666 20.2 22.88 12.8 366.83036 0.5972969 1 2.187077 1.891483 -1.351351 1.540172 1 1
0 18.10 0 0.740 5.818 92.4 1.8662 666 20.2 22.11 10.5 359.23080 0.6239043 1 2.168439 1.800103 -1.351351 1.553079 1 1
0 18.10 0 0.740 6.219 100.0 2.0048 666 20.2 16.59 18.4 398.10717 0.6955443 1 2.018188 2.071116 -1.351351 1.579175 1 1
0 18.10 0 0.740 5.854 96.6 1.8956 666 20.2 23.79 10.8 380.60130 0.6395354 1 2.208506 1.812825 -1.351351 1.555476 1 1
0 18.10 0 0.713 6.525 86.5 2.4358 666 20.2 18.13 14.1 329.70151 0.8902752 1 2.063476 1.937781 -1.402524 1.598251 1 1
0 18.10 0 0.713 6.376 88.4 2.5671 666 20.2 14.65 17.7 339.14697 0.9427769 1 1.956408 2.051131 -1.402524 1.589048 1 1
0 18.10 0 0.655 6.209 65.4 2.9634 666 20.2 13.22 21.4 229.21924 1.0863373 1 1.906812 2.150817 -1.526718 1.578539 1 1
0 9.69 0 0.585 5.794 70.6 2.8927 391 19.2 14.10 18.3 253.18974 1.0621903 0 1.937781 2.068296 -1.709402 1.551474 0 1
0 11.93 0 0.573 6.976 91.0 2.1675 273 21.0 5.64 23.9 352.17118 0.7735744 0 1.541061 2.211055 -1.745201 1.625181 0 1

Model output

# export to .csv for submission
write.csv(eval_df, file = "C:/CUNY/Semester4(Spring)/DATA 621/Assignments/Homework3/Output/crime_logistic_regression_model_submission.csv",row.names = FALSE)

Our model prediction output can be found in the below GitHub location -

Model Output