Data 621 Homework3

Introduction

For this assignment, we were tasked with building a binary logistic regression model from a dataset containing information on crime in various neighborhoods of a major city. Given a vector of predictors, we seek to predict whether the neighborhood crime rate is above the median.

data_train <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_3/crime-training-data_modified.csv", header = TRUE)
data_test <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_3/crime-evaluation-data_modified.csv", header = TRUE)

Data Exploration

Data Exploration

The dataset is composed of 466 observations and 12 predictor variables. The response variable target is binary (0 or 1). A quick look at the distribution of the training dataset reveals some skewed predictors. All observations in this dataset are complete.

skim(data_train)
Data summary
Name data_train
Number of rows 466
Number of columns 13
_______________________
Column type frequency:
numeric 13
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
zn 0 1 11.58 23.36 0.00 0.00 0.00 16.25 100.00 ▇▁▁▁▁
indus 0 1 11.11 6.85 0.46 5.15 9.69 18.10 27.74 ▇▆▁▇▁
chas 0 1 0.07 0.26 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
nox 0 1 0.55 0.12 0.39 0.45 0.54 0.62 0.87 ▇▇▅▃▁
rm 0 1 6.29 0.70 3.86 5.89 6.21 6.63 8.78 ▁▂▇▂▁
age 0 1 68.37 28.32 2.90 43.88 77.15 94.10 100.00 ▂▂▂▃▇
dis 0 1 3.80 2.11 1.13 2.10 3.19 5.21 12.13 ▇▅▂▁▁
rad 0 1 9.53 8.69 1.00 4.00 5.00 24.00 24.00 ▇▂▁▁▃
tax 0 1 409.50 167.90 187.00 281.00 334.50 666.00 711.00 ▇▇▅▁▇
ptratio 0 1 18.40 2.20 12.60 16.90 18.90 20.20 22.00 ▁▃▅▅▇
lstat 0 1 12.63 7.10 1.73 7.04 11.35 16.93 37.97 ▇▇▅▂▁
medv 0 1 22.59 9.24 5.00 17.02 21.20 25.00 50.00 ▂▇▅▁▁
target 0 1 0.49 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇

Visualization

With a closer look at the distribution of the data using density plots, we can identify bimodal distribution for indus, rar and tax and skew in dis, nox, ptratio and zn.

Looking at the density and box plots of each we can observe that:

  • age: there is a higher concentration of homes that are older (more than 50 years old). The box plot shows that generally older homes in neighborhoods are see as associated with higher crime.
  • chas: most homes in the dataset border the Charles River, thus this may not be a good predictor variable dis: a right skewed distribution where a lower distance to employment centers shows a higher crime indicator
  • indus: bi-modal distribution of industrial sectors and generally seen by the box plots that the higher industrial activity results in an increased crime factor
  • lstat: a predictor variable based on “status” of population. However it is ambigious what the sale in this factor reflects, but the observation is that the higher on the lstat scale the more indicator of crime
  • mdev: median value of homes, and seems correct that we would see higher value homes associated with lower crimes
  • nox: the amount of nitrogen oxides concentrations is right skewed with most locations not having a “high” amount, and as the concentration increases as does the crime
  • ptratio: student to teacher ratio, as convention and observation show a high student to teacher ratio is indicative of higher crimes
  • rad: the distance to highways seems slightly bi-modal, and higher distance from highways seems to be associated with higher crime, however the variability on the positive crime indicator is very large
  • rm: the average number of rooms per home looks normally distributed and the association with crime seems evenly distributed as per the box plot
  • tax: the property tax variable is bi-modal, the box plot shows that the variability of a positive crime indicator is fairly large
  • zn: large lot zones show most values as 0 and lower proportions seems associated with higher crime
data_train %>% 
  select(-target) %>%
  gather() %>% 
  ggplot(aes(x= value)) + 
  geom_density(fill='pink') + 
  facet_wrap(~key, scales = 'free')

data_train_long <- gather(data_train, "Variable", "Value", zn:medv)
data_train_long$target <- as.factor(data_train_long$target)  
ggplot(data_train_long, aes(x=target, y=Value)) + geom_boxplot(varwidth = TRUE, alpha=0.2, fill="orange1") + facet_wrap(~Variable, scales = "free")

Correlations with Response Variable

Looking at the correlation plot of our data set we see the below and confirm some of the observations made from the density and box plots:

  • the target variable is positively correlated with nox(.73), age(.63), rad(.63), and tax(.61)
  • the target variable is negatively correlated with dis(-.62) as seen in the density and box plots, the chas variable as a very weak correlation with all the other variables, and including the target. Therefore we can look to eliminate it from the analysis
  • there is present a amount of correlation amongst the predictor variables and this is suspect for multicollinearity issues

Highest correlations amongst predictor variables:
- tax|rad (.91)
-nox|indus (.76)
-age|nox(.74)
-tax|indus (.73)
-medv|rm (.71)
-dis|indus (-.7)
-medv|lstat (-.74)
-dist|age (-.75)
-dis|nox (-.77)

q <- cor(data_train)
ggcorrplot(q, type = "upper", outline.color = "white",
           ggtheme = theme_classic,
           colors = c("#6D9EC1", "white", "#E46726"),
           lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3) 

#pairs(data_train, col=data_train$target)

Data Preparation

In preparation of the data to use in the model we remove the chas variable is it does not seem to have any impact on target and is not highly correlated with any other predictor variable.

We also convert the target variable into a factor from an integer type

data_train_m2 <- data_train %>% select(-chas)
#convert target to factor
data_train_m2$target <- as.factor(data_train_m2$target)

Influential Leverage Points

In preparation of the data to use in the model we will look to see if within the data set that are any influence points that may have a significant impact on the model.

From the general model below are some outlier observations based on Cooks distance, though not all outlier observations are influential

ilp <- glm(target ~ .,data = data_train_m2, family = binomial(link="logit"))
#Top 5 outliers
plot(ilp, which = 4, id.n = 5)

To see if any outlier observations are influential we plot the standardized residual error to determine if any residuals are above the absolute value of 3.

augment(ilp) %>% mutate(index = 1:n()) %>% ggplot(aes(index, .std.resid)) + geom_point(aes(color=target)) + labs(y= "Standardized Residuals", x="Observation Index")

From the above plot there are is 1 value that is greater than 3. Observation 457. We will remove them from our dataset, we will also remove observation 338 which is close to 3 but not quite

(augment(ilp) %>% mutate(index = 1:n()) %>% top_n(2, .std.resid))[,c(1:9,15,19)]
#remove from model
data_train_m2 <- data_train_m2[-c(457,338),]

Model Building

Models in this sections are simply subsets of the full model.

# Initialize a df that will store the metrics of models
models.df <- tibble(id=character(), formula=character(), res.deviance=numeric(), null.deviance=numeric(),
                 aic=numeric(), accuracy=numeric(), sensitivity=numeric(), specificity=numeric(),
                precision.deviance=numeric(), stringsAsFactors=FALSE) 
# A function to extract the relevant metrics from the summary and confusion matrix
build_model <- function(id, formula, data) {
  glm.fit <- glm(formula, data=data, family=binomial)
  print(summary(glm.fit))
  glm.probs <- predict(glm.fit, type="response")
  # Confirm the 0.5 threshold
  glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
  results <- tibble(target=data$target, pred=glm.pred)
  results <- results %>%
    mutate(pred.class = as.factor(pred), target.class = as.factor(target))
  
  #print(confusionMatrix(results$pred.class,results$target.class, positive = "1"))
  
  acc <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$overall['Accuracy']
  sens <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Sensitivity']
  spec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Specificity']
  prec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Precision']
  res.deviance <- glm.fit$deviance
  null.deviance <- glm.fit$null.deviance  
  aic <- glm.fit$aic
  metrics <- list(res.deviance=res.deviance, null.deviance=null.deviance,aic=aic, accuracy=acc, sensitivity=sens, specificity=spec, precision=prec)
  metrics <- lapply(metrics, round, 3)
  
  plot(roc(results$target.class,glm.probs), print.auc = TRUE)
  model.df <- tibble(id=id, formula=formula, res.deviance=metrics$res.deviance, null.deviance=metrics$null.deviance, 
                         aic=metrics$aic, accuracy=metrics$accuracy, sensitivity=metrics$sensitivity, specificity=metrics$specificity, precision=metrics$precision)
  model.list <- list(model=glm.fit, df_info=model.df)
  return(model.list)
}

Model 1: Base Model Subsets

Model 1 starts with a complete model including all predictors iteratively finds the model with the subset of predictors that are statistically significant.

m1 <- build_model('m1', "target ~ .", data = data_train)

Call:
glm(formula = formula, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8464  -0.1445  -0.0017   0.0029   3.4665  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -40.822934   6.632913  -6.155 7.53e-10 ***
zn           -0.065946   0.034656  -1.903  0.05706 .  
indus        -0.064614   0.047622  -1.357  0.17485    
chas          0.910765   0.755546   1.205  0.22803    
nox          49.122297   7.931706   6.193 5.90e-10 ***
rm           -0.587488   0.722847  -0.813  0.41637    
age           0.034189   0.013814   2.475  0.01333 *  
dis           0.738660   0.230275   3.208  0.00134 ** 
rad           0.666366   0.163152   4.084 4.42e-05 ***
tax          -0.006171   0.002955  -2.089  0.03674 *  
ptratio       0.402566   0.126627   3.179  0.00148 ** 
lstat         0.045869   0.054049   0.849  0.39608    
medv          0.180824   0.068294   2.648  0.00810 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 192.05  on 453  degrees of freedom
AIC: 218.05

Number of Fisher Scoring iterations: 9

m1$df_info
models.df <- rbind(models.df,m1$df_info)
m1_b <- build_model('m1_b', "target ~ . -rm", data = data_train)

Call:
glm(formula = formula, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8750  -0.1679  -0.0019   0.0032   3.4391  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -41.649437   6.503781  -6.404 1.51e-10 ***
zn           -0.068448   0.034414  -1.989 0.046707 *  
indus        -0.063406   0.047493  -1.335 0.181854    
chas          0.931110   0.757531   1.229 0.219020    
nox          48.092158   7.728183   6.223 4.88e-10 ***
age           0.028384   0.011619   2.443 0.014568 *  
dis           0.691454   0.218298   3.167 0.001538 ** 
rad           0.643883   0.159204   4.044 5.25e-05 ***
tax          -0.006310   0.002932  -2.152 0.031399 *  
ptratio       0.361822   0.113913   3.176 0.001492 ** 
lstat         0.062518   0.049592   1.261 0.207437    
medv          0.137881   0.041423   3.329 0.000873 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 192.71  on 454  degrees of freedom
AIC: 216.71

Number of Fisher Scoring iterations: 9

m1_b$df_info
models.df <- rbind(models.df,m1_b$df_info)
m1_c <- build_model('m1_c', "target ~ . -rm - chas", data = data_train)

Call:
glm(formula = formula, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8779  -0.1646  -0.0015   0.0030   3.4366  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -40.271093   6.335539  -6.356 2.07e-10 ***
zn           -0.074693   0.034472  -2.167 0.030252 *  
indus        -0.050727   0.045716  -1.110 0.267166    
nox          46.245279   7.498336   6.167 6.94e-10 ***
age           0.028882   0.011572   2.496 0.012564 *  
dis           0.666663   0.215818   3.089 0.002008 ** 
rad           0.691985   0.156587   4.419 9.91e-06 ***
tax          -0.007032   0.002876  -2.445 0.014490 *  
ptratio       0.335525   0.110981   3.023 0.002501 ** 
lstat         0.070234   0.049073   1.431 0.152367    
medv          0.138750   0.041554   3.339 0.000841 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 194.24  on 455  degrees of freedom
AIC: 216.24

Number of Fisher Scoring iterations: 9

m1_c$df_info
models.df <- rbind(models.df,m1_c$df_info)
m1_d <- build_model('m1_d', "target ~ . -rm -chas -indus", data = data_train)

Call:
glm(formula = formula, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9279  -0.1640  -0.0016   0.0027   3.3989  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -38.786168   6.161170  -6.295 3.07e-10 ***
zn           -0.072933   0.033022  -2.209 0.027199 *  
nox          43.014846   6.689945   6.430 1.28e-10 ***
age           0.028379   0.011401   2.489 0.012809 *  
dis           0.658786   0.214787   3.067 0.002161 ** 
rad           0.739415   0.152375   4.853 1.22e-06 ***
tax          -0.008045   0.002651  -3.035 0.002408 ** 
ptratio       0.334196   0.111780   2.990 0.002792 ** 
lstat         0.064814   0.048427   1.338 0.180769    
medv          0.138257   0.041630   3.321 0.000897 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 195.51  on 456  degrees of freedom
AIC: 215.51

Number of Fisher Scoring iterations: 9

m1_d$df_info
models.df <- rbind(models.df,m1_d$df_info)
m1_e <- build_model('m1_e', "target ~ . -rm -chas -indus -lstat", data = data_train)

Call:
glm(formula = formula, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8295  -0.1752  -0.0021   0.0032   3.4191  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -37.415922   6.035013  -6.200 5.65e-10 ***
zn           -0.068648   0.032019  -2.144  0.03203 *  
nox          42.807768   6.678692   6.410 1.46e-10 ***
age           0.032950   0.010951   3.009  0.00262 ** 
dis           0.654896   0.214050   3.060  0.00222 ** 
rad           0.725109   0.149788   4.841 1.29e-06 ***
tax          -0.007756   0.002653  -2.924  0.00346 ** 
ptratio       0.323628   0.111390   2.905  0.00367 ** 
medv          0.110472   0.035445   3.117  0.00183 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 197.32  on 457  degrees of freedom
AIC: 215.32

Number of Fisher Scoring iterations: 9

m1_e$df_info
models.df <- rbind(models.df,m1_e$df_info)

Here is a summary of the models built in this section.

models.df

Model 2: Observation Removal

Model 2 is a model that has the below characteristics:

  • Predictor chas as been removed
  • The target variable has been converted to a factor
  • Observations 338 and 457 have been removed as influencial outliers
  • In the above sections we indicated that we were concerned with multicollinearity issues with the predictor values. We look further into this by checking the Variance Inflation Factor (VIF), where anything above a 5 is a high collinearity that may be problematic to the model. We see that the predictor medv has a VIF value of 9.
m2 <- build_model("m2", "target ~ .", data = data_train_m2)

Call:
glm(formula = formula, family = binomial, data = data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.91120  -0.07794  -0.00019   0.00068   2.83800  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -48.713108   7.885103  -6.178 6.50e-10 ***
zn           -0.101575   0.041487  -2.448 0.014352 *  
indus        -0.070315   0.050020  -1.406 0.159802    
nox          58.567669   9.306145   6.293 3.11e-10 ***
rm           -1.059071   0.803827  -1.318 0.187659    
age           0.061760   0.016582   3.725 0.000196 ***
dis           1.056492   0.274292   3.852 0.000117 ***
rad           0.865495   0.181865   4.759 1.95e-06 ***
tax          -0.007048   0.003184  -2.214 0.026856 *  
ptratio       0.483564   0.141717   3.412 0.000644 ***
lstat        -0.015129   0.056362  -0.268 0.788376    
medv          0.235034   0.078814   2.982 0.002862 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 643.03  on 463  degrees of freedom
Residual deviance: 169.95  on 452  degrees of freedom
AIC: 193.95

Number of Fisher Scoring iterations: 9

m2$df_info
models.df <- rbind(models.df,m2$df_info)
car::vif(m2$model)
      zn    indus      nox       rm      age      dis      rad      tax 
2.004175 2.496646 4.715557 5.641760 3.043744 4.789220 2.071828 2.075384 
 ptratio    lstat     medv 
2.574732 2.459107 8.961585 

The above model indicates that the predictor mdev can be problematic. mdev is highly correlated with lstat and rm, both of which are not significant in the model, but mdev is significant. When removing lstat and rm from the model we see that this addresses the collinearity issue with all VIF values under 5

#remove rm and lstat from model
m2_b <- build_model('m2_b', "target ~ . -rm - lstat", data = data_train_m2)

Call:
glm(formula = formula, family = binomial, data = data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.91283  -0.09883  -0.00026   0.00102   2.90788  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -49.291862   7.725211  -6.381 1.76e-10 ***
zn           -0.102530   0.040852  -2.510 0.012080 *  
indus        -0.066558   0.049644  -1.341 0.180015    
nox          55.998677   8.903666   6.289 3.19e-10 ***
age           0.051060   0.012897   3.959 7.52e-05 ***
dis           0.953861   0.252702   3.775 0.000160 ***
rad           0.818482   0.175225   4.671 3.00e-06 ***
tax          -0.007207   0.003166  -2.276 0.022836 *  
ptratio       0.405181   0.123433   3.283 0.001029 ** 
medv          0.149231   0.041416   3.603 0.000314 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 643.03  on 463  degrees of freedom
Residual deviance: 171.78  on 454  degrees of freedom
AIC: 191.78

Number of Fisher Scoring iterations: 9

m2_b$df_info
models.df <- rbind(models.df,m2_b$df_info)

car::vif(m2_b$model)
      zn    indus      nox      age      dis      rad      tax  ptratio 
1.932216 2.461747 4.305506 1.906550 4.100342 1.928789 2.030718 1.994571 
    medv 
2.466408 

Backward elimination produces the final model:

#continue backwards elimination
#m1c <- build_model('m1c', "target ~ . -indus", data = data_train_mv2)
m2_c <- build_model('m2_c', "target ~ . -rm -lstat -indus", data = data_train_m2)

Call:
glm(formula = formula, family = binomial, data = data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.92152  -0.11310  -0.00030   0.00076   2.90631  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -47.265354   7.511377  -6.293 3.12e-10 ***
zn           -0.099818   0.038990  -2.560 0.010465 *  
nox          51.545583   8.013193   6.433 1.25e-10 ***
age           0.049284   0.012664   3.892 9.96e-05 ***
dis           0.938995   0.250466   3.749 0.000178 ***
rad           0.877730   0.171836   5.108 3.26e-07 ***
tax          -0.008424   0.002841  -2.965 0.003029 ** 
ptratio       0.402108   0.124753   3.223 0.001267 ** 
medv          0.151465   0.041492   3.650 0.000262 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 643.03  on 463  degrees of freedom
Residual deviance: 173.70  on 455  degrees of freedom
AIC: 191.7

Number of Fisher Scoring iterations: 9

m2_c$df_info
models.df <- rbind(models.df,m2_c$df_info)

Model 3: Transformations

Here we added transformed predictors with significant skew to the base model and iteratively eliminated features that were not statistically significant.

trans_models.df <- tibble(id=character(), formula=character(), res.deviance=numeric(), null.deviance=numeric(),
                 aic=numeric(), accuracy=numeric(), sensitivity=numeric(), specificity=numeric(),
                precision.deviance=numeric(), stringsAsFactors=FALSE) 

We proceed to remove the variables that were identified earlier to have low correlation or significant multicollinearity.

data_train_trans <- data_train %>% select(-chas,-rm,-lstat)
mt1 <- build_model("mt1" ,"target ~ . + log(dis)+log(nox)", data=data_train_trans)

Call:
glm(formula = formula, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9702  -0.2420  -0.0019   0.0004   3.1989  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.447e+02  9.629e+01  -3.580 0.000343 ***
zn          -6.386e-02  3.319e-02  -1.924 0.054365 .  
indus        2.246e-03  6.007e-02   0.037 0.970181    
nox          3.904e+02  1.109e+02   3.522 0.000428 ***
age          4.388e-02  1.206e-02   3.637 0.000275 ***
dis         -2.965e+00  8.991e-01  -3.298 0.000972 ***
rad          8.151e-01  1.707e-01   4.775 1.80e-06 ***
tax         -6.786e-03  3.136e-03  -2.164 0.030493 *  
ptratio      4.289e-01  1.299e-01   3.302 0.000962 ***
medv         1.260e-01  3.962e-02   3.180 0.001471 ** 
log(dis)     1.543e+01  3.816e+00   4.044 5.25e-05 ***
log(nox)    -1.778e+02  5.633e+01  -3.156 0.001601 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 176.19  on 454  degrees of freedom
AIC: 200.19

Number of Fisher Scoring iterations: 9

mt1$df_info
trans_models.df  <- rbind(trans_models.df, mt1$df_info)

We see that the transformed predictors that were introduced are collinear with the original predictors. To deal with this issue we remove the original predictors and proceed.

car::vif(mt1$model)
        zn      indus        nox        age        dis        rad        tax 
  1.935681   3.682769 793.082457   1.830976  45.039191   1.773397   2.181768 
   ptratio       medv   log(dis)   log(nox) 
  1.958810   2.605745  57.709494 743.376530 
mt2 <- build_model("mt2","target ~ . + log(dis)+log(nox)-dis-nox", data=data_train_trans)

Call:
glm(formula = formula, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7852  -0.1535  -0.0023   0.0040   3.4072  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.635575   3.473533  -0.471 0.637735    
zn          -0.045662   0.028819  -1.584 0.113098    
indus       -0.007702   0.044940  -0.171 0.863930    
age          0.035947   0.011302   3.181 0.001470 ** 
rad          0.682013   0.158560   4.301 1.70e-05 ***
tax         -0.005718   0.002838  -2.015 0.043929 *  
ptratio      0.354516   0.112526   3.151 0.001630 ** 
medv         0.128618   0.037983   3.386 0.000709 ***
log(dis)     3.201847   0.875133   3.659 0.000254 ***
log(nox)    25.467685   3.985613   6.390 1.66e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 193.11  on 456  degrees of freedom
AIC: 213.11

Number of Fisher Scoring iterations: 9

mt2$df_info
trans_models.df  <- rbind(trans_models.df, mt2$df_info)
mt3 <- build_model("mt3","target ~ . + log(dis)+log(nox)-dis-nox-indus", data=data_train_trans)

Call:
glm(formula = formula, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8024  -0.1526  -0.0022   0.0037   3.4053  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.850196   3.234215  -0.572 0.567275    
zn          -0.045716   0.028709  -1.592 0.111298    
age          0.035896   0.011292   3.179 0.001478 ** 
rad          0.689837   0.152274   4.530 5.89e-06 ***
tax         -0.005869   0.002686  -2.185 0.028890 *  
ptratio      0.354632   0.112751   3.145 0.001659 ** 
medv         0.129441   0.037771   3.427 0.000610 ***
log(dis)     3.226553   0.864740   3.731 0.000191 ***
log(nox)    25.305601   3.862591   6.551 5.70e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 193.14  on 457  degrees of freedom
AIC: 211.14

Number of Fisher Scoring iterations: 9

mt3$df_info
trans_models.df  <- rbind(trans_models.df, mt3$df_info)
mt4 <- build_model("mt4","target ~ . + log(dis)+log(nox)-dis-nox-indus-zn", data=data_train_trans)

Call:
glm(formula = formula, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7287  -0.1640  -0.0074   0.0044   3.2107  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.873364   3.250243  -0.576 0.564361    
age          0.036544   0.011193   3.265 0.001095 ** 
rad          0.675544   0.144596   4.672 2.98e-06 ***
tax         -0.006190   0.002579  -2.400 0.016407 *  
ptratio      0.400345   0.109666   3.651 0.000262 ***
medv         0.123227   0.037168   3.315 0.000915 ***
log(dis)     2.931249   0.826099   3.548 0.000388 ***
log(nox)    25.875044   3.876728   6.674 2.48e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 196.15  on 458  degrees of freedom
AIC: 212.15

Number of Fisher Scoring iterations: 9

mt4$df_info
trans_models.df  <- rbind(trans_models.df, mt4$df_info)

Model Selection

When comparing the full model to its subsets we see that residual deviance tends to increase as predictors are dropped which indicates that the smaller models explain more of the residuals. Additionally, AIC also decreases meaning that less information is lost when using the smaller models. There is a minor accuracy penalty of 0.04% on the testing data when using the smaller model, which is acceptable.

models.df

However, the models that used additional transformed features had different results compared to the observations above. This time, both residual deviance and AIC seem to increase with smaller models.

trans_models.df

However, we see that mt1 which added the log transformation of the predictors nox and dis has the highest accuracy (93.1%) and the lowest AIC (200.19) of the models tested so far. In addition, model mt1 delivers 1.5% more in accuracy than the best model subset without transformations. While similar in accuracy to models mt2 and mt3, mt1 has the highest specificity and precision score which outweigh the minor decrease in sensitivity in our criteria for model selection. Even though we are conscious that adding the transformation of a predictor to a model already containing those predictors introduces multicollinearity, mt1 still delivers the best performance.

Our preference for this model is also confirmed by a likelyhood ratio test comparing mt1 (largest model) with mt4 (smaller model).

When comparing a larger to a smaller model, the null hypothesis is that the coefficients in the larger model are 0. In our case, the small p value suggests that the extra coefficients are not 0 and that we prefer the larger model.

anova(mt4$model, mt1$model, test = "LRT")

Therefore, our selected final model is mt1 which yields an accuracy on the test set of 93.1%.

Taking a look at the coefficients and marginal effects, we can now estimate the relative effects of each predictor on the odds of the target variable, keeping the other predictors the same. The predictors nox and log(dis) have large exponents and seem to have significant impact on the model. Interestingly and counterintuitively, the predictor log(nox) seems to have the largest single effect on whether a neighborhood is has a crime rate above the median.

coefs <- as.data.frame(exp(mt1$model$coefficients))
coefs

We can now interpret the marginal effects as follows. For example, with each unit increase in distance to employment centers neighborhoods are 17% less likely to be above the median crime rate. Note that the nox relative likelyhood is a very percentage, but this is due to the scale of the units. The nitrogen oxides concentrations are measured in part per million and unit change of 1 for this predictor is very large and an extrapolation beyond the domain of the nox variable. The effect is also modulated by the log(nox) variable which decreases its impact to some extent.

# Marginal effects
LogitScalar <- mean(dlogis(predict(mt1$model, type="link")))
marg <- LogitScalar * coef(mt1$model)
marg <- as.data.frame(marg)
marg[] <- lapply(marg, function(x) sprintf("%.4f", x))
marg

Looking at the marginal model plots, we see that the for all variables the data follows the expected values of the model, though there is a slight variation in the variables indus and age but not significant. We can therefore say that this is the correct model for the given data.

marginalModelPlots(glm(target ~ . + log(dis)+log(nox), data=data_train_trans, family = binomial))

Based on model mt1, below is the prediction distribution of the test dataset, we see that the distribution is fairly split between the binary variable target

data_test_trans <- data_test %>% select(-chas,-rm,-lstat)
test_predict <- predict(mt1$model, newdata=data_test_trans)
test_predict <- ifelse(test_predict<.5,0,1)

data_test$target <- test_predict

ggplot(data_test, aes(x=index(data_test), y=target, color=factor(target))) + geom_point() +
  labs(x="Observation", y="target", title = "Model mt1 Prediction", colour = "target")

table(test_predict)
test_predict
 0  1 
22 18 
write.csv(test_predict, "CrimePredictions.csv")