In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). Below is a short description of the variables of interest in the data set:
zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)indus: proportion of non-retail business acres per suburb (predictor variable)chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)rm: average number of rooms per dwelling (predictor variable)age: proportion of owner-occupied units built prior to 1940 (predictor variable)dis: weighted mean of distances to five Boston employment centers (predictor variable)rad: index of accessibility to radial highways (predictor variable)tax: full-value property-tax rate per $10,000 (predictor variable)ptratio: pupil-teacher ratio by town (predictor variable)black: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town (predictor variable)lstat: lower status of the population (percent) (predictor variable)medv: median value of owner-occupied homes in $1000s (predictor variable)target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided).
Replication of our work requires the following packages in Rstudio:
library(ggplot2)
library(cowplot)
library(dplyr)
library(tidyr)
library(corrplot)
library(randomForest)
library(forecast)
library(e1071)
library(MASS)
library(tidyverse)
library(broom)
library(car)
library(caret)
library(pscl)
library(pROC)
First, we read the data as a csv then performed some simple statistical calculations so that we could explore the data. Below we can see a sample of the data output as it was read from the csv.
| zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 19.58 | 0 | 0.605 | 7.929 | 96.2 | 2.0459 | 5 | 403 | 14.7 | 3.70 | 50.0 | 1 |
| 0 | 19.58 | 1 | 0.871 | 5.403 | 100.0 | 1.3216 | 5 | 403 | 14.7 | 26.82 | 13.4 | 1 |
| 0 | 18.10 | 0 | 0.740 | 6.485 | 100.0 | 1.9784 | 24 | 666 | 20.2 | 18.85 | 15.4 | 1 |
| 30 | 4.93 | 0 | 0.428 | 6.393 | 7.8 | 7.0355 | 6 | 300 | 16.6 | 5.19 | 23.7 | 0 |
| 0 | 2.46 | 0 | 0.488 | 7.155 | 92.2 | 2.7006 | 3 | 193 | 17.8 | 4.82 | 37.9 | 0 |
We can explore how many NAs are in each column to see if we need to impute any of the variables:
| zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 466 | 466 | 466 | 466 | 466 | 466 | 466 | 466 | 466 | 466 | 466 | 466 | 466 |
As we can see, each data vector has the same number of entries, 466. Thus, imputation will not be necessary.
We then calculated the mean and standard deviation for each data vector:
| means | sds | |
|---|---|---|
| zn | 11.5772532 | 23.3646511 |
| indus | 11.1050215 | 6.8458549 |
| chas | 0.0708155 | 0.2567920 |
| nox | 0.5543105 | 0.1166667 |
| rm | 6.2906738 | 0.7048513 |
| age | 68.3675966 | 28.3213784 |
| dis | 3.7956929 | 2.1069496 |
| rad | 9.5300429 | 8.6859272 |
| tax | 409.5021459 | 167.9000887 |
| ptratio | 18.3984979 | 2.1968447 |
| lstat | 12.6314592 | 7.1018907 |
| medv | 22.5892704 | 9.2396814 |
| target | 0.4914163 | 0.5004636 |
Below is a bar chart that illutrates the average and standard deviation for each of our data vectors. As we can see, the tax vector is a totally different magnitude than the rest. Models involving this vector will benefit from normalization or scaling.
The following histograms help visualize the spread and skewness of the raw data.
ggplot(data = gather(training), mapping = aes(x = value)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="lightgrey")+
facet_wrap(~key, ncol = 2, scales = 'free')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see our correlation matrix below. A dark blue circle represents a strong positive relationship and a dark red circle represents a strong negative relationship between two variables. We can see that indus, nox, target, and dis have the most colinearity. Likewise, these vectors are the best predictors for the target value. Note that this plot only includes rows tha have data in each column.
Finally, we can use the randomforest package to verify our assumptions from the correlation plot.
## Warning in randomForest.default(training2, target, importance = TRUE, ntree
## = 1000): The response has five or fewer unique values. Are you sure you
## want to do regression?
We verified our assumptions above using 1000 random forests. The nox, rad, indus, and tax have the most effect. While disis strongly colinear, it has less effect on the target. This is likely due to it encoding information stored redundantly in another vector.
In the following section, we will prepare and transform our variables for our model:
While logistic modeling does not require normalized data, we choose to apply log transformations to adjust the scales for age, lstat, and tax to test if these changes improve our model.
training2 <- training2 %>%
mutate_at(.vars = vars(age, lstat, tax), .funs = log)
training2 %>% dplyr::select(age, lstat,tax) %>% gather() %>% ggplot(mapping = aes(x = value)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="lightgrey")+
facet_wrap(~key, ncol = 2, scales = 'free')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This transformation helps center the age and normalize the lstat and tax variables.
We additionally chose to create several variables from our initial dataset.
radRad is an index variable that represents accessibility to radial highways. We choose to bifucate this data using the median value, 5.
Our new variable for rad now looks like this:
ggplot(training2, aes(x=rad))+geom_bar()
ptratioWe first changed ptratio, a pupil-teacher ratio measurement, into a categorial variable. In the new variable, 0 represents small, 1 represents medium, and 3 represents large ratios.
Our new variable for ptratio now looks like this:
indusThis variable represents the proportion of non-retail business acres per suburb. The plots below show the indus data is bimodal, skewed right, and centered around 10. The red line shows the median, whereas the blue line depicts the mean value for this variable.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We choose to bifuncate this variable using its median value.
As a result of these transformations, our data now looks like this:
| zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0.605 | 7.929 | 96.2 | 2.0459 | 1 | 403 | 1 | 3.70 | 50.0 | 1 |
| 0 | 1 | 1 | 0.871 | 5.403 | 100.0 | 1.3216 | 1 | 403 | 1 | 26.82 | 13.4 | 1 |
| 0 | 1 | 0 | 0.740 | 6.485 | 100.0 | 1.9784 | 1 | 666 | 3 | 18.85 | 15.4 | 1 |
| 30 | 0 | 0 | 0.428 | 6.393 | 7.8 | 7.0355 | 1 | 300 | 2 | 5.19 | 23.7 | 0 |
| 0 | 0 | 0 | 0.488 | 7.155 | 92.2 | 2.7006 | 0 | 193 | 2 | 4.82 | 37.9 | 0 |
| 0 | 0 | 0 | 0.520 | 6.781 | 71.3 | 2.8561 | 1 | 384 | 3 | 7.67 | 26.5 | 0 |
This is a basic model, we use all data without any transformations applied. Backward elimination method is used.
training$target = as.factor(training$target)
model_1<- step(glm(target~., data = training, family = 'binomial'), direction = "backward")
## Start: AIC=218.05
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv
##
## Df Deviance AIC
## - rm 1 192.71 216.71
## - lstat 1 192.77 216.77
## - chas 1 193.53 217.53
## - indus 1 193.99 217.99
## <none> 192.05 218.05
## - tax 1 196.59 220.59
## - zn 1 196.89 220.89
## - age 1 198.73 222.73
## - medv 1 199.95 223.95
## - ptratio 1 203.32 227.32
## - dis 1 203.84 227.84
## - rad 1 233.74 257.74
## - nox 1 265.05 289.05
##
## Step: AIC=216.71
## target ~ zn + indus + chas + nox + age + dis + rad + tax + ptratio +
## lstat + medv
##
## Df Deviance AIC
## - chas 1 194.24 216.24
## - lstat 1 194.32 216.32
## - indus 1 194.58 216.58
## <none> 192.71 216.71
## - tax 1 197.59 219.59
## - zn 1 198.07 220.07
## - age 1 199.11 221.11
## - ptratio 1 203.53 225.53
## - dis 1 203.85 225.85
## - medv 1 205.35 227.35
## - rad 1 233.81 255.81
## - nox 1 265.14 287.14
##
## Step: AIC=216.24
## target ~ zn + indus + nox + age + dis + rad + tax + ptratio +
## lstat + medv
##
## Df Deviance AIC
## - indus 1 195.51 215.51
## <none> 194.24 216.24
## - lstat 1 196.33 216.33
## - zn 1 200.59 220.59
## - tax 1 200.75 220.75
## - age 1 201.00 221.00
## - ptratio 1 203.94 223.94
## - dis 1 204.83 224.83
## - medv 1 207.12 227.12
## - rad 1 241.41 261.41
## - nox 1 265.19 285.19
##
## Step: AIC=215.51
## target ~ zn + nox + age + dis + rad + tax + ptratio + lstat +
## medv
##
## Df Deviance AIC
## - lstat 1 197.32 215.32
## <none> 195.51 215.51
## - zn 1 202.05 220.05
## - age 1 202.23 220.23
## - ptratio 1 205.01 223.01
## - dis 1 205.96 223.96
## - tax 1 206.60 224.60
## - medv 1 208.13 226.13
## - rad 1 249.55 267.55
## - nox 1 270.59 288.59
##
## Step: AIC=215.32
## target ~ zn + nox + age + dis + rad + tax + ptratio + medv
##
## Df Deviance AIC
## <none> 197.32 215.32
## - zn 1 203.45 219.45
## - ptratio 1 206.27 222.27
## - age 1 207.13 223.13
## - tax 1 207.62 223.62
## - dis 1 207.64 223.64
## - medv 1 208.65 224.65
## - rad 1 250.98 266.98
## - nox 1 273.18 289.18
summary(model_1)
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## medv, family = "binomial", data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8295 -0.1752 -0.0021 0.0032 3.4191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.415922 6.035013 -6.200 5.65e-10 ***
## zn -0.068648 0.032019 -2.144 0.03203 *
## nox 42.807768 6.678692 6.410 1.46e-10 ***
## age 0.032950 0.010951 3.009 0.00262 **
## dis 0.654896 0.214050 3.060 0.00222 **
## rad 0.725109 0.149788 4.841 1.29e-06 ***
## tax -0.007756 0.002653 -2.924 0.00346 **
## ptratio 0.323628 0.111390 2.905 0.00367 **
## medv 0.110472 0.035445 3.117 0.00183 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 197.32 on 457 degrees of freedom
## AIC: 215.32
##
## Number of Fisher Scoring iterations: 9
vif(model_1)
## zn nox age dis rad tax ptratio medv
## 1.789037 3.172660 1.701974 3.595939 1.697110 1.754274 1.865085 2.193689
There is no significant multicollinearity detected in model_1.
Check model_1 for the following logistic regression assumptions:
Checking for a linear relationship between the logit of the outcome and each predictor variables
Not all the relationships are linear, model can benefit from varibles transformations.
Checking model_1 for the presence of influencial values.
## # A tibble: 5 x 17
## target zn nox age dis rad tax ptratio medv .fitted .se.fit
## <fct> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 22 0.431 8.4 8.91 7 330 19.1 42.8 -0.941 0.970
## 2 1 0 0.544 37.8 2.52 4 304 18.4 16.1 -2.96 0.706
## 3 1 22 0.431 34.9 8.06 7 330 19.1 24.3 -2.67 0.652
## 4 1 20 0.464 42.1 4.43 3 223 18.6 21.1 -5.84 0.936
## 5 1 0 0.489 9.8 3.59 4 277 18.6 23.7 -4.42 0.832
## # ... with 6 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>,
## # .cooksd <dbl>, .std.resid <dbl>, index <int>
## # A tibble: 1 x 17
## target zn nox age dis rad tax ptratio medv .fitted .se.fit
## <fct> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 20 0.464 42.1 4.43 3 223 18.6 21.1 -5.84 0.936
## # ... with 6 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>,
## # .cooksd <dbl>, .std.resid <dbl>, index <int>
Eliminating the row from training data set with influential value.
training_clean <-training %>%
filter(!(nox==0.464 & age==42.1))
Building a model based on a dateset with eliminated influential values.
model_2<- step(glm(target~., data = training_clean, family = 'binomial'), direction = "backward")
## Start: AIC=204.95
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv
##
## Df Deviance AIC
## - lstat 1 179.53 203.53
## - rm 1 179.86 203.86
## - chas 1 180.40 204.40
## <none> 178.95 204.95
## - indus 1 181.26 205.26
## - tax 1 182.93 206.93
## - zn 1 186.28 210.28
## - age 1 187.56 211.56
## - medv 1 188.43 212.43
## - ptratio 1 190.95 214.95
## - dis 1 194.36 218.36
## - rad 1 221.84 245.84
## - nox 1 258.08 282.08
##
## Step: AIC=203.53
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + medv
##
## Df Deviance AIC
## - chas 1 181.28 203.28
## - rm 1 181.38 203.38
## <none> 179.53 203.53
## - indus 1 181.76 203.76
## - tax 1 183.26 205.26
## - zn 1 186.46 208.46
## - medv 1 189.16 211.16
## - ptratio 1 192.50 214.50
## - age 1 192.97 214.97
## - dis 1 195.50 217.50
## - rad 1 222.50 244.50
## - nox 1 259.97 281.97
##
## Step: AIC=203.28
## target ~ zn + indus + nox + rm + age + dis + rad + tax + ptratio +
## medv
##
## Df Deviance AIC
## - indus 1 182.79 202.79
## <none> 181.28 203.28
## - rm 1 183.38 203.38
## - tax 1 186.60 206.60
## - zn 1 189.44 209.44
## - medv 1 191.08 211.08
## - ptratio 1 193.09 213.09
## - age 1 195.88 215.88
## - dis 1 196.73 216.73
## - rad 1 232.03 252.03
## - nox 1 260.00 280.00
##
## Step: AIC=202.79
## target ~ zn + nox + rm + age + dis + rad + tax + ptratio + medv
##
## Df Deviance AIC
## - rm 1 184.66 202.66
## <none> 182.79 202.79
## - zn 1 191.25 209.25
## - medv 1 192.17 210.17
## - tax 1 192.67 210.67
## - ptratio 1 194.12 212.12
## - age 1 196.72 214.72
## - dis 1 197.96 215.96
## - rad 1 240.79 258.79
## - nox 1 266.03 284.03
##
## Step: AIC=202.66
## target ~ zn + nox + age + dis + rad + tax + ptratio + medv
##
## Df Deviance AIC
## <none> 184.66 202.66
## - zn 1 193.60 209.60
## - ptratio 1 194.15 210.15
## - tax 1 194.85 210.85
## - age 1 196.83 212.83
## - dis 1 198.25 214.25
## - medv 1 198.43 214.43
## - rad 1 240.96 256.96
## - nox 1 266.17 282.17
summary(model_2)
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## medv, family = "binomial", data = training_clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8555 -0.1501 -0.0006 0.0014 3.1726
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -42.128250 6.730139 -6.260 3.86e-10 ***
## zn -0.093902 0.036896 -2.545 0.010927 *
## nox 47.388109 7.340291 6.456 1.08e-10 ***
## age 0.038718 0.011711 3.306 0.000946 ***
## dis 0.807838 0.235713 3.427 0.000610 ***
## rad 0.806479 0.161555 4.992 5.98e-07 ***
## tax -0.007945 0.002739 -2.900 0.003728 **
## ptratio 0.350885 0.117755 2.980 0.002884 **
## medv 0.130814 0.038521 3.396 0.000684 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 644.45 on 464 degrees of freedom
## Residual deviance: 184.66 on 456 degrees of freedom
## AIC: 202.66
##
## Number of Fisher Scoring iterations: 9
vif(model_2)
## zn nox age dis rad tax ptratio medv
## 1.960798 3.435929 1.773456 4.001177 1.770018 1.759468 1.985662 2.336037
There is no significant multicollinearity detected in model_2.
This model is built based on important variables, selected using caret package function varImp()
control <- trainControl(method="repeatedcv", number=10, repeats=3)
model <- train(target~., data=training, method="glm", trControl=control)
importance <- varImp(model, scale=FALSE)
print(importance)
## glm variable importance
##
## Overall
## nox 6.1932
## rad 4.0843
## dis 3.2077
## ptratio 3.1791
## medv 2.6477
## age 2.4749
## tax 2.0887
## zn 1.9029
## indus 1.3568
## chas 1.2054
## lstat 0.8486
## rm 0.8127
plot(importance)
model_3<- step(glm(target~., data = training %>% dplyr::select(-lstat, -rm), family = 'binomial'), direction = "backward")
## Start: AIC=216.32
## target ~ zn + indus + chas + nox + age + dis + rad + tax + ptratio +
## medv
##
## Df Deviance AIC
## - indus 1 195.97 215.97
## <none> 194.32 216.32
## - chas 1 196.33 216.33
## - tax 1 198.83 218.83
## - zn 1 199.29 219.29
## - age 1 203.62 223.62
## - ptratio 1 204.88 224.88
## - dis 1 205.44 225.44
## - medv 1 205.98 225.98
## - rad 1 235.07 255.07
## - nox 1 267.08 287.08
##
## Step: AIC=215.97
## target ~ zn + chas + nox + age + dis + rad + tax + ptratio +
## medv
##
## Df Deviance AIC
## - chas 1 197.32 215.32
## <none> 195.97 215.97
## - zn 1 201.29 219.29
## - age 1 205.01 223.01
## - tax 1 205.20 223.20
## - ptratio 1 205.90 223.90
## - dis 1 206.82 224.82
## - medv 1 207.65 225.65
## - rad 1 244.73 262.73
## - nox 1 273.06 291.06
##
## Step: AIC=215.32
## target ~ zn + nox + age + dis + rad + tax + ptratio + medv
##
## Df Deviance AIC
## <none> 197.32 215.32
## - zn 1 203.45 219.45
## - ptratio 1 206.27 222.27
## - age 1 207.13 223.13
## - tax 1 207.62 223.62
## - dis 1 207.64 223.64
## - medv 1 208.65 224.65
## - rad 1 250.98 266.98
## - nox 1 273.18 289.18
summary(model_3)
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## medv, family = "binomial", data = training %>% dplyr::select(-lstat,
## -rm))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8295 -0.1752 -0.0021 0.0032 3.4191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.415922 6.035013 -6.200 5.65e-10 ***
## zn -0.068648 0.032019 -2.144 0.03203 *
## nox 42.807768 6.678692 6.410 1.46e-10 ***
## age 0.032950 0.010951 3.009 0.00262 **
## dis 0.654896 0.214050 3.060 0.00222 **
## rad 0.725109 0.149788 4.841 1.29e-06 ***
## tax -0.007756 0.002653 -2.924 0.00346 **
## ptratio 0.323628 0.111390 2.905 0.00367 **
## medv 0.110472 0.035445 3.117 0.00183 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 197.32 on 457 degrees of freedom
## AIC: 215.32
##
## Number of Fisher Scoring iterations: 9
vif(model_3)
## zn nox age dis rad tax ptratio medv
## 1.789037 3.172660 1.701974 3.595939 1.697110 1.754274 1.865085 2.193689
There is no significant multicollinearity detected in model_3.
This model is built based on the lowest Akaike information criterion (AIC). MASS package is used.
model_4 <- glm(target ~., data = training, family = binomial) %>%
stepAIC(trace = FALSE)
summary(model_4)
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## medv, family = binomial, data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8295 -0.1752 -0.0021 0.0032 3.4191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.415922 6.035013 -6.200 5.65e-10 ***
## zn -0.068648 0.032019 -2.144 0.03203 *
## nox 42.807768 6.678692 6.410 1.46e-10 ***
## age 0.032950 0.010951 3.009 0.00262 **
## dis 0.654896 0.214050 3.060 0.00222 **
## rad 0.725109 0.149788 4.841 1.29e-06 ***
## tax -0.007756 0.002653 -2.924 0.00346 **
## ptratio 0.323628 0.111390 2.905 0.00367 **
## medv 0.110472 0.035445 3.117 0.00183 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 197.32 on 457 degrees of freedom
## AIC: 215.32
##
## Number of Fisher Scoring iterations: 9
vif(model_4)
## zn nox age dis rad tax ptratio medv
## 1.789037 3.172660 1.701974 3.595939 1.697110 1.754274 1.865085 2.193689
There is no significant multicollinearity detected in model_4.
This model is built based on the data transformation performed in “Data Preparation” part
model_5 <- step(glm(target ~., data = transformed.data, family = 'binomial'), direction = "backward")
## Start: AIC=264.55
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv
##
## Df Deviance AIC
## - rm 1 238.55 262.55
## - ptratio 1 239.06 263.06
## - lstat 1 239.08 263.08
## - age 1 239.92 263.92
## <none> 238.55 264.55
## - rad 1 240.70 264.70
## - chas 1 245.73 269.73
## - medv 1 247.46 271.46
## - indus 1 248.57 272.57
## - dis 1 249.74 273.74
## - zn 1 251.88 275.88
## - tax 1 253.88 277.88
## - nox 1 319.56 343.56
##
## Step: AIC=262.55
## target ~ zn + indus + chas + nox + age + dis + rad + tax + ptratio +
## lstat + medv
##
## Df Deviance AIC
## - ptratio 1 239.06 261.06
## - lstat 1 239.28 261.28
## - age 1 240.21 262.21
## <none> 238.55 262.55
## - rad 1 240.70 262.70
## - chas 1 245.73 267.73
## - indus 1 248.65 270.65
## - dis 1 250.09 272.09
## - zn 1 251.94 273.94
## - tax 1 254.83 276.83
## - medv 1 256.01 278.01
## - nox 1 319.81 341.81
##
## Step: AIC=261.06
## target ~ zn + indus + chas + nox + age + dis + rad + tax + lstat +
## medv
##
## Df Deviance AIC
## - lstat 1 239.80 259.80
## - age 1 240.37 260.37
## <none> 239.06 261.06
## - rad 1 241.66 261.66
## - chas 1 245.95 265.95
## - dis 1 250.09 270.09
## - indus 1 250.62 270.62
## - zn 1 254.57 274.57
## - medv 1 256.49 276.49
## - tax 1 258.13 278.13
## - nox 1 320.41 340.41
##
## Step: AIC=259.8
## target ~ zn + indus + chas + nox + age + dis + rad + tax + medv
##
## Df Deviance AIC
## <none> 239.80 259.80
## - age 1 242.02 260.02
## - rad 1 242.54 260.54
## - chas 1 247.49 265.49
## - dis 1 250.96 268.96
## - indus 1 251.17 269.17
## - zn 1 254.89 272.89
## - medv 1 259.55 277.55
## - tax 1 259.98 277.98
## - nox 1 324.18 342.18
summary(model_5)
##
## Call:
## glm(formula = target ~ zn + indus + chas + nox + age + dis +
## rad + tax + medv, family = "binomial", data = transformed.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.85033 -0.30530 -0.00586 0.22881 3.12422
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -30.073893 4.119934 -7.300 2.89e-13 ***
## zn -0.090604 0.032465 -2.791 0.00526 **
## indus -1.553566 0.480927 -3.230 0.00124 **
## chas 1.852943 0.669088 2.769 0.00562 **
## nox 42.032693 6.191242 6.789 1.13e-11 ***
## age 0.013685 0.009269 1.477 0.13980
## dis 0.638574 0.194855 3.277 0.00105 **
## rad -0.632071 0.385629 -1.639 0.10120
## tax 0.007368 0.001761 4.184 2.87e-05 ***
## medv 0.116715 0.028411 4.108 3.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 239.80 on 456 degrees of freedom
## AIC: 259.8
##
## Number of Fisher Scoring iterations: 8
vif(model_5)
## zn indus chas nox age dis rad tax
## 1.440095 2.150958 1.188508 3.437939 1.558044 3.079032 1.392115 1.962191
## medv
## 1.886436
There is no significant multicollinearity detected in model_5.
AIC, BIC, Loik and pseudR2 were used to select the best model.
## AIC BIC loglik pseudoR2
## model_1 215.3229 252.6205 -98.66143 0.6944879
## model_2 202.6583 239.9366 -92.32914 0.7134650
## model_3 215.3229 252.6205 -98.66143 0.6944879
## model_4 215.3229 252.6205 -98.66143 0.6944879
## model_5 259.8032 301.2451 -119.90161 0.6287162
model_2 is the best model considering AIC,BIC, log likelihood and McFadden pseudoR2. model_2 has the lowest AIC, loglik and highest pseudoR2 which is indicative of a superior fit over all the other models. Although using that process might direct to choose a model that is overfitted.
We will choose model_2 as the best model for this assignment.
Splitting data set on train and test in order to assess model 2.
set.seed(123)
training.samples <- training$target %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- training[training.samples, ]
test.data <- training[-training.samples, ]
Roc curve of model_2
Let’s choose a cut off probability measure for predicting with a high or low crime rate.
## [1] 0.6335833
The value is closed to 50% let’s use 50% as a cutoff.
Confusion matrix of model_2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 41 5
## 1 6 40
##
## Accuracy : 0.8804
## 95% CI : (0.7961, 0.9388)
## No Information Rate : 0.5109
## P-Value [Acc > NIR] : 5.644e-14
##
## Kappa : 0.7609
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8889
## Specificity : 0.8723
## Pos Pred Value : 0.8696
## Neg Pred Value : 0.8913
## Prevalence : 0.4891
## Detection Rate : 0.4348
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8806
##
## 'Positive' Class : 1
##
## $accuracy
## [1] 0.8804348
##
## $error_rate
## [1] 0.1195652
##
## $precision
## [1] 0.8695652
##
## $sensitivity
## [1] 0.8888889
##
## $specificity
## [1] 0.8723404
##
## $F1
## [1] 0.8791209
## zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 2 0 8.14 0 0.538 6.096 84.5 4.4619 4 307 21.0 10.26 18.2
## 3 0 8.14 0 0.538 6.495 94.4 4.4547 4 307 21.0 12.80 18.4
## 4 0 8.14 0 0.538 5.950 82.0 3.9900 4 307 21.0 27.71 13.2
## 5 0 5.96 0 0.499 5.850 41.5 3.9342 5 279 19.2 8.77 21.0
## 6 25 5.13 0 0.453 5.741 66.2 7.2254 8 284 19.7 13.15 18.7
## predict_prob predict_target
## 1 0.04523037 0
## 2 0.68543918 1
## 3 0.76540370 1
## 4 0.41260604 0
## 5 0.08341741 0
## 6 0.25830142 0