DATA 621 Homework #3
DATA 621 Homework #3
library(tidyverse)
library(kableExtra)
library(visdat)
library(DT)
library(psych)
library(corrplot)
library(MASS)
library(matrixStats)
library(pacman)
library(bestglm)
library(glmnet)
library(AICcmodavg)
library(RcmdrMisc)
library(caret)
p_load(Hmisc, xlsx, xtable, knitr, scales, magrittr, tidyverse, stringr, e1071, corrplot, knitcitations, bibtex, missForest, abc,
foreach, doParallel, stargazer, forecast, matrixStats, glmulti, leaps, data.table, highlight, car, Amelia, caret)Overview
In this homework assignment, you will explore, analyze and model a dataset containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
Your objective is to build a binary logistic regression model on the training dataset to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation dataset using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the dataset:
| Variable Name | Definition | Variable Type |
|---|---|---|
| zn | proportion of residential land zoned for large lots (over 25000 square feet) | predictor |
| indus | proportion of non-retail business acres per suburb | predictor |
| chas | a dummy var. for whether the suburb borders the Charles River (1) or not (0) | predictor |
| nox | nitrogen oxides concentration (parts per 10 million) | predictor |
| rm | average number of rooms per dwelling | predictor |
| age | proportion of owner-occupied units built prior to 1940 | predictor |
| dis | weighted mean of distances to five Boston employment centers | predictor |
| rad | index of accessibility to radial highways | predictor |
| tax | full-value property-tax rate per $10,000 | predictor |
| ptratio | pupil-teacher ratio by town | predictor |
| black | 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town | predictor |
| lstat | lower status of the population (percent) | predictor |
| medv | median value of owner-occupied homes in $1000s | predictor |
| target | whether the crime rate is above the median crime rate (1) or not (0) | response |
Deliverables
A write-up of your solutions submitted in PDF format. Assigned prediction (probabilities, classifications) for the evaluation dataset. Use 0.5 threshold.
Data Exploration
We have two datasets. One is the training set, which includes 12 candidate predictors, 1 response variable, and 466 observations. The other one is the evaluation set, which includes 12 candidate predictors only, 40 observations.
We are going to study their missing values, data types and data statistics.
data_t <- read_csv("https://raw.githubusercontent.com/Rajwantmishra/DATA621_CR4/master/HW3/crime-training-data_modified.csv") %>%
dplyr::select(target, everything())
data_e <- read_csv('https://raw.githubusercontent.com/Rajwantmishra/DATA621_CR4/master/HW3/crime-evaluation-data_modified.csv')Missing Values & Data Type Check
In the training set, there are 12 candidate predictors and 1 response variable with 466 observations. In the evaluation set, there are 12 candidate predictors with 40 observations. Both datasets have no missing values (eg: NA, NULL or ’’). However, the variable black, which is described in the overview section, is not presented in both datasets.
Among the 12 candidate predictors, 1 is categorical (chas), the other 11 are continuous numerical. The response variable target is categorical.
## Observations: 466
## Variables: 13
## $ target <dbl> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,...
## $ zn <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 100, 20...
## $ indus <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5.19, ...
## $ chas <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693, 0.5...
## $ rm <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519, 6.3...
## $ age <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38.1, 19...
## $ dis <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896, 1.6...
## $ rad <dbl> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5, 5, 2...
## $ tax <dbl> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330, 398,...
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, 16.4,...
## $ lstat <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5.68, 9...
## $ medv <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20.9, 2...
## Observations: 40
## Variables: 12
## $ zn <dbl> 0, 0, 0, 0, 0, 25, 25, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 22,...
## $ indus <dbl> 7.07, 8.14, 8.14, 8.14, 5.96, 5.13, 5.13, 4.49, 4.49, 2.89,...
## $ chas <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ nox <dbl> 0.469, 0.538, 0.538, 0.538, 0.499, 0.453, 0.453, 0.449, 0.4...
## $ rm <dbl> 7.185, 6.096, 6.495, 5.950, 5.850, 5.741, 5.966, 6.630, 6.1...
## $ age <dbl> 61.1, 84.5, 94.4, 82.0, 41.5, 66.2, 93.4, 56.1, 56.8, 69.6,...
## $ dis <dbl> 4.9671, 4.4619, 4.4547, 3.9900, 3.9342, 7.2254, 6.8185, 4.4...
## $ rad <dbl> 2, 4, 4, 4, 5, 8, 8, 3, 3, 2, 2, 2, 4, 5, 5, 4, 8, 8, 7, 1,...
## $ tax <dbl> 242, 307, 307, 307, 279, 284, 284, 247, 247, 276, 188, 188,...
## $ ptratio <dbl> 17.8, 21.0, 21.0, 21.0, 19.2, 19.7, 19.7, 18.5, 18.5, 18.0,...
## $ lstat <dbl> 4.03, 10.26, 12.80, 27.71, 8.77, 13.15, 14.44, 6.53, 8.44, ...
## $ medv <dbl> 34.7, 18.2, 18.4, 13.2, 21.0, 18.7, 16.0, 26.6, 22.2, 21.4,...
library(gridExtra)
p_t_dt <- vis_dat(data_t)
p_t_m <- vis_miss(data_t)
p_e_dt <- vis_dat(data_e)
p_e_m <- vis_miss(data_e)
grid.arrange(p_t_m,p_e_m, p_t_dt,p_e_dt,ncol = 2,
widths = c(1,1),
heights = c(1.5,1),
top = 'Missing Values & Data Type Check
Training Set Evaluation Set')Figure 1: Missing Values & Data Type Check
Below is the summary of the datasets and some inference of it.
- It seems there are no Null values in the predictor and response variables.
- Each variables are in different scale.
- Categorical variables are
chasandtarget. - There are a total of 466 observations and 12 predictor variables and 1 response variable.
Data Statistics Summary
A binary logistic regression model is built using the training set, therefore the training set is used for the following data exploration.
The data types in the raw dataset are all ‘doubles’, however the candidate predictor chas and the response variable target are categorical, therefore, we update the data types of these two variables to ‘factors’.
data_t_mod <- data_t %>%
mutate(chas = as.factor(chas),
target = as.factor(target)) %>%
dplyr::select(target, everything())
DT::datatable(data_t_mod)
Table 3: training set
The statistics of all variables are list below:
## target zn indus chas nox
## 0:237 Min. : 0.00 Min. : 0.460 0:433 Min. :0.3890
## 1:229 1st Qu.: 0.00 1st Qu.: 5.145 1: 33 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio lstat medv
## Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02
## Median :334.5 Median :18.9 Median :11.350 Median :21.20
## Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00
The box plot below shows that outliners exist in variables zn, rm, dis, istat, medv. We use scaled training set to draw the box plot to show the corresponding outliers by ratio.
data_t %>%
scale() %>%
as.data.frame() %>%
stack() %>%
ggplot(aes(x = ind, y = values)) +
geom_boxplot(fill = 'deeppink4') +
labs(title = 'Boxplot: Scaled Training Set',
x = 'Variables',
y = 'Normalized_Values')+
theme(panel.background = element_rect(fill = 'grey')) Figure 2: Boxplot: Scaled Training Set
The scaled histogram and density plot show that variables zn,nox, dis, lstat, medv are right skewed; age, ptratio are left skewed; rad, tax are bimodal; target, chas are categorical however target is close to unbias while chas is highly biased; the rest are close to normal.
data_t %>%
scale() %>%
as.data.frame() %>%
stack() %>%
ggplot(aes(x = values)) +
geom_histogram(fill = 'deeppink4', color = 'black') +
labs(title = 'Histogram: Training Set')+
theme(panel.background = element_rect(fill = 'grey'))+
facet_wrap(~ind, scale='free', ncol = 4)Figure 3: Histogram: Training Set
data_t %>%
select_if(is.numeric) %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(x=value)) + # Plot the values
facet_wrap(~key, scales = "free") + # In separate panels
geom_density() Figure 4: Density Plot: Training Set
The correlation matrix below shows that the response variable target has strong positive relationship (>=0.6) with variables rad,tax,age,indus,nox, and strong negative relationship (<=-0.6) with variable dis.
Meanwhile, it worths notice that some pairs of candidate predictors have strong correlationship, such as rad and tax (0.92), indus and nox (0.76), nox and dis (-0.77), etc.
data_t %>%
cor() %>%
#corrplot(method = "square", type = "upper", order = 'hclust', tl.col = "black", diag = FALSE, bg= 'white', col = colorRampPalette(c('deeppink4','white','steelblue1'))(100))
corrplot.mixed(upper = 'pie', lower = 'number', order = 'hclust', tl.col = "black")Figure 5: Correlation Pie Chart: Training Set
We implement a correlation matrix to better understand the correlation between variables in the dataset. The below matrix is the results and we noticed a few interesting correlations.
nox: High nitrogen oxides concentration (parts per 10 million) (“nox”) is positively correlated with higher than median crime rates. As defined by the EPA - “NOx pollution is emitted by automobiles, trucks and various non-road vehicles (e.g., construction equipment, boats, etc.) as well as industrial sources such as power plants, industrial boilers, cement kilns, and turbines”. It is clear to see that nox is concentrated in areas of high road traffic and possible high industrial use which would be neighborhoods of low value and may attract crime.dis: The weighted mean of distances is negatively correlated with a city with higher than median crime rate. This is intuitive in that employment centers would be more closely located in cities of high crime due to high unemployment being positively correlated with higher crimes rates.tax: It is also counterintuitive how the crime rate has a positive correlation with the property tax. It would be anticipated that if the property tax increases, the crime rate would decrease due to the money that home occupants and owners would spend on “promised” security systems. However, when the crime rate starts to increase, the housing prices would decrease due to the fact that the home occupants and owners would not want to risk their safety.
Figure 6: Correlation Chart: Training Set
data_t %>%
cor() %>%
as.data.frame() %>%
rownames_to_column('Variable') %>%
dplyr::rename(Correlation_vs_Response = target)Consolidated Data Dictionary
As a summary of the data exploration process, a data dictionary is created below:
data_stat <- data_t %>%
dplyr::select(-target,-chas) %>%
gather() %>%
group_by(key) %>%
summarise(Mean = mean(value),
Median = median(value),
Max = max(value),
Min = min(value),
SD = sd(value))
data_cor <- data_t %>%
cor() %>%
as.data.frame() %>%
dplyr::select(target) %>%
rownames_to_column('Variable') %>%
dplyr::rename(Correlation_vs_Response = target)
data_t %>%
gather() %>%
dplyr::select(key) %>%
unique() %>%
dplyr::rename(Variable = key) %>%
mutate(Description = c('whether the crime rate is above the median crime rate (1) or not (0)',
'proportion of residential land zoned for large lots (over 25000 square feet)',
'proportion of non-retail business acres per suburb',
'a dummy var. for whether the suburb borders the Charles River (1) or not (0)',
'nitrogen oxides concentration (parts per 10 million)',
'average number of rooms per dwelling',
'proportion of owner-occupied units built prior to 1940',
'weighted mean of distances to five Boston employment centers',
'index of accessibility to radial highways',
'full-value property-tax rate per $10,000',
'pupil-teacher ratio by town',
'lower status of the population (percent)',
'median value of owner-occupied homes in $1000s'),
Var_Type_1 = case_when(Variable %in% c('target','chas') ~ 'categorical',
Variable %in% c('rad','tax') ~ 'discrete numerical',
TRUE ~ 'continuous numerical'),
Var_Type_2 = if_else(Variable == 'target', 'response', 'predictor'),
Missing_Value = 'No') %>%
left_join(data_stat, by = c('Variable'='key')) %>%
left_join(data_cor, by = 'Variable') %>%
mutate_if(is.numeric,round,2) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)| Variable | Description | Var_Type_1 | Var_Type_2 | Missing_Value | Mean | Median | Max | Min | SD | Correlation_vs_Response |
|---|---|---|---|---|---|---|---|---|---|---|
| target | whether the crime rate is above the median crime rate (1) or not (0) | categorical | response | No | NA | NA | NA | NA | NA | 1.00 |
| zn | proportion of residential land zoned for large lots (over 25000 square feet) | continuous numerical | predictor | No | 11.58 | 0.00 | 100.00 | 0.00 | 23.36 | -0.43 |
| indus | proportion of non-retail business acres per suburb | continuous numerical | predictor | No | 11.11 | 9.69 | 27.74 | 0.46 | 6.85 | 0.60 |
| chas | a dummy var. for whether the suburb borders the Charles River (1) or not (0) | categorical | predictor | No | NA | NA | NA | NA | NA | 0.08 |
| nox | nitrogen oxides concentration (parts per 10 million) | continuous numerical | predictor | No | 0.55 | 0.54 | 0.87 | 0.39 | 0.12 | 0.73 |
| rm | average number of rooms per dwelling | continuous numerical | predictor | No | 6.29 | 6.21 | 8.78 | 3.86 | 0.70 | -0.15 |
| age | proportion of owner-occupied units built prior to 1940 | continuous numerical | predictor | No | 68.37 | 77.15 | 100.00 | 2.90 | 28.32 | 0.63 |
| dis | weighted mean of distances to five Boston employment centers | continuous numerical | predictor | No | 3.80 | 3.19 | 12.13 | 1.13 | 2.11 | -0.62 |
| rad | index of accessibility to radial highways | discrete numerical | predictor | No | 9.53 | 5.00 | 24.00 | 1.00 | 8.69 | 0.63 |
| tax | full-value property-tax rate per $10,000 | discrete numerical | predictor | No | 409.50 | 334.50 | 711.00 | 187.00 | 167.90 | 0.61 |
| ptratio | pupil-teacher ratio by town | continuous numerical | predictor | No | 18.40 | 18.90 | 22.00 | 12.60 | 2.20 | 0.25 |
| lstat | lower status of the population (percent) | continuous numerical | predictor | No | 12.63 | 11.35 | 37.97 | 1.73 | 7.10 | 0.47 |
| medv | median value of owner-occupied homes in $1000s | continuous numerical | predictor | No | 22.59 | 21.20 | 50.00 | 5.00 | 9.24 | -0.27 |
Data Preparation
Re-scale Data
The dataset contains variables of different measurements, such as percentage, distance, money values, etc. To put all the predictors and the response on a comparable scale, they are all normalized with mean = 0 and SD = 1.
data_rescaled <- scale(data_t_mod[c(2,3,5:13)]) %>%
as.data.frame() %>%
cbind(data_t_mod[c(1,4)]) %>%
dplyr::select(target, zn, indus, chas, everything())
DT::datatable(data_rescaled)
Table 5: Rescaled training set
Build Models
Because we have a small number of observations to train over, we will use k-fold Cross Validation to train, with k = 10. We’ll hold out 15% of the data for validation while doing the initial modeling, but once we select our model, we will retrain over the full training set.
Each of our logistic regression models will use bionomial regression with a logit link function.
Model 1: Full Model
The first model includes all the variables. A review of the VIF output of the model suggests some points that are highly colinear and a number of variables that may not be necessary. Model 1 uses the formula:
target ~ .
set.seed(121)
split <- caret::createDataPartition(data_rescaled$target, p=0.85, list=FALSE)
partial_train <- data_rescaled[split, ]
validation <- data_rescaled[ -split, ]
mod1 <- caret::train(target ~., data = partial_train,
method = "glm", family = "binomial",
trControl = trainControl(
method = "cv", number = 10,
savePredictions = TRUE),
tuneLength = 5,
preProcess = c("center", "scale"))
knitr::kable(vif(mod1$finalModel))| x | |
|---|---|
| zn | 1.775536 |
| indus | 2.615682 |
| chas1 | 1.289891 |
| nox | 4.090926 |
| rm | 6.680172 |
| age | 2.408913 |
| dis | 3.574289 |
| rad | 2.078134 |
| tax | 2.209580 |
| ptratio | 2.433736 |
| lstat | 2.735861 |
| medv | 9.246747 |
Model 2: Removing predictors seemed Unnecessary
Our second model ignores the colinear issues, but removes models that seemed unnecessary in Model #1. Model 2 uses the formula:
target ~ zn + nox + age + dis + rad + ptratio + medv
# remove low p-values
mod2 <- train(target ~ zn + nox + age + dis + rad + ptratio + medv,
data = partial_train,
method = "glm", family = "binomial",
trControl = trainControl(
method = "cv", number = 10,
savePredictions = TRUE),
tuneLength = 5,
preProcess = c("center", "scale"))
knitr::kable(vif(mod2$finalModel))| x | |
|---|---|
| zn | 1.801287 |
| nox | 3.049522 |
| age | 1.685178 |
| dis | 3.659469 |
| rad | 1.235992 |
| ptratio | 1.826575 |
| medv | 2.094548 |
Model 3: Removing Highest VIF Values
Model #3 removes the variables with the 2 highest VIF values from model1. The model formula is:
target ~ indus + rm + age + dis + tax + ptratio + lstat + medv
## Reduce Collinearity by removing high VIFs
mod3 <- train(target ~ indus + rm + age + dis + tax + ptratio + lstat + medv, data = partial_train,
method = "glm", family = "binomial",
trControl = trainControl(
method = "cv", number = 10,
savePredictions = TRUE),
tuneLength = 5,
preProcess = c("center", "scale"))
knitr::kable(vif(mod3$finalModel))| x | |
|---|---|
| indus | 2.190206 |
| rm | 4.462813 |
| age | 2.097140 |
| dis | 1.956005 |
| tax | 1.749705 |
| ptratio | 1.423980 |
| lstat | 2.765737 |
| medv | 5.782926 |
Model 4: Removing Poor Predictors
Model #4 takes the advances in model #3 and removes those values shown to be poor predictors.
target ~ age + dis + tax + medv
## reduce collinearity, and remove low values
mod4 <- train(target ~ age + dis + tax + medv,
data = partial_train,
method = "glm", family = "binomial",
trControl = trainControl(
method = "cv", number = 10,
savePredictions = TRUE),
tuneLength = 5,
preProcess = c("center", "scale"))
knitr::kable(vif(mod4$finalModel))| x | |
|---|---|
| age | 1.733106 |
| dis | 1.715677 |
| tax | 1.386751 |
| medv | 1.413739 |
Model 5: Stepwise Based on AIC
Model #5: We use stepwise function based on AIC criterion in both direction and get Model #5 in 10 steps.
target ~ nox + rad + tax + ptratio + medv + lstat + dis + zn + age
full_model <- glm(target ~ ., data = partial_train, family = "binomial")
model_5 <- stepwise(full_model, criterion = 'AIC', direction = 'forward/backward', trace = TRUE)##
## Direction: forward/backward
## Criterion: AIC
##
## Start: AIC=552.24
## target ~ 1
##
## Df Deviance AIC
## + nox 1 263.04 267.04
## + rad 1 353.20 357.20
## + dis 1 362.95 366.95
## + age 1 374.91 378.91
## + tax 1 389.61 393.61
## + indus 1 394.18 398.18
## + zn 1 445.94 449.94
## + lstat 1 458.04 462.04
## + ptratio 1 527.19 531.19
## + medv 1 527.24 531.24
## + rm 1 543.46 547.46
## + chas 1 544.87 548.87
## <none> 550.24 552.24
##
## Step: AIC=267.04
## target ~ nox
##
## Df Deviance AIC
## + rad 1 218.02 224.02
## + rm 1 257.59 263.59
## + medv 1 257.84 263.84
## + chas 1 259.12 265.12
## + indus 1 259.98 265.98
## + tax 1 260.04 266.04
## + zn 1 260.29 266.29
## <none> 263.04 267.04
## + ptratio 1 261.57 267.57
## + dis 1 261.93 267.93
## + age 1 262.05 268.05
## + lstat 1 263.04 269.04
## - nox 1 550.24 552.24
##
## Step: AIC=224.02
## target ~ nox + rad
##
## Df Deviance AIC
## + tax 1 205.03 213.03
## + indus 1 213.29 221.29
## + zn 1 214.62 222.62
## + ptratio 1 215.77 223.77
## + medv 1 215.87 223.87
## <none> 218.02 224.02
## + dis 1 216.04 224.04
## + rm 1 216.20 224.20
## + chas 1 216.21 224.21
## + age 1 216.26 224.26
## + lstat 1 217.98 225.98
## - rad 1 263.04 267.04
## - nox 1 353.20 357.20
##
## Step: AIC=213.03
## target ~ nox + rad + tax
##
## Df Deviance AIC
## + ptratio 1 199.20 209.20
## + zn 1 201.48 211.48
## + age 1 202.17 212.17
## <none> 205.03 213.03
## + lstat 1 203.38 213.38
## + dis 1 203.44 213.44
## + chas 1 204.25 214.25
## + indus 1 204.50 214.50
## + rm 1 204.73 214.73
## + medv 1 204.93 214.93
## - tax 1 218.02 224.02
## - rad 1 260.04 266.04
## - nox 1 347.31 353.31
##
## Step: AIC=209.2
## target ~ nox + rad + tax + ptratio
##
## Df Deviance AIC
## + medv 1 196.13 208.13
## + age 1 196.43 208.43
## <none> 199.20 209.20
## + zn 1 197.55 209.55
## + rm 1 197.59 209.59
## + chas 1 197.61 209.61
## + dis 1 197.88 209.88
## + lstat 1 198.62 210.62
## + indus 1 198.85 210.85
## - ptratio 1 205.03 213.03
## - tax 1 215.77 223.77
## - rad 1 259.41 267.41
## - nox 1 346.63 354.63
##
## Step: AIC=208.13
## target ~ nox + rad + tax + ptratio + medv
##
## Df Deviance AIC
## + lstat 1 190.72 204.72
## + age 1 192.40 206.40
## + dis 1 193.46 207.46
## <none> 196.13 208.13
## + zn 1 194.18 208.18
## + chas 1 194.38 208.38
## - medv 1 199.20 209.20
## + indus 1 195.82 209.82
## + rm 1 195.89 209.89
## - ptratio 1 204.93 214.93
## - tax 1 208.44 218.44
## - rad 1 246.62 256.62
## - nox 1 346.52 356.52
##
## Step: AIC=204.72
## target ~ nox + rad + tax + ptratio + medv + lstat
##
## Df Deviance AIC
## + dis 1 187.06 203.06
## <none> 190.72 204.72
## + zn 1 188.81 204.81
## + age 1 189.03 205.03
## + chas 1 189.96 205.96
## + indus 1 190.09 206.09
## + rm 1 190.72 206.72
## - lstat 1 196.13 208.13
## - medv 1 198.62 210.62
## - ptratio 1 201.36 213.36
## - tax 1 205.13 217.13
## - rad 1 243.00 255.00
## - nox 1 310.16 322.16
##
## Step: AIC=203.06
## target ~ nox + rad + tax + ptratio + medv + lstat + dis
##
## Df Deviance AIC
## + zn 1 182.11 200.11
## + age 1 183.27 201.27
## <none> 187.06 203.06
## + chas 1 185.67 203.67
## + indus 1 186.31 204.31
## - dis 1 190.72 204.72
## + rm 1 187.06 205.06
## - lstat 1 193.46 207.46
## - medv 1 197.45 211.45
## - ptratio 1 198.90 212.90
## - tax 1 199.70 213.70
## - rad 1 237.35 251.35
## - nox 1 265.54 279.54
##
## Step: AIC=200.11
## target ~ nox + rad + tax + ptratio + medv + lstat + dis + zn
##
## Df Deviance AIC
## + age 1 178.21 198.21
## <none> 182.11 200.11
## + indus 1 181.38 201.38
## + chas 1 181.44 201.44
## + rm 1 182.06 202.06
## - zn 1 187.06 203.06
## - dis 1 188.81 204.81
## - lstat 1 189.00 205.00
## - ptratio 1 190.53 206.53
## - tax 1 192.52 208.52
## - medv 1 194.47 210.47
## - rad 1 229.73 245.73
## - nox 1 258.55 274.55
##
## Step: AIC=198.21
## target ~ nox + rad + tax + ptratio + medv + lstat + dis + zn +
## age
##
## Df Deviance AIC
## <none> 178.21 198.21
## + rm 1 177.42 199.42
## + indus 1 177.46 199.46
## + chas 1 177.73 199.73
## - lstat 1 181.88 199.88
## - age 1 182.11 200.11
## - zn 1 183.27 201.27
## - dis 1 187.43 205.43
## - ptratio 1 187.51 205.51
## - tax 1 188.85 206.85
## - medv 1 191.07 209.07
## - rad 1 226.81 244.81
## - nox 1 245.63 263.63
##
## Call:
## glm(formula = target ~ nox + rad + tax + ptratio + medv + lstat +
## dis + zn + age, family = "binomial", data = partial_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9810 -0.2345 -0.0020 0.0038 3.2862
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.6310 0.7350 3.580 0.000344 ***
## nox 5.0016 0.8106 6.170 6.82e-10 ***
## rad 6.5437 1.4025 4.666 3.07e-06 ***
## tax -1.4398 0.4858 -2.964 0.003037 **
## ptratio 0.7573 0.2569 2.948 0.003199 **
## medv 1.2908 0.3889 3.319 0.000904 ***
## lstat 0.6877 0.3597 1.912 0.055883 .
## dis 1.3092 0.4517 2.898 0.003752 **
## zn -1.4975 0.7628 -1.963 0.049617 *
## age 0.6307 0.3307 1.907 0.056531 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 550.24 on 396 degrees of freedom
## Residual deviance: 178.21 on 387 degrees of freedom
## AIC: 198.21
##
## Number of Fisher Scoring iterations: 9
Model 6: Stepwise Based on BIC
Model #6: We use stepwise function based on BIC criterion in both direction and get Model #6 in 4 steps.
target ~ nox + rad + tax
##
## Direction: forward/backward
## Criterion: BIC
##
## Start: AIC=556.22
## target ~ 1
##
## Df Deviance AIC
## + nox 1 263.04 275.00
## + rad 1 353.20 365.17
## + dis 1 362.95 374.91
## + age 1 374.91 386.88
## + tax 1 389.61 401.58
## + indus 1 394.18 406.15
## + zn 1 445.94 457.90
## + lstat 1 458.04 470.01
## + ptratio 1 527.19 539.15
## + medv 1 527.24 539.21
## + rm 1 543.46 555.42
## <none> 550.24 556.22
## + chas 1 544.87 556.84
##
## Step: AIC=275
## target ~ nox
##
## Df Deviance AIC
## + rad 1 218.02 235.98
## <none> 263.04 275.00
## + rm 1 257.59 275.54
## + medv 1 257.84 275.79
## + chas 1 259.12 277.07
## + indus 1 259.98 277.93
## + tax 1 260.04 278.00
## + zn 1 260.29 278.24
## + ptratio 1 261.57 279.52
## + dis 1 261.93 279.88
## + age 1 262.05 280.00
## + lstat 1 263.04 280.99
## - nox 1 550.24 556.22
##
## Step: AIC=235.98
## target ~ nox + rad
##
## Df Deviance AIC
## + tax 1 205.03 228.97
## <none> 218.02 235.98
## + indus 1 213.29 237.23
## + zn 1 214.62 238.56
## + ptratio 1 215.77 239.71
## + medv 1 215.87 239.80
## + dis 1 216.04 239.98
## + rm 1 216.20 240.14
## + chas 1 216.21 240.14
## + age 1 216.26 240.19
## + lstat 1 217.98 241.91
## - rad 1 263.04 275.00
## - nox 1 353.20 365.17
##
## Step: AIC=228.97
## target ~ nox + rad + tax
##
## Df Deviance AIC
## <none> 205.03 228.97
## + ptratio 1 199.20 229.12
## + zn 1 201.48 231.40
## + age 1 202.17 232.09
## + lstat 1 203.38 233.30
## + dis 1 203.44 233.36
## + chas 1 204.25 234.17
## + indus 1 204.50 234.42
## + rm 1 204.73 234.65
## + medv 1 204.93 234.84
## - tax 1 218.02 235.98
## - rad 1 260.04 278.00
## - nox 1 347.31 365.26
##
## Call:
## glm(formula = target ~ nox + rad + tax, family = "binomial",
## data = partial_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.87592 -0.37435 -0.04307 0.00727 2.51162
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.5664 0.6048 4.243 2.20e-05 ***
## nox 4.0277 0.5539 7.272 3.54e-13 ***
## rad 5.4287 1.0918 4.972 6.62e-07 ***
## tax -1.3703 0.4225 -3.244 0.00118 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 550.24 on 396 degrees of freedom
## Residual deviance: 205.03 on 393 degrees of freedom
## AIC: 213.03
##
## Number of Fisher Scoring iterations: 8
Model 7: Best Subset Based on AIC
Model #7: We use best subset method based on AIC criterion to find Model #7.
target ~ zn + nox + age + dis + rad + tax + ptratio + lstat + medv (Same as Model 5)
Xy <- partial_train %>% dplyr::select(-target,everything())
model_7 <- bestglm(Xy = Xy, family = binomial, IC = 'AIC', method = 'exhaustive')Top 5 models among all the subsets:
model_7$BestModels %>%
mutate(model_rank = row_number()) %>%
dplyr::select(model_rank, everything()) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)| model_rank | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | Criterion |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | TRUE | FALSE | FALSE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | 196.2099 |
| 2 | TRUE | FALSE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | 197.4200 |
| 3 | TRUE | TRUE | FALSE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | 197.4567 |
| 4 | TRUE | FALSE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | 197.5250 |
| 5 | TRUE | FALSE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | 197.7333 |
The rank 1 model is selected as model 7.
##
## Call: glm(formula = y ~ ., family = family, data = Xi, weights = weights)
##
## Coefficients:
## (Intercept) zn nox age dis rad
## 2.6310 -1.4975 5.0016 0.6307 1.3092 6.5437
## tax ptratio lstat medv
## -1.4398 0.7573 0.6877 1.2908
##
## Degrees of Freedom: 396 Total (i.e. Null); 387 Residual
## Null Deviance: 550.2
## Residual Deviance: 178.2 AIC: 198.2
Model 8: Best Subset Based on BIC
Model #8: We use best subset method based on BIC criterion to find Model #8.
target ~ nox + rad + tax (Same as Model 6)
Top 5 models among all the subsets:
model_8$BestModels %>%
mutate(model_rank = row_number()) %>%
dplyr::select(model_rank, everything()) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)| model_rank | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | Criterion |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | TRUE | TRUE | FALSE | FALSE | FALSE | 222.9816 |
| 2 | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | TRUE | TRUE | TRUE | FALSE | FALSE | 223.1394 |
| 3 | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | TRUE | TRUE | FALSE | FALSE | FALSE | 225.4175 |
| 4 | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | TRUE | TRUE | TRUE | FALSE | TRUE | 226.0475 |
| 5 | FALSE | FALSE | FALSE | TRUE | FALSE | TRUE | FALSE | TRUE | TRUE | FALSE | FALSE | FALSE | 226.1033 |
The rank 1 model is selected as model 8.
##
## Call: glm(formula = y ~ ., family = family, data = Xi, weights = weights)
##
## Coefficients:
## (Intercept) nox rad tax
## 2.566 4.028 5.429 -1.370
##
## Degrees of Freedom: 396 Total (i.e. Null); 393 Residual
## Null Deviance: 550.2
## Residual Deviance: 205 AIC: 213
#re-train data using train() function
model_5 <- train(target ~ nox + rad + tax + ptratio + age + medv + dis + zn + age,
data = partial_train,
method = "glm", family = "binomial",
trControl = trainControl(
method = "cv", number = 10,
savePredictions = TRUE),
tuneLength = 5,
preProcess = c("center", "scale"))
model_6 <- train(target ~ nox + rad + tax,
data = partial_train,
method = "glm", family = "binomial",
trControl = trainControl(
method = "cv", number = 10,
savePredictions = TRUE),
tuneLength = 5,
preProcess = c("center", "scale"))Select Models
To help aid in model selection, we will review their accuracy by making predictions on our holdout validation set, and comparing their performance using a variety of confusion matrix adjacent functions like fourfold plots, summary statistics, and ROC / AUC plots.
Fourfold Plots
preds1 <- predict(mod1, newdata = validation)
preds2 <- predict(mod2, newdata = validation)
preds3 <- predict(mod3, newdata = validation)
preds4 <- predict(mod4, newdata = validation)
preds5 <- predict(model_5, newdata = validation)
preds6 <- predict(model_6, newdata = validation)
m1cM <- confusionMatrix(preds1, validation$target, mode = "everything")
m2cM <- confusionMatrix(preds2, validation$target, mode = "everything")
m3cM <- confusionMatrix(preds3, validation$target, mode = "everything")
m4cM <- confusionMatrix(preds4, validation$target, mode = "everything")
m5cM <- confusionMatrix(preds5, validation$target, mode = "everything")
m6cM <- confusionMatrix(preds6, validation$target, mode = "everything")
par(mfrow=c(3,3))
fourfoldplot(m1cM$table, color = c("#B22222", "#2E8B57"), main="Model 1")
fourfoldplot(m2cM$table, color = c("#B22222", "#2E8B57"), main="Model 2")
fourfoldplot(m3cM$table, color = c("#B22222", "#2E8B57"), main="Model 3")
fourfoldplot(m4cM$table, color = c("#B22222", "#2E8B57"), main="Model 4")
fourfoldplot(m5cM$table, color = c("#B22222", "#2E8B57"), main="Model 5")
fourfoldplot(m5cM$table, color = c("#B22222", "#2E8B57"), main="Model 6")Summary Statistics
Model 1, Model 2 and Model 5 have best performance in at least one category.
temp <- data.frame(m1cM$overall,
m2cM$overall,
m3cM$overall,
m4cM$overall,
m5cM$overall,
m6cM$overall) %>%
t() %>%
data.frame() %>%
dplyr::select(Accuracy) %>%
mutate(Classification_Error_Rate = 1-Accuracy)
Summ_Stat <-data.frame(m1cM$byClass,
m2cM$byClass,
m3cM$byClass,
m4cM$byClass,
m5cM$byClass,
m6cM$byClass) %>%
t() %>%
data.frame() %>%
cbind(temp) %>%
# manipulate results DF
mutate(Model = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6")) %>%
dplyr::select(Model, Accuracy, Classification_Error_Rate, Precision, Sensitivity, Specificity, F1) %>%
add_row(Model = 'Model 7 (Same as Model 5)') %>%
add_row(Model = 'Model 8 (Same as Model 6)') %>%
mutate_if(is.numeric, round,3) %>%
mutate_at(c('Accuracy', 'Precision', 'Sensitivity', 'Specificity', 'F1'), function(x) {
cell_spec(x,
bold = if_else(x == max(x, na.rm = TRUE),TRUE, FALSE),
font_size = if_else(x == max(x, na.rm = TRUE),14, 12))}) %>%
mutate(Classification_Error_Rate = cell_spec(Classification_Error_Rate,
bold = if_else(Classification_Error_Rate == min(Classification_Error_Rate, na.rm = TRUE),TRUE,FALSE),
font_size = if_else(Classification_Error_Rate == min(Classification_Error_Rate, na.rm = TRUE),14, 12))) %>%
mutate(Model = cell_spec(Model,
bold = if_else(Model %in% c('Model 1', 'Model 2', 'Model 5'), TRUE, FALSE),
font_size = if_else(Model %in% c('Model 1', 'Model 2', 'Model 5'), 14, 12))) %>%
kable('html', escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)
Summ_Stat| Model | Accuracy | Classification_Error_Rate | Precision | Sensitivity | Specificity | F1 |
|---|---|---|---|---|---|---|
| Model 1 | 0.957 | 0.043 | 0.944 | 0.971 | 0.941 | 0.958 |
| Model 2 | 0.942 | 0.058 | 0.943 | 0.943 | 0.941 | 0.943 |
| Model 3 | 0.884 | 0.116 | 0.865 | 0.914 | 0.853 | 0.889 |
| Model 4 | 0.899 | 0.101 | 0.868 | 0.943 | 0.853 | 0.904 |
| Model 5 | 0.942 | 0.058 | 0.943 | 0.943 | 0.941 | 0.943 |
| Model 6 | 0.928 | 0.072 | 0.917 | 0.943 | 0.912 | 0.93 |
| Model 7 (Same as Model 5) | NA | NA | NA | NA | NA | NA |
| Model 8 (Same as Model 6) | NA | NA | NA | NA | NA | NA |
ROC / AUC
The larger the area under the curve, the better the model.
AUC: model 1 > model 5 > model 2 > model 6 > model 3 > model 4
getROC <- function(model) {
name <- deparse(substitute(model))
pred.prob1 <- predict(model, newdata = data_rescaled, type="prob")
p1 <- data.frame(pred = data_rescaled$target, prob = pred.prob1[[1]])
p1 <- p1[order(p1$prob),]
rocobj <- pROC::roc(p1$pred, p1$prob)
plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
xlab="Specificity", main = name)
}
par(mfrow=c(3,3))
getROC(mod1)
getROC(mod2)
getROC(mod3)
getROC(mod4)
getROC(model_5)
getROC(model_6)R^2, AIC, AICc & BIC
Although Model 1 has the largest R^2, Model 5 has the smallest AIC and AICc, and the second largest R^2.
null_model <- glm(target ~ 1, data = partial_train, family = 'binomial')
#refit models using glm() function
model_1 <- glm(target~., partial_train, family = 'binomial')
model_2 <- glm(target~zn + nox + age + dis + rad + ptratio + medv, partial_train, family = 'binomial')
model_3 <- glm(target~indus + rm + age + dis + tax + ptratio + lstat + medv, partial_train, family = 'binomial')
model_4 <- glm(target~age + dis + tax + medv, partial_train, family = 'binomial')
model_5 <- glm(target~nox + rad + tax + ptratio + age + medv + dis + zn + age, partial_train, family = 'binomial')
model_6 <- glm(target~nox + rad + tax, partial_train, family = 'binomial')
models <- list(model_1, model_2, model_3, model_4, model_5, model_6)
Predictor <- models %>%
lapply(function(x) str_c(unlist(row.names(summary(x)$coefficients)), collapse = ',')) %>%
unlist() %>% str_remove('\\(Intercept\\)\\,')
McFaddens_R2 <- list(1-logLik(model_1)/logLik(null_model),
1-logLik(model_2)/logLik(null_model),
1-logLik(model_3)/logLik(null_model),
1-logLik(model_4)/logLik(null_model),
1-logLik(model_5)/logLik(null_model),
1-logLik(model_6)/logLik(null_model)) %>%
unlist()
AIC <- models %>%
lapply(function(x) AIC(x)) %>%
unlist()
AICc <- models %>%
lapply(function(x) AICc(x)) %>%
unlist()
BIC <- models %>%
lapply(function(x) BIC(x)) %>%
unlist()
cbind(Predictor, McFaddens_R2, AIC, AICc, BIC) %>%
as.data.frame(stringsAsFactors = FALSE) %>%
mutate_at(c('McFaddens_R2','AIC','AICc','BIC'), as.numeric) %>%
mutate(Model = c(str_c('Model ', c(1:6)))) %>%
dplyr::select(Model, everything()) %>%
add_row(Model = 'Model 7', Predictor = 'Same as Model 5') %>%
add_row(Model = 'Model 8', Predictor = 'Same as Model 6') %>%
mutate_if(is.numeric, round,3) %>%
mutate(McFaddens_R2 = cell_spec(McFaddens_R2,
bold = if_else(McFaddens_R2 == max(McFaddens_R2, na.rm = TRUE), TRUE, FALSE),
font_size = if_else(McFaddens_R2 == max(McFaddens_R2, na.rm = TRUE), 15, 12)),
AIC = cell_spec(AIC,
bold = if_else(AIC == min(AIC, na.rm = TRUE), TRUE, FALSE),
font_size = if_else(AIC == min(AIC, na.rm = TRUE), 14, 12)),
AICc = cell_spec(AICc,
bold = if_else(AICc == min(AICc, na.rm = TRUE), TRUE, FALSE),
font_size = if_else(AICc == min(AICc, na.rm = TRUE), 14, 12)),
BIC = cell_spec(BIC,
bold = if_else(BIC == min(BIC, na.rm = TRUE), TRUE, FALSE),
font_size = if_else(BIC == min(BIC, na.rm = TRUE), 14, 12)),
Model = cell_spec(Model,
bold = if_else(Model == 'Model 5', TRUE, FALSE),
font_size = if_else(Model == 'Model 5', 14, 12))) %>%
kable('html', escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)| Model | Predictor | McFaddens_R2 | AIC | AICc | BIC |
|---|---|---|---|---|---|
| Model 1 | zn,indus,chas1,nox,rm,age,dis,rad,tax,ptratio,lstat,medv | 0.68 | 201.853 | 202.804 | 253.644 |
| Model 2 | zn,nox,age,dis,rad,ptratio,medv | 0.653 | 207.183 | 207.555 | 239.055 |
| Model 3 | indus,rm,age,dis,tax,ptratio,lstat,medv | 0.463 | 313.336 | 313.801 | 349.191 |
| Model 4 | age,dis,tax,medv | 0.455 | 310.012 | 310.165 | 329.932 |
| Model 5 | nox,rad,tax,ptratio,age,medv,dis,zn | 0.669 | 199.883 | 200.348 | 235.739 |
| Model 6 | nox,rad,tax | 0.627 | 213.03 | 213.132 | 228.966 |
| Model 7 | Same as Model 5 | NA | NA | NA | NA |
| Model 8 | Same as Model 6 | NA | NA | NA | NA |
Model Selection
From the model selection process above, we know that Model 1 suffers from co-linearity issues, the rest of the models tried to eliminate these issues but also to achieve best prediction performance. Among them, Model 5 has 1) the highest Specificity, 2) second highest accuracy, precision, sensitivity, F1 Score, AUC and McFadden’s R squared proceed by model1, 3) lowest AIC and AICc. Therefore Model 5 is selected to be the final model.
finalmod_FS <- train(target ~ nox + rad + tax + ptratio + age + medv + dis + zn + age,
data = data_rescaled,
method = "glm", family = "binomial",
trControl = trainControl(
method = "cv", number = 10,
savePredictions = TRUE),
tuneLength = 5,
preProcess = c("center", "scale"))
summary(finalmod_FS)##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8295 -0.1752 -0.0021 0.0032 3.4191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.4406 0.6769 3.606 0.000311 ***
## nox 4.9942 0.7792 6.410 1.46e-10 ***
## rad 6.2982 1.3010 4.841 1.29e-06 ***
## tax -1.3023 0.4454 -2.924 0.003459 **
## ptratio 0.7110 0.2447 2.905 0.003668 **
## age 0.9332 0.3101 3.009 0.002622 **
## medv 1.0207 0.3275 3.117 0.001829 **
## dis 1.3798 0.4510 3.060 0.002217 **
## zn -1.6039 0.7481 -2.144 0.032033 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 197.32 on 457 degrees of freedom
## AIC: 215.32
##
## Number of Fisher Scoring iterations: 9
Odds Ratio
We will also create a table of the Odds Ratio for our final model beside the 95% confidence interval of those boundaries. Odd Ratio (OR) is a measure of association between exposure and an outcome. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.
odds <- round(exp(cbind(OddsRatio = coef(finalmod_FS$finalModel), confint(finalmod_FS$finalModel))), 3)
knitr::kable(odds)| OddsRatio | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | 11.479 | 3.161 | 45.685 |
| nox | 147.561 | 35.436 | 763.373 |
| rad | 543.616 | 49.797 | 8423.905 |
| tax | 0.272 | 0.104 | 0.615 |
| ptratio | 2.036 | 1.274 | 3.343 |
| age | 2.543 | 1.408 | 4.780 |
| medv | 2.775 | 1.503 | 5.441 |
| dis | 3.974 | 1.693 | 10.049 |
| zn | 0.201 | 0.040 | 0.747 |
So we can now say that with a one unit increase in the scaled age variable, the odds of the neighborhood being below the median crime rate increase by 2.543%.
All that is left is to use our final model to make predictions over the evaluation dataset.
Make Predictions
We make our final predictions, create a dataframe with the prediction and the predicted probabilities along with the evaluation set. The data set is rescaled in as well. The result shows that among the 40 observations, 23 are predicted to have crime rate below median (0), 17 are predicted to be above median (1).
data_e_rescaled <- data_e %>%
dplyr::select(-chas) %>%
scale() %>%
cbind(data_e[3]) %>%
as.data.frame() %>%
mutate(chas = as.factor(chas))
finalpreds <- predict(finalmod_FS, data_e_rescaled)
finalpreds.probs <- predict(finalmod_FS, data_e, type="prob")
finaldf <- cbind(predicted_Response=finalpreds, Predicted_Prob=finalpreds.probs, data_e) %>%
mutate(Predicted_Prob.0 = percent(Predicted_Prob.0),
Predicted_Prob.1 = percent(Predicted_Prob.1))
DT::datatable(finaldf, caption = 'Predicted Result of Final Model: Model 5')