In this homework assignment, you will explore, analyze, and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
zn: the proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)indus: the proportion of non-retail business acres per suburb (predictor variable)chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)rm: average number of rooms per dwelling (predictor variable)age: the proportion of owner-occupied units built prior to 1940 (predictor variable)dis: weighted mean of distances to five Boston employment centers (predictor variable)rad: index of accessibility to radial highways (predictor variable)tax: full-value property-tax rate per $10,000 (predictor variable)ptratio: pupil-teacher ratio by town (predictor variable)lstat: lower status of the population (percent) (predictor variable)medv: median value of owner-occupied homes in $1000s (predictor variable)target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)Crime is a common concern in large cities, and denizens, policymakers, and law enforcement are interested in identifying locations where crime can or might occur. In this assignment, we are given a dataset for the city of Boston and will build a model to identify regions that might have higher or lower (against the median) crime. Since the goal is a binary (above or below median) response, this assignment will employ binary logistic regression.
Describe the size and the variables in the crime training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a checklist of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.
It contains both a categorical target and provides some categorical features. The fact that we are provided with a categorical target turns the task from a regression task to a classification task, perfect for logistic regression. The data contains 12 features and 466 instances, with 40 instances reserved for an evaluation dataset with targets removed.
There are two files provided:
# Load crime dataset
df <- read.csv('https://raw.githubusercontent.com/jayatveluri/Data621/main/crime-training-data_modified.csv')
df_eval <- read.csv('https://raw.githubusercontent.com/jayatveluri/Data621/main/crime-evaluation-data_modified.csv')summary(df)## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio lstat medv
## Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02
## Median :334.5 Median :18.9 Median :11.350 Median :21.20
## Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00
## target
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4914
## 3rd Qu.:1.0000
## Max. :1.0000
The first observation is that we have no missing data (coded as NA’s).
we have some highly skewed features(where means are far from median, examples variables zn and tax) indicating a skewed distribution. Categorical variablechas` is quite imbalanced, as over 75% of values are 0.
We have two target values, 0 and 1. When building models, we ideally want an equal representation of both classes. As class imbalance deviates, model performance will suffer form effects of differential variance between the classes and bias towards picking the more represented class.
class_proportion = colMeans(df['target'])
class_proportion## target
## 0.4914163
The classes are quite balanced, with approximately 51% 0’s and 49% 1’s
# Prepare data for ggplot
gdata_df <- df %>% dplyr::select(-target) %>%
gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(gdata_df) +
geom_histogram(aes(x=value, y = ..density..), bins=30) +
geom_density(aes(x=value), color='blue') +
facet_wrap(. ~variable, scales='free', ncol=3)## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In addition to creating histogram distributions, we use box-plots to get an idea of the spread of each variable
# Prepare data for ggplot
gdata_df <- df %>% dplyr::select(-target) %>%
gather(key = 'variable', value = 'value')
# Boxplots for each variable
ggplot(gdata_df, aes(variable, value)) +
geom_boxplot() +
facet_wrap(. ~variable, scales='free', ncol=6) ### Variable Plots
Generate scatter plots of each variable versus the target variable to get an idea of the relationship between them.
# Plot scatter plots of each variable versus the target variable
featurePlot(df[,1:ncol(df)-1], df[,ncol(df)], pch = 20)we see no NAs, or missing data in the provided training dataset.
missingdata <- colSums(df %>% sapply(is.na))
missingdata_pct <- round(missingdata / nrow(df) * 100, 2)
stack(sort(missingdata_pct, decreasing = TRUE))## values ind
## 1 0 zn
## 2 0 indus
## 3 0 chas
## 4 0 nox
## 5 0 rm
## 6 0 age
## 7 0 dis
## 8 0 rad
## 9 0 tax
## 10 0 ptratio
## 11 0 lstat
## 12 0 medv
## 13 0 target
clean_df <- dfstack(sort(cor(df[,13], df[,1:ncol(df)-1])[,], decreasing=TRUE))## values ind
## 1 0.72610622 nox
## 2 0.63010625 age
## 3 0.62810492 rad
## 4 0.61111331 tax
## 5 0.60485074 indus
## 6 0.46912702 lstat
## 7 0.25084892 ptratio
## 8 0.08004187 chas
## 9 -0.15255334 rm
## 10 -0.27055071 medv
## 11 -0.43168176 zn
## 12 -0.61867312 dis
nox, age, rad, tax, and indus have the highest correlation (positive) with target
correlation = cor(clean_df, use = 'pairwise.complete.obs')
corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
col=brewer.pal(n=8, name="RdYlBu")) We can see that some variables are highly correlated with one another, such as
tax and rad, with a correlation between .75 and 1. When we start considering features for our models, we’ll need to account for the correlations between features and avoid including pairs with strong correlations.
Summarize our data preparation and exploration
All the predictor variables have no missing values and show no indication of incomplete or incorrect data. As such, we have kept all the fields.
The data had no missing values to remove.
No outliers were removed as all values seemed reasonable.
From our histogram plots, we can see that some of our variables are highly skewed. To address this, we decided to perform some transformations to make them more normally distributed. Here are some plots to demonstrate the changes in distributions before and after the transformations.
# created empty data frame to store transformed variables
df_temp <- data.frame(matrix(ncol = 1, nrow = length(clean_df$target)))
# performed boxcox transformation after identifying proper lambda
df_temp$rm <- clean_df$rm
rm_lambda <- BoxCox.lambda(clean_df$rm)
df_temp$rm_transform <- BoxCox(clean_df$rm, rm_lambda)
df_temp$nox <- clean_df$nox
nox_lambda <- BoxCox.lambda(clean_df$nox)
df_temp$nox_transform <- BoxCox(clean_df$nox, nox_lambda)
df_temp$dis <- clean_df$dis
df_temp$dis_transform <- log(clean_df$dis)
df_temp$zn <- clean_df$zn
df_temp$zn_transform <- log(clean_df$zn+1)
df_temp$lstat <- clean_df$lstat
df_temp$lstat_transform <- log(clean_df$lstat)
df_temp$age <- clean_df$age
df_temp$age_transform <- log(max(clean_df$age) + 1 - clean_df$age)
df_temp$ptratio <- clean_df$ptratio
df_temp$ptratio_transform <- log(max(clean_df$ptratio) + 1 - clean_df$ptratio)
df_temp <- df_temp[, 2:15]we can now add these into our cl_df dataframe and continue on to build our models.
cl_df <- data.frame(cbind(clean_df,
rm_transform = df_temp$rm_transform,
nox_transform = df_temp$nox_transform,
dis_transform = df_temp$dis_transform,
zn_transform = df_temp$zn_transform,
lstat_transform = df_temp$lstat_transform,
age_transform = df_temp$age_transform,
ptratio_transform = df_temp$ptratio_transform
))
is.na(cl_df) <- sapply(cl_df, is.infinite)Using the training data, build at least three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done. Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter-intuitive? Why? The boss needs to know.
With Our Dataset cleaned we will build our logistic regression models.
First, we decided to split our cleaned dataset into a training and testing set (80% training, 20% testing). This was necessary as the provided holdout evaluation dataset doesn’t provide target values so we cannot measure our model performance against that dataset.
set.seed(123456)
# utilizing one dataset for all four models
cldfdataTrain <- createDataPartition(cl_df$target, p=0.8, list=FALSE)
cldftraining <- cl_df[cldfdataTrain,]
cldftesting <- cl_df[-cldfdataTrain,]Using our training dataset, we decided to run a binary logistic regression model that included all non-transformed features.
model1 <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, data = cldftraining , family = binomial)## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model1)##
## Call:
## glm(formula = target ~ zn + indus + chas + nox + rm + age + dis +
## rad + tax + ptratio + lstat + medv, family = binomial, data = cldftraining)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -61.876805 11.104609 -5.572 2.52e-08 ***
## zn -0.097157 0.054388 -1.786 0.074042 .
## indus -0.092162 0.064947 -1.419 0.155888
## chas 1.442852 1.001954 1.440 0.149856
## nox 68.689102 13.055167 5.261 1.43e-07 ***
## rm -1.111323 1.057127 -1.051 0.293136
## age 0.085584 0.022193 3.856 0.000115 ***
## dis 1.258497 0.374901 3.357 0.000788 ***
## rad 0.900122 0.239910 3.752 0.000175 ***
## tax -0.006122 0.004098 -1.494 0.135217
## ptratio 0.665643 0.198810 3.348 0.000814 ***
## lstat -0.001007 0.068152 -0.015 0.988210
## medv 0.293953 0.103956 2.828 0.004689 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 516.63 on 372 degrees of freedom
## Residual deviance: 110.68 on 360 degrees of freedom
## AIC: 136.68
##
## Number of Fisher Scoring iterations: 9
calculate VIF scores to measure the effects of collinearity as well as variable importance:
print('VIF model1')## [1] "VIF model1"
VIF(model1)## zn indus chas nox rm age dis rad
## 2.055550 2.683933 1.353039 5.573427 7.313187 3.189662 4.971248 2.424854
## tax ptratio lstat medv
## 2.289799 3.203721 2.635193 11.683660
Model #2 using our transformed features.
model2 <- glm(target ~ zn_transform + indus + chas + nox_transform + rm_transform + age_transform + dis_transform + rad + tax + ptratio_transform + lstat_transform + medv, data = cldftraining , family = binomial)
summary(model2)##
## Call:
## glm(formula = target ~ zn_transform + indus + chas + nox_transform +
## rm_transform + age_transform + dis_transform + rad + tax +
## ptratio_transform + lstat_transform + medv, family = binomial,
## data = cldftraining)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 19.648555 11.227065 1.750 0.080100 .
## zn_transform -0.481877 0.337480 -1.428 0.153330
## indus -0.026942 0.068561 -0.393 0.694340
## chas 1.226179 0.994598 1.233 0.217636
## nox_transform 17.439569 3.229450 5.400 6.66e-08 ***
## rm_transform -2.864549 3.657480 -0.783 0.433508
## age_transform -1.533715 0.390108 -3.932 8.44e-05 ***
## dis_transform 3.951566 1.330350 2.970 0.002975 **
## rad 0.986357 0.261550 3.771 0.000162 ***
## tax -0.006789 0.004322 -1.571 0.116223
## ptratio_transform -2.828343 0.802360 -3.525 0.000423 ***
## lstat_transform -0.754703 1.008646 -0.748 0.454319
## medv 0.219507 0.086708 2.532 0.011355 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 516.63 on 372 degrees of freedom
## Residual deviance: 105.74 on 360 degrees of freedom
## AIC: 131.74
##
## Number of Fisher Scoring iterations: 9
calculate VIF scores to measure the effects of collinearity as well as variable importance:
print('VIF model2')## [1] "VIF model2"
VIF(model2)## zn_transform indus chas nox_transform
## 1.783127 2.722927 1.362322 3.633686
## rm_transform age_transform dis_transform rad
## 5.551059 2.350723 3.520485 2.531822
## tax ptratio_transform lstat_transform medv
## 2.506246 2.755768 3.769478 6.851934
We use a Stepwise AIC on Model #1 (model with non-transformed features) to pick which features were most relevant.
model3 <- model1 %>% stepAIC(trace = FALSE)## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model3)##
## Call:
## glm(formula = target ~ zn + indus + chas + nox + age + dis +
## rad + tax + ptratio + medv, family = binomial, data = cldftraining)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -61.579380 10.924258 -5.637 1.73e-08 ***
## zn -0.099274 0.053491 -1.856 0.063470 .
## indus -0.087984 0.064947 -1.355 0.175514
## chas 1.580819 1.004001 1.575 0.115368
## nox 65.295329 12.193437 5.355 8.56e-08 ***
## age 0.074673 0.017516 4.263 2.02e-05 ***
## dis 1.170570 0.353153 3.315 0.000918 ***
## rad 0.820831 0.225401 3.642 0.000271 ***
## tax -0.005837 0.004020 -1.452 0.146532
## ptratio 0.570151 0.168084 3.392 0.000694 ***
## medv 0.196640 0.055583 3.538 0.000404 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 516.63 on 372 degrees of freedom
## Residual deviance: 112.10 on 362 degrees of freedom
## AIC: 134.1
##
## Number of Fisher Scoring iterations: 9
calculate VIF scores to measure the effects of collinearity as well as variable importance:
print('VIF model3')## [1] "VIF model3"
VIF(model3)## zn indus chas nox age dis rad tax
## 1.989664 2.702260 1.256644 4.859144 2.018329 4.508705 2.104971 2.245591
## ptratio medv
## 2.327607 3.253767
we apply Stepwise AIC on Model #2 (our model with transformed features) to pick which features were most relevant.
model4 <- model2 %>% stepAIC(trace = FALSE)## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model4)##
## Call:
## glm(formula = target ~ zn_transform + nox_transform + age_transform +
## dis_transform + rad + tax + ptratio_transform + medv, family = binomial,
## data = cldftraining)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.798986 2.917739 3.358 0.000784 ***
## zn_transform -0.644205 0.319078 -2.019 0.043492 *
## nox_transform 16.352106 3.045619 5.369 7.91e-08 ***
## age_transform -1.453228 0.325122 -4.470 7.83e-06 ***
## dis_transform 4.036422 1.281617 3.149 0.001636 **
## rad 1.021022 0.230145 4.436 9.15e-06 ***
## tax -0.007794 0.003903 -1.997 0.045819 *
## ptratio_transform -2.478915 0.710827 -3.487 0.000488 ***
## medv 0.195979 0.055413 3.537 0.000405 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 516.63 on 372 degrees of freedom
## Residual deviance: 108.13 on 364 degrees of freedom
## AIC: 126.13
##
## Number of Fisher Scoring iterations: 9
calculate VIF scores to measure the effects of collinearity as well as variable importance:
print('VIF model4')## [1] "VIF model4"
VIF(model4)## zn_transform nox_transform age_transform dis_transform
## 1.570677 3.366213 1.725178 3.408731
## rad tax ptratio_transform medv
## 2.000213 2.087589 2.164370 2.679912
Throughout our model-building process, we noticed that many of our model outputs yielded a few coefficient values that seemed to contradict our initial estimates. For instance, in Model #1:
Positive values for coefficients that we’d expect to be negative
This is a trend we saw throughout the four models that we built, although Models #3 and Model #4 were able to adjust for this better than our first two models – we can likely attribute this phenomenon to multicollinearity. Since we noticed in our initial data exploration that many variables in the dataset were highly correlated with one another (i.e. medv and zn, nox and indus, tax and rad), this phenomenon likely is increasing the variance of the coefficient estimates, making them difficult to interpret (and in some cases such as the features listed above, they are switching the signs). This was also supported by our Variance Inflation Factor (VIF) tests, which showed high values for features such as [medv and rn]. In our final models (Models #3 and #4), we made sure to keep this in mind in order to get a better handle on our coefficients and reduce multicollinearity – mainly, we removed certain variables that had high VIF scores through our stepwise selection process.
For the binary logistic regression model, use a metric such as log-likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.
Model #1 Confusion Matrix:
model1_glm_pred = ifelse(predict(model1, type = "link") > 0.5, "Yes", "No")
cldftesting$model1 <- ifelse(predict.glm(model1, cldftesting, "response") >= 0.5, 1, 0)
cm1 <- confusionMatrix(factor(cldftesting$model1), factor(cldftesting$target), "1")
results <- tibble(Model = "Model #1", Accuracy=cm1$byClass[11], F1 = cm1$byClass[7],
Deviance= model1$deviance,
R2 = 1 - model1$deviance / model1$null.deviance,
AIC= model1$aic)
cm1## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 42 11
## 1 2 38
##
## Accuracy : 0.8602
## 95% CI : (0.7728, 0.9234)
## No Information Rate : 0.5269
## P-Value [Acc > NIR] : 1.023e-11
##
## Kappa : 0.7225
##
## Mcnemar's Test P-Value : 0.0265
##
## Sensitivity : 0.7755
## Specificity : 0.9545
## Pos Pred Value : 0.9500
## Neg Pred Value : 0.7925
## Prevalence : 0.5269
## Detection Rate : 0.4086
## Detection Prevalence : 0.4301
## Balanced Accuracy : 0.8650
##
## 'Positive' Class : 1
##
Model #2 Confusion Matrix:
model2_glm_pred = ifelse(predict(model2, type = "link") > 0.5, "Yes", "No")
cldftesting$model2 <- ifelse(predict.glm(model2, cldftesting, "response") >= 0.5, 1, 0)
cm2 <- confusionMatrix(factor(cldftesting$model2), factor(cldftesting$target), "1")
results <- rbind(results, tibble(Model = "Model #2", Accuracy=cm2$byClass[11], F1 = cm2$byClass[7],
Deviance= model2$deviance,
R2 = 1 - model2$deviance / model2$null.deviance,
AIC= model2$aic))
cm2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 40 11
## 1 4 38
##
## Accuracy : 0.8387
## 95% CI : (0.748, 0.9068)
## No Information Rate : 0.5269
## P-Value [Acc > NIR] : 2.581e-10
##
## Kappa : 0.6791
##
## Mcnemar's Test P-Value : 0.1213
##
## Sensitivity : 0.7755
## Specificity : 0.9091
## Pos Pred Value : 0.9048
## Neg Pred Value : 0.7843
## Prevalence : 0.5269
## Detection Rate : 0.4086
## Detection Prevalence : 0.4516
## Balanced Accuracy : 0.8423
##
## 'Positive' Class : 1
##
Model #3 Confusion Matrix:
model3_glm_pred = ifelse(predict(model3, type = "link") > 0.5, "Yes", "No")
cldftesting$model3 <- ifelse(predict.glm(model3, cldftesting,"response") >= 0.5, 1, 0)
cm3 <- confusionMatrix(factor(cldftesting$model3), factor(cldftesting$target), "1")
results <- rbind(results, tibble(Model = "Model #3", Accuracy=cm3$byClass[11], F1 = cm3$byClass[7],
Deviance=model3$deviance,
R2 = 1 - model3$deviance / model3$null.deviance,
AIC=model3$aic))
cm3## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 42 12
## 1 2 37
##
## Accuracy : 0.8495
## 95% CI : (0.7603, 0.9152)
## No Information Rate : 0.5269
## P-Value [Acc > NIR] : 5.349e-11
##
## Kappa : 0.7015
##
## Mcnemar's Test P-Value : 0.01616
##
## Sensitivity : 0.7551
## Specificity : 0.9545
## Pos Pred Value : 0.9487
## Neg Pred Value : 0.7778
## Prevalence : 0.5269
## Detection Rate : 0.3978
## Detection Prevalence : 0.4194
## Balanced Accuracy : 0.8548
##
## 'Positive' Class : 1
##
Model #4 Confusion Matrix:
model4_glm_pred = ifelse(predict(model4, type = "link") > 0.5, "Yes", "No")
cldftesting$model4 <- ifelse(predict.glm(model4, cldftesting,"response") >= 0.5, 1, 0)
cm4 <- confusionMatrix(factor(cldftesting$model4), factor(cldftesting$target), "1")
results <- rbind(results, tibble(Model = "Model #4", Accuracy=cm4$byClass[11], F1 = cm4$byClass[7],
Deviance=model4$deviance,
R2 = 1 - model4$deviance / model4$null.deviance,
AIC=model4$aic))
cm4## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 40 11
## 1 4 38
##
## Accuracy : 0.8387
## 95% CI : (0.748, 0.9068)
## No Information Rate : 0.5269
## P-Value [Acc > NIR] : 2.581e-10
##
## Kappa : 0.6791
##
## Mcnemar's Test P-Value : 0.1213
##
## Sensitivity : 0.7755
## Specificity : 0.9091
## Pos Pred Value : 0.9048
## Neg Pred Value : 0.7843
## Prevalence : 0.5269
## Detection Rate : 0.4086
## Detection Prevalence : 0.4516
## Balanced Accuracy : 0.8423
##
## 'Positive' Class : 1
##
A Comparison of ROC Curves for each model, Model #1 … Model #4:
print('Model #1 ROC Curve')## [1] "Model #1 ROC Curve"
roc(cldftesting[["target"]], cldftesting[["model1"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = cldftesting[["target"]], predictor = cldftesting[["model1"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
##
## Data: cldftesting[["model1"]] in 44 controls (cldftesting[["target"]] 0) < 49 cases (cldftesting[["target"]] 1).
## Area under the curve: 0.865
print('Model #2 ROC Curve')## [1] "Model #2 ROC Curve"
roc(cldftesting[["target"]], cldftesting[["model2"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = cldftesting[["target"]], predictor = cldftesting[["model2"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
##
## Data: cldftesting[["model2"]] in 44 controls (cldftesting[["target"]] 0) < 49 cases (cldftesting[["target"]] 1).
## Area under the curve: 0.8423
print('Model #3 ROC Curve')## [1] "Model #3 ROC Curve"
roc(cldftesting[["target"]], cldftesting[["model3"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = cldftesting[["target"]], predictor = cldftesting[["model3"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
##
## Data: cldftesting[["model3"]] in 44 controls (cldftesting[["target"]] 0) < 49 cases (cldftesting[["target"]] 1).
## Area under the curve: 0.8548
print('Model #4 ROC Curve')## [1] "Model #4 ROC Curve"
roc(cldftesting[["target"]], cldftesting[["model4"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = cldftesting[["target"]], predictor = cldftesting[["model4"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
##
## Data: cldftesting[["model4"]] in 44 controls (cldftesting[["target"]] 0) < 49 cases (cldftesting[["target"]] 1).
## Area under the curve: 0.8423
df_temp_eval <- data.frame(matrix(ncol = 1, nrow = length(df_eval$medv)))
# performed boxcox transformation after identifying proper lambda
df_temp_eval$rm <- df_eval$rm
rm_lambda <- BoxCox.lambda(df_eval$rm)
df_temp_eval$rm_transform <- BoxCox(df_eval$rm, rm_lambda)
df_temp_eval$nox <- df_eval$nox
nox_lambda <- BoxCox.lambda(df_eval$nox)
df_temp_eval$nox_transform <- BoxCox(df_eval$nox, nox_lambda)
df_temp_eval$dis <- df_eval$dis
df_temp_eval$dis_transform <- log(df_eval$dis)
df_temp_eval$zn <- df_eval$zn
df_temp_eval$zn_transform <- log(df_eval$zn+1)
df_temp_eval$lstat <- df_eval$lstat
df_temp_eval$lstat_transform <- log(df_eval$lstat)
df_temp_eval$age <- df_eval$age
df_temp_eval$age_transform <- log(max(df_eval$age) + 1 - df_eval$age)
df_temp_eval$ptratio <- df_eval$ptratio
df_temp_eval$ptratio_transform <- log(max(df_eval$ptratio) + 1 - df_eval$ptratio)
df_temp_eval <- df_temp_eval[, 2:15]df_eval <- data.frame(cbind(df_eval,
rm_transform = df_temp_eval$rm_transform,
nox_transform = df_temp_eval$nox_transform,
dis_transform = df_temp_eval$dis_transform,
zn_transform = df_temp_eval$zn_transform,
lstat_transform = df_temp_eval$lstat_transform,
age_transform = df_temp_eval$age_transform,
ptratio_transform = df_temp_eval$ptratio_transform
))
is.na(df_eval) <- sapply(df_eval, is.infinite)The following table discusses each of the model performance metrics on the training dataset. These values indicate there is a minor improvement in model performance after applying transformations and selecting for significant parameters.
Based on all of our models we gave the edge to* Model #2* and Model #4, given that they fit the data (Higher \(R^2\), lower AIC, lower deviance) better than their counterparts.
Ultimately we chose Model #4 as all thing equal (which they shared accuracy, F-statistic, and McFaden R-squared), it had a lower AIC, and therefore was a higher quality model.
We apply Model #4 to the holdout evaluation set to predict the targets.
eval_data <- df_eval %>% select(c(zn_transform, indus, chas, nox_transform, rm_transform, age_transform, dis_transform, rad, tax, ptratio_transform, lstat_transform, medv))
predictions <- ifelse(predict(model4, eval_data, type = "link") > 0.5, 1, 0)
df_eval['target'] <- predictions
predictions## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 1
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 0 1 1 1 1 1 1 1 1 1 1 1 1 1