Data 621 Fall2023: Homework 3 (Group1)

Instructions

Overview

In this homework assignment, you will explore, analyze, and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

zn: the proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)
indus: the proportion of non-retail business acres per suburb (predictor variable)
chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)
nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)
rm: average number of rooms per dwelling (predictor variable)
age: the proportion of owner-occupied units built prior to 1940 (predictor variable)
dis: weighted mean of distances to five Boston employment centers (predictor variable)
rad: index of accessibility to radial highways (predictor variable)
tax: full-value property-tax rate per $10,000 (predictor variable)
ptratio: pupil-teacher ratio by town (predictor variable)
lstat: lower status of the population (percent) (predictor variable)
medv: median value of owner-occupied homes in $1000s (predictor variable)
target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)

Deliverables

A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.
Assigned prediction (probabilities, classifications) for the evaluation data set. Use a 0.5 threshold.
Include your R statistical programming code in an Appendix.

Introduction

Crime is a common concern in large cities, and denizens, policymakers, and law enforcement are interested in identifying locations where crime can or might occur. In this assignment, we are given a dataset for the city of Boston and will build a model to identify regions that might have higher or lower (against the median) crime. Since the goal is a binary (above or below median) response, this assignment will employ binary logistic regression.

1. Data Exploration

Describe the size and the variables in the crime training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a checklist of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

Dataset

It contains both a categorical target and provides some categorical features. The fact that we are provided with a categorical target turns the task from a regression task to a classification task, perfect for logistic regression. The data contains 12 features and 466 instances, with 40 instances reserved for an evaluation dataset with targets removed.

There are two files provided:

crime-training-data_modified.csv - hold data from training a model
crime-evaluation-data_modified.csv - holdout data used for evaluation.

# Load crime dataset
df <- read.csv('https://raw.githubusercontent.com/jayatveluri/Data621/main/crime-training-data_modified.csv')
df_eval <- read.csv('https://raw.githubusercontent.com/jayatveluri/Data621/main/crime-evaluation-data_modified.csv')

Summary Statistics

summary(df)

##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv      
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20  
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00  
##      target      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4914  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

The first observation is that we have no missing data (coded as NA’s).

we have some highly skewed features(where means are far from median, examples variables zn and tax) indicating a skewed distribution. Categorical variablechas` is quite imbalanced, as over 75% of values are 0.

Check Class Bias

We have two target values, 0 and 1. When building models, we ideally want an equal representation of both classes. As class imbalance deviates, model performance will suffer form effects of differential variance between the classes and bias towards picking the more represented class.

class_proportion = colMeans(df['target'])
class_proportion

##    target 
## 0.4914163

The classes are quite balanced, with approximately 51% 0’s and 49% 1’s

Distributions

# Prepare data for ggplot
gdata_df <- df %>% dplyr::select(-target) %>%
  gather(key = 'variable', value = 'value')

# Histogram plots of each variable
ggplot(gdata_df) + 
  geom_histogram(aes(x=value, y = ..density..), bins=30) + 
  geom_density(aes(x=value), color='blue') +
  facet_wrap(. ~variable, scales='free', ncol=3)

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Boxplots

In addition to creating histogram distributions, we use box-plots to get an idea of the spread of each variable

# Prepare data for ggplot
gdata_df <- df %>% dplyr::select(-target) %>%
  gather(key = 'variable', value = 'value')

# Boxplots for each variable
ggplot(gdata_df, aes(variable, value)) + 
  geom_boxplot() + 
  facet_wrap(. ~variable, scales='free', ncol=6)

### Variable Plots

Generate scatter plots of each variable versus the target variable to get an idea of the relationship between them.

# Plot scatter plots of each variable versus the target variable
featurePlot(df[,1:ncol(df)-1], df[,ncol(df)], pch = 20)

Missing Data

we see no NAs, or missing data in the provided training dataset.

missingdata <- colSums(df %>% sapply(is.na))
missingdata_pct <- round(missingdata / nrow(df) * 100, 2)
stack(sort(missingdata_pct, decreasing = TRUE))

##    values     ind
## 1       0      zn
## 2       0   indus
## 3       0    chas
## 4       0     nox
## 5       0      rm
## 6       0     age
## 7       0     dis
## 8       0     rad
## 9       0     tax
## 10      0 ptratio
## 11      0   lstat
## 12      0    medv
## 13      0  target

clean_df <- df

Target Correlations

stack(sort(cor(df[,13], df[,1:ncol(df)-1])[,], decreasing=TRUE))

##         values     ind
## 1   0.72610622     nox
## 2   0.63010625     age
## 3   0.62810492     rad
## 4   0.61111331     tax
## 5   0.60485074   indus
## 6   0.46912702   lstat
## 7   0.25084892 ptratio
## 8   0.08004187    chas
## 9  -0.15255334      rm
## 10 -0.27055071    medv
## 11 -0.43168176      zn
## 12 -0.61867312     dis

nox, age, rad, tax, and indus have the highest correlation (positive) with target

correlation = cor(clean_df, use = 'pairwise.complete.obs')

corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
         col=brewer.pal(n=8, name="RdYlBu"))

We can see that some variables are highly correlated with one another, such as tax and rad, with a correlation between .75 and 1. When we start considering features for our models, we’ll need to account for the correlations between features and avoid including pairs with strong correlations.

2. Data Preparation

Summarize our data preparation and exploration

Removed Fields

All the predictor variables have no missing values and show no indication of incomplete or incorrect data. As such, we have kept all the fields.

Missing Values

The data had no missing values to remove.

Outliers

No outliers were removed as all values seemed reasonable.

Transform non-normal variables

From our histogram plots, we can see that some of our variables are highly skewed. To address this, we decided to perform some transformations to make them more normally distributed. Here are some plots to demonstrate the changes in distributions before and after the transformations.

# created empty data frame to store transformed variables
df_temp <- data.frame(matrix(ncol = 1, nrow = length(clean_df$target)))

# performed boxcox transformation after identifying proper lambda
df_temp$rm <- clean_df$rm
rm_lambda <- BoxCox.lambda(clean_df$rm)
df_temp$rm_transform <- BoxCox(clean_df$rm, rm_lambda)

df_temp$nox <- clean_df$nox
nox_lambda <- BoxCox.lambda(clean_df$nox)
df_temp$nox_transform <- BoxCox(clean_df$nox, nox_lambda)

df_temp$dis <- clean_df$dis
df_temp$dis_transform <- log(clean_df$dis)

df_temp$zn <- clean_df$zn
df_temp$zn_transform <- log(clean_df$zn+1)

df_temp$lstat <- clean_df$lstat
df_temp$lstat_transform <- log(clean_df$lstat)

df_temp$age <- clean_df$age
df_temp$age_transform <- log(max(clean_df$age) + 1 - clean_df$age)

df_temp$ptratio <- clean_df$ptratio
df_temp$ptratio_transform <- log(max(clean_df$ptratio) + 1 - clean_df$ptratio)

df_temp <- df_temp[, 2:15]

Finalizing the dataset for model building

we can now add these into our cl_df dataframe and continue on to build our models.

cl_df <- data.frame(cbind(clean_df,
                        rm_transform = df_temp$rm_transform,
                        nox_transform = df_temp$nox_transform,
                        dis_transform = df_temp$dis_transform,
                        zn_transform = df_temp$zn_transform,
                        lstat_transform = df_temp$lstat_transform,
                        age_transform = df_temp$age_transform,
                        ptratio_transform = df_temp$ptratio_transform
                        ))

is.na(cl_df) <- sapply(cl_df, is.infinite)

3. Build Models

Using the training data, build at least three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done. Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter-intuitive? Why? The boss needs to know.

Model-building methodology

With Our Dataset cleaned we will build our logistic regression models.

First, we decided to split our cleaned dataset into a training and testing set (80% training, 20% testing). This was necessary as the provided holdout evaluation dataset doesn’t provide target values so we cannot measure our model performance against that dataset.

set.seed(123456)
# utilizing one dataset for all four models
cldfdataTrain <- createDataPartition(cl_df$target, p=0.8, list=FALSE)
cldftraining <- cl_df[cldfdataTrain,]
cldftesting <- cl_df[-cldfdataTrain,]

Model#1 (non-transformed features)

Using our training dataset, we decided to run a binary logistic regression model that included all non-transformed features.

model1 <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, data = cldftraining , family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model1)

## 
## Call:
## glm(formula = target ~ zn + indus + chas + nox + rm + age + dis + 
##     rad + tax + ptratio + lstat + medv, family = binomial, data = cldftraining)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -61.876805  11.104609  -5.572 2.52e-08 ***
## zn           -0.097157   0.054388  -1.786 0.074042 .  
## indus        -0.092162   0.064947  -1.419 0.155888    
## chas          1.442852   1.001954   1.440 0.149856    
## nox          68.689102  13.055167   5.261 1.43e-07 ***
## rm           -1.111323   1.057127  -1.051 0.293136    
## age           0.085584   0.022193   3.856 0.000115 ***
## dis           1.258497   0.374901   3.357 0.000788 ***
## rad           0.900122   0.239910   3.752 0.000175 ***
## tax          -0.006122   0.004098  -1.494 0.135217    
## ptratio       0.665643   0.198810   3.348 0.000814 ***
## lstat        -0.001007   0.068152  -0.015 0.988210    
## medv          0.293953   0.103956   2.828 0.004689 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.63  on 372  degrees of freedom
## Residual deviance: 110.68  on 360  degrees of freedom
## AIC: 136.68
## 
## Number of Fisher Scoring iterations: 9

calculate VIF scores to measure the effects of collinearity as well as variable importance:

print('VIF model1')

## [1] "VIF model1"

VIF(model1)

##        zn     indus      chas       nox        rm       age       dis       rad 
##  2.055550  2.683933  1.353039  5.573427  7.313187  3.189662  4.971248  2.424854 
##       tax   ptratio     lstat      medv 
##  2.289799  3.203721  2.635193 11.683660

Model#2 (Transformed Features)

Model #2 using our transformed features.

model2 <- glm(target ~ zn_transform + indus + chas + nox_transform + rm_transform + age_transform + dis_transform + rad + tax + ptratio_transform + lstat_transform + medv, data = cldftraining , family = binomial)
summary(model2)

## 
## Call:
## glm(formula = target ~ zn_transform + indus + chas + nox_transform + 
##     rm_transform + age_transform + dis_transform + rad + tax + 
##     ptratio_transform + lstat_transform + medv, family = binomial, 
##     data = cldftraining)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       19.648555  11.227065   1.750 0.080100 .  
## zn_transform      -0.481877   0.337480  -1.428 0.153330    
## indus             -0.026942   0.068561  -0.393 0.694340    
## chas               1.226179   0.994598   1.233 0.217636    
## nox_transform     17.439569   3.229450   5.400 6.66e-08 ***
## rm_transform      -2.864549   3.657480  -0.783 0.433508    
## age_transform     -1.533715   0.390108  -3.932 8.44e-05 ***
## dis_transform      3.951566   1.330350   2.970 0.002975 ** 
## rad                0.986357   0.261550   3.771 0.000162 ***
## tax               -0.006789   0.004322  -1.571 0.116223    
## ptratio_transform -2.828343   0.802360  -3.525 0.000423 ***
## lstat_transform   -0.754703   1.008646  -0.748 0.454319    
## medv               0.219507   0.086708   2.532 0.011355 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.63  on 372  degrees of freedom
## Residual deviance: 105.74  on 360  degrees of freedom
## AIC: 131.74
## 
## Number of Fisher Scoring iterations: 9

calculate VIF scores to measure the effects of collinearity as well as variable importance:

print('VIF model2')

## [1] "VIF model2"

VIF(model2)

##      zn_transform             indus              chas     nox_transform 
##          1.783127          2.722927          1.362322          3.633686 
##      rm_transform     age_transform     dis_transform               rad 
##          5.551059          2.350723          3.520485          2.531822 
##               tax ptratio_transform   lstat_transform              medv 
##          2.506246          2.755768          3.769478          6.851934

Model#3 (Stepwise-AIC on Model #1)

We use a Stepwise AIC on Model #1 (model with non-transformed features) to pick which features were most relevant.

model3 <- model1 %>% stepAIC(trace = FALSE)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model3)

## 
## Call:
## glm(formula = target ~ zn + indus + chas + nox + age + dis + 
##     rad + tax + ptratio + medv, family = binomial, data = cldftraining)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -61.579380  10.924258  -5.637 1.73e-08 ***
## zn           -0.099274   0.053491  -1.856 0.063470 .  
## indus        -0.087984   0.064947  -1.355 0.175514    
## chas          1.580819   1.004001   1.575 0.115368    
## nox          65.295329  12.193437   5.355 8.56e-08 ***
## age           0.074673   0.017516   4.263 2.02e-05 ***
## dis           1.170570   0.353153   3.315 0.000918 ***
## rad           0.820831   0.225401   3.642 0.000271 ***
## tax          -0.005837   0.004020  -1.452 0.146532    
## ptratio       0.570151   0.168084   3.392 0.000694 ***
## medv          0.196640   0.055583   3.538 0.000404 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.63  on 372  degrees of freedom
## Residual deviance: 112.10  on 362  degrees of freedom
## AIC: 134.1
## 
## Number of Fisher Scoring iterations: 9

calculate VIF scores to measure the effects of collinearity as well as variable importance:

print('VIF model3')

## [1] "VIF model3"

VIF(model3)

##       zn    indus     chas      nox      age      dis      rad      tax 
## 1.989664 2.702260 1.256644 4.859144 2.018329 4.508705 2.104971 2.245591 
##  ptratio     medv 
## 2.327607 3.253767

Model#4 (Stepwise-AIC on Model #2)

we apply Stepwise AIC on Model #2 (our model with transformed features) to pick which features were most relevant.

model4 <- model2 %>% stepAIC(trace = FALSE)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model4)

## 
## Call:
## glm(formula = target ~ zn_transform + nox_transform + age_transform + 
##     dis_transform + rad + tax + ptratio_transform + medv, family = binomial, 
##     data = cldftraining)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        9.798986   2.917739   3.358 0.000784 ***
## zn_transform      -0.644205   0.319078  -2.019 0.043492 *  
## nox_transform     16.352106   3.045619   5.369 7.91e-08 ***
## age_transform     -1.453228   0.325122  -4.470 7.83e-06 ***
## dis_transform      4.036422   1.281617   3.149 0.001636 ** 
## rad                1.021022   0.230145   4.436 9.15e-06 ***
## tax               -0.007794   0.003903  -1.997 0.045819 *  
## ptratio_transform -2.478915   0.710827  -3.487 0.000488 ***
## medv               0.195979   0.055413   3.537 0.000405 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.63  on 372  degrees of freedom
## Residual deviance: 108.13  on 364  degrees of freedom
## AIC: 126.13
## 
## Number of Fisher Scoring iterations: 9

calculate VIF scores to measure the effects of collinearity as well as variable importance:

print('VIF model4')

## [1] "VIF model4"

VIF(model4)

##      zn_transform     nox_transform     age_transform     dis_transform 
##          1.570677          3.366213          1.725178          3.408731 
##               rad               tax ptratio_transform              medv 
##          2.000213          2.087589          2.164370          2.679912

Examining our model coefficients

Throughout our model-building process, we noticed that many of our model outputs yielded a few coefficient values that seemed to contradict our initial estimates. For instance, in Model #1:

Positive values for coefficients that we’d expect to be negative

age - logically we thought that the higher the proportion of owner-occupied units built prior to 1940 would lead to a lower crime rate (historic homes with more property value)
dis - logically we thought that the higher the weighted mean value of distance to five Boston employment centers, the lower the crime rate (areas farther from employment centers indicated there is less of a need for them – higher rates of employment)
medv - logically we thought the higher the median value of owner-occupied homes in $1000s would lead to a lower crime rate

This is a trend we saw throughout the four models that we built, although Models #3 and Model #4 were able to adjust for this better than our first two models – we can likely attribute this phenomenon to multicollinearity. Since we noticed in our initial data exploration that many variables in the dataset were highly correlated with one another (i.e. medv and zn, nox and indus, tax and rad), this phenomenon likely is increasing the variance of the coefficient estimates, making them difficult to interpret (and in some cases such as the features listed above, they are switching the signs). This was also supported by our Variance Inflation Factor (VIF) tests, which showed high values for features such as [medv and rn]. In our final models (Models #3 and #4), we made sure to keep this in mind in order to get a better handle on our coefficients and reduce multicollinearity – mainly, we removed certain variables that had high VIF scores through our stepwise selection process.

4. Model Selection & Analysis

For the binary logistic regression model, use a metric such as log-likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.

Confusion Matrices

Model #1 Confusion Matrix:

model1_glm_pred = ifelse(predict(model1, type = "link") > 0.5, "Yes", "No")

cldftesting$model1 <- ifelse(predict.glm(model1, cldftesting, "response") >= 0.5, 1, 0)
cm1 <- confusionMatrix(factor(cldftesting$model1), factor(cldftesting$target), "1")
results <- tibble(Model = "Model #1", Accuracy=cm1$byClass[11], F1 = cm1$byClass[7],
                  Deviance= model1$deviance, 
                  R2 = 1 - model1$deviance / model1$null.deviance,
                  AIC= model1$aic)
cm1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 42 11
##          1  2 38
##                                           
##                Accuracy : 0.8602          
##                  95% CI : (0.7728, 0.9234)
##     No Information Rate : 0.5269          
##     P-Value [Acc > NIR] : 1.023e-11       
##                                           
##                   Kappa : 0.7225          
##                                           
##  Mcnemar's Test P-Value : 0.0265          
##                                           
##             Sensitivity : 0.7755          
##             Specificity : 0.9545          
##          Pos Pred Value : 0.9500          
##          Neg Pred Value : 0.7925          
##              Prevalence : 0.5269          
##          Detection Rate : 0.4086          
##    Detection Prevalence : 0.4301          
##       Balanced Accuracy : 0.8650          
##                                           
##        'Positive' Class : 1               
##

Model #2 Confusion Matrix:

model2_glm_pred = ifelse(predict(model2, type = "link") > 0.5, "Yes", "No")

cldftesting$model2 <- ifelse(predict.glm(model2, cldftesting, "response") >= 0.5, 1, 0)
cm2 <- confusionMatrix(factor(cldftesting$model2), factor(cldftesting$target), "1")
results <- rbind(results, tibble(Model = "Model #2", Accuracy=cm2$byClass[11], F1 = cm2$byClass[7],
                  Deviance= model2$deviance, 
                  R2 = 1 - model2$deviance / model2$null.deviance,
                  AIC= model2$aic))
cm2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 40 11
##          1  4 38
##                                          
##                Accuracy : 0.8387         
##                  95% CI : (0.748, 0.9068)
##     No Information Rate : 0.5269         
##     P-Value [Acc > NIR] : 2.581e-10      
##                                          
##                   Kappa : 0.6791         
##                                          
##  Mcnemar's Test P-Value : 0.1213         
##                                          
##             Sensitivity : 0.7755         
##             Specificity : 0.9091         
##          Pos Pred Value : 0.9048         
##          Neg Pred Value : 0.7843         
##              Prevalence : 0.5269         
##          Detection Rate : 0.4086         
##    Detection Prevalence : 0.4516         
##       Balanced Accuracy : 0.8423         
##                                          
##        'Positive' Class : 1              
##

Model #3 Confusion Matrix:

model3_glm_pred = ifelse(predict(model3, type = "link") > 0.5, "Yes", "No")

cldftesting$model3 <- ifelse(predict.glm(model3, cldftesting,"response") >= 0.5, 1, 0)
cm3 <- confusionMatrix(factor(cldftesting$model3), factor(cldftesting$target), "1")
results <- rbind(results, tibble(Model = "Model #3", Accuracy=cm3$byClass[11], F1 = cm3$byClass[7],
                  Deviance=model3$deviance, 
                  R2 = 1 - model3$deviance / model3$null.deviance,
                  AIC=model3$aic))
cm3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 42 12
##          1  2 37
##                                           
##                Accuracy : 0.8495          
##                  95% CI : (0.7603, 0.9152)
##     No Information Rate : 0.5269          
##     P-Value [Acc > NIR] : 5.349e-11       
##                                           
##                   Kappa : 0.7015          
##                                           
##  Mcnemar's Test P-Value : 0.01616         
##                                           
##             Sensitivity : 0.7551          
##             Specificity : 0.9545          
##          Pos Pred Value : 0.9487          
##          Neg Pred Value : 0.7778          
##              Prevalence : 0.5269          
##          Detection Rate : 0.3978          
##    Detection Prevalence : 0.4194          
##       Balanced Accuracy : 0.8548          
##                                           
##        'Positive' Class : 1               
##

Model #4 Confusion Matrix:

model4_glm_pred = ifelse(predict(model4, type = "link") > 0.5, "Yes", "No")

cldftesting$model4 <- ifelse(predict.glm(model4, cldftesting,"response") >= 0.5, 1, 0)
cm4 <- confusionMatrix(factor(cldftesting$model4), factor(cldftesting$target), "1")
results <- rbind(results, tibble(Model = "Model #4", Accuracy=cm4$byClass[11], F1 = cm4$byClass[7],
                  Deviance=model4$deviance, 
                  R2 = 1 - model4$deviance / model4$null.deviance,
                  AIC=model4$aic))
cm4

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 40 11
##          1  4 38
##                                          
##                Accuracy : 0.8387         
##                  95% CI : (0.748, 0.9068)
##     No Information Rate : 0.5269         
##     P-Value [Acc > NIR] : 2.581e-10      
##                                          
##                   Kappa : 0.6791         
##                                          
##  Mcnemar's Test P-Value : 0.1213         
##                                          
##             Sensitivity : 0.7755         
##             Specificity : 0.9091         
##          Pos Pred Value : 0.9048         
##          Neg Pred Value : 0.7843         
##              Prevalence : 0.5269         
##          Detection Rate : 0.4086         
##    Detection Prevalence : 0.4516         
##       Balanced Accuracy : 0.8423         
##                                          
##        'Positive' Class : 1              
##

ROC Curves

A Comparison of ROC Curves for each model, Model #1 … Model #4:

print('Model #1 ROC Curve')

## [1] "Model #1 ROC Curve"

roc(cldftesting[["target"]], cldftesting[["model1"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## 
## Call:
## roc.default(response = cldftesting[["target"]], predictor = cldftesting[["model1"]],     plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
## 
## Data: cldftesting[["model1"]] in 44 controls (cldftesting[["target"]] 0) < 49 cases (cldftesting[["target"]] 1).
## Area under the curve: 0.865

print('Model #2 ROC Curve')

## [1] "Model #2 ROC Curve"

roc(cldftesting[["target"]], cldftesting[["model2"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 
## Call:
## roc.default(response = cldftesting[["target"]], predictor = cldftesting[["model2"]],     plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
## 
## Data: cldftesting[["model2"]] in 44 controls (cldftesting[["target"]] 0) < 49 cases (cldftesting[["target"]] 1).
## Area under the curve: 0.8423

print('Model #3 ROC Curve')

## [1] "Model #3 ROC Curve"

roc(cldftesting[["target"]], cldftesting[["model3"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 
## Call:
## roc.default(response = cldftesting[["target"]], predictor = cldftesting[["model3"]],     plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
## 
## Data: cldftesting[["model3"]] in 44 controls (cldftesting[["target"]] 0) < 49 cases (cldftesting[["target"]] 1).
## Area under the curve: 0.8548

print('Model #4 ROC Curve')

## [1] "Model #4 ROC Curve"

roc(cldftesting[["target"]], cldftesting[["model4"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 
## Call:
## roc.default(response = cldftesting[["target"]], predictor = cldftesting[["model4"]],     plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
## 
## Data: cldftesting[["model4"]] in 44 controls (cldftesting[["target"]] 0) < 49 cases (cldftesting[["target"]] 1).
## Area under the curve: 0.8423

created empty data frame to store transformed variables

df_temp_eval <- data.frame(matrix(ncol = 1, nrow = length(df_eval$medv)))

# performed boxcox transformation after identifying proper lambda
df_temp_eval$rm <- df_eval$rm
rm_lambda <- BoxCox.lambda(df_eval$rm)
df_temp_eval$rm_transform <- BoxCox(df_eval$rm, rm_lambda)

df_temp_eval$nox <- df_eval$nox
nox_lambda <- BoxCox.lambda(df_eval$nox)
df_temp_eval$nox_transform <- BoxCox(df_eval$nox, nox_lambda)

df_temp_eval$dis <- df_eval$dis
df_temp_eval$dis_transform <- log(df_eval$dis)

df_temp_eval$zn <- df_eval$zn
df_temp_eval$zn_transform <- log(df_eval$zn+1)

df_temp_eval$lstat <- df_eval$lstat
df_temp_eval$lstat_transform <- log(df_eval$lstat)

df_temp_eval$age <- df_eval$age
df_temp_eval$age_transform <- log(max(df_eval$age) + 1 - df_eval$age)

df_temp_eval$ptratio <- df_eval$ptratio
df_temp_eval$ptratio_transform <- log(max(df_eval$ptratio) + 1 - df_eval$ptratio)

df_temp_eval <- df_temp_eval[, 2:15]

Build dataframe with transformation

df_eval <- data.frame(cbind(df_eval, 
                        rm_transform = df_temp_eval$rm_transform,
                        nox_transform = df_temp_eval$nox_transform,
                        dis_transform = df_temp_eval$dis_transform,
                        zn_transform = df_temp_eval$zn_transform,
                        lstat_transform = df_temp_eval$lstat_transform,
                        age_transform = df_temp_eval$age_transform,
                        ptratio_transform = df_temp_eval$ptratio_transform
                        ))

is.na(df_eval) <- sapply(df_eval, is.infinite)

Model Performance Summary

The following table discusses each of the model performance metrics on the training dataset. These values indicate there is a minor improvement in model performance after applying transformations and selecting for significant parameters.

Model of choice

Based on all of our models we gave the edge to* Model #2* and Model #4, given that they fit the data (Higher $R^2$, lower AIC, lower deviance) better than their counterparts.

Ultimately we chose Model #4 as all thing equal (which they shared accuracy, F-statistic, and McFaden R-squared), it had a lower AIC, and therefore was a higher quality model.

Predictions

We apply Model #4 to the holdout evaluation set to predict the targets.

eval_data <- df_eval %>% select(c(zn_transform, indus, chas, nox_transform, rm_transform, age_transform, dis_transform, rad, tax, ptratio_transform, lstat_transform, medv))
predictions <- ifelse(predict(model4, eval_data, type = "link") > 0.5, 1, 0)
df_eval['target'] <- predictions
predictions

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  0  1  1  1  0  0  0  0  0  0  0  0  1  1  1  0  0  1  0  0  0  0  0  0  0  1 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
##  0  1  1  1  1  1  1  1  1  1  1  1  1  1

Data 621 Fall2023: Homework 3 (Group1)

Crime Logistic Regression

Tage Singh, Jaya Veluri, Khyati NaIk, Mahmud Hasan Al RajI

Instructions

Overview

Deliverables

Introduction

1. Data Exploration

Dataset

Summary Statistics

Check Class Bias

Distributions

Boxplots

Missing Data

Target Correlations

2. Data Preparation

Removed Fields

Missing Values

Outliers

Transform non-normal variables

Finalizing the dataset for model building

3. Build Models

Model-building methodology

Model#1 (non-transformed features)

Model#2 (Transformed Features)

Model#3 (Stepwise-AIC on Model #1)

Model#4 (Stepwise-AIC on Model #2)

Examining our model coefficients

4. Model Selection & Analysis

Confusion Matrices

ROC Curves

created empty data frame to store transformed variables

Build dataframe with transformation

Model Performance Summary

Model of choice

Predictions