DATA621: Business Analytics and Data Mining Homework #3: LOGISTIC REGRESSION

Github Link: https://github.com/asmozo24/Data62 Web link:

Overview

In this homework assignment, we will explore, analyze and model a data set information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). This assignment is about learning how to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. We will provide classifications and probabilities for the evaluation data set using the built binary logistic regression model. Let’s recall the definition of binary logistic regression with R. According to R-bloggers (https://www.r-bloggers.com/2020/05/binary-logistic-regression-with-r/), “Binary Logistic Regression is used to explain the relationship between the categorical dependent variable and one or more independent variables.”

DATA EXPLORATION

There are 02 datasets: crime-training-data_modified, crime-evaluation-data_modified provided by Instructor:Nasrin Khansari. These are csv files and we used R-programming language to acquire the 02 datasets pre-stored in Github repository. These 13 variables of interest are all predictors except the variable called “target”, which is the response variable, and are already defined within the dataset package(see below). The case study: the crime level for various neighborhoods of a major city.

Data621_Logistic_Regression..all.predictor.variables.	X
Variable Name	Short Description
zn	proportion of residential land zoned for large lots (over 25000 square feet)
indus	proportion of non-retail business acres per suburb
chas	a dummy var. for whether the suburb borders the Charles River (1) or not (0)
nox	nitrogen oxides concentration (parts per 10 million)
rm	average number of rooms per dwelling
age	proportion of owner-occupied units built prior to 1940
dis	weighted mean of distances to five Boston employment centers
rad	index of accessibility to radial highways
tax	full-value property-tax rate per $10,000
ptratio	pupil-teacher ratio by town
lstat	lower status of the population (percent)
medv	median value of owner-occupied homes in $1000s
target	whether the crime rate is above the median crime rate (1) or not (0)

1.1 Data Structure

These datasets include 446 observations and 13 variables. All values are numerical of type integer or double excepted the “chas” and response variable. The “chas” and response variables are categorical variables because of the binary input values(1 or 0). We called the dataframe “training_df” which contains 12 predictors and 01 response variables.

Data overlook

## 'data.frame':    466 obs. of  13 variables:
##  $ zn     : num  0 0 0 30 0 0 0 0 0 80 ...
##  $ indus  : num  19.58 19.58 18.1 4.93 2.46 ...
##  $ chas   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
##  $ rm     : num  7.93 5.4 6.49 6.39 7.16 ...
##  $ age    : num  96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
##  $ dis    : num  2.05 1.32 1.98 7.04 2.7 ...
##  $ rad    : int  5 5 24 6 3 5 24 24 5 1 ...
##  $ tax    : int  403 403 666 300 193 384 666 666 224 315 ...
##  $ ptratio: num  14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
##  $ lstat  : num  3.7 26.82 18.85 5.19 4.82 ...
##  $ medv   : num  50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
##  $ target : int  1 1 1 0 0 0 1 1 0 0 ...

zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	lstat	medv	target
0	19.58	0	0.605	7.929	96.2	2.0459	5	403	14.7	3.70	50.0	1
0	19.58	1	0.871	5.403	100.0	1.3216	5	403	14.7	26.82	13.4	1
0	18.10	0	0.740	6.485	100.0	1.9784	24	666	20.2	18.85	15.4	1
30	4.93	0	0.428	6.393	7.8	7.0355	6	300	16.6	5.19	23.7	0
0	2.46	0	0.488	7.155	92.2	2.7006	3	193	17.8	4.82	37.9	0

##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv      
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20  
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00  
##      target      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4914  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

2.Data Preparation

It is important to check the missing values before applying regression analysis because missing values can increase the error and add bias to the regression model. As we can see below, the dataset shows no missing values. This means the dataset is good for analysis. In addition, the variable called ‘target’ = whether the crime rate is above the median crime rate (1) or not (0) is a two level response, so we want to set it as factor as well as chas(even it is not the response variable).

## The total of missing values is :  0

## 
## The total of missing 'NA' is:

## character(0)

## 
##  Even more, the table below shows the total number of missing values per variable

##      zn   indus    chas     nox      rm     age     dis     rad     tax ptratio 
##       0       0       0       0       0       0       0       0       0       0 
##   lstat    medv  target 
##       0       0       0

Data Distribution

We want to take a look at how the data are distributed across all variables. We see that the response variable(target) has a binomial normal distribution.This makes sense because the response variable only has 02 outputs(0 and 1). Beside the average number of rooms per dwelling(rm) which as a normal distribution, the rest of variables show right and left skewed. Based on these density plots, we want to see what the variance between predictors and response variable look like.

## 
## Another way of looking at the data distribution for all 13 variables.

Among the boxplots (below), we can see that the (nox) =median for nitrogen oxides concentration (parts per 10 million) places the neighborhood at risk for high crime levels. We also see that chas (a dummy var, for whether the suburb borders the Charles River (1) or not (0)) and zn(proportion of residential land zoned for large lots (over 25000 square feet)) are irrelevant for statistical analysis since the median is insignificant.

3.Building the Model

Model 1- Backward Elimination

Model 1 accounts for all variables. To build model 1, we could split the training_df into train (80%) and test(20%) using the createDataPartition() function. However, we have the evaluation_df that we can use for the prediction. So, instead of partitioning or sampling, we will keep 0.912 of training_df1 for train and evaluation_df for test.

## 
## Call:
## glm(formula = target ~ ., family = binomial(), data = train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8464  -0.1445  -0.0017   0.0029   3.4665  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -40.822934   6.632913  -6.155 7.53e-10 ***
## zn           -0.065946   0.034656  -1.903  0.05706 .  
## indus        -0.064614   0.047622  -1.357  0.17485    
## chas1         0.910765   0.755546   1.205  0.22803    
## nox          49.122297   7.931706   6.193 5.90e-10 ***
## rm           -0.587488   0.722847  -0.813  0.41637    
## age           0.034189   0.013814   2.475  0.01333 *  
## dis           0.738660   0.230275   3.208  0.00134 ** 
## rad           0.666366   0.163152   4.084 4.42e-05 ***
## tax          -0.006171   0.002955  -2.089  0.03674 *  
## ptratio       0.402566   0.126627   3.179  0.00148 ** 
## lstat         0.045869   0.054049   0.849  0.39608    
## medv          0.180824   0.068294   2.648  0.00810 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 192.05  on 453  degrees of freedom
## AIC: 218.05
## 
## Number of Fisher Scoring iterations: 9

Interpreting Model 1 Output

Something strange happened: the chas variable changed to chas1. Looking at the coefficient, we can say that for every increase in nox, the log odds of being high risk(crime level) increases by 49.122297, and similarly for the other variables.

Most of the variables show to be significant. There are few variables(zn, chas, indus, rm and lstat) with higher pvalue (Pvalue>0.05). These variables should be removed as there are non-significant to the model.

We can see from the logit_1 output that adding 12 (465-453 =12) independent variables decreased the deviance from 645.88 to 192.05, which is a significant reduction in deviance. The Residual Deviance has reduced by 645.88-192.05=453.83 with a loss of 12 degrees of freedom.

The Fisher Scoring iterations indicates the number of iterations perform in model 1. This is an indication that the model converged after 09 iterations which tells us about some level of model goodfit.

Model 2 Backward Elimination (reduced variables)

For the second model, we will exclude variables(zn, chas, indus, rm and lstat) for being insignificant. This is a backward selections. As we started all variables, we look at those variables that do not contribute to the model. These variables tend to have high pvalues (pvalue >0.05 on the 95% confidence interval)

## 
## Call:
## glm(formula = target ~ ., family = binomial(), data = train2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.01059  -0.19744  -0.01371   0.00402   3.06424  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -36.824228   5.858405  -6.286 3.26e-10 ***
## nox          42.338378   6.639207   6.377 1.81e-10 ***
## age           0.031882   0.010693   2.982 0.002867 ** 
## dis           0.429555   0.171849   2.500 0.012433 *  
## rad           0.701767   0.139426   5.033 4.82e-07 ***
## tax          -0.008237   0.002534  -3.250 0.001153 ** 
## ptratio       0.376575   0.108912   3.458 0.000545 ***
## medv          0.093653   0.033556   2.791 0.005255 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 203.45  on 458  degrees of freedom
## AIC: 219.45
## 
## Number of Fisher Scoring iterations: 9

Interpreting Model 2 Output

Model 2 has less independent variables compared to the initial training dataset. All the predictors are highly significant (pvalue<0.05). We also observed that the intercept has increased while the coefficient has decreased. So, for nox variable , which has the lowest pvalue and can better explain the response variable, y2 = 42.338378x2-36.824228,y1 = 49.122297x1-40.822934 , one unit nitrogen from model 2 gives 5.51415 while model 1 gives 8.299363 contribution toward determining neighborhood crime level.

Model 3 Stepwise Regression

For the model 3, we want to try the stepwise method since model 2 output shows that all variables (07) are significant to the built model.

## 
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio + 
##     medv, family = binomial(), data = train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8295  -0.1752  -0.0021   0.0032   3.4191  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -37.415922   6.035013  -6.200 5.65e-10 ***
## zn           -0.068648   0.032019  -2.144  0.03203 *  
## nox          42.807768   6.678692   6.410 1.46e-10 ***
## age           0.032950   0.010951   3.009  0.00262 ** 
## dis           0.654896   0.214050   3.060  0.00222 ** 
## rad           0.725109   0.149788   4.841 1.29e-06 ***
## tax          -0.007756   0.002653  -2.924  0.00346 ** 
## ptratio       0.323628   0.111390   2.905  0.00367 ** 
## medv          0.110472   0.035445   3.117  0.00183 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 197.32  on 457  degrees of freedom
## AIC: 215.32
## 
## Number of Fisher Scoring iterations: 9

## Loading required package: mosaic

## Warning: package 'mosaic' was built under R version 4.0.5

## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2

## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.

## 
## Attaching package: 'mosaic'

## The following objects are masked from 'package:car':
## 
##     deltaMethod, logit

## The following object is masked from 'package:plm':
## 
##     r.squared

## The following objects are masked from 'package:pROC':
## 
##     cov, var

## The following object is masked from 'package:correlation':
## 
##     cor_test

## The following object is masked from 'package:BayesFactor':
## 
##     compare

## The following object is masked from 'package:Matrix':
## 
##     mean

## The following object is masked from 'package:caret':
## 
##     dotPlot

## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally

## The following object is masked from 'package:purrr':
## 
##     cross

## The following object is masked from 'package:ggplot2':
## 
##     stat

## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var

## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

## Loading required package: Stat2Data

## Loading required package: leaps

## Warning: package 'leaps' was built under R version 4.0.5

Interpreting Model 3 Output

Model 3 has about 08 variables of the initial training dataset. All variable are significant and the AIC value did drop. We could also find the variables that best explain the response variable. This could have been possible with a more robust stepwise method called Leaps. We also wonder if this could have been capture early by checking the correlation among these variable.

Model Selection

We will use cross check The Akaike information criterion (AIC) for all models. We can recall that the AIC is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. AIC estimates the quality of each model, relative to each of the other models. Model 3 is the best model because has the lowest AIC. We also tried the anova , but encountered some error in the output.

## [1] 218.0469

## [1] 219.4473

## [1] 215.3229

## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored

## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors

## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored

## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors

## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored

## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors

## Warning in Ops.factor(weighted.residuals(object), 2): '^' not meaningful for
## factors

## Warning in Ops.factor(weighted.residuals(object), 2): '^' not meaningful for
## factors

## Warning in Ops.factor(weighted.residuals(object), 2): '^' not meaningful for
## factors

## Analysis of Variance Table
## 
## Model 1: target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + lstat + medv
## Model 2: target ~ nox + age + dis + rad + tax + ptratio + medv
## Model 3: target ~ zn + nox + age + dis + rad + tax + ptratio + medv
##   Res.Df RSS Df Sum of Sq F Pr(>F)
## 1    453   0                      
## 2    458   0 -5         0         
## 3    457   0  1         0

We can visualize the distribution of predicted probability of high level of crime by plotting histogram.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000001 0.010404 0.430284 0.491416 0.999991 1.000000

Evaluation of Model

Cros-validation for model performance. We put Model3 through the confusionMatrix to find out abou the model accuracy. We encountered many code errors which results in using the evaluation_df. Perphaps, the evaluation needs some data cleaning.

## [1] 40

## [1] 40

## Confusion Matrix and Statistics
## 
##          Trained
## Predicted  0  1
##         0  9 11
##         1 11  9
##                                           
##                Accuracy : 0.45            
##                  95% CI : (0.2926, 0.6151)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.7852          
##                                           
##                   Kappa : -0.1            
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.450           
##             Specificity : 0.450           
##          Pos Pred Value : 0.450           
##          Neg Pred Value : 0.450           
##              Prevalence : 0.500           
##          Detection Rate : 0.225           
##    Detection Prevalence : 0.500           
##       Balanced Accuracy : 0.450           
##                                           
##        'Positive' Class : 0               
## 

## Setting levels: control = 0, case = 1

## Setting direction: controls > cases

## Confusion Matrix and Statistics
## 
##          Trained
## Predicted  0  1
##         0 11  6
##         1  9 14
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.458, 0.7727)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : 0.07693        
##                                          
##                   Kappa : 0.25           
##                                          
##  Mcnemar's Test P-Value : 0.60558        
##                                          
##             Sensitivity : 0.5500         
##             Specificity : 0.7000         
##          Pos Pred Value : 0.6471         
##          Neg Pred Value : 0.6087         
##              Prevalence : 0.5000         
##          Detection Rate : 0.2750         
##    Detection Prevalence : 0.4250         
##       Balanced Accuracy : 0.6250         
##                                          
##        'Positive' Class : 0              
## 

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Confusion Matrix and Statistics
## 
##          Trained
## Predicted  0  1
##         0  7 13
##         1 13  7
##                                           
##                Accuracy : 0.35            
##                  95% CI : (0.2063, 0.5168)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.9808          
##                                           
##                   Kappa : -0.3            
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.350           
##             Specificity : 0.350           
##          Pos Pred Value : 0.350           
##          Neg Pred Value : 0.350           
##              Prevalence : 0.500           
##          Detection Rate : 0.175           
##    Detection Prevalence : 0.500           
##       Balanced Accuracy : 0.350           
##                                           
##        'Positive' Class : 0               
## 

## Setting levels: control = 0, case = 1

## Setting direction: controls > cases

Conclusion

After running the ConfusionMatrix, we found some results that did not correlate to the previous finding. Using AIC method, we found that model 3 was the best among the 03 models when we used stepwise regression. We got a much more better AIC compared to the other models. The confusionMatrix shows that model 2 is the best since the model accuracy is greater than model 1 and 3. This is rather strange than conclusive. Anova method or Leaps could have split this model selection if much more simplicity was applied on the data.

https://www.r-bloggers.com/2020/05/binary-logistic-regression-with-r/

https://towardsdatascience.com/implementing-binary-logistic-regression-in-r-7d802a9d98fe

https://www.scribbr.com/statistics/akaike-information-criterion/

https://cran.r-project.org/web/packages/caret/vignettes/caret.html

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.567.459&rep=rep1&type=pdf#:~:text=The%20categorical%20dependent%20variable%20here,OLS%20is%20biased%20and%20inefficient.