Github Link: https://github.com/asmozo24/Data62 Web link:
In this homework assignment, we will explore, analyze and model a data set information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). This assignment is about learning how to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. We will provide classifications and probabilities for the evaluation data set using the built binary logistic regression model. Let’s recall the definition of binary logistic regression with R. According to R-bloggers (https://www.r-bloggers.com/2020/05/binary-logistic-regression-with-r/), “Binary Logistic Regression is used to explain the relationship between the categorical dependent variable and one or more independent variables.”
There are 02 datasets: crime-training-data_modified, crime-evaluation-data_modified provided by Instructor:Nasrin Khansari. These are csv files and we used R-programming language to acquire the 02 datasets pre-stored in Github repository. These 13 variables of interest are all predictors except the variable called “target”, which is the response variable, and are already defined within the dataset package(see below). The case study: the crime level for various neighborhoods of a major city.
Data621_Logistic_Regression..all.predictor.variables. | X |
---|---|
Variable Name | Short Description |
zn | proportion of residential land zoned for large lots (over 25000 square feet) |
indus | proportion of non-retail business acres per suburb |
chas | a dummy var. for whether the suburb borders the Charles River (1) or not (0) |
nox | nitrogen oxides concentration (parts per 10 million) |
rm | average number of rooms per dwelling |
age | proportion of owner-occupied units built prior to 1940 |
dis | weighted mean of distances to five Boston employment centers |
rad | index of accessibility to radial highways |
tax | full-value property-tax rate per $10,000 |
ptratio | pupil-teacher ratio by town |
lstat | lower status of the population (percent) |
medv | median value of owner-occupied homes in $1000s |
target | whether the crime rate is above the median crime rate (1) or not (0) |
1.1 Data Structure
These datasets include 446 observations and 13 variables. All values are numerical of type integer or double excepted the “chas” and response variable. The “chas” and response variables are categorical variables because of the binary input values(1 or 0). We called the dataframe “training_df” which contains 12 predictors and 01 response variables.
Data overlook
## 'data.frame': 466 obs. of 13 variables:
## $ zn : num 0 0 0 30 0 0 0 0 0 80 ...
## $ indus : num 19.58 19.58 18.1 4.93 2.46 ...
## $ chas : int 0 1 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
## $ rm : num 7.93 5.4 6.49 6.39 7.16 ...
## $ age : num 96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
## $ dis : num 2.05 1.32 1.98 7.04 2.7 ...
## $ rad : int 5 5 24 6 3 5 24 24 5 1 ...
## $ tax : int 403 403 666 300 193 384 666 666 224 315 ...
## $ ptratio: num 14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
## $ lstat : num 3.7 26.82 18.85 5.19 4.82 ...
## $ medv : num 50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
## $ target : int 1 1 1 0 0 0 1 1 0 0 ...
zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19.58 | 0 | 0.605 | 7.929 | 96.2 | 2.0459 | 5 | 403 | 14.7 | 3.70 | 50.0 | 1 |
0 | 19.58 | 1 | 0.871 | 5.403 | 100.0 | 1.3216 | 5 | 403 | 14.7 | 26.82 | 13.4 | 1 |
0 | 18.10 | 0 | 0.740 | 6.485 | 100.0 | 1.9784 | 24 | 666 | 20.2 | 18.85 | 15.4 | 1 |
30 | 4.93 | 0 | 0.428 | 6.393 | 7.8 | 7.0355 | 6 | 300 | 16.6 | 5.19 | 23.7 | 0 |
0 | 2.46 | 0 | 0.488 | 7.155 | 92.2 | 2.7006 | 3 | 193 | 17.8 | 4.82 | 37.9 | 0 |
## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio lstat medv
## Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02
## Median :334.5 Median :18.9 Median :11.350 Median :21.20
## Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00
## target
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4914
## 3rd Qu.:1.0000
## Max. :1.0000
2.Data Preparation
It is important to check the missing values before applying regression analysis because missing values can increase the error and add bias to the regression model. As we can see below, the dataset shows no missing values. This means the dataset is good for analysis. In addition, the variable called ‘target’ = whether the crime rate is above the median crime rate (1) or not (0) is a two level response, so we want to set it as factor as well as chas(even it is not the response variable).
## The total of missing values is : 0
##
## The total of missing 'NA' is:
## character(0)
##
## Even more, the table below shows the total number of missing values per variable
## zn indus chas nox rm age dis rad tax ptratio
## 0 0 0 0 0 0 0 0 0 0
## lstat medv target
## 0 0 0
We want to take a look at how the data are distributed across all variables. We see that the response variable(target) has a binomial normal distribution.This makes sense because the response variable only has 02 outputs(0 and 1). Beside the average number of rooms per dwelling(rm) which as a normal distribution, the rest of variables show right and left skewed. Based on these density plots, we want to see what the variance between predictors and response variable look like.
##
## Another way of looking at the data distribution for all 13 variables.
Among the boxplots (below), we can see that the (nox) =median for nitrogen oxides concentration (parts per 10 million) places the neighborhood at risk for high crime levels. We also see that chas (a dummy var, for whether the suburb borders the Charles River (1) or not (0)) and zn(proportion of residential land zoned for large lots (over 25000 square feet)) are irrelevant for statistical analysis since the median is insignificant.
3.Building the Model
Model 1 accounts for all variables. To build model 1, we could split the training_df into train (80%) and test(20%) using the createDataPartition() function. However, we have the evaluation_df that we can use for the prediction. So, instead of partitioning or sampling, we will keep 0.912 of training_df1 for train and evaluation_df for test.
##
## Call:
## glm(formula = target ~ ., family = binomial(), data = train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8464 -0.1445 -0.0017 0.0029 3.4665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -40.822934 6.632913 -6.155 7.53e-10 ***
## zn -0.065946 0.034656 -1.903 0.05706 .
## indus -0.064614 0.047622 -1.357 0.17485
## chas1 0.910765 0.755546 1.205 0.22803
## nox 49.122297 7.931706 6.193 5.90e-10 ***
## rm -0.587488 0.722847 -0.813 0.41637
## age 0.034189 0.013814 2.475 0.01333 *
## dis 0.738660 0.230275 3.208 0.00134 **
## rad 0.666366 0.163152 4.084 4.42e-05 ***
## tax -0.006171 0.002955 -2.089 0.03674 *
## ptratio 0.402566 0.126627 3.179 0.00148 **
## lstat 0.045869 0.054049 0.849 0.39608
## medv 0.180824 0.068294 2.648 0.00810 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 192.05 on 453 degrees of freedom
## AIC: 218.05
##
## Number of Fisher Scoring iterations: 9
Something strange happened: the chas variable changed to chas1. Looking at the coefficient, we can say that for every increase in nox, the log odds of being high risk(crime level) increases by 49.122297, and similarly for the other variables.
Most of the variables show to be significant. There are few variables(zn, chas, indus, rm and lstat) with higher pvalue (Pvalue>0.05). These variables should be removed as there are non-significant to the model.
We can see from the logit_1 output that adding 12 (465-453 =12) independent variables decreased the deviance from 645.88 to 192.05, which is a significant reduction in deviance. The Residual Deviance has reduced by 645.88-192.05=453.83 with a loss of 12 degrees of freedom.
The Fisher Scoring iterations indicates the number of iterations perform in model 1. This is an indication that the model converged after 09 iterations which tells us about some level of model goodfit.
For the second model, we will exclude variables(zn, chas, indus, rm and lstat) for being insignificant. This is a backward selections. As we started all variables, we look at those variables that do not contribute to the model. These variables tend to have high pvalues (pvalue >0.05 on the 95% confidence interval)
##
## Call:
## glm(formula = target ~ ., family = binomial(), data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.01059 -0.19744 -0.01371 0.00402 3.06424
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -36.824228 5.858405 -6.286 3.26e-10 ***
## nox 42.338378 6.639207 6.377 1.81e-10 ***
## age 0.031882 0.010693 2.982 0.002867 **
## dis 0.429555 0.171849 2.500 0.012433 *
## rad 0.701767 0.139426 5.033 4.82e-07 ***
## tax -0.008237 0.002534 -3.250 0.001153 **
## ptratio 0.376575 0.108912 3.458 0.000545 ***
## medv 0.093653 0.033556 2.791 0.005255 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 203.45 on 458 degrees of freedom
## AIC: 219.45
##
## Number of Fisher Scoring iterations: 9
Model 2 has less independent variables compared to the initial training dataset. All the predictors are highly significant (pvalue<0.05). We also observed that the intercept has increased while the coefficient has decreased. So, for nox variable , which has the lowest pvalue and can better explain the response variable, y2 = 42.338378x2-36.824228,y1 = 49.122297x1-40.822934 , one unit nitrogen from model 2 gives 5.51415 while model 1 gives 8.299363 contribution toward determining neighborhood crime level.
For the model 3, we want to try the stepwise method since model 2 output shows that all variables (07) are significant to the built model.
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## medv, family = binomial(), data = train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8295 -0.1752 -0.0021 0.0032 3.4191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.415922 6.035013 -6.200 5.65e-10 ***
## zn -0.068648 0.032019 -2.144 0.03203 *
## nox 42.807768 6.678692 6.410 1.46e-10 ***
## age 0.032950 0.010951 3.009 0.00262 **
## dis 0.654896 0.214050 3.060 0.00222 **
## rad 0.725109 0.149788 4.841 1.29e-06 ***
## tax -0.007756 0.002653 -2.924 0.00346 **
## ptratio 0.323628 0.111390 2.905 0.00367 **
## medv 0.110472 0.035445 3.117 0.00183 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 197.32 on 457 degrees of freedom
## AIC: 215.32
##
## Number of Fisher Scoring iterations: 9
## Loading required package: mosaic
## Warning: package 'mosaic' was built under R version 4.0.5
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
## The following objects are masked from 'package:car':
##
## deltaMethod, logit
## The following object is masked from 'package:plm':
##
## r.squared
## The following objects are masked from 'package:pROC':
##
## cov, var
## The following object is masked from 'package:correlation':
##
## cor_test
## The following object is masked from 'package:BayesFactor':
##
## compare
## The following object is masked from 'package:Matrix':
##
## mean
## The following object is masked from 'package:caret':
##
## dotPlot
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## stat
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
## Loading required package: Stat2Data
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 4.0.5
Model 3 has about 08 variables of the initial training dataset. All variable are significant and the AIC value did drop. We could also find the variables that best explain the response variable. This could have been possible with a more robust stepwise method called Leaps. We also wonder if this could have been capture early by checking the correlation among these variable.
We will use cross check The Akaike information criterion (AIC) for all models. We can recall that the AIC is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. AIC estimates the quality of each model, relative to each of the other models. Model 3 is the best model because has the lowest AIC. We also tried the anova , but encountered some error in the output.
## [1] 218.0469
## [1] 219.4473
## [1] 215.3229
## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in Ops.factor(weighted.residuals(object), 2): '^' not meaningful for
## factors
## Warning in Ops.factor(weighted.residuals(object), 2): '^' not meaningful for
## factors
## Warning in Ops.factor(weighted.residuals(object), 2): '^' not meaningful for
## factors
## Analysis of Variance Table
##
## Model 1: target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv
## Model 2: target ~ nox + age + dis + rad + tax + ptratio + medv
## Model 3: target ~ zn + nox + age + dis + rad + tax + ptratio + medv
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 453 0
## 2 458 0 -5 0
## 3 457 0 1 0
We can visualize the distribution of predicted probability of high level of crime by plotting histogram.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000001 0.010404 0.430284 0.491416 0.999991 1.000000
Cros-validation for model performance. We put Model3 through the confusionMatrix to find out abou the model accuracy. We encountered many code errors which results in using the evaluation_df. Perphaps, the evaluation needs some data cleaning.
## [1] 40
## [1] 40
## Confusion Matrix and Statistics
##
## Trained
## Predicted 0 1
## 0 9 11
## 1 11 9
##
## Accuracy : 0.45
## 95% CI : (0.2926, 0.6151)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.7852
##
## Kappa : -0.1
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.450
## Specificity : 0.450
## Pos Pred Value : 0.450
## Neg Pred Value : 0.450
## Prevalence : 0.500
## Detection Rate : 0.225
## Detection Prevalence : 0.500
## Balanced Accuracy : 0.450
##
## 'Positive' Class : 0
##
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
## Confusion Matrix and Statistics
##
## Trained
## Predicted 0 1
## 0 11 6
## 1 9 14
##
## Accuracy : 0.625
## 95% CI : (0.458, 0.7727)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.07693
##
## Kappa : 0.25
##
## Mcnemar's Test P-Value : 0.60558
##
## Sensitivity : 0.5500
## Specificity : 0.7000
## Pos Pred Value : 0.6471
## Neg Pred Value : 0.6087
## Prevalence : 0.5000
## Detection Rate : 0.2750
## Detection Prevalence : 0.4250
## Balanced Accuracy : 0.6250
##
## 'Positive' Class : 0
##
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Confusion Matrix and Statistics
##
## Trained
## Predicted 0 1
## 0 7 13
## 1 13 7
##
## Accuracy : 0.35
## 95% CI : (0.2063, 0.5168)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.9808
##
## Kappa : -0.3
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.350
## Specificity : 0.350
## Pos Pred Value : 0.350
## Neg Pred Value : 0.350
## Prevalence : 0.500
## Detection Rate : 0.175
## Detection Prevalence : 0.500
## Balanced Accuracy : 0.350
##
## 'Positive' Class : 0
##
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
After running the ConfusionMatrix, we found some results that did not correlate to the previous finding. Using AIC method, we found that model 3 was the best among the 03 models when we used stepwise regression. We got a much more better AIC compared to the other models. The confusionMatrix shows that model 2 is the best since the model accuracy is greater than model 1 and 3. This is rather strange than conclusive. Anova method or Leaps could have split this model selection if much more simplicity was applied on the data.
https://www.r-bloggers.com/2020/05/binary-logistic-regression-with-r/
https://towardsdatascience.com/implementing-binary-logistic-regression-in-r-7d802a9d98fe
https://www.scribbr.com/statistics/akaike-information-criterion/
https://cran.r-project.org/web/packages/caret/vignettes/caret.html
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.567.459&rep=rep1&type=pdf#:~:text=The%20categorical%20dependent%20variable%20here,OLS%20is%20biased%20and%20inefficient.