DATA 621: HW 3
David Quarshie - Group 3
In this assignment we will build a logistic regression model to predict wheter a particular neighborhood in Boston is above or below the median crime level. We’re given information on 466 Boston neighborhoods, 13 predictor variables and 1 response variable, target. Target will tell us if the neighborhood is above the median crime level (1) or below it (0).
1. DATA EXPLORATION
Below we’ll display a few basic EDA techniques to gain insight into our crime dataset.
Basic Statistics
There are 466 rows and 14 columns. Thankfully, there are no missing values out of all of the 6,524 data points.
| n | mean | sd | median | min | max | skew | kurtosis | |
|---|---|---|---|---|---|---|---|---|
| zn | 466 | 11.5772532 | 23.3646511 | 0.00000 | 0.0000 | 100.0000 | 2.1768152 | 3.8135765 |
| indus | 466 | 11.1050215 | 6.8458549 | 9.69000 | 0.4600 | 27.7400 | 0.2885450 | -1.2432132 |
| chas | 466 | 0.0708155 | 0.2567920 | 0.00000 | 0.0000 | 1.0000 | 3.3354899 | 9.1451313 |
| nox | 466 | 0.5543105 | 0.1166667 | 0.53800 | 0.3890 | 0.8710 | 0.7463281 | -0.0357736 |
| rm | 466 | 6.2906738 | 0.7048513 | 6.21000 | 3.8630 | 8.7800 | 0.4793202 | 1.5424378 |
| age | 466 | 68.3675966 | 28.3213784 | 77.15000 | 2.9000 | 100.0000 | -0.5777075 | -1.0098814 |
| dis | 466 | 3.7956929 | 2.1069496 | 3.19095 | 1.1296 | 12.1265 | 0.9988926 | 0.4719679 |
| rad | 466 | 9.5300429 | 8.6859272 | 5.00000 | 1.0000 | 24.0000 | 1.0102788 | -0.8619110 |
| tax | 466 | 409.5021459 | 167.9000887 | 334.50000 | 187.0000 | 711.0000 | 0.6593136 | -1.1480456 |
| ptratio | 466 | 18.3984979 | 2.1968447 | 18.90000 | 12.6000 | 22.0000 | -0.7542681 | -0.4003627 |
| lstat | 466 | 12.6314592 | 7.1018907 | 11.35000 | 1.7300 | 37.9700 | 0.9055864 | 0.5033688 |
| ## Distrib | ution | of Target Vari | able |
Let’s look at the target variable in our training data to make sure there no one sided distribution.
| Var1 | Freq |
|---|---|
| 0 | 237 |
| 1 | 229 |
Histogram of Variables
Boxplot of Variables
## NULL
2. DATA PREPARATION
We’ve determined that there are no missing values in our data, but looking at our visualizations we see a few variables with some issues.
1: Chas
Looking at the results from our chas variable it doesn’t seem to be needed here so we can remove it.
2: Indus
We see a lot of outliers in the indus variable, so we’ll removed the rows which indus is greater than 20 and target is 0.
3: Dis
Dis also has some outliers so we’ll remove rows where dis was greater than 11 and target was 0, and where dis was greater than 7.5 and target was 1.
Data Summary
Let’s take a quick look at what variables we have remaining.
## [1] "zn" "indus" "nox" "rm" "age" "dis" "rad"
## [8] "tax" "ptratio" "lstat" "medv" "target"
## [1] 452 12
3. BUILD MODELS
k-fold Cross Validation is used when there’s a small amount of data to train. For our project we’re only dealing with 466 obersavations so we’ll use k-fold with k = 10. We’ll keep out 20% of the data for validation while doing the first modeling and when we select our final model we’ll use the full training set.
Each of our logistic regression models will use bionomial regression with a logit link function.
Model 1
For the first model we will include all the variables. Looking at the output of the model we see that some points are highly colinear and a some variables that may not be necessary.
Model 1 uses the formula:
target ~ .
| x | |
|---|---|
| zn | 274.32819 |
| indus | 123.40258 |
| nox | 352.53030 |
| rm | 130.66418 |
| age | 63.06141 |
| dis | 106.92211 |
| rad | 1273.41974 |
| tax | 474.20124 |
| ptratio | 52.26712 |
| lstat | 58.36391 |
| medv | 210.06847 |
Model 2
For the second model we ignore what’s colinear but remove unneccessary variables shown in model 1.
Model 2’s formula:
target ~ zn + nox + age + dis + rad + ptratio + medv
| x | |
|---|---|
| zn | 302.47651 |
| nox | 245.57269 |
| age | 47.60585 |
| dis | 88.76467 |
| rad | 518.30063 |
| ptratio | 29.46829 |
| medv | 51.12595 |
Model 3
For model 3 take out the variables with the 2 highest VIF values from the first model.
Model 3’s formula:
target ~ indus + rm + age + dis + tax + ptratio + lstat + medv
| x | |
|---|---|
| indus | 50.05715 |
| rm | 56.40033 |
| age | 38.33446 |
| dis | 48.85328 |
| tax | 55.72066 |
| ptratio | 18.69283 |
| lstat | 42.03121 |
| medv | 90.49769 |
Model 4
For our final model we remove the not needed vairables from model 3.
Model 4’s formula:
target ~ age + dis + tax + medv
| x | |
|---|---|
| age | 32.45534 |
| dis | 38.50808 |
| tax | 35.45638 |
| medv | 17.78640 |
4. SELECT MODELS
It’s time for us to pick which model we want to use. To help do this we’ll review each model’s accuracy by making predictions on the 20% we kept and comparing their results. We’ll use fourfold plots, summary statistics, and ROC / AUC plots to determine overall accuarcy.
Fourfold Plots
Summary Statistics
| Sensitivity | Specificity | Precision | Recall | F1 | |
|---|---|---|---|---|---|
| Model1 | 0.9111111 | 0.9333333 | 0.9318182 | 0.9111111 | 0.9213483 |
| Model2 | 0.9111111 | 0.9555556 | 0.9534884 | 0.9111111 | 0.9318182 |
| Model3 | 0.9333333 | 0.8444444 | 0.8571429 | 0.9333333 | 0.8936170 |
| Model4 | 0.9111111 | 0.8444444 | 0.8541667 | 0.9111111 | 0.8817204 |
ROC / AUC
Model Selection
Model 1 and 2 both contain more variables than 3 and 4 but they also have so colinearity issues that can be seen when we look at the VIF output. Model 3 performs takes care of colinearity but also has variables that are not needed. In the end, we’ll go with model 4 which deals with colinearity and gets rid of unneeded variables.
Let’s use model 4 over our full dataset and review some summary diagnostic plots and outputs.
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.06595 -0.36692 0.07437 0.31853 2.62961
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1329 0.1990 0.668 0.50422
## age 1.1948 0.2542 4.700 2.60e-06 ***
## dis -1.0226 0.2872 -3.561 0.00037 ***
## tax 1.5371 0.2804 5.481 4.22e-08 ***
## medv 0.2566 0.1944 1.320 0.18691
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 626.60 on 451 degrees of freedom
## Residual deviance: 267.56 on 447 degrees of freedom
## AIC: 277.56
##
## Number of Fisher Scoring iterations: 6
Odds Ratio
Here’s a table of the Odds Ratio for model 4 beside the 95% confidence interval of those boundaries.
| OddsRatio | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | 1.142 | 0.779 | 1.711 |
| age | 3.303 | 2.038 | 5.541 |
| dis | 0.360 | 0.198 | 0.614 |
| tax | 4.651 | 2.791 | 8.450 |
| medv | 1.293 | 0.890 | 1.912 |
Our output shows us that the odds of the neighborhood being below the median crime rate increase by 3.303% when age increases by 1.
5. Make Predictions
At last, we can make our final predictions. We can see from the head of our final dataframe and the table output of our predicted variable class that the prediction distribution looks very similar to that of our initial test distribution.
| 0 | 1 | prediction |
|---|---|---|
| 0.8768524 | 0.1231476 | 0 |
| 0.6405608 | 0.3594392 | 0 |
| 0.5375394 | 0.4624606 | 0 |
| 0.6428915 | 0.3571085 | 0 |
| 0.9100915 | 0.0899085 | 0 |
| 0.9489599 | 0.0510401 | 0 |
| Var1 | Freq |
|---|---|
| 0 | 21 |
| 1 | 19 |
Appendix
Code: https://github.com/dquarshie89/Data-621/blob/master/DQ_HW3.Rmd
Prediction Results: https://github.com/dquarshie89/Data-621/blob/master/HW3_prediction.csv