DATA 621: HW 3

David Quarshie - Group 3

In this assignment we will build a logistic regression model to predict wheter a particular neighborhood in Boston is above or below the median crime level. We’re given information on 466 Boston neighborhoods, 13 predictor variables and 1 response variable, target. Target will tell us if the neighborhood is above the median crime level (1) or below it (0).

1. DATA EXPLORATION

Below we’ll display a few basic EDA techniques to gain insight into our crime dataset.

Basic Statistics

There are 466 rows and 14 columns. Thankfully, there are no missing values out of all of the 6,524 data points.

n mean sd median min max skew kurtosis
zn 466 11.5772532 23.3646511 0.00000 0.0000 100.0000 2.1768152 3.8135765
indus 466 11.1050215 6.8458549 9.69000 0.4600 27.7400 0.2885450 -1.2432132
chas 466 0.0708155 0.2567920 0.00000 0.0000 1.0000 3.3354899 9.1451313
nox 466 0.5543105 0.1166667 0.53800 0.3890 0.8710 0.7463281 -0.0357736
rm 466 6.2906738 0.7048513 6.21000 3.8630 8.7800 0.4793202 1.5424378
age 466 68.3675966 28.3213784 77.15000 2.9000 100.0000 -0.5777075 -1.0098814
dis 466 3.7956929 2.1069496 3.19095 1.1296 12.1265 0.9988926 0.4719679
rad 466 9.5300429 8.6859272 5.00000 1.0000 24.0000 1.0102788 -0.8619110
tax 466 409.5021459 167.9000887 334.50000 187.0000 711.0000 0.6593136 -1.1480456
ptratio 466 18.3984979 2.1968447 18.90000 12.6000 22.0000 -0.7542681 -0.4003627
lstat 466 12.6314592 7.1018907 11.35000 1.7300 37.9700 0.9055864 0.5033688
## Distrib ution of Target Vari able

Let’s look at the target variable in our training data to make sure there no one sided distribution.

Var1 Freq
0 237
1 229

Histogram of Variables

Boxplot of Variables

## NULL

2. DATA PREPARATION

We’ve determined that there are no missing values in our data, but looking at our visualizations we see a few variables with some issues.

1: Chas

Looking at the results from our chas variable it doesn’t seem to be needed here so we can remove it.

2: Indus

We see a lot of outliers in the indus variable, so we’ll removed the rows which indus is greater than 20 and target is 0.

3: Dis

Dis also has some outliers so we’ll remove rows where dis was greater than 11 and target was 0, and where dis was greater than 7.5 and target was 1.

Data Summary

Let’s take a quick look at what variables we have remaining.

##  [1] "zn"      "indus"   "nox"     "rm"      "age"     "dis"     "rad"    
##  [8] "tax"     "ptratio" "lstat"   "medv"    "target"
## [1] 452  12

3. BUILD MODELS

k-fold Cross Validation is used when there’s a small amount of data to train. For our project we’re only dealing with 466 obersavations so we’ll use k-fold with k = 10. We’ll keep out 20% of the data for validation while doing the first modeling and when we select our final model we’ll use the full training set.

Each of our logistic regression models will use bionomial regression with a logit link function.

Model 1

For the first model we will include all the variables. Looking at the output of the model we see that some points are highly colinear and a some variables that may not be necessary.

Model 1 uses the formula:

target ~ .

x
zn 274.32819
indus 123.40258
nox 352.53030
rm 130.66418
age 63.06141
dis 106.92211
rad 1273.41974
tax 474.20124
ptratio 52.26712
lstat 58.36391
medv 210.06847

Model 2

For the second model we ignore what’s colinear but remove unneccessary variables shown in model 1.

Model 2’s formula:

target ~ zn + nox + age + dis + rad + ptratio + medv

x
zn 302.47651
nox 245.57269
age 47.60585
dis 88.76467
rad 518.30063
ptratio 29.46829
medv 51.12595

Model 3

For model 3 take out the variables with the 2 highest VIF values from the first model.

Model 3’s formula:

target ~ indus + rm + age + dis + tax + ptratio + lstat + medv

x
indus 50.05715
rm 56.40033
age 38.33446
dis 48.85328
tax 55.72066
ptratio 18.69283
lstat 42.03121
medv 90.49769

Model 4

For our final model we remove the not needed vairables from model 3.

Model 4’s formula:

target ~ age + dis + tax + medv

x
age 32.45534
dis 38.50808
tax 35.45638
medv 17.78640

4. SELECT MODELS

It’s time for us to pick which model we want to use. To help do this we’ll review each model’s accuracy by making predictions on the 20% we kept and comparing their results. We’ll use fourfold plots, summary statistics, and ROC / AUC plots to determine overall accuarcy.

Fourfold Plots

Summary Statistics

Sensitivity Specificity Precision Recall F1
Model1 0.9111111 0.9333333 0.9318182 0.9111111 0.9213483
Model2 0.9111111 0.9555556 0.9534884 0.9111111 0.9318182
Model3 0.9333333 0.8444444 0.8571429 0.9333333 0.8936170
Model4 0.9111111 0.8444444 0.8541667 0.9111111 0.8817204

ROC / AUC

Model Selection

Model 1 and 2 both contain more variables than 3 and 4 but they also have so colinearity issues that can be seen when we look at the VIF output. Model 3 performs takes care of colinearity but also has variables that are not needed. In the end, we’ll go with model 4 which deals with colinearity and gets rid of unneeded variables.

Let’s use model 4 over our full dataset and review some summary diagnostic plots and outputs.

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.06595  -0.36692   0.07437   0.31853   2.62961  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.1329     0.1990   0.668  0.50422    
## age           1.1948     0.2542   4.700 2.60e-06 ***
## dis          -1.0226     0.2872  -3.561  0.00037 ***
## tax           1.5371     0.2804   5.481 4.22e-08 ***
## medv          0.2566     0.1944   1.320  0.18691    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 626.60  on 451  degrees of freedom
## Residual deviance: 267.56  on 447  degrees of freedom
## AIC: 277.56
## 
## Number of Fisher Scoring iterations: 6

Odds Ratio

Here’s a table of the Odds Ratio for model 4 beside the 95% confidence interval of those boundaries.

OddsRatio 2.5 % 97.5 %
(Intercept) 1.142 0.779 1.711
age 3.303 2.038 5.541
dis 0.360 0.198 0.614
tax 4.651 2.791 8.450
medv 1.293 0.890 1.912

Our output shows us that the odds of the neighborhood being below the median crime rate increase by 3.303% when age increases by 1.

5. Make Predictions

At last, we can make our final predictions. We can see from the head of our final dataframe and the table output of our predicted variable class that the prediction distribution looks very similar to that of our initial test distribution.

0 1 prediction
0.8768524 0.1231476 0
0.6405608 0.3594392 0
0.5375394 0.4624606 0
0.6428915 0.3571085 0
0.9100915 0.0899085 0
0.9489599 0.0510401 0
Var1 Freq
0 21
1 19