DATA 621: HW 3
1. DATA EXPLORATION
2. DATA PREPARATION
- 1: Chas
- 2: Indus
- 3: Dis
- Data Summary
3. BUILD MODELS
- Model 1
- Model 2
- Model 3
- Model 4
4. SELECT MODELS
- Odds Ratio
5. Make Predictions
Appendix

DATA 621: HW 3

David Quarshie - Group 3

In this assignment we will build a logistic regression model to predict wheter a particular neighborhood in Boston is above or below the median crime level. We’re given information on 466 Boston neighborhoods, 13 predictor variables and 1 response variable, target. Target will tell us if the neighborhood is above the median crime level (1) or below it (0).

1. DATA EXPLORATION

Below we’ll display a few basic EDA techniques to gain insight into our crime dataset.

Basic Statistics

There are 466 rows and 14 columns. Thankfully, there are no missing values out of all of the 6,524 data points.

	n	mean	sd	median	min	max	skew	kurtosis
zn	466	11.5772532	23.3646511	0.00000	0.0000	100.0000	2.1768152	3.8135765
indus	466	11.1050215	6.8458549	9.69000	0.4600	27.7400	0.2885450	-1.2432132
chas	466	0.0708155	0.2567920	0.00000	0.0000	1.0000	3.3354899	9.1451313
nox	466	0.5543105	0.1166667	0.53800	0.3890	0.8710	0.7463281	-0.0357736
rm	466	6.2906738	0.7048513	6.21000	3.8630	8.7800	0.4793202	1.5424378
age	466	68.3675966	28.3213784	77.15000	2.9000	100.0000	-0.5777075	-1.0098814
dis	466	3.7956929	2.1069496	3.19095	1.1296	12.1265	0.9988926	0.4719679
rad	466	9.5300429	8.6859272	5.00000	1.0000	24.0000	1.0102788	-0.8619110
tax	466	409.5021459	167.9000887	334.50000	187.0000	711.0000	0.6593136	-1.1480456
ptratio	466	18.3984979	2.1968447	18.90000	12.6000	22.0000	-0.7542681	-0.4003627
lstat	466	12.6314592	7.1018907	11.35000	1.7300	37.9700	0.9055864	0.5033688
## Distrib	ution	of Target Vari	able

Let’s look at the target variable in our training data to make sure there no one sided distribution.

Var1	Freq
0	237
1	229

Histogram of Variables

Boxplot of Variables

## NULL

2. DATA PREPARATION

We’ve determined that there are no missing values in our data, but looking at our visualizations we see a few variables with some issues.

1: Chas

Looking at the results from our chas variable it doesn’t seem to be needed here so we can remove it.

2: Indus

We see a lot of outliers in the indus variable, so we’ll removed the rows which indus is greater than 20 and target is 0.

3: Dis

Dis also has some outliers so we’ll remove rows where dis was greater than 11 and target was 0, and where dis was greater than 7.5 and target was 1.

Data Summary

Let’s take a quick look at what variables we have remaining.

##  [1] "zn"      "indus"   "nox"     "rm"      "age"     "dis"     "rad"    
##  [8] "tax"     "ptratio" "lstat"   "medv"    "target"

## [1] 452  12

3. BUILD MODELS

k-fold Cross Validation is used when there’s a small amount of data to train. For our project we’re only dealing with 466 obersavations so we’ll use k-fold with k = 10. We’ll keep out 20% of the data for validation while doing the first modeling and when we select our final model we’ll use the full training set.

Each of our logistic regression models will use bionomial regression with a logit link function.

Model 1

For the first model we will include all the variables. Looking at the output of the model we see that some points are highly colinear and a some variables that may not be necessary.

Model 1 uses the formula:

target ~ .

	x
zn	274.32819
indus	123.40258
nox	352.53030
rm	130.66418
age	63.06141
dis	106.92211
rad	1273.41974
tax	474.20124
ptratio	52.26712
lstat	58.36391
medv	210.06847

Model 2

For the second model we ignore what’s colinear but remove unneccessary variables shown in model 1.

Model 2’s formula:

target ~ zn + nox + age + dis + rad + ptratio + medv

	x
zn	302.47651
nox	245.57269
age	47.60585
dis	88.76467
rad	518.30063
ptratio	29.46829
medv	51.12595

Model 3

For model 3 take out the variables with the 2 highest VIF values from the first model.

Model 3’s formula:

target ~ indus + rm + age + dis + tax + ptratio + lstat + medv

	x
indus	50.05715
rm	56.40033
age	38.33446
dis	48.85328
tax	55.72066
ptratio	18.69283
lstat	42.03121
medv	90.49769

Model 4

For our final model we remove the not needed vairables from model 3.

Model 4’s formula:

target ~ age + dis + tax + medv

	x
age	32.45534
dis	38.50808
tax	35.45638
medv	17.78640

4. SELECT MODELS

It’s time for us to pick which model we want to use. To help do this we’ll review each model’s accuracy by making predictions on the 20% we kept and comparing their results. We’ll use fourfold plots, summary statistics, and ROC / AUC plots to determine overall accuarcy.

Fourfold Plots

Summary Statistics

	Sensitivity	Specificity	Precision	Recall	F1
Model1	0.9111111	0.9333333	0.9318182	0.9111111	0.9213483
Model2	0.9111111	0.9555556	0.9534884	0.9111111	0.9318182
Model3	0.9333333	0.8444444	0.8571429	0.9333333	0.8936170
Model4	0.9111111	0.8444444	0.8541667	0.9111111	0.8817204

ROC / AUC

Model Selection

Model 1 and 2 both contain more variables than 3 and 4 but they also have so colinearity issues that can be seen when we look at the VIF output. Model 3 performs takes care of colinearity but also has variables that are not needed. In the end, we’ll go with model 4 which deals with colinearity and gets rid of unneeded variables.

Let’s use model 4 over our full dataset and review some summary diagnostic plots and outputs.

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.06595  -0.36692   0.07437   0.31853   2.62961  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.1329     0.1990   0.668  0.50422    
## age           1.1948     0.2542   4.700 2.60e-06 ***
## dis          -1.0226     0.2872  -3.561  0.00037 ***
## tax           1.5371     0.2804   5.481 4.22e-08 ***
## medv          0.2566     0.1944   1.320  0.18691    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 626.60  on 451  degrees of freedom
## Residual deviance: 267.56  on 447  degrees of freedom
## AIC: 277.56
## 
## Number of Fisher Scoring iterations: 6

Odds Ratio

Here’s a table of the Odds Ratio for model 4 beside the 95% confidence interval of those boundaries.

	OddsRatio	2.5 %	97.5 %
(Intercept)	1.142	0.779	1.711
age	3.303	2.038	5.541
dis	0.360	0.198	0.614
tax	4.651	2.791	8.450
medv	1.293	0.890	1.912

Our output shows us that the odds of the neighborhood being below the median crime rate increase by 3.303% when age increases by 1.

5. Make Predictions

At last, we can make our final predictions. We can see from the head of our final dataframe and the table output of our predicted variable class that the prediction distribution looks very similar to that of our initial test distribution.

0	1	prediction
0.8768524	0.1231476	0
0.6405608	0.3594392	0
0.5375394	0.4624606	0
0.6428915	0.3571085	0
0.9100915	0.0899085	0
0.9489599	0.0510401	0

Var1	Freq
0	21
1	19

Appendix

Code: https://github.com/dquarshie89/Data-621/blob/master/DQ_HW3.Rmd
Prediction Results: https://github.com/dquarshie89/Data-621/blob/master/HW3_prediction.csv