DATA 621 HW 3

DATA EXPLORATION Describe the size and the variables in the crime training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

To explore the data, I will first preview the first ten rows of the training data set. This is useful to see the column names and what data types each column is.

head(crime_train)

##   zn indus chas   nox    rm   age    dis rad tax ptratio lstat medv target
## 1  0 19.58    0 0.605 7.929  96.2 2.0459   5 403    14.7  3.70 50.0      1
## 2  0 19.58    1 0.871 5.403 100.0 1.3216   5 403    14.7 26.82 13.4      1
## 3  0 18.10    0 0.740 6.485 100.0 1.9784  24 666    20.2 18.85 15.4      1
## 4 30  4.93    0 0.428 6.393   7.8 7.0355   6 300    16.6  5.19 23.7      0
## 5  0  2.46    0 0.488 7.155  92.2 2.7006   3 193    17.8  4.82 37.9      0
## 6  0  8.56    0 0.520 6.781  71.3 2.8561   5 384    20.9  7.67 26.5      0

I’ll now explore some of the basic stats.

mean(crime_train$nox)

## [1] 0.5543105

mean(crime_train$medv)

## [1] 22.58927

plot(crime_train)

And view the data.

boxplot(crime_train)

Let’s look at the box plots without the Tax variable.

crimebox <- crime_train[,-9]
boxplot(crimebox)

We can see that zn, rm, and medv have many outliers.

DATA PREPARATION Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations. First, we will replace missing variables with the mean for that variable.

for(i in 1:ncol(crime_train)){
  crime_train[is.na(crime_train[,i]), i] <- mean(crime_train[,i], na.rm = TRUE)
}

We will then transform the skewed variables.

#transform data using log for skewed variables
m1_data <- crime_train
m1_data$zn <- log(m1_data$zn+1)
m1_data$rm <- log(m1_data$rm+1)
m1_data$medv <- log(m1_data$medv+1)

3. BUILD MODELS (25 Points) Using the training data, build at least three different binary logistic regression models, using different variables (or the same variables with different transformations).

First, we will build a model based on the original data.

m0 = glm(data=crime_train,target~.,family=binomial)
summary(m0)

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = crime_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8464  -0.1445  -0.0017   0.0029   3.4665  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -40.822934   6.632913  -6.155 7.53e-10 ***
## zn           -0.065946   0.034656  -1.903  0.05706 .  
## indus        -0.064614   0.047622  -1.357  0.17485    
## chas          0.910765   0.755546   1.205  0.22803    
## nox          49.122297   7.931706   6.193 5.90e-10 ***
## rm           -0.587488   0.722847  -0.813  0.41637    
## age           0.034189   0.013814   2.475  0.01333 *  
## dis           0.738660   0.230275   3.208  0.00134 ** 
## rad           0.666366   0.163152   4.084 4.42e-05 ***
## tax          -0.006171   0.002955  -2.089  0.03674 *  
## ptratio       0.402566   0.126627   3.179  0.00148 ** 
## lstat         0.045869   0.054049   0.849  0.39608    
## medv          0.180824   0.068294   2.648  0.00810 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 192.05  on 453  degrees of freedom
## AIC: 218.05
## 
## Number of Fisher Scoring iterations: 9

For the second model, we will only using signifacnt variables with probability of less than .05.

#create new df
m2_data <- crime_train %>% select(zn, nox, age, dis, rad, ptratio, medv, target)
#create model
m2 = glm(data=m2_data,target~.,family=binomial)
summary(m2)

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = m2_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8015  -0.2383  -0.0042   0.0067   3.4398  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -35.81914    5.71903  -6.263 3.77e-10 ***
## zn           -0.07482    0.03140  -2.383 0.017193 *  
## nox          37.68909    6.15376   6.125 9.09e-10 ***
## age           0.03059    0.01056   2.896 0.003782 ** 
## dis           0.75199    0.20864   3.604 0.000313 ***
## rad           0.53469    0.12455   4.293 1.76e-05 ***
## ptratio       0.25940    0.10615   2.444 0.014542 *  
## medv          0.13587    0.03373   4.029 5.61e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 207.62  on 458  degrees of freedom
## AIC: 223.62
## 
## Number of Fisher Scoring iterations: 8

Next, we’ll use the transformed data.

#using all data
m1 = glm(data=m1_data,target~.,family=binomial)
summary(m1)

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = m1_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8997  -0.1881  -0.0086   0.0036   3.4997  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -50.415062   9.745994  -5.173 2.30e-07 ***
## zn           -0.544176   0.255436  -2.130  0.03314 *  
## indus        -0.077573   0.048098  -1.613  0.10678    
## chas          0.791586   0.764041   1.036  0.30018    
## nox          49.655159   8.001651   6.206 5.45e-10 ***
## rm           -0.333642   4.711563  -0.071  0.94355    
## age           0.030487   0.013293   2.293  0.02183 *  
## dis           0.741977   0.231340   3.207  0.00134 ** 
## rad           0.616525   0.156841   3.931 8.46e-05 ***
## tax          -0.005764   0.003035  -1.899  0.05754 .  
## ptratio       0.333012   0.128974   2.582  0.00982 ** 
## lstat         0.075544   0.056379   1.340  0.18027    
## medv          3.806181   1.730386   2.200  0.02783 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 195.68  on 453  degrees of freedom
## AIC: 221.68
## 
## Number of Fisher Scoring iterations: 9

Finally, we’ll select the variables with probability of less than .05 from the transformed data.

#create new df
m3_data <- m1_data %>% select(zn, nox, age, dis, rad, ptratio, medv, target)
#create model
m3 = glm(data=m3_data,target~.,family=binomial)
summary(m3)

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = m3_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8574  -0.2569  -0.0211   0.0067   3.4821  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -43.5063     7.2796  -5.976 2.28e-09 ***
## zn           -0.5508     0.2225  -2.476 0.013299 *  
## nox          38.6680     6.1852   6.252 4.06e-10 ***
## age           0.0327     0.0107   3.057 0.002236 ** 
## dis           0.7618     0.2098   3.631 0.000282 ***
## rad           0.5236     0.1216   4.306 1.66e-05 ***
## ptratio       0.2345     0.1128   2.079 0.037632 *  
## medv          3.3915     0.8962   3.784 0.000154 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 210.41  on 458  degrees of freedom
## AIC: 226.41
## 
## Number of Fisher Scoring iterations: 8

4. SELECT MODELS (25 Points) Decide on the criteria for selecting the best binary logistic regression model.

I’m selecting model m0, because it has the lowest AIC score and the lowest deviance of the 4 models, meaning it will be the best fit.

I will check how the ROC curve, as well.

crime_train$predict <- predict(m0, crime_train, type='response')

crime_roc <- roc(crime_train$target, crime_train$predict, plot=T, asp=NA,
                legacy.axes=T, main = "ROC Curve", ret="tp", col="blue")

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

Since the ROC plots the true positive rates against the false postive rates, we are looking for the area under the curve to be close to 1.

crime_roc["auc"]

## $auc
## Area under the curve: 0.9738

.97 is very close to 1, so this is a satisfactory model.

We will plug in the evaluation data.

final_Pred =predict(m0, newdata=crime_eval)
final_Pred = ifelse(final_Pred<.5,0,1)
hist(final_Pred)