Assignment3

Assignment 3

Overview

DATA 621 – Business Analytics and Data Mining

Homework #3 Assignment Requirements

In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable) indus: proportion of non-retail business acres per suburb (predictor variable) chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable) nox: nitrogen oxides concentration (parts per 10 million) (predictor variable) rm: average number of rooms per dwelling (predictor variable) age: proportion of owner-occupied units built prior to 1940 (predictor variable) dis: weighted mean of distances to five Boston employment centers (predictor variable) rad: index of accessibility to radial highways (predictor variable) tax: full-value property-tax rate per $10,000 (predictor variable) ptratio: pupil-teacher ratio by town (predictor variable) black: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town (predictor variable) lstat: lower status of the population (percent) (predictor variable) medv: median value of owner-occupied homes in $1000s (predictor variable) target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)

Deliverables:

A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details. Assigned prediction (probabilities, classifications) for the evaluation data set. Use 0.5 threshold. Include your R statistical programming code in an Appendix.

Write Up:

SUMMARY

The gmod3 is found to be a decent model that can be used to predict the target variable as outlined in this report.

1. DATA EXPLORATION (25 Points)

Describe the size and the variables in the crime training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas. a. Mean / Standard Deviation / Median b. Bar Chart or Box Plot of the data c. Is the data correlated to the target variable (or to other variables?) d. Are any of the variables missing and need to be imputed “fixed”?

DATA EXPLORATION SOLUTION

Per guidance in the problem, I’ve included below both a numerical summary (mean, median, standard deviation) of variables in the training dataset as well as histogram plots for the variables since all the relevant variables are numeric. Furthermore, there are no variables with NA’s. Additionally, although two variables (zn and chas) have minimum values of 0’s, these are legitimate values and not ones entered in error. Finally, a correlation plot indicates the numeric correlation values (-1 to +1) of the variables in the dataset.

In terms of the size of the training data set, we notice that it has 466 records and 12 predictor variables and one response variable. We start with some numerical summaries. A close look at the minimum and maximum values of each variable is worthwhile. The following variables have minimum values of 0 - zn (Proportion of residential land zoned for large lots) and chas (Dummy variable for whether the suburb borders the Charles River (1) or not (0)). It is not unusual for both of these variables to have 0 values, i.e. proportion of residential land zoned for large lots maybe 0% and a particular suburb that does not border the Charles river may have a dummy variable value of 0. Similarly, looking at the maximum values doesn’t cause any alarm.

A visual plot of the variables indicates most have a skewed distribution. There appears to be only one normally distributed variable - average number of rooms per dwelling (rm) and a few variables have bi-modal distribution - indus, tax and rad.

From the correlation matrix plot, we identify a few variables that are highly or moderately postively correlated - tax & rad (0.91), nox & age (0.74), tax & indus (0.73), target & nox (0.73) and medv & rm (0.71). Similarly, there are some variables which are moderately negatively correlated - dis & nox (-0.77), dis & age (-0.75) and medv & lstat (-0.74), dis & indus (-0.7).

This knowledge helps us narrow the size of the predictive model by including fewer variables. For example, dis & indus has a negative probability of -0.7 indicating that the further the distance to Boston employment centers, the lower the proportion of non-retail businesses or conversely the higher the proportion of residential. Also dis & nox are inversely related, which means the lower the distance to Boston employment centers or the closer in the properties are, the higher the nitrogen oxides concentration perhaps because these properties are zoned for industrial use and have higher pollutants.

We look at the target variable which indicates whether a particular neighborhood has a highern than median crime rate or not and observe that the target has a moderate positive correlation to indus, nox, age, rad and tax and a moderate negative correlation to dis.

library(corrplot)

## corrplot 0.92 loaded

#Reading in the training and evaluation data files

training <- read.csv("/Users/tponnada/Downloads/crime-training-data_modified.csv")
eval <- read.csv("/Users/tponnada/Downloads/crime-evaluation-data_modified.csv")

#Checking the first 6 rows of the training data set, the dimensions of the data set and the usual univariate summary information.

head(training)

##   zn indus chas   nox    rm   age    dis rad tax ptratio lstat medv target
## 1  0 19.58    0 0.605 7.929  96.2 2.0459   5 403    14.7  3.70 50.0      1
## 2  0 19.58    1 0.871 5.403 100.0 1.3216   5 403    14.7 26.82 13.4      1
## 3  0 18.10    0 0.740 6.485 100.0 1.9784  24 666    20.2 18.85 15.4      1
## 4 30  4.93    0 0.428 6.393   7.8 7.0355   6 300    16.6  5.19 23.7      0
## 5  0  2.46    0 0.488 7.155  92.2 2.7006   3 193    17.8  4.82 37.9      0
## 6  0  8.56    0 0.520 6.781  71.3 2.8561   5 384    20.9  7.67 26.5      0

dim(training)

## [1] 466  13

summary(training)

##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv      
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20  
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00  
##      target      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4914  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

#Standard deviations of variables 

sapply(training[,1:12], sd)

##          zn       indus        chas         nox          rm         age 
##  23.3646511   6.8458549   0.2567920   0.1166667   0.7048513  28.3213784 
##         dis         rad         tax     ptratio       lstat        medv 
##   2.1069496   8.6859272 167.9000887   2.1968447   7.1018907   9.2396814

#Univariate plots using histograms, kernel density estimates and sorted data plotted against its index for the 17 variables.

#par(mfrow=c(,10))

#Proportion of residential land zoned for large lots

hist(training$zn, xlab = "Proportion of residential land zoned for large lots", main = "")

plot(density(training$zn, na.rm = TRUE), main = "")

plot(sort(training$zn), ylab = "Proportion of residential land zoned for large lots")

boxplot(zn ~ target, training)

#Proportion of non-retail business acres per suburb 

hist(training$indus, xlab = "Proportion of non-retail business acres per suburb", main = "")

plot(density(training$indus, na.rm = TRUE), main = "")

plot(sort(training$indus), ylab = "Proportion of non-retail business acres per suburb")

boxplot(indus ~ target, training)

#A dummy var. for whether the suburb borders the Charles River (1) or not (0)

hist(training$chas, xlab = "Dummy variable for whether the suburb borders the Charles River or not", main = "")

plot(density(training$chas, na.rm = TRUE), main = "")

plot(sort(training$chas), ylab = "Dummy variable for whether the suburb borders the Charles River or not")

boxplot(chas ~ target, training)

#Nitrogen oxides concentration

hist(training$nox, xlab = "Nitrogen oxides concentration", main = "")

plot(density(training$nox, na.rm = TRUE), main = "")

plot(sort(training$nox), ylab = "Nitrogen oxides concentration")

boxplot(nox ~ target, training)

#Average number of rooms per dwelling

hist(training$rm, xlab = "Average number of rooms per dwelling", main = "")

plot(density(training$rm, na.rm = TRUE), main = "")

plot(sort(training$rm), ylab = "Average number of rooms per dwelling")

boxplot(rm ~ target, training)

#Proportion of owner-occupied units built prior to 1940

hist(training$age, xlab = "Proportion of owner-occupied units built prior to 1940", main = "")

plot(density(training$age, na.rm = TRUE), main = "")

plot(sort(training$age), ylab = "Proportion of owner-occupied units built prior to 1940")

boxplot(age ~ target, training)

#Weighted mean of distances to five Boston employment centers 

hist(training$dis, xlab = "Weighted mean of distances to five Boston employment centers", main = "")

plot(density(training$dis, na.rm = TRUE), main = "")

plot(sort(training$dis), ylab = "Weighted mean of distances to five Boston employment centers")

boxplot(dis ~ target, training)

#Index of accessibility to radial highways 

hist(training$rad, xlab = "Index of accessibility to radial highways", main = "")

plot(density(training$rad, na.rm = TRUE), main = "")

plot(sort(training$rad), ylab = "Index of accessibility to radial highways")

boxplot(rad ~ target, training)

#Full-value property-tax rate per $10,000 

hist(training$tax, xlab = "Full-value property-tax rate per $10,000", main = "")

plot(density(training$tax, na.rm = TRUE), main = "")

plot(sort(training$tax), ylab = "Full-value property-tax rate per $10,000")

boxplot(tax ~ target, training)

#Pupil-teacher ratio by town

hist(training$ptratio, xlab = "Pupil-teacher ratio by town", main = "")

plot(density(training$ptratio, na.rm = TRUE), main = "")

plot(sort(training$ptratio), ylab = "Pupil-teacher ratio by town")

boxplot(ptratio ~ target, training)

#Lower status of the population (percent) 

hist(training$lstat, xlab = "Lower status of the population (percent) ", main = "")

plot(density(training$lstat, na.rm = TRUE), main = "")

plot(sort(training$lstat), ylab = "Lower status of the population (percent) ")

boxplot(lstat ~ target, training)

#Median value of owner-occupied homes in $1000s

hist(training$medv, xlab = "Median value of owner-occupied homes in $1000s", main = "")

plot(density(training$medv, na.rm = TRUE), main = "")

plot(sort(training$medv), ylab = "Median value of owner-occupied homes in $1000s")

boxplot(medv ~ target, training)

#Instead of using scatterplots for each of the 12 variables against each other, I used the correlation matrix.

M = cor(training, use = "na.or.complete")
corrplot(M, method = 'number', type = 'lower', diag = FALSE, number.cex = 0.5, tl.cex = 0.5, cl.cex = 0.5)

2. DATA PREPARATION (25 Points)

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations. a. Fix missing values (maybe with a Mean or Median value) b. Create flags to suggest if a variable was missing c. Transform data by putting it into buckets d. Mathematical transforms such as log or square root (or use Box-Cox) e. Combine variables (such as ratios or adding or multiplying) to create new variables

DATA PREPARATION SOLUTION

There are no missing values as observed above in the data exploration stage above. However, there is scope to transform some of the variables which we’ll discuss here. We first calculate the variance inflation factor (vif) of each variable, which measures the strength and correlation between the predictor variables in a regression model. A value > 5 indicates potentially severe correlation between a given predictor variable and other predictors in the model. Based on this threshold and the calculated vif scores below, we decide to eliminate two predictors - rad (vif score: 6.781354) and tax (vif score: 9.217228) from the model. We also observe from boxplots above that the variance between the two values of the target differs for zn, nox, age, dis, rad & tax. Since, we decided to eliminate rad and tax based on vif scores, we consider log transformation on the positively skewed variables zn, nox and dis and consider a modified log transformation for age based on its negative skewness.

The target variable is binomially distributed and we cannot use a linear model since a linear model requires that the errors be approximately normally distributed and furthermore the variance of a binomial variable is not constant which violates another crucial assumption of the linear model. Hence, in the next phase, we experiment with various binomial models.

require(MASS)

## Loading required package: MASS

require(car)

## Loading required package: car

## Loading required package: carData

lmod <- lm(target ~., training)
summary(lmod)

## 
## Call:
## lm(formula = target ~ ., data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59701 -0.21505 -0.04691  0.14908  0.88702 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.6013725  0.3594901  -4.455 1.06e-05 ***
## zn          -0.0009668  0.0009442  -1.024 0.306432    
## indus        0.0031277  0.0042909   0.729 0.466433    
## chas         0.0059892  0.0588402   0.102 0.918970    
## nox          1.9722476  0.2632648   7.491 3.60e-13 ***
## rm           0.0249823  0.0315042   0.793 0.428202    
## age          0.0031738  0.0009045   3.509 0.000495 ***
## dis          0.0125382  0.0141433   0.887 0.375814    
## rad          0.0207000  0.0043384   4.771 2.47e-06 ***
## tax         -0.0002787  0.0002617  -1.065 0.287396    
## ptratio      0.0115287  0.0093460   1.234 0.218013    
## lstat        0.0045124  0.0038923   1.159 0.246935    
## medv         0.0089246  0.0029992   2.976 0.003080 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.312 on 453 degrees of freedom
## Multiple R-squared:  0.6213, Adjusted R-squared:  0.6112 
## F-statistic: 61.92 on 12 and 453 DF,  p-value: < 2.2e-16

vif(lmod)

##       zn    indus     chas      nox       rm      age      dis      rad 
## 2.324259 4.120699 1.090265 4.505049 2.354788 3.134015 4.240618 6.781354 
##      tax  ptratio    lstat     medv 
## 9.217228 2.013109 3.649059 3.667370

trainingdata <- training

#Log transformation on positive skew
trainingdata$nox <- log(trainingdata$nox)

#Log transformation on positive skew
trainingdata$dis <- log(trainingdata$dis)

#Modified log transformation on negative skew
trainingdata$age <- log(1+max(trainingdata$age)-trainingdata$age)

3. BUILD MODELS (25 Points)

Using the training data, build at least three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion from the model, indicate why this was done.

Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

BUILD MODELS SOLUTION

We start with a basic binomial model that includes all predictor variables in their untransformed state. From this initial model, we observe that 7 of the 12 explanatory variables have significant p-values - nox (5.90e-10), age (0.01333), dis (0.00134), rad (4.42e-05), tax (0.03674), ptratio (0.00148) and medv (0.00810).

To test the goodness of fit, we assume that the null hypothesis specifies that the model is correctly specified. We pass the residual deviance along with the model degrees of freedom to pchisq and find that there is no evidence to reject the null hypothesis that the model fits. For the second model, we use the transformed variables from the data preparation stage above and observe similarly as with model 1 that there is no evidence to reject the null hypothesis that the model fits.

For model 3, we use a limited number of variables, mainly the 7 variables from the basic model 1 that were found to have explanatory significance along with their transformations if applicable. Again, we observe similarly as with models 1 & 2 that there is no evidence to reject the null hypothesis that the model fits.

## Model 1:

gmod1 <- glm(formula = target ~., family = "binomial", data = training)
summary(gmod1)

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8464  -0.1445  -0.0017   0.0029   3.4665  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -40.822934   6.632913  -6.155 7.53e-10 ***
## zn           -0.065946   0.034656  -1.903  0.05706 .  
## indus        -0.064614   0.047622  -1.357  0.17485    
## chas          0.910765   0.755546   1.205  0.22803    
## nox          49.122297   7.931706   6.193 5.90e-10 ***
## rm           -0.587488   0.722847  -0.813  0.41637    
## age           0.034189   0.013814   2.475  0.01333 *  
## dis           0.738660   0.230275   3.208  0.00134 ** 
## rad           0.666366   0.163152   4.084 4.42e-05 ***
## tax          -0.006171   0.002955  -2.089  0.03674 *  
## ptratio       0.402566   0.126627   3.179  0.00148 ** 
## lstat         0.045869   0.054049   0.849  0.39608    
## medv          0.180824   0.068294   2.648  0.00810 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 192.05  on 453  degrees of freedom
## AIC: 218.05
## 
## Number of Fisher Scoring iterations: 9

df <- 453
deviance <- 192.05
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val

## [1] 1

## Model 2:

gmod2 <- glm(formula = target ~., family = "binomial", data = trainingdata)
summary(gmod2)

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = trainingdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8734  -0.1292  -0.0021   0.0054   3.3742  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  6.207055   4.518712   1.374  0.16956    
## zn          -0.045862   0.030595  -1.499  0.13388    
## indus       -0.038233   0.048482  -0.789  0.43035    
## chas         0.780209   0.760376   1.026  0.30485    
## nox         26.587287   4.202057   6.327  2.5e-10 ***
## rm          -0.680979   0.700933  -0.972  0.33128    
## age         -0.894507   0.284048  -3.149  0.00164 ** 
## dis          3.256871   0.898688   3.624  0.00029 ***
## rad          0.678045   0.167744   4.042  5.3e-05 ***
## tax         -0.005486   0.003025  -1.814  0.06974 .  
## ptratio      0.401265   0.125181   3.205  0.00135 ** 
## lstat        0.013602   0.058033   0.234  0.81468    
## medv         0.191387   0.067254   2.846  0.00443 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 184.66  on 453  degrees of freedom
## AIC: 210.66
## 
## Number of Fisher Scoring iterations: 9

df <- 453
deviance <- 184.66
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val

## [1] 1

## Model 3:

gmod3 <- glm(formula = target ~ nox + age + dis + rad + tax + ptratio + medv, family = "binomial", data = trainingdata)
summary(gmod3)

## 
## Call:
## glm(formula = target ~ nox + age + dis + rad + tax + ptratio + 
##     medv, family = "binomial", data = trainingdata)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.96171  -0.17862  -0.01004   0.00366   3.14199  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.808298   2.637837   1.444 0.148818    
## nox         25.282400   3.859363   6.551 5.72e-11 ***
## age         -0.851823   0.227036  -3.752 0.000175 ***
## dis          2.664808   0.797977   3.339 0.000839 ***
## rad          0.706682   0.144791   4.881 1.06e-06 ***
## tax         -0.007326   0.002620  -2.796 0.005177 ** 
## ptratio      0.385922   0.110503   3.492 0.000479 ***
## medv         0.119348   0.036108   3.305 0.000949 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 191.85  on 458  degrees of freedom
## AIC: 207.85
## 
## Number of Fisher Scoring iterations: 9

df <- 458
deviance <- 191.85
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val

## [1] 1

4. SELECT MODELS (25 Points)

Decide on the criteria for selecting the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.

For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.

SELECT MODELS SOLUTION

Model 3 has the least AIC value (207.85) and is also the most parsimonius of all three models. Furthermore, model 3 also has the lowest AUC value of 0.973. Model 3 also has the highest accuracy (0.901), the lowest classification error rate (0.099), the second best precision (0.869), is tied on sensitivity with model 1 (0.949). The differences in specificity and F1 are very small between the three models.

Please note I had to look around for some help onn the calculation of the confusion matrix and ROC

Based on these diagnostics, I decided to use model 3 to predict results using the training dataset.

Confusion matrix:

library(caret)

## Warning: package 'caret' was built under R version 4.0.5

## Loading required package: ggplot2

## Loading required package: lattice

library(kableExtra)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:kableExtra':
## 
##     group_rows

## The following object is masked from 'package:car':
## 
##     recode

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

formula(gmod1)

## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + lstat + medv

formula(gmod2)

## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + lstat + medv

formula(gmod3)

## target ~ nox + age + dis + rad + tax + ptratio + medv

preds1 = predict(gmod1, newdata = training)
preds2 = predict(gmod2, newdata = trainingdata)
preds3 = predict(gmod3, newdata = trainingdata)

preds1[preds1 >= 0.5] <- 1
preds1[preds1 < 0.5] <- 0
preds1 = as.factor(preds1)
preds2[preds2 >= 0.5] <- 1
preds2[preds2 < 0.5] <- 0
preds2 = as.factor(preds2)
preds3[preds3 >= 0.5] <- 1
preds3[preds3 < 0.5] <- 0
preds3 = as.factor(preds3)

m1 <- confusionMatrix(preds1, as.factor(training$target), mode = "everything")
m2 <- confusionMatrix(preds2, as.factor(trainingdata$target), mode = "everything")
m3 <- confusionMatrix(preds3, as.factor(trainingdata$target), mode = "everything")

temp <- data.frame(m1$overall, 
                   m2$overall, 
                   m3$overall) %>%
  t() %>%
  data.frame() %>%
  dplyr::select(Accuracy) %>%
  mutate(Classification_Error_Rate = 1-Accuracy)
Summ_Stat <-data.frame(m1$byClass, 
                   m2$byClass, 
                   m3$byClass) %>%
  t() %>%
  data.frame() %>%
  cbind(temp) %>%
  mutate(Model = c("Model 1", "Model 2", "Model 3")) %>%
  dplyr::select(Model, Accuracy, Classification_Error_Rate, Precision, Sensitivity, Specificity, F1) %>%
  mutate_if(is.numeric, round,3) %>%
  kable('html', escape = F) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)

Summ_Stat

Model	Accuracy	Classification_Error_Rate	Precision	Sensitivity	Specificity	F1
Model 1	0.899	0.101	0.865	0.949	0.847	0.905
Model 2	0.899	0.101	0.871	0.941	0.856	0.905
Model 3	0.901	0.099	0.869	0.949	0.852	0.907

getROC <- function(model) {
    name <- deparse(substitute(model))
    pred.prob1 <- predict(model, newdata = training)
    p1 <- data.frame(pred = training$target, prob = pred.prob1)
    p1 <- p1[order(p1$prob),]
    rocobj <- pROC::roc(p1$pred, p1$prob)
    plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
         xlab="Specificity", main = name)
}

getROC1 <- function(model) {
    name <- deparse(substitute(model))
    pred.prob1 <- predict(model, newdata = trainingdata)
    p1 <- data.frame(pred = trainingdata$target, prob = pred.prob1)
    p1 <- p1[order(p1$prob),]
    rocobj <- pROC::roc(p1$pred, p1$prob)
    plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
         xlab="Specificity", main = name)
}

par(mfrow=c(3,3))

getROC(gmod1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

getROC1(gmod2)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

getROC1(gmod3)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Using model 3 to predict results:

anova(gmod1, gmod2, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + lstat + medv
## Model 2: target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + lstat + medv
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       453     192.05                     
## 2       453     184.66  0   7.3839

anova(gmod2, gmod3, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + lstat + medv
## Model 2: target ~ nox + age + dis + rad + tax + ptratio + medv
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       453     184.66                     
## 2       458     191.85 -5  -7.1829   0.2074

anova(gmod1, gmod3, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + lstat + medv
## Model 2: target ~ nox + age + dis + rad + tax + ptratio + medv
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       453     192.05                     
## 2       458     191.85 -5  0.20094

log_predict <- predict(gmod3,newdata = eval,type = "response")
log_predict <- ifelse(log_predict > 0.5,1,0)

evaldata <- eval

#Log transformation on positive skew
evaldata$nox <- log(evaldata$nox)

#Log transformation on positive skew
evaldata$dis <- log(evaldata$dis)

#Modified log transformation on negative skew
evaldata$age <- log(1 + max(evaldata$age) - evaldata$age)

## Model 3 applied to eval dataset:

pred <- predict(gmod3, evaldata)
pred1 <- exp(pred)
write.csv(pred1, file = '/Users/tponnada/Downloads/predict1.csv')

rmse <- (function(x, y) sqrt(mean(x-y)^2))
rmse(fitted(gmod3), training$target)

## [1] 6.132544e-16

rmse(predict(gmod3, evaldata), evaldata$target)

## [1] NaN