Homework #3 Assignment Requirements
In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable) indus: proportion of non-retail business acres per suburb (predictor variable) chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable) nox: nitrogen oxides concentration (parts per 10 million) (predictor variable) rm: average number of rooms per dwelling (predictor variable) age: proportion of owner-occupied units built prior to 1940 (predictor variable) dis: weighted mean of distances to five Boston employment centers (predictor variable) rad: index of accessibility to radial highways (predictor variable) tax: full-value property-tax rate per $10,000 (predictor variable) ptratio: pupil-teacher ratio by town (predictor variable) black: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town (predictor variable) lstat: lower status of the population (percent) (predictor variable) medv: median value of owner-occupied homes in $1000s (predictor variable) target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)
Deliverables:
A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details. Assigned prediction (probabilities, classifications) for the evaluation data set. Use 0.5 threshold. Include your R statistical programming code in an Appendix.
Write Up:
The gmod3 is found to be a decent model that can be used to predict the target variable as outlined in this report.
Describe the size and the variables in the crime training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas. a. Mean / Standard Deviation / Median b. Bar Chart or Box Plot of the data c. Is the data correlated to the target variable (or to other variables?) d. Are any of the variables missing and need to be imputed “fixed”?
Per guidance in the problem, I’ve included below both a numerical summary (mean, median, standard deviation) of variables in the training dataset as well as histogram plots for the variables since all the relevant variables are numeric. Furthermore, there are no variables with NA’s. Additionally, although two variables (zn and chas) have minimum values of 0’s, these are legitimate values and not ones entered in error. Finally, a correlation plot indicates the numeric correlation values (-1 to +1) of the variables in the dataset.
In terms of the size of the training data set, we notice that it has 466 records and 12 predictor variables and one response variable. We start with some numerical summaries. A close look at the minimum and maximum values of each variable is worthwhile. The following variables have minimum values of 0 - zn (Proportion of residential land zoned for large lots) and chas (Dummy variable for whether the suburb borders the Charles River (1) or not (0)). It is not unusual for both of these variables to have 0 values, i.e. proportion of residential land zoned for large lots maybe 0% and a particular suburb that does not border the Charles river may have a dummy variable value of 0. Similarly, looking at the maximum values doesn’t cause any alarm.
A visual plot of the variables indicates most have a skewed distribution. There appears to be only one normally distributed variable - average number of rooms per dwelling (rm) and a few variables have bi-modal distribution - indus, tax and rad.
From the correlation matrix plot, we identify a few variables that are highly or moderately postively correlated - tax & rad (0.91), nox & age (0.74), tax & indus (0.73), target & nox (0.73) and medv & rm (0.71). Similarly, there are some variables which are moderately negatively correlated - dis & nox (-0.77), dis & age (-0.75) and medv & lstat (-0.74), dis & indus (-0.7).
This knowledge helps us narrow the size of the predictive model by including fewer variables. For example, dis & indus has a negative probability of -0.7 indicating that the further the distance to Boston employment centers, the lower the proportion of non-retail businesses or conversely the higher the proportion of residential. Also dis & nox are inversely related, which means the lower the distance to Boston employment centers or the closer in the properties are, the higher the nitrogen oxides concentration perhaps because these properties are zoned for industrial use and have higher pollutants.
We look at the target variable which indicates whether a particular neighborhood has a highern than median crime rate or not and observe that the target has a moderate positive correlation to indus, nox, age, rad and tax and a moderate negative correlation to dis.
library(corrplot)
## corrplot 0.92 loaded
#Reading in the training and evaluation data files
training <- read.csv("/Users/tponnada/Downloads/crime-training-data_modified.csv")
eval <- read.csv("/Users/tponnada/Downloads/crime-evaluation-data_modified.csv")
#Checking the first 6 rows of the training data set, the dimensions of the data set and the usual univariate summary information.
head(training)
## zn indus chas nox rm age dis rad tax ptratio lstat medv target
## 1 0 19.58 0 0.605 7.929 96.2 2.0459 5 403 14.7 3.70 50.0 1
## 2 0 19.58 1 0.871 5.403 100.0 1.3216 5 403 14.7 26.82 13.4 1
## 3 0 18.10 0 0.740 6.485 100.0 1.9784 24 666 20.2 18.85 15.4 1
## 4 30 4.93 0 0.428 6.393 7.8 7.0355 6 300 16.6 5.19 23.7 0
## 5 0 2.46 0 0.488 7.155 92.2 2.7006 3 193 17.8 4.82 37.9 0
## 6 0 8.56 0 0.520 6.781 71.3 2.8561 5 384 20.9 7.67 26.5 0
dim(training)
## [1] 466 13
summary(training)
## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio lstat medv
## Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02
## Median :334.5 Median :18.9 Median :11.350 Median :21.20
## Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00
## target
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4914
## 3rd Qu.:1.0000
## Max. :1.0000
#Standard deviations of variables
sapply(training[,1:12], sd)
## zn indus chas nox rm age
## 23.3646511 6.8458549 0.2567920 0.1166667 0.7048513 28.3213784
## dis rad tax ptratio lstat medv
## 2.1069496 8.6859272 167.9000887 2.1968447 7.1018907 9.2396814
#Univariate plots using histograms, kernel density estimates and sorted data plotted against its index for the 17 variables.
#par(mfrow=c(,10))
#Proportion of residential land zoned for large lots
hist(training$zn, xlab = "Proportion of residential land zoned for large lots", main = "")
plot(density(training$zn, na.rm = TRUE), main = "")
plot(sort(training$zn), ylab = "Proportion of residential land zoned for large lots")
boxplot(zn ~ target, training)
#Proportion of non-retail business acres per suburb
hist(training$indus, xlab = "Proportion of non-retail business acres per suburb", main = "")
plot(density(training$indus, na.rm = TRUE), main = "")
plot(sort(training$indus), ylab = "Proportion of non-retail business acres per suburb")
boxplot(indus ~ target, training)
#A dummy var. for whether the suburb borders the Charles River (1) or not (0)
hist(training$chas, xlab = "Dummy variable for whether the suburb borders the Charles River or not", main = "")
plot(density(training$chas, na.rm = TRUE), main = "")
plot(sort(training$chas), ylab = "Dummy variable for whether the suburb borders the Charles River or not")
boxplot(chas ~ target, training)
#Nitrogen oxides concentration
hist(training$nox, xlab = "Nitrogen oxides concentration", main = "")
plot(density(training$nox, na.rm = TRUE), main = "")
plot(sort(training$nox), ylab = "Nitrogen oxides concentration")
boxplot(nox ~ target, training)
#Average number of rooms per dwelling
hist(training$rm, xlab = "Average number of rooms per dwelling", main = "")
plot(density(training$rm, na.rm = TRUE), main = "")
plot(sort(training$rm), ylab = "Average number of rooms per dwelling")
boxplot(rm ~ target, training)
#Proportion of owner-occupied units built prior to 1940
hist(training$age, xlab = "Proportion of owner-occupied units built prior to 1940", main = "")
plot(density(training$age, na.rm = TRUE), main = "")
plot(sort(training$age), ylab = "Proportion of owner-occupied units built prior to 1940")
boxplot(age ~ target, training)
#Weighted mean of distances to five Boston employment centers
hist(training$dis, xlab = "Weighted mean of distances to five Boston employment centers", main = "")
plot(density(training$dis, na.rm = TRUE), main = "")
plot(sort(training$dis), ylab = "Weighted mean of distances to five Boston employment centers")
boxplot(dis ~ target, training)
#Index of accessibility to radial highways
hist(training$rad, xlab = "Index of accessibility to radial highways", main = "")
plot(density(training$rad, na.rm = TRUE), main = "")
plot(sort(training$rad), ylab = "Index of accessibility to radial highways")
boxplot(rad ~ target, training)
#Full-value property-tax rate per $10,000
hist(training$tax, xlab = "Full-value property-tax rate per $10,000", main = "")
plot(density(training$tax, na.rm = TRUE), main = "")
plot(sort(training$tax), ylab = "Full-value property-tax rate per $10,000")
boxplot(tax ~ target, training)
#Pupil-teacher ratio by town
hist(training$ptratio, xlab = "Pupil-teacher ratio by town", main = "")
plot(density(training$ptratio, na.rm = TRUE), main = "")
plot(sort(training$ptratio), ylab = "Pupil-teacher ratio by town")
boxplot(ptratio ~ target, training)
#Lower status of the population (percent)
hist(training$lstat, xlab = "Lower status of the population (percent) ", main = "")
plot(density(training$lstat, na.rm = TRUE), main = "")
plot(sort(training$lstat), ylab = "Lower status of the population (percent) ")
boxplot(lstat ~ target, training)
#Median value of owner-occupied homes in $1000s
hist(training$medv, xlab = "Median value of owner-occupied homes in $1000s", main = "")
plot(density(training$medv, na.rm = TRUE), main = "")
plot(sort(training$medv), ylab = "Median value of owner-occupied homes in $1000s")
boxplot(medv ~ target, training)
#Instead of using scatterplots for each of the 12 variables against each other, I used the correlation matrix.
M = cor(training, use = "na.or.complete")
corrplot(M, method = 'number', type = 'lower', diag = FALSE, number.cex = 0.5, tl.cex = 0.5, cl.cex = 0.5)
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations. a. Fix missing values (maybe with a Mean or Median value) b. Create flags to suggest if a variable was missing c. Transform data by putting it into buckets d. Mathematical transforms such as log or square root (or use Box-Cox) e. Combine variables (such as ratios or adding or multiplying) to create new variables
There are no missing values as observed above in the data exploration stage above. However, there is scope to transform some of the variables which we’ll discuss here. We first calculate the variance inflation factor (vif) of each variable, which measures the strength and correlation between the predictor variables in a regression model. A value > 5 indicates potentially severe correlation between a given predictor variable and other predictors in the model. Based on this threshold and the calculated vif scores below, we decide to eliminate two predictors - rad (vif score: 6.781354) and tax (vif score: 9.217228) from the model. We also observe from boxplots above that the variance between the two values of the target differs for zn, nox, age, dis, rad & tax. Since, we decided to eliminate rad and tax based on vif scores, we consider log transformation on the positively skewed variables zn, nox and dis and consider a modified log transformation for age based on its negative skewness.
The target variable is binomially distributed and we cannot use a linear model since a linear model requires that the errors be approximately normally distributed and furthermore the variance of a binomial variable is not constant which violates another crucial assumption of the linear model. Hence, in the next phase, we experiment with various binomial models.
require(MASS)
## Loading required package: MASS
require(car)
## Loading required package: car
## Loading required package: carData
lmod <- lm(target ~., training)
summary(lmod)
##
## Call:
## lm(formula = target ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59701 -0.21505 -0.04691 0.14908 0.88702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.6013725 0.3594901 -4.455 1.06e-05 ***
## zn -0.0009668 0.0009442 -1.024 0.306432
## indus 0.0031277 0.0042909 0.729 0.466433
## chas 0.0059892 0.0588402 0.102 0.918970
## nox 1.9722476 0.2632648 7.491 3.60e-13 ***
## rm 0.0249823 0.0315042 0.793 0.428202
## age 0.0031738 0.0009045 3.509 0.000495 ***
## dis 0.0125382 0.0141433 0.887 0.375814
## rad 0.0207000 0.0043384 4.771 2.47e-06 ***
## tax -0.0002787 0.0002617 -1.065 0.287396
## ptratio 0.0115287 0.0093460 1.234 0.218013
## lstat 0.0045124 0.0038923 1.159 0.246935
## medv 0.0089246 0.0029992 2.976 0.003080 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.312 on 453 degrees of freedom
## Multiple R-squared: 0.6213, Adjusted R-squared: 0.6112
## F-statistic: 61.92 on 12 and 453 DF, p-value: < 2.2e-16
vif(lmod)
## zn indus chas nox rm age dis rad
## 2.324259 4.120699 1.090265 4.505049 2.354788 3.134015 4.240618 6.781354
## tax ptratio lstat medv
## 9.217228 2.013109 3.649059 3.667370
trainingdata <- training
#Log transformation on positive skew
trainingdata$nox <- log(trainingdata$nox)
#Log transformation on positive skew
trainingdata$dis <- log(trainingdata$dis)
#Modified log transformation on negative skew
trainingdata$age <- log(1+max(trainingdata$age)-trainingdata$age)
Using the training data, build at least three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion from the model, indicate why this was done.
Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
We start with a basic binomial model that includes all predictor variables in their untransformed state. From this initial model, we observe that 7 of the 12 explanatory variables have significant p-values - nox (5.90e-10), age (0.01333), dis (0.00134), rad (4.42e-05), tax (0.03674), ptratio (0.00148) and medv (0.00810).
To test the goodness of fit, we assume that the null hypothesis specifies that the model is correctly specified. We pass the residual deviance along with the model degrees of freedom to pchisq and find that there is no evidence to reject the null hypothesis that the model fits. For the second model, we use the transformed variables from the data preparation stage above and observe similarly as with model 1 that there is no evidence to reject the null hypothesis that the model fits.
For model 3, we use a limited number of variables, mainly the 7 variables from the basic model 1 that were found to have explanatory significance along with their transformations if applicable. Again, we observe similarly as with models 1 & 2 that there is no evidence to reject the null hypothesis that the model fits.
## Model 1:
gmod1 <- glm(formula = target ~., family = "binomial", data = training)
summary(gmod1)
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8464 -0.1445 -0.0017 0.0029 3.4665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -40.822934 6.632913 -6.155 7.53e-10 ***
## zn -0.065946 0.034656 -1.903 0.05706 .
## indus -0.064614 0.047622 -1.357 0.17485
## chas 0.910765 0.755546 1.205 0.22803
## nox 49.122297 7.931706 6.193 5.90e-10 ***
## rm -0.587488 0.722847 -0.813 0.41637
## age 0.034189 0.013814 2.475 0.01333 *
## dis 0.738660 0.230275 3.208 0.00134 **
## rad 0.666366 0.163152 4.084 4.42e-05 ***
## tax -0.006171 0.002955 -2.089 0.03674 *
## ptratio 0.402566 0.126627 3.179 0.00148 **
## lstat 0.045869 0.054049 0.849 0.39608
## medv 0.180824 0.068294 2.648 0.00810 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 192.05 on 453 degrees of freedom
## AIC: 218.05
##
## Number of Fisher Scoring iterations: 9
df <- 453
deviance <- 192.05
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val
## [1] 1
## Model 2:
gmod2 <- glm(formula = target ~., family = "binomial", data = trainingdata)
summary(gmod2)
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = trainingdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8734 -0.1292 -0.0021 0.0054 3.3742
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.207055 4.518712 1.374 0.16956
## zn -0.045862 0.030595 -1.499 0.13388
## indus -0.038233 0.048482 -0.789 0.43035
## chas 0.780209 0.760376 1.026 0.30485
## nox 26.587287 4.202057 6.327 2.5e-10 ***
## rm -0.680979 0.700933 -0.972 0.33128
## age -0.894507 0.284048 -3.149 0.00164 **
## dis 3.256871 0.898688 3.624 0.00029 ***
## rad 0.678045 0.167744 4.042 5.3e-05 ***
## tax -0.005486 0.003025 -1.814 0.06974 .
## ptratio 0.401265 0.125181 3.205 0.00135 **
## lstat 0.013602 0.058033 0.234 0.81468
## medv 0.191387 0.067254 2.846 0.00443 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 184.66 on 453 degrees of freedom
## AIC: 210.66
##
## Number of Fisher Scoring iterations: 9
df <- 453
deviance <- 184.66
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val
## [1] 1
## Model 3:
gmod3 <- glm(formula = target ~ nox + age + dis + rad + tax + ptratio + medv, family = "binomial", data = trainingdata)
summary(gmod3)
##
## Call:
## glm(formula = target ~ nox + age + dis + rad + tax + ptratio +
## medv, family = "binomial", data = trainingdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.96171 -0.17862 -0.01004 0.00366 3.14199
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.808298 2.637837 1.444 0.148818
## nox 25.282400 3.859363 6.551 5.72e-11 ***
## age -0.851823 0.227036 -3.752 0.000175 ***
## dis 2.664808 0.797977 3.339 0.000839 ***
## rad 0.706682 0.144791 4.881 1.06e-06 ***
## tax -0.007326 0.002620 -2.796 0.005177 **
## ptratio 0.385922 0.110503 3.492 0.000479 ***
## medv 0.119348 0.036108 3.305 0.000949 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 191.85 on 458 degrees of freedom
## AIC: 207.85
##
## Number of Fisher Scoring iterations: 9
df <- 458
deviance <- 191.85
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val
## [1] 1
Decide on the criteria for selecting the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.
For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.
Model 3 has the least AIC value (207.85) and is also the most parsimonius of all three models. Furthermore, model 3 also has the lowest AUC value of 0.973. Model 3 also has the highest accuracy (0.901), the lowest classification error rate (0.099), the second best precision (0.869), is tied on sensitivity with model 1 (0.949). The differences in specificity and F1 are very small between the three models.
Please note I had to look around for some help onn the calculation of the confusion matrix and ROC
Based on these diagnostics, I decided to use model 3 to predict results using the training dataset.
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: ggplot2
## Loading required package: lattice
library(kableExtra)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following object is masked from 'package:car':
##
## recode
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
formula(gmod1)
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv
formula(gmod2)
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv
formula(gmod3)
## target ~ nox + age + dis + rad + tax + ptratio + medv
preds1 = predict(gmod1, newdata = training)
preds2 = predict(gmod2, newdata = trainingdata)
preds3 = predict(gmod3, newdata = trainingdata)
preds1[preds1 >= 0.5] <- 1
preds1[preds1 < 0.5] <- 0
preds1 = as.factor(preds1)
preds2[preds2 >= 0.5] <- 1
preds2[preds2 < 0.5] <- 0
preds2 = as.factor(preds2)
preds3[preds3 >= 0.5] <- 1
preds3[preds3 < 0.5] <- 0
preds3 = as.factor(preds3)
m1 <- confusionMatrix(preds1, as.factor(training$target), mode = "everything")
m2 <- confusionMatrix(preds2, as.factor(trainingdata$target), mode = "everything")
m3 <- confusionMatrix(preds3, as.factor(trainingdata$target), mode = "everything")
temp <- data.frame(m1$overall,
m2$overall,
m3$overall) %>%
t() %>%
data.frame() %>%
dplyr::select(Accuracy) %>%
mutate(Classification_Error_Rate = 1-Accuracy)
Summ_Stat <-data.frame(m1$byClass,
m2$byClass,
m3$byClass) %>%
t() %>%
data.frame() %>%
cbind(temp) %>%
mutate(Model = c("Model 1", "Model 2", "Model 3")) %>%
dplyr::select(Model, Accuracy, Classification_Error_Rate, Precision, Sensitivity, Specificity, F1) %>%
mutate_if(is.numeric, round,3) %>%
kable('html', escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)
Summ_Stat
| Model | Accuracy | Classification_Error_Rate | Precision | Sensitivity | Specificity | F1 |
|---|---|---|---|---|---|---|
| Model 1 | 0.899 | 0.101 | 0.865 | 0.949 | 0.847 | 0.905 |
| Model 2 | 0.899 | 0.101 | 0.871 | 0.941 | 0.856 | 0.905 |
| Model 3 | 0.901 | 0.099 | 0.869 | 0.949 | 0.852 | 0.907 |
getROC <- function(model) {
name <- deparse(substitute(model))
pred.prob1 <- predict(model, newdata = training)
p1 <- data.frame(pred = training$target, prob = pred.prob1)
p1 <- p1[order(p1$prob),]
rocobj <- pROC::roc(p1$pred, p1$prob)
plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
xlab="Specificity", main = name)
}
getROC1 <- function(model) {
name <- deparse(substitute(model))
pred.prob1 <- predict(model, newdata = trainingdata)
p1 <- data.frame(pred = trainingdata$target, prob = pred.prob1)
p1 <- p1[order(p1$prob),]
rocobj <- pROC::roc(p1$pred, p1$prob)
plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
xlab="Specificity", main = name)
}
par(mfrow=c(3,3))
getROC(gmod1)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
getROC1(gmod2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
getROC1(gmod3)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
anova(gmod1, gmod2, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv
## Model 2: target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 453 192.05
## 2 453 184.66 0 7.3839
anova(gmod2, gmod3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv
## Model 2: target ~ nox + age + dis + rad + tax + ptratio + medv
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 453 184.66
## 2 458 191.85 -5 -7.1829 0.2074
anova(gmod1, gmod3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv
## Model 2: target ~ nox + age + dis + rad + tax + ptratio + medv
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 453 192.05
## 2 458 191.85 -5 0.20094
log_predict <- predict(gmod3,newdata = eval,type = "response")
log_predict <- ifelse(log_predict > 0.5,1,0)
evaldata <- eval
#Log transformation on positive skew
evaldata$nox <- log(evaldata$nox)
#Log transformation on positive skew
evaldata$dis <- log(evaldata$dis)
#Modified log transformation on negative skew
evaldata$age <- log(1 + max(evaldata$age) - evaldata$age)
## Model 3 applied to eval dataset:
pred <- predict(gmod3, evaldata)
pred1 <- exp(pred)
write.csv(pred1, file = '/Users/tponnada/Downloads/predict1.csv')
rmse <- (function(x, y) sqrt(mean(x-y)^2))
rmse(fitted(gmod3), training$target)
## [1] 6.132544e-16
rmse(predict(gmod3, evaldata), evaldata$target)
## [1] NaN