Explore Data
stats
The data set has 13 variables and 466 cases.
# load dataset
data <- read.csv('crime-training-data_modified.csv')
str(data)## 'data.frame': 466 obs. of 13 variables:
## $ zn : num 0 0 0 30 0 0 0 0 0 80 ...
## $ indus : num 19.58 19.58 18.1 4.93 2.46 ...
## $ chas : int 0 1 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
## $ rm : num 7.93 5.4 6.49 6.39 7.16 ...
## $ age : num 96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
## $ dis : num 2.05 1.32 1.98 7.04 2.7 ...
## $ rad : int 5 5 24 6 3 5 24 24 5 1 ...
## $ tax : int 403 403 666 300 193 384 666 666 224 315 ...
## $ ptratio: num 14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
## $ lstat : num 3.7 26.82 18.85 5.19 4.82 ...
## $ medv : num 50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
## $ target : int 1 1 1 0 0 0 1 1 0 0 ...
From the summary, it does not seem to have missing value. However, variables like ‘zn’, ‘indus’, ‘age’ and so on are skewed, additionally, variable ‘chas’ and response variable should be factor type instead of int
# brief summary
summary(data)## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio lstat medv
## Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02
## Median :334.5 Median :18.9 Median :11.350 Median :21.20
## Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00
## target
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4914
## 3rd Qu.:1.0000
## Max. :1.0000
plots
from boxplot, predictors are split by response variable, except ‘nox’, ‘rad’, and dummy variable ‘chas’ have no outliers, the rest of them have more/less outliers
# make box plot for all variables
data.m <- melt(data, id.vars = 'target')%>% mutate(target = as.factor(target))
head(data.m)## target variable value
## 1 1 zn 0
## 2 1 zn 0
## 3 1 zn 0
## 4 0 zn 30
## 5 0 zn 0
## 6 0 zn 0
ggplot(data.m, aes(x = variable, y = value, fill = target)) + geom_boxplot() + facet_wrap(~ variable, scales = 'free') + theme_classic()As I mentioned above, there are some variables skewed(not include dummy variable), some are bi-model, only ‘rm’ looks pretty normal.
# check normality and skewness
data.m %>% ggplot(aes(x = value)) + geom_density() + facet_wrap(~variable, scale = 'free') + theme_classic()From correlation plot, variables between have somewhat relationship. However, from correlation scatter plot, it is very hard to tell which variable has which kind of relationship with others because all of points seems cluster together.
# correlation between predictors and response
par(mar = c(1,1,1,1))
data%>% corPlot()# shape of correlation
pairs(data)Data Transformation
transform variable
corresponding library is required to perform the function, reference is here.
use Box-Cox transformation as well as center and scale, just so variables are in the same scale, from the plot, we see that variables are more normal than raw data. Even though there are some variables still look problematic, let’s stay with it now.
# use transform data using boxcox, in the mean time center and scale them.
preprocess <- preProcess(data %>% dplyr::select(-target), method = c('BoxCox', 'center', 'scale'))
# get transformed data set
trans.df <- predict(preprocess, data %>% dplyr::select(-target)) %>% mutate(target = as.factor(data$target))
# plot density to check skewness and normality
trans.df.m <- trans.df %>% melt(id.vars = 'target')
trans.df.m %>% ggplot(aes(x = value)) + geom_density() + facet_wrap(~variable, scales = 'free') + theme_classic()building model
split data into train set and test set
set.seed(100)
split.df <- trainTestPartition(trans.df, 0.7) # 70% is training, the rest will be testing
train <- split.df$XyTr # training set with label
test <- split.df$XyTe # testing test with label
nrow(train)## [1] 326
nrow(test)## [1] 140
logistic regression
first model(full)
the distribution of deviance residuals looks pretty normal, there are more than one predictors are not statistically significant
AIC is 171.47
m1 <- glm(target ~ . , data = train, family = binomial)
summary(m1)##
## Call:
## glm(formula = target ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8631 -0.1426 -0.0017 0.1505 3.2149
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.25576 0.34099 0.750 0.453223
## zn -0.28758 0.60614 -0.474 0.635178
## indus 0.05478 0.43749 0.125 0.900355
## chas 0.15247 0.22179 0.687 0.491793
## nox 4.35204 0.81524 5.338 9.38e-08 ***
## rm -0.45372 0.52312 -0.867 0.385757
## age 0.95864 0.41517 2.309 0.020943 *
## dis 1.67105 0.53207 3.141 0.001686 **
## rad 2.82899 0.74244 3.810 0.000139 ***
## tax -0.21199 0.46172 -0.459 0.646130
## ptratio 1.03648 0.32546 3.185 0.001449 **
## lstat -0.05232 0.47394 -0.110 0.912100
## medv 1.91395 0.73637 2.599 0.009345 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 451.88 on 325 degrees of freedom
## Residual deviance: 145.47 on 313 degrees of freedom
## AIC: 171.47
##
## Number of Fisher Scoring iterations: 8
- the plot prove that predictors like ‘zn’, ‘indus’ and so on with slope near 0 are not so useful for predicting result
# added variable plot
par(mar = c(0.3, 0.3, 0.3, 0.3))
avPlots(m1)from the model, we see that variable ‘medv’ and ‘rm’ happen to have collinearity
from marginal plot, line for data and line for model is indistinguishable, it happens to be a valid model.
# colinearity checking
vif(m1)## zn indus chas nox rm age dis rad
## 1.461676 2.765940 1.271799 4.001009 5.078328 2.646463 3.665460 2.196282
## tax ptratio lstat medv
## 2.444588 2.459527 3.838314 8.228199
# validity check
mmps(m1)## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
## A term has fewer unique covariate combinations than specified maximum degrees of freedom
## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
## A term has fewer unique covariate combinations than specified maximum degrees of freedom
second model(reduced - remove medv, rm)
To solve collinearity, I am going to remove variables cause such problem to form a new model.
the distribution of deviance residuals still looks fine
AIC increase to 176.22
m2 <- update(m1, .~. -medv -rm)
summary(m2)##
## Call:
## glm(formula = target ~ zn + indus + chas + nox + age + dis +
## rad + tax + ptratio + lstat, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.86861 -0.18407 -0.00288 0.10400 3.11969
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3969 0.3439 1.154 0.24853
## zn -0.3044 0.5520 -0.551 0.58132
## indus -0.0635 0.4220 -0.150 0.88039
## chas 0.2083 0.2210 0.943 0.34587
## nox 3.8119 0.7418 5.139 2.76e-07 ***
## age 0.6672 0.3254 2.051 0.04030 *
## dis 1.0502 0.4598 2.284 0.02237 *
## rad 2.8925 0.7630 3.791 0.00015 ***
## tax -0.5016 0.4438 -1.130 0.25839
## ptratio 0.6250 0.2625 2.381 0.01726 *
## lstat -0.7080 0.3015 -2.348 0.01885 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 451.88 on 325 degrees of freedom
## Residual deviance: 154.22 on 315 degrees of freedom
## AIC: 176.22
##
## Number of Fisher Scoring iterations: 8
there is no collinearity
model is valid
# colinearity
vif(m2)## zn indus chas nox age dis rad tax
## 1.395962 2.961269 1.222143 3.448725 1.746431 2.797059 2.078069 2.368769
## ptratio lstat
## 1.763439 1.752629
# mmps
mmps(m2)## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
## A term has fewer unique covariate combinations than specified maximum degrees of freedom
## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
## A term has fewer unique covariate combinations than specified maximum degrees of freedom
third model (stepwise based on m1)
stepwise approach helps to find all subset of proper models and pick the lowest AIC score(which means the best model it computes). Therefore, there is no need for me to do backward elimination and forward selection.
- the final model it picks has AIC score of 162.05
m3 <- step(update(m1, .~.))## Start: AIC=171.47
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv
##
## Df Deviance AIC
## - lstat 1 145.48 169.48
## - indus 1 145.48 169.48
## - tax 1 145.67 169.67
## - zn 1 145.71 169.71
## - chas 1 145.95 169.95
## - rm 1 146.22 170.22
## <none> 145.47 171.47
## - age 1 151.11 175.11
## - medv 1 152.80 176.80
## - dis 1 156.43 180.43
## - ptratio 1 157.17 181.17
## - rad 1 174.02 198.02
## - nox 1 192.53 216.53
##
## Step: AIC=169.48
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + medv
##
## Df Deviance AIC
## - indus 1 145.49 167.49
## - tax 1 145.68 167.68
## - zn 1 145.74 167.74
## - chas 1 145.95 167.95
## - rm 1 146.28 168.28
## <none> 145.48 169.48
## - age 1 151.72 173.72
## - medv 1 153.34 175.34
## - dis 1 156.52 178.52
## - ptratio 1 157.18 179.18
## - rad 1 174.21 196.21
## - nox 1 192.55 214.55
##
## Step: AIC=167.49
## target ~ zn + chas + nox + rm + age + dis + rad + tax + ptratio +
## medv
##
## Df Deviance AIC
## - tax 1 145.69 165.69
## - zn 1 145.78 165.78
## - chas 1 146.05 166.05
## - rm 1 146.29 166.29
## <none> 145.49 167.49
## - age 1 151.72 171.72
## - medv 1 153.35 173.35
## - dis 1 156.64 176.64
## - ptratio 1 157.34 177.34
## - rad 1 177.12 197.12
## - nox 1 195.20 215.20
##
## Step: AIC=165.69
## target ~ zn + chas + nox + rm + age + dis + rad + ptratio + medv
##
## Df Deviance AIC
## - zn 1 145.96 163.96
## - chas 1 146.35 164.35
## - rm 1 146.51 164.51
## <none> 145.69 165.69
## - age 1 151.84 169.84
## - medv 1 154.89 172.89
## - ptratio 1 157.40 175.40
## - dis 1 158.65 176.65
## - nox 1 195.54 213.54
## - rad 1 198.13 216.13
##
## Step: AIC=163.96
## target ~ chas + nox + rm + age + dis + rad + ptratio + medv
##
## Df Deviance AIC
## - chas 1 146.80 162.80
## - rm 1 146.95 162.95
## <none> 145.96 163.96
## - age 1 152.40 168.40
## - medv 1 155.61 171.61
## - dis 1 158.82 174.82
## - ptratio 1 160.55 176.55
## - rad 1 198.49 214.49
## - nox 1 199.53 215.53
##
## Step: AIC=162.8
## target ~ nox + rm + age + dis + rad + ptratio + medv
##
## Df Deviance AIC
## - rm 1 148.05 162.05
## <none> 146.80 162.80
## - age 1 154.30 168.30
## - medv 1 157.13 171.13
## - dis 1 159.60 173.60
## - ptratio 1 160.67 174.67
## - nox 1 200.06 214.06
## - rad 1 200.60 214.60
##
## Step: AIC=162.05
## target ~ nox + age + dis + rad + ptratio + medv
##
## Df Deviance AIC
## <none> 148.05 162.05
## - age 1 154.31 166.31
## - dis 1 159.86 171.86
## - ptratio 1 160.67 172.67
## - medv 1 166.08 178.08
## - nox 1 201.25 213.25
## - rad 1 201.28 213.28
summary(m3)##
## Call:
## glm(formula = target ~ nox + age + dis + rad + ptratio + medv,
## family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8120 -0.1808 -0.0036 0.1260 3.1717
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4122 0.2627 1.569 0.116632
## nox 4.3138 0.7579 5.692 1.26e-08 ***
## age 0.8234 0.3384 2.433 0.014962 *
## dis 1.5832 0.4832 3.276 0.001051 **
## rad 2.5344 0.5441 4.658 3.19e-06 ***
## ptratio 0.9316 0.2752 3.386 0.000710 ***
## medv 1.4718 0.3935 3.740 0.000184 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 451.88 on 325 degrees of freedom
## Residual deviance: 148.05 on 319 degrees of freedom
## AIC: 162.05
##
## Number of Fisher Scoring iterations: 7
- there is no colinearity and model is valid
# collinearity
vif(m3)## nox age dis rad ptratio medv
## 3.594358 1.841173 3.176852 1.144804 1.730120 2.334173
# validity
mmps(m3)Forth model(stepwise based on m2)
m1 has the full model, m2 is reduced model.
- the best model computed by step() function has AIC score of 171.71 which is very similar to m1
m4 <- step(update(m2, .~.))## Start: AIC=176.22
## target ~ zn + indus + chas + nox + age + dis + rad + tax + ptratio +
## lstat
##
## Df Deviance AIC
## - indus 1 154.24 174.24
## - zn 1 154.55 174.55
## - chas 1 155.13 175.13
## - tax 1 155.43 175.43
## <none> 154.22 176.22
## - age 1 158.53 178.53
## - dis 1 159.61 179.61
## - ptratio 1 160.16 180.16
## - lstat 1 160.41 180.41
## - rad 1 187.38 207.38
## - nox 1 194.50 214.50
##
## Step: AIC=174.24
## target ~ zn + chas + nox + age + dis + rad + tax + ptratio +
## lstat
##
## Df Deviance AIC
## - zn 1 154.55 172.55
## - chas 1 155.15 173.15
## - tax 1 156.09 174.09
## <none> 154.24 174.24
## - age 1 158.56 176.56
## - dis 1 159.75 177.75
## - ptratio 1 160.20 178.20
## - lstat 1 160.46 178.46
## - rad 1 191.69 209.69
## - nox 1 196.61 214.61
##
## Step: AIC=172.55
## target ~ chas + nox + age + dis + rad + tax + ptratio + lstat
##
## Df Deviance AIC
## - chas 1 155.71 171.71
## - tax 1 156.38 172.38
## <none> 154.55 172.55
## - age 1 158.91 174.91
## - dis 1 160.00 176.00
## - lstat 1 160.96 176.96
## - ptratio 1 162.41 178.41
## - rad 1 193.01 209.01
## - nox 1 201.11 217.11
##
## Step: AIC=171.71
## target ~ nox + age + dis + rad + tax + ptratio + lstat
##
## Df Deviance AIC
## <none> 155.71 171.71
## - tax 1 157.99 171.99
## - age 1 160.68 174.68
## - dis 1 160.86 174.86
## - lstat 1 161.84 175.84
## - ptratio 1 162.69 176.69
## - rad 1 196.81 210.81
## - nox 1 201.70 215.70
summary(m4)##
## Call:
## glm(formula = target ~ nox + age + dis + rad + tax + ptratio +
## lstat, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.82397 -0.21077 -0.00344 0.10221 3.13804
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5203 0.2749 1.892 0.0584 .
## nox 3.8183 0.7039 5.425 5.81e-08 ***
## age 0.7095 0.3232 2.195 0.0282 *
## dis 1.0009 0.4474 2.237 0.0253 *
## rad 3.0594 0.6912 4.426 9.59e-06 ***
## tax -0.5723 0.3783 -1.513 0.1303
## ptratio 0.6203 0.2409 2.575 0.0100 *
## lstat -0.6899 0.2943 -2.344 0.0191 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 451.88 on 325 degrees of freedom
## Residual deviance: 155.71 on 318 degrees of freedom
## AIC: 171.71
##
## Number of Fisher Scoring iterations: 8
- no colinearity and model is still valid
# colinearity check
vif(m4)## nox age dis rad tax ptratio lstat
## 3.212228 1.759048 2.776736 1.663719 1.756459 1.453004 1.707905
# validity check
mmps(m4)model selection
based on the result from AICc, and model performance, I am going to choose model3
model_list <- list(m1, m2, m3, m4)
aictab(model_list)## Warning in aictab.AICglm.lm(model_list):
## Model names have been supplied automatically in the table
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## Mod3 7 162.40 0.00 0.99 0.99 -74.02
## Mod4 8 172.16 9.76 0.01 0.99 -77.85
## Mod1 13 172.63 10.23 0.01 1.00 -72.73
## Mod2 11 177.06 14.66 0.00 1.00 -77.11
compare_performance(m1, m2, m3, m4, rank = T)## Warning: Following indices with missing values are not used for ranking:
## Score_log
## # Comparison of Model Performance Indices
##
## Name | Model | Tjur's R2 | RMSE | Sigma | Log_loss | Score_log | Score_spherical | PCP | AIC weights | BIC weights | Performance-Score
## -----------------------------------------------------------------------------------------------------------------------------------------
## m3 | glm | 0.717 | 0.263 | 0.681 | 0.227 | -Inf | 0.025 | 0.858 | 0.982 | 0.999 | 78.39%
## m1 | glm | 0.720 | 0.262 | 0.682 | 0.223 | -Inf | 0.027 | 0.860 | 0.009 | < 0.001 | 62.28%
## m2 | glm | 0.704 | 0.269 | 0.700 | 0.237 | -Inf | 0.024 | 0.852 | < 0.001 | < 0.001 | 19.23%
## m4 | glm | 0.700 | 0.270 | 0.700 | 0.239 | -Inf | 0.023 | 0.850 | 0.008 | 0.001 | 12.60%
prediction
pred <- predict(m3, test)
pred[pred>0.5] = 1
pred[pred<0.5] = 0
pred <- pred %>% as.factor()
confusionMatrix(pred, test$target)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 67 6
## 1 5 62
##
## Accuracy : 0.9214
## 95% CI : (0.8638, 0.9601)
## No Information Rate : 0.5143
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8427
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9306
## Specificity : 0.9118
## Pos Pred Value : 0.9178
## Neg Pred Value : 0.9254
## Prevalence : 0.5143
## Detection Rate : 0.4786
## Detection Prevalence : 0.5214
## Balanced Accuracy : 0.9212
##
## 'Positive' Class : 0
##