621-hw3

Explore Data

stats

The data set has 13 variables and 466 cases.

# load dataset
data <- read.csv('crime-training-data_modified.csv')
str(data)

## 'data.frame':    466 obs. of  13 variables:
##  $ zn     : num  0 0 0 30 0 0 0 0 0 80 ...
##  $ indus  : num  19.58 19.58 18.1 4.93 2.46 ...
##  $ chas   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
##  $ rm     : num  7.93 5.4 6.49 6.39 7.16 ...
##  $ age    : num  96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
##  $ dis    : num  2.05 1.32 1.98 7.04 2.7 ...
##  $ rad    : int  5 5 24 6 3 5 24 24 5 1 ...
##  $ tax    : int  403 403 666 300 193 384 666 666 224 315 ...
##  $ ptratio: num  14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
##  $ lstat  : num  3.7 26.82 18.85 5.19 4.82 ...
##  $ medv   : num  50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
##  $ target : int  1 1 1 0 0 0 1 1 0 0 ...

From the summary, it does not seem to have missing value. However, variables like ‘zn’, ‘indus’, ‘age’ and so on are skewed, additionally, variable ‘chas’ and response variable should be factor type instead of int

# brief summary
summary(data)

##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv      
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20  
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00  
##      target      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4914  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

plots

from boxplot, predictors are split by response variable, except ‘nox’, ‘rad’, and dummy variable ‘chas’ have no outliers, the rest of them have more/less outliers

# make box plot for all variables
data.m <- melt(data, id.vars = 'target')%>% mutate(target = as.factor(target))
head(data.m)

##   target variable value
## 1      1       zn     0
## 2      1       zn     0
## 3      1       zn     0
## 4      0       zn    30
## 5      0       zn     0
## 6      0       zn     0

ggplot(data.m, aes(x = variable, y = value, fill = target)) + geom_boxplot() + facet_wrap(~ variable, scales = 'free') + theme_classic()

As I mentioned above, there are some variables skewed(not include dummy variable), some are bi-model, only ‘rm’ looks pretty normal.

# check normality and skewness
data.m %>% ggplot(aes(x = value)) + geom_density() + facet_wrap(~variable, scale = 'free') + theme_classic()

From correlation plot, variables between have somewhat relationship. However, from correlation scatter plot, it is very hard to tell which variable has which kind of relationship with others because all of points seems cluster together.

# correlation between predictors and response
par(mar = c(1,1,1,1))
data%>% corPlot()

# shape of correlation
pairs(data)

Data Transformation

transform variable

corresponding library is required to perform the function, reference is here.

use Box-Cox transformation as well as center and scale, just so variables are in the same scale, from the plot, we see that variables are more normal than raw data. Even though there are some variables still look problematic, let’s stay with it now.

# use transform data using boxcox, in the mean time center and scale them.
preprocess <- preProcess(data %>% dplyr::select(-target), method = c('BoxCox', 'center', 'scale'))

# get transformed data set
trans.df <- predict(preprocess, data %>% dplyr::select(-target)) %>% mutate(target = as.factor(data$target))

# plot density to check skewness and normality
trans.df.m <- trans.df %>% melt(id.vars = 'target')

trans.df.m %>% ggplot(aes(x = value)) + geom_density() + facet_wrap(~variable, scales = 'free') + theme_classic()

building model

split data into train set and test set

set.seed(100)

split.df <- trainTestPartition(trans.df, 0.7) # 70% is training, the rest will be testing

train <- split.df$XyTr  # training set with label
test <- split.df$XyTe  # testing test with label

nrow(train)

## [1] 326

nrow(test)

## [1] 140

logistic regression

first model(full)

the distribution of deviance residuals looks pretty normal, there are more than one predictors are not statistically significant
AIC is 171.47

m1 <- glm(target ~ . , data = train, family = binomial)
summary(m1)

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8631  -0.1426  -0.0017   0.1505   3.2149  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.25576    0.34099   0.750 0.453223    
## zn          -0.28758    0.60614  -0.474 0.635178    
## indus        0.05478    0.43749   0.125 0.900355    
## chas         0.15247    0.22179   0.687 0.491793    
## nox          4.35204    0.81524   5.338 9.38e-08 ***
## rm          -0.45372    0.52312  -0.867 0.385757    
## age          0.95864    0.41517   2.309 0.020943 *  
## dis          1.67105    0.53207   3.141 0.001686 ** 
## rad          2.82899    0.74244   3.810 0.000139 ***
## tax         -0.21199    0.46172  -0.459 0.646130    
## ptratio      1.03648    0.32546   3.185 0.001449 ** 
## lstat       -0.05232    0.47394  -0.110 0.912100    
## medv         1.91395    0.73637   2.599 0.009345 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 451.88  on 325  degrees of freedom
## Residual deviance: 145.47  on 313  degrees of freedom
## AIC: 171.47
## 
## Number of Fisher Scoring iterations: 8

the plot prove that predictors like ‘zn’, ‘indus’ and so on with slope near 0 are not so useful for predicting result

# added variable plot
par(mar = c(0.3, 0.3, 0.3, 0.3))
avPlots(m1)

from the model, we see that variable ‘medv’ and ‘rm’ happen to have collinearity
from marginal plot, line for data and line for model is indistinguishable, it happens to be a valid model.

# colinearity checking
vif(m1)

##       zn    indus     chas      nox       rm      age      dis      rad 
## 1.461676 2.765940 1.271799 4.001009 5.078328 2.646463 3.665460 2.196282 
##      tax  ptratio    lstat     medv 
## 2.444588 2.459527 3.838314 8.228199

# validity check
mmps(m1)

## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : 
##   A term has fewer unique covariate combinations than specified maximum degrees of freedom
## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : 
##   A term has fewer unique covariate combinations than specified maximum degrees of freedom

second model(reduced - remove medv, rm)

To solve collinearity, I am going to remove variables cause such problem to form a new model.

the distribution of deviance residuals still looks fine
AIC increase to 176.22

m2 <- update(m1, .~. -medv -rm)
summary(m2)

## 
## Call:
## glm(formula = target ~ zn + indus + chas + nox + age + dis + 
##     rad + tax + ptratio + lstat, family = binomial, data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.86861  -0.18407  -0.00288   0.10400   3.11969  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.3969     0.3439   1.154  0.24853    
## zn           -0.3044     0.5520  -0.551  0.58132    
## indus        -0.0635     0.4220  -0.150  0.88039    
## chas          0.2083     0.2210   0.943  0.34587    
## nox           3.8119     0.7418   5.139 2.76e-07 ***
## age           0.6672     0.3254   2.051  0.04030 *  
## dis           1.0502     0.4598   2.284  0.02237 *  
## rad           2.8925     0.7630   3.791  0.00015 ***
## tax          -0.5016     0.4438  -1.130  0.25839    
## ptratio       0.6250     0.2625   2.381  0.01726 *  
## lstat        -0.7080     0.3015  -2.348  0.01885 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 451.88  on 325  degrees of freedom
## Residual deviance: 154.22  on 315  degrees of freedom
## AIC: 176.22
## 
## Number of Fisher Scoring iterations: 8

there is no collinearity
model is valid

# colinearity
vif(m2)

##       zn    indus     chas      nox      age      dis      rad      tax 
## 1.395962 2.961269 1.222143 3.448725 1.746431 2.797059 2.078069 2.368769 
##  ptratio    lstat 
## 1.763439 1.752629

# mmps
mmps(m2)

## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : 
##   A term has fewer unique covariate combinations than specified maximum degrees of freedom
## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : 
##   A term has fewer unique covariate combinations than specified maximum degrees of freedom

third model (stepwise based on m1)

stepwise approach helps to find all subset of proper models and pick the lowest AIC score(which means the best model it computes). Therefore, there is no need for me to do backward elimination and forward selection.

the final model it picks has AIC score of 162.05

m3 <- step(update(m1, .~.))

## Start:  AIC=171.47
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + lstat + medv
## 
##           Df Deviance    AIC
## - lstat    1   145.48 169.48
## - indus    1   145.48 169.48
## - tax      1   145.67 169.67
## - zn       1   145.71 169.71
## - chas     1   145.95 169.95
## - rm       1   146.22 170.22
## <none>         145.47 171.47
## - age      1   151.11 175.11
## - medv     1   152.80 176.80
## - dis      1   156.43 180.43
## - ptratio  1   157.17 181.17
## - rad      1   174.02 198.02
## - nox      1   192.53 216.53
## 
## Step:  AIC=169.48
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + medv
## 
##           Df Deviance    AIC
## - indus    1   145.49 167.49
## - tax      1   145.68 167.68
## - zn       1   145.74 167.74
## - chas     1   145.95 167.95
## - rm       1   146.28 168.28
## <none>         145.48 169.48
## - age      1   151.72 173.72
## - medv     1   153.34 175.34
## - dis      1   156.52 178.52
## - ptratio  1   157.18 179.18
## - rad      1   174.21 196.21
## - nox      1   192.55 214.55
## 
## Step:  AIC=167.49
## target ~ zn + chas + nox + rm + age + dis + rad + tax + ptratio + 
##     medv
## 
##           Df Deviance    AIC
## - tax      1   145.69 165.69
## - zn       1   145.78 165.78
## - chas     1   146.05 166.05
## - rm       1   146.29 166.29
## <none>         145.49 167.49
## - age      1   151.72 171.72
## - medv     1   153.35 173.35
## - dis      1   156.64 176.64
## - ptratio  1   157.34 177.34
## - rad      1   177.12 197.12
## - nox      1   195.20 215.20
## 
## Step:  AIC=165.69
## target ~ zn + chas + nox + rm + age + dis + rad + ptratio + medv
## 
##           Df Deviance    AIC
## - zn       1   145.96 163.96
## - chas     1   146.35 164.35
## - rm       1   146.51 164.51
## <none>         145.69 165.69
## - age      1   151.84 169.84
## - medv     1   154.89 172.89
## - ptratio  1   157.40 175.40
## - dis      1   158.65 176.65
## - nox      1   195.54 213.54
## - rad      1   198.13 216.13
## 
## Step:  AIC=163.96
## target ~ chas + nox + rm + age + dis + rad + ptratio + medv
## 
##           Df Deviance    AIC
## - chas     1   146.80 162.80
## - rm       1   146.95 162.95
## <none>         145.96 163.96
## - age      1   152.40 168.40
## - medv     1   155.61 171.61
## - dis      1   158.82 174.82
## - ptratio  1   160.55 176.55
## - rad      1   198.49 214.49
## - nox      1   199.53 215.53
## 
## Step:  AIC=162.8
## target ~ nox + rm + age + dis + rad + ptratio + medv
## 
##           Df Deviance    AIC
## - rm       1   148.05 162.05
## <none>         146.80 162.80
## - age      1   154.30 168.30
## - medv     1   157.13 171.13
## - dis      1   159.60 173.60
## - ptratio  1   160.67 174.67
## - nox      1   200.06 214.06
## - rad      1   200.60 214.60
## 
## Step:  AIC=162.05
## target ~ nox + age + dis + rad + ptratio + medv
## 
##           Df Deviance    AIC
## <none>         148.05 162.05
## - age      1   154.31 166.31
## - dis      1   159.86 171.86
## - ptratio  1   160.67 172.67
## - medv     1   166.08 178.08
## - nox      1   201.25 213.25
## - rad      1   201.28 213.28

summary(m3)

## 
## Call:
## glm(formula = target ~ nox + age + dis + rad + ptratio + medv, 
##     family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8120  -0.1808  -0.0036   0.1260   3.1717  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4122     0.2627   1.569 0.116632    
## nox           4.3138     0.7579   5.692 1.26e-08 ***
## age           0.8234     0.3384   2.433 0.014962 *  
## dis           1.5832     0.4832   3.276 0.001051 ** 
## rad           2.5344     0.5441   4.658 3.19e-06 ***
## ptratio       0.9316     0.2752   3.386 0.000710 ***
## medv          1.4718     0.3935   3.740 0.000184 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 451.88  on 325  degrees of freedom
## Residual deviance: 148.05  on 319  degrees of freedom
## AIC: 162.05
## 
## Number of Fisher Scoring iterations: 7

there is no colinearity and model is valid

# collinearity
vif(m3)

##      nox      age      dis      rad  ptratio     medv 
## 3.594358 1.841173 3.176852 1.144804 1.730120 2.334173

# validity
mmps(m3)

Forth model(stepwise based on m2)

m1 has the full model, m2 is reduced model.

the best model computed by step() function has AIC score of 171.71 which is very similar to m1

m4 <- step(update(m2, .~.))

## Start:  AIC=176.22
## target ~ zn + indus + chas + nox + age + dis + rad + tax + ptratio + 
##     lstat
## 
##           Df Deviance    AIC
## - indus    1   154.24 174.24
## - zn       1   154.55 174.55
## - chas     1   155.13 175.13
## - tax      1   155.43 175.43
## <none>         154.22 176.22
## - age      1   158.53 178.53
## - dis      1   159.61 179.61
## - ptratio  1   160.16 180.16
## - lstat    1   160.41 180.41
## - rad      1   187.38 207.38
## - nox      1   194.50 214.50
## 
## Step:  AIC=174.24
## target ~ zn + chas + nox + age + dis + rad + tax + ptratio + 
##     lstat
## 
##           Df Deviance    AIC
## - zn       1   154.55 172.55
## - chas     1   155.15 173.15
## - tax      1   156.09 174.09
## <none>         154.24 174.24
## - age      1   158.56 176.56
## - dis      1   159.75 177.75
## - ptratio  1   160.20 178.20
## - lstat    1   160.46 178.46
## - rad      1   191.69 209.69
## - nox      1   196.61 214.61
## 
## Step:  AIC=172.55
## target ~ chas + nox + age + dis + rad + tax + ptratio + lstat
## 
##           Df Deviance    AIC
## - chas     1   155.71 171.71
## - tax      1   156.38 172.38
## <none>         154.55 172.55
## - age      1   158.91 174.91
## - dis      1   160.00 176.00
## - lstat    1   160.96 176.96
## - ptratio  1   162.41 178.41
## - rad      1   193.01 209.01
## - nox      1   201.11 217.11
## 
## Step:  AIC=171.71
## target ~ nox + age + dis + rad + tax + ptratio + lstat
## 
##           Df Deviance    AIC
## <none>         155.71 171.71
## - tax      1   157.99 171.99
## - age      1   160.68 174.68
## - dis      1   160.86 174.86
## - lstat    1   161.84 175.84
## - ptratio  1   162.69 176.69
## - rad      1   196.81 210.81
## - nox      1   201.70 215.70

summary(m4)

## 
## Call:
## glm(formula = target ~ nox + age + dis + rad + tax + ptratio + 
##     lstat, family = binomial, data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82397  -0.21077  -0.00344   0.10221   3.13804  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.5203     0.2749   1.892   0.0584 .  
## nox           3.8183     0.7039   5.425 5.81e-08 ***
## age           0.7095     0.3232   2.195   0.0282 *  
## dis           1.0009     0.4474   2.237   0.0253 *  
## rad           3.0594     0.6912   4.426 9.59e-06 ***
## tax          -0.5723     0.3783  -1.513   0.1303    
## ptratio       0.6203     0.2409   2.575   0.0100 *  
## lstat        -0.6899     0.2943  -2.344   0.0191 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 451.88  on 325  degrees of freedom
## Residual deviance: 155.71  on 318  degrees of freedom
## AIC: 171.71
## 
## Number of Fisher Scoring iterations: 8

no colinearity and model is still valid

# colinearity check
vif(m4)

##      nox      age      dis      rad      tax  ptratio    lstat 
## 3.212228 1.759048 2.776736 1.663719 1.756459 1.453004 1.707905

# validity check
mmps(m4)

model selection

based on the result from AICc, and model performance, I am going to choose model3

model_list <- list(m1, m2, m3, m4)
aictab(model_list)

## Warning in aictab.AICglm.lm(model_list): 
## Model names have been supplied automatically in the table

## 
## Model selection based on AICc:
## 
##       K   AICc Delta_AICc AICcWt Cum.Wt     LL
## Mod3  7 162.40       0.00   0.99   0.99 -74.02
## Mod4  8 172.16       9.76   0.01   0.99 -77.85
## Mod1 13 172.63      10.23   0.01   1.00 -72.73
## Mod2 11 177.06      14.66   0.00   1.00 -77.11

compare_performance(m1, m2, m3, m4, rank = T)

## Warning: Following indices with missing values are not used for ranking:
## Score_log

## # Comparison of Model Performance Indices
## 
## Name | Model | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log | Score_spherical |   PCP | AIC weights | BIC weights | Performance-Score
## -----------------------------------------------------------------------------------------------------------------------------------------
## m3   |   glm |     0.717 | 0.263 | 0.681 |    0.227 |      -Inf |           0.025 | 0.858 |       0.982 |       0.999 |            78.39%
## m1   |   glm |     0.720 | 0.262 | 0.682 |    0.223 |      -Inf |           0.027 | 0.860 |       0.009 |     < 0.001 |            62.28%
## m2   |   glm |     0.704 | 0.269 | 0.700 |    0.237 |      -Inf |           0.024 | 0.852 |     < 0.001 |     < 0.001 |            19.23%
## m4   |   glm |     0.700 | 0.270 | 0.700 |    0.239 |      -Inf |           0.023 | 0.850 |       0.008 |       0.001 |            12.60%

prediction

pred <- predict(m3, test)
pred[pred>0.5] = 1
pred[pred<0.5] = 0
pred <- pred %>% as.factor()

confusionMatrix(pred, test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 67  6
##          1  5 62
##                                           
##                Accuracy : 0.9214          
##                  95% CI : (0.8638, 0.9601)
##     No Information Rate : 0.5143          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8427          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9306          
##             Specificity : 0.9118          
##          Pos Pred Value : 0.9178          
##          Neg Pred Value : 0.9254          
##              Prevalence : 0.5143          
##          Detection Rate : 0.4786          
##    Detection Prevalence : 0.5214          
##       Balanced Accuracy : 0.9212          
##                                           
##        'Positive' Class : 0               
##