Problem Set 7

Prpblem 1:

Modeling of Data:

Classification methods include: • Logistic regression • LDA • QDA • Naive Bayes • Decision/partition Tree • Random Forest • K-nearest Neighbor • SVM • Neural Network

Build at least 8 models of the data (you can skip one of the 9 for full credit). In each case, you must described the model, use some type of cross-validation or complexity scheme (such as AIC) to select parameters, and identify a justifiably good model within the category. When necessary, be sure fit the model enough times so that you know you have produced a good solution (i.e., when the optimization may end up in a local optimum). Write a brief description (2-4 sentences) about each of the six models, describing whether/why/how the model is doing, and provide evidence to back up your conclusions.

data_all <- read.csv("masking.csv")
data_test <- read.csv("testcases-hidden.csv")

colnames(data_all)<-c("A1","A2","A3","A4","A5","A6","A7","A8","A9","A10","A11","A12","A13","A14","A15","A16","A17","A18","A19","A20","A21","A22","A23","A24","A25","A26","mask")
colnames(data_test)<-c("A1","A2","A3","A4","A5","A6","A7","A8","A9","A10","A11","A12","A13","A14","A15","A16","A17","A18","A19","A20","A21","A22","A23","A24","A25","A26")

library(MASS)
#library(DAAG)
library(e1071)
library(rpart)
library(randomForest)

## Warning: package 'randomForest' was built under R version 3.6.2

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

library(klaR)
library(class)
library(nnet)
library(randomForest)
library(randomForestExplainer)

## Warning: package 'randomForestExplainer' was built under R version 3.6.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

model_linear<- lm(mask~., data = data_all)
summary(model_linear)

## 
## Call:
## lm(formula = mask ~ ., data = data_all)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90152 -0.26520  0.01997  0.25119  0.61838 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.837e-01  1.366e-01   2.809 0.005576 ** 
## A1           7.823e-04  3.507e-04   2.231 0.027052 *  
## A2          -4.398e-04  5.364e-04  -0.820 0.413407    
## A3           2.397e-04  4.570e-04   0.525 0.600593    
## A4           1.982e-04  5.128e-04   0.387 0.699610    
## A5          -8.134e-04  4.930e-04  -1.650 0.100855    
## A6           1.120e-03  4.754e-04   2.356 0.019648 *  
## A7           3.892e-04  4.368e-04   0.891 0.374253    
## A8          -2.243e-04  4.343e-04  -0.516 0.606244    
## A9          -5.580e-04  3.895e-04  -1.433 0.153820    
## A10          2.835e-04  4.252e-04   0.667 0.505866    
## A11         -2.221e-04  4.871e-04  -0.456 0.649131    
## A12         -1.874e-04  4.853e-04  -0.386 0.699897    
## A13         -1.747e-04  4.365e-04  -0.400 0.689457    
## A14         -3.103e-04  5.196e-04  -0.597 0.551155    
## A15          1.530e-03  3.893e-04   3.930 0.000125 ***
## A16         -5.676e-04  4.923e-04  -1.153 0.250616    
## A17         -6.608e-04  4.281e-04  -1.544 0.124573    
## A18          1.504e-03  4.617e-04   3.258 0.001362 ** 
## A19         -2.399e-04  4.626e-04  -0.519 0.604725    
## A20         -1.175e-04  5.071e-04  -0.232 0.816960    
## A21          5.152e-04  4.181e-04   1.232 0.219583    
## A22         -2.455e-04  4.696e-04  -0.523 0.601768    
## A23         -4.490e-04  4.142e-04  -1.084 0.279850    
## A24          4.411e-05  3.849e-04   0.115 0.908892    
## A25         -1.429e-03  4.272e-04  -3.345 0.001019 ** 
## A26          3.997e-04  4.526e-04   0.883 0.378471    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3344 on 164 degrees of freedom
## Multiple R-squared:  0.6121, Adjusted R-squared:  0.5506 
## F-statistic: 9.954 on 26 and 164 DF,  p-value: < 2.2e-16

model_logistic1 <- glm(mask~. , data = data_all, family = binomial(logit))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model_logistic1)

## 
## Call:
## glm(formula = mask ~ ., family = binomial(logit), data = data_all)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.13613  -0.00697   0.00000   0.01278   2.26969  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -8.1205889  4.7733167  -1.701   0.0889 .
## A1           0.0498351  0.0238012   2.094   0.0363 *
## A2          -0.0149885  0.0186576  -0.803   0.4218  
## A3           0.0146249  0.0187251   0.781   0.4348  
## A4           0.0175319  0.0264473   0.663   0.5074  
## A5          -0.0348478  0.0220469  -1.581   0.1140  
## A6           0.0402956  0.0193035   2.087   0.0368 *
## A7           0.0249796  0.0206631   1.209   0.2267  
## A8          -0.0173813  0.0127262  -1.366   0.1720  
## A9          -0.0176561  0.0155571  -1.135   0.2564  
## A10         -0.0146682  0.0187931  -0.781   0.4351  
## A11          0.0017110  0.0172388   0.099   0.9209  
## A12          0.0132290  0.0154766   0.855   0.3927  
## A13         -0.0178142  0.0225629  -0.790   0.4298  
## A14         -0.0005521  0.0148855  -0.037   0.9704  
## A15          0.0340147  0.0145555   2.337   0.0194 *
## A16         -0.0046572  0.0157801  -0.295   0.7679  
## A17         -0.0153517  0.0136645  -1.123   0.2612  
## A18          0.0525250  0.0236928   2.217   0.0266 *
## A19          0.0023342  0.0146810   0.159   0.8737  
## A20         -0.0121069  0.0212577  -0.570   0.5690  
## A21          0.0076452  0.0177990   0.430   0.6675  
## A22         -0.0357996  0.0181209  -1.976   0.0482 *
## A23         -0.0206555  0.0117357  -1.760   0.0784 .
## A24          0.0057253  0.0110809   0.517   0.6054  
## A25         -0.0349073  0.0163315  -2.137   0.0326 *
## A26         -0.0027791  0.0148775  -0.187   0.8518  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 262.889  on 190  degrees of freedom
## Residual deviance:  33.008  on 164  degrees of freedom
## AIC: 87.008
## 
## Number of Fisher Scoring iterations: 10

model_logistic2 <- glm(mask~. , data = data_all, family = quasibinomial(logit))
summary(model_logistic2)

## 
## Call:
## glm(formula = mask ~ ., family = quasibinomial(logit), data = data_all)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.13613  -0.00697   0.00000   0.01278   2.26969  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.1205889  2.4072581  -3.373 0.000927 ***
## A1           0.0498351  0.0120033   4.152 5.29e-05 ***
## A2          -0.0149885  0.0094093  -1.593 0.113100    
## A3           0.0146249  0.0094433   1.549 0.123382    
## A4           0.0175319  0.0133378   1.314 0.190528    
## A5          -0.0348478  0.0111186  -3.134 0.002042 ** 
## A6           0.0402956  0.0097350   4.139 5.56e-05 ***
## A7           0.0249796  0.0104207   2.397 0.017650 *  
## A8          -0.0173813  0.0064180  -2.708 0.007483 ** 
## A9          -0.0176561  0.0078457  -2.250 0.025751 *  
## A10         -0.0146682  0.0094776  -1.548 0.123630    
## A11          0.0017110  0.0086938   0.197 0.844220    
## A12          0.0132290  0.0078051   1.695 0.091987 .  
## A13         -0.0178142  0.0113788  -1.566 0.119379    
## A14         -0.0005521  0.0075070  -0.074 0.941467    
## A15          0.0340147  0.0073405   4.634 7.28e-06 ***
## A16         -0.0046572  0.0079581  -0.585 0.559210    
## A17         -0.0153517  0.0068912  -2.228 0.027259 *  
## A18          0.0525250  0.0119486   4.396 1.98e-05 ***
## A19          0.0023342  0.0074039   0.315 0.752962    
## A20         -0.0121069  0.0107206  -1.129 0.260417    
## A21          0.0076452  0.0089763   0.852 0.395617    
## A22         -0.0357996  0.0091386  -3.917 0.000131 ***
## A23         -0.0206555  0.0059185  -3.490 0.000620 ***
## A24          0.0057253  0.0055883   1.025 0.307097    
## A25         -0.0349073  0.0082362  -4.238 3.75e-05 ***
## A26         -0.0027791  0.0075029  -0.370 0.711563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.2543342)
## 
##     Null deviance: 262.889  on 190  degrees of freedom
## Residual deviance:  33.008  on 164  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 10

AIC(model_logistic1)

## [1] 87.00816

#Selecting Model_logistic1 as AIC value is not available for Quasibinomial

model_lda <- lda(mask ~ ., data = data_all)
summary(model_lda)

##         Length Class  Mode     
## prior    2     -none- numeric  
## counts   2     -none- numeric  
## means   52     -none- numeric  
## scaling 26     -none- numeric  
## lev      2     -none- character
## svd      1     -none- numeric  
## N        1     -none- numeric  
## call     3     -none- call     
## terms    3     terms  call     
## xlevels  0     -none- list

model_qda <- qda(mask ~ ., data=data_all)
model_qda

## Call:
## qda(mask ~ ., data = data_all)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4502618 0.5497382 
## 
## Group means:
##         A1       A2       A3       A4       A5       A6       A7       A8
## 0 476.5930 512.6977 468.2907 469.7791 523.8488 515.4419 489.6163 588.5814
## 1 555.8667 500.8381 560.9810 546.8000 499.9810 517.0571 574.8381 501.0667
##         A9      A10      A11      A12      A13      A14      A15      A16
## 0 578.6047 520.8837 519.3140 510.6279 523.1744 525.3372 438.8372 482.7791
## 1 497.4762 531.2952 485.3048 492.9905 487.3333 499.0190 568.4000 496.4952
##        A17      A18      A19      A20      A21      A22      A23      A24
## 0 510.1163 481.1744 506.3721 542.5465 493.7791 493.5930 555.8140 578.4535
## 1 590.8286 505.1714 496.7238 480.0476 543.4476 470.5429 502.9905 512.9143
##        A25      A26
## 0 561.4535 505.5116
## 1 478.8476 522.7524

model_nb<-naiveBayes(mask~.,data=data_all)
summary(model_nb)

##           Length Class  Mode   
## apriori    2     table  numeric
## tables    26     -none- list   
## levels     0     -none- NULL   
## isnumeric 26     -none- logical
## call       4     -none- call

model_DTree1<-rpart(mask~.,data=data_all,method="class")
model_DTree1

## n= 191 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 191 86 1 (0.45026178 0.54973822)  
##    2) A15< 481.5 88 20 0 (0.77272727 0.22727273)  
##      4) A25>=434.5 62  5 0 (0.91935484 0.08064516) *
##      5) A25< 434.5 26 11 1 (0.42307692 0.57692308)  
##       10) A7< 421 18  7 0 (0.61111111 0.38888889) *
##       11) A7>=421 8  0 1 (0.00000000 1.00000000) *
##    3) A15>=481.5 103 18 1 (0.17475728 0.82524272)  
##      6) A24>=826.5 10  0 0 (1.00000000 0.00000000) *
##      7) A24< 826.5 93  8 1 (0.08602151 0.91397849)  
##       14) A24>=615.5 21  6 1 (0.28571429 0.71428571)  
##         28) A1< 547 7  2 0 (0.71428571 0.28571429) *
##         29) A1>=547 14  1 1 (0.07142857 0.92857143) *
##       15) A24< 615.5 72  2 1 (0.02777778 0.97222222) *

model_DTree2<-rpart(mask~.,data=data_all,control = c(minsplit = 1, minbucket = 1, cp = -1,maxdepth=3,xval = 10))
model_DTree2

## n= 191 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 191 47.2774900 0.54973820  
##    2) A15< 481.5 88 15.4545500 0.22727270  
##      4) A25>=434.5 62  4.5967740 0.08064516  
##        8) A9>=388.5 59  2.8474580 0.05084746 *
##        9) A9< 388.5 3  0.6666667 0.66666670 *
##      5) A25< 434.5 26  6.3461540 0.57692310  
##       10) A7< 421 18  4.2777780 0.38888890 *
##       11) A7>=421 8  0.0000000 1.00000000 *
##    3) A15>=481.5 103 14.8543700 0.82524270  
##      6) A24>=826.5 10  0.0000000 0.00000000 *
##      7) A24< 826.5 93  7.3118280 0.91397850  
##       14) A24>=615.5 21  4.2857140 0.71428570 *
##       15) A24< 615.5 72  1.9444440 0.97222220 *

### Selecting Model 2 as the number of Decision Splits are more. 

model_RF1<- randomForest(mask ~ ., data= data_all,proximity = T, ntree = 5)

## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

summary(model_RF1)

##                 Length Class  Mode     
## call                5  -none- call     
## type                1  -none- character
## predicted         191  -none- numeric  
## mse                 5  -none- numeric  
## rsq                 5  -none- numeric  
## oob.times         191  -none- numeric  
## importance         26  -none- numeric  
## importanceSD        0  -none- NULL     
## localImportance     0  -none- NULL     
## proximity       36481  -none- numeric  
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             11  -none- list     
## coefs               0  -none- NULL     
## y                 191  -none- numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call

model_RF1

## 
## Call:
##  randomForest(formula = mask ~ ., data = data_all, proximity = T,      ntree = 5) 
##                Type of random forest: regression
##                      Number of trees: 5
## No. of variables tried at each split: 8
## 
##           Mean of squared residuals: 0.1909148
##                     % Var explained: 22.87

plot_min_depth_distribution(model_RF1)

model_RF2<- randomForest(mask ~ ., data= data_all,proximity = T, ntree = 5,replace = F, mtry = 3,keep.inbag = T, sampsize = 10)

## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

summary(model_RF2)

##                 Length Class  Mode     
## call                9  -none- call     
## type                1  -none- character
## predicted         191  -none- numeric  
## mse                 5  -none- numeric  
## rsq                 5  -none- numeric  
## oob.times         191  -none- numeric  
## importance         26  -none- numeric  
## importanceSD        0  -none- NULL     
## localImportance     0  -none- NULL     
## proximity       36481  -none- numeric  
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             11  -none- list     
## coefs               0  -none- NULL     
## y                 191  -none- numeric  
## test                0  -none- NULL     
## inbag             955  -none- numeric  
## terms               3  terms  call

model_RF2

## 
## Call:
##  randomForest(formula = mask ~ ., data = data_all, proximity = T,      ntree = 5, replace = F, mtry = 3, keep.inbag = T, sampsize = 10) 
##                Type of random forest: regression
##                      Number of trees: 5
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.1955574
##                     % Var explained: 21

plot_min_depth_distribution(model_RF2)

#Selecting model2 as the Variance is less. 

model_svm1<- svm(y = data_all$mask, x = data_all[, -1], kernel = "linear", scale = T,cross = 10)

## Warning in cret$cresults * scale.factor: Recycling array of length 1 in vector-array arithmetic is deprecated.
##   Use c() or as.vector() instead.

model_svm1

## 
## Call:
## svm.default(x = data_all[, -1], y = data_all$mask, scale = T, kernel = "linear", 
##     cross = 10)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.03846154 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  42

model_svm2<- svm(y = data_all$mask, x = data_all[, -1], kernel = "radial", scale = T,cross = 10, cost =10, gamma =191)

## Warning in cret$cresults * scale.factor: Recycling array of length 1 in vector-array arithmetic is deprecated.
##   Use c() or as.vector() instead.

model_svm2

## 
## Call:
## svm.default(x = data_all[, -1], y = data_all$mask, scale = T, kernel = "radial", 
##     gamma = 191, cost = 10, cross = 10)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  191 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  191

model_svm3<- svm(y = data_all$mask, x = data_all[, -1], kernel = "radial", scale = T,cross = 10, cost =10, gamma =10)

## Warning in cret$cresults * scale.factor: Recycling array of length 1 in vector-array arithmetic is deprecated.
##   Use c() or as.vector() instead.

model_svm3

## 
## Call:
## svm.default(x = data_all[, -1], y = data_all$mask, scale = T, kernel = "radial", 
##     gamma = 10, cost = 10, cross = 10)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  10 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  191

## Selecting model2 as the gamma value is more. 

model_neural1<- nnet(mask ~ ., data = data_all,size=1)

## # weights:  29
## initial  value 48.320948 
## final  value 47.277487 
## converged

model_neural1

## a 26-1-1 network with 29 weights
## inputs: A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 
## output(s): mask 
## options were -

model_neural2<- nnet(mask ~ ., data = data_all,size=10)

## # weights:  281
## initial  value 48.600797 
## iter  10 value 46.485680
## iter  20 value 46.473872
## final  value 46.473837 
## converged

model_neural2

## a 26-10-1 network with 281 weights
## inputs: A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 
## output(s): mask 
## options were -

# Selecting model1 as the no.of hidden layers must be between 1 and no.of input variables.

Problem 2:

Prediction contest:

Make predictions with each of your selected models about the twelve NEW cases in the testcases file. Show a table summarizing the predictions across models. Then, using whatever criteria you want, select your best guess about each of the twelve cases (this could be a consensus of models, a single prediction of your best model, or even just flipping a coin).

predict(model_logistic1, newdata = data_test)

##          1          2          3          4          5          6          7 
##  22.127591  -5.182721  12.177563   5.259136 -13.296487 -14.147216   5.282995 
##          8          9         10         11 
## -10.841631   4.834520  -4.187206   2.572769

predict(model_lda, newdata= data_test)

## $class
##  [1] 1 0 1 1 0 0 1 0 1 0 1
## Levels: 0 1
## 
## $posterior
##              0           1
## 1  0.002181108 0.997818892
## 2  0.779815515 0.220184485
## 3  0.017717120 0.982282880
## 4  0.131206029 0.868793971
## 5  0.982569262 0.017430738
## 6  0.992547148 0.007452852
## 7  0.167505970 0.832494030
## 8  0.986863262 0.013136738
## 9  0.152315178 0.847684822
## 10 0.623475467 0.376524533
## 11 0.167802363 0.832197637
## 
## $x
##           LD1
## 1   2.2344877
## 2  -0.7078824
## 3   1.3942613
## 4   0.5482141
## 5  -1.8096668
## 6  -2.1519626
## 7   0.4339764
## 8  -1.9240052
## 9   0.4790256
## 10 -0.4051920
## 11  0.4331307

predict(model_qda, newdata = data_test)

## $class
##  [1] 1 0 1 1 0 0 1 0 1 0 1
## Levels: 0 1
## 
## $posterior
##               0            1
## 1  2.835085e-17 1.000000e+00
## 2  1.000000e+00 3.731286e-08
## 3  1.113437e-10 1.000000e+00
## 4  1.127312e-02 9.887269e-01
## 5  9.999438e-01 5.619412e-05
## 6  1.000000e+00 2.358865e-17
## 7  1.442832e-03 9.985572e-01
## 8  9.989746e-01 1.025403e-03
## 9  7.196649e-03 9.928034e-01
## 10 9.775244e-01 2.247558e-02
## 11 1.776423e-02 9.822358e-01

predict(model_nb,newdata=data_test)

## factor(0)
## Levels:

predict(model_neural2, newdata=data_test)

##         [,1]
## 1  0.5503488
## 2  0.5503488
## 3  0.5503488
## 4  0.5503488
## 5  0.5503488
## 6  0.5503488
## 7  0.5503488
## 8  0.5503488
## 9  0.5503488
## 10 0.5503488
## 11 0.5503488

predict(model_RF2, newdata= data_test)

##    1    2    3    4    5    6    7    8    9   10   11 
## 0.92 0.13 0.72 0.33 0.13 0.60 0.80 0.40 0.72 0.13 0.33

predict(model_DTree2, newdata = data_test)

##          1          2          3          4          5          6          7 
## 0.97222222 0.05084746 0.97222222 1.00000000 0.05084746 0.05084746 0.97222222 
##          8          9         10         11 
## 0.05084746 0.97222222 0.05084746 0.97222222

predict(model_svm2,newdata= data_test)

##         1         2         3         4         5         6         7         8 
## 0.5447761 0.5447761 0.5447761 0.5447761 0.5447761 0.5447761 0.5447761 0.5447761 
##         9        10        11 
## 0.5447761 0.5447761 0.5447761

Problem 3:

Discuss Comparing Models:

Write a discussion comparing the 8 different models you have fit. Describe the advantages and disadvantages of each model for this problem. Discuss why you chose the particular prediction that you did, in the context of the different model’s strengths and weaknesses.

Problem Set 7

Sneha Nimmagadda

March 14, 2019