Prpblem 1:
Modeling of Data:
Classification methods include: • Logistic regression • LDA • QDA • Naive Bayes • Decision/partition Tree • Random Forest • K-nearest Neighbor • SVM • Neural Network
Build at least 8 models of the data (you can skip one of the 9 for full credit). In each case, you must described the model, use some type of cross-validation or complexity scheme (such as AIC) to select parameters, and identify a justifiably good model within the category. When necessary, be sure fit the model enough times so that you know you have produced a good solution (i.e., when the optimization may end up in a local optimum). Write a brief description (2-4 sentences) about each of the six models, describing whether/why/how the model is doing, and provide evidence to back up your conclusions.
data_all <- read.csv("masking.csv")
data_test <- read.csv("testcases-hidden.csv")
colnames(data_all)<-c("A1","A2","A3","A4","A5","A6","A7","A8","A9","A10","A11","A12","A13","A14","A15","A16","A17","A18","A19","A20","A21","A22","A23","A24","A25","A26","mask")
colnames(data_test)<-c("A1","A2","A3","A4","A5","A6","A7","A8","A9","A10","A11","A12","A13","A14","A15","A16","A17","A18","A19","A20","A21","A22","A23","A24","A25","A26")
library(MASS)
#library(DAAG)
library(e1071)
library(rpart)
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(klaR)
library(class)
library(nnet)
library(randomForest)
library(randomForestExplainer)
## Warning: package 'randomForestExplainer' was built under R version 3.6.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
model_linear<- lm(mask~., data = data_all)
summary(model_linear)
##
## Call:
## lm(formula = mask ~ ., data = data_all)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90152 -0.26520 0.01997 0.25119 0.61838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.837e-01 1.366e-01 2.809 0.005576 **
## A1 7.823e-04 3.507e-04 2.231 0.027052 *
## A2 -4.398e-04 5.364e-04 -0.820 0.413407
## A3 2.397e-04 4.570e-04 0.525 0.600593
## A4 1.982e-04 5.128e-04 0.387 0.699610
## A5 -8.134e-04 4.930e-04 -1.650 0.100855
## A6 1.120e-03 4.754e-04 2.356 0.019648 *
## A7 3.892e-04 4.368e-04 0.891 0.374253
## A8 -2.243e-04 4.343e-04 -0.516 0.606244
## A9 -5.580e-04 3.895e-04 -1.433 0.153820
## A10 2.835e-04 4.252e-04 0.667 0.505866
## A11 -2.221e-04 4.871e-04 -0.456 0.649131
## A12 -1.874e-04 4.853e-04 -0.386 0.699897
## A13 -1.747e-04 4.365e-04 -0.400 0.689457
## A14 -3.103e-04 5.196e-04 -0.597 0.551155
## A15 1.530e-03 3.893e-04 3.930 0.000125 ***
## A16 -5.676e-04 4.923e-04 -1.153 0.250616
## A17 -6.608e-04 4.281e-04 -1.544 0.124573
## A18 1.504e-03 4.617e-04 3.258 0.001362 **
## A19 -2.399e-04 4.626e-04 -0.519 0.604725
## A20 -1.175e-04 5.071e-04 -0.232 0.816960
## A21 5.152e-04 4.181e-04 1.232 0.219583
## A22 -2.455e-04 4.696e-04 -0.523 0.601768
## A23 -4.490e-04 4.142e-04 -1.084 0.279850
## A24 4.411e-05 3.849e-04 0.115 0.908892
## A25 -1.429e-03 4.272e-04 -3.345 0.001019 **
## A26 3.997e-04 4.526e-04 0.883 0.378471
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3344 on 164 degrees of freedom
## Multiple R-squared: 0.6121, Adjusted R-squared: 0.5506
## F-statistic: 9.954 on 26 and 164 DF, p-value: < 2.2e-16
model_logistic1 <- glm(mask~. , data = data_all, family = binomial(logit))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_logistic1)
##
## Call:
## glm(formula = mask ~ ., family = binomial(logit), data = data_all)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.13613 -0.00697 0.00000 0.01278 2.26969
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.1205889 4.7733167 -1.701 0.0889 .
## A1 0.0498351 0.0238012 2.094 0.0363 *
## A2 -0.0149885 0.0186576 -0.803 0.4218
## A3 0.0146249 0.0187251 0.781 0.4348
## A4 0.0175319 0.0264473 0.663 0.5074
## A5 -0.0348478 0.0220469 -1.581 0.1140
## A6 0.0402956 0.0193035 2.087 0.0368 *
## A7 0.0249796 0.0206631 1.209 0.2267
## A8 -0.0173813 0.0127262 -1.366 0.1720
## A9 -0.0176561 0.0155571 -1.135 0.2564
## A10 -0.0146682 0.0187931 -0.781 0.4351
## A11 0.0017110 0.0172388 0.099 0.9209
## A12 0.0132290 0.0154766 0.855 0.3927
## A13 -0.0178142 0.0225629 -0.790 0.4298
## A14 -0.0005521 0.0148855 -0.037 0.9704
## A15 0.0340147 0.0145555 2.337 0.0194 *
## A16 -0.0046572 0.0157801 -0.295 0.7679
## A17 -0.0153517 0.0136645 -1.123 0.2612
## A18 0.0525250 0.0236928 2.217 0.0266 *
## A19 0.0023342 0.0146810 0.159 0.8737
## A20 -0.0121069 0.0212577 -0.570 0.5690
## A21 0.0076452 0.0177990 0.430 0.6675
## A22 -0.0357996 0.0181209 -1.976 0.0482 *
## A23 -0.0206555 0.0117357 -1.760 0.0784 .
## A24 0.0057253 0.0110809 0.517 0.6054
## A25 -0.0349073 0.0163315 -2.137 0.0326 *
## A26 -0.0027791 0.0148775 -0.187 0.8518
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 262.889 on 190 degrees of freedom
## Residual deviance: 33.008 on 164 degrees of freedom
## AIC: 87.008
##
## Number of Fisher Scoring iterations: 10
model_logistic2 <- glm(mask~. , data = data_all, family = quasibinomial(logit))
summary(model_logistic2)
##
## Call:
## glm(formula = mask ~ ., family = quasibinomial(logit), data = data_all)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.13613 -0.00697 0.00000 0.01278 2.26969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.1205889 2.4072581 -3.373 0.000927 ***
## A1 0.0498351 0.0120033 4.152 5.29e-05 ***
## A2 -0.0149885 0.0094093 -1.593 0.113100
## A3 0.0146249 0.0094433 1.549 0.123382
## A4 0.0175319 0.0133378 1.314 0.190528
## A5 -0.0348478 0.0111186 -3.134 0.002042 **
## A6 0.0402956 0.0097350 4.139 5.56e-05 ***
## A7 0.0249796 0.0104207 2.397 0.017650 *
## A8 -0.0173813 0.0064180 -2.708 0.007483 **
## A9 -0.0176561 0.0078457 -2.250 0.025751 *
## A10 -0.0146682 0.0094776 -1.548 0.123630
## A11 0.0017110 0.0086938 0.197 0.844220
## A12 0.0132290 0.0078051 1.695 0.091987 .
## A13 -0.0178142 0.0113788 -1.566 0.119379
## A14 -0.0005521 0.0075070 -0.074 0.941467
## A15 0.0340147 0.0073405 4.634 7.28e-06 ***
## A16 -0.0046572 0.0079581 -0.585 0.559210
## A17 -0.0153517 0.0068912 -2.228 0.027259 *
## A18 0.0525250 0.0119486 4.396 1.98e-05 ***
## A19 0.0023342 0.0074039 0.315 0.752962
## A20 -0.0121069 0.0107206 -1.129 0.260417
## A21 0.0076452 0.0089763 0.852 0.395617
## A22 -0.0357996 0.0091386 -3.917 0.000131 ***
## A23 -0.0206555 0.0059185 -3.490 0.000620 ***
## A24 0.0057253 0.0055883 1.025 0.307097
## A25 -0.0349073 0.0082362 -4.238 3.75e-05 ***
## A26 -0.0027791 0.0075029 -0.370 0.711563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.2543342)
##
## Null deviance: 262.889 on 190 degrees of freedom
## Residual deviance: 33.008 on 164 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 10
AIC(model_logistic1)
## [1] 87.00816
#Selecting Model_logistic1 as AIC value is not available for Quasibinomial
model_lda <- lda(mask ~ ., data = data_all)
summary(model_lda)
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 52 -none- numeric
## scaling 26 -none- numeric
## lev 2 -none- character
## svd 1 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## terms 3 terms call
## xlevels 0 -none- list
model_qda <- qda(mask ~ ., data=data_all)
model_qda
## Call:
## qda(mask ~ ., data = data_all)
##
## Prior probabilities of groups:
## 0 1
## 0.4502618 0.5497382
##
## Group means:
## A1 A2 A3 A4 A5 A6 A7 A8
## 0 476.5930 512.6977 468.2907 469.7791 523.8488 515.4419 489.6163 588.5814
## 1 555.8667 500.8381 560.9810 546.8000 499.9810 517.0571 574.8381 501.0667
## A9 A10 A11 A12 A13 A14 A15 A16
## 0 578.6047 520.8837 519.3140 510.6279 523.1744 525.3372 438.8372 482.7791
## 1 497.4762 531.2952 485.3048 492.9905 487.3333 499.0190 568.4000 496.4952
## A17 A18 A19 A20 A21 A22 A23 A24
## 0 510.1163 481.1744 506.3721 542.5465 493.7791 493.5930 555.8140 578.4535
## 1 590.8286 505.1714 496.7238 480.0476 543.4476 470.5429 502.9905 512.9143
## A25 A26
## 0 561.4535 505.5116
## 1 478.8476 522.7524
model_nb<-naiveBayes(mask~.,data=data_all)
summary(model_nb)
## Length Class Mode
## apriori 2 table numeric
## tables 26 -none- list
## levels 0 -none- NULL
## isnumeric 26 -none- logical
## call 4 -none- call
model_DTree1<-rpart(mask~.,data=data_all,method="class")
model_DTree1
## n= 191
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 191 86 1 (0.45026178 0.54973822)
## 2) A15< 481.5 88 20 0 (0.77272727 0.22727273)
## 4) A25>=434.5 62 5 0 (0.91935484 0.08064516) *
## 5) A25< 434.5 26 11 1 (0.42307692 0.57692308)
## 10) A7< 421 18 7 0 (0.61111111 0.38888889) *
## 11) A7>=421 8 0 1 (0.00000000 1.00000000) *
## 3) A15>=481.5 103 18 1 (0.17475728 0.82524272)
## 6) A24>=826.5 10 0 0 (1.00000000 0.00000000) *
## 7) A24< 826.5 93 8 1 (0.08602151 0.91397849)
## 14) A24>=615.5 21 6 1 (0.28571429 0.71428571)
## 28) A1< 547 7 2 0 (0.71428571 0.28571429) *
## 29) A1>=547 14 1 1 (0.07142857 0.92857143) *
## 15) A24< 615.5 72 2 1 (0.02777778 0.97222222) *
model_DTree2<-rpart(mask~.,data=data_all,control = c(minsplit = 1, minbucket = 1, cp = -1,maxdepth=3,xval = 10))
model_DTree2
## n= 191
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 191 47.2774900 0.54973820
## 2) A15< 481.5 88 15.4545500 0.22727270
## 4) A25>=434.5 62 4.5967740 0.08064516
## 8) A9>=388.5 59 2.8474580 0.05084746 *
## 9) A9< 388.5 3 0.6666667 0.66666670 *
## 5) A25< 434.5 26 6.3461540 0.57692310
## 10) A7< 421 18 4.2777780 0.38888890 *
## 11) A7>=421 8 0.0000000 1.00000000 *
## 3) A15>=481.5 103 14.8543700 0.82524270
## 6) A24>=826.5 10 0.0000000 0.00000000 *
## 7) A24< 826.5 93 7.3118280 0.91397850
## 14) A24>=615.5 21 4.2857140 0.71428570 *
## 15) A24< 615.5 72 1.9444440 0.97222220 *
### Selecting Model 2 as the number of Decision Splits are more.
model_RF1<- randomForest(mask ~ ., data= data_all,proximity = T, ntree = 5)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
summary(model_RF1)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 191 -none- numeric
## mse 5 -none- numeric
## rsq 5 -none- numeric
## oob.times 191 -none- numeric
## importance 26 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 36481 -none- numeric
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 191 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
model_RF1
##
## Call:
## randomForest(formula = mask ~ ., data = data_all, proximity = T, ntree = 5)
## Type of random forest: regression
## Number of trees: 5
## No. of variables tried at each split: 8
##
## Mean of squared residuals: 0.1909148
## % Var explained: 22.87
plot_min_depth_distribution(model_RF1)
model_RF2<- randomForest(mask ~ ., data= data_all,proximity = T, ntree = 5,replace = F, mtry = 3,keep.inbag = T, sampsize = 10)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
summary(model_RF2)
## Length Class Mode
## call 9 -none- call
## type 1 -none- character
## predicted 191 -none- numeric
## mse 5 -none- numeric
## rsq 5 -none- numeric
## oob.times 191 -none- numeric
## importance 26 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 36481 -none- numeric
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 191 -none- numeric
## test 0 -none- NULL
## inbag 955 -none- numeric
## terms 3 terms call
model_RF2
##
## Call:
## randomForest(formula = mask ~ ., data = data_all, proximity = T, ntree = 5, replace = F, mtry = 3, keep.inbag = T, sampsize = 10)
## Type of random forest: regression
## Number of trees: 5
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.1955574
## % Var explained: 21
plot_min_depth_distribution(model_RF2)
#Selecting model2 as the Variance is less.
model_svm1<- svm(y = data_all$mask, x = data_all[, -1], kernel = "linear", scale = T,cross = 10)
## Warning in cret$cresults * scale.factor: Recycling array of length 1 in vector-array arithmetic is deprecated.
## Use c() or as.vector() instead.
model_svm1
##
## Call:
## svm.default(x = data_all[, -1], y = data_all$mask, scale = T, kernel = "linear",
## cross = 10)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: linear
## cost: 1
## gamma: 0.03846154
## epsilon: 0.1
##
##
## Number of Support Vectors: 42
model_svm2<- svm(y = data_all$mask, x = data_all[, -1], kernel = "radial", scale = T,cross = 10, cost =10, gamma =191)
## Warning in cret$cresults * scale.factor: Recycling array of length 1 in vector-array arithmetic is deprecated.
## Use c() or as.vector() instead.
model_svm2
##
## Call:
## svm.default(x = data_all[, -1], y = data_all$mask, scale = T, kernel = "radial",
## gamma = 191, cost = 10, cross = 10)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 10
## gamma: 191
## epsilon: 0.1
##
##
## Number of Support Vectors: 191
model_svm3<- svm(y = data_all$mask, x = data_all[, -1], kernel = "radial", scale = T,cross = 10, cost =10, gamma =10)
## Warning in cret$cresults * scale.factor: Recycling array of length 1 in vector-array arithmetic is deprecated.
## Use c() or as.vector() instead.
model_svm3
##
## Call:
## svm.default(x = data_all[, -1], y = data_all$mask, scale = T, kernel = "radial",
## gamma = 10, cost = 10, cross = 10)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 10
## gamma: 10
## epsilon: 0.1
##
##
## Number of Support Vectors: 191
## Selecting model2 as the gamma value is more.
model_neural1<- nnet(mask ~ ., data = data_all,size=1)
## # weights: 29
## initial value 48.320948
## final value 47.277487
## converged
model_neural1
## a 26-1-1 network with 29 weights
## inputs: A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26
## output(s): mask
## options were -
model_neural2<- nnet(mask ~ ., data = data_all,size=10)
## # weights: 281
## initial value 48.600797
## iter 10 value 46.485680
## iter 20 value 46.473872
## final value 46.473837
## converged
model_neural2
## a 26-10-1 network with 281 weights
## inputs: A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26
## output(s): mask
## options were -
# Selecting model1 as the no.of hidden layers must be between 1 and no.of input variables.
Problem 2:
Prediction contest:
Make predictions with each of your selected models about the twelve NEW cases in the testcases file. Show a table summarizing the predictions across models. Then, using whatever criteria you want, select your best guess about each of the twelve cases (this could be a consensus of models, a single prediction of your best model, or even just flipping a coin).
predict(model_logistic1, newdata = data_test)
## 1 2 3 4 5 6 7
## 22.127591 -5.182721 12.177563 5.259136 -13.296487 -14.147216 5.282995
## 8 9 10 11
## -10.841631 4.834520 -4.187206 2.572769
predict(model_lda, newdata= data_test)
## $class
## [1] 1 0 1 1 0 0 1 0 1 0 1
## Levels: 0 1
##
## $posterior
## 0 1
## 1 0.002181108 0.997818892
## 2 0.779815515 0.220184485
## 3 0.017717120 0.982282880
## 4 0.131206029 0.868793971
## 5 0.982569262 0.017430738
## 6 0.992547148 0.007452852
## 7 0.167505970 0.832494030
## 8 0.986863262 0.013136738
## 9 0.152315178 0.847684822
## 10 0.623475467 0.376524533
## 11 0.167802363 0.832197637
##
## $x
## LD1
## 1 2.2344877
## 2 -0.7078824
## 3 1.3942613
## 4 0.5482141
## 5 -1.8096668
## 6 -2.1519626
## 7 0.4339764
## 8 -1.9240052
## 9 0.4790256
## 10 -0.4051920
## 11 0.4331307
predict(model_qda, newdata = data_test)
## $class
## [1] 1 0 1 1 0 0 1 0 1 0 1
## Levels: 0 1
##
## $posterior
## 0 1
## 1 2.835085e-17 1.000000e+00
## 2 1.000000e+00 3.731286e-08
## 3 1.113437e-10 1.000000e+00
## 4 1.127312e-02 9.887269e-01
## 5 9.999438e-01 5.619412e-05
## 6 1.000000e+00 2.358865e-17
## 7 1.442832e-03 9.985572e-01
## 8 9.989746e-01 1.025403e-03
## 9 7.196649e-03 9.928034e-01
## 10 9.775244e-01 2.247558e-02
## 11 1.776423e-02 9.822358e-01
predict(model_nb,newdata=data_test)
## factor(0)
## Levels:
predict(model_neural2, newdata=data_test)
## [,1]
## 1 0.5503488
## 2 0.5503488
## 3 0.5503488
## 4 0.5503488
## 5 0.5503488
## 6 0.5503488
## 7 0.5503488
## 8 0.5503488
## 9 0.5503488
## 10 0.5503488
## 11 0.5503488
predict(model_RF2, newdata= data_test)
## 1 2 3 4 5 6 7 8 9 10 11
## 0.92 0.13 0.72 0.33 0.13 0.60 0.80 0.40 0.72 0.13 0.33
predict(model_DTree2, newdata = data_test)
## 1 2 3 4 5 6 7
## 0.97222222 0.05084746 0.97222222 1.00000000 0.05084746 0.05084746 0.97222222
## 8 9 10 11
## 0.05084746 0.97222222 0.05084746 0.97222222
predict(model_svm2,newdata= data_test)
## 1 2 3 4 5 6 7 8
## 0.5447761 0.5447761 0.5447761 0.5447761 0.5447761 0.5447761 0.5447761 0.5447761
## 9 10 11
## 0.5447761 0.5447761 0.5447761
Problem 3:
Discuss Comparing Models:
Write a discussion comparing the 8 different models you have fit. Describe the advantages and disadvantages of each model for this problem. Discuss why you chose the particular prediction that you did, in the context of the different model’s strengths and weaknesses.