The document reports the analysis of email spam dataset hosted by Kaggle. The dataset contains the count of frequent words in email to determine if it is a spam. The project goal is to build predictive model which can be used to identify spam email. Also, infer the relationship between the predictors and the response. The URL of the dataset is https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv . The scope of the project is to apply concepts learnt in class on the dataset. R language has been used throughout the project. Initially, there were 3000 predictors in the dataset. I ran out of memory due to large number of predictors. Then I tried feature selection techniques to work with the most significant features for the rest of the project. Finally, there are 30 predictors in the dataset which are which are the most common words in all the emails. The predictor values are all real numbers. There are 5172 instances, each instance for each email. For each instance, the count of each predictor (column) in that email (instance) is stored in the respective cells. The information regarding all the emails are stored in a data frame. At first, I load all the required libraries.
library(readxl) library(xlsx) library(readr) library(psych) library(pastecs) library(caret)
library(e1071) library(GGally)
library(ggplot2) library(gridExtra) library(grid) library(tidyverse)
library(ggthemes) library(gridExtra) library(corrplot)
library(factoextra)
library(rpart) library(rpart.plot) library(gplots)
I read the data from a csv file and attach to the project. The R function attach() Is used to attach the project for ease of accessibility.
data = read_csv("email_v2.csv")
attach(data)
Below procedure display the data’s features and observations for better understanding.
names(data)
## [1] "will" "gas" "deal" "meter" "please" ## [6] "attached" "day" "only" "http" "volume" ## [11] "contract" "texas" "nom" "questions" "change" ## [16] "thanks" "money" "rate" "best" "houston" ## [21] "feb" "prices" "move" "neon" "nor" ## [26] "dr" "z" "gra" "rev" "alt" ## [31] "Prediction"
head(data)
## # A tibble: 6 x 31 ## will gas deal meter please attached day only http volume contract ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0 0 0 0 0 0 0 0 0 0 0 ## 2 0 1 0 0 2 1 2 0 0 0 0 ## 3 0 2 0 0 0 0 0 0 0 0 0 ## 4 0 0 2 1 0 0 1 0 0 1 0 ## 5 0 2 0 3 1 0 0 0 0 3 4 ## 6 0 0 0 0 0 0 0 0 0 0 0 ## # … with 20 more variables: texas <dbl>, nom <dbl>, questions <dbl>, ## # change <dbl>, thanks <dbl>, money <dbl>, rate <dbl>, best <dbl>, ## # houston <dbl>, feb <dbl>, prices <dbl>, move <dbl>, neon <dbl>, nor <dbl>, ## # dr <dbl>, z <dbl>, gra <dbl>, rev <dbl>, alt <dbl>, Prediction <dbl>
dim(data)
## [1] 5172 31
The dim() function shows the data has 5172 observations and 31 columns. The summary() function in R shows a numerical summary of each variable in data set. We display the summary statistics of our data set.
summary(data)
## will gas deal meter ## Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 ## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 ## Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 ## Mean : 0.8509 Mean : 0.6174 Mean : 0.7345 Mean : 0.5381 ## 3rd Qu.: 1.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 ## Max. :53.0000 Max. :29.0000 Max. :25.0000 Max. :29.0000 ## please attached day only ## Min. : 0.0000 Min. :0.0000 Min. : 0.0000 Min. : 0.0000 ## 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 ## Median : 0.0000 Median :0.0000 Median : 0.0000 Median : 0.0000 ## Mean : 0.6278 Mean :0.2121 Mean : 0.7341 Mean : 0.1841 ## 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.: 1.0000 3rd Qu.: 0.0000 ## Max. :12.0000 Max. :7.0000 Max. :26.0000 Max. :12.0000 ## http volume contract texas ## Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 ## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 ## Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 ## Mean : 0.2448 Mean : 0.3291 Mean : 0.2372 Mean : 0.1794 ## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 ## Max. :75.0000 Max. :21.0000 Max. :30.0000 Max. :26.0000 ## nom questions change thanks ## Min. : 0.0000 Min. :0.0000 Min. : 0.0000 Min. :0.0000 ## 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 ## Median : 0.0000 Median :0.0000 Median : 0.0000 Median :0.0000 ## Mean : 0.5058 Mean :0.1522 Mean : 0.2862 Mean :0.3705 ## 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.: 0.0000 3rd Qu.:1.0000 ## Max. :17.0000 Max. :5.0000 Max. :35.0000 Max. :8.0000 ## money rate best houston ## Min. :0.00000 Min. : 0.0000 Min. : 0.00000 Min. : 0.00000 ## 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000 ## Median :0.00000 Median : 0.0000 Median : 0.00000 Median : 0.00000 ## Mean :0.07057 Mean : 0.2502 Mean : 0.08063 Mean : 0.08894 ## 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.: 0.00000 ## Max. :8.00000 Max. :42.0000 Max. :12.00000 Max. :55.00000 ## feb prices move neon ## Min. : 0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000 ## 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 ## Median : 0.0000 Median :0.00000 Median :0.0000 Median :0.00000 ## Mean : 0.1133 Mean :0.06207 Mean :0.1205 Mean :0.01489 ## 3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 ## Max. :12.0000 Max. :7.00000 Max. :6.0000 Max. :9.00000 ## nor dr z gra ## Min. : 0.0000 Min. : 0.0000 Min. : 0.000 Min. : 0.0000 ## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000 ## Median : 0.0000 Median : 0.0000 Median : 0.000 Median : 0.0000 ## Mean : 0.1798 Mean : 0.5027 Mean : 1.119 Mean : 0.3534 ## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.: 0.0000 ## Max. :27.0000 Max. :45.0000 Max. :83.000 Max. :66.0000 ## rev alt Prediction ## Min. : 0.0000 Min. : 0.0000 Min. :0.00 ## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.00 ## Median : 0.0000 Median : 0.0000 Median :0.00 ## Mean : 0.2846 Mean : 0.1205 Mean :0.29 ## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:1.00 ## Max. :15.0000 Max. :13.0000 Max. :1.00
The cor() function in R shows the correlation between the predictors.
cor(data[,-31])
## will gas deal meter please ## will 1.000000000 0.19408826 0.1165658397 0.074194634 0.3075509537 ## gas 0.194088257 1.00000000 0.2518701108 0.402090078 0.1927989559 ## deal 0.116565840 0.25187011 1.0000000000 0.345258095 0.1932443390 ## meter 0.074194634 0.40209008 0.3452580950 1.000000000 0.2740664278 ## please 0.307550954 0.19279896 0.1932443390 0.274066428 1.0000000000 ## attached 0.053399963 0.15074555 0.0115556003 0.093745104 0.2011005612 ## day 0.316428242 0.21303059 0.1836804108 0.130031019 0.2565856344 ## only 0.177554426 0.11095202 0.1215434572 0.113116537 0.1666391425 ## http -0.015146397 -0.03928388 -0.0370370790 -0.037391361 0.0032825872 ## volume 0.099497149 0.25685806 0.3220178673 0.359065289 0.1453252220 ## contract 0.197231403 0.24990029 0.3525168359 0.259538569 0.1981665222 ## texas 0.113240680 0.15032034 0.0479859292 0.009870298 0.0710757363 ## nom 0.057433304 0.13931293 -0.0041694741 0.083655985 0.0005268855 ## questions 0.240028223 0.19509997 0.0963054923 0.186399601 0.4136645938 ## change 0.349604074 0.07724622 0.2580086930 0.049011365 0.1746024356 ## thanks 0.139869773 0.36772843 0.4089695131 0.405970285 0.3728523751 ## money 0.144592613 -0.01726649 -0.0033592852 0.006258493 0.0572784116 ## rate 0.199130192 0.04105676 0.0115202712 0.011118047 0.0555922796 ## best 0.054684098 -0.02467211 -0.0391525433 -0.046926459 -0.0221339507 ## houston 0.079273390 0.09782829 0.0098614929 0.018467430 0.0718047753 ## feb 0.044694961 0.04875616 0.1355137245 0.068444377 0.0788940300 ## prices 0.061960619 0.10484440 0.0455019979 -0.021774217 0.0874615477 ## move 0.121079472 0.09890604 0.1231503597 0.004938004 0.1350519186 ## neon -0.008296583 -0.02384860 -0.0271037057 -0.019706970 -0.0338284852 ## nor 0.241815548 0.24067285 0.0890183301 0.268419217 0.1837039290 ## dr 0.160052732 0.09584913 0.0192809337 0.027481633 0.1664594112 ## z 0.099943665 0.05038512 0.0174314028 -0.011078764 0.0653508757 ## gra 0.151209818 0.09890504 -0.0006858967 0.092293050 0.1071135579 ## rev 0.178498449 0.07234346 -0.0102878944 0.008143123 0.0976012521 ## alt 0.104629747 0.05999195 0.0167319993 -0.001876477 0.0416485361 ## attached day only http volume ## will 0.053399963 0.316428242 0.177554426 -0.015146397 0.099497149 ## gas 0.150745554 0.213030585 0.110952021 -0.039283876 0.256858061 ## deal 0.011555600 0.183680411 0.121543457 -0.037037079 0.322017867 ## meter 0.093745104 0.130031019 0.113116537 -0.037391361 0.359065289 ## please 0.201100561 0.256585634 0.166639143 0.003282587 0.145325222 ## attached 1.000000000 -0.033972392 -0.019116135 -0.048835061 0.009264633 ## day -0.033972392 1.000000000 0.187198100 -0.009122755 0.158011780 ## only -0.019116135 0.187198100 1.000000000 0.027669937 0.055089411 ## http -0.048835061 -0.009122755 0.027669937 1.000000000 -0.039032594 ## volume 0.009264633 0.158011780 0.055089411 -0.039032594 1.000000000 ## contract 0.126549437 0.112307312 0.041265491 -0.027188555 0.172773198 ## texas -0.041285959 0.058525115 0.021092377 -0.025457939 0.096188832 ## nom 0.140054460 0.097666919 0.004940490 -0.051113909 0.134558316 ## questions 0.151110038 0.159223775 0.055991407 -0.009691162 0.182944582 ## change 0.053292183 0.134967160 0.088423146 -0.016700345 0.119422142 ## thanks 0.130238303 0.234263351 0.071814105 -0.035037804 0.288512970 ## money -0.038322393 0.037653304 0.148054411 0.016295528 -0.030216013 ## rate -0.033794126 0.105603113 0.138063810 0.013732376 0.011399543 ## best -0.064705837 0.053243501 0.082034234 0.102213192 -0.033643511 ## houston -0.002247008 0.050771434 0.005939768 -0.006527714 0.005726187 ## feb 0.093794005 0.036822265 0.036059882 -0.010471309 0.042640220 ## prices -0.065735122 0.050694225 0.279650092 0.040506164 0.004274045 ## move -0.052475747 0.093143620 0.296742791 0.028467473 0.015148820 ## neon -0.022515944 -0.012241889 0.027136571 -0.007711167 -0.021980894 ## nor 0.086978583 0.117307428 0.105359890 0.014486922 0.070157891 ## dr -0.009118693 0.146183114 0.240722330 0.057442167 0.005799155 ## z -0.039078511 0.070034156 0.080425480 0.194081295 0.004081841 ## gra 0.051274065 0.071177775 0.150676899 0.021520509 -0.003135425 ## rev 0.124358844 0.084979664 0.042728910 -0.020947573 0.051949190 ## alt -0.001270039 0.067588221 0.049373959 0.186373410 -0.017985446 ## contract texas nom questions change ## will 1.972314e-01 0.113240680 0.0574333041 0.2400282231 0.349604074 ## gas 2.499003e-01 0.150320335 0.1393129331 0.1950999664 0.077246224 ## deal 3.525168e-01 0.047985929 -0.0041694741 0.0963054923 0.258008693 ## meter 2.595386e-01 0.009870298 0.0836559845 0.1863996006 0.049011365 ## please 1.981665e-01 0.071075736 0.0005268855 0.4136645938 0.174602436 ## attached 1.265494e-01 -0.041285959 0.1400544602 0.1511100378 0.053292183 ## day 1.123073e-01 0.058525115 0.0976669193 0.1592237748 0.134967160 ## only 4.126549e-02 0.021092377 0.0049404903 0.0559914069 0.088423146 ## http -2.718855e-02 -0.025457939 -0.0511139090 -0.0096911620 -0.016700345 ## volume 1.727732e-01 0.096188832 0.1345583157 0.1829445818 0.119422142 ## contract 1.000000e+00 0.069158859 0.1048655648 0.1846292339 0.242419926 ## texas 6.915886e-02 1.000000000 0.1948335372 0.0373533874 0.072858742 ## nom 1.048656e-01 0.194833537 1.0000000000 0.0245038315 0.179682862 ## questions 1.846292e-01 0.037353387 0.0245038315 1.0000000000 0.092812830 ## change 2.424199e-01 0.072858742 0.1796828616 0.0928128301 1.000000000 ## thanks 2.852708e-01 0.041467867 0.0180594035 0.2773145660 0.101239551 ## money 1.549473e-02 -0.033007929 -0.0381146538 -0.0262463985 0.033348241 ## rate 9.383281e-02 0.055016012 0.0509137979 0.0138813364 0.075386228 ## best -1.833782e-02 -0.028434835 -0.0488018062 0.0057227634 -0.006729329 ## houston 2.575592e-02 0.107632564 0.0515396619 0.0189282426 0.024051728 ## feb -7.392723e-06 0.015880911 0.0604128620 0.0430776756 0.063044106 ## prices 2.089187e-03 0.034244265 -0.0404733375 -0.0255887387 0.064586908 ## move 6.090265e-02 0.053513827 -0.0268722396 0.0001828731 0.101063174 ## neon -1.684671e-02 -0.015937059 -0.0302884022 0.0089273186 -0.012571139 ## nor 1.019572e-01 0.063603452 0.0077856606 0.1230101071 0.045416885 ## dr 3.975837e-02 0.054768060 -0.0026986365 0.0226361950 0.063980619 ## z 3.869394e-02 0.036727455 0.0170816563 0.0085967978 0.075341596 ## gra 4.035798e-02 0.036673367 -0.0087681947 0.0453523517 0.023741191 ## rev 8.015481e-02 0.025172338 0.1150117527 0.0718790718 0.121748207 ## alt 5.319918e-02 0.025213155 0.0261972845 0.0177203209 0.063958775 ## thanks money rate best houston ## will 0.139869773 0.144592613 0.19913019 0.054684098 0.079273390 ## gas 0.367728426 -0.017266486 0.04105676 -0.024672111 0.097828289 ## deal 0.408969513 -0.003359285 0.01152027 -0.039152543 0.009861493 ## meter 0.405970285 0.006258493 0.01111805 -0.046926459 0.018467430 ## please 0.372852375 0.057278412 0.05559228 -0.022133951 0.071804775 ## attached 0.130238303 -0.038322393 -0.03379413 -0.064705837 -0.002247008 ## day 0.234263351 0.037653304 0.10560311 0.053243501 0.050771434 ## only 0.071814105 0.148054411 0.13806381 0.082034234 0.005939768 ## http -0.035037804 0.016295528 0.01373238 0.102213192 -0.006527714 ## volume 0.288512970 -0.030216013 0.01139954 -0.033643511 0.005726187 ## contract 0.285270821 0.015494731 0.09383281 -0.018337815 0.025755918 ## texas 0.041467867 -0.033007929 0.05501601 -0.028434835 0.107632564 ## nom 0.018059404 -0.038114654 0.05091380 -0.048801806 0.051539662 ## questions 0.277314566 -0.026246399 0.01388134 0.005722763 0.018928243 ## change 0.101239551 0.033348241 0.07538623 -0.006729329 0.024051728 ## thanks 1.000000000 -0.034826826 -0.01474870 -0.046424158 0.008988285 ## money -0.034826826 1.000000000 0.04292973 0.122389837 -0.005922217 ## rate -0.014748697 0.042929733 1.00000000 0.089119599 0.097830674 ## best -0.046424158 0.122389837 0.08911960 1.000000000 -0.008362240 ## houston 0.008988285 -0.005922217 0.09783067 -0.008362240 1.000000000 ## feb 0.111833839 -0.007197264 0.02656975 -0.005102083 0.009958814 ## prices -0.046355194 0.010248183 0.16488769 0.049003826 -0.006574359 ## move -0.002106540 0.051759237 0.10856328 0.005569897 -0.003886378 ## neon -0.032446132 -0.012870834 -0.01181546 0.020832628 -0.005129891 ## nor 0.146028721 0.063683127 0.14836871 0.066243655 0.018530317 ## dr 0.009307535 0.170826730 0.37827541 0.115724781 0.027406194 ## z -0.006642907 0.041168690 0.21644830 0.058769771 0.023354059 ## gra 0.023911250 0.092030534 0.45462859 0.103737255 0.035726488 ## rev 0.058179752 0.027454655 0.15645669 -0.006690009 0.030158956 ## alt -0.009286070 0.059938942 0.13953308 0.107802512 0.007281793 ## feb prices move neon nor ## will 4.469496e-02 0.061960619 0.1210794719 -0.008296583 0.241815548 ## gas 4.875616e-02 0.104844402 0.0989060448 -0.023848602 0.240672848 ## deal 1.355137e-01 0.045501998 0.1231503597 -0.027103706 0.089018330 ## meter 6.844438e-02 -0.021774217 0.0049380037 -0.019706970 0.268419217 ## please 7.889403e-02 0.087461548 0.1350519186 -0.033828485 0.183703929 ## attached 9.379400e-02 -0.065735122 -0.0524757467 -0.022515944 0.086978583 ## day 3.682227e-02 0.050694225 0.0931436199 -0.012241889 0.117307428 ## only 3.605988e-02 0.279650092 0.2967427906 0.027136571 0.105359890 ## http -1.047131e-02 0.040506164 0.0284674725 -0.007711167 0.014486922 ## volume 4.264022e-02 0.004274045 0.0151488201 -0.021980894 0.070157891 ## contract -7.392723e-06 0.002089187 0.0609026456 -0.016846706 0.101957172 ## texas 1.588091e-02 0.034244265 0.0535138271 -0.015937059 0.063603452 ## nom 6.041286e-02 -0.040473338 -0.0268722396 -0.030288402 0.007785661 ## questions 4.307768e-02 -0.025588739 0.0001828731 0.008927319 0.123010107 ## change 6.304411e-02 0.064586908 0.1010631743 -0.012571139 0.045416885 ## thanks 1.118338e-01 -0.046355194 -0.0021065400 -0.032446132 0.146028721 ## money -7.197264e-03 0.010248183 0.0517592370 -0.012870834 0.063683127 ## rate 2.656975e-02 0.164887691 0.1085632840 -0.011815457 0.148368708 ## best -5.102083e-03 0.049003826 0.0055698967 0.020832628 0.066243655 ## houston 9.958814e-03 -0.006574359 -0.0038863779 -0.005129891 0.018530317 ## feb 1.000000e+00 -0.023112497 -0.0130724528 -0.002776526 0.002698103 ## prices -2.311250e-02 1.000000000 0.3477687367 -0.012462566 0.050941039 ## move -1.307245e-02 0.347768737 1.0000000000 -0.012135057 0.087487387 ## neon -2.776526e-03 -0.012462566 -0.0121350571 1.000000000 -0.012138461 ## nor 2.698103e-03 0.050941039 0.0874873874 -0.012138461 1.000000000 ## dr 8.449249e-03 0.178355354 0.1838652821 -0.015126395 0.235538464 ## z 1.832634e-02 0.063517236 0.0922380377 -0.014804336 0.133310700 ## gra 4.475333e-03 0.060852725 0.0987101381 -0.011218547 0.237584275 ## rev 8.008111e-03 0.088609473 0.0379777257 -0.018886341 0.079901414 ## alt 1.211311e-02 0.092836047 0.0461676763 0.002683003 0.169847134 ## dr z gra rev alt ## will 0.160052732 0.099943665 0.1512098185 0.178498449 0.104629747 ## gas 0.095849135 0.050385117 0.0989050437 0.072343456 0.059991945 ## deal 0.019280934 0.017431403 -0.0006858967 -0.010287894 0.016731999 ## meter 0.027481633 -0.011078764 0.0922930497 0.008143123 -0.001876477 ## please 0.166459411 0.065350876 0.1071135579 0.097601252 0.041648536 ## attached -0.009118693 -0.039078511 0.0512740650 0.124358844 -0.001270039 ## day 0.146183114 0.070034156 0.0711777746 0.084979664 0.067588221 ## only 0.240722330 0.080425480 0.1506768986 0.042728910 0.049373959 ## http 0.057442167 0.194081295 0.0215205092 -0.020947573 0.186373410 ## volume 0.005799155 0.004081841 -0.0031354249 0.051949190 -0.017985446 ## contract 0.039758369 0.038693942 0.0403579751 0.080154809 0.053199179 ## texas 0.054768060 0.036727455 0.0366733666 0.025172338 0.025213155 ## nom -0.002698637 0.017081656 -0.0087681947 0.115011753 0.026197285 ## questions 0.022636195 0.008596798 0.0453523517 0.071879072 0.017720321 ## change 0.063980619 0.075341596 0.0237411912 0.121748207 0.063958775 ## thanks 0.009307535 -0.006642907 0.0239112502 0.058179752 -0.009286070 ## money 0.170826730 0.041168690 0.0920305337 0.027454655 0.059938942 ## rate 0.378275407 0.216448298 0.4546285928 0.156456690 0.139533076 ## best 0.115724781 0.058769771 0.1037372552 -0.006690009 0.107802512 ## houston 0.027406194 0.023354059 0.0357264883 0.030158956 0.007281793 ## feb 0.008449249 0.018326341 0.0044753329 0.008008111 0.012113107 ## prices 0.178355354 0.063517236 0.0608527252 0.088609473 0.092836047 ## move 0.183865282 0.092238038 0.0987101381 0.037977726 0.046167676 ## neon -0.015126395 -0.014804336 -0.0112185473 -0.018886341 0.002683003 ## nor 0.235538464 0.133310700 0.2375842753 0.079901414 0.169847134 ## dr 1.000000000 0.336072478 0.5660648781 0.071473226 0.201463114 ## z 0.336072478 1.000000000 0.2849167673 0.052428524 0.168087889 ## gra 0.566064878 0.284916767 1.0000000000 0.052724409 0.183336950 ## rev 0.071473226 0.052428524 0.0527244093 1.000000000 0.069861292 ## alt 0.201463114 0.168087889 0.1833369498 0.069861292 1.000000000
I mutate the outcome variable to 1 for spam and and 0 for non-spam.
##Label and Factorize Outcome Variable #mutate outcome variable data = data %>% mutate( #label Prediction Column Prediction = case_when( Prediction == 0 ~ "non-spam", Prediction == 1 ~ "spam" ), #factorize outcome column Prediction = as.factor(Prediction) ) str(data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 5172 obs. of 31 variables: ## $ will : num 0 0 0 0 0 0 0 1 0 2 ... ## $ gas : num 0 1 2 0 2 0 5 0 0 1 ... ## $ deal : num 0 0 0 2 0 0 1 0 0 0 ... ## $ meter : num 0 0 0 1 3 0 4 0 1 0 ... ## $ please : num 0 2 0 0 1 0 1 1 0 1 ... ## $ attached : num 0 1 0 0 0 0 0 0 0 0 ... ## $ day : num 0 2 0 1 0 0 1 0 0 1 ... ## $ only : num 0 0 0 0 0 0 0 0 0 0 ... ## $ http : num 0 0 0 0 0 0 0 0 0 0 ... ## $ volume : num 0 0 0 1 3 0 0 0 0 0 ... ## $ contract : num 0 0 0 0 4 0 0 0 1 0 ... ## $ texas : num 0 0 0 0 0 0 0 0 0 0 ... ## $ nom : num 0 2 2 0 1 0 0 0 1 1 ... ## $ questions : num 0 0 0 0 0 0 0 0 0 0 ... ## $ change : num 0 0 0 0 0 0 0 0 0 6 ... ## $ thanks : num 0 1 0 1 1 0 1 1 0 1 ... ## $ money : num 0 0 0 0 0 0 0 0 0 0 ... ## $ rate : num 0 0 0 0 0 0 0 0 0 0 ... ## $ best : num 0 0 0 0 0 0 0 0 0 0 ... ## $ houston : num 0 0 0 0 0 0 0 0 0 0 ... ## $ feb : num 0 0 0 0 0 0 0 0 0 0 ... ## $ prices : num 0 0 0 0 0 0 0 0 0 0 ... ## $ move : num 0 0 0 0 1 0 0 0 0 0 ... ## $ neon : num 0 0 0 0 0 0 0 0 0 0 ... ## $ nor : num 0 1 0 0 0 1 0 0 0 0 ... ## $ dr : num 0 1 0 0 0 0 0 0 0 1 ... ## $ z : num 0 1 0 0 0 0 0 1 3 2 ... ## $ gra : num 0 2 0 0 0 0 0 0 0 0 ... ## $ rev : num 0 1 0 0 0 0 0 0 0 0 ... ## $ alt : num 0 0 0 0 0 0 0 0 0 1 ... ## $ Prediction: Factor w/ 2 levels "non-spam","spam": 1 1 1 1 1 2 1 2 1 1 ...
##Visualizing the Distribution Difference with Density Plots## #Define Predictor Variables predictor_variables = setdiff(colnames(data), Prediction) #initialize an empty list to store plots plot_list = list() #set plot index i = 1 #create plot and fill plot list for (predictor_variable in predictor_variables) { plot = ggplot(data) + geom_density(aes_string(x = predictor_variable, fill = "Prediction"), alpha = 0.5) plot_list[[i]] = plot i = i + 1 } #display plot do.call ("grid.arrange", c(plot_list, ncol = 3))

I also use graphical plot to show the correlation among the predictors in the dataset. Graphical Representation I use pair() function to create a scatterplot matrix for all variables which can help to view a scatterplot for every pair variables in the dataset.
pairs(data,col=data$Prediction)

From the pair plot above, the red color indicate the distribution of observation with unstable response while black indicate distribution of observations with stable response. I create another plot using the ggpairs function in ggally package in R to see the correlation among predictors in the dataset.
ggpairs(data = data[1:30], title = "Electricity data Correlation Plot", upper = list(continuous = wrap("cor", size = 5)), lower = list(continuous = "smooth") )

I display the density and frequency plot using histogram on the data.
#Electric Data Frequency Histogram # Density & Frequency analysis with the Histogram, # will willl <- ggplot(data=data, aes(x=will))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("Will") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(will)),linetype="dashed",color="grey") # gas gasl <- ggplot(data=data, aes(x=gas))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("gas") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(gas)),linetype="dashed",color="grey") #deal deall <- ggplot(data=data, aes(x=deal))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("deal") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(deal)),linetype="dashed",color="grey") #meter meterl <- ggplot(data=data, aes(x=meter))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("meter") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(meter)),linetype="dashed",color="grey") #please pleasel <- ggplot(data=data, aes(x=please))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("please") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(please)),linetype="dashed",color="grey") #attached attachedl <- ggplot(data=data, aes(x=attached))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("attached") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(attached)),linetype="dashed",color="grey") #day dayl <- ggplot(data=data, aes(x=day))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("day") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(day)),linetype="dashed",color="grey") #only onlyl <- ggplot(data=data, aes(x=only))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("only") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(only)),linetype="dashed",color="grey") #http httpl <- ggplot(data=data, aes(x=http))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("http") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(http)),linetype="dashed",color="grey") #volume volumel <- ggplot(data=data, aes(x=volume))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("volume") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(volume)),linetype="dashed",color="grey") #contract contractl <- ggplot(data=data, aes(x=contract))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("contract") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(contract)),linetype="dashed",color="grey") #texas texasl <- ggplot(data=data, aes(x=texas))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("texas") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(texas)),linetype="dashed",color="grey") #nom noml <- ggplot(data=data, aes(x=nom))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("nom") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(nom)),linetype="dashed",color="grey") #questions questionsl <- ggplot(data=data, aes(x=questions))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("questions") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(questions)),linetype="dashed",color="grey") #change changel <- ggplot(data=data, aes(x=change))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("change") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(change)),linetype="dashed",color="grey") #thanks thanksl <- ggplot(data=data, aes(x=thanks))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("thanks") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(thanks)),linetype="dashed",color="grey") #money moneyl <- ggplot(data=data, aes(x=money))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("money") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(money)),linetype="dashed",color="grey") #rate ratel <- ggplot(data=data, aes(x=rate))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("rate") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(rate)),linetype="dashed",color="grey") #best bestl <- ggplot(data=data, aes(x=best))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("best") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(best)),linetype="dashed",color="grey") #houston houstonl <- ggplot(data=data, aes(x=houston))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("houston") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(houston)),linetype="dashed",color="grey") #feb febl <- ggplot(data=data, aes(x=feb))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("feb") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(feb)),linetype="dashed",color="grey") #prices pricesl <- ggplot(data=data, aes(x=prices))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("prices") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(prices)),linetype="dashed",color="grey") #move movel <- ggplot(data=data, aes(x=move))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("move") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(move)),linetype="dashed",color="grey") #neon neonl <- ggplot(data=data, aes(x=neon))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("neon") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(neon)),linetype="dashed",color="grey") #nor norl <- ggplot(data=data, aes(x=nor))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("nor") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(nor)),linetype="dashed",color="grey") #dr drl <- ggplot(data=data, aes(x=dr))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("dr") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(dr)),linetype="dashed",color="grey") #z zl <- ggplot(data=data, aes(x=z))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("z") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(z)),linetype="dashed",color="grey") #gra gral <- ggplot(data=data, aes(x=gra))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("gra") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(gra)),linetype="dashed",color="grey") #rev revl <- ggplot(data=data, aes(x=rev))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("rev") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(rev)),linetype="dashed",color="grey") #alt altl <- ggplot(data=data, aes(x=alt))+ geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) + xlab("alt") + ylab("Frequency") + theme(legend.position="none")+ ggtitle("Histogram of Sepal Length")+ geom_vline(data=data, aes(xintercept = mean(alt)),linetype="dashed",color="grey") # Plot all visualizations #plot not good, need to fix grid.arrange(willl + ggtitle(""), gasl + ggtitle(""), deall + ggtitle(""), meterl + ggtitle(""), pleasel + ggtitle(""), attachedl + ggtitle(""), dayl + ggtitle(""), onlyl + ggtitle(""), httpl + ggtitle(""), volumel + ggtitle(""), contractl + ggtitle(""), texasl + ggtitle(""), noml + ggtitle(""), questionsl + ggtitle(""), changel + ggtitle(""), thanksl + ggtitle(""), moneyl + ggtitle(""), ratel + ggtitle(""), bestl + ggtitle(""), houstonl + ggtitle(""), febl + ggtitle(""), pricesl + ggtitle(""), movel + ggtitle(""), neonl + ggtitle(""), norl + ggtitle(""), drl + ggtitle(""), zl + ggtitle(""), gral + ggtitle(""), revl + ggtitle(""), altl + ggtitle(""), nrow = 7, top = textGrob("Electric Data Frequency Histogram", gp=gpar(fontsize=8)) )

From above plot, the density plot by class can show the separation of classes. It can help to understand the overlap in class values for an attribute. I make another plot in a single visualization that contains all the boxplot.
#Boxplots #will willSL <- ggplot(data=data, aes(Prediction,will,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("will", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #gas gasSL <- ggplot(data=data, aes(Prediction,gas,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("gas", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #deal dealSL <- ggplot(data=data, aes(Prediction,deal,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("deal", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #meter meterSL <- ggplot(data=data, aes(Prediction,meter,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("meter", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #please pleaseSL <- ggplot(data=data, aes(Prediction,please,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("please", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #attached attachedSL <- ggplot(data=data, aes(Prediction,attached,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("attached", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #day daySL <- ggplot(data=data, aes(Prediction,day,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("day", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #only onlySL <- ggplot(data=data, aes(Prediction,only,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("only", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #http httpSL <- ggplot(data=data, aes(Prediction,http,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("http", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #volume volumeSL <- ggplot(data=data, aes(Prediction,volume,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("volume", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #contract contractSL <- ggplot(data=data, aes(Prediction,contract,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("contract", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #texas texasSL <- ggplot(data=data, aes(Prediction,texas,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("texas", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #nom nomSL <- ggplot(data=data, aes(Prediction,nom,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("nom", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #questions questionsSL <- ggplot(data=data, aes(Prediction,questions,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("questions", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #change changeSL <- ggplot(data=data, aes(Prediction,change,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("change", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #thanks thanksSL <- ggplot(data=data, aes(Prediction,thanks,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("thanks", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #money moneySL <- ggplot(data=data, aes(Prediction,money,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("money", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #rate rateSL <- ggplot(data=data, aes(Prediction,rate,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("rate", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #best bestSL <- ggplot(data=data, aes(Prediction,best,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("best", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #houston houstonSL <- ggplot(data=data, aes(Prediction,houston,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("houston", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #feb febSL <- ggplot(data=data, aes(Prediction,feb,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("feb", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #prices pricesSL <- ggplot(data=data, aes(Prediction,prices,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("prices", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #move moveSL <- ggplot(data=data, aes(Prediction,move,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("move", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #neon neonSL <- ggplot(data=data, aes(Prediction,neon,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("neon", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #nor norSL <- ggplot(data=data, aes(Prediction,nor,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("nor", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #dr drSL <- ggplot(data=data, aes(Prediction,dr,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("dr", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #z zSL <- ggplot(data=data, aes(Prediction,z,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("z", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #gra graSL <- ggplot(data=data, aes(Prediction,gra,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("gra", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #rev revSL <- ggplot(data=data, aes(Prediction,rev,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("rev", breaks= seq(0,300, by=.5))+ theme(legend.position="none") #alt altSL <- ggplot(data=data, aes(Prediction,alt,fill=Prediction)) + geom_boxplot()+ scale_y_continuous("alt", breaks= seq(0,300, by=.5))+ theme(legend.position="none") # Plot all density visualizations grid.arrange(willSL + ggtitle(""), gasSL + ggtitle(""), dealSL + ggtitle(""), meterSL + ggtitle(""), pleaseSL + ggtitle(""), attachedSL + ggtitle(""), daySL + ggtitle(""), onlySL + ggtitle(""), httpSL + ggtitle(""), volumeSL + ggtitle(""), contractSL + ggtitle(""), texasSL + ggtitle(""), nomSL + ggtitle(""), questionsSL + ggtitle(""), changeSL + ggtitle(""), thanksSL + ggtitle(""), moneySL + ggtitle(""), rateSL + ggtitle(""), bestSL + ggtitle(""), houstonSL + ggtitle(""), febSL + ggtitle(""), pricesSL + ggtitle(""), moveSL + ggtitle(""), neonSL + ggtitle(""), norSL + ggtitle(""), drSL + ggtitle(""), zSL + ggtitle(""), graSL + ggtitle(""), revSL + ggtitle(""), altSL + ggtitle(""), nrow = 5, top = textGrob("Electricity data Density Plot", gp=gpar(fontsize=11)) )

In this section, I explored the data set. I used several functions in R to understand the dataset. I also used graphical representation to look through the data. Classification The response variable in the dataset is categorical. In this section, I will use several classification methods to predict the qualitative response. Logistic Regression I fit a logistic regression model to predict spam email using the predictors in the dataset. I used generalized linear model glm() function in R. I started by using the whole dataset for training and testing the model.
##Logistic regression model ## create model glm.fits = glm(formula = Prediction ~ ., data = data, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fits)
## ## Call: ## glm(formula = Prediction ~ ., family = "binomial", data = data) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -5.2285 -0.2345 -0.0292 0.0862 3.2878 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.012791 0.063502 -0.201 0.84037 ## will -0.447672 0.056324 -7.948 1.89e-15 *** ## gas -0.242692 0.061893 -3.921 8.81e-05 *** ## deal -0.748635 0.120323 -6.222 4.91e-10 *** ## meter -1.308691 0.235308 -5.562 2.67e-08 *** ## please -0.414646 0.087615 -4.733 2.22e-06 *** ## attached -4.714750 0.522301 -9.027 < 2e-16 *** ## day -0.338068 0.050110 -6.747 1.51e-11 *** ## only 0.959795 0.147963 6.487 8.77e-11 *** ## http 0.974066 0.094628 10.294 < 2e-16 *** ## volume -1.223554 0.248946 -4.915 8.88e-07 *** ## contract -0.297113 0.117643 -2.526 0.01155 * ## texas -1.204828 0.212911 -5.659 1.52e-08 *** ## nom -0.369976 0.091543 -4.042 5.31e-05 *** ## questions -2.097084 0.270962 -7.739 9.99e-15 *** ## change -0.903930 0.135229 -6.684 2.32e-11 *** ## thanks -1.994111 0.186541 -10.690 < 2e-16 *** ## money 2.555173 0.261873 9.757 < 2e-16 *** ## rate 0.303496 0.098263 3.089 0.00201 ** ## best 1.298562 0.190085 6.831 8.40e-12 *** ## houston -2.799894 0.463803 -6.037 1.57e-09 *** ## feb -1.423951 0.306411 -4.647 3.36e-06 *** ## prices 1.616134 0.229800 7.033 2.02e-12 *** ## move 1.337748 0.161147 8.301 < 2e-16 *** ## neon -2.182367 0.668102 -3.267 0.00109 ** ## nor -0.002474 0.139544 -0.018 0.98586 ## dr 0.338414 0.064269 5.266 1.40e-07 *** ## z 0.221679 0.040609 5.459 4.79e-08 *** ## gra 0.443850 0.073144 6.068 1.29e-09 *** ## rev 0.138100 0.076612 1.803 0.07145 . ## alt 0.453869 0.148369 3.059 0.00222 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 6228.9 on 5171 degrees of freedom ## Residual deviance: 2489.3 on 5141 degrees of freedom ## AIC: 2551.3 ## ## Number of Fisher Scoring iterations: 9
coef(glm.fits)
## (Intercept) will gas deal meter please ## -0.012790682 -0.447671952 -0.242691638 -0.748635159 -1.308690940 -0.414645830 ## attached day only http volume contract ## -4.714749568 -0.338068490 0.959794909 0.974065996 -1.223554480 -0.297113255 ## texas nom questions change thanks money ## -1.204828464 -0.369976405 -2.097083798 -0.903930148 -1.994110595 2.555173279 ## rate best houston feb prices move ## 0.303495513 1.298562255 -2.799894081 -1.423951196 1.616134451 1.337747808 ## neon nor dr z gra rev ## -2.182367358 -0.002473625 0.338413967 0.221679387 0.443850360 0.138100274 ## alt ## 0.453868867
summary(glm.fits)$coef
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.012790682 0.06350195 -0.20142189 8.403687e-01 ## will -0.447671952 0.05632361 -7.94821068 1.892247e-15 ## gas -0.242691638 0.06189253 -3.92117804 8.811713e-05 ## deal -0.748635159 0.12032253 -6.22190357 4.911589e-10 ## meter -1.308690940 0.23530772 -5.56161486 2.672897e-08 ## please -0.414645830 0.08761471 -4.73260494 2.216566e-06 ## attached -4.714749568 0.52230068 -9.02688760 1.766237e-19 ## day -0.338068490 0.05010951 -6.74659362 1.513564e-11 ## only 0.959794909 0.14796318 6.48671451 8.772835e-11 ## http 0.974065996 0.09462780 10.29365559 7.526355e-25 ## volume -1.223554480 0.24894557 -4.91494783 8.880611e-07 ## contract -0.297113255 0.11764299 -2.52555008 1.155173e-02 ## texas -1.204828464 0.21291083 -5.65884058 1.523990e-08 ## nom -0.369976405 0.09154283 -4.04156634 5.309535e-05 ## questions -2.097083798 0.27096189 -7.73940489 9.988323e-15 ## change -0.903930148 0.13522887 -6.68444655 2.317990e-11 ## thanks -1.994110595 0.18654078 -10.68994466 1.134386e-26 ## money 2.555173279 0.26187340 9.75728469 1.716905e-22 ## rate 0.303495513 0.09826310 3.08860111 2.011012e-03 ## best 1.298562255 0.19008458 6.83149721 8.403292e-12 ## houston -2.799894081 0.46380343 -6.03681198 1.571887e-09 ## feb -1.423951196 0.30641109 -4.64719204 3.364839e-06 ## prices 1.616134451 0.22979969 7.03279633 2.024348e-12 ## move 1.337747808 0.16114718 8.30140395 1.028882e-16 ## neon -2.182367358 0.66810159 -3.26652021 1.088780e-03 ## nor -0.002473625 0.13954418 -0.01772646 9.858571e-01 ## dr 0.338413967 0.06426853 5.26562519 1.397131e-07 ## z 0.221679387 0.04060904 5.45886843 4.791787e-08 ## gra 0.443850360 0.07314402 6.06817031 1.293757e-09 ## rev 0.138100274 0.07661174 1.80259928 7.145117e-02 ## alt 0.453868867 0.14836860 3.05906282 2.220306e-03
glm.probs=predict(glm.fits,type="response") glm.probs[1:10]
## 1 2 3 4 5 6 ## 4.968024e-01 4.872221e-04 2.247622e-01 1.701596e-03 2.200215e-05 4.961840e-01 ## 7 8 9 10 ## 4.741746e-05 6.614081e-02 2.102311e-01 2.126677e-04
## create predictions dataset #glm.preds = as.factor(ifelse(glm.probs > 0.5, "spam", "antispam")) glm.pred=rep("non-spam",5172) glm.pred[glm.probs>.5]="spam" table(glm.pred,data$Prediction)
## ## glm.pred non-spam spam ## non-spam 3530 349 ## spam 142 1151
#Confusion Matrix not working, throwing error #ConfusionMatrix(glm.pre, data$Prediction) glm_correctness = mean(glm.pred==data$Prediction) glm_correctness
## [1] 0.9050657
In the above procedure, I fitted the model using the glm() function and used the summary function in R to display the summary of the fitted model. The predictors with small p values have significance on the response variable Prediction. The variables with negative coefficients have negative impact on the response variable. I use the coef() function to display the coefficients for the fitted model below. I use summary function to display the p-values for the coefficients. I use the predict() function to predict the probability of an email being spam, given the values of the predictors. The glm.probs displays the probability of the prediction. The above commands create vector of class predictions based whether the predicted probability of an email being spam is greater than or less than 0.5. I used the table function in R to create a confusion matrix to determine the correct and incorrect predictions. The diagonal elements, 3530 and 1151, of the confusion matrix shows the correct predictions, while off-diagonals elements, 142 and 349, shows incorrect predictions. I use the mean() function the calculate the fraction of prediction correctness. The accuracy of the model is 90.51%. Training and Testing dataset Using the whole dataset for both training the model and testing the model could be misleading. The train error rate tends to underestimate the test error rate. To solve this issue, i fit the model using part of the data, and then examine how well it predicts using the held-out data. This method yields a more realistic error rate. I randomly divide the dataset into train data and test data.
###Train dataset and Held out dataset set.seed(1) smp_siz = floor(0.75*nrow(data)) train_ind = sample(seq_len(nrow(data)),size = smp_siz) train =data[train_ind,] test=data[-train_ind,]
I used ratio of 75:25 for the dataset to train and test respectively. I fit the model again and make prediction using separate data for training and testing. The predictors with small p-values have significance on the response. The variable with negative coefficients has negative impact on the response variable. I use the predict() function to predict the probability of an email being spam using the test data, given the values of the predictors. I created a confusion matrix below to show the results. I used the mean() function to calculate the fraction of prediction correctness. The accuracy is 90.1% which is almost equal to one we got previously.
glm.fits2 = glm(formula = Prediction ~ ., data = train, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fits2)
## ## Call: ## glm(formula = Prediction ~ ., family = "binomial", data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.4512 -0.2176 -0.0236 0.0918 3.1929 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.01220 0.07363 0.166 0.868384 ## will -0.39773 0.06093 -6.528 6.68e-11 *** ## gas -0.28983 0.07501 -3.864 0.000112 *** ## deal -0.71646 0.13884 -5.160 2.47e-07 *** ## meter -1.59524 0.31634 -5.043 4.59e-07 *** ## please -0.47955 0.10116 -4.741 2.13e-06 *** ## attached -4.62392 0.54785 -8.440 < 2e-16 *** ## day -0.33008 0.05652 -5.840 5.21e-09 *** ## only 0.90474 0.16529 5.474 4.41e-08 *** ## http 0.92117 0.10891 8.458 < 2e-16 *** ## volume -1.28079 0.29784 -4.300 1.71e-05 *** ## contract -0.27140 0.13320 -2.038 0.041598 * ## texas -1.03754 0.21223 -4.889 1.02e-06 *** ## nom -0.33867 0.09852 -3.438 0.000587 *** ## questions -2.09188 0.31123 -6.721 1.80e-11 *** ## change -0.93013 0.16078 -5.785 7.24e-09 *** ## thanks -1.94094 0.20519 -9.459 < 2e-16 *** ## money 2.46007 0.29267 8.406 < 2e-16 *** ## rate 0.05524 0.06137 0.900 0.368121 ## best 1.37064 0.22104 6.201 5.62e-10 *** ## houston -3.05698 0.54492 -5.610 2.02e-08 *** ## feb -1.29992 0.35548 -3.657 0.000255 *** ## prices 1.41432 0.24153 5.856 4.75e-09 *** ## move 1.22418 0.18141 6.748 1.50e-11 *** ## neon -15.15596 350.98995 -0.043 0.965558 ## nor 0.50842 0.13663 3.721 0.000198 *** ## dr 0.37612 0.07228 5.204 1.95e-07 *** ## z 0.23005 0.04583 5.019 5.18e-07 *** ## gra 0.43292 0.08407 5.150 2.61e-07 *** ## rev 0.10141 0.08786 1.154 0.248403 ## alt 0.81577 0.19841 4.112 3.93e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 4700.0 on 3878 degrees of freedom ## Residual deviance: 1847.2 on 3848 degrees of freedom ## AIC: 1909.2 ## ## Number of Fisher Scoring iterations: 16
coef(glm.fits2)
## (Intercept) will gas deal meter please ## 0.01220148 -0.39772590 -0.28982554 -0.71645883 -1.59523959 -0.47955419 ## attached day only http volume contract ## -4.62391745 -0.33007562 0.90473976 0.92117095 -1.28079076 -0.27139871 ## texas nom questions change thanks money ## -1.03754094 -0.33867127 -2.09187589 -0.93013039 -1.94093607 2.46006817 ## rate best houston feb prices move ## 0.05523513 1.37064397 -3.05698212 -1.29992141 1.41432498 1.22417503 ## neon nor dr z gra rev ## -15.15595926 0.50842242 0.37612008 0.23004966 0.43292304 0.10141489 ## alt ## 0.81577269
summary(glm.fits2)$coef
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.01220148 0.07363079 0.16571164 8.683839e-01 ## will -0.39772590 0.06092984 -6.52760396 6.683012e-11 ## gas -0.28982554 0.07501054 -3.86379738 1.116378e-04 ## deal -0.71645883 0.13884374 -5.16018113 2.467110e-07 ## meter -1.59523959 0.31634195 -5.04276975 4.588413e-07 ## please -0.47955419 0.10115744 -4.74067161 2.130109e-06 ## attached -4.62391745 0.54785082 -8.44010322 3.170576e-17 ## day -0.33007562 0.05651736 -5.84025155 5.212206e-09 ## only 0.90473976 0.16529453 5.47350082 4.412308e-08 ## http 0.92117095 0.10891394 8.45778739 2.724984e-17 ## volume -1.28079076 0.29783601 -4.30032201 1.705501e-05 ## contract -0.27139871 0.13320057 -2.03751912 4.159805e-02 ## texas -1.03754094 0.21223451 -4.88865325 1.015281e-06 ## nom -0.33867127 0.09851934 -3.43761192 5.868681e-04 ## questions -2.09187589 0.31122591 -6.72140672 1.799784e-11 ## change -0.93013039 0.16077681 -5.78522729 7.241435e-09 ## thanks -1.94093607 0.20519404 -9.45902765 3.108200e-21 ## money 2.46006817 0.29267152 8.40556053 4.258229e-17 ## rate 0.05523513 0.06137248 0.89999825 3.681212e-01 ## best 1.37064397 0.22104347 6.20078932 5.618068e-10 ## houston -3.05698212 0.54491792 -5.60998644 2.023425e-08 ## feb -1.29992141 0.35547769 -3.65682977 2.553538e-04 ## prices 1.41432498 0.24152765 5.85574760 4.748689e-09 ## move 1.22417503 0.18140656 6.74824012 1.496491e-11 ## neon -15.15595926 350.98995161 -0.04318061 9.655576e-01 ## nor 0.50842242 0.13663456 3.72103811 1.984055e-04 ## dr 0.37612008 0.07227700 5.20386986 1.951807e-07 ## z 0.23004966 0.04583202 5.01940922 5.183063e-07 ## gra 0.43292304 0.08406929 5.14959788 2.610455e-07 ## rev 0.10141489 0.08786314 1.15423713 2.484030e-01 ## alt 0.81577269 0.19841185 4.11151198 3.930765e-05
glm.probs2=predict(glm.fits2,test,type="response") glm.probs2[1:10]
## 1 2 3 4 5 6 ## 8.604222e-04 7.384485e-06 6.272936e-01 7.070138e-02 1.819846e-01 3.061579e-04 ## 7 8 9 10 ## 5.268448e-07 8.287878e-06 5.287603e-01 3.689203e-03
dim(test)
## [1] 1293 31
glm.pred2=rep("non-spam",1293) glm.pred2[glm.probs2>.5]="spam" table(glm.pred2,test$Prediction)
## ## glm.pred2 non-spam spam ## non-spam 848 42 ## spam 86 317
#not working, throwing error #ConfusionMatrix(glm.pre, data$Prediction) glm_correctness2 = mean(glm.pred2==test$Prediction) glm_correctness2
## [1] 0.9010054
Linear Discriminant Analysis I performed LDA on the dataset. I used the training and testing data set I already had. I fit an LDA model using lda() function on the training data. The output of the LDA indicates π1 = 0.71 and π2 = 0.29. This indicates that 71% of the training observations correspond to non-spam. While 29% indicates spam. The Group means shows averge of each predictor within each class and its LDA estimates. The Coefficients of linear discriminants provides the linear combination of all the predictors that are used to form the LDA decision rule. I use plot() function to produce a plot of the linear discriminants. I use predict() function on the fitted model with the test data. The predict() function returned a list with three elements. Namely the class, posterior and x which contains the linear discriminants. I use table() function to display the confusion matrix. I use() mean function to calculate the percentage of accuracy.
#Linear Discriminant Analysis require(MASS)
#lda.fit=lda(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4,data=train) lda.fit = lda(Prediction ~ ., data = train) #lda.fit=lda(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4,data=train) lda.fit
## Call: ## lda(Prediction ~ ., data = train) ## ## Prior probabilities of groups: ## non-spam spam ## 0.705852 0.294148 ## ## Group means: ## will gas deal meter please attached day ## non-spam 1.0295836 0.8151936 0.9930606 0.7702703 0.7677137 0.30460190 0.8739956 ## spam 0.4872918 0.1472393 0.1542507 0.0324277 0.3593339 0.01139351 0.4618755 ## only http volume contract texas nom ## non-spam 0.1241782 0.06647188 0.46201607 0.32359386 0.23265157 0.6391527 ## spam 0.3374233 0.66958808 0.01402279 0.05609115 0.01928133 0.1428571 ## questions change thanks money rate best ## non-spam 0.21292915 0.3582907 0.50292184 0.01789627 0.1888240 0.03287071 ## spam 0.01840491 0.1209465 0.06748466 0.18843120 0.4653812 0.18755478 ## houston feb prices move neon nor ## non-spam 0.100438276 0.14864865 0.02447042 0.07815924 0.02264427 0.1669102 ## spam 0.004382121 0.02366345 0.17616126 0.23926380 0.00000000 0.2278703 ## dr z gra rev alt ## non-spam 0.3268809 0.6139518 0.2074507 0.2914536 0.06208912 ## spam 1.0043821 2.5547765 0.7642419 0.2304996 0.26730938 ## ## Coefficients of linear discriminants: ## LD1 ## will -0.105757835 ## gas -0.019718250 ## deal -0.118239385 ## meter -0.032242656 ## please -0.085997441 ## attached -0.598894669 ## day -0.075077063 ## only 0.405015436 ## http 0.055424801 ## volume -0.083651711 ## contract 0.084285856 ## texas -0.218434561 ## nom -0.208139476 ## questions -0.320766344 ## change -0.018718588 ## thanks -0.321028593 ## money 0.929545882 ## rate -0.050238221 ## best 1.035306060 ## houston -0.253183640 ## feb -0.157416783 ## prices 0.438859685 ## move 0.319259498 ## neon -0.613267607 ## nor 0.182424771 ## dr 0.079042886 ## z 0.048799875 ## gra 0.082767596 ## rev 0.009150501 ## alt 0.302105638
plot(lda.fit)

lda.pred=predict(lda.fit, test) names(lda.pred)
## [1] "class" "posterior" "x"
lda.class=lda.pred$class table(lda.class,test$Prediction)
## ## lda.class non-spam spam ## non-spam 912 215 ## spam 22 144
lda_correctness=mean(lda.class==test$Prediction) lda_correctness
## [1] 0.8167053
K-Nearest Neighbors I carry out K-nearest neighbors analysis on the data set. I use the knn() function in R. With knn it works differently from other model-fitting functions. Knn take a single command instead of the two-step train test approach. I divide the data into train and test and use normalization formula to convert everything into values between 0 and 1. I used the mean() function to calculate the accuracy. I set k = 1 and got a prediction accuracy of 87.16%. I believe increasing the value of k will yield better prediction accuracy. Next, I set k=5. I used the mean() function to calculate the accuracy. Setting k=5 improved accuracy from 87.16% to 88.09%.
#K-Nearest Neighbors library(class) set.seed(1) ##normalization function is created nor <-function(x) { (x -min(x))/(max(x)-min(x)) } ##Run nomalization on first 20 coulumns of dataset because they are the predictors data_norm <- as.data.frame(lapply(data[,c(1:30)], nor)) head(data_norm)
## will gas deal meter please attached day only http ## 1 0 0.00000000 0.00 0.00000000 0.00000000 0.0000000 0.00000000 0 0 ## 2 0 0.03448276 0.00 0.00000000 0.16666667 0.1428571 0.07692308 0 0 ## 3 0 0.06896552 0.00 0.00000000 0.00000000 0.0000000 0.00000000 0 0 ## 4 0 0.00000000 0.08 0.03448276 0.00000000 0.0000000 0.03846154 0 0 ## 5 0 0.06896552 0.00 0.10344828 0.08333333 0.0000000 0.00000000 0 0 ## 6 0 0.00000000 0.00 0.00000000 0.00000000 0.0000000 0.00000000 0 0 ## volume contract texas nom questions change thanks money rate best ## 1 0.00000000 0.0000000 0 0.00000000 0 0 0.000 0 0 0 ## 2 0.00000000 0.0000000 0 0.11764706 0 0 0.125 0 0 0 ## 3 0.00000000 0.0000000 0 0.11764706 0 0 0.000 0 0 0 ## 4 0.04761905 0.0000000 0 0.00000000 0 0 0.125 0 0 0 ## 5 0.14285714 0.1333333 0 0.05882353 0 0 0.125 0 0 0 ## 6 0.00000000 0.0000000 0 0.00000000 0 0 0.000 0 0 0 ## houston feb prices move neon nor dr z gra ## 1 0 0 0 0.0000000 0 0.00000000 0.00000000 0.00000000 0.00000000 ## 2 0 0 0 0.0000000 0 0.03703704 0.02222222 0.01204819 0.03030303 ## 3 0 0 0 0.0000000 0 0.00000000 0.00000000 0.00000000 0.00000000 ## 4 0 0 0 0.0000000 0 0.00000000 0.00000000 0.00000000 0.00000000 ## 5 0 0 0 0.1666667 0 0.00000000 0.00000000 0.00000000 0.00000000 ## 6 0 0 0 0.0000000 0 0.03703704 0.00000000 0.00000000 0.00000000 ## rev alt ## 1 0.00000000 0 ## 2 0.06666667 0 ## 3 0.00000000 0 ## 4 0.00000000 0 ## 5 0.00000000 0 ## 6 0.00000000 0
#1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30 train2 <- sample(1:nrow(data), 0.75 * nrow(data)) ##extract training set data_train <- data_norm[train2,] ##extract testing set data_test <- data_norm[-train2,] ##extract 31st column of train dataset because it will be used as 'cl' argument in knn function. data_target_category <- data[train2,31, drop = TRUE] ##extract 31st column if test dataset to measure the accuracy data_test_category <- data[-train2,31, drop = TRUE]
set.seed(1) ##run knn function #cl = train_labels[,1, drop = TRUE] knn.pred <- knn(train = data_train, test = data_test, cl=data_target_category, k=1) ##create confusion matrix table(knn.pred, data_test_category)
## data_test_category ## knn.pred non-spam spam ## non-spam 826 58 ## spam 108 301
knn_1 = mean(knn.pred==data_test_category) knn_1
## [1] 0.8716164
set.seed(1) ##run knn function knn.pred <- knn(data_train,data_test,cl=data_target_category,k=5) ##create confusion matrix table(knn.pred,data_test_category)
## data_test_category ## knn.pred non-spam spam ## non-spam 831 51 ## spam 103 308
knn_5=mean(knn.pred==data_test_category) knn_5
## [1] 0.8808971
Comparing Logistic regression, LDA and KNN result I compare the performance from all the models I used so far. Below shows the comparison of the performance of the model used in this section.
###Comparing Logistic regression, LDA and KNN result #data.frame(Logistic = glm_correctness, LDA = lda_correctness, Knn1 = knn_1, Knn5 = knn_5) data.frame(Logistic = glm_correctness, LDA = lda_correctness, Knn1 = knn_1, Knn5 = knn_5)
## Logistic LDA Knn1 Knn5 ## 1 0.9050657 0.8167053 0.8716164 0.8808971
Cross Validation The following section discusses different cross validation techniques. Validation Set Approach To carry out the Validation set approach on my dataset, I use the set.seed() to random generate sample. I randomly divide the observations into 2586 train and 2586 validation set. Then I trained my model using logistic regression on the training set.
###Validation set Approach set.seed(3) smp_siz = floor(0.5*nrow(data)) train_ind = sample(seq_len(nrow(data)),size = smp_siz) trainVali =data[train_ind,] validationSet=data[-train_ind,] train_model=glm(formula = Prediction ~ ., family = binomial,data = trainVali)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
I ran the predict() function using the test model and observe the validation set error rate.
glm.probs=predict(train_model,validationSet,type = "response") #cont = contrasts(train$Prediction) glm.pred=rep("non-spam",2586) glm.pred[glm.probs>.5]="spam" tlb = table(glm.pred,validationSet$Prediction) mean(glm.pred!=validationSet$Prediction)
## [1] 0.0962877
The validation set error rate is 9.62%. Then I use the poly() function to estimate the validation error rate for the quadratic and cubic classification.
#poly() function to estimate the validation error rate for the quadratic #quadratic train_model2=glm(Prediction~poly((will+gas+deal+meter+please+attached+day+only+http+volume+contract+texas+nom+questions+change+thanks+money+rate+best+houston+feb+prices+move+neon+nor+dr+z+gra+rev+alt),2), family = binomial,data = trainVali) glm.probs2=predict(train_model2,validationSet,type = "response") #cont2 = contrasts(train$Prediction) glm.pred2=rep("non-spam",2586) glm.pred2[glm.probs2>.5]="spam" tlb2 = table(glm.pred2,validationSet$Prediction) mean(glm.pred2!=validationSet$Prediction)
## [1] 0.287703
The validation set error rate for quadratic classification is 28.77%.
#cubic train_model3=glm(Prediction~poly((will+gas+deal+meter+please+attached+day+only+http+volume+contract+texas+nom+questions+change+thanks+money+rate+best+houston+feb+prices+move+neon+nor+dr+z+gra+rev+alt),3), family = binomial,data = trainVali) glm.probs3=predict(train_model3,validationSet,type = "response") glm.pred3=rep("non-spame",2586) glm.pred3[glm.probs3>.5]="spam" tlb3 = table(glm.pred3,validationSet$Prediction) mean(glm.pred3!=validationSet$Prediction)
## [1] 0.996133
The validation set error rate for cubic classification is 99.61%. Leave-One-Out Cross-Validation (LOOCV) I use thed cv.glm() function in boot library in R to carry out Leave-One-Out Cross-Validation. I run the LOOCV for complex polynomial fits. I used loop to iteratively fit polynomial regressions for polynomial of order i = 1 to i=3. Using the Leave-One-Out Cross-Validation approach, the error rates for the model with linear, quadratic and cubic terms are 0.2900232, 0.2906032 and 0.2913766 respectively.
###LOOCV library(boot)
# Cost function for a binary classifier suggested by boot package cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5) cv.error=rep(0,3) for(i in 1:3){ model=glm(Prediction~poly((will+gas+deal+meter+please+attached+day+only+http+volume+contract+texas+nom+questions+change+thanks+money+rate+best+houston+feb+prices+move+neon+nor+dr+z+gra+rev+alt),i), family = binomial, data = data) cv.error[i]=cv.glm(data,model,cost=cost)$delta[1] } cv.error
## [1] 0.2900232 0.2906032 0.2913766
# Histogram of the Test Error hist(cv.error,xlab='Test Error',ylab='Freq',main='Test Error LOOCV', col='cyan',border='blue',density=30)

From above procedure, I used loop to iteratively fit polynomial regressions for polynomial of order i = 1 to i=3. I Computed the Leave-One-Out cross-validation. The error rates for the model with linear, quadratic and cubic terms are 0.2900232, 0.2906032 and 0.2913766 respectively. K-fold Cross-Validation I use cv.glm() function to implement k-fold CV. I set K = 5 first and k=10. For each procedure, I get CV errors corresponding to polynomial fits for orders 1 to 5.
##K-fOLD Cross-Validation #K=5 set.seed(17) cv.error.5=rep(0,5) for(i in 1:5){ model=glm(Prediction~poly((will+gas+deal+meter+please+attached+day+only+http+volume+contract+texas+nom+questions+change+thanks+money+rate+best+houston+feb+prices+move+neon+nor+dr+z+gra+rev+alt),i), family = binomial, data = data) cv.error.5[i]=cv.glm(data,model,K=5,cost=cost)$delta[1] }
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cv.error.5
## [1] 0.2900232 0.2907966 0.2919567 0.3604022 0.2900232
# Histogram of the Test Error hist(cv.error.5,xlab='Test Error',ylab='Freq',main='Test Error LOOCV', col='cyan',border='blue',density=30)

The obtained error rates for the model for polynomials from i=1 to i=5 are 0.2900232, 0.2907966, 0.2919567, 0.3604022, and 0.2900232 respectively. Then I computed the K-Fold cross-validation with k=10.
#K-10 set.seed(17) cv.error.10=rep(0,5) for(i in 1:5){ model=glm(Prediction~poly((will+gas+deal+meter+please+attached+day+only+http+volume+contract+texas+nom+questions+change+thanks+money+rate+best+houston+feb+prices+move+neon+nor+dr+z+gra+rev+alt),i), family = binomial, data = data) cv.error.10[i]=cv.glm(data,model,K=10,cost=cost)$delta[1] }
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cv.error.10
## [1] 0.2900232 0.2904099 0.2909899 0.3766435 0.2900232
# Histogram of the Test Error hist(cv.error.10,xlab='Test Error',ylab='Freq',main='Test Error LOOCV', col='cyan',border='blue',density=30)

The error rates for the model for polynomials from i=1 to i=5 are 0.2900232, 0.2904099, 0.2909899, 0.376643, and 0.2900232. Bootstrap I use the boot method within the trainControl function in R.
###Bootstrap # define training control train_control <- trainControl(method="boot", number=100) # train the model model <- train(Prediction~., data=data, trControl=train_control, method="glm")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# summarize results print(model)
## Generalized Linear Model ## ## 5172 samples ## 30 predictor ## 2 classes: 'non-spam', 'spam' ## ## No pre-processing ## Resampling: Bootstrapped (100 reps) ## Summary of sample sizes: 5172, 5172, 5172, 5172, 5172, 5172, ... ## Resampling results: ## ## Accuracy Kappa ## 0.9000618 0.7553677
set.seed(1) boot.fn=function(data,index){ model=glm(Prediction~., family = binomial, data = data,subset = index) return(coef(model)) } boot.fn(data,sample(100,100,replace = T))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## (Intercept) will gas deal meter please ## -1.0986123 -0.4859151 -2.8877542 -6.9200400 -7.4100141 19.2179618 ## attached day only http volume contract ## -12.5209255 -9.7188460 -53.4317987 -3.0303465 13.4117469 3.4364595 ## texas nom questions change thanks money ## -13.3013765 -9.2058518 -37.5805371 20.4434598 -40.0285383 95.6382930 ## rate best houston feb prices move ## 180.8070450 -17.5225854 -11.6866038 NA NA 59.7476567 ## neon nor dr z gra rev ## NA 22.2125313 28.9927927 9.8944433 -4.1235508 -42.1576960 ## alt ## -147.6270061
boot(data,boot.fn,1000)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ## ORDINARY NONPARAMETRIC BOOTSTRAP ## ## ## Call: ## boot(data = data, statistic = boot.fn, R = 1000) ## ## ## Bootstrap Statistics : ## original bias std. error ## t1* -0.012790682 4.874065e+10 2.943937e+13 ## t2* -0.447671952 -2.159129e+11 4.829783e+12 ## t3* -0.242691638 -2.598755e+11 5.905677e+12 ## t4* -0.748635159 -2.649525e+11 5.956364e+12 ## t5* -1.308690940 -2.651894e+11 5.982503e+12 ## t6* -0.414645830 -3.156910e+11 7.227420e+12 ## t7* -4.714749568 -2.616930e+12 6.112282e+13 ## t8* -0.338068490 -2.282863e+11 5.122801e+12 ## t9* 0.959794909 5.195576e+11 1.162567e+13 ## t10* 0.974065996 5.269677e+11 1.198293e+13 ## t11* -1.223554480 -3.855077e+11 8.616480e+12 ## t12* -0.297113255 -1.279852e+11 3.862272e+12 ## t13* -1.204828464 -1.464947e+12 3.279777e+13 ## t14* -0.369976405 -2.310518e+11 6.708924e+12 ## t15* -2.097083798 -1.085542e+12 2.460433e+13 ## t16* -0.903930148 -5.167748e+11 1.157921e+13 ## t17* -1.994110595 -8.870422e+11 2.101658e+13 ## t18* 2.555173279 1.159264e+12 2.736658e+13 ## t19* 0.303495513 1.771778e+11 4.280029e+12 ## t20* 1.298562255 6.732227e+11 1.516689e+13 ## t21* -2.799894081 -1.186080e+12 2.653094e+13 ## t22* -1.423951196 -7.040938e+11 1.574437e+13 ## t23* 1.616134451 6.374309e+11 1.548012e+13 ## t24* 1.337747808 8.540582e+11 1.979485e+13 ## t25* -2.182367358 -2.085566e+12 4.938785e+13 ## t26* -0.002473625 8.097286e+10 1.810405e+12 ## t27* 0.338413967 1.871934e+11 4.231227e+12 ## t28* 0.221679387 1.297736e+11 3.192075e+12 ## t29* 0.443850360 3.340825e+11 7.466874e+12 ## t30* 0.138100274 1.834406e+10 4.131979e+11 ## t31* 0.453868867 2.363850e+11 5.702862e+12
Model Selection and Regularization Subset Selection Best Subset Selection I apply the best subset selection approach on the data set. I use the bestglm package in R package to carryout subset selection. I use logistic regression for illustration. This package is dependent on the leaps package that is used for linear regression but can also be used for logistic regression. Asterisk indicates that a given variable is included in the corresponding model. From above, the output indicated that the best one-variable model contains thanks. Best two-variable model contains only thanks and dr, best three-variable model contains thanks, dr and attached etc. I run the above procedure again but set nvmax to 30 variables. Parameter nvmax is used to set the maximum number of variables you desire the function to use.
###Model Selection and Regularization #Best subset selection library(leaps) regfit.full = regsubsets(Prediction~.,data) summary(regfit.full)
## Subset selection object ## Call: regsubsets.formula(Prediction ~ ., data) ## 30 Variables (and intercept) ## Forced in Forced out ## will FALSE FALSE ## gas FALSE FALSE ## deal FALSE FALSE ## meter FALSE FALSE ## please FALSE FALSE ## attached FALSE FALSE ## day FALSE FALSE ## only FALSE FALSE ## http FALSE FALSE ## volume FALSE FALSE ## contract FALSE FALSE ## texas FALSE FALSE ## nom FALSE FALSE ## questions FALSE FALSE ## change FALSE FALSE ## thanks FALSE FALSE ## money FALSE FALSE ## rate FALSE FALSE ## best FALSE FALSE ## houston FALSE FALSE ## feb FALSE FALSE ## prices FALSE FALSE ## move FALSE FALSE ## neon FALSE FALSE ## nor FALSE FALSE ## dr FALSE FALSE ## z FALSE FALSE ## gra FALSE FALSE ## rev FALSE FALSE ## alt FALSE FALSE ## 1 subsets of each size up to 8 ## Selection Algorithm: exhaustive ## will gas deal meter please attached day only http volume contract ## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " ## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 4 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 5 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 6 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 7 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 8 ( 1 ) "*" " " " " " " " " "*" " " " " " " " " " " ## texas nom questions change thanks money rate best houston feb prices ## 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " " ## 3 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " " ## 4 ( 1 ) " " " " " " " " "*" " " " " "*" " " " " " " ## 5 ( 1 ) " " "*" " " " " "*" " " " " "*" " " " " " " ## 6 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 7 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " "*" ## 8 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " "*" ## move neon nor dr z gra rev alt ## 1 ( 1 ) " " " " " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " "*" " " " " " " " " ## 3 ( 1 ) " " " " " " "*" " " " " " " " " ## 4 ( 1 ) " " " " " " "*" " " " " " " " " ## 5 ( 1 ) " " " " " " "*" " " " " " " " " ## 6 ( 1 ) " " " " " " "*" " " " " " " " " ## 7 ( 1 ) " " " " " " " " "*" " " " " " " ## 8 ( 1 ) " " " " " " " " "*" " " " " " "
library(leaps) regfit.full = regsubsets(Prediction~.,data,nvmax = 30) summary(regfit.full)
## Subset selection object ## Call: regsubsets.formula(Prediction ~ ., data, nvmax = 30) ## 30 Variables (and intercept) ## Forced in Forced out ## will FALSE FALSE ## gas FALSE FALSE ## deal FALSE FALSE ## meter FALSE FALSE ## please FALSE FALSE ## attached FALSE FALSE ## day FALSE FALSE ## only FALSE FALSE ## http FALSE FALSE ## volume FALSE FALSE ## contract FALSE FALSE ## texas FALSE FALSE ## nom FALSE FALSE ## questions FALSE FALSE ## change FALSE FALSE ## thanks FALSE FALSE ## money FALSE FALSE ## rate FALSE FALSE ## best FALSE FALSE ## houston FALSE FALSE ## feb FALSE FALSE ## prices FALSE FALSE ## move FALSE FALSE ## neon FALSE FALSE ## nor FALSE FALSE ## dr FALSE FALSE ## z FALSE FALSE ## gra FALSE FALSE ## rev FALSE FALSE ## alt FALSE FALSE ## 1 subsets of each size up to 30 ## Selection Algorithm: exhaustive ## will gas deal meter please attached day only http volume contract ## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " ## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 4 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 5 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 6 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 7 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 8 ( 1 ) "*" " " " " " " " " "*" " " " " " " " " " " ## 9 ( 1 ) "*" " " " " " " " " "*" " " " " " " " " " " ## 10 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 11 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 12 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 13 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 14 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 15 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 16 ( 1 ) "*" " " "*" " " " " "*" "*" "*" " " " " " " ## 17 ( 1 ) "*" " " "*" " " " " "*" "*" "*" " " " " " " ## 18 ( 1 ) "*" " " "*" " " " " "*" "*" "*" "*" " " " " ## 19 ( 1 ) "*" " " "*" " " " " "*" "*" "*" "*" "*" " " ## 20 ( 1 ) "*" " " "*" " " " " "*" "*" "*" "*" "*" " " ## 21 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" " " ## 22 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" " " ## 23 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" " " ## 24 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" "*" ## 25 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" "*" ## 26 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" ## 27 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## 28 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## 29 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## 30 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## texas nom questions change thanks money rate best houston feb prices ## 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " " ## 3 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " " ## 4 ( 1 ) " " " " " " " " "*" " " " " "*" " " " " " " ## 5 ( 1 ) " " "*" " " " " "*" " " " " "*" " " " " " " ## 6 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 7 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " "*" ## 8 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " "*" ## 9 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " "*" ## 10 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " "*" ## 11 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 12 ( 1 ) " " "*" "*" " " "*" "*" " " "*" " " " " " " ## 13 ( 1 ) " " "*" "*" " " "*" "*" " " "*" " " " " " " ## 14 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " " " ## 15 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 16 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 17 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 18 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 19 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 20 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 21 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 22 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " "*" "*" ## 23 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 24 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 25 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 26 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 27 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 28 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*" ## 29 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*" ## 30 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## move neon nor dr z gra rev alt ## 1 ( 1 ) " " " " " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " "*" " " " " " " " " ## 3 ( 1 ) " " " " " " "*" " " " " " " " " ## 4 ( 1 ) " " " " " " "*" " " " " " " " " ## 5 ( 1 ) " " " " " " "*" " " " " " " " " ## 6 ( 1 ) " " " " " " "*" " " " " " " " " ## 7 ( 1 ) " " " " " " " " "*" " " " " " " ## 8 ( 1 ) " " " " " " " " "*" " " " " " " ## 9 ( 1 ) " " " " " " " " "*" "*" " " " " ## 10 ( 1 ) " " " " " " " " "*" " " " " " " ## 11 ( 1 ) "*" " " " " " " "*" " " " " "*" ## 12 ( 1 ) "*" " " " " " " "*" " " " " "*" ## 13 ( 1 ) "*" " " " " " " "*" "*" " " "*" ## 14 ( 1 ) "*" " " " " " " "*" "*" " " "*" ## 15 ( 1 ) "*" " " " " " " "*" "*" " " "*" ## 16 ( 1 ) "*" " " " " " " "*" "*" " " "*" ## 17 ( 1 ) "*" "*" " " " " "*" "*" " " "*" ## 18 ( 1 ) "*" "*" " " " " "*" "*" " " "*" ## 19 ( 1 ) "*" "*" " " " " "*" "*" " " "*" ## 20 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 21 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 22 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 23 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 24 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 25 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" ## 26 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" ## 27 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" ## 28 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" ## 29 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" ## 30 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
names(summary(regfit.full))
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
reg.summary=summary(regfit.full) par(mfrow=c(2,2)) which.min(reg.summary$bic)
## [1] 22
The summary() function returns R2, RSS, adjusted R2, Cp, and BIC. Above shows the R-square values of all the 30 variables. Below I plot the BIC for all of the models to set the overall best model and indicates model with smallest statistics.
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l") points(22, reg.summary$bic[22], col = "red", cex = 2, pch = 20)

plot(regfit.full,scale ="bic")

The best over model will contain only the variables shown above. I test this by fitting a logistic regression model with only these twenty-two variables and checking the performance of this model.
coef(regfit.full,22)
## (Intercept) will deal please attached day ## 1.36228332 -0.02256815 -0.02436799 -0.02145805 -0.12421438 -0.01910964 ## only http volume texas nom questions ## 0.08194146 0.01519859 -0.02212577 -0.04741141 -0.04759202 -0.06310902 ## thanks money best feb prices move ## -0.06902932 0.17505928 0.17712165 -0.02970396 0.10702211 0.08388627 ## neon dr z gra alt ## -0.12764599 0.02046628 0.01068268 0.02097053 0.05934660
glm.fits3=glm(Prediction~will+deal+please+attached+day+only+http+volume+texas+nom+questions+thanks+money+best+feb+prices+move+neon+dr+z+gra+rev+alt, family = binomial, data = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.probs3=predict(glm.fits3,test,type = "response") glm.pred3=rep("non-spam",5172) glm.pred3[glm.probs3>.5]="spam" table(glm.pred3,data$Prediction)
## ## glm.pred3 non-spam spam ## non-spam 2782 1130 ## spam 890 370
mean(glm.pred3==test$Prediction)
## [1] 0.8994586
The performance of the model using twenty-two variables only is better than using all the model. I observe a 90% accuracy. Stepwise Selection Forward Stepwise Selection The regsubsets() function will also be used for forward stepwise selection, by adding the parameter method=“forward”.
##Stepwise Selection #Forward Stepwise Selection regfit.full = regsubsets(Prediction~.,data,nvmax = 30, method = "forward") summary(regfit.full)
## Subset selection object ## Call: regsubsets.formula(Prediction ~ ., data, nvmax = 30, method = "forward") ## 30 Variables (and intercept) ## Forced in Forced out ## will FALSE FALSE ## gas FALSE FALSE ## deal FALSE FALSE ## meter FALSE FALSE ## please FALSE FALSE ## attached FALSE FALSE ## day FALSE FALSE ## only FALSE FALSE ## http FALSE FALSE ## volume FALSE FALSE ## contract FALSE FALSE ## texas FALSE FALSE ## nom FALSE FALSE ## questions FALSE FALSE ## change FALSE FALSE ## thanks FALSE FALSE ## money FALSE FALSE ## rate FALSE FALSE ## best FALSE FALSE ## houston FALSE FALSE ## feb FALSE FALSE ## prices FALSE FALSE ## move FALSE FALSE ## neon FALSE FALSE ## nor FALSE FALSE ## dr FALSE FALSE ## z FALSE FALSE ## gra FALSE FALSE ## rev FALSE FALSE ## alt FALSE FALSE ## 1 subsets of each size up to 30 ## Selection Algorithm: forward ## will gas deal meter please attached day only http volume contract ## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " ## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 4 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 5 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 6 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 7 ( 1 ) "*" " " " " " " " " "*" " " " " " " " " " " ## 8 ( 1 ) "*" " " " " " " " " "*" " " " " " " " " " " ## 9 ( 1 ) "*" " " " " " " " " "*" " " " " " " " " " " ## 10 ( 1 ) "*" " " "*" " " " " "*" " " " " " " " " " " ## 11 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 12 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 13 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 14 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 15 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 16 ( 1 ) "*" " " "*" " " " " "*" "*" "*" " " " " " " ## 17 ( 1 ) "*" " " "*" " " " " "*" "*" "*" " " " " " " ## 18 ( 1 ) "*" " " "*" " " " " "*" "*" "*" "*" " " " " ## 19 ( 1 ) "*" " " "*" " " " " "*" "*" "*" "*" " " " " ## 20 ( 1 ) "*" " " "*" " " " " "*" "*" "*" "*" "*" " " ## 21 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" " " ## 22 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" " " ## 23 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" " " ## 24 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" "*" ## 25 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" "*" ## 26 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" ## 27 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## 28 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## 29 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## 30 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## texas nom questions change thanks money rate best houston feb prices ## 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " " ## 3 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " " ## 4 ( 1 ) " " " " " " " " "*" " " " " "*" " " " " " " ## 5 ( 1 ) " " "*" " " " " "*" " " " " "*" " " " " " " ## 6 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 7 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 8 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " "*" ## 9 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " "*" ## 10 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " "*" ## 11 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " "*" ## 12 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " "*" ## 13 ( 1 ) "*" "*" " " " " "*" "*" " " "*" " " " " "*" ## 14 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 15 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 16 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 17 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 18 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 19 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 20 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 21 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 22 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " "*" "*" ## 23 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 24 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 25 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 26 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 27 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 28 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*" ## 29 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*" ## 30 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## move neon nor dr z gra rev alt ## 1 ( 1 ) " " " " " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " "*" " " " " " " " " ## 3 ( 1 ) " " " " " " "*" " " " " " " " " ## 4 ( 1 ) " " " " " " "*" " " " " " " " " ## 5 ( 1 ) " " " " " " "*" " " " " " " " " ## 6 ( 1 ) " " " " " " "*" " " " " " " " " ## 7 ( 1 ) " " " " " " "*" " " " " " " " " ## 8 ( 1 ) " " " " " " "*" " " " " " " " " ## 9 ( 1 ) " " " " " " "*" "*" " " " " " " ## 10 ( 1 ) " " " " " " "*" "*" " " " " " " ## 11 ( 1 ) " " " " " " "*" "*" " " " " " " ## 12 ( 1 ) " " " " " " "*" "*" " " " " "*" ## 13 ( 1 ) " " " " " " "*" "*" " " " " "*" ## 14 ( 1 ) " " " " " " "*" "*" " " " " "*" ## 15 ( 1 ) "*" " " " " "*" "*" " " " " "*" ## 16 ( 1 ) "*" " " " " "*" "*" " " " " "*" ## 17 ( 1 ) "*" "*" " " "*" "*" " " " " "*" ## 18 ( 1 ) "*" "*" " " "*" "*" " " " " "*" ## 19 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 20 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 21 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 22 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 23 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 24 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 25 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" ## 26 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" ## 27 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" ## 28 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" ## 29 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" ## 30 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
Following the above forward stepwise selection, the best one variable model contains only thanks, and the best two-variable contains thanks and dr and so on. Due to the nature of our data, the result of forward stepwise is similar to what we got with best stepwise. Backward Stepwise Selection The regsubsets() function can also be used for backward stepwise selection, by adding the parameter method=“backward”.
#Backward Stepwise Selection #The regsubsets() function can also be used for backward stepwise selection, #by adding the parameter method=“backward”. regfit.full = regsubsets(Prediction~.,data,nvmax = 30, method = "backward") summary(regfit.full)
## Subset selection object ## Call: regsubsets.formula(Prediction ~ ., data, nvmax = 30, method = "backward") ## 30 Variables (and intercept) ## Forced in Forced out ## will FALSE FALSE ## gas FALSE FALSE ## deal FALSE FALSE ## meter FALSE FALSE ## please FALSE FALSE ## attached FALSE FALSE ## day FALSE FALSE ## only FALSE FALSE ## http FALSE FALSE ## volume FALSE FALSE ## contract FALSE FALSE ## texas FALSE FALSE ## nom FALSE FALSE ## questions FALSE FALSE ## change FALSE FALSE ## thanks FALSE FALSE ## money FALSE FALSE ## rate FALSE FALSE ## best FALSE FALSE ## houston FALSE FALSE ## feb FALSE FALSE ## prices FALSE FALSE ## move FALSE FALSE ## neon FALSE FALSE ## nor FALSE FALSE ## dr FALSE FALSE ## z FALSE FALSE ## gra FALSE FALSE ## rev FALSE FALSE ## alt FALSE FALSE ## 1 subsets of each size up to 30 ## Selection Algorithm: backward ## will gas deal meter please attached day only http volume contract ## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " ## 3 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " ## 4 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " ## 5 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " ## 6 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " ## 7 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " " " ## 8 ( 1 ) "*" " " " " " " " " "*" " " "*" " " " " " " ## 9 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 10 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 11 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 12 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 13 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 14 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 15 ( 1 ) "*" " " "*" " " " " "*" " " "*" " " " " " " ## 16 ( 1 ) "*" " " "*" " " " " "*" "*" "*" " " " " " " ## 17 ( 1 ) "*" " " "*" " " " " "*" "*" "*" " " " " " " ## 18 ( 1 ) "*" " " "*" " " " " "*" "*" "*" "*" " " " " ## 19 ( 1 ) "*" " " "*" " " " " "*" "*" "*" "*" "*" " " ## 20 ( 1 ) "*" " " "*" " " " " "*" "*" "*" "*" "*" " " ## 21 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" " " ## 22 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" " " ## 23 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" " " ## 24 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" "*" ## 25 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" "*" "*" "*" ## 26 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" ## 27 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## 28 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## 29 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## 30 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## texas nom questions change thanks money rate best houston feb prices ## 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " "*" "*" " " " " " " " " " " ## 3 ( 1 ) " " "*" " " " " "*" "*" " " " " " " " " " " ## 4 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 5 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 6 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 7 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 8 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 9 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 10 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 11 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " " " ## 12 ( 1 ) " " "*" "*" " " "*" "*" " " "*" " " " " " " ## 13 ( 1 ) " " "*" "*" " " "*" "*" " " "*" " " " " " " ## 14 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " " " ## 15 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 16 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 17 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 18 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 19 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 20 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 21 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " " "*" ## 22 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " "*" "*" ## 23 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 24 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 25 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 26 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 27 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" ## 28 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*" ## 29 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*" ## 30 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## move neon nor dr z gra rev alt ## 1 ( 1 ) " " " " " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " " " " " " " " " ## 3 ( 1 ) " " " " " " " " " " " " " " " " ## 4 ( 1 ) " " " " " " " " " " " " " " " " ## 5 ( 1 ) " " " " " " " " "*" " " " " " " ## 6 ( 1 ) " " " " " " " " "*" " " " " " " ## 7 ( 1 ) " " " " " " " " "*" " " " " " " ## 8 ( 1 ) " " " " " " " " "*" " " " " " " ## 9 ( 1 ) " " " " " " " " "*" " " " " " " ## 10 ( 1 ) "*" " " " " " " "*" " " " " " " ## 11 ( 1 ) "*" " " " " " " "*" " " " " "*" ## 12 ( 1 ) "*" " " " " " " "*" " " " " "*" ## 13 ( 1 ) "*" " " " " " " "*" "*" " " "*" ## 14 ( 1 ) "*" " " " " " " "*" "*" " " "*" ## 15 ( 1 ) "*" " " " " " " "*" "*" " " "*" ## 16 ( 1 ) "*" " " " " " " "*" "*" " " "*" ## 17 ( 1 ) "*" "*" " " " " "*" "*" " " "*" ## 18 ( 1 ) "*" "*" " " " " "*" "*" " " "*" ## 19 ( 1 ) "*" "*" " " " " "*" "*" " " "*" ## 20 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 21 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 22 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 23 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 24 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" ## 25 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" ## 26 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" ## 27 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" ## 28 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" ## 29 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" ## 30 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
Following the above backward stepwise selection, the best one variable model contains only thanks, which is similar to our previous results. The best two-variable contains thanks and money which is different from the previous results. Shrinkage Methods For shrinkage methods I carry out two approaches, regression model and lasso models. I use glmnet R package to perform both ridge regression model and lasso models. Ridge regression I use the glmnet() function in R with parameter alpha=0 to fit a ridge regression model. I make use of other R functions to prepare the data for ridge regression. I also convert the outcome class to numerical variable.
##Shrinkage Methods #For shrinkage methods we carry out two approaches. Namely, regression model and lasso models. We use #glmnet R package to perform both ridge regression model and lasso models. #Ridge regression #We use the glmnet() function in R with parameter alpha=0 to fit a ridge regression model. We make use #of other R functions to prepare the data for ridge regression. We also convert the outcome class to #numerical variable. library(caret) library(glmnet)
library(magrittr)
library(dplyr) set.seed(123) training.samples <- data$Prediction %>% createDataPartition(p = 0.8, list = FALSE) train.data <- data[training.samples, ] test.data <- data[-training.samples, ] # Find the best lambda using cross-validation set.seed(123) # Dumy code categorical predictor variables x = model.matrix(Prediction~., train.data)[,-1] # Convert the outcome (class) to a numerical variable y = ifelse(train.data$Prediction == "spam", 1, 0) cv.ridge <- cv.glmnet(x, y, alpha = 0, family = "binomial") #Next, we fit a model using our ridge regression value and then observe the performance accuracy. # Fit the final model on the training data model <- glmnet(x, y, alpha = 1, family = "binomial", lambda = cv.ridge$lambda.min) # Display regression coefficients coef(model)
## 31 x 1 sparse Matrix of class "dgCMatrix" ## s0 ## (Intercept) -0.25848956 ## will -0.17880875 ## gas -0.11112661 ## deal -0.30069583 ## meter -0.15164634 ## please -0.21130137 ## attached -1.92224297 ## day -0.13336439 ## only 0.33992603 ## http 0.30844379 ## volume -0.26657215 ## contract . ## texas -0.33539089 ## nom -0.44951505 ## questions -0.76333325 ## change -0.14187050 ## thanks -0.97727006 ## money 1.11092299 ## rate 0.11311596 ## best 0.92539465 ## houston -0.03212444 ## feb -0.15531511 ## prices 0.80674915 ## move 0.44016891 ## neon -0.43365642 ## nor . ## dr 0.16852161 ## z 0.05985237 ## gra 0.16476605 ## rev . ## alt 0.15862611
# Make predictions on the test data x.test <- model.matrix(Prediction ~., test.data)[,-1] probabilities <- model %>% predict(newx = x.test) predicted.classes <- ifelse(probabilities > 0.5, "spam", "non-spam") # Model accuracy observed.classes <- test.data$Prediction mean(predicted.classes == observed.classes)
## [1] 0.8220503
From the above procedure using ridge regression method, I an accuracy of 82.2%. I make a plot of the cv.ridge variable.
plot(cv.ridge)

Lasso Lasso works similar to ridge regression with the difference that lasso has penalty of the sum of the absolute values of the coefficients. The penalty for ridge regression is the sum of the squares of the cofficients. I use glmnet also for lasso but with alpha parameter set to 1.
##The Lasso #Lasso works similar to ridge regression with the difference that lasso has penalty of #the sum of the absolute values of the coefficients. The penalty for ridge regression is #the sum of the squares of the cofficients. We will use glmnet also for lasso but with alpha parameter set to 1. library(magrittr) library(dplyr) library(caret) set.seed(123) training.samples <- data$Prediction %>% createDataPartition(p = 0.8, list = FALSE) train.data <- data[training.samples, ] test.data <- data[-training.samples, ] # Find the best lambda using cross-validation set.seed(123) # Dumy code categorical predictor variables x <- model.matrix(Prediction~., train.data)[,-1] # Convert the outcome to a numerical variable y <- ifelse(train.data$Prediction == "spam", 1, 0) cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial") # Fit the final model on the training data model <- glmnet(x, y, alpha = 1, family = "binomial", lambda = cv.lasso$lambda.min) # Display regression coefficients coef(model)
## 31 x 1 sparse Matrix of class "dgCMatrix" ## s0 ## (Intercept) 0.01809849 ## will -0.40244418 ## gas -0.25990722 ## deal -0.81204520 ## meter -1.02214804 ## please -0.39449267 ## attached -4.41784262 ## day -0.38566539 ## only 0.98405024 ## http 0.85882689 ## volume -1.12034371 ## contract -0.19715559 ## texas -1.21494405 ## nom -0.43289720 ## questions -1.79032449 ## change -0.94831179 ## thanks -1.79864837 ## money 2.20652720 ## rate 0.57054676 ## best 1.35444913 ## houston -2.55046822 ## feb -1.03920148 ## prices 2.16679557 ## move 1.17873411 ## neon -1.86161430 ## nor -0.09326442 ## dr 0.30134224 ## z 0.13584882 ## gra 0.37171235 ## rev 0.20514793 ## alt 0.39997250
# Make predictions on the test data x.test <- model.matrix(Prediction ~., test.data)[,-1] probabilities <- model %>% predict(newx = x.test) predicted.classes <- ifelse(probabilities > 0.5, "spam", "non-spam") # Model accuracy observed.classes <- test.data$Prediction mean(predicted.classes == observed.classes)
## [1] 0.8858801
#From above, using lasso gave an accuracy of 88.58% which is better than ridge regression #with accuracy of 82.2%. set.seed(123) cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial") plot(cv.lasso)

The above plot displays the cross-validation error according to the log of lambda. The vertical dash line indicates that the log of the optimal value of lambda which minimizes the prediction error. Dimension Reduction Methods There are two dimension reduction methods I tried in this session. These are principal components regression and partial least squares method. Principal Components Regression Principal components regression (PCT) is performed using the prcomp() function in R.
##Dimension Reduction Methods #There are two dimension reduction methods we carry out in this session. These are methods #are principal components regression and partial least squares method. #Principal Components Regression #Principal components regression (PCT) can be performed using the prcomp() function in R. data.pr <- prcomp(data[c(1:30)], center = TRUE, scale = TRUE) summary(data.pr)
## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## Standard deviation 1.9687 1.59735 1.22170 1.19692 1.14375 1.10398 1.06592 ## Proportion of Variance 0.1292 0.08505 0.04975 0.04775 0.04361 0.04063 0.03787 ## Cumulative Proportion 0.1292 0.21424 0.26399 0.31175 0.35535 0.39598 0.43385 ## PC8 PC9 PC10 PC11 PC12 PC13 PC14 ## Standard deviation 1.05711 1.02335 1.01485 1.00306 0.97917 0.97025 0.94265 ## Proportion of Variance 0.03725 0.03491 0.03433 0.03354 0.03196 0.03138 0.02962 ## Cumulative Proportion 0.47110 0.50601 0.54034 0.57388 0.60584 0.63722 0.66683 ## PC15 PC16 PC17 PC18 PC19 PC20 PC21 ## Standard deviation 0.93975 0.91195 0.88628 0.87834 0.84348 0.8289 0.81349 ## Proportion of Variance 0.02944 0.02772 0.02618 0.02572 0.02372 0.0229 0.02206 ## Cumulative Proportion 0.69627 0.72399 0.75018 0.77589 0.79961 0.8225 0.84457 ## PC22 PC23 PC24 PC25 PC26 PC27 PC28 ## Standard deviation 0.80394 0.79431 0.77171 0.73795 0.72572 0.69868 0.66898 ## Proportion of Variance 0.02154 0.02103 0.01985 0.01815 0.01756 0.01627 0.01492 ## Cumulative Proportion 0.86611 0.88714 0.90699 0.92515 0.94270 0.95897 0.97389 ## PC29 PC30 ## Standard deviation 0.64365 0.6074 ## Proportion of Variance 0.01381 0.0123 ## Cumulative Proportion 0.98770 1.0000
The prcomp() function runs PCA on the predictor variables. I excluded the response variable. I tell the R function to center and scale the data and then standardize the data. I then displayed the summary. The summary call displays the standard deviation (eigenvalues), proportion of variance and the cumulative proportion. I make a plot of the eigenvalues pf the first 31 PCs and cumulative variance.
screeplot(data.pr, type = "l", npcs = 31, main = "Screeplot of the first 31 PCs") abline(h = 1, col="red", lty=5) legend("topright", legend=c("Eigenvalue = 1"), col=c("red"), lty=5, cex=0.6)

cumpro <- cumsum(data.pr$sdev^2 / sum(data.pr$sdev^2)) plot(cumpro[0:31], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot") abline(v = 11, col="blue", lty=5) abline(h = 0.88759, col="blue", lty=5)

plot(data.pr$x[,1],data.pr$x[,2], xlab="PC1 (44.3%)", ylab = "PC2 (19%)", main = "PC1 / PC2 - plot")

#Let us include the response variable in the plot and see how it looks. library("factoextra") fviz_pca_ind(data.pr, geom.ind = "point", pointshape = 21, pointsize = 2, fill.ind = data$Prediction, col.ind = "black", palette = "jco", addEllipses = TRUE, label = "var", col.var = "black", repel = TRUE, legend.title = "Prediction") + ggtitle("2D PCA-plot from 30 feature dataset") + theme(plot.title = element_text(hjust = 0.5))

In above procedure, I made same plot as before but added some colors corresponding to predictor labels. This added beauty to the PCA. I used the first two components and I can see some separation between spam and non-spam emails. Partial Least Squares To carry out PLS I use caret package. I start by preprocessing by removing the zero variance predictors, center and scale all the remaining predictors using the preProc argument. I train the PLS model and then check the cross-validation profile.
#Partial Least Squares #To carry out PLS we use of caret package. We will start by preprocessing by removing the #zero variance predictors, center and scale all the remaining predictors using the preProc argument. #We train the PLS model and then check the cross-validation profile. set.seed (1) dataPLS = data dataPLS$class = factor(dataPLS$Prediction) # Compile cross-validation settings set.seed(100) myfolds <- createMultiFolds(dataPLS$class, k = 5, times = 10) control <- trainControl("repeatedcv", index = myfolds, selectionFunction = "oneSE") # Train PLS model mod1 <- train(Prediction ~ ., data = dataPLS, method = "pls", metric = "Accuracy", tuneLength = 20, trControl = control, preProc = c("zv","center","scale"))
plot(mod1)

plot(varImp(mod1), 25, main = "PLS-DA")

The above procedure displayed the 25 most important variables given in relative levels (scaled to the range 0 - 100). The response variable has a relative level of 100. Moving Beyond Linearity In this session, I focus on generalized addictive models only. There are other models that can also be used here but I will limit the scope to GAM. Generalized Additive Models Generalized additive models (GAMs) presents framework to extending a standard linear model by allowing non-linear functions of each variables while maintaining additivity. I use the gam() function in the gam package in R. I fit a logistic regression Model using GAMs for probabilities of the Binary response values. One important feature of gam is its ability to allow non-linearity on multiple variables at the same time. GAM can be used with other non-linear functions. I run gam function on the data set. Then I fit the model using smoothing splines.
#Generalized Additive Models #Generalized additive models (GAMs) presents framwork to extending a standard linear model by #allowing non-linear functions of each variables while maintaining additivity. #We use the gam() function in the gam package in R. We are going to fit a logistic regression Model #using GAMs for probabilities of the Binary response values. One important feature of gam is its ability #to allow non-linearity on multiple variables at the sametime. GAM can be used with other non-linear functions. #We run gam function on our data set. The We fit the model using smoothing splines. library(gam)
#logistic Regression Model logitgam1<-gam(I(Prediction) ~ s(will,df=4) + s(gas,df=4) + s(deal,df=4) + s(meter,df=4) + s(please,df=4) +s(attached,df=4)+s(day,df=4) + s(only,df=4) + s(http,df=4) + s(volume,df=4) + s(contract,df=4) + s(texas,df=4) + s(nom,df=4)+ s(questions,df=4)+ s(change,df=4)+ s(thanks,df=4)+ s(money,df=4)+ s(rate,df=4)+ s(best,df=4)+ s(houston,df=4)+ s(feb,df=4)+ s(prices,df=4)+ s(move,df=4)+ s(neon,df=4)+ s(nor,df=4)+ s(dr,df=4)+ s(z,df=4)+ s(gra,df=4)+ s(rev,df=4)+ s(alt,df=4),data=data,family=binomial)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument ## ignored
plot(logitgam1,se=T)






























From the plots above, I can see that fitting non-linearity isn’t sufficient for most of the variables. Tree-Based methods Tree-based methods involve segmenting the predictor space into a number of simple regions. Based on the nature of the data set, I used tree-based methods for classification. Fitting Classification Trees The tree library is used to construct classification trees in R.
##Tree-Based methods #Tree-based methods involve segmenting the predictor space into a number of simple regions. Bsed on #the nature of our data set, we will be using tree-based methods for classifiction. #Fitting Classification Trees #The tree library is used to construct classification trees in R. We will use these methods #on our data set. library(tree)
tree.data=tree(Prediction~.,data) summary(tree.data)
## ## Classification tree: ## tree(formula = Prediction ~ ., data = data) ## Variables actually used in tree construction: ## [1] "http" "thanks" "attached" "gas" "gra" "nom" ## [7] "meter" "dr" "money" "questions" ## Number of terminal nodes: 11 ## Residual mean deviance: 0.6995 = 3610 / 5161 ## Misclassification error rate: 0.1827 = 945 / 5172
The summary() function shows that only 10 of the 30 variables are used in the tree construction. The variables are "http", "thanks", "attached", "gas", "gra", "nom", "meter", "dr", "money", and "questions". The number of nodes is 11, residual mean deviance is 0.70 and the misclassification error rate is 18.3%. Displaying the tree graphically.
plot(tree.data) text(tree.data,pretty = 0)

I used plot() function to display the tree structure, and the text() function to display the node labels. The argument pretty=0 is to include the category names of the qualitative predictors rather than displaying a letter for each category. Terminal nodes are spam and non-spam. I print the output corresponding to each branch of the tree below.
tree.data
## node), split, n, deviance, yval, (yprob) ## * denotes terminal node ## ## 1) root 5172 6229.00 non-spam ( 0.70998 0.29002 ) ## 2) http < 0.5 4555 4848.00 non-spam ( 0.77585 0.22415 ) ## 4) thanks < 0.5 3201 3942.00 non-spam ( 0.69416 0.30584 ) ## 8) attached < 0.5 2592 3427.00 non-spam ( 0.62577 0.37423 ) ## 16) gas < 0.5 2042 2810.00 non-spam ( 0.55093 0.44907 ) ## 32) gra < 0.5 1678 2235.00 non-spam ( 0.61621 0.38379 ) ## 64) nom < 0.5 1390 1912.00 non-spam ( 0.55180 0.44820 ) ## 128) meter < 0.5 1262 1749.00 non-spam ( 0.51189 0.48811 ) * ## 129) meter > 0.5 128 54.30 non-spam ( 0.94531 0.05469 ) * ## 65) nom > 0.5 288 150.40 non-spam ( 0.92708 0.07292 ) * ## 33) gra > 0.5 364 409.40 spam ( 0.25000 0.75000 ) * ## 17) gas > 0.5 550 348.70 non-spam ( 0.90364 0.09636 ) ## 34) dr < 1.5 484 191.00 non-spam ( 0.95041 0.04959 ) * ## 35) dr > 1.5 66 90.52 non-spam ( 0.56061 0.43939 ) * ## 9) attached > 0.5 609 93.73 non-spam ( 0.98522 0.01478 ) * ## 5) thanks > 0.5 1354 374.40 non-spam ( 0.96898 0.03102 ) ## 10) money < 0.5 1316 255.50 non-spam ( 0.98024 0.01976 ) * ## 11) money > 0.5 38 51.73 non-spam ( 0.57895 0.42105 ) * ## 3) http > 0.5 617 655.90 spam ( 0.22366 0.77634 ) ## 6) questions < 0.5 559 503.20 spam ( 0.16637 0.83363 ) * ## 7) questions > 0.5 58 61.72 non-spam ( 0.77586 0.22414 ) *
#The above code shows the number of observations in the branch, the deviance, the overall prediction for #the branch (stable or unstable), and the fraction of observations in that branch that takes one values of #yes and no. The brances that lead to terminal nodes are indicated using asterisks. #To evaluate the performance of the classification tree on the data, we train the tree on the train data and #estimate the test error using the test data. set.seed(2) train=sample(1:nrow(data),3879) data.test = data[-train,] Prediction.test = data$Prediction[-train] tree.data=tree(Prediction~.,data,subset = train) tree.pred=predict(tree.data,data.test,type = "class") table(tree.pred,Prediction.test)
## Prediction.test ## tree.pred non-spam spam ## non-spam 737 58 ## spam 172 326
dt =mean(tree.pred==Prediction.test) dt
## [1] 0.8221191
#We split the observations into trainging set and test set. Build the tree using the training set, #and evaluate the performance on the test data. The predict() fucntion is used for this purpose. #This procedure gave 82.2% correct predictions of the locations of the test data set. #Next, we carry out pruning of the tree to see if there can be improvement on the tree. #We use FUN=prune.misclass in order to indicate that we want the classification error rate to guide #the cross-validation and pruning process, rather than the default for the cv.tree() function, which is #deviance. set.seed(3) cv.data=cv.tree(tree.data,FUN=prune.misclass) names(cv.data)
## [1] "size" "dev" "k" "method"
cv.data
## $size ## [1] 14 11 10 6 2 1 ## ## $dev ## [1] 704 704 695 723 868 1116 ## ## $k ## [1] -Inf 0.00 20.00 27.00 36.25 248.00 ## ## $method ## [1] "misclass" ## ## attr(,"class") ## [1] "prune" "tree.sequence"
The above code shows the number of observations in the branch, the deviance, the overall prediction for the branch (spam or non-spam), and the fraction of observations in that branch that takes one values of yes and no. The branches that lead to terminal nodes are indicated using asterisks. To evaluate the performance of the classification tree on the data, I train the tree on the train data and estimate the test error using the test data. I split the observations into training set and test set. Build the tree using the training set, and evaluate the performance on the test data. The predict() function is used for this purpose. This procedure gave 82.2% correct predictions of the locations of the test data set. I carry out pruning of the tree to see if there can be improvement on the tree. I use FUN=prune.misclass in order to indicate that I want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance. The above procedure outputs the number of terminal nodes of each tree considered(size), value of cost-complexity parameter used(k), cross-validation error(dev) and method which is misclass. Tree with 14 terminal nodes results in the lowest cross-validation error rate, with 704 cross-validation errors.
par(mfrow=c(1,2)) plot(cv.data$size,cv.data$dev, type = "b") plot(cv.data$k,cv.data$dev, type = "b")

prune.data=prune.misclass(tree.data,best = 14) plot(prune.data) text(prune.data,pretty=0)

To find out how well the pruned tree perform on the test data set, I apply the predict() function.
#To find out how well the pruned tree perform on the test data set, we apply the predict() function. tree.pred2=predict(prune.data,data.test,type = "class") table(tree.pred2,Prediction.test)
## Prediction.test ## tree.pred2 non-spam spam ## non-spam 737 58 ## spam 172 326
dt_pruned =mean(tree.pred2==Prediction.test) dt_pruned
## [1] 0.8221191
Using the pruned tree to carry out prediction on test data gave 82.21% accuracy. The pruning process is expected to produce both an interpretable tree and also improved the classification accuracy in some cases. Bagging and Random Forests I apply bagging and random forests to the data set, using the randomForest package in R. Bagging is a special case of a random forest with m=p. RandomForest() function in R can be used for both random forests and bagging.
##Bagging and Random Forests #We apply bagging and random forests to our data set, using the randomForest oackage in R. #Bagging is a speacil case of a random forest with m=p. RandomForest() function in R can be used for #both random forests and bagging. library("randomForest")
set.seed(1) bag.data=randomForest(Prediction~.,data,subset=train,mtry=30,importance=TRUE) bag.data
## ## Call: ## randomForest(formula = Prediction ~ ., data = data, mtry = 30, importance = TRUE, subset = train) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 30 ## ## OOB estimate of error rate: 10.26% ## Confusion matrix: ## non-spam spam class.error ## non-spam 2505 258 0.09337676 ## spam 140 976 0.12544803
#The argument mtry=30 to indicates that all 30 predictors should be considered for each split of the tee. #Below, We check how well the bagged model perform on the test set. tree.pred_rf = predict(bag.data,data.test,type = "class") table(tree.pred_rf,Prediction.test)
## Prediction.test ## tree.pred_rf non-spam spam ## non-spam 802 46 ## spam 107 338
tree_rf =mean(tree.pred_rf==Prediction.test) tree_rf
## [1] 0.8816705
#The misclassification rate is 11.83%, in others words the accuracy is 88.17%. #This performed better than using an optimally-pruned single tree. #To grow randomForest works in similar way as bagging, except that a smaller value is used for #the mtry argument. By default, random() uses SqrtRoot of P when building a random forest of classification #trees. We use mtry = 5. set.seed(1) rf.data=bag.data=randomForest(Prediction~.,data,subset=train,mtry=5,importance=TRUE) tree.pred=predict(rf.data,data.test,type = "class") table(tree.pred,Prediction.test)
## Prediction.test ## tree.pred non-spam spam ## non-spam 812 31 ## spam 97 353
tree_rf_mtry5 =mean(tree.pred==Prediction.test) tree_rf_mtry5
## [1] 0.9010054
#The accuracy is 90.1%; this shows that random forests yield an improvement over bagging in this case. #Below, we use the importance() function, to see the importance of each variable. importance(rf.data)
## non-spam spam MeanDecreaseAccuracy MeanDecreaseGini ## will 5.3042463 45.479724 49.86747 63.452780 ## gas 24.2463433 56.221548 53.08475 82.224761 ## deal 0.9802994 42.477007 39.10400 44.378382 ## meter -4.5367391 29.987282 27.98291 32.078145 ## please -0.1902791 32.834536 37.57409 49.909437 ## attached 1.5259830 49.701662 38.62647 66.700432 ## day -0.4268590 28.889684 35.09778 37.436594 ## only 17.2061445 5.664471 18.87643 21.792961 ## http 48.6134396 48.429899 59.75667 138.080696 ## volume -8.9952347 36.918035 33.24116 37.901143 ## contract -6.5207033 26.818487 25.36784 13.864951 ## texas 2.2840661 41.895667 36.57034 26.934516 ## nom -2.5553713 47.580550 38.02725 78.614167 ## questions 1.4049630 38.524547 35.96407 34.155015 ## change -1.1096887 25.447872 26.13298 18.279740 ## thanks 1.1478633 70.109123 62.08956 101.311754 ## money 35.1556259 37.838557 42.40500 48.476058 ## rate 13.7818545 11.314898 20.48072 22.293444 ## best 26.4134138 16.099751 29.12680 35.564738 ## houston 10.1284094 28.517723 28.94814 17.548066 ## feb -5.2494038 23.054995 18.86005 9.133184 ## prices 24.1908388 16.224955 26.73269 23.339295 ## move 24.2815051 27.102642 32.99077 32.210392 ## neon 22.8761407 13.337040 23.26828 9.367063 ## nor 7.1605074 8.034046 12.75462 13.214706 ## dr 32.8422668 13.678669 36.05150 52.354310 ## z 36.2935404 22.668259 45.76390 74.510092 ## gra 35.4732405 20.574643 39.71020 78.426598 ## rev -6.4561269 23.730672 20.72939 18.432590 ## alt 19.2369705 4.070941 19.23063 18.172001
#From the above procedure, two measure of variable of importance are mean deacreased accuracy #and mean decreased gini.Plots of these measures will be produced below.
varImpPlot(rf.data)

Boosting The gbm() function within glm package in R is used to fit boosted classification trees to the data set.
##Boosting #The gbm() function within glm package in R will be used to fit boosted classification trees to our data set. library(gbm)
set.seed(1) data = read.csv("email_v2.csv") train=sample(1:nrow(data),3879) data.test = data[-train,] Prediction.test = data$Prediction[-train] boost.data=gbm(Prediction~.,data[train,],distribution = "bernoulli",n.trees = 2586, interaction.depth = 4) #The above procedure was used to fit boosted classification tree to the data set. #We ran gmb() with option distribution=“bernoulli” for a binary classification problem. The argument #n.trees=2586 indictates we wanted 2586 trees and the option interaction.depth=4 limits the depth of each tree. summary(boost.data)

## var rel.inf ## http http 16.4808133 ## z z 7.7868276 ## thanks thanks 7.6023091 ## gra gra 6.1182133 ## will will 5.6528929 ## attached attached 5.2310361 ## nom nom 5.1724083 ## money money 4.5333997 ## gas gas 4.4551762 ## dr dr 4.0759911 ## please please 3.9405970 ## day day 3.6693143 ## best best 3.4612881 ## questions questions 2.1027172 ## deal deal 2.0823118 ## move move 2.0703872 ## prices prices 1.6819056 ## meter meter 1.6774147 ## rate rate 1.5932185 ## volume volume 1.5580087 ## texas texas 1.4109733 ## only only 1.3826357 ## rev rev 1.1896871 ## alt alt 1.0607455 ## change change 1.0057292 ## houston houston 0.9354307 ## nor nor 0.7941808 ## neon neon 0.6697102 ## feb feb 0.4203874 ## contract contract 0.1842894
#From above procedure, the variables are listed in the descending order of importance. par(mfrow=c(1,2)) plot(boost.data,i='http')

plot(boost.data,i='z')

#In the above procedure, we produced partial dependence plots for the two most important variables. #These ilustrate the marginal effect of the selected variables on the response after integrating out #the other variable. #Next, We use the boosted model to predict stabf on the test set. yhat.boost=predict(boost.data,data.test,type = "response",n.trees = 2586) mean(yhat.boost==Prediction.test)
## [1] 0
#The accuracy is 100%.This is perfect and performs better than random forest and bagging. #In the above procedure we used the default lambda which is 0.1. I will run the procedure again to make #use of lambda as 0.2. boost.data=gbm(Prediction~.,data[train,],distribution = "bernoulli",n.trees = 2586, interaction.depth = 4, shrinkage = 0.2, verbose = F) yhat.boost=predict(boost.data,data.test,type = "response",n.trees = 2586) mean(yhat.boost==Prediction.test)
## [1] 0
In the above procedure, I produced partial dependence plots for the two most important variables. These illustrate the marginal effect of the selected variables on the response after integrating out the other variable. I use the boosted model to predict Prediction on the test set. Prediction with boost model on our data gave 0 test error rate, making the accuracy to be 1. This is perfect and performs better than random forest and bagging. In the above procedure I used the default lambda which is 0.1. I will run the procedure again to make use of lambda as 0.2. With lambda as 0.2, the accuracy still remains the same. Support Vector Machine Support vector machine main objective is to find a hyperplane in an N-dimensional space that distinctly classifies the data points. There are many possible hyperplanes that could be chosen to separate the two classes. In this project, I try to find a plane that has the maximum margin. I use the svm() function in e1071 package in R. I use the training data to fit the model, and the testing data to test the model. From the above procedure, the cost parameter is used to penalise the model for misclassification. I set the cost to be 1000. I fitted the model using svm. I run predict() function on the fitted model using the test data, and then look at its accurracy.
##Support Vector Machine #Support vector machine main objective is to find a hyperplane in an N-dimensional space that distinctly #classifies the data points. There are many possible hyperplanes that could be chosen to separte the two classes. In this project, our objective is to find a plane that has the maximum margin. #We use the svm() function in e1071 package in R. We use the training data to fit the model, #and the testing data to test the model. library(e1071) library(magrittr) set.seed(1) smp_siz = floor(0.75*nrow(data)) train_ind = sample(seq_len(nrow(data)),size = smp_siz) train =data[train_ind,] test=data[-train_ind,] model_svm <- svm(Prediction ~., data=train, cost = 1000, gamma = 0.01) model_svm
## ## Call: ## svm(formula = Prediction ~ ., data = train, cost = 1000, gamma = 0.01) ## ## ## Parameters: ## SVM-Type: eps-regression ## SVM-Kernel: radial ## cost: 1000 ## gamma: 0.01 ## epsilon: 0.1 ## ## ## Number of Support Vectors: 2740
#From the above procedure, the cost parameter is used to penalise the model for misclassification. #We set the cost to be 100. We fitted the model using svm. Below, we run predict() function pn our fitted #model using the test data, and then look at its accurracy. test_svm <- predict(model_svm, newdata = test %>% na.omit()) yo <- test %>% na.omit() #table(test_svm, yo$Prediction)
The result above gave an accuracy of 88.94% using train data and test data.