Prediction of email spam using Classification Machine Learning Methods

The document reports the analysis of email spam dataset hosted by Kaggle. The dataset contains the count of frequent words in email to determine if it is a spam. The project goal is to build predictive model which can be used to identify spam email. Also, infer the relationship between the predictors and the response. The URL of the dataset is https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv . The scope of the project is to apply concepts learnt in class on the dataset. R language has been used throughout the project. Initially, there were 3000 predictors in the dataset. I ran out of memory due to large number of predictors. Then I tried feature selection techniques to work with the most significant features for the rest of the project. Finally, there are 30 predictors in the dataset which are which are the most common words in all the emails. The predictor values are all real numbers. There are 5172 instances, each instance for each email. For each instance, the count of each predictor (column) in that email (instance) is stored in the respective cells. The information regarding all the emails are stored in a data frame. At first, I load all the required libraries.

library(readxl)
library(xlsx)
library(readr)
library(psych)
library(pastecs)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(e1071)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggplot2)
library(gridExtra)
library(grid)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ tibble  2.1.3     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ purrr   0.3.3     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x ggplot2::%+%()   masks psych::%+%()
## x ggplot2::alpha() masks psych::alpha()
## x dplyr::combine() masks gridExtra::combine()
## x tidyr::extract() masks pastecs::extract()
## x dplyr::filter()  masks stats::filter()
## x dplyr::first()   masks pastecs::first()
## x dplyr::lag()     masks stats::lag()
## x dplyr::last()    masks pastecs::last()
## x purrr::lift()    masks caret::lift()

library(ggthemes)
library(gridExtra)
library(corrplot)

## corrplot 0.84 loaded

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(rpart)
library(rpart.plot)
library(gplots)

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

I read the data from a csv file and attach to the project. The R function attach() Is used to attach the project for ease of accessibility.

data = read_csv("email_v2.csv")

## Parsed with column specification:
## cols(
##   .default = col_double()
## )

## See spec(...) for full column specifications.

attach(data)

Below procedure display the data’s features and observations for better understanding.

names(data)

##  [1] "will"       "gas"        "deal"       "meter"      "please"    
##  [6] "attached"   "day"        "only"       "http"       "volume"    
## [11] "contract"   "texas"      "nom"        "questions"  "change"    
## [16] "thanks"     "money"      "rate"       "best"       "houston"   
## [21] "feb"        "prices"     "move"       "neon"       "nor"       
## [26] "dr"         "z"          "gra"        "rev"        "alt"       
## [31] "Prediction"

head(data)

## # A tibble: 6 x 31
##    will   gas  deal meter please attached   day  only  http volume contract
##   <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>
## 1     0     0     0     0      0        0     0     0     0      0        0
## 2     0     1     0     0      2        1     2     0     0      0        0
## 3     0     2     0     0      0        0     0     0     0      0        0
## 4     0     0     2     1      0        0     1     0     0      1        0
## 5     0     2     0     3      1        0     0     0     0      3        4
## 6     0     0     0     0      0        0     0     0     0      0        0
## # … with 20 more variables: texas <dbl>, nom <dbl>, questions <dbl>,
## #   change <dbl>, thanks <dbl>, money <dbl>, rate <dbl>, best <dbl>,
## #   houston <dbl>, feb <dbl>, prices <dbl>, move <dbl>, neon <dbl>, nor <dbl>,
## #   dr <dbl>, z <dbl>, gra <dbl>, rev <dbl>, alt <dbl>, Prediction <dbl>

dim(data)

## [1] 5172   31

The dim() function shows the data has 5172 observations and 31 columns. The summary() function in R shows a numerical summary of each variable in data set. We display the summary statistics of our data set.

summary(data)

##       will              gas               deal             meter        
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
##  Mean   : 0.8509   Mean   : 0.6174   Mean   : 0.7345   Mean   : 0.5381  
##  3rd Qu.: 1.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
##  Max.   :53.0000   Max.   :29.0000   Max.   :25.0000   Max.   :29.0000  
##      please           attached           day               only        
##  Min.   : 0.0000   Min.   :0.0000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.0000   Median :0.0000   Median : 0.0000   Median : 0.0000  
##  Mean   : 0.6278   Mean   :0.2121   Mean   : 0.7341   Mean   : 0.1841  
##  3rd Qu.: 1.0000   3rd Qu.:0.0000   3rd Qu.: 1.0000   3rd Qu.: 0.0000  
##  Max.   :12.0000   Max.   :7.0000   Max.   :26.0000   Max.   :12.0000  
##       http             volume           contract           texas        
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
##  Mean   : 0.2448   Mean   : 0.3291   Mean   : 0.2372   Mean   : 0.1794  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
##  Max.   :75.0000   Max.   :21.0000   Max.   :30.0000   Max.   :26.0000  
##       nom            questions          change            thanks      
##  Min.   : 0.0000   Min.   :0.0000   Min.   : 0.0000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.: 0.0000   1st Qu.:0.0000  
##  Median : 0.0000   Median :0.0000   Median : 0.0000   Median :0.0000  
##  Mean   : 0.5058   Mean   :0.1522   Mean   : 0.2862   Mean   :0.3705  
##  3rd Qu.: 1.0000   3rd Qu.:0.0000   3rd Qu.: 0.0000   3rd Qu.:1.0000  
##  Max.   :17.0000   Max.   :5.0000   Max.   :35.0000   Max.   :8.0000  
##      money              rate              best             houston        
##  Min.   :0.00000   Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000  
##  1st Qu.:0.00000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000  
##  Median :0.00000   Median : 0.0000   Median : 0.00000   Median : 0.00000  
##  Mean   :0.07057   Mean   : 0.2502   Mean   : 0.08063   Mean   : 0.08894  
##  3rd Qu.:0.00000   3rd Qu.: 0.0000   3rd Qu.: 0.00000   3rd Qu.: 0.00000  
##  Max.   :8.00000   Max.   :42.0000   Max.   :12.00000   Max.   :55.00000  
##       feb              prices             move             neon        
##  Min.   : 0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median : 0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   : 0.1133   Mean   :0.06207   Mean   :0.1205   Mean   :0.01489  
##  3rd Qu.: 0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :12.0000   Max.   :7.00000   Max.   :6.0000   Max.   :9.00000  
##       nor                dr                z               gra         
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median : 0.0000   Median : 0.0000   Median : 0.000   Median : 0.0000  
##  Mean   : 0.1798   Mean   : 0.5027   Mean   : 1.119   Mean   : 0.3534  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 1.000   3rd Qu.: 0.0000  
##  Max.   :27.0000   Max.   :45.0000   Max.   :83.000   Max.   :66.0000  
##       rev               alt            Prediction  
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :0.00  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.00  
##  Median : 0.0000   Median : 0.0000   Median :0.00  
##  Mean   : 0.2846   Mean   : 0.1205   Mean   :0.29  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:1.00  
##  Max.   :15.0000   Max.   :13.0000   Max.   :1.00

cor(data[,-31])

##                   will         gas          deal        meter        please
## will       1.000000000  0.19408826  0.1165658397  0.074194634  0.3075509537
## gas        0.194088257  1.00000000  0.2518701108  0.402090078  0.1927989559
## deal       0.116565840  0.25187011  1.0000000000  0.345258095  0.1932443390
## meter      0.074194634  0.40209008  0.3452580950  1.000000000  0.2740664278
## please     0.307550954  0.19279896  0.1932443390  0.274066428  1.0000000000
## attached   0.053399963  0.15074555  0.0115556003  0.093745104  0.2011005612
## day        0.316428242  0.21303059  0.1836804108  0.130031019  0.2565856344
## only       0.177554426  0.11095202  0.1215434572  0.113116537  0.1666391425
## http      -0.015146397 -0.03928388 -0.0370370790 -0.037391361  0.0032825872
## volume     0.099497149  0.25685806  0.3220178673  0.359065289  0.1453252220
## contract   0.197231403  0.24990029  0.3525168359  0.259538569  0.1981665222
## texas      0.113240680  0.15032034  0.0479859292  0.009870298  0.0710757363
## nom        0.057433304  0.13931293 -0.0041694741  0.083655985  0.0005268855
## questions  0.240028223  0.19509997  0.0963054923  0.186399601  0.4136645938
## change     0.349604074  0.07724622  0.2580086930  0.049011365  0.1746024356
## thanks     0.139869773  0.36772843  0.4089695131  0.405970285  0.3728523751
## money      0.144592613 -0.01726649 -0.0033592852  0.006258493  0.0572784116
## rate       0.199130192  0.04105676  0.0115202712  0.011118047  0.0555922796
## best       0.054684098 -0.02467211 -0.0391525433 -0.046926459 -0.0221339507
## houston    0.079273390  0.09782829  0.0098614929  0.018467430  0.0718047753
## feb        0.044694961  0.04875616  0.1355137245  0.068444377  0.0788940300
## prices     0.061960619  0.10484440  0.0455019979 -0.021774217  0.0874615477
## move       0.121079472  0.09890604  0.1231503597  0.004938004  0.1350519186
## neon      -0.008296583 -0.02384860 -0.0271037057 -0.019706970 -0.0338284852
## nor        0.241815548  0.24067285  0.0890183301  0.268419217  0.1837039290
## dr         0.160052732  0.09584913  0.0192809337  0.027481633  0.1664594112
## z          0.099943665  0.05038512  0.0174314028 -0.011078764  0.0653508757
## gra        0.151209818  0.09890504 -0.0006858967  0.092293050  0.1071135579
## rev        0.178498449  0.07234346 -0.0102878944  0.008143123  0.0976012521
## alt        0.104629747  0.05999195  0.0167319993 -0.001876477  0.0416485361
##               attached          day         only         http       volume
## will       0.053399963  0.316428242  0.177554426 -0.015146397  0.099497149
## gas        0.150745554  0.213030585  0.110952021 -0.039283876  0.256858061
## deal       0.011555600  0.183680411  0.121543457 -0.037037079  0.322017867
## meter      0.093745104  0.130031019  0.113116537 -0.037391361  0.359065289
## please     0.201100561  0.256585634  0.166639143  0.003282587  0.145325222
## attached   1.000000000 -0.033972392 -0.019116135 -0.048835061  0.009264633
## day       -0.033972392  1.000000000  0.187198100 -0.009122755  0.158011780
## only      -0.019116135  0.187198100  1.000000000  0.027669937  0.055089411
## http      -0.048835061 -0.009122755  0.027669937  1.000000000 -0.039032594
## volume     0.009264633  0.158011780  0.055089411 -0.039032594  1.000000000
## contract   0.126549437  0.112307312  0.041265491 -0.027188555  0.172773198
## texas     -0.041285959  0.058525115  0.021092377 -0.025457939  0.096188832
## nom        0.140054460  0.097666919  0.004940490 -0.051113909  0.134558316
## questions  0.151110038  0.159223775  0.055991407 -0.009691162  0.182944582
## change     0.053292183  0.134967160  0.088423146 -0.016700345  0.119422142
## thanks     0.130238303  0.234263351  0.071814105 -0.035037804  0.288512970
## money     -0.038322393  0.037653304  0.148054411  0.016295528 -0.030216013
## rate      -0.033794126  0.105603113  0.138063810  0.013732376  0.011399543
## best      -0.064705837  0.053243501  0.082034234  0.102213192 -0.033643511
## houston   -0.002247008  0.050771434  0.005939768 -0.006527714  0.005726187
## feb        0.093794005  0.036822265  0.036059882 -0.010471309  0.042640220
## prices    -0.065735122  0.050694225  0.279650092  0.040506164  0.004274045
## move      -0.052475747  0.093143620  0.296742791  0.028467473  0.015148820
## neon      -0.022515944 -0.012241889  0.027136571 -0.007711167 -0.021980894
## nor        0.086978583  0.117307428  0.105359890  0.014486922  0.070157891
## dr        -0.009118693  0.146183114  0.240722330  0.057442167  0.005799155
## z         -0.039078511  0.070034156  0.080425480  0.194081295  0.004081841
## gra        0.051274065  0.071177775  0.150676899  0.021520509 -0.003135425
## rev        0.124358844  0.084979664  0.042728910 -0.020947573  0.051949190
## alt       -0.001270039  0.067588221  0.049373959  0.186373410 -0.017985446
##                contract        texas           nom     questions       change
## will       1.972314e-01  0.113240680  0.0574333041  0.2400282231  0.349604074
## gas        2.499003e-01  0.150320335  0.1393129331  0.1950999664  0.077246224
## deal       3.525168e-01  0.047985929 -0.0041694741  0.0963054923  0.258008693
## meter      2.595386e-01  0.009870298  0.0836559845  0.1863996006  0.049011365
## please     1.981665e-01  0.071075736  0.0005268855  0.4136645938  0.174602436
## attached   1.265494e-01 -0.041285959  0.1400544602  0.1511100378  0.053292183
## day        1.123073e-01  0.058525115  0.0976669193  0.1592237748  0.134967160
## only       4.126549e-02  0.021092377  0.0049404903  0.0559914069  0.088423146
## http      -2.718855e-02 -0.025457939 -0.0511139090 -0.0096911620 -0.016700345
## volume     1.727732e-01  0.096188832  0.1345583157  0.1829445818  0.119422142
## contract   1.000000e+00  0.069158859  0.1048655648  0.1846292339  0.242419926
## texas      6.915886e-02  1.000000000  0.1948335372  0.0373533874  0.072858742
## nom        1.048656e-01  0.194833537  1.0000000000  0.0245038315  0.179682862
## questions  1.846292e-01  0.037353387  0.0245038315  1.0000000000  0.092812830
## change     2.424199e-01  0.072858742  0.1796828616  0.0928128301  1.000000000
## thanks     2.852708e-01  0.041467867  0.0180594035  0.2773145660  0.101239551
## money      1.549473e-02 -0.033007929 -0.0381146538 -0.0262463985  0.033348241
## rate       9.383281e-02  0.055016012  0.0509137979  0.0138813364  0.075386228
## best      -1.833782e-02 -0.028434835 -0.0488018062  0.0057227634 -0.006729329
## houston    2.575592e-02  0.107632564  0.0515396619  0.0189282426  0.024051728
## feb       -7.392723e-06  0.015880911  0.0604128620  0.0430776756  0.063044106
## prices     2.089187e-03  0.034244265 -0.0404733375 -0.0255887387  0.064586908
## move       6.090265e-02  0.053513827 -0.0268722396  0.0001828731  0.101063174
## neon      -1.684671e-02 -0.015937059 -0.0302884022  0.0089273186 -0.012571139
## nor        1.019572e-01  0.063603452  0.0077856606  0.1230101071  0.045416885
## dr         3.975837e-02  0.054768060 -0.0026986365  0.0226361950  0.063980619
## z          3.869394e-02  0.036727455  0.0170816563  0.0085967978  0.075341596
## gra        4.035798e-02  0.036673367 -0.0087681947  0.0453523517  0.023741191
## rev        8.015481e-02  0.025172338  0.1150117527  0.0718790718  0.121748207
## alt        5.319918e-02  0.025213155  0.0261972845  0.0177203209  0.063958775
##                 thanks        money        rate         best      houston
## will       0.139869773  0.144592613  0.19913019  0.054684098  0.079273390
## gas        0.367728426 -0.017266486  0.04105676 -0.024672111  0.097828289
## deal       0.408969513 -0.003359285  0.01152027 -0.039152543  0.009861493
## meter      0.405970285  0.006258493  0.01111805 -0.046926459  0.018467430
## please     0.372852375  0.057278412  0.05559228 -0.022133951  0.071804775
## attached   0.130238303 -0.038322393 -0.03379413 -0.064705837 -0.002247008
## day        0.234263351  0.037653304  0.10560311  0.053243501  0.050771434
## only       0.071814105  0.148054411  0.13806381  0.082034234  0.005939768
## http      -0.035037804  0.016295528  0.01373238  0.102213192 -0.006527714
## volume     0.288512970 -0.030216013  0.01139954 -0.033643511  0.005726187
## contract   0.285270821  0.015494731  0.09383281 -0.018337815  0.025755918
## texas      0.041467867 -0.033007929  0.05501601 -0.028434835  0.107632564
## nom        0.018059404 -0.038114654  0.05091380 -0.048801806  0.051539662
## questions  0.277314566 -0.026246399  0.01388134  0.005722763  0.018928243
## change     0.101239551  0.033348241  0.07538623 -0.006729329  0.024051728
## thanks     1.000000000 -0.034826826 -0.01474870 -0.046424158  0.008988285
## money     -0.034826826  1.000000000  0.04292973  0.122389837 -0.005922217
## rate      -0.014748697  0.042929733  1.00000000  0.089119599  0.097830674
## best      -0.046424158  0.122389837  0.08911960  1.000000000 -0.008362240
## houston    0.008988285 -0.005922217  0.09783067 -0.008362240  1.000000000
## feb        0.111833839 -0.007197264  0.02656975 -0.005102083  0.009958814
## prices    -0.046355194  0.010248183  0.16488769  0.049003826 -0.006574359
## move      -0.002106540  0.051759237  0.10856328  0.005569897 -0.003886378
## neon      -0.032446132 -0.012870834 -0.01181546  0.020832628 -0.005129891
## nor        0.146028721  0.063683127  0.14836871  0.066243655  0.018530317
## dr         0.009307535  0.170826730  0.37827541  0.115724781  0.027406194
## z         -0.006642907  0.041168690  0.21644830  0.058769771  0.023354059
## gra        0.023911250  0.092030534  0.45462859  0.103737255  0.035726488
## rev        0.058179752  0.027454655  0.15645669 -0.006690009  0.030158956
## alt       -0.009286070  0.059938942  0.13953308  0.107802512  0.007281793
##                     feb       prices          move         neon          nor
## will       4.469496e-02  0.061960619  0.1210794719 -0.008296583  0.241815548
## gas        4.875616e-02  0.104844402  0.0989060448 -0.023848602  0.240672848
## deal       1.355137e-01  0.045501998  0.1231503597 -0.027103706  0.089018330
## meter      6.844438e-02 -0.021774217  0.0049380037 -0.019706970  0.268419217
## please     7.889403e-02  0.087461548  0.1350519186 -0.033828485  0.183703929
## attached   9.379400e-02 -0.065735122 -0.0524757467 -0.022515944  0.086978583
## day        3.682227e-02  0.050694225  0.0931436199 -0.012241889  0.117307428
## only       3.605988e-02  0.279650092  0.2967427906  0.027136571  0.105359890
## http      -1.047131e-02  0.040506164  0.0284674725 -0.007711167  0.014486922
## volume     4.264022e-02  0.004274045  0.0151488201 -0.021980894  0.070157891
## contract  -7.392723e-06  0.002089187  0.0609026456 -0.016846706  0.101957172
## texas      1.588091e-02  0.034244265  0.0535138271 -0.015937059  0.063603452
## nom        6.041286e-02 -0.040473338 -0.0268722396 -0.030288402  0.007785661
## questions  4.307768e-02 -0.025588739  0.0001828731  0.008927319  0.123010107
## change     6.304411e-02  0.064586908  0.1010631743 -0.012571139  0.045416885
## thanks     1.118338e-01 -0.046355194 -0.0021065400 -0.032446132  0.146028721
## money     -7.197264e-03  0.010248183  0.0517592370 -0.012870834  0.063683127
## rate       2.656975e-02  0.164887691  0.1085632840 -0.011815457  0.148368708
## best      -5.102083e-03  0.049003826  0.0055698967  0.020832628  0.066243655
## houston    9.958814e-03 -0.006574359 -0.0038863779 -0.005129891  0.018530317
## feb        1.000000e+00 -0.023112497 -0.0130724528 -0.002776526  0.002698103
## prices    -2.311250e-02  1.000000000  0.3477687367 -0.012462566  0.050941039
## move      -1.307245e-02  0.347768737  1.0000000000 -0.012135057  0.087487387
## neon      -2.776526e-03 -0.012462566 -0.0121350571  1.000000000 -0.012138461
## nor        2.698103e-03  0.050941039  0.0874873874 -0.012138461  1.000000000
## dr         8.449249e-03  0.178355354  0.1838652821 -0.015126395  0.235538464
## z          1.832634e-02  0.063517236  0.0922380377 -0.014804336  0.133310700
## gra        4.475333e-03  0.060852725  0.0987101381 -0.011218547  0.237584275
## rev        8.008111e-03  0.088609473  0.0379777257 -0.018886341  0.079901414
## alt        1.211311e-02  0.092836047  0.0461676763  0.002683003  0.169847134
##                     dr            z           gra          rev          alt
## will       0.160052732  0.099943665  0.1512098185  0.178498449  0.104629747
## gas        0.095849135  0.050385117  0.0989050437  0.072343456  0.059991945
## deal       0.019280934  0.017431403 -0.0006858967 -0.010287894  0.016731999
## meter      0.027481633 -0.011078764  0.0922930497  0.008143123 -0.001876477
## please     0.166459411  0.065350876  0.1071135579  0.097601252  0.041648536
## attached  -0.009118693 -0.039078511  0.0512740650  0.124358844 -0.001270039
## day        0.146183114  0.070034156  0.0711777746  0.084979664  0.067588221
## only       0.240722330  0.080425480  0.1506768986  0.042728910  0.049373959
## http       0.057442167  0.194081295  0.0215205092 -0.020947573  0.186373410
## volume     0.005799155  0.004081841 -0.0031354249  0.051949190 -0.017985446
## contract   0.039758369  0.038693942  0.0403579751  0.080154809  0.053199179
## texas      0.054768060  0.036727455  0.0366733666  0.025172338  0.025213155
## nom       -0.002698637  0.017081656 -0.0087681947  0.115011753  0.026197285
## questions  0.022636195  0.008596798  0.0453523517  0.071879072  0.017720321
## change     0.063980619  0.075341596  0.0237411912  0.121748207  0.063958775
## thanks     0.009307535 -0.006642907  0.0239112502  0.058179752 -0.009286070
## money      0.170826730  0.041168690  0.0920305337  0.027454655  0.059938942
## rate       0.378275407  0.216448298  0.4546285928  0.156456690  0.139533076
## best       0.115724781  0.058769771  0.1037372552 -0.006690009  0.107802512
## houston    0.027406194  0.023354059  0.0357264883  0.030158956  0.007281793
## feb        0.008449249  0.018326341  0.0044753329  0.008008111  0.012113107
## prices     0.178355354  0.063517236  0.0608527252  0.088609473  0.092836047
## move       0.183865282  0.092238038  0.0987101381  0.037977726  0.046167676
## neon      -0.015126395 -0.014804336 -0.0112185473 -0.018886341  0.002683003
## nor        0.235538464  0.133310700  0.2375842753  0.079901414  0.169847134
## dr         1.000000000  0.336072478  0.5660648781  0.071473226  0.201463114
## z          0.336072478  1.000000000  0.2849167673  0.052428524  0.168087889
## gra        0.566064878  0.284916767  1.0000000000  0.052724409  0.183336950
## rev        0.071473226  0.052428524  0.0527244093  1.000000000  0.069861292
## alt        0.201463114  0.168087889  0.1833369498  0.069861292  1.000000000

##Label and Factorize Outcome Variable
#mutate outcome variable
data = data %>%
  mutate(
    #label Prediction Column
    Prediction = case_when(
      Prediction == 0 ~ "non-spam",
      Prediction == 1 ~ "spam"
    ),

    #factorize outcome column
    Prediction = as.factor(Prediction)

  )

str(data)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':	5172 obs. of  31 variables:
##  $ will      : num  0 0 0 0 0 0 0 1 0 2 ...
##  $ gas       : num  0 1 2 0 2 0 5 0 0 1 ...
##  $ deal      : num  0 0 0 2 0 0 1 0 0 0 ...
##  $ meter     : num  0 0 0 1 3 0 4 0 1 0 ...
##  $ please    : num  0 2 0 0 1 0 1 1 0 1 ...
##  $ attached  : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ day       : num  0 2 0 1 0 0 1 0 0 1 ...
##  $ only      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ http      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ volume    : num  0 0 0 1 3 0 0 0 0 0 ...
##  $ contract  : num  0 0 0 0 4 0 0 0 1 0 ...
##  $ texas     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ nom       : num  0 2 2 0 1 0 0 0 1 1 ...
##  $ questions : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ change    : num  0 0 0 0 0 0 0 0 0 6 ...
##  $ thanks    : num  0 1 0 1 1 0 1 1 0 1 ...
##  $ money     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ rate      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ best      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ houston   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ feb       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ prices    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ move      : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ neon      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ nor       : num  0 1 0 0 0 1 0 0 0 0 ...
##  $ dr        : num  0 1 0 0 0 0 0 0 0 1 ...
##  $ z         : num  0 1 0 0 0 0 0 1 3 2 ...
##  $ gra       : num  0 2 0 0 0 0 0 0 0 0 ...
##  $ rev       : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ alt       : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Prediction: Factor w/ 2 levels "non-spam","spam": 1 1 1 1 1 2 1 2 1 1 ...

##Visualizing the Distribution Difference with Density Plots##
#Define Predictor Variables
predictor_variables = setdiff(colnames(data), Prediction)

#initialize an empty list to store plots
plot_list = list()

#set plot index
i = 1

#create plot and fill plot list
for (predictor_variable in predictor_variables) {
  plot = ggplot(data) +
    geom_density(aes_string(x = predictor_variable, fill = "Prediction"), alpha = 0.5)
  plot_list[[i]] = plot
  i = i + 1

}

#display plot
do.call ("grid.arrange", c(plot_list, ncol = 3))

I also use graphical plot to show the correlation among the predictors in the dataset. Graphical Representation I use pair() function to create a scatterplot matrix for all variables which can help to view a scatterplot for every pair variables in the dataset.

pairs(data,col=data$Prediction)

From the pair plot above, the red color indicate the distribution of observation with unstable response while black indicate distribution of observations with stable response. I create another plot using the ggpairs function in ggally package in R to see the correlation among predictors in the dataset.

ggpairs(data = data[1:30],
        title = "Electricity data Correlation Plot",
        upper = list(continuous = wrap("cor", size = 5)),
        lower = list(continuous = "smooth")
)

#Electric Data Frequency Histogram

# Density & Frequency analysis with the Histogram,
# will 
willl <- ggplot(data=data, aes(x=will))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("Will") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(will)),linetype="dashed",color="grey")

# gas
gasl <- ggplot(data=data, aes(x=gas))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("gas") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(gas)),linetype="dashed",color="grey")

#deal
deall <- ggplot(data=data, aes(x=deal))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("deal") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(deal)),linetype="dashed",color="grey")

#meter
meterl <- ggplot(data=data, aes(x=meter))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("meter") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(meter)),linetype="dashed",color="grey")

#please
pleasel <- ggplot(data=data, aes(x=please))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("please") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(please)),linetype="dashed",color="grey")

#attached
attachedl <- ggplot(data=data, aes(x=attached))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("attached") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(attached)),linetype="dashed",color="grey")

#day
dayl <- ggplot(data=data, aes(x=day))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("day") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(day)),linetype="dashed",color="grey")

#only
onlyl <- ggplot(data=data, aes(x=only))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("only") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(only)),linetype="dashed",color="grey")

#http
httpl <- ggplot(data=data, aes(x=http))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("http") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(http)),linetype="dashed",color="grey")

#volume
volumel <- ggplot(data=data, aes(x=volume))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("volume") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(volume)),linetype="dashed",color="grey")

#contract
contractl <- ggplot(data=data, aes(x=contract))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("contract") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(contract)),linetype="dashed",color="grey")

#texas
texasl <- ggplot(data=data, aes(x=texas))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("texas") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(texas)),linetype="dashed",color="grey")

#nom
noml <- ggplot(data=data, aes(x=nom))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("nom") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(nom)),linetype="dashed",color="grey")

#questions
questionsl <- ggplot(data=data, aes(x=questions))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("questions") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(questions)),linetype="dashed",color="grey")

#change
changel <- ggplot(data=data, aes(x=change))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("change") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(change)),linetype="dashed",color="grey")

#thanks
thanksl <- ggplot(data=data, aes(x=thanks))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("thanks") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(thanks)),linetype="dashed",color="grey")

#money
moneyl <- ggplot(data=data, aes(x=money))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("money") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(money)),linetype="dashed",color="grey")

#rate
ratel <- ggplot(data=data, aes(x=rate))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("rate") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(rate)),linetype="dashed",color="grey")

#best
bestl <- ggplot(data=data, aes(x=best))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("best") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(best)),linetype="dashed",color="grey")

#houston
houstonl <- ggplot(data=data, aes(x=houston))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("houston") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(houston)),linetype="dashed",color="grey")

#feb
febl <- ggplot(data=data, aes(x=feb))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("feb") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(feb)),linetype="dashed",color="grey")

#prices
pricesl <- ggplot(data=data, aes(x=prices))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("prices") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(prices)),linetype="dashed",color="grey")

#move
movel <- ggplot(data=data, aes(x=move))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("move") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(move)),linetype="dashed",color="grey")

#neon
neonl <- ggplot(data=data, aes(x=neon))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("neon") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(neon)),linetype="dashed",color="grey")

#nor
norl <- ggplot(data=data, aes(x=nor))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("nor") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(nor)),linetype="dashed",color="grey")

#dr
drl <- ggplot(data=data, aes(x=dr))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("dr") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(dr)),linetype="dashed",color="grey")

#z
zl <- ggplot(data=data, aes(x=z))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("z") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(z)),linetype="dashed",color="grey")

#gra
gral <- ggplot(data=data, aes(x=gra))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("gra") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(gra)),linetype="dashed",color="grey")

#rev
revl <- ggplot(data=data, aes(x=rev))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("rev") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(rev)),linetype="dashed",color="grey")

#alt
altl <- ggplot(data=data, aes(x=alt))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Prediction)) +
  xlab("alt") +
  ylab("Frequency") +
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(alt)),linetype="dashed",color="grey")


# Plot all visualizations
#plot not good, need to fix
grid.arrange(willl + ggtitle(""),
             gasl + ggtitle(""),
             deall + ggtitle(""),
             meterl + ggtitle(""),
             pleasel + ggtitle(""),
             attachedl + ggtitle(""),
             dayl + ggtitle(""),
             onlyl + ggtitle(""),
             httpl + ggtitle(""),
             volumel + ggtitle(""),
             contractl + ggtitle(""),
             texasl + ggtitle(""),
             noml + ggtitle(""),
             questionsl + ggtitle(""),
             changel + ggtitle(""),
             thanksl + ggtitle(""),
             moneyl + ggtitle(""),
             ratel + ggtitle(""),
             bestl + ggtitle(""),
             houstonl + ggtitle(""),
             febl + ggtitle(""),
             pricesl + ggtitle(""),
             movel + ggtitle(""),
             neonl + ggtitle(""),
             norl + ggtitle(""),
             drl + ggtitle(""),
             zl + ggtitle(""),
             gral + ggtitle(""),
             revl + ggtitle(""),
             altl + ggtitle(""),
             nrow = 7,
             top = textGrob("Electric Data Frequency Histogram",
                            gp=gpar(fontsize=8))
)

From above plot, the density plot by class can show the separation of classes. It can help to understand the overlap in class values for an attribute. I make another plot in a single visualization that contains all the boxplot.

#Boxplots

#will
willSL <- ggplot(data=data, aes(Prediction,will,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("will", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#gas
gasSL <- ggplot(data=data, aes(Prediction,gas,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("gas", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#deal
dealSL <- ggplot(data=data, aes(Prediction,deal,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("deal", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#meter
meterSL <- ggplot(data=data, aes(Prediction,meter,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("meter", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#please
pleaseSL <- ggplot(data=data, aes(Prediction,please,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("please", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")


#attached
attachedSL <- ggplot(data=data, aes(Prediction,attached,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("attached", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#day
daySL <- ggplot(data=data, aes(Prediction,day,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("day", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#only
onlySL <- ggplot(data=data, aes(Prediction,only,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("only", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#http
httpSL <- ggplot(data=data, aes(Prediction,http,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("http", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#volume
volumeSL <- ggplot(data=data, aes(Prediction,volume,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("volume", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#contract
contractSL <- ggplot(data=data, aes(Prediction,contract,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("contract", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#texas
texasSL <- ggplot(data=data, aes(Prediction,texas,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("texas", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#nom
nomSL <- ggplot(data=data, aes(Prediction,nom,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("nom", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#questions
questionsSL <- ggplot(data=data, aes(Prediction,questions,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("questions", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#change
changeSL <- ggplot(data=data, aes(Prediction,change,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("change", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#thanks
thanksSL <- ggplot(data=data, aes(Prediction,thanks,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("thanks", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#money
moneySL <- ggplot(data=data, aes(Prediction,money,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("money", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#rate
rateSL <- ggplot(data=data, aes(Prediction,rate,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("rate", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#best
bestSL <- ggplot(data=data, aes(Prediction,best,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("best", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#houston
houstonSL <- ggplot(data=data, aes(Prediction,houston,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("houston", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#feb
febSL <- ggplot(data=data, aes(Prediction,feb,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("feb", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#prices
pricesSL <- ggplot(data=data, aes(Prediction,prices,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("prices", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#move
moveSL <- ggplot(data=data, aes(Prediction,move,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("move", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#neon
neonSL <- ggplot(data=data, aes(Prediction,neon,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("neon", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#nor
norSL <- ggplot(data=data, aes(Prediction,nor,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("nor", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#dr
drSL <- ggplot(data=data, aes(Prediction,dr,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("dr", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#z
zSL <- ggplot(data=data, aes(Prediction,z,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("z", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#gra
graSL <- ggplot(data=data, aes(Prediction,gra,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("gra", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#rev
revSL <- ggplot(data=data, aes(Prediction,rev,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("rev", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

#alt
altSL <- ggplot(data=data, aes(Prediction,alt,fill=Prediction)) +
  geom_boxplot()+
  scale_y_continuous("alt", breaks= seq(0,300, by=.5))+
  theme(legend.position="none")

# Plot all density visualizations
grid.arrange(willSL + ggtitle(""),
             gasSL + ggtitle(""),
             dealSL + ggtitle(""),
             meterSL + ggtitle(""),
             pleaseSL + ggtitle(""),
             attachedSL + ggtitle(""),
             daySL + ggtitle(""),
             onlySL + ggtitle(""),
             httpSL + ggtitle(""),
             volumeSL + ggtitle(""),
             contractSL + ggtitle(""),
             texasSL + ggtitle(""),
             nomSL + ggtitle(""),
             questionsSL + ggtitle(""),
             changeSL + ggtitle(""),
             thanksSL + ggtitle(""),
             moneySL + ggtitle(""),
             rateSL + ggtitle(""),
             bestSL + ggtitle(""),
             houstonSL + ggtitle(""),
             febSL + ggtitle(""),
             pricesSL + ggtitle(""),
             moveSL + ggtitle(""),
             neonSL + ggtitle(""),
             norSL + ggtitle(""),
             drSL + ggtitle(""),
             zSL + ggtitle(""),
             graSL + ggtitle(""),
             revSL + ggtitle(""),
             altSL + ggtitle(""),
             nrow = 5,
             top = textGrob("Electricity data Density Plot",
                            gp=gpar(fontsize=11))
)

In this section, I explored the data set. I used several functions in R to understand the dataset. I also used graphical representation to look through the data. Classification The response variable in the dataset is categorical. In this section, I will use several classification methods to predict the qualitative response. Logistic Regression I fit a logistic regression model to predict spam email using the predictors in the dataset. I used generalized linear model glm() function in R. I started by using the whole dataset for training and testing the model.

##Logistic regression model
## create model
glm.fits = glm(formula = Prediction ~ ., data = data, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(glm.fits)

## 
## Call:
## glm(formula = Prediction ~ ., family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.2285  -0.2345  -0.0292   0.0862   3.2878  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.012791   0.063502  -0.201  0.84037    
## will        -0.447672   0.056324  -7.948 1.89e-15 ***
## gas         -0.242692   0.061893  -3.921 8.81e-05 ***
## deal        -0.748635   0.120323  -6.222 4.91e-10 ***
## meter       -1.308691   0.235308  -5.562 2.67e-08 ***
## please      -0.414646   0.087615  -4.733 2.22e-06 ***
## attached    -4.714750   0.522301  -9.027  < 2e-16 ***
## day         -0.338068   0.050110  -6.747 1.51e-11 ***
## only         0.959795   0.147963   6.487 8.77e-11 ***
## http         0.974066   0.094628  10.294  < 2e-16 ***
## volume      -1.223554   0.248946  -4.915 8.88e-07 ***
## contract    -0.297113   0.117643  -2.526  0.01155 *  
## texas       -1.204828   0.212911  -5.659 1.52e-08 ***
## nom         -0.369976   0.091543  -4.042 5.31e-05 ***
## questions   -2.097084   0.270962  -7.739 9.99e-15 ***
## change      -0.903930   0.135229  -6.684 2.32e-11 ***
## thanks      -1.994111   0.186541 -10.690  < 2e-16 ***
## money        2.555173   0.261873   9.757  < 2e-16 ***
## rate         0.303496   0.098263   3.089  0.00201 ** 
## best         1.298562   0.190085   6.831 8.40e-12 ***
## houston     -2.799894   0.463803  -6.037 1.57e-09 ***
## feb         -1.423951   0.306411  -4.647 3.36e-06 ***
## prices       1.616134   0.229800   7.033 2.02e-12 ***
## move         1.337748   0.161147   8.301  < 2e-16 ***
## neon        -2.182367   0.668102  -3.267  0.00109 ** 
## nor         -0.002474   0.139544  -0.018  0.98586    
## dr           0.338414   0.064269   5.266 1.40e-07 ***
## z            0.221679   0.040609   5.459 4.79e-08 ***
## gra          0.443850   0.073144   6.068 1.29e-09 ***
## rev          0.138100   0.076612   1.803  0.07145 .  
## alt          0.453869   0.148369   3.059  0.00222 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6228.9  on 5171  degrees of freedom
## Residual deviance: 2489.3  on 5141  degrees of freedom
## AIC: 2551.3
## 
## Number of Fisher Scoring iterations: 9

coef(glm.fits)

##  (Intercept)         will          gas         deal        meter       please 
## -0.012790682 -0.447671952 -0.242691638 -0.748635159 -1.308690940 -0.414645830 
##     attached          day         only         http       volume     contract 
## -4.714749568 -0.338068490  0.959794909  0.974065996 -1.223554480 -0.297113255 
##        texas          nom    questions       change       thanks        money 
## -1.204828464 -0.369976405 -2.097083798 -0.903930148 -1.994110595  2.555173279 
##         rate         best      houston          feb       prices         move 
##  0.303495513  1.298562255 -2.799894081 -1.423951196  1.616134451  1.337747808 
##         neon          nor           dr            z          gra          rev 
## -2.182367358 -0.002473625  0.338413967  0.221679387  0.443850360  0.138100274 
##          alt 
##  0.453868867

summary(glm.fits)$coef

##                 Estimate Std. Error      z value     Pr(>|z|)
## (Intercept) -0.012790682 0.06350195  -0.20142189 8.403687e-01
## will        -0.447671952 0.05632361  -7.94821068 1.892247e-15
## gas         -0.242691638 0.06189253  -3.92117804 8.811713e-05
## deal        -0.748635159 0.12032253  -6.22190357 4.911589e-10
## meter       -1.308690940 0.23530772  -5.56161486 2.672897e-08
## please      -0.414645830 0.08761471  -4.73260494 2.216566e-06
## attached    -4.714749568 0.52230068  -9.02688760 1.766237e-19
## day         -0.338068490 0.05010951  -6.74659362 1.513564e-11
## only         0.959794909 0.14796318   6.48671451 8.772835e-11
## http         0.974065996 0.09462780  10.29365559 7.526355e-25
## volume      -1.223554480 0.24894557  -4.91494783 8.880611e-07
## contract    -0.297113255 0.11764299  -2.52555008 1.155173e-02
## texas       -1.204828464 0.21291083  -5.65884058 1.523990e-08
## nom         -0.369976405 0.09154283  -4.04156634 5.309535e-05
## questions   -2.097083798 0.27096189  -7.73940489 9.988323e-15
## change      -0.903930148 0.13522887  -6.68444655 2.317990e-11
## thanks      -1.994110595 0.18654078 -10.68994466 1.134386e-26
## money        2.555173279 0.26187340   9.75728469 1.716905e-22
## rate         0.303495513 0.09826310   3.08860111 2.011012e-03
## best         1.298562255 0.19008458   6.83149721 8.403292e-12
## houston     -2.799894081 0.46380343  -6.03681198 1.571887e-09
## feb         -1.423951196 0.30641109  -4.64719204 3.364839e-06
## prices       1.616134451 0.22979969   7.03279633 2.024348e-12
## move         1.337747808 0.16114718   8.30140395 1.028882e-16
## neon        -2.182367358 0.66810159  -3.26652021 1.088780e-03
## nor         -0.002473625 0.13954418  -0.01772646 9.858571e-01
## dr           0.338413967 0.06426853   5.26562519 1.397131e-07
## z            0.221679387 0.04060904   5.45886843 4.791787e-08
## gra          0.443850360 0.07314402   6.06817031 1.293757e-09
## rev          0.138100274 0.07661174   1.80259928 7.145117e-02
## alt          0.453868867 0.14836860   3.05906282 2.220306e-03

glm.probs=predict(glm.fits,type="response")
glm.probs[1:10]

##            1            2            3            4            5            6 
## 4.968024e-01 4.872221e-04 2.247622e-01 1.701596e-03 2.200215e-05 4.961840e-01 
##            7            8            9           10 
## 4.741746e-05 6.614081e-02 2.102311e-01 2.126677e-04

## create predictions dataset
#glm.preds = as.factor(ifelse(glm.probs > 0.5, "spam", "antispam"))
glm.pred=rep("non-spam",5172)
glm.pred[glm.probs>.5]="spam"
table(glm.pred,data$Prediction)

##           
## glm.pred   non-spam spam
##   non-spam     3530  349
##   spam          142 1151

#Confusion Matrix not working, throwing error
#ConfusionMatrix(glm.pre, data$Prediction)
glm_correctness = mean(glm.pred==data$Prediction)
glm_correctness

## [1] 0.9050657

In the above procedure, I fitted the model using the glm() function and used the summary function in R to display the summary of the fitted model. The predictors with small p values have significance on the response variable Prediction. The variables with negative coefficients have negative impact on the response variable. I use the coef() function to display the coefficients for the fitted model below. I use summary function to display the p-values for the coefficients. I use the predict() function to predict the probability of an email being spam, given the values of the predictors. The glm.probs displays the probability of the prediction. The above commands create vector of class predictions based whether the predicted probability of an email being spam is greater than or less than 0.5. I used the table function in R to create a confusion matrix to determine the correct and incorrect predictions. The diagonal elements, 3530 and 1151, of the confusion matrix shows the correct predictions, while off-diagonals elements, 142 and 349, shows incorrect predictions. I use the mean() function the calculate the fraction of prediction correctness. The accuracy of the model is 90.51%. Training and Testing dataset Using the whole dataset for both training the model and testing the model could be misleading. The train error rate tends to underestimate the test error rate. To solve this issue, i fit the model using part of the data, and then examine how well it predicts using the held-out data. This method yields a more realistic error rate. I randomly divide the dataset into train data and test data.

###Train dataset and Held out dataset
set.seed(1)
smp_siz = floor(0.75*nrow(data))
train_ind = sample(seq_len(nrow(data)),size = smp_siz)
train =data[train_ind,]
test=data[-train_ind,]

I used ratio of 75:25 for the dataset to train and test respectively. I fit the model again and make prediction using separate data for training and testing. The predictors with small p-values have significance on the response. The variable with negative coefficients has negative impact on the response variable. I use the predict() function to predict the probability of an email being spam using the test data, given the values of the predictors. I created a confusion matrix below to show the results. I used the mean() function to calculate the fraction of prediction correctness. The accuracy is 90.1% which is almost equal to one we got previously.

glm.fits2 = glm(formula = Prediction ~ ., data = train, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(glm.fits2)

## 
## Call:
## glm(formula = Prediction ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4512  -0.2176  -0.0236   0.0918   3.1929  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.01220    0.07363   0.166 0.868384    
## will         -0.39773    0.06093  -6.528 6.68e-11 ***
## gas          -0.28983    0.07501  -3.864 0.000112 ***
## deal         -0.71646    0.13884  -5.160 2.47e-07 ***
## meter        -1.59524    0.31634  -5.043 4.59e-07 ***
## please       -0.47955    0.10116  -4.741 2.13e-06 ***
## attached     -4.62392    0.54785  -8.440  < 2e-16 ***
## day          -0.33008    0.05652  -5.840 5.21e-09 ***
## only          0.90474    0.16529   5.474 4.41e-08 ***
## http          0.92117    0.10891   8.458  < 2e-16 ***
## volume       -1.28079    0.29784  -4.300 1.71e-05 ***
## contract     -0.27140    0.13320  -2.038 0.041598 *  
## texas        -1.03754    0.21223  -4.889 1.02e-06 ***
## nom          -0.33867    0.09852  -3.438 0.000587 ***
## questions    -2.09188    0.31123  -6.721 1.80e-11 ***
## change       -0.93013    0.16078  -5.785 7.24e-09 ***
## thanks       -1.94094    0.20519  -9.459  < 2e-16 ***
## money         2.46007    0.29267   8.406  < 2e-16 ***
## rate          0.05524    0.06137   0.900 0.368121    
## best          1.37064    0.22104   6.201 5.62e-10 ***
## houston      -3.05698    0.54492  -5.610 2.02e-08 ***
## feb          -1.29992    0.35548  -3.657 0.000255 ***
## prices        1.41432    0.24153   5.856 4.75e-09 ***
## move          1.22418    0.18141   6.748 1.50e-11 ***
## neon        -15.15596  350.98995  -0.043 0.965558    
## nor           0.50842    0.13663   3.721 0.000198 ***
## dr            0.37612    0.07228   5.204 1.95e-07 ***
## z             0.23005    0.04583   5.019 5.18e-07 ***
## gra           0.43292    0.08407   5.150 2.61e-07 ***
## rev           0.10141    0.08786   1.154 0.248403    
## alt           0.81577    0.19841   4.112 3.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4700.0  on 3878  degrees of freedom
## Residual deviance: 1847.2  on 3848  degrees of freedom
## AIC: 1909.2
## 
## Number of Fisher Scoring iterations: 16

coef(glm.fits2)

##  (Intercept)         will          gas         deal        meter       please 
##   0.01220148  -0.39772590  -0.28982554  -0.71645883  -1.59523959  -0.47955419 
##     attached          day         only         http       volume     contract 
##  -4.62391745  -0.33007562   0.90473976   0.92117095  -1.28079076  -0.27139871 
##        texas          nom    questions       change       thanks        money 
##  -1.03754094  -0.33867127  -2.09187589  -0.93013039  -1.94093607   2.46006817 
##         rate         best      houston          feb       prices         move 
##   0.05523513   1.37064397  -3.05698212  -1.29992141   1.41432498   1.22417503 
##         neon          nor           dr            z          gra          rev 
## -15.15595926   0.50842242   0.37612008   0.23004966   0.43292304   0.10141489 
##          alt 
##   0.81577269

summary(glm.fits2)$coef

##                 Estimate   Std. Error     z value     Pr(>|z|)
## (Intercept)   0.01220148   0.07363079  0.16571164 8.683839e-01
## will         -0.39772590   0.06092984 -6.52760396 6.683012e-11
## gas          -0.28982554   0.07501054 -3.86379738 1.116378e-04
## deal         -0.71645883   0.13884374 -5.16018113 2.467110e-07
## meter        -1.59523959   0.31634195 -5.04276975 4.588413e-07
## please       -0.47955419   0.10115744 -4.74067161 2.130109e-06
## attached     -4.62391745   0.54785082 -8.44010322 3.170576e-17
## day          -0.33007562   0.05651736 -5.84025155 5.212206e-09
## only          0.90473976   0.16529453  5.47350082 4.412308e-08
## http          0.92117095   0.10891394  8.45778739 2.724984e-17
## volume       -1.28079076   0.29783601 -4.30032201 1.705501e-05
## contract     -0.27139871   0.13320057 -2.03751912 4.159805e-02
## texas        -1.03754094   0.21223451 -4.88865325 1.015281e-06
## nom          -0.33867127   0.09851934 -3.43761192 5.868681e-04
## questions    -2.09187589   0.31122591 -6.72140672 1.799784e-11
## change       -0.93013039   0.16077681 -5.78522729 7.241435e-09
## thanks       -1.94093607   0.20519404 -9.45902765 3.108200e-21
## money         2.46006817   0.29267152  8.40556053 4.258229e-17
## rate          0.05523513   0.06137248  0.89999825 3.681212e-01
## best          1.37064397   0.22104347  6.20078932 5.618068e-10
## houston      -3.05698212   0.54491792 -5.60998644 2.023425e-08
## feb          -1.29992141   0.35547769 -3.65682977 2.553538e-04
## prices        1.41432498   0.24152765  5.85574760 4.748689e-09
## move          1.22417503   0.18140656  6.74824012 1.496491e-11
## neon        -15.15595926 350.98995161 -0.04318061 9.655576e-01
## nor           0.50842242   0.13663456  3.72103811 1.984055e-04
## dr            0.37612008   0.07227700  5.20386986 1.951807e-07
## z             0.23004966   0.04583202  5.01940922 5.183063e-07
## gra           0.43292304   0.08406929  5.14959788 2.610455e-07
## rev           0.10141489   0.08786314  1.15423713 2.484030e-01
## alt           0.81577269   0.19841185  4.11151198 3.930765e-05

glm.probs2=predict(glm.fits2,test,type="response")
glm.probs2[1:10]

##            1            2            3            4            5            6 
## 8.604222e-04 7.384485e-06 6.272936e-01 7.070138e-02 1.819846e-01 3.061579e-04 
##            7            8            9           10 
## 5.268448e-07 8.287878e-06 5.287603e-01 3.689203e-03

dim(test)

## [1] 1293   31

glm.pred2=rep("non-spam",1293)
glm.pred2[glm.probs2>.5]="spam"
table(glm.pred2,test$Prediction)

##           
## glm.pred2  non-spam spam
##   non-spam      848   42
##   spam           86  317

#not working, throwing error
#ConfusionMatrix(glm.pre, data$Prediction)
glm_correctness2 = mean(glm.pred2==test$Prediction)
glm_correctness2

## [1] 0.9010054

Linear Discriminant Analysis I performed LDA on the dataset. I used the training and testing data set I already had. I fit an LDA model using lda() function on the training data. The output of the LDA indicates π1 = 0.71 and π2 = 0.29. This indicates that 71% of the training observations correspond to non-spam. While 29% indicates spam. The Group means shows averge of each predictor within each class and its LDA estimates. The Coefficients of linear discriminants provides the linear combination of all the predictors that are used to form the LDA decision rule. I use plot() function to produce a plot of the linear discriminants. I use predict() function on the fitted model with the test data. The predict() function returned a list with three elements. Namely the class, posterior and x which contains the linear discriminants. I use table() function to display the confusion matrix. I use() mean function to calculate the percentage of accuracy.

#Linear Discriminant Analysis
require(MASS)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

#lda.fit=lda(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4,data=train)
lda.fit = lda(Prediction ~ ., data = train)
#lda.fit=lda(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4,data=train)
lda.fit

## Call:
## lda(Prediction ~ ., data = train)
## 
## Prior probabilities of groups:
## non-spam     spam 
## 0.705852 0.294148 
## 
## Group means:
##               will       gas      deal     meter    please   attached       day
## non-spam 1.0295836 0.8151936 0.9930606 0.7702703 0.7677137 0.30460190 0.8739956
## spam     0.4872918 0.1472393 0.1542507 0.0324277 0.3593339 0.01139351 0.4618755
##               only       http     volume   contract      texas       nom
## non-spam 0.1241782 0.06647188 0.46201607 0.32359386 0.23265157 0.6391527
## spam     0.3374233 0.66958808 0.01402279 0.05609115 0.01928133 0.1428571
##           questions    change     thanks      money      rate       best
## non-spam 0.21292915 0.3582907 0.50292184 0.01789627 0.1888240 0.03287071
## spam     0.01840491 0.1209465 0.06748466 0.18843120 0.4653812 0.18755478
##              houston        feb     prices       move       neon       nor
## non-spam 0.100438276 0.14864865 0.02447042 0.07815924 0.02264427 0.1669102
## spam     0.004382121 0.02366345 0.17616126 0.23926380 0.00000000 0.2278703
##                 dr         z       gra       rev        alt
## non-spam 0.3268809 0.6139518 0.2074507 0.2914536 0.06208912
## spam     1.0043821 2.5547765 0.7642419 0.2304996 0.26730938
## 
## Coefficients of linear discriminants:
##                    LD1
## will      -0.105757835
## gas       -0.019718250
## deal      -0.118239385
## meter     -0.032242656
## please    -0.085997441
## attached  -0.598894669
## day       -0.075077063
## only       0.405015436
## http       0.055424801
## volume    -0.083651711
## contract   0.084285856
## texas     -0.218434561
## nom       -0.208139476
## questions -0.320766344
## change    -0.018718588
## thanks    -0.321028593
## money      0.929545882
## rate      -0.050238221
## best       1.035306060
## houston   -0.253183640
## feb       -0.157416783
## prices     0.438859685
## move       0.319259498
## neon      -0.613267607
## nor        0.182424771
## dr         0.079042886
## z          0.048799875
## gra        0.082767596
## rev        0.009150501
## alt        0.302105638

plot(lda.fit)

lda.pred=predict(lda.fit, test)

names(lda.pred)

## [1] "class"     "posterior" "x"

lda.class=lda.pred$class
table(lda.class,test$Prediction)

##           
## lda.class  non-spam spam
##   non-spam      912  215
##   spam           22  144

lda_correctness=mean(lda.class==test$Prediction)
lda_correctness

## [1] 0.8167053

K-Nearest Neighbors I carry out K-nearest neighbors analysis on the data set. I use the knn() function in R. With knn it works differently from other model-fitting functions. Knn take a single command instead of the two-step train test approach. I divide the data into train and test and use normalization formula to convert everything into values between 0 and 1. I used the mean() function to calculate the accuracy. I set k = 1 and got a prediction accuracy of 87.16%. I believe increasing the value of k will yield better prediction accuracy. Next, I set k=5. I used the mean() function to calculate the accuracy. Setting k=5 improved accuracy from 87.16% to 88.09%.

#K-Nearest Neighbors
library(class)
set.seed(1)

##normalization function is created
nor <-function(x) { (x -min(x))/(max(x)-min(x))   }

##Run nomalization on first 20 coulumns of dataset because they are the predictors
data_norm <- as.data.frame(lapply(data[,c(1:30)], nor))
head(data_norm)

##   will        gas deal      meter     please  attached        day only http
## 1    0 0.00000000 0.00 0.00000000 0.00000000 0.0000000 0.00000000    0    0
## 2    0 0.03448276 0.00 0.00000000 0.16666667 0.1428571 0.07692308    0    0
## 3    0 0.06896552 0.00 0.00000000 0.00000000 0.0000000 0.00000000    0    0
## 4    0 0.00000000 0.08 0.03448276 0.00000000 0.0000000 0.03846154    0    0
## 5    0 0.06896552 0.00 0.10344828 0.08333333 0.0000000 0.00000000    0    0
## 6    0 0.00000000 0.00 0.00000000 0.00000000 0.0000000 0.00000000    0    0
##       volume  contract texas        nom questions change thanks money rate best
## 1 0.00000000 0.0000000     0 0.00000000         0      0  0.000     0    0    0
## 2 0.00000000 0.0000000     0 0.11764706         0      0  0.125     0    0    0
## 3 0.00000000 0.0000000     0 0.11764706         0      0  0.000     0    0    0
## 4 0.04761905 0.0000000     0 0.00000000         0      0  0.125     0    0    0
## 5 0.14285714 0.1333333     0 0.05882353         0      0  0.125     0    0    0
## 6 0.00000000 0.0000000     0 0.00000000         0      0  0.000     0    0    0
##   houston feb prices      move neon        nor         dr          z        gra
## 1       0   0      0 0.0000000    0 0.00000000 0.00000000 0.00000000 0.00000000
## 2       0   0      0 0.0000000    0 0.03703704 0.02222222 0.01204819 0.03030303
## 3       0   0      0 0.0000000    0 0.00000000 0.00000000 0.00000000 0.00000000
## 4       0   0      0 0.0000000    0 0.00000000 0.00000000 0.00000000 0.00000000
## 5       0   0      0 0.1666667    0 0.00000000 0.00000000 0.00000000 0.00000000
## 6       0   0      0 0.0000000    0 0.03703704 0.00000000 0.00000000 0.00000000
##          rev alt
## 1 0.00000000   0
## 2 0.06666667   0
## 3 0.00000000   0
## 4 0.00000000   0
## 5 0.00000000   0
## 6 0.00000000   0

#1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30

train2 <- sample(1:nrow(data), 0.75 * nrow(data))

##extract training set
data_train <- data_norm[train2,]

##extract testing set
data_test <- data_norm[-train2,]

##extract 31st column of train dataset because it will be used as 'cl' argument in knn function.
data_target_category <- data[train2,31, drop = TRUE]

##extract 31st column if test dataset to measure the accuracy
data_test_category <- data[-train2,31, drop = TRUE]

set.seed(1)
##run knn function
#cl = train_labels[,1, drop = TRUE]
knn.pred <- knn(train = data_train, test = data_test, cl=data_target_category, k=1)
##create confusion matrix
table(knn.pred, data_test_category)

##           data_test_category
## knn.pred   non-spam spam
##   non-spam      826   58
##   spam          108  301

knn_1 = mean(knn.pred==data_test_category)
knn_1

## [1] 0.8716164

set.seed(1)
##run knn function
knn.pred <- knn(data_train,data_test,cl=data_target_category,k=5)
##create confusion matrix
table(knn.pred,data_test_category)

##           data_test_category
## knn.pred   non-spam spam
##   non-spam      831   51
##   spam          103  308

knn_5=mean(knn.pred==data_test_category)
knn_5

## [1] 0.8808971

Comparing Logistic regression, LDA and KNN result I compare the performance from all the models I used so far. Below shows the comparison of the performance of the model used in this section.

###Comparing Logistic regression, LDA and KNN result
#data.frame(Logistic = glm_correctness, LDA = lda_correctness, Knn1 = knn_1, Knn5 = knn_5)
data.frame(Logistic = glm_correctness, LDA = lda_correctness, Knn1 = knn_1, Knn5 = knn_5)

##    Logistic       LDA      Knn1      Knn5
## 1 0.9050657 0.8167053 0.8716164 0.8808971

Cross Validation The following section discusses different cross validation techniques. Validation Set Approach To carry out the Validation set approach on my dataset, I use the set.seed() to random generate sample. I randomly divide the observations into 2586 train and 2586 validation set. Then I trained my model using logistic regression on the training set.

###Validation set Approach
set.seed(3)
smp_siz = floor(0.5*nrow(data))
train_ind = sample(seq_len(nrow(data)),size = smp_siz)
trainVali =data[train_ind,]
validationSet=data[-train_ind,]
train_model=glm(formula = Prediction ~ ., family = binomial,data = trainVali)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

I ran the predict() function using the test model and observe the validation set error rate.

glm.probs=predict(train_model,validationSet,type = "response")
#cont = contrasts(train$Prediction)
glm.pred=rep("non-spam",2586)
glm.pred[glm.probs>.5]="spam"
tlb = table(glm.pred,validationSet$Prediction)
mean(glm.pred!=validationSet$Prediction)

## [1] 0.0962877

The validation set error rate is 9.62%. Then I use the poly() function to estimate the validation error rate for the quadratic and cubic classification.

#poly() function to estimate the validation error rate for the quadratic
#quadratic
train_model2=glm(Prediction~poly((will+gas+deal+meter+please+attached+day+only+http+volume+contract+texas+nom+questions+change+thanks+money+rate+best+houston+feb+prices+move+neon+nor+dr+z+gra+rev+alt),2), family = binomial,data = trainVali)
glm.probs2=predict(train_model2,validationSet,type = "response")
#cont2 = contrasts(train$Prediction)
glm.pred2=rep("non-spam",2586)
glm.pred2[glm.probs2>.5]="spam"
tlb2 = table(glm.pred2,validationSet$Prediction)
mean(glm.pred2!=validationSet$Prediction)

## [1] 0.287703

#cubic
train_model3=glm(Prediction~poly((will+gas+deal+meter+please+attached+day+only+http+volume+contract+texas+nom+questions+change+thanks+money+rate+best+houston+feb+prices+move+neon+nor+dr+z+gra+rev+alt),3), family = binomial,data = trainVali)
glm.probs3=predict(train_model3,validationSet,type = "response")
glm.pred3=rep("non-spame",2586)
glm.pred3[glm.probs3>.5]="spam"
tlb3 = table(glm.pred3,validationSet$Prediction)
mean(glm.pred3!=validationSet$Prediction)

## [1] 0.996133

The validation set error rate for cubic classification is 99.61%. Leave-One-Out Cross-Validation (LOOCV) I use thed cv.glm() function in boot library in R to carry out Leave-One-Out Cross-Validation. I run the LOOCV for complex polynomial fits. I used loop to iteratively fit polynomial regressions for polynomial of order i = 1 to i=3. Using the Leave-One-Out Cross-Validation approach, the error rates for the model with linear, quadratic and cubic terms are 0.2900232, 0.2906032 and 0.2913766 respectively.

###LOOCV

library(boot)

## 
## Attaching package: 'boot'

## The following object is masked from 'package:lattice':
## 
##     melanoma

## The following object is masked from 'package:psych':
## 
##     logit

# Cost function for a binary classifier suggested by boot package
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
cv.error=rep(0,3)
for(i in 1:3){
  model=glm(Prediction~poly((will+gas+deal+meter+please+attached+day+only+http+volume+contract+texas+nom+questions+change+thanks+money+rate+best+houston+feb+prices+move+neon+nor+dr+z+gra+rev+alt),i), family = binomial, data = data)
  cv.error[i]=cv.glm(data,model,cost=cost)$delta[1]
}
cv.error

## [1] 0.2900232 0.2906032 0.2913766

# Histogram of the Test Error
hist(cv.error,xlab='Test Error',ylab='Freq',main='Test Error LOOCV',
     col='cyan',border='blue',density=30)

From above procedure, I used loop to iteratively fit polynomial regressions for polynomial of order i = 1 to i=3. I Computed the Leave-One-Out cross-validation. The error rates for the model with linear, quadratic and cubic terms are 0.2900232, 0.2906032 and 0.2913766 respectively. K-fold Cross-Validation I use cv.glm() function to implement k-fold CV. I set K = 5 first and k=10. For each procedure, I get CV errors corresponding to polynomial fits for orders 1 to 5.

##K-fOLD Cross-Validation
#K=5
set.seed(17)
cv.error.5=rep(0,5)
for(i in 1:5){
  model=glm(Prediction~poly((will+gas+deal+meter+please+attached+day+only+http+volume+contract+texas+nom+questions+change+thanks+money+rate+best+houston+feb+prices+move+neon+nor+dr+z+gra+rev+alt),i), family = binomial, data = data)
  cv.error.5[i]=cv.glm(data,model,K=5,cost=cost)$delta[1]
}

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

cv.error.5

## [1] 0.2900232 0.2907966 0.2919567 0.3604022 0.2900232

# Histogram of the Test Error
hist(cv.error.5,xlab='Test Error',ylab='Freq',main='Test Error LOOCV',
     col='cyan',border='blue',density=30)

The obtained error rates for the model for polynomials from i=1 to i=5 are 0.2900232, 0.2907966, 0.2919567, 0.3604022, and 0.2900232 respectively. Then I computed the K-Fold cross-validation with k=10.

#K-10
set.seed(17)
cv.error.10=rep(0,5)
for(i in 1:5){
  model=glm(Prediction~poly((will+gas+deal+meter+please+attached+day+only+http+volume+contract+texas+nom+questions+change+thanks+money+rate+best+houston+feb+prices+move+neon+nor+dr+z+gra+rev+alt),i), family = binomial, data = data)
  cv.error.10[i]=cv.glm(data,model,K=10,cost=cost)$delta[1]
}

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

cv.error.10

## [1] 0.2900232 0.2904099 0.2909899 0.3766435 0.2900232

# Histogram of the Test Error
hist(cv.error.10,xlab='Test Error',ylab='Freq',main='Test Error LOOCV',
     col='cyan',border='blue',density=30)

The error rates for the model for polynomials from i=1 to i=5 are 0.2900232, 0.2904099, 0.2909899, 0.376643, and 0.2900232. Bootstrap I use the boot method within the trainControl function in R.

###Bootstrap

# define training control
train_control <- trainControl(method="boot", number=100)
# train the model
model <- train(Prediction~., data=data, trControl=train_control, method="glm")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# summarize results
print(model)

## Generalized Linear Model 
## 
## 5172 samples
##   30 predictor
##    2 classes: 'non-spam', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (100 reps) 
## Summary of sample sizes: 5172, 5172, 5172, 5172, 5172, 5172, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9000618  0.7553677

set.seed(1)
boot.fn=function(data,index){
  model=glm(Prediction~., family = binomial, data = data,subset = index)
  return(coef(model))
}

boot.fn(data,sample(100,100,replace = T))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

##  (Intercept)         will          gas         deal        meter       please 
##   -1.0986123   -0.4859151   -2.8877542   -6.9200400   -7.4100141   19.2179618 
##     attached          day         only         http       volume     contract 
##  -12.5209255   -9.7188460  -53.4317987   -3.0303465   13.4117469    3.4364595 
##        texas          nom    questions       change       thanks        money 
##  -13.3013765   -9.2058518  -37.5805371   20.4434598  -40.0285383   95.6382930 
##         rate         best      houston          feb       prices         move 
##  180.8070450  -17.5225854  -11.6866038           NA           NA   59.7476567 
##         neon          nor           dr            z          gra          rev 
##           NA   22.2125313   28.9927927    9.8944433   -4.1235508  -42.1576960 
##          alt 
## -147.6270061

boot(data,boot.fn,1000)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1*  -0.012790682  4.874065e+10 2.943937e+13
## t2*  -0.447671952 -2.159129e+11 4.829783e+12
## t3*  -0.242691638 -2.598755e+11 5.905677e+12
## t4*  -0.748635159 -2.649525e+11 5.956364e+12
## t5*  -1.308690940 -2.651894e+11 5.982503e+12
## t6*  -0.414645830 -3.156910e+11 7.227420e+12
## t7*  -4.714749568 -2.616930e+12 6.112282e+13
## t8*  -0.338068490 -2.282863e+11 5.122801e+12
## t9*   0.959794909  5.195576e+11 1.162567e+13
## t10*  0.974065996  5.269677e+11 1.198293e+13
## t11* -1.223554480 -3.855077e+11 8.616480e+12
## t12* -0.297113255 -1.279852e+11 3.862272e+12
## t13* -1.204828464 -1.464947e+12 3.279777e+13
## t14* -0.369976405 -2.310518e+11 6.708924e+12
## t15* -2.097083798 -1.085542e+12 2.460433e+13
## t16* -0.903930148 -5.167748e+11 1.157921e+13
## t17* -1.994110595 -8.870422e+11 2.101658e+13
## t18*  2.555173279  1.159264e+12 2.736658e+13
## t19*  0.303495513  1.771778e+11 4.280029e+12
## t20*  1.298562255  6.732227e+11 1.516689e+13
## t21* -2.799894081 -1.186080e+12 2.653094e+13
## t22* -1.423951196 -7.040938e+11 1.574437e+13
## t23*  1.616134451  6.374309e+11 1.548012e+13
## t24*  1.337747808  8.540582e+11 1.979485e+13
## t25* -2.182367358 -2.085566e+12 4.938785e+13
## t26* -0.002473625  8.097286e+10 1.810405e+12
## t27*  0.338413967  1.871934e+11 4.231227e+12
## t28*  0.221679387  1.297736e+11 3.192075e+12
## t29*  0.443850360  3.340825e+11 7.466874e+12
## t30*  0.138100274  1.834406e+10 4.131979e+11
## t31*  0.453868867  2.363850e+11 5.702862e+12

Model Selection and Regularization Subset Selection Best Subset Selection I apply the best subset selection approach on the data set. I use the bestglm package in R package to carryout subset selection. I use logistic regression for illustration. This package is dependent on the leaps package that is used for linear regression but can also be used for logistic regression. Asterisk indicates that a given variable is included in the corresponding model. From above, the output indicated that the best one-variable model contains thanks. Best two-variable model contains only thanks and dr, best three-variable model contains thanks, dr and attached etc. I run the above procedure again but set nvmax to 30 variables. Parameter nvmax is used to set the maximum number of variables you desire the function to use.

###Model Selection and Regularization
#Best subset selection

library(leaps)
regfit.full = regsubsets(Prediction~.,data)
summary(regfit.full)

## Subset selection object
## Call: regsubsets.formula(Prediction ~ ., data)
## 30 Variables  (and intercept)
##           Forced in Forced out
## will          FALSE      FALSE
## gas           FALSE      FALSE
## deal          FALSE      FALSE
## meter         FALSE      FALSE
## please        FALSE      FALSE
## attached      FALSE      FALSE
## day           FALSE      FALSE
## only          FALSE      FALSE
## http          FALSE      FALSE
## volume        FALSE      FALSE
## contract      FALSE      FALSE
## texas         FALSE      FALSE
## nom           FALSE      FALSE
## questions     FALSE      FALSE
## change        FALSE      FALSE
## thanks        FALSE      FALSE
## money         FALSE      FALSE
## rate          FALSE      FALSE
## best          FALSE      FALSE
## houston       FALSE      FALSE
## feb           FALSE      FALSE
## prices        FALSE      FALSE
## move          FALSE      FALSE
## neon          FALSE      FALSE
## nor           FALSE      FALSE
## dr            FALSE      FALSE
## z             FALSE      FALSE
## gra           FALSE      FALSE
## rev           FALSE      FALSE
## alt           FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          will gas deal meter please attached day only http volume contract
## 1  ( 1 ) " "  " " " "  " "   " "    " "      " " " "  " "  " "    " "     
## 2  ( 1 ) " "  " " " "  " "   " "    " "      " " " "  " "  " "    " "     
## 3  ( 1 ) " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 4  ( 1 ) " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 5  ( 1 ) " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 6  ( 1 ) " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 7  ( 1 ) " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 8  ( 1 ) "*"  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
##          texas nom questions change thanks money rate best houston feb prices
## 1  ( 1 ) " "   " " " "       " "    "*"    " "   " "  " "  " "     " " " "   
## 2  ( 1 ) " "   " " " "       " "    "*"    " "   " "  " "  " "     " " " "   
## 3  ( 1 ) " "   " " " "       " "    "*"    " "   " "  " "  " "     " " " "   
## 4  ( 1 ) " "   " " " "       " "    "*"    " "   " "  "*"  " "     " " " "   
## 5  ( 1 ) " "   "*" " "       " "    "*"    " "   " "  "*"  " "     " " " "   
## 6  ( 1 ) " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 7  ( 1 ) " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 8  ( 1 ) " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " "*"   
##          move neon nor dr  z   gra rev alt
## 1  ( 1 ) " "  " "  " " " " " " " " " " " "
## 2  ( 1 ) " "  " "  " " "*" " " " " " " " "
## 3  ( 1 ) " "  " "  " " "*" " " " " " " " "
## 4  ( 1 ) " "  " "  " " "*" " " " " " " " "
## 5  ( 1 ) " "  " "  " " "*" " " " " " " " "
## 6  ( 1 ) " "  " "  " " "*" " " " " " " " "
## 7  ( 1 ) " "  " "  " " " " "*" " " " " " "
## 8  ( 1 ) " "  " "  " " " " "*" " " " " " "

library(leaps)
regfit.full = regsubsets(Prediction~.,data,nvmax = 30)
summary(regfit.full)

## Subset selection object
## Call: regsubsets.formula(Prediction ~ ., data, nvmax = 30)
## 30 Variables  (and intercept)
##           Forced in Forced out
## will          FALSE      FALSE
## gas           FALSE      FALSE
## deal          FALSE      FALSE
## meter         FALSE      FALSE
## please        FALSE      FALSE
## attached      FALSE      FALSE
## day           FALSE      FALSE
## only          FALSE      FALSE
## http          FALSE      FALSE
## volume        FALSE      FALSE
## contract      FALSE      FALSE
## texas         FALSE      FALSE
## nom           FALSE      FALSE
## questions     FALSE      FALSE
## change        FALSE      FALSE
## thanks        FALSE      FALSE
## money         FALSE      FALSE
## rate          FALSE      FALSE
## best          FALSE      FALSE
## houston       FALSE      FALSE
## feb           FALSE      FALSE
## prices        FALSE      FALSE
## move          FALSE      FALSE
## neon          FALSE      FALSE
## nor           FALSE      FALSE
## dr            FALSE      FALSE
## z             FALSE      FALSE
## gra           FALSE      FALSE
## rev           FALSE      FALSE
## alt           FALSE      FALSE
## 1 subsets of each size up to 30
## Selection Algorithm: exhaustive
##           will gas deal meter please attached day only http volume contract
## 1  ( 1 )  " "  " " " "  " "   " "    " "      " " " "  " "  " "    " "     
## 2  ( 1 )  " "  " " " "  " "   " "    " "      " " " "  " "  " "    " "     
## 3  ( 1 )  " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 4  ( 1 )  " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 5  ( 1 )  " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 6  ( 1 )  " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 7  ( 1 )  " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 8  ( 1 )  "*"  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 9  ( 1 )  "*"  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 10  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 11  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 12  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 13  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 14  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 15  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 16  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  " "  " "    " "     
## 17  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  " "  " "    " "     
## 18  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  "*"  " "    " "     
## 19  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  "*"  "*"    " "     
## 20  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  "*"  "*"    " "     
## 21  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    " "     
## 22  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    " "     
## 23  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    " "     
## 24  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 25  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 26  ( 1 ) "*"  " " "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 27  ( 1 ) "*"  "*" "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 28  ( 1 ) "*"  "*" "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 29  ( 1 ) "*"  "*" "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 30  ( 1 ) "*"  "*" "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
##           texas nom questions change thanks money rate best houston feb prices
## 1  ( 1 )  " "   " " " "       " "    "*"    " "   " "  " "  " "     " " " "   
## 2  ( 1 )  " "   " " " "       " "    "*"    " "   " "  " "  " "     " " " "   
## 3  ( 1 )  " "   " " " "       " "    "*"    " "   " "  " "  " "     " " " "   
## 4  ( 1 )  " "   " " " "       " "    "*"    " "   " "  "*"  " "     " " " "   
## 5  ( 1 )  " "   "*" " "       " "    "*"    " "   " "  "*"  " "     " " " "   
## 6  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 7  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 8  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 9  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 10  ( 1 ) " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 11  ( 1 ) " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 12  ( 1 ) " "   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 13  ( 1 ) " "   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 14  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 15  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 16  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 17  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 18  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 19  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 20  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 21  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 22  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     "*" "*"   
## 23  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 24  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 25  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 26  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 27  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 28  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   "*"  "*"  "*"     "*" "*"   
## 29  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   "*"  "*"  "*"     "*" "*"   
## 30  ( 1 ) "*"   "*" "*"       "*"    "*"    "*"   "*"  "*"  "*"     "*" "*"   
##           move neon nor dr  z   gra rev alt
## 1  ( 1 )  " "  " "  " " " " " " " " " " " "
## 2  ( 1 )  " "  " "  " " "*" " " " " " " " "
## 3  ( 1 )  " "  " "  " " "*" " " " " " " " "
## 4  ( 1 )  " "  " "  " " "*" " " " " " " " "
## 5  ( 1 )  " "  " "  " " "*" " " " " " " " "
## 6  ( 1 )  " "  " "  " " "*" " " " " " " " "
## 7  ( 1 )  " "  " "  " " " " "*" " " " " " "
## 8  ( 1 )  " "  " "  " " " " "*" " " " " " "
## 9  ( 1 )  " "  " "  " " " " "*" "*" " " " "
## 10  ( 1 ) " "  " "  " " " " "*" " " " " " "
## 11  ( 1 ) "*"  " "  " " " " "*" " " " " "*"
## 12  ( 1 ) "*"  " "  " " " " "*" " " " " "*"
## 13  ( 1 ) "*"  " "  " " " " "*" "*" " " "*"
## 14  ( 1 ) "*"  " "  " " " " "*" "*" " " "*"
## 15  ( 1 ) "*"  " "  " " " " "*" "*" " " "*"
## 16  ( 1 ) "*"  " "  " " " " "*" "*" " " "*"
## 17  ( 1 ) "*"  "*"  " " " " "*" "*" " " "*"
## 18  ( 1 ) "*"  "*"  " " " " "*" "*" " " "*"
## 19  ( 1 ) "*"  "*"  " " " " "*" "*" " " "*"
## 20  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 21  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 22  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 23  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 24  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 25  ( 1 ) "*"  "*"  "*" "*" "*" "*" " " "*"
## 26  ( 1 ) "*"  "*"  "*" "*" "*" "*" " " "*"
## 27  ( 1 ) "*"  "*"  "*" "*" "*" "*" " " "*"
## 28  ( 1 ) "*"  "*"  "*" "*" "*" "*" " " "*"
## 29  ( 1 ) "*"  "*"  "*" "*" "*" "*" "*" "*"
## 30  ( 1 ) "*"  "*"  "*" "*" "*" "*" "*" "*"

names(summary(regfit.full))

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

reg.summary=summary(regfit.full)

par(mfrow=c(2,2))
which.min(reg.summary$bic)

## [1] 22

The summary() function returns R2, RSS, adjusted R2, Cp, and BIC. Above shows the R-square values of all the 30 variables. Below I plot the BIC for all of the models to set the overall best model and indicates model with smallest statistics.

plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(22, reg.summary$bic[22], col = "red", cex = 2, pch = 20)

plot(regfit.full,scale ="bic")

The best over model will contain only the variables shown above. I test this by fitting a logistic regression model with only these twenty-two variables and checking the performance of this model.

coef(regfit.full,22)

## (Intercept)        will        deal      please    attached         day 
##  1.36228332 -0.02256815 -0.02436799 -0.02145805 -0.12421438 -0.01910964 
##        only        http      volume       texas         nom   questions 
##  0.08194146  0.01519859 -0.02212577 -0.04741141 -0.04759202 -0.06310902 
##      thanks       money        best         feb      prices        move 
## -0.06902932  0.17505928  0.17712165 -0.02970396  0.10702211  0.08388627 
##        neon          dr           z         gra         alt 
## -0.12764599  0.02046628  0.01068268  0.02097053  0.05934660

glm.fits3=glm(Prediction~will+deal+please+attached+day+only+http+volume+texas+nom+questions+thanks+money+best+feb+prices+move+neon+dr+z+gra+rev+alt, family = binomial, data = train)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

glm.probs3=predict(glm.fits3,test,type = "response")
glm.pred3=rep("non-spam",5172)
glm.pred3[glm.probs3>.5]="spam"
table(glm.pred3,data$Prediction)

##           
## glm.pred3  non-spam spam
##   non-spam     2782 1130
##   spam          890  370

mean(glm.pred3==test$Prediction)

## [1] 0.8994586

The performance of the model using twenty-two variables only is better than using all the model. I observe a 90% accuracy. Stepwise Selection Forward Stepwise Selection The regsubsets() function will also be used for forward stepwise selection, by adding the parameter method=“forward”.

##Stepwise Selection
#Forward Stepwise Selection

regfit.full = regsubsets(Prediction~.,data,nvmax = 30, method = "forward")
summary(regfit.full)

## Subset selection object
## Call: regsubsets.formula(Prediction ~ ., data, nvmax = 30, method = "forward")
## 30 Variables  (and intercept)
##           Forced in Forced out
## will          FALSE      FALSE
## gas           FALSE      FALSE
## deal          FALSE      FALSE
## meter         FALSE      FALSE
## please        FALSE      FALSE
## attached      FALSE      FALSE
## day           FALSE      FALSE
## only          FALSE      FALSE
## http          FALSE      FALSE
## volume        FALSE      FALSE
## contract      FALSE      FALSE
## texas         FALSE      FALSE
## nom           FALSE      FALSE
## questions     FALSE      FALSE
## change        FALSE      FALSE
## thanks        FALSE      FALSE
## money         FALSE      FALSE
## rate          FALSE      FALSE
## best          FALSE      FALSE
## houston       FALSE      FALSE
## feb           FALSE      FALSE
## prices        FALSE      FALSE
## move          FALSE      FALSE
## neon          FALSE      FALSE
## nor           FALSE      FALSE
## dr            FALSE      FALSE
## z             FALSE      FALSE
## gra           FALSE      FALSE
## rev           FALSE      FALSE
## alt           FALSE      FALSE
## 1 subsets of each size up to 30
## Selection Algorithm: forward
##           will gas deal meter please attached day only http volume contract
## 1  ( 1 )  " "  " " " "  " "   " "    " "      " " " "  " "  " "    " "     
## 2  ( 1 )  " "  " " " "  " "   " "    " "      " " " "  " "  " "    " "     
## 3  ( 1 )  " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 4  ( 1 )  " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 5  ( 1 )  " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 6  ( 1 )  " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 7  ( 1 )  "*"  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 8  ( 1 )  "*"  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 9  ( 1 )  "*"  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 10  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " " "  " "  " "    " "     
## 11  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 12  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 13  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 14  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 15  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 16  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  " "  " "    " "     
## 17  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  " "  " "    " "     
## 18  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  "*"  " "    " "     
## 19  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  "*"  " "    " "     
## 20  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  "*"  "*"    " "     
## 21  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    " "     
## 22  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    " "     
## 23  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    " "     
## 24  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 25  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 26  ( 1 ) "*"  " " "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 27  ( 1 ) "*"  "*" "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 28  ( 1 ) "*"  "*" "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 29  ( 1 ) "*"  "*" "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 30  ( 1 ) "*"  "*" "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
##           texas nom questions change thanks money rate best houston feb prices
## 1  ( 1 )  " "   " " " "       " "    "*"    " "   " "  " "  " "     " " " "   
## 2  ( 1 )  " "   " " " "       " "    "*"    " "   " "  " "  " "     " " " "   
## 3  ( 1 )  " "   " " " "       " "    "*"    " "   " "  " "  " "     " " " "   
## 4  ( 1 )  " "   " " " "       " "    "*"    " "   " "  "*"  " "     " " " "   
## 5  ( 1 )  " "   "*" " "       " "    "*"    " "   " "  "*"  " "     " " " "   
## 6  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 7  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 8  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 9  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 10  ( 1 ) " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 11  ( 1 ) " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 12  ( 1 ) " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 13  ( 1 ) "*"   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 14  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 15  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 16  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 17  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 18  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 19  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 20  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 21  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 22  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     "*" "*"   
## 23  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 24  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 25  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 26  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 27  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 28  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   "*"  "*"  "*"     "*" "*"   
## 29  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   "*"  "*"  "*"     "*" "*"   
## 30  ( 1 ) "*"   "*" "*"       "*"    "*"    "*"   "*"  "*"  "*"     "*" "*"   
##           move neon nor dr  z   gra rev alt
## 1  ( 1 )  " "  " "  " " " " " " " " " " " "
## 2  ( 1 )  " "  " "  " " "*" " " " " " " " "
## 3  ( 1 )  " "  " "  " " "*" " " " " " " " "
## 4  ( 1 )  " "  " "  " " "*" " " " " " " " "
## 5  ( 1 )  " "  " "  " " "*" " " " " " " " "
## 6  ( 1 )  " "  " "  " " "*" " " " " " " " "
## 7  ( 1 )  " "  " "  " " "*" " " " " " " " "
## 8  ( 1 )  " "  " "  " " "*" " " " " " " " "
## 9  ( 1 )  " "  " "  " " "*" "*" " " " " " "
## 10  ( 1 ) " "  " "  " " "*" "*" " " " " " "
## 11  ( 1 ) " "  " "  " " "*" "*" " " " " " "
## 12  ( 1 ) " "  " "  " " "*" "*" " " " " "*"
## 13  ( 1 ) " "  " "  " " "*" "*" " " " " "*"
## 14  ( 1 ) " "  " "  " " "*" "*" " " " " "*"
## 15  ( 1 ) "*"  " "  " " "*" "*" " " " " "*"
## 16  ( 1 ) "*"  " "  " " "*" "*" " " " " "*"
## 17  ( 1 ) "*"  "*"  " " "*" "*" " " " " "*"
## 18  ( 1 ) "*"  "*"  " " "*" "*" " " " " "*"
## 19  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 20  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 21  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 22  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 23  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 24  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 25  ( 1 ) "*"  "*"  "*" "*" "*" "*" " " "*"
## 26  ( 1 ) "*"  "*"  "*" "*" "*" "*" " " "*"
## 27  ( 1 ) "*"  "*"  "*" "*" "*" "*" " " "*"
## 28  ( 1 ) "*"  "*"  "*" "*" "*" "*" " " "*"
## 29  ( 1 ) "*"  "*"  "*" "*" "*" "*" "*" "*"
## 30  ( 1 ) "*"  "*"  "*" "*" "*" "*" "*" "*"

Following the above forward stepwise selection, the best one variable model contains only thanks, and the best two-variable contains thanks and dr and so on. Due to the nature of our data, the result of forward stepwise is similar to what we got with best stepwise. Backward Stepwise Selection The regsubsets() function can also be used for backward stepwise selection, by adding the parameter method=“backward”.

#Backward Stepwise Selection

#The regsubsets() function can also be used for backward stepwise selection, 
#by adding the parameter method=“backward”.

regfit.full = regsubsets(Prediction~.,data,nvmax = 30, method = "backward")
summary(regfit.full)

## Subset selection object
## Call: regsubsets.formula(Prediction ~ ., data, nvmax = 30, method = "backward")
## 30 Variables  (and intercept)
##           Forced in Forced out
## will          FALSE      FALSE
## gas           FALSE      FALSE
## deal          FALSE      FALSE
## meter         FALSE      FALSE
## please        FALSE      FALSE
## attached      FALSE      FALSE
## day           FALSE      FALSE
## only          FALSE      FALSE
## http          FALSE      FALSE
## volume        FALSE      FALSE
## contract      FALSE      FALSE
## texas         FALSE      FALSE
## nom           FALSE      FALSE
## questions     FALSE      FALSE
## change        FALSE      FALSE
## thanks        FALSE      FALSE
## money         FALSE      FALSE
## rate          FALSE      FALSE
## best          FALSE      FALSE
## houston       FALSE      FALSE
## feb           FALSE      FALSE
## prices        FALSE      FALSE
## move          FALSE      FALSE
## neon          FALSE      FALSE
## nor           FALSE      FALSE
## dr            FALSE      FALSE
## z             FALSE      FALSE
## gra           FALSE      FALSE
## rev           FALSE      FALSE
## alt           FALSE      FALSE
## 1 subsets of each size up to 30
## Selection Algorithm: backward
##           will gas deal meter please attached day only http volume contract
## 1  ( 1 )  " "  " " " "  " "   " "    " "      " " " "  " "  " "    " "     
## 2  ( 1 )  " "  " " " "  " "   " "    " "      " " " "  " "  " "    " "     
## 3  ( 1 )  " "  " " " "  " "   " "    " "      " " " "  " "  " "    " "     
## 4  ( 1 )  " "  " " " "  " "   " "    " "      " " " "  " "  " "    " "     
## 5  ( 1 )  " "  " " " "  " "   " "    " "      " " " "  " "  " "    " "     
## 6  ( 1 )  " "  " " " "  " "   " "    "*"      " " " "  " "  " "    " "     
## 7  ( 1 )  " "  " " " "  " "   " "    "*"      " " "*"  " "  " "    " "     
## 8  ( 1 )  "*"  " " " "  " "   " "    "*"      " " "*"  " "  " "    " "     
## 9  ( 1 )  "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 10  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 11  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 12  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 13  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 14  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 15  ( 1 ) "*"  " " "*"  " "   " "    "*"      " " "*"  " "  " "    " "     
## 16  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  " "  " "    " "     
## 17  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  " "  " "    " "     
## 18  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  "*"  " "    " "     
## 19  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  "*"  "*"    " "     
## 20  ( 1 ) "*"  " " "*"  " "   " "    "*"      "*" "*"  "*"  "*"    " "     
## 21  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    " "     
## 22  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    " "     
## 23  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    " "     
## 24  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 25  ( 1 ) "*"  " " "*"  " "   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 26  ( 1 ) "*"  " " "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 27  ( 1 ) "*"  "*" "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 28  ( 1 ) "*"  "*" "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 29  ( 1 ) "*"  "*" "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
## 30  ( 1 ) "*"  "*" "*"  "*"   "*"    "*"      "*" "*"  "*"  "*"    "*"     
##           texas nom questions change thanks money rate best houston feb prices
## 1  ( 1 )  " "   " " " "       " "    "*"    " "   " "  " "  " "     " " " "   
## 2  ( 1 )  " "   " " " "       " "    "*"    "*"   " "  " "  " "     " " " "   
## 3  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  " "  " "     " " " "   
## 4  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 5  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 6  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 7  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 8  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 9  ( 1 )  " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 10  ( 1 ) " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 11  ( 1 ) " "   "*" " "       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 12  ( 1 ) " "   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 13  ( 1 ) " "   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 14  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " " "   
## 15  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 16  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 17  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 18  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 19  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 20  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 21  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     " " "*"   
## 22  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  " "     "*" "*"   
## 23  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 24  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 25  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 26  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 27  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   " "  "*"  "*"     "*" "*"   
## 28  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   "*"  "*"  "*"     "*" "*"   
## 29  ( 1 ) "*"   "*" "*"       " "    "*"    "*"   "*"  "*"  "*"     "*" "*"   
## 30  ( 1 ) "*"   "*" "*"       "*"    "*"    "*"   "*"  "*"  "*"     "*" "*"   
##           move neon nor dr  z   gra rev alt
## 1  ( 1 )  " "  " "  " " " " " " " " " " " "
## 2  ( 1 )  " "  " "  " " " " " " " " " " " "
## 3  ( 1 )  " "  " "  " " " " " " " " " " " "
## 4  ( 1 )  " "  " "  " " " " " " " " " " " "
## 5  ( 1 )  " "  " "  " " " " "*" " " " " " "
## 6  ( 1 )  " "  " "  " " " " "*" " " " " " "
## 7  ( 1 )  " "  " "  " " " " "*" " " " " " "
## 8  ( 1 )  " "  " "  " " " " "*" " " " " " "
## 9  ( 1 )  " "  " "  " " " " "*" " " " " " "
## 10  ( 1 ) "*"  " "  " " " " "*" " " " " " "
## 11  ( 1 ) "*"  " "  " " " " "*" " " " " "*"
## 12  ( 1 ) "*"  " "  " " " " "*" " " " " "*"
## 13  ( 1 ) "*"  " "  " " " " "*" "*" " " "*"
## 14  ( 1 ) "*"  " "  " " " " "*" "*" " " "*"
## 15  ( 1 ) "*"  " "  " " " " "*" "*" " " "*"
## 16  ( 1 ) "*"  " "  " " " " "*" "*" " " "*"
## 17  ( 1 ) "*"  "*"  " " " " "*" "*" " " "*"
## 18  ( 1 ) "*"  "*"  " " " " "*" "*" " " "*"
## 19  ( 1 ) "*"  "*"  " " " " "*" "*" " " "*"
## 20  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 21  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 22  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 23  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 24  ( 1 ) "*"  "*"  " " "*" "*" "*" " " "*"
## 25  ( 1 ) "*"  "*"  "*" "*" "*" "*" " " "*"
## 26  ( 1 ) "*"  "*"  "*" "*" "*" "*" " " "*"
## 27  ( 1 ) "*"  "*"  "*" "*" "*" "*" " " "*"
## 28  ( 1 ) "*"  "*"  "*" "*" "*" "*" " " "*"
## 29  ( 1 ) "*"  "*"  "*" "*" "*" "*" "*" "*"
## 30  ( 1 ) "*"  "*"  "*" "*" "*" "*" "*" "*"

Following the above backward stepwise selection, the best one variable model contains only thanks, which is similar to our previous results. The best two-variable contains thanks and money which is different from the previous results. Shrinkage Methods For shrinkage methods I carry out two approaches, regression model and lasso models. I use glmnet R package to perform both ridge regression model and lasso models. Ridge regression I use the glmnet() function in R with parameter alpha=0 to fit a ridge regression model. I make use of other R functions to prepare the data for ridge regression. I also convert the outcome class to numerical variable.

##Shrinkage Methods
#For shrinkage methods we carry out two approaches. Namely, regression model and lasso models. We use 
#glmnet R package to perform both ridge regression model and lasso models.


#Ridge regression
#We use the glmnet() function in R with parameter alpha=0 to fit a ridge regression model. We make use 
#of other R functions to prepare the data for ridge regression. We also convert the outcome class to 
#numerical variable.

library(caret)
library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 3.0-2

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

## The following object is masked from 'package:pastecs':
## 
##     extract

library(dplyr)

set.seed(123)
training.samples <- data$Prediction %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]

# Find the best lambda using cross-validation
set.seed(123)
# Dumy code categorical predictor variables
x = model.matrix(Prediction~., train.data)[,-1]
# Convert the outcome (class) to a numerical variable
y = ifelse(train.data$Prediction == "spam", 1, 0)
cv.ridge <- cv.glmnet(x, y, alpha = 0, family = "binomial")

#Next, we fit a model using our ridge regression value and then observe the performance accuracy.

# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, family = "binomial",
                lambda = cv.ridge$lambda.min)

# Display regression coefficients
coef(model)

## 31 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) -0.25848956
## will        -0.17880875
## gas         -0.11112661
## deal        -0.30069583
## meter       -0.15164634
## please      -0.21130137
## attached    -1.92224297
## day         -0.13336439
## only         0.33992603
## http         0.30844379
## volume      -0.26657215
## contract     .         
## texas       -0.33539089
## nom         -0.44951505
## questions   -0.76333325
## change      -0.14187050
## thanks      -0.97727006
## money        1.11092299
## rate         0.11311596
## best         0.92539465
## houston     -0.03212444
## feb         -0.15531511
## prices       0.80674915
## move         0.44016891
## neon        -0.43365642
## nor          .         
## dr           0.16852161
## z            0.05985237
## gra          0.16476605
## rev          .         
## alt          0.15862611

# Make predictions on the test data
x.test <- model.matrix(Prediction ~., test.data)[,-1]
probabilities <- model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "spam", "non-spam")

# Model accuracy
observed.classes <- test.data$Prediction
mean(predicted.classes == observed.classes)

## [1] 0.8220503

From the above procedure using ridge regression method, I an accuracy of 82.2%. I make a plot of the cv.ridge variable.

Lasso Lasso works similar to ridge regression with the difference that lasso has penalty of the sum of the absolute values of the coefficients. The penalty for ridge regression is the sum of the squares of the cofficients. I use glmnet also for lasso but with alpha parameter set to 1.

##The Lasso
#Lasso works similar to ridge regression with the difference that lasso has penalty of 
#the sum of the absolute values of the coefficients. The penalty for ridge regression is 
#the sum of the squares of the cofficients. We will use glmnet also for lasso but with alpha parameter set to 1.

library(magrittr)
library(dplyr)
library(caret)

set.seed(123)
training.samples <- data$Prediction %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]

# Find the best lambda using cross-validation
set.seed(123)
# Dumy code categorical predictor variables
x <- model.matrix(Prediction~., train.data)[,-1]
# Convert the outcome to a numerical variable
y <- ifelse(train.data$Prediction == "spam", 1, 0)


cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, family = "binomial",
                lambda = cv.lasso$lambda.min)
# Display regression coefficients
coef(model)

## 31 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept)  0.01809849
## will        -0.40244418
## gas         -0.25990722
## deal        -0.81204520
## meter       -1.02214804
## please      -0.39449267
## attached    -4.41784262
## day         -0.38566539
## only         0.98405024
## http         0.85882689
## volume      -1.12034371
## contract    -0.19715559
## texas       -1.21494405
## nom         -0.43289720
## questions   -1.79032449
## change      -0.94831179
## thanks      -1.79864837
## money        2.20652720
## rate         0.57054676
## best         1.35444913
## houston     -2.55046822
## feb         -1.03920148
## prices       2.16679557
## move         1.17873411
## neon        -1.86161430
## nor         -0.09326442
## dr           0.30134224
## z            0.13584882
## gra          0.37171235
## rev          0.20514793
## alt          0.39997250

# Make predictions on the test data
x.test <- model.matrix(Prediction ~., test.data)[,-1]
probabilities <- model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "spam", "non-spam")
# Model accuracy
observed.classes <- test.data$Prediction
mean(predicted.classes == observed.classes)

## [1] 0.8858801

#From above, using lasso gave an accuracy of 88.58% which is better than ridge regression 
#with accuracy of 82.2%.


set.seed(123)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
plot(cv.lasso)

The above plot displays the cross-validation error according to the log of lambda. The vertical dash line indicates that the log of the optimal value of lambda which minimizes the prediction error. Dimension Reduction Methods There are two dimension reduction methods I tried in this session. These are principal components regression and partial least squares method. Principal Components Regression Principal components regression (PCT) is performed using the prcomp() function in R.

##Dimension Reduction Methods
#There are two dimension reduction methods we carry out in this session. These are methods 
#are principal components regression and partial least squares method.

#Principal Components Regression
#Principal components regression (PCT) can be performed using the prcomp() function in R.

data.pr <- prcomp(data[c(1:30)], center = TRUE, scale = TRUE)
summary(data.pr)

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.9687 1.59735 1.22170 1.19692 1.14375 1.10398 1.06592
## Proportion of Variance 0.1292 0.08505 0.04975 0.04775 0.04361 0.04063 0.03787
## Cumulative Proportion  0.1292 0.21424 0.26399 0.31175 0.35535 0.39598 0.43385
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.05711 1.02335 1.01485 1.00306 0.97917 0.97025 0.94265
## Proportion of Variance 0.03725 0.03491 0.03433 0.03354 0.03196 0.03138 0.02962
## Cumulative Proportion  0.47110 0.50601 0.54034 0.57388 0.60584 0.63722 0.66683
##                           PC15    PC16    PC17    PC18    PC19   PC20    PC21
## Standard deviation     0.93975 0.91195 0.88628 0.87834 0.84348 0.8289 0.81349
## Proportion of Variance 0.02944 0.02772 0.02618 0.02572 0.02372 0.0229 0.02206
## Cumulative Proportion  0.69627 0.72399 0.75018 0.77589 0.79961 0.8225 0.84457
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.80394 0.79431 0.77171 0.73795 0.72572 0.69868 0.66898
## Proportion of Variance 0.02154 0.02103 0.01985 0.01815 0.01756 0.01627 0.01492
## Cumulative Proportion  0.86611 0.88714 0.90699 0.92515 0.94270 0.95897 0.97389
##                           PC29   PC30
## Standard deviation     0.64365 0.6074
## Proportion of Variance 0.01381 0.0123
## Cumulative Proportion  0.98770 1.0000

The prcomp() function runs PCA on the predictor variables. I excluded the response variable. I tell the R function to center and scale the data and then standardize the data. I then displayed the summary. The summary call displays the standard deviation (eigenvalues), proportion of variance and the cumulative proportion. I make a plot of the eigenvalues pf the first 31 PCs and cumulative variance.

screeplot(data.pr, type = "l", npcs = 31, main = "Screeplot of the first 31 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"),
       col=c("red"), lty=5, cex=0.6)

cumpro <- cumsum(data.pr$sdev^2 / sum(data.pr$sdev^2))
plot(cumpro[0:31], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(v = 11, col="blue", lty=5)
abline(h = 0.88759, col="blue", lty=5)

plot(data.pr$x[,1],data.pr$x[,2], xlab="PC1 (44.3%)", ylab = "PC2 (19%)", main = "PC1 / PC2 - plot")

#Let us include the response variable in the plot and see how it looks.

library("factoextra")
fviz_pca_ind(data.pr, geom.ind = "point", pointshape = 21,
             pointsize = 2,
             fill.ind = data$Prediction,
             col.ind = "black",
             palette = "jco",
             addEllipses = TRUE,
             label = "var",
             col.var = "black",
             repel = TRUE,
             legend.title = "Prediction") +
  ggtitle("2D PCA-plot from 30 feature dataset") +
  theme(plot.title = element_text(hjust = 0.5))

In above procedure, I made same plot as before but added some colors corresponding to predictor labels. This added beauty to the PCA. I used the first two components and I can see some separation between spam and non-spam emails. Partial Least Squares To carry out PLS I use caret package. I start by preprocessing by removing the zero variance predictors, center and scale all the remaining predictors using the preProc argument. I train the PLS model and then check the cross-validation profile.

#Partial Least Squares
#To carry out PLS we use of caret package. We will start by preprocessing by removing the 
#zero variance predictors, center and scale all the remaining predictors using the preProc argument. 
#We train the PLS model and then check the cross-validation profile.

set.seed (1)
dataPLS = data
dataPLS$class = factor(dataPLS$Prediction)
# Compile cross-validation settings
set.seed(100)
myfolds <- createMultiFolds(dataPLS$class, k = 5, times = 10)
control <- trainControl("repeatedcv", index = myfolds, selectionFunction = "oneSE")

# Train PLS model
mod1 <- train(Prediction ~ ., data = dataPLS,
              method = "pls",
              metric = "Accuracy",
              tuneLength = 20,
              trControl = control,
              preProc = c("zv","center","scale"))

plot(varImp(mod1), 25, main = "PLS-DA")

## 
## Attaching package: 'pls'

## The following object is masked from 'package:corrplot':
## 
##     corrplot

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

The above procedure displayed the 25 most important variables given in relative levels (scaled to the range 0 - 100). The response variable has a relative level of 100. Moving Beyond Linearity In this session, I focus on generalized addictive models only. There are other models that can also be used here but I will limit the scope to GAM. Generalized Additive Models Generalized additive models (GAMs) presents framework to extending a standard linear model by allowing non-linear functions of each variables while maintaining additivity. I use the gam() function in the gam package in R. I fit a logistic regression Model using GAMs for probabilities of the Binary response values. One important feature of gam is its ability to allow non-linearity on multiple variables at the same time. GAM can be used with other non-linear functions. I run gam function on the data set. Then I fit the model using smoothing splines.

#Generalized Additive Models
#Generalized additive models (GAMs) presents framwork to extending a standard linear model by 
#allowing non-linear functions of each variables while maintaining additivity.

#We use the gam() function in the gam package in R. We are going to fit a logistic regression Model 
#using GAMs for probabilities of the Binary response values. One important feature of gam is its ability 
#to allow non-linearity on multiple variables at the sametime. GAM can be used with other non-linear functions.

#We run gam function on our data set. The We fit the model using smoothing splines.
library(gam)

## Loading required package: splines

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loaded gam 1.16.1

#logistic Regression Model
logitgam1<-gam(I(Prediction) ~ s(will,df=4) + s(gas,df=4) + s(deal,df=4) + s(meter,df=4) + s(please,df=4) +s(attached,df=4)+s(day,df=4) + s(only,df=4) + s(http,df=4) + s(volume,df=4) + s(contract,df=4) + s(texas,df=4) + s(nom,df=4)+ s(questions,df=4)+ s(change,df=4)+ s(thanks,df=4)+ s(money,df=4)+ s(rate,df=4)+ s(best,df=4)+ s(houston,df=4)+ s(feb,df=4)+ s(prices,df=4)+ s(move,df=4)+ s(neon,df=4)+ s(nor,df=4)+ s(dr,df=4)+ s(z,df=4)+ s(gra,df=4)+ s(rev,df=4)+ s(alt,df=4),data=data,family=binomial)

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored

plot(logitgam1,se=T)

From the plots above, I can see that fitting non-linearity isn’t sufficient for most of the variables. Tree-Based methods Tree-based methods involve segmenting the predictor space into a number of simple regions. Based on the nature of the data set, I used tree-based methods for classification. Fitting Classification Trees The tree library is used to construct classification trees in R.

##Tree-Based methods
#Tree-based methods involve segmenting the predictor space into a number of simple regions. Bsed on 
#the nature of our data set, we will be using tree-based methods for classifiction.

#Fitting Classification Trees
#The tree library is used to construct classification trees in R. We will use these methods
#on our data set.

library(tree)

## Registered S3 method overwritten by 'tree':
##   method     from
##   print.tree cli

tree.data=tree(Prediction~.,data)
summary(tree.data)

## 
## Classification tree:
## tree(formula = Prediction ~ ., data = data)
## Variables actually used in tree construction:
##  [1] "http"      "thanks"    "attached"  "gas"       "gra"       "nom"      
##  [7] "meter"     "dr"        "money"     "questions"
## Number of terminal nodes:  11 
## Residual mean deviance:  0.6995 = 3610 / 5161 
## Misclassification error rate: 0.1827 = 945 / 5172

The summary() function shows that only 10 of the 30 variables are used in the tree construction. The variables are "http", "thanks", "attached", "gas", "gra", "nom", "meter", "dr", "money", and "questions". The number of nodes is 11, residual mean deviance is 0.70 and the misclassification error rate is 18.3%. Displaying the tree graphically.

plot(tree.data)
text(tree.data,pretty = 0)

I used plot() function to display the tree structure, and the text() function to display the node labels. The argument pretty=0 is to include the category names of the qualitative predictors rather than displaying a letter for each category. Terminal nodes are spam and non-spam. I print the output corresponding to each branch of the tree below.

tree.data

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 5172 6229.00 non-spam ( 0.70998 0.29002 )  
##     2) http < 0.5 4555 4848.00 non-spam ( 0.77585 0.22415 )  
##       4) thanks < 0.5 3201 3942.00 non-spam ( 0.69416 0.30584 )  
##         8) attached < 0.5 2592 3427.00 non-spam ( 0.62577 0.37423 )  
##          16) gas < 0.5 2042 2810.00 non-spam ( 0.55093 0.44907 )  
##            32) gra < 0.5 1678 2235.00 non-spam ( 0.61621 0.38379 )  
##              64) nom < 0.5 1390 1912.00 non-spam ( 0.55180 0.44820 )  
##               128) meter < 0.5 1262 1749.00 non-spam ( 0.51189 0.48811 ) *
##               129) meter > 0.5 128   54.30 non-spam ( 0.94531 0.05469 ) *
##              65) nom > 0.5 288  150.40 non-spam ( 0.92708 0.07292 ) *
##            33) gra > 0.5 364  409.40 spam ( 0.25000 0.75000 ) *
##          17) gas > 0.5 550  348.70 non-spam ( 0.90364 0.09636 )  
##            34) dr < 1.5 484  191.00 non-spam ( 0.95041 0.04959 ) *
##            35) dr > 1.5 66   90.52 non-spam ( 0.56061 0.43939 ) *
##         9) attached > 0.5 609   93.73 non-spam ( 0.98522 0.01478 ) *
##       5) thanks > 0.5 1354  374.40 non-spam ( 0.96898 0.03102 )  
##        10) money < 0.5 1316  255.50 non-spam ( 0.98024 0.01976 ) *
##        11) money > 0.5 38   51.73 non-spam ( 0.57895 0.42105 ) *
##     3) http > 0.5 617  655.90 spam ( 0.22366 0.77634 )  
##       6) questions < 0.5 559  503.20 spam ( 0.16637 0.83363 ) *
##       7) questions > 0.5 58   61.72 non-spam ( 0.77586 0.22414 ) *

#The above code shows the number of observations in the branch, the deviance, the overall prediction for 
#the branch (stable or unstable), and the fraction of observations in that branch that takes one values of 
#yes and no. The brances that lead to terminal nodes are indicated using asterisks.

#To evaluate the performance of the classification tree on the data, we train the tree on the train data and 
#estimate the test error using the test data.

set.seed(2)
train=sample(1:nrow(data),3879)
data.test = data[-train,]
Prediction.test = data$Prediction[-train]
tree.data=tree(Prediction~.,data,subset = train)
tree.pred=predict(tree.data,data.test,type = "class")
table(tree.pred,Prediction.test)

##           Prediction.test
## tree.pred  non-spam spam
##   non-spam      737   58
##   spam          172  326

dt =mean(tree.pred==Prediction.test)
dt

## [1] 0.8221191

#We split the observations into trainging set and test set. Build the tree using the training set, 
#and evaluate the performance on the test data. The predict() fucntion is used for this purpose. 
#This procedure gave 82.2% correct predictions of the locations of the test data set.

#Next, we carry out pruning of the tree to see if there can be improvement on the tree. 
#We use FUN=prune.misclass in order to indicate that we want the classification error rate to guide 
#the cross-validation and pruning process, rather than the default for the cv.tree() function, which is 
#deviance.

set.seed(3)
cv.data=cv.tree(tree.data,FUN=prune.misclass)
names(cv.data)

## [1] "size"   "dev"    "k"      "method"

cv.data

## $size
## [1] 14 11 10  6  2  1
## 
## $dev
## [1]  704  704  695  723  868 1116
## 
## $k
## [1]   -Inf   0.00  20.00  27.00  36.25 248.00
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

The above code shows the number of observations in the branch, the deviance, the overall prediction for the branch (spam or non-spam), and the fraction of observations in that branch that takes one values of yes and no. The branches that lead to terminal nodes are indicated using asterisks. To evaluate the performance of the classification tree on the data, I train the tree on the train data and estimate the test error using the test data. I split the observations into training set and test set. Build the tree using the training set, and evaluate the performance on the test data. The predict() function is used for this purpose. This procedure gave 82.2% correct predictions of the locations of the test data set. I carry out pruning of the tree to see if there can be improvement on the tree. I use FUN=prune.misclass in order to indicate that I want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance. The above procedure outputs the number of terminal nodes of each tree considered(size), value of cost-complexity parameter used(k), cross-validation error(dev) and method which is misclass. Tree with 14 terminal nodes results in the lowest cross-validation error rate, with 704 cross-validation errors.

par(mfrow=c(1,2))
plot(cv.data$size,cv.data$dev, type = "b")
plot(cv.data$k,cv.data$dev, type = "b")

prune.data=prune.misclass(tree.data,best = 14)
plot(prune.data)
text(prune.data,pretty=0)

To find out how well the pruned tree perform on the test data set, I apply the predict() function.

#To find out how well the pruned tree perform on the test data set, we apply the predict() function.

tree.pred2=predict(prune.data,data.test,type = "class")
table(tree.pred2,Prediction.test)

##           Prediction.test
## tree.pred2 non-spam spam
##   non-spam      737   58
##   spam          172  326

dt_pruned =mean(tree.pred2==Prediction.test)
dt_pruned

## [1] 0.8221191

Using the pruned tree to carry out prediction on test data gave 82.21% accuracy. The pruning process is expected to produce both an interpretable tree and also improved the classification accuracy in some cases. Bagging and Random Forests I apply bagging and random forests to the data set, using the randomForest package in R. Bagging is a special case of a random forest with m=p. RandomForest() function in R can be used for both random forests and bagging.

##Bagging and Random Forests
#We apply bagging and random forests to our data set, using the randomForest oackage in R. 
#Bagging is a speacil case of a random forest with m=p. RandomForest() function in R can be used for 
#both random forests and bagging.

library("randomForest")

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:psych':
## 
##     outlier

set.seed(1)
bag.data=randomForest(Prediction~.,data,subset=train,mtry=30,importance=TRUE)
bag.data

## 
## Call:
##  randomForest(formula = Prediction ~ ., data = data, mtry = 30,      importance = TRUE, subset = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 30
## 
##         OOB estimate of  error rate: 10.26%
## Confusion matrix:
##          non-spam spam class.error
## non-spam     2505  258  0.09337676
## spam          140  976  0.12544803

#The argument mtry=30 to indicates that all 30 predictors should be considered for each split of the tee. 
#Below, We check how well the bagged model perform on the test set.


tree.pred_rf = predict(bag.data,data.test,type = "class")
table(tree.pred_rf,Prediction.test)

##             Prediction.test
## tree.pred_rf non-spam spam
##     non-spam      802   46
##     spam          107  338

tree_rf =mean(tree.pred_rf==Prediction.test)
tree_rf

## [1] 0.8816705

#The misclassification rate is 11.83%, in others words the accuracy is 88.17%. 
#This performed better than using an optimally-pruned single tree.

#To grow randomForest works in similar way as bagging, except that a smaller value is used for 
#the mtry argument. By default, random() uses SqrtRoot of P when building a random forest of classification 
#trees. We use mtry = 5.

set.seed(1)
rf.data=bag.data=randomForest(Prediction~.,data,subset=train,mtry=5,importance=TRUE)
tree.pred=predict(rf.data,data.test,type = "class")
table(tree.pred,Prediction.test)

##           Prediction.test
## tree.pred  non-spam spam
##   non-spam      812   31
##   spam           97  353

tree_rf_mtry5 =mean(tree.pred==Prediction.test)
tree_rf_mtry5

## [1] 0.9010054

#The accuracy is 90.1%; this shows that random forests yield an improvement over bagging in this case.

#Below, we use the importance() function, to see the importance of each variable.

importance(rf.data)

##             non-spam      spam MeanDecreaseAccuracy MeanDecreaseGini
## will       5.3042463 45.479724             49.86747        63.452780
## gas       24.2463433 56.221548             53.08475        82.224761
## deal       0.9802994 42.477007             39.10400        44.378382
## meter     -4.5367391 29.987282             27.98291        32.078145
## please    -0.1902791 32.834536             37.57409        49.909437
## attached   1.5259830 49.701662             38.62647        66.700432
## day       -0.4268590 28.889684             35.09778        37.436594
## only      17.2061445  5.664471             18.87643        21.792961
## http      48.6134396 48.429899             59.75667       138.080696
## volume    -8.9952347 36.918035             33.24116        37.901143
## contract  -6.5207033 26.818487             25.36784        13.864951
## texas      2.2840661 41.895667             36.57034        26.934516
## nom       -2.5553713 47.580550             38.02725        78.614167
## questions  1.4049630 38.524547             35.96407        34.155015
## change    -1.1096887 25.447872             26.13298        18.279740
## thanks     1.1478633 70.109123             62.08956       101.311754
## money     35.1556259 37.838557             42.40500        48.476058
## rate      13.7818545 11.314898             20.48072        22.293444
## best      26.4134138 16.099751             29.12680        35.564738
## houston   10.1284094 28.517723             28.94814        17.548066
## feb       -5.2494038 23.054995             18.86005         9.133184
## prices    24.1908388 16.224955             26.73269        23.339295
## move      24.2815051 27.102642             32.99077        32.210392
## neon      22.8761407 13.337040             23.26828         9.367063
## nor        7.1605074  8.034046             12.75462        13.214706
## dr        32.8422668 13.678669             36.05150        52.354310
## z         36.2935404 22.668259             45.76390        74.510092
## gra       35.4732405 20.574643             39.71020        78.426598
## rev       -6.4561269 23.730672             20.72939        18.432590
## alt       19.2369705  4.070941             19.23063        18.172001

#From the above procedure, two measure of variable of importance are mean deacreased accuracy 
#and mean decreased gini.Plots of these measures will be produced below.

Boosting The gbm() function within glm package in R is used to fit boosted classification trees to the data set.

##Boosting

#The gbm() function within glm package in R will be used to fit boosted classification trees to our data set.

library(gbm)

## Loaded gbm 2.1.5

set.seed(1)
data = read.csv("email_v2.csv")
train=sample(1:nrow(data),3879)
data.test = data[-train,]
Prediction.test = data$Prediction[-train]

boost.data=gbm(Prediction~.,data[train,],distribution = "bernoulli",n.trees = 2586, interaction.depth = 4)

#The above procedure was used to fit boosted classification tree to the data set. 
#We ran gmb() with option distribution=“bernoulli” for a binary classification problem. The argument 
#n.trees=2586 indictates we wanted 2586 trees and the option interaction.depth=4 limits the depth of each tree.

summary(boost.data)

##                 var    rel.inf
## http           http 16.4808133
## z                 z  7.7868276
## thanks       thanks  7.6023091
## gra             gra  6.1182133
## will           will  5.6528929
## attached   attached  5.2310361
## nom             nom  5.1724083
## money         money  4.5333997
## gas             gas  4.4551762
## dr               dr  4.0759911
## please       please  3.9405970
## day             day  3.6693143
## best           best  3.4612881
## questions questions  2.1027172
## deal           deal  2.0823118
## move           move  2.0703872
## prices       prices  1.6819056
## meter         meter  1.6774147
## rate           rate  1.5932185
## volume       volume  1.5580087
## texas         texas  1.4109733
## only           only  1.3826357
## rev             rev  1.1896871
## alt             alt  1.0607455
## change       change  1.0057292
## houston     houston  0.9354307
## nor             nor  0.7941808
## neon           neon  0.6697102
## feb             feb  0.4203874
## contract   contract  0.1842894

#From above procedure, the variables are listed in the descending order of importance.

par(mfrow=c(1,2))
plot(boost.data,i='http')

plot(boost.data,i='z')

#In the above procedure, we produced partial dependence plots for the two most important variables. 
#These ilustrate the marginal effect of the selected variables on the response after integrating out 
#the other variable.


#Next, We use the boosted model to predict stabf on the test set.

yhat.boost=predict(boost.data,data.test,type = "response",n.trees = 2586)
mean(yhat.boost==Prediction.test)

## [1] 0

#The accuracy is 100%.This is perfect and performs better than random forest and bagging.

#In the above procedure we used the default lambda which is 0.1. I will run the procedure again to make 
#use of lambda as 0.2.

boost.data=gbm(Prediction~.,data[train,],distribution = "bernoulli",n.trees = 2586, interaction.depth = 4, shrinkage = 0.2, verbose = F)
yhat.boost=predict(boost.data,data.test,type = "response",n.trees = 2586)
mean(yhat.boost==Prediction.test)

## [1] 0

In the above procedure, I produced partial dependence plots for the two most important variables. These illustrate the marginal effect of the selected variables on the response after integrating out the other variable. I use the boosted model to predict Prediction on the test set. Prediction with boost model on our data gave 0 test error rate, making the accuracy to be 1. This is perfect and performs better than random forest and bagging. In the above procedure I used the default lambda which is 0.1. I will run the procedure again to make use of lambda as 0.2. With lambda as 0.2, the accuracy still remains the same. Support Vector Machine Support vector machine main objective is to find a hyperplane in an N-dimensional space that distinctly classifies the data points. There are many possible hyperplanes that could be chosen to separate the two classes. In this project, I try to find a plane that has the maximum margin. I use the svm() function in e1071 package in R. I use the training data to fit the model, and the testing data to test the model. From the above procedure, the cost parameter is used to penalise the model for misclassification. I set the cost to be 1000. I fitted the model using svm. I run predict() function on the fitted model using the test data, and then look at its accurracy.