Title:Laboratory 2: # Step 1: Exploring and Preparing the Data

# replace the ? (question marks) into NAs
rawdata <- read.csv(file="F:/homework/530/creditData.csv", na.strings='?')
sum(is.na(rawdata))

## [1] 0

Q1- What is your suggestion if you see any NA values? There are no missing value in this dataset. If it does, the first step is to explore the missing columns and understand it belongs to what type of missing. After understanding the data, I can try different methods to deal with missing value. For example, if it is Missing completely at random (MCAR), I can use Listwise or case deletion.Pairwise deletion eliminates information only when the particular data-point needed to test a particular assumption is missing.Mean substitution:the mean value of a variable is used in place of the missing data value for that same variable.Why mean? the mean is a reasonable estimate for a randomly selected observation from a normal distribution.Otherwise,it might lead to inconsistent bias.Regression imputation:Imputation is the process of replacing the missing data with estimated values. However,no novel information is added, while the sample size has been increased and the standard error is reduced.

Q2- Compute the percentage of both classes similar to what you did in lab 1 and see if the distribution of both classes preserved for both training and testing data.

str(rawdata)

## 'data.frame':    1000 obs. of  21 variables:
##  $ Creditability                    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Account.Balance                  : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Duration.of.Credit..month.       : int  18 9 12 12 12 10 8 6 18 24 ...
##  $ Payment.Status.of.Previous.Credit: int  4 4 2 4 4 4 4 4 4 2 ...
##  $ Purpose                          : int  2 0 9 0 0 0 0 0 3 3 ...
##  $ Credit.Amount                    : int  1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
##  $ Value.Savings.Stocks             : int  1 1 2 1 1 1 1 1 1 3 ...
##  $ Length.of.current.employment     : int  2 3 4 3 3 2 4 2 1 1 ...
##  $ Instalment.per.cent              : int  4 2 2 3 4 1 1 2 4 1 ...
##  $ Sex...Marital.Status             : int  2 3 2 3 3 3 3 3 2 2 ...
##  $ Guarantors                       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Duration.in.Current.address      : int  4 2 4 2 4 3 4 4 4 4 ...
##  $ Most.valuable.available.asset    : int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Age..years.                      : int  21 36 23 39 38 48 39 40 65 23 ...
##  $ Concurrent.Credits               : int  3 3 3 3 1 3 3 3 3 3 ...
##  $ Type.of.apartment                : int  1 1 1 1 2 1 2 2 2 1 ...
##  $ No.of.Credits.at.this.Bank       : int  1 2 1 2 2 2 2 1 2 1 ...
##  $ Occupation                       : int  3 3 2 2 2 2 2 2 1 1 ...
##  $ No.of.dependents                 : int  1 2 1 2 1 2 1 2 1 1 ...
##  $ Telephone                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Foreign.Worker                   : int  1 1 1 2 2 2 2 2 1 1 ...

summary(rawdata)

##  Creditability Account.Balance Duration.of.Credit..month.
##  Min.   :0.0   Min.   :1.000   Min.   : 4.0              
##  1st Qu.:0.0   1st Qu.:1.000   1st Qu.:12.0              
##  Median :1.0   Median :2.000   Median :18.0              
##  Mean   :0.7   Mean   :2.577   Mean   :20.9              
##  3rd Qu.:1.0   3rd Qu.:4.000   3rd Qu.:24.0              
##  Max.   :1.0   Max.   :4.000   Max.   :72.0              
##  Payment.Status.of.Previous.Credit    Purpose       Credit.Amount  
##  Min.   :0.000                     Min.   : 0.000   Min.   :  250  
##  1st Qu.:2.000                     1st Qu.: 1.000   1st Qu.: 1366  
##  Median :2.000                     Median : 2.000   Median : 2320  
##  Mean   :2.545                     Mean   : 2.828   Mean   : 3271  
##  3rd Qu.:4.000                     3rd Qu.: 3.000   3rd Qu.: 3972  
##  Max.   :4.000                     Max.   :10.000   Max.   :18424  
##  Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
##  Min.   :1.000        Min.   :1.000                Min.   :1.000      
##  1st Qu.:1.000        1st Qu.:3.000                1st Qu.:2.000      
##  Median :1.000        Median :3.000                Median :3.000      
##  Mean   :2.105        Mean   :3.384                Mean   :2.973      
##  3rd Qu.:3.000        3rd Qu.:5.000                3rd Qu.:4.000      
##  Max.   :5.000        Max.   :5.000                Max.   :4.000      
##  Sex...Marital.Status   Guarantors    Duration.in.Current.address
##  Min.   :1.000        Min.   :1.000   Min.   :1.000              
##  1st Qu.:2.000        1st Qu.:1.000   1st Qu.:2.000              
##  Median :3.000        Median :1.000   Median :3.000              
##  Mean   :2.682        Mean   :1.145   Mean   :2.845              
##  3rd Qu.:3.000        3rd Qu.:1.000   3rd Qu.:4.000              
##  Max.   :4.000        Max.   :3.000   Max.   :4.000              
##  Most.valuable.available.asset  Age..years.    Concurrent.Credits
##  Min.   :1.000                 Min.   :19.00   Min.   :1.000     
##  1st Qu.:1.000                 1st Qu.:27.00   1st Qu.:3.000     
##  Median :2.000                 Median :33.00   Median :3.000     
##  Mean   :2.358                 Mean   :35.54   Mean   :2.675     
##  3rd Qu.:3.000                 3rd Qu.:42.00   3rd Qu.:3.000     
##  Max.   :4.000                 Max.   :75.00   Max.   :3.000     
##  Type.of.apartment No.of.Credits.at.this.Bank   Occupation    No.of.dependents
##  Min.   :1.000     Min.   :1.000              Min.   :1.000   Min.   :1.000   
##  1st Qu.:2.000     1st Qu.:1.000              1st Qu.:3.000   1st Qu.:1.000   
##  Median :2.000     Median :1.000              Median :3.000   Median :1.000   
##  Mean   :1.928     Mean   :1.407              Mean   :2.904   Mean   :1.155   
##  3rd Qu.:2.000     3rd Qu.:2.000              3rd Qu.:3.000   3rd Qu.:1.000   
##  Max.   :3.000     Max.   :4.000              Max.   :4.000   Max.   :2.000   
##    Telephone     Foreign.Worker 
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000  
##  Mean   :1.404   Mean   :1.037  
##  3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.000

set.seed(12345)
rawdata$Creditability <- as.factor(rawdata$Creditability)
credit_rand <- rawdata[order(runif(1000)), ]
credit_train <- credit_rand[1:750, ]
credit_test <- credit_rand[751:1000, ]
prop.table(table(credit_train$Creditability))

## 
##         0         1 
## 0.3146667 0.6853333

prop.table(table(credit_test$Creditability))

## 
##     0     1 
## 0.256 0.744

Step 2: Training a Mode to build the Naive Bayes Classifier.

#install.packages("naivebayes")
library(naivebayes)

## naivebayes 0.9.7 loaded

naive_model <- naive_bayes(credit_train$Creditability ~., data = credit_train)
naive_model

## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = credit_train$Creditability ~ ., 
##     data = credit_train)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
##         0         1 
## 0.3146667 0.6853333 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: Account.Balance (Gaussian) 
## --------------------------------------------------------------------------------- 
##                
## Account.Balance        0        1
##            mean 1.923729 2.793774
##            sd   1.036826 1.252008
## 
## --------------------------------------------------------------------------------- 
##  ::: Duration.of.Credit..month. (Gaussian) 
## --------------------------------------------------------------------------------- 
##                           
## Duration.of.Credit..month.        0        1
##                       mean 24.46610 19.20039
##                       sd   13.82208 11.13433
## 
## --------------------------------------------------------------------------------- 
##  ::: Payment.Status.of.Previous.Credit (Gaussian) 
## --------------------------------------------------------------------------------- 
##                                  
## Payment.Status.of.Previous.Credit        0        1
##                              mean 2.161017 2.665370
##                              sd   1.071649 1.045219
## 
## --------------------------------------------------------------------------------- 
##  ::: Purpose (Gaussian) 
## --------------------------------------------------------------------------------- 
##        
## Purpose        0        1
##    mean 2.927966 2.803502
##    sd   2.944722 2.633253
## 
## --------------------------------------------------------------------------------- 
##  ::: Credit.Amount (Gaussian) 
## --------------------------------------------------------------------------------- 
##              
## Credit.Amount        0        1
##          mean 3964.195 2984.177
##          sd   3597.093 2379.685
## 
## ---------------------------------------------------------------------------------
## 
## # ... and 15 more tables
## 
## ---------------------------------------------------------------------------------

(conf_nat <- table(predict(naive_model, credit_test), credit_test$Creditability))

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

##    
##       0   1
##   0  42  35
##   1  22 151

(Accuracy <- sum(diag(conf_nat))/sum(conf_nat)*100)

## [1] 77.2

Laboratory 2: Naive Bayes Classifiers, Part 2

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(robustbase)

creditDataScaled <- scale(credit_rand[,2:ncol(credit_rand)], center=TRUE, scale = TRUE)
m <- cor(creditDataScaled)
(highlycor <- findCorrelation(m, 0.30))

## [1]  5 12 19 15  3

enhanced_Data <- credit_rand[, -(highlycor[5]+1)]
enhancedTraining <- enhanced_Data[1:750, ]
enhancedTest <- enhanced_Data[751:1000, ]
nb_model <- naive_bayes(Creditability ~ ., data=enhancedTraining)
enhancedTestPred <- predict(nb_model, newdata = enhancedTest)

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

table(enhancedTestPred, enhancedTest$Creditability)

##                 
## enhancedTestPred   0   1
##                0  44  35
##                1  20 151

(conf_nat <- table(enhancedTestPred, enhancedTest$Creditability))

##                 
## enhancedTestPred   0   1
##                0  44  35
##                1  20 151

(Accuracy <- sum(diag(conf_nat))/sum(conf_nat)*100)

## [1] 78

Q3- What is the accuracy this time? accuary has been improved to 78%.

Laboratory 2: Support Vector Machine, Part 3

letterdata <- read.csv(file="F:/homework/530/letterdata.csv", na.strings='?')

letterdata$letter <- as.factor(letterdata$letter)
str(letterdata)

## 'data.frame':    20000 obs. of  17 variables:
##  $ letter: Factor w/ 26 levels "A","B","C","D",..: 20 9 4 14 7 19 2 1 10 13 ...
##  $ xbox  : int  2 5 4 7 2 4 4 1 2 11 ...
##  $ ybox  : int  8 12 11 11 1 11 2 1 2 15 ...
##  $ width : int  3 3 6 6 3 5 5 3 4 13 ...
##  $ height: int  5 7 8 6 1 8 4 2 4 9 ...
##  $ onpix : int  1 2 6 3 1 3 4 1 2 7 ...
##  $ xbar  : int  8 10 10 5 8 8 8 8 10 13 ...
##  $ ybar  : int  13 5 6 9 6 8 7 2 6 2 ...
##  $ x2bar : int  0 5 2 4 6 6 6 2 2 6 ...
##  $ y2bar : int  6 4 6 6 6 9 6 2 6 2 ...
##  $ xybar : int  6 13 10 4 6 5 7 8 12 12 ...
##  $ x2ybar: int  10 3 3 4 5 6 6 2 4 1 ...
##  $ xy2bar: int  8 9 7 10 9 6 6 8 8 9 ...
##  $ xedge : int  0 2 3 6 1 0 2 1 1 8 ...
##  $ xedgey: int  8 8 7 10 7 8 8 6 6 1 ...
##  $ yedge : int  0 4 3 2 5 9 7 2 1 1 ...
##  $ yedgex: int  8 10 9 8 10 7 10 7 7 8 ...

Method #2. SVM model

letters_train <- letterdata[1:18000, ]
letters_test <- letterdata[18001:20000, ]
library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

letter_classifier <- ksvm(letter ~ ., data = letters_train, kernel = "polydot")

##  Setting default kernel parameters

##  Setting default kernel parameters
letter_classifier

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  1  scale =  1  offset =  1 
## 
## Number of Support Vectors : 7887 
## 
## Objective Function Value : -15.3458 -21.3403 -25.7672 -6.8685 -8.8812 -35.9555 -59.5883 -18.1975 -65.6075 -41.5654 -18.8559 -39.3558 -36.9961 -60.3052 -15.1694 -42.144 -35.0941 -19.4069 -15.8234 -38.6718 -33.3013 -8.5298 -12.4387 -38.2194 -14.3682 -9.5508 -165.7154 -53.2778 -79.2163 -134.5052 -184.481 -58.9285 -46.3252 -81.004 -28.1341 -29.6955 -27.5983 -38.1764 -47.2889 -137.0497 -208.1397 -239.2617 -23.8945 -10.9655 -64.2279 -12.2139 -55.7818 -10.8001 -21.2407 -11.1795 -121.5639 -33.2229 -267.3926 -81.0708 -9.4937 -4.6577 -161.5171 -86.7114 -20.9146 -16.8272 -86.6582 -16.7205 -30.3036 -20.0054 -26.233 -29.9289 -56.1072 -11.6335 -5.2564 -14.8153 -4.983 -4.8171 -8.5044 -43.2267 -55.9 -214.755 -47.0748 -49.6539 -50.2278 -18.3767 -19.1813 -97.6132 -113.6502 -42.4112 -32.5859 -127.4807 -33.7418 -30.7568 -40.0953 -18.6792 -5.4826 -49.3916 -10.6142 -20.0286 -63.8287 -183.8297 -57.0671 -43.3721 -35.2782 -85.4452 -145.9585 -11.8002 -6.1194 -12.5323 -33.5245 -155.2248 -57.2602 -194.0785 -111.0155 -10.8207 -16.7926 -3.7766 -77.3561 -7.9004 -106.5759 -52.523 -107.0402 -78.0148 -74.4773 -24.8166 -13.2372 -7.8706 -27.2788 -13.2342 -280.2868 -32.7288 -25.9531 -149.5447 -153.8495 -10.0146 -40.8917 -6.7333 -65.2053 -72.8179 -35.1252 -246.7046 -38.0738 -16.9126 -158.18 -184.0021 -50.8427 -28.7687 -164.597 -97.8359 -386.1426 -160.3188 -181.8759 -38.3648 -37.2272 -60.116 -28.2074 -53.7383 -7.8729 -12.3159 -37.8942 -72.6434 -211.8342 -58.5022 -105.1605 -176.7259 -685.8994 -142.8147 -159.635 -366.9437 -37.641 -73.1357 -175.1906 -131.2833 -41.1464 -77.8404 -57.8131 -8.6365 -251.3728 -14.0836 -36.5144 -2.2292 -6.1598 -16.8011 -26.5165 -67.19 -21.3366 -221.4815 -22.9219 -4.2616 -4.7901 -0.8263 -134.7538 -8.8843 -83.1109 -23.1019 -14.4251 -5.7337 -17.5244 -29.7925 -23.9243 -88.9083 -28.6719 -106.0564 -16.4981 -10.6486 -7.9315 -1.5742 -91.1706 -7.3819 -118.2628 -117.5543 -48.5606 -26.6093 -71.2968 -30.4913 -63.5712 -279.292 -46.3024 -50.4912 -37.9431 -21.5243 -11.6202 -134.9025 -7.516 -5.8131 -10.1595 -13.6329 -27.0294 -25.7282 -151.851 -39.0524 -105.4861 -34.2434 -15.7051 -10.2304 -3.6687 -98.2094 -7.4666 -15.2668 -75.1283 -116.5383 -16.6429 -14.9215 -55.1062 -3.0636 -8.4262 -93.6829 -38.1162 -123.1859 -4.9078 -9.1612 -1.3077 -102.9021 -23.1138 -8.5262 -57.2623 -3.4297 -20.9579 -78.2019 -50.3741 -62.3531 -6.4908 -21.9308 -2.3736 -84.3835 -126.3997 -114.8725 -26.4109 -21.5589 -61.6405 -34.9162 -66.3243 -25.1148 -6.7203 -4.6695 -65.3518 -39.7924 -67.3505 -36.2154 -10.9031 -62.2195 -14.9491 -24.3238 -65.0847 -4.9657 -64.2797 -278.2874 -14.6902 -13.9198 -18.2059 -9.8972 -78.2645 -17.454 -49.5929 -55.7786 -28.7673 -15.9476 -47.531 -17.4379 -71.0516 -5.6899 -6.2519 -97.5508 -3.8196 -7.0502 -1.1238 -147.6952 -28.2018 -414.2587 -32.3275 -35.1191 -4.9605 -90.2308 -151.3409 -90.0329 -27.9491 -42.4688 -12.5118 -26.4828 -2.0045 -62.195 -9.1662 -178.4617 -1.9406 -1.9871 -11.3982 -0.5214 -29.6136 -35.0449 -6.7569 
## Training error : 0.1335

letter_predictions <- predict(letter_classifier, letters_test)
table(enhancedTestPred, enhancedTest$Creditability)

##                 
## enhancedTestPred   0   1
##                0  44  35
##                1  20 151

(conf_nat <- table(letter_predictions, letters_test$letter))

##                   
## letter_predictions  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T
##                  A 73  0  0  0  0  0  0  0  0  1  0  0  0  0  3  0  4  0  0  1
##                  B  0 61  0  3  2  0  1  1  0  0  1  1  0  0  0  2  0  1  3  0
##                  C  0  0 64  0  2  0  4  2  1  0  1  2  0  0  1  0  0  0  0  0
##                  D  2  1  0 67  0  0  1  3  3  2  1  2  0  3  4  2  1  2  0  0
##                  E  0  0  1  0 64  1  1  0  0  0  2  2  0  0  0  0  2  0  6  0
##                  F  0  0  0  0  0 70  1  1  4  0  0  0  0  0  0  5  1  0  2  0
##                  G  1  1  2  1  3  2 68  1  0  0  0  1  0  0  0  0  4  1  3  2
##                  H  0  0  0  1  0  1  0 46  0  2  3  1  1  1  9  0  0  5  0  3
##                  I  0  0  0  0  0  0  0  0 65  2  0  0  0  0  0  0  0  0  2  0
##                  J  0  1  0  0  0  1  0  0  3 62  0  0  0  0  1  0  0  0  1  0
##                  K  0  1  4  0  0  0  0  5  0  0 56  0  0  2  0  0  0  4  0  1
##                  L  0  0  0  0  1  0  0  1  0  0  0 63  0  0  0  0  0  0  0  0
##                  M  0  0  1  0  0  0  1  0  0  0  0  0 70  2  0  0  0  0  0  0
##                  N  0  0  0  0  0  0  0  0  0  0  0  0  0 77  0  0  0  1  0  0
##                  O  0  0  1  1  0  0  0  1  0  1  0  0  0  0 49  1  2  0  0  0
##                  P  0  0  0  0  0  3  0  0  0  0  0  0  0  0  2 69  0  0  0  0
##                  Q  0  0  0  0  0  0  3  1  0  0  0  2  0  0  2  1 52  0  1  0
##                  R  0  4  0  0  1  0  0  3  0  0  3  0  0  0  1  0  0 64  0  1
##                  S  0  1  0  0  1  1  1  0  1  1  0  0  0  0  0  0  6  0 47  1
##                  T  0  0  0  0  1  1  0  0  0  0  1  0  0  0  0  0  0  0  1 83
##                  U  0  0  2  1  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0
##                  V  0  0  0  0  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0
##                  W  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
##                  X  0  1  0  0  1  0  0  1  0  0  1  4  0  0  0  0  0  1  0  0
##                  Y  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  4  0  0  0  1
##                  Z  1  0  0  0  2  0  0  0  0  2  0  0  0  0  0  0  0  0  5  1
##                   
## letter_predictions  U  V  W  X  Y  Z
##                  A  2  0  1  0  0  0
##                  B  0  0  0  0  0  0
##                  C  0  0  0  0  0  0
##                  D  0  0  0  0  1  0
##                  E  0  0  0  1  0  0
##                  F  0  1  0  0  2  0
##                  G  0  0  0  0  0  0
##                  H  0  2  0  0  1  0
##                  I  0  0  0  2  1  0
##                  J  0  0  0  1  0  4
##                  K  2  0  0  4  0  0
##                  L  0  0  0  0  0  0
##                  M  1  0  6  0  0  0
##                  N  1  0  2  0  0  0
##                  O  1  0  0  0  0  0
##                  P  0  0  0  0  1  0
##                  Q  0  0  0  0  0  0
##                  R  0  1  0  0  0  0
##                  S  0  0  0  1  0  6
##                  T  1  0  0  0  2  2
##                  U 83  0  0  0  0  0
##                  V  0 64  1  0  1  0
##                  W  0  3 59  0  0  0
##                  X  0  0  0 76  1  0
##                  Y  0  0  0  1 58  0
##                  Z  0  0  0  0  0 70

(Accuracy <- sum(diag(conf_nat))/sum(conf_nat)*100)

## [1] 84

Q4- We may be able to do better than this by changing the Kernels. Try Polynomial and RBF kernels to improve the result. I tried both Polynomial and RBF. The RBF's accuary is much higher than Polynomial. It means it is sensitivy.

Laboratory 2: News popularity, Part 4

Now apply the Naive Bayes classifier and SVM that you saw in Parts 1 through 3 on News popularity data set from lab 1.

news <- read.csv(file="F:/homework/530/OnlineNewsPopularity_for_R.csv", na.strings='?')
newsShort <- data.frame(news$n_tokens_title, news$n_tokens_content, news$n_unique_tokens, news$n_non_stop_words, news$num_hrefs, news$num_imgs, news$num_videos, news$average_token_length, news$num_keywords, news$kw_max_max, news$global_sentiment_polarity, news$avg_positive_polarity, news$title_subjectivity, news$title_sentiment_polarity, news$abs_title_subjectivity, news$abs_title_sentiment_polarity, news$shares)
colnames(newsShort) <- c("n_tokens_title", "n_tokens_content", "n_unique_tokens", "n_non_stop_words", "num_hrefs", "num_imgs", "num_videos", "average_token_length", "num_keywords", "kw_max_max", "global_sentiment_polarity", "avg_positive_polarity", "title_subjectivity", "title_sentiment_polarity", "abs_title_subjectivity", "abs_title_sentiment_polarity", "shares")
newsShort$popular = rep('na', nrow(newsShort))
for(i in 1:39644) {
     if(newsShort$shares[i] >= 1400) {
         newsShort$popular[i] = "yes"} 
     else {newsShort$popular[i] = "no"}
}
newsShort$shares = newsShort$popular
newsShort$shares <- as.factor(newsShort$shares)
news_rand <- newsShort[order(runif(10000)), ]
set.seed(12345)
news_train <- news_rand[1:9000, ]
news_test <- news_rand[9001:10000, ]
nb_model <- naive_bayes(shares ~ ., data=news_train)

## Warning: naive_bayes(): Feature popular - zero probabilities are present.
## Consider Laplace smoothing.

nb_model

## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = shares ~ ., data = news_train)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
##        no       yes 
## 0.4291111 0.5708889 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: n_tokens_title (Gaussian) 
## --------------------------------------------------------------------------------- 
##               
## n_tokens_title       no      yes
##           mean 9.820559 9.695991
##           sd   1.929249 1.987754
## 
## --------------------------------------------------------------------------------- 
##  ::: n_tokens_content (Gaussian) 
## --------------------------------------------------------------------------------- 
##                 
## n_tokens_content       no      yes
##             mean 452.2315 515.1051
##             sd   347.1779 450.0206
## 
## --------------------------------------------------------------------------------- 
##  ::: n_unique_tokens (Gaussian) 
## --------------------------------------------------------------------------------- 
##                
## n_unique_tokens        no       yes
##            mean 0.5702437 0.5542023
##            sd   0.1127776 0.1232687
## 
## --------------------------------------------------------------------------------- 
##  ::: n_non_stop_words (Gaussian) 
## --------------------------------------------------------------------------------- 
##                 
## n_non_stop_words         no        yes
##             mean 0.99404453 0.99124172
##             sd   0.07695147 0.09318398
## 
## --------------------------------------------------------------------------------- 
##  ::: num_hrefs (Gaussian) 
## --------------------------------------------------------------------------------- 
##          
## num_hrefs        no       yes
##      mean  9.147851 10.570650
##      sd    8.644083 11.540711
## 
## ---------------------------------------------------------------------------------
## 
## # ... and 12 more tables
## 
## ---------------------------------------------------------------------------------

news_Pred <- predict(nb_model, newdata = news_test)

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

(conf_nat <- table(news_Pred, news_test$shares))

##          
## news_Pred  no yes
##       no  419   0
##       yes  11 570

(Accuracy <- sum(diag(conf_nat))/sum(conf_nat)*100)

## [1] 98.9

##  SVM
news_classifier <- ksvm(shares ~ ., data = news_train, kernel = "polydot")

##  Setting default kernel parameters

news_classifier

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  1  scale =  1  offset =  1 
## 
## Number of Support Vectors : 70 
## 
## Objective Function Value : -1 
## Training error : 0

news_predictions <- predict(news_classifier, news_test)
table(news_predictions, news_test$shares)

##                 
## news_predictions  no yes
##              no  430   0
##              yes   0 570

(conf_nat <- table(news_predictions, news_test$shares))

##                 
## news_predictions  no yes
##              no  430   0
##              yes   0 570

(Accuracy <- sum(diag(conf_nat))/sum(conf_nat)*100)

## [1] 100

Q5- Do you see any improvement compared to last three techniques? Please completely explain your results and analysis. I applied Naive Bayes and SVM on onlinenews dataset. The accuary of svm is little bit higher than Naive Bayes. There is no single answer about which is the best classification method for a given dataset. Naive Bayes Classifier (NBC) and Support Vector Machine (SVM) have different options including the choice of kernel function for each. They are both sensitive to parameter optimization (i.e. different parameter selection can significantly change their output) . So, if you have a result showing that NBC is performing better than SVM. This is only true for the selected parameters.

ANLY 530 Project

Xue, Yingwen

2020-07-11