Title:Laboratory 2: # Step 1: Exploring and Preparing the Data
# replace the ? (question marks) into NAs
rawdata <- read.csv(file="F:/homework/530/creditData.csv", na.strings='?')
sum(is.na(rawdata))
## [1] 0
Q1- What is your suggestion if you see any NA values? There are no missing value in this dataset. If it does, the first step is to explore the missing columns and understand it belongs to what type of missing. After understanding the data, I can try different methods to deal with missing value. For example, if it is Missing completely at random (MCAR), I can use Listwise or case deletion.Pairwise deletion eliminates information only when the particular data-point needed to test a particular assumption is missing.Mean substitution:the mean value of a variable is used in place of the missing data value for that same variable.Why mean? the mean is a reasonable estimate for a randomly selected observation from a normal distribution.Otherwise,it might lead to inconsistent bias.Regression imputation:Imputation is the process of replacing the missing data with estimated values. However,no novel information is added, while the sample size has been increased and the standard error is reduced.
str(rawdata)
## 'data.frame': 1000 obs. of 21 variables:
## $ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ...
## $ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
## $ Payment.Status.of.Previous.Credit: int 4 4 2 4 4 4 4 4 4 2 ...
## $ Purpose : int 2 0 9 0 0 0 0 0 3 3 ...
## $ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
## $ Value.Savings.Stocks : int 1 1 2 1 1 1 1 1 1 3 ...
## $ Length.of.current.employment : int 2 3 4 3 3 2 4 2 1 1 ...
## $ Instalment.per.cent : int 4 2 2 3 4 1 1 2 4 1 ...
## $ Sex...Marital.Status : int 2 3 2 3 3 3 3 3 2 2 ...
## $ Guarantors : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Duration.in.Current.address : int 4 2 4 2 4 3 4 4 4 4 ...
## $ Most.valuable.available.asset : int 2 1 1 1 2 1 1 1 3 4 ...
## $ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ...
## $ Concurrent.Credits : int 3 3 3 3 1 3 3 3 3 3 ...
## $ Type.of.apartment : int 1 1 1 1 2 1 2 2 2 1 ...
## $ No.of.Credits.at.this.Bank : int 1 2 1 2 2 2 2 1 2 1 ...
## $ Occupation : int 3 3 2 2 2 2 2 2 1 1 ...
## $ No.of.dependents : int 1 2 1 2 1 2 1 2 1 1 ...
## $ Telephone : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Foreign.Worker : int 1 1 1 2 2 2 2 2 1 1 ...
summary(rawdata)
## Creditability Account.Balance Duration.of.Credit..month.
## Min. :0.0 Min. :1.000 Min. : 4.0
## 1st Qu.:0.0 1st Qu.:1.000 1st Qu.:12.0
## Median :1.0 Median :2.000 Median :18.0
## Mean :0.7 Mean :2.577 Mean :20.9
## 3rd Qu.:1.0 3rd Qu.:4.000 3rd Qu.:24.0
## Max. :1.0 Max. :4.000 Max. :72.0
## Payment.Status.of.Previous.Credit Purpose Credit.Amount
## Min. :0.000 Min. : 0.000 Min. : 250
## 1st Qu.:2.000 1st Qu.: 1.000 1st Qu.: 1366
## Median :2.000 Median : 2.000 Median : 2320
## Mean :2.545 Mean : 2.828 Mean : 3271
## 3rd Qu.:4.000 3rd Qu.: 3.000 3rd Qu.: 3972
## Max. :4.000 Max. :10.000 Max. :18424
## Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000
## Median :1.000 Median :3.000 Median :3.000
## Mean :2.105 Mean :3.384 Mean :2.973
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :4.000
## Sex...Marital.Status Guarantors Duration.in.Current.address
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000
## Median :3.000 Median :1.000 Median :3.000
## Mean :2.682 Mean :1.145 Mean :2.845
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:4.000
## Max. :4.000 Max. :3.000 Max. :4.000
## Most.valuable.available.asset Age..years. Concurrent.Credits
## Min. :1.000 Min. :19.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:27.00 1st Qu.:3.000
## Median :2.000 Median :33.00 Median :3.000
## Mean :2.358 Mean :35.54 Mean :2.675
## 3rd Qu.:3.000 3rd Qu.:42.00 3rd Qu.:3.000
## Max. :4.000 Max. :75.00 Max. :3.000
## Type.of.apartment No.of.Credits.at.this.Bank Occupation No.of.dependents
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:1.000
## Median :2.000 Median :1.000 Median :3.000 Median :1.000
## Mean :1.928 Mean :1.407 Mean :2.904 Mean :1.155
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :3.000 Max. :4.000 Max. :4.000 Max. :2.000
## Telephone Foreign.Worker
## Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000
## Mean :1.404 Mean :1.037
## 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000
set.seed(12345)
rawdata$Creditability <- as.factor(rawdata$Creditability)
credit_rand <- rawdata[order(runif(1000)), ]
credit_train <- credit_rand[1:750, ]
credit_test <- credit_rand[751:1000, ]
prop.table(table(credit_train$Creditability))
##
## 0 1
## 0.3146667 0.6853333
prop.table(table(credit_test$Creditability))
##
## 0 1
## 0.256 0.744
#install.packages("naivebayes")
library(naivebayes)
## naivebayes 0.9.7 loaded
naive_model <- naive_bayes(credit_train$Creditability ~., data = credit_train)
naive_model
##
## ================================== Naive Bayes ==================================
##
## Call:
## naive_bayes.formula(formula = credit_train$Creditability ~ .,
## data = credit_train)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
##
## 0 1
## 0.3146667 0.6853333
##
## ---------------------------------------------------------------------------------
##
## Tables:
##
## ---------------------------------------------------------------------------------
## ::: Account.Balance (Gaussian)
## ---------------------------------------------------------------------------------
##
## Account.Balance 0 1
## mean 1.923729 2.793774
## sd 1.036826 1.252008
##
## ---------------------------------------------------------------------------------
## ::: Duration.of.Credit..month. (Gaussian)
## ---------------------------------------------------------------------------------
##
## Duration.of.Credit..month. 0 1
## mean 24.46610 19.20039
## sd 13.82208 11.13433
##
## ---------------------------------------------------------------------------------
## ::: Payment.Status.of.Previous.Credit (Gaussian)
## ---------------------------------------------------------------------------------
##
## Payment.Status.of.Previous.Credit 0 1
## mean 2.161017 2.665370
## sd 1.071649 1.045219
##
## ---------------------------------------------------------------------------------
## ::: Purpose (Gaussian)
## ---------------------------------------------------------------------------------
##
## Purpose 0 1
## mean 2.927966 2.803502
## sd 2.944722 2.633253
##
## ---------------------------------------------------------------------------------
## ::: Credit.Amount (Gaussian)
## ---------------------------------------------------------------------------------
##
## Credit.Amount 0 1
## mean 3964.195 2984.177
## sd 3597.093 2379.685
##
## ---------------------------------------------------------------------------------
##
## # ... and 15 more tables
##
## ---------------------------------------------------------------------------------
(conf_nat <- table(predict(naive_model, credit_test), credit_test$Creditability))
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
##
## 0 1
## 0 42 35
## 1 22 151
(Accuracy <- sum(diag(conf_nat))/sum(conf_nat)*100)
## [1] 77.2
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(robustbase)
creditDataScaled <- scale(credit_rand[,2:ncol(credit_rand)], center=TRUE, scale = TRUE)
m <- cor(creditDataScaled)
(highlycor <- findCorrelation(m, 0.30))
## [1] 5 12 19 15 3
enhanced_Data <- credit_rand[, -(highlycor[5]+1)]
enhancedTraining <- enhanced_Data[1:750, ]
enhancedTest <- enhanced_Data[751:1000, ]
nb_model <- naive_bayes(Creditability ~ ., data=enhancedTraining)
enhancedTestPred <- predict(nb_model, newdata = enhancedTest)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
table(enhancedTestPred, enhancedTest$Creditability)
##
## enhancedTestPred 0 1
## 0 44 35
## 1 20 151
(conf_nat <- table(enhancedTestPred, enhancedTest$Creditability))
##
## enhancedTestPred 0 1
## 0 44 35
## 1 20 151
(Accuracy <- sum(diag(conf_nat))/sum(conf_nat)*100)
## [1] 78
Q3- What is the accuracy this time? accuary has been improved to 78%.
letterdata <- read.csv(file="F:/homework/530/letterdata.csv", na.strings='?')
letterdata$letter <- as.factor(letterdata$letter)
str(letterdata)
## 'data.frame': 20000 obs. of 17 variables:
## $ letter: Factor w/ 26 levels "A","B","C","D",..: 20 9 4 14 7 19 2 1 10 13 ...
## $ xbox : int 2 5 4 7 2 4 4 1 2 11 ...
## $ ybox : int 8 12 11 11 1 11 2 1 2 15 ...
## $ width : int 3 3 6 6 3 5 5 3 4 13 ...
## $ height: int 5 7 8 6 1 8 4 2 4 9 ...
## $ onpix : int 1 2 6 3 1 3 4 1 2 7 ...
## $ xbar : int 8 10 10 5 8 8 8 8 10 13 ...
## $ ybar : int 13 5 6 9 6 8 7 2 6 2 ...
## $ x2bar : int 0 5 2 4 6 6 6 2 2 6 ...
## $ y2bar : int 6 4 6 6 6 9 6 2 6 2 ...
## $ xybar : int 6 13 10 4 6 5 7 8 12 12 ...
## $ x2ybar: int 10 3 3 4 5 6 6 2 4 1 ...
## $ xy2bar: int 8 9 7 10 9 6 6 8 8 9 ...
## $ xedge : int 0 2 3 6 1 0 2 1 1 8 ...
## $ xedgey: int 8 8 7 10 7 8 8 6 6 1 ...
## $ yedge : int 0 4 3 2 5 9 7 2 1 1 ...
## $ yedgex: int 8 10 9 8 10 7 10 7 7 8 ...
letters_train <- letterdata[1:18000, ]
letters_test <- letterdata[18001:20000, ]
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
letter_classifier <- ksvm(letter ~ ., data = letters_train, kernel = "polydot")
## Setting default kernel parameters
## Setting default kernel parameters
letter_classifier
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Polynomial kernel function.
## Hyperparameters : degree = 1 scale = 1 offset = 1
##
## Number of Support Vectors : 7887
##
## Objective Function Value : -15.3458 -21.3403 -25.7672 -6.8685 -8.8812 -35.9555 -59.5883 -18.1975 -65.6075 -41.5654 -18.8559 -39.3558 -36.9961 -60.3052 -15.1694 -42.144 -35.0941 -19.4069 -15.8234 -38.6718 -33.3013 -8.5298 -12.4387 -38.2194 -14.3682 -9.5508 -165.7154 -53.2778 -79.2163 -134.5052 -184.481 -58.9285 -46.3252 -81.004 -28.1341 -29.6955 -27.5983 -38.1764 -47.2889 -137.0497 -208.1397 -239.2617 -23.8945 -10.9655 -64.2279 -12.2139 -55.7818 -10.8001 -21.2407 -11.1795 -121.5639 -33.2229 -267.3926 -81.0708 -9.4937 -4.6577 -161.5171 -86.7114 -20.9146 -16.8272 -86.6582 -16.7205 -30.3036 -20.0054 -26.233 -29.9289 -56.1072 -11.6335 -5.2564 -14.8153 -4.983 -4.8171 -8.5044 -43.2267 -55.9 -214.755 -47.0748 -49.6539 -50.2278 -18.3767 -19.1813 -97.6132 -113.6502 -42.4112 -32.5859 -127.4807 -33.7418 -30.7568 -40.0953 -18.6792 -5.4826 -49.3916 -10.6142 -20.0286 -63.8287 -183.8297 -57.0671 -43.3721 -35.2782 -85.4452 -145.9585 -11.8002 -6.1194 -12.5323 -33.5245 -155.2248 -57.2602 -194.0785 -111.0155 -10.8207 -16.7926 -3.7766 -77.3561 -7.9004 -106.5759 -52.523 -107.0402 -78.0148 -74.4773 -24.8166 -13.2372 -7.8706 -27.2788 -13.2342 -280.2868 -32.7288 -25.9531 -149.5447 -153.8495 -10.0146 -40.8917 -6.7333 -65.2053 -72.8179 -35.1252 -246.7046 -38.0738 -16.9126 -158.18 -184.0021 -50.8427 -28.7687 -164.597 -97.8359 -386.1426 -160.3188 -181.8759 -38.3648 -37.2272 -60.116 -28.2074 -53.7383 -7.8729 -12.3159 -37.8942 -72.6434 -211.8342 -58.5022 -105.1605 -176.7259 -685.8994 -142.8147 -159.635 -366.9437 -37.641 -73.1357 -175.1906 -131.2833 -41.1464 -77.8404 -57.8131 -8.6365 -251.3728 -14.0836 -36.5144 -2.2292 -6.1598 -16.8011 -26.5165 -67.19 -21.3366 -221.4815 -22.9219 -4.2616 -4.7901 -0.8263 -134.7538 -8.8843 -83.1109 -23.1019 -14.4251 -5.7337 -17.5244 -29.7925 -23.9243 -88.9083 -28.6719 -106.0564 -16.4981 -10.6486 -7.9315 -1.5742 -91.1706 -7.3819 -118.2628 -117.5543 -48.5606 -26.6093 -71.2968 -30.4913 -63.5712 -279.292 -46.3024 -50.4912 -37.9431 -21.5243 -11.6202 -134.9025 -7.516 -5.8131 -10.1595 -13.6329 -27.0294 -25.7282 -151.851 -39.0524 -105.4861 -34.2434 -15.7051 -10.2304 -3.6687 -98.2094 -7.4666 -15.2668 -75.1283 -116.5383 -16.6429 -14.9215 -55.1062 -3.0636 -8.4262 -93.6829 -38.1162 -123.1859 -4.9078 -9.1612 -1.3077 -102.9021 -23.1138 -8.5262 -57.2623 -3.4297 -20.9579 -78.2019 -50.3741 -62.3531 -6.4908 -21.9308 -2.3736 -84.3835 -126.3997 -114.8725 -26.4109 -21.5589 -61.6405 -34.9162 -66.3243 -25.1148 -6.7203 -4.6695 -65.3518 -39.7924 -67.3505 -36.2154 -10.9031 -62.2195 -14.9491 -24.3238 -65.0847 -4.9657 -64.2797 -278.2874 -14.6902 -13.9198 -18.2059 -9.8972 -78.2645 -17.454 -49.5929 -55.7786 -28.7673 -15.9476 -47.531 -17.4379 -71.0516 -5.6899 -6.2519 -97.5508 -3.8196 -7.0502 -1.1238 -147.6952 -28.2018 -414.2587 -32.3275 -35.1191 -4.9605 -90.2308 -151.3409 -90.0329 -27.9491 -42.4688 -12.5118 -26.4828 -2.0045 -62.195 -9.1662 -178.4617 -1.9406 -1.9871 -11.3982 -0.5214 -29.6136 -35.0449 -6.7569
## Training error : 0.1335
letter_predictions <- predict(letter_classifier, letters_test)
table(enhancedTestPred, enhancedTest$Creditability)
##
## enhancedTestPred 0 1
## 0 44 35
## 1 20 151
(conf_nat <- table(letter_predictions, letters_test$letter))
##
## letter_predictions A B C D E F G H I J K L M N O P Q R S T
## A 73 0 0 0 0 0 0 0 0 1 0 0 0 0 3 0 4 0 0 1
## B 0 61 0 3 2 0 1 1 0 0 1 1 0 0 0 2 0 1 3 0
## C 0 0 64 0 2 0 4 2 1 0 1 2 0 0 1 0 0 0 0 0
## D 2 1 0 67 0 0 1 3 3 2 1 2 0 3 4 2 1 2 0 0
## E 0 0 1 0 64 1 1 0 0 0 2 2 0 0 0 0 2 0 6 0
## F 0 0 0 0 0 70 1 1 4 0 0 0 0 0 0 5 1 0 2 0
## G 1 1 2 1 3 2 68 1 0 0 0 1 0 0 0 0 4 1 3 2
## H 0 0 0 1 0 1 0 46 0 2 3 1 1 1 9 0 0 5 0 3
## I 0 0 0 0 0 0 0 0 65 2 0 0 0 0 0 0 0 0 2 0
## J 0 1 0 0 0 1 0 0 3 62 0 0 0 0 1 0 0 0 1 0
## K 0 1 4 0 0 0 0 5 0 0 56 0 0 2 0 0 0 4 0 1
## L 0 0 0 0 1 0 0 1 0 0 0 63 0 0 0 0 0 0 0 0
## M 0 0 1 0 0 0 1 0 0 0 0 0 70 2 0 0 0 0 0 0
## N 0 0 0 0 0 0 0 0 0 0 0 0 0 77 0 0 0 1 0 0
## O 0 0 1 1 0 0 0 1 0 1 0 0 0 0 49 1 2 0 0 0
## P 0 0 0 0 0 3 0 0 0 0 0 0 0 0 2 69 0 0 0 0
## Q 0 0 0 0 0 0 3 1 0 0 0 2 0 0 2 1 52 0 1 0
## R 0 4 0 0 1 0 0 3 0 0 3 0 0 0 1 0 0 64 0 1
## S 0 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 6 0 47 1
## T 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 83
## U 0 0 2 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## V 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
## W 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## X 0 1 0 0 1 0 0 1 0 0 1 4 0 0 0 0 0 1 0 0
## Y 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 1
## Z 1 0 0 0 2 0 0 0 0 2 0 0 0 0 0 0 0 0 5 1
##
## letter_predictions U V W X Y Z
## A 2 0 1 0 0 0
## B 0 0 0 0 0 0
## C 0 0 0 0 0 0
## D 0 0 0 0 1 0
## E 0 0 0 1 0 0
## F 0 1 0 0 2 0
## G 0 0 0 0 0 0
## H 0 2 0 0 1 0
## I 0 0 0 2 1 0
## J 0 0 0 1 0 4
## K 2 0 0 4 0 0
## L 0 0 0 0 0 0
## M 1 0 6 0 0 0
## N 1 0 2 0 0 0
## O 1 0 0 0 0 0
## P 0 0 0 0 1 0
## Q 0 0 0 0 0 0
## R 0 1 0 0 0 0
## S 0 0 0 1 0 6
## T 1 0 0 0 2 2
## U 83 0 0 0 0 0
## V 0 64 1 0 1 0
## W 0 3 59 0 0 0
## X 0 0 0 76 1 0
## Y 0 0 0 1 58 0
## Z 0 0 0 0 0 70
(Accuracy <- sum(diag(conf_nat))/sum(conf_nat)*100)
## [1] 84
Q4- We may be able to do better than this by changing the Kernels. Try Polynomial and RBF kernels to improve the result. I tried both Polynomial and RBF. The RBF's accuary is much higher than Polynomial. It means it is sensitivy.
news <- read.csv(file="F:/homework/530/OnlineNewsPopularity_for_R.csv", na.strings='?')
newsShort <- data.frame(news$n_tokens_title, news$n_tokens_content, news$n_unique_tokens, news$n_non_stop_words, news$num_hrefs, news$num_imgs, news$num_videos, news$average_token_length, news$num_keywords, news$kw_max_max, news$global_sentiment_polarity, news$avg_positive_polarity, news$title_subjectivity, news$title_sentiment_polarity, news$abs_title_subjectivity, news$abs_title_sentiment_polarity, news$shares)
colnames(newsShort) <- c("n_tokens_title", "n_tokens_content", "n_unique_tokens", "n_non_stop_words", "num_hrefs", "num_imgs", "num_videos", "average_token_length", "num_keywords", "kw_max_max", "global_sentiment_polarity", "avg_positive_polarity", "title_subjectivity", "title_sentiment_polarity", "abs_title_subjectivity", "abs_title_sentiment_polarity", "shares")
newsShort$popular = rep('na', nrow(newsShort))
for(i in 1:39644) {
if(newsShort$shares[i] >= 1400) {
newsShort$popular[i] = "yes"}
else {newsShort$popular[i] = "no"}
}
newsShort$shares = newsShort$popular
newsShort$shares <- as.factor(newsShort$shares)
news_rand <- newsShort[order(runif(10000)), ]
set.seed(12345)
news_train <- news_rand[1:9000, ]
news_test <- news_rand[9001:10000, ]
nb_model <- naive_bayes(shares ~ ., data=news_train)
## Warning: naive_bayes(): Feature popular - zero probabilities are present.
## Consider Laplace smoothing.
nb_model
##
## ================================== Naive Bayes ==================================
##
## Call:
## naive_bayes.formula(formula = shares ~ ., data = news_train)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
##
## no yes
## 0.4291111 0.5708889
##
## ---------------------------------------------------------------------------------
##
## Tables:
##
## ---------------------------------------------------------------------------------
## ::: n_tokens_title (Gaussian)
## ---------------------------------------------------------------------------------
##
## n_tokens_title no yes
## mean 9.820559 9.695991
## sd 1.929249 1.987754
##
## ---------------------------------------------------------------------------------
## ::: n_tokens_content (Gaussian)
## ---------------------------------------------------------------------------------
##
## n_tokens_content no yes
## mean 452.2315 515.1051
## sd 347.1779 450.0206
##
## ---------------------------------------------------------------------------------
## ::: n_unique_tokens (Gaussian)
## ---------------------------------------------------------------------------------
##
## n_unique_tokens no yes
## mean 0.5702437 0.5542023
## sd 0.1127776 0.1232687
##
## ---------------------------------------------------------------------------------
## ::: n_non_stop_words (Gaussian)
## ---------------------------------------------------------------------------------
##
## n_non_stop_words no yes
## mean 0.99404453 0.99124172
## sd 0.07695147 0.09318398
##
## ---------------------------------------------------------------------------------
## ::: num_hrefs (Gaussian)
## ---------------------------------------------------------------------------------
##
## num_hrefs no yes
## mean 9.147851 10.570650
## sd 8.644083 11.540711
##
## ---------------------------------------------------------------------------------
##
## # ... and 12 more tables
##
## ---------------------------------------------------------------------------------
news_Pred <- predict(nb_model, newdata = news_test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
(conf_nat <- table(news_Pred, news_test$shares))
##
## news_Pred no yes
## no 419 0
## yes 11 570
(Accuracy <- sum(diag(conf_nat))/sum(conf_nat)*100)
## [1] 98.9
## SVM
news_classifier <- ksvm(shares ~ ., data = news_train, kernel = "polydot")
## Setting default kernel parameters
news_classifier
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Polynomial kernel function.
## Hyperparameters : degree = 1 scale = 1 offset = 1
##
## Number of Support Vectors : 70
##
## Objective Function Value : -1
## Training error : 0
news_predictions <- predict(news_classifier, news_test)
table(news_predictions, news_test$shares)
##
## news_predictions no yes
## no 430 0
## yes 0 570
(conf_nat <- table(news_predictions, news_test$shares))
##
## news_predictions no yes
## no 430 0
## yes 0 570
(Accuracy <- sum(diag(conf_nat))/sum(conf_nat)*100)
## [1] 100
Q5- Do you see any improvement compared to last three techniques? Please completely explain your results and analysis. I applied Naive Bayes and SVM on onlinenews dataset. The accuary of svm is little bit higher than Naive Bayes. There is no single answer about which is the best classification method for a given dataset. Naive Bayes Classifier (NBC) and Support Vector Machine (SVM) have different options including the choice of kernel function for each. They are both sensitive to parameter optimization (i.e. different parameter selection can significantly change their output) . So, if you have a result showing that NBC is performing better than SVM. This is only true for the selected parameters.