Lab 2 Classification

we will use the Online News Popularity data set from the UCI Machine Learning Reposittory to perform a binary classification(popular vs unpopular) analysis using the 4 calssification method: Tree-based classification, KNN Algorithm. Support vector machines, and Adding regression to trees. We will use the features such as digital media content, earlier popularity of news referenced in the article, average number of shares of keywords prior to publiction and natural lanuages features to conduct the analysis. Then we will use sms_spam.csv to do the Naive Bayes Algorithm.

Method 1. Tree based classification

1. import news popularity data

news=read.csv('C:/Users/atan/Desktop/news.csv',header = TRUE)
news=news[, 29:61]
str(news)
## 'data.frame':    39644 obs. of  33 variables:
##  $ self_reference_min_shares   : num  496 0 918 0 545 8500 545 545 0 0 ...
##  $ self_reference_max_shares   : num  496 0 918 0 16000 8500 16000 16000 0 0 ...
##  $ self_reference_avg_sharess  : num  496 0 918 0 3151 ...
##  $ weekday_is_monday           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_tuesday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_wednesday        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_thursday         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_friday           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_saturday         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_sunday           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_weekend                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LDA_00                      : num  0.5003 0.7998 0.2178 0.0286 0.0286 ...
##  $ LDA_01                      : num  0.3783 0.05 0.0333 0.4193 0.0288 ...
##  $ LDA_02                      : num  0.04 0.0501 0.0334 0.4947 0.0286 ...
##  $ LDA_03                      : num  0.0413 0.0501 0.0333 0.0289 0.0286 ...
##  $ LDA_04                      : num  0.0401 0.05 0.6822 0.0286 0.8854 ...
##  $ global_subjectivity         : num  0.522 0.341 0.702 0.43 0.514 ...
##  $ global_sentiment_polarity   : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ global_rate_positive_words  : num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
##  $ global_rate_negative_words  : num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
##  $ rate_positive_words         : num  0.769 0.733 0.857 0.667 0.86 ...
##  $ rate_negative_words         : num  0.231 0.267 0.143 0.333 0.14 ...
##  $ avg_positive_polarity       : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ min_positive_polarity       : num  0.1 0.0333 0.1 0.1364 0.0333 ...
##  $ max_positive_polarity       : num  0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
##  $ avg_negative_polarity       : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
##  $ min_negative_polarity       : num  -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
##  $ max_negative_polarity       : num  -0.2 -0.1 -0.133 -0.167 -0.05 ...
##  $ title_subjectivity          : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity    : num  -0.188 0 0 0 0.136 ...
##  $ abs_title_subjectivity      : num  0 0.5 0.5 0.5 0.0455 ...
##  $ abs_title_sentiment_polarity: num  0.188 0 0 0 0.136 ...
##  $ shares                      : int  593 711 1500 1200 505 855 556 891 3600 710 ...

2.Split training dataset and test dataset

First, we choose the variable avg_positive_polarity to be our dependent variable. Then, we will randomize the observations and split the data into trainning and test set with the ratio 0.8. So we will include training_set and test_set

library(caTools)
## Warning: package 'caTools' was built under R version 3.4.2
set.seed(123)
news$popular= ifelse(news$avg_positive_polarity>0.5,1, 0)
news=news[,-23]
split=sample.split(news$popular , SplitRatio=0.8)
training_set=subset(news, split==TRUE)
test_set=subset(news, split==FALSE)

3.Training a model on the data

we will use classifier in glm to do the logistic regression, because our result is either popular or un popular. we decide if the average popularity >0.5 then it is a popular news. if it is less or equal to 0.5 then it is un-popular

library(C50)
## Warning: package 'C50' was built under R version 3.4.2
classifier=glm(formula= popular~ . , 
              family=binomial, data=training_set)
classifier
## 
## Call:  glm(formula = popular ~ ., family = binomial, data = training_set)
## 
## Coefficients:
##                  (Intercept)     self_reference_min_shares  
##                   -1.830e+01                     1.673e-06  
##    self_reference_max_shares    self_reference_avg_sharess  
##                    1.346e-06                    -7.244e-07  
##            weekday_is_monday            weekday_is_tuesday  
##                   -1.572e-01                    -2.502e-01  
##         weekday_is_wednesday           weekday_is_thursday  
##                   -8.288e-02                    -2.687e-01  
##            weekday_is_friday           weekday_is_saturday  
##                   -1.470e-01                    -1.838e-01  
##            weekday_is_sunday                    is_weekend  
##                           NA                            NA  
##                       LDA_00                        LDA_01  
##                   -2.671e-01                    -8.843e-02  
##                       LDA_02                        LDA_03  
##                   -3.256e-01                    -7.457e-02  
##                       LDA_04           global_subjectivity  
##                   -8.800e-01                    -6.027e-01  
##    global_sentiment_polarity    global_rate_positive_words  
##                    2.711e+01                    -5.111e+01  
##   global_rate_negative_words           rate_positive_words  
##                    4.332e+01                    -3.067e+00  
##          rate_negative_words         min_positive_polarity  
##                    6.533e+00                     1.586e+01  
##        max_positive_polarity         avg_negative_polarity  
##                    1.089e+01                    -7.236e+00  
##        min_negative_polarity         max_negative_polarity  
##                    1.261e+00                     1.547e+00  
##           title_subjectivity      title_sentiment_polarity  
##                   -2.547e-01                    -3.948e-01  
##       abs_title_subjectivity  abs_title_sentiment_polarity  
##                    3.005e-02                     1.036e+00  
##                       shares  
##                   -2.013e-06  
## 
## Degrees of Freedom: 31715 Total (i.e. Null);  31685 Residual
## Null Deviance:       12660 
## Residual Deviance: 5585  AIC: 5647
prob_pred = predict(classifier, type= "response", test_set[-33])
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
y_pred= ifelse(prob_pred>0.5,1, 0)
cm=table(test_set[,33] , y_pred)

As we can see when it is Not popular, we predict it is not popular coreect 7478 times correct.

Method 2. Adding regression to trees

we just showed our prediction result using tree, now we will add regression lines.

Tree visualization

  1. we use only one digit to predict the tree.
library(rpart)
## Warning: package 'rpart' was built under R version 3.4.2
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.4.2
m.rpart <- rpart(popular ~. , data=training_set)
m.rpart
## n= 31716 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 31716 1517.48500 0.05038466  
##    2) global_sentiment_polarity< 0.3182905 30960 1168.24000 0.03927649  
##      4) min_positive_polarity< 0.2678571 30271  886.40280 0.03019391  
##        8) global_sentiment_polarity< 0.240115 28139  583.37640 0.02118057  
##         16) min_positive_polarity< 0.1431818 26114  378.35340 0.01470476 *
##         17) min_positive_polarity>=0.1431818 2025  189.80540 0.10469140  
##           34) max_positive_polarity< 0.8416667 1463   37.96036 0.02665755 *
##           35) max_positive_polarity>=0.8416667 562  119.74560 0.30782920 *
##        9) global_sentiment_polarity>=0.240115 2132  270.56850 0.14915570  
##         18) rate_positive_words>=0.7639319 1883  155.47110 0.09081253 *
##         19) rate_positive_words< 0.7639319 249   60.21687 0.59036140 *
##      5) min_positive_polarity>=0.2678571 689  169.62840 0.43831640  
##       10) max_positive_polarity< 0.575 292    0.00000 0.00000000 *
##       11) max_positive_polarity>=0.575 397   72.26700 0.76070530 *
##    3) global_sentiment_polarity>=0.3182905 756  188.97880 0.50529100  
##      6) max_positive_polarity< 0.825 185   27.75135 0.18378380 *
##      7) max_positive_polarity>=0.825 571  135.90890 0.60945710 *
rpart.plot(m.rpart, digits=1)

  1. another tree visualization
rpart.plot(m.rpart, digits=1, fallen.leaves=TRUE, type=3, extra=101)

Evaluating Model Performance

Based on the tree node, we suspect the global_sentiment_polarity is the most corralted with popularity. There for we will run two correlation test to check. As we can see, although , intuitively, max_positive_polarity must be correlated with popularity, the global_sentiment_polarity out do the max from the result.

summary(test_set$max_positive_polarity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.6000  0.8000  0.7532  1.0000  1.0000
summary(test_set$popular)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05033 0.00000 1.00000
cor(test_set$max_positive_polarity, test_set$popular)
## [1] 0.1760677
summary(test_set$global_sentiment_polarity)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.37766  0.05737  0.12044  0.11976  0.17808  0.57374
summary(test_set$popular)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05033 0.00000 1.00000
cor(test_set$global_sentiment_polarity, test_set$popular)
## [1] 0.2417969

Method 3.KNN Algorithm

Now we will perform the k-Nearest Neighbhor Algorithm using already pre-processed training set and test set. we will set the k=21

library(class)
## Warning: package 'class' was built under R version 3.4.2
train_label= training_set$popular
test_label= test_set$popular
knn_pred<- knn(train=training_set , test=test_set , cl=train_label, k=21)
knn_pred= ifelse(knn_pred>0.5,1, 0)
## Warning in Ops.factor(knn_pred, 0.5): '>' not meaningful for factors
knncm=table(x=test_label, y=knn_pred)
knncm
## < table of extent 2 x 0 >

This is amazing, it predict all correctly!

Method 4. Support Vector machines

We will perfom our pre-processed data set from the support vector machine perspective.

library(kernlab) 
svm_classifier <- ksvm(popular~. , data=training_set, kernel="vanilladot")
##  Setting default kernel parameters
svm_classifier
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 3626 
## 
## Objective Function Value : -6945.975 
## Training error : 0.984
svm_pred <- predict(svm_classifier, test_set)
svm_pred1= ifelse(svm_pred>0.5,1, 0)
svmcm<- table(svm_pred1, test_set$popular)
svmcm
##          
## svm_pred1    0    1
##         0 7529  399

As a result, we can see, we successfully predicted all the unpopular news, but didnt predict the popular news correctly.

Method 5. Naive Bayes Algorithm

1. import dataset

sms <- read.csv('C:/Users/atan/Desktop/sms.csv', stringsAsFactors = FALSE, header = TRUE)
summary(sms)
##      type               text          
##  Length:5559        Length:5559       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character
head(sms)
##   type
## 1  ham
## 2  ham
## 3  ham
## 4 spam
## 5 spam
## 6  ham
##                                                                                                                                                                text
## 1                                                                                                                 Hope you are having a good week. Just checking in
## 2                                                                                                                                           K..give back my thanks.
## 3                                                                                                                       Am also doing in cbe only. But have to pay.
## 4            complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+
## 5 okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm
## 6                                                                                                                Aiya we discuss later lar... Pick u up at 4 is it?

2. Exploring and preparing the data

library(tm)
## Warning: package 'tm' was built under R version 3.4.2
## Loading required package: NLP
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.4.2
## Loading required package: RColorBrewer
sms_corpus <- Corpus(VectorSource(sms$text))
print(sms_corpus)
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5559
lapply(sms_corpus[10:15], as.character)
## [[1]]
## [1] "fyi I'm at usf now, swing by the room whenever"
## 
## [[2]]
## [1] "Sure thing big man. i have hockey elections at 6, shouldnâ<U+0082>¬Ë<U+009C>t go on longer than an hour though"
## 
## [[3]]
## [1] "I anything lor..."
## 
## [[4]]
## [1] "By march ending, i should be ready. But will call you for sure. The problem is that my capital never complete. How far with you. How's work and the ladies"
## 
## [[5]]
## [1] "Hmm well, night night "
## 
## [[6]]
## [1] "K I'll be sure to get up before noon and see what's what"
corpus_clean <- tm_map(sms_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, PlainTextDocument)
corpus <- Corpus(VectorSource(corpus_clean))
sms_dtm <- DocumentTermMatrix(corpus)
sms_train <- sms[1:4169, ]
sms_test<-sms[4170: 5559,]
prop.table(table(sms_train$type))
## 
##       ham      spam 
## 0.8647158 0.1352842
prop.table(table(sms_test$type))
## 
##       ham      spam 
## 0.8683453 0.1316547
sms_corpus_train <- corpus_clean[1:4169]
wordcloud(sms_corpus_train, min.freq = 40, random.order = FALSE)