we will use the Online News Popularity data set from the UCI Machine Learning Reposittory to perform a binary classification(popular vs unpopular) analysis using the 4 calssification method: Tree-based classification, KNN Algorithm. Support vector machines, and Adding regression to trees. We will use the features such as digital media content, earlier popularity of news referenced in the article, average number of shares of keywords prior to publiction and natural lanuages features to conduct the analysis. Then we will use sms_spam.csv to do the Naive Bayes Algorithm.
news=read.csv('C:/Users/atan/Desktop/news.csv',header = TRUE)
news=news[, 29:61]
str(news)
## 'data.frame': 39644 obs. of 33 variables:
## $ self_reference_min_shares : num 496 0 918 0 545 8500 545 545 0 0 ...
## $ self_reference_max_shares : num 496 0 918 0 16000 8500 16000 16000 0 0 ...
## $ self_reference_avg_sharess : num 496 0 918 0 3151 ...
## $ weekday_is_monday : num 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_tuesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_wednesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_thursday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_friday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_saturday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_sunday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ is_weekend : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LDA_00 : num 0.5003 0.7998 0.2178 0.0286 0.0286 ...
## $ LDA_01 : num 0.3783 0.05 0.0333 0.4193 0.0288 ...
## $ LDA_02 : num 0.04 0.0501 0.0334 0.4947 0.0286 ...
## $ LDA_03 : num 0.0413 0.0501 0.0333 0.0289 0.0286 ...
## $ LDA_04 : num 0.0401 0.05 0.6822 0.0286 0.8854 ...
## $ global_subjectivity : num 0.522 0.341 0.702 0.43 0.514 ...
## $ global_sentiment_polarity : num 0.0926 0.1489 0.3233 0.1007 0.281 ...
## $ global_rate_positive_words : num 0.0457 0.0431 0.0569 0.0414 0.0746 ...
## $ global_rate_negative_words : num 0.0137 0.01569 0.00948 0.02072 0.01213 ...
## $ rate_positive_words : num 0.769 0.733 0.857 0.667 0.86 ...
## $ rate_negative_words : num 0.231 0.267 0.143 0.333 0.14 ...
## $ avg_positive_polarity : num 0.379 0.287 0.496 0.386 0.411 ...
## $ min_positive_polarity : num 0.1 0.0333 0.1 0.1364 0.0333 ...
## $ max_positive_polarity : num 0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
## $ avg_negative_polarity : num -0.35 -0.119 -0.467 -0.37 -0.22 ...
## $ min_negative_polarity : num -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
## $ max_negative_polarity : num -0.2 -0.1 -0.133 -0.167 -0.05 ...
## $ title_subjectivity : num 0.5 0 0 0 0.455 ...
## $ title_sentiment_polarity : num -0.188 0 0 0 0.136 ...
## $ abs_title_subjectivity : num 0 0.5 0.5 0.5 0.0455 ...
## $ abs_title_sentiment_polarity: num 0.188 0 0 0 0.136 ...
## $ shares : int 593 711 1500 1200 505 855 556 891 3600 710 ...
First, we choose the variable avg_positive_polarity to be our dependent variable. Then, we will randomize the observations and split the data into trainning and test set with the ratio 0.8. So we will include training_set and test_set
library(caTools)
## Warning: package 'caTools' was built under R version 3.4.2
set.seed(123)
news$popular= ifelse(news$avg_positive_polarity>0.5,1, 0)
news=news[,-23]
split=sample.split(news$popular , SplitRatio=0.8)
training_set=subset(news, split==TRUE)
test_set=subset(news, split==FALSE)
we will use classifier in glm to do the logistic regression, because our result is either popular or un popular. we decide if the average popularity >0.5 then it is a popular news. if it is less or equal to 0.5 then it is un-popular
library(C50)
## Warning: package 'C50' was built under R version 3.4.2
classifier=glm(formula= popular~ . ,
family=binomial, data=training_set)
classifier
##
## Call: glm(formula = popular ~ ., family = binomial, data = training_set)
##
## Coefficients:
## (Intercept) self_reference_min_shares
## -1.830e+01 1.673e-06
## self_reference_max_shares self_reference_avg_sharess
## 1.346e-06 -7.244e-07
## weekday_is_monday weekday_is_tuesday
## -1.572e-01 -2.502e-01
## weekday_is_wednesday weekday_is_thursday
## -8.288e-02 -2.687e-01
## weekday_is_friday weekday_is_saturday
## -1.470e-01 -1.838e-01
## weekday_is_sunday is_weekend
## NA NA
## LDA_00 LDA_01
## -2.671e-01 -8.843e-02
## LDA_02 LDA_03
## -3.256e-01 -7.457e-02
## LDA_04 global_subjectivity
## -8.800e-01 -6.027e-01
## global_sentiment_polarity global_rate_positive_words
## 2.711e+01 -5.111e+01
## global_rate_negative_words rate_positive_words
## 4.332e+01 -3.067e+00
## rate_negative_words min_positive_polarity
## 6.533e+00 1.586e+01
## max_positive_polarity avg_negative_polarity
## 1.089e+01 -7.236e+00
## min_negative_polarity max_negative_polarity
## 1.261e+00 1.547e+00
## title_subjectivity title_sentiment_polarity
## -2.547e-01 -3.948e-01
## abs_title_subjectivity abs_title_sentiment_polarity
## 3.005e-02 1.036e+00
## shares
## -2.013e-06
##
## Degrees of Freedom: 31715 Total (i.e. Null); 31685 Residual
## Null Deviance: 12660
## Residual Deviance: 5585 AIC: 5647
prob_pred = predict(classifier, type= "response", test_set[-33])
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
y_pred= ifelse(prob_pred>0.5,1, 0)
cm=table(test_set[,33] , y_pred)
As we can see when it is Not popular, we predict it is not popular coreect 7478 times correct.
we just showed our prediction result using tree, now we will add regression lines.
library(rpart)
## Warning: package 'rpart' was built under R version 3.4.2
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.4.2
m.rpart <- rpart(popular ~. , data=training_set)
m.rpart
## n= 31716
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 31716 1517.48500 0.05038466
## 2) global_sentiment_polarity< 0.3182905 30960 1168.24000 0.03927649
## 4) min_positive_polarity< 0.2678571 30271 886.40280 0.03019391
## 8) global_sentiment_polarity< 0.240115 28139 583.37640 0.02118057
## 16) min_positive_polarity< 0.1431818 26114 378.35340 0.01470476 *
## 17) min_positive_polarity>=0.1431818 2025 189.80540 0.10469140
## 34) max_positive_polarity< 0.8416667 1463 37.96036 0.02665755 *
## 35) max_positive_polarity>=0.8416667 562 119.74560 0.30782920 *
## 9) global_sentiment_polarity>=0.240115 2132 270.56850 0.14915570
## 18) rate_positive_words>=0.7639319 1883 155.47110 0.09081253 *
## 19) rate_positive_words< 0.7639319 249 60.21687 0.59036140 *
## 5) min_positive_polarity>=0.2678571 689 169.62840 0.43831640
## 10) max_positive_polarity< 0.575 292 0.00000 0.00000000 *
## 11) max_positive_polarity>=0.575 397 72.26700 0.76070530 *
## 3) global_sentiment_polarity>=0.3182905 756 188.97880 0.50529100
## 6) max_positive_polarity< 0.825 185 27.75135 0.18378380 *
## 7) max_positive_polarity>=0.825 571 135.90890 0.60945710 *
rpart.plot(m.rpart, digits=1)
rpart.plot(m.rpart, digits=1, fallen.leaves=TRUE, type=3, extra=101)
Based on the tree node, we suspect the global_sentiment_polarity is the most corralted with popularity. There for we will run two correlation test to check. As we can see, although , intuitively, max_positive_polarity must be correlated with popularity, the global_sentiment_polarity out do the max from the result.
summary(test_set$max_positive_polarity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.6000 0.8000 0.7532 1.0000 1.0000
summary(test_set$popular)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05033 0.00000 1.00000
cor(test_set$max_positive_polarity, test_set$popular)
## [1] 0.1760677
summary(test_set$global_sentiment_polarity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.37766 0.05737 0.12044 0.11976 0.17808 0.57374
summary(test_set$popular)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05033 0.00000 1.00000
cor(test_set$global_sentiment_polarity, test_set$popular)
## [1] 0.2417969
Now we will perform the k-Nearest Neighbhor Algorithm using already pre-processed training set and test set. we will set the k=21
library(class)
## Warning: package 'class' was built under R version 3.4.2
train_label= training_set$popular
test_label= test_set$popular
knn_pred<- knn(train=training_set , test=test_set , cl=train_label, k=21)
knn_pred= ifelse(knn_pred>0.5,1, 0)
## Warning in Ops.factor(knn_pred, 0.5): '>' not meaningful for factors
knncm=table(x=test_label, y=knn_pred)
knncm
## < table of extent 2 x 0 >
This is amazing, it predict all correctly!
We will perfom our pre-processed data set from the support vector machine perspective.
library(kernlab)
svm_classifier <- ksvm(popular~. , data=training_set, kernel="vanilladot")
## Setting default kernel parameters
svm_classifier
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 3626
##
## Objective Function Value : -6945.975
## Training error : 0.984
svm_pred <- predict(svm_classifier, test_set)
svm_pred1= ifelse(svm_pred>0.5,1, 0)
svmcm<- table(svm_pred1, test_set$popular)
svmcm
##
## svm_pred1 0 1
## 0 7529 399
As a result, we can see, we successfully predicted all the unpopular news, but didnt predict the popular news correctly.
sms <- read.csv('C:/Users/atan/Desktop/sms.csv', stringsAsFactors = FALSE, header = TRUE)
summary(sms)
## type text
## Length:5559 Length:5559
## Class :character Class :character
## Mode :character Mode :character
head(sms)
## type
## 1 ham
## 2 ham
## 3 ham
## 4 spam
## 5 spam
## 6 ham
## text
## 1 Hope you are having a good week. Just checking in
## 2 K..give back my thanks.
## 3 Am also doing in cbe only. But have to pay.
## 4 complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+
## 5 okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm
## 6 Aiya we discuss later lar... Pick u up at 4 is it?
library(tm)
## Warning: package 'tm' was built under R version 3.4.2
## Loading required package: NLP
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.4.2
## Loading required package: RColorBrewer
sms_corpus <- Corpus(VectorSource(sms$text))
print(sms_corpus)
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5559
lapply(sms_corpus[10:15], as.character)
## [[1]]
## [1] "fyi I'm at usf now, swing by the room whenever"
##
## [[2]]
## [1] "Sure thing big man. i have hockey elections at 6, shouldnâ<U+0082>¬Ë<U+009C>t go on longer than an hour though"
##
## [[3]]
## [1] "I anything lor..."
##
## [[4]]
## [1] "By march ending, i should be ready. But will call you for sure. The problem is that my capital never complete. How far with you. How's work and the ladies"
##
## [[5]]
## [1] "Hmm well, night night "
##
## [[6]]
## [1] "K I'll be sure to get up before noon and see what's what"
corpus_clean <- tm_map(sms_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, PlainTextDocument)
corpus <- Corpus(VectorSource(corpus_clean))
sms_dtm <- DocumentTermMatrix(corpus)
sms_train <- sms[1:4169, ]
sms_test<-sms[4170: 5559,]
prop.table(table(sms_train$type))
##
## ham spam
## 0.8647158 0.1352842
prop.table(table(sms_test$type))
##
## ham spam
## 0.8683453 0.1316547
sms_corpus_train <- corpus_clean[1:4169]
wordcloud(sms_corpus_train, min.freq = 40, random.order = FALSE)