Online Articles Popularity

Taller 3: Analysis of article classification

This work seeks to determine popularity parameters for different articles published in Online, it is a preliminary analysis is looking for alternatives of analysis over commercialized articles on the web.

The next table, show the used data for the exercise:

#pander(summary(all_articles))
str(all_articles)

## 'data.frame':    39644 obs. of  41 variables:
##  $ url                          : chr  "amazon-instant-video-browser" "ap-samsung-sponsored-tweets" "apple-40-billion-app-downloads" "astronaut-notre-dame-bcs" ...
##  $ num_hrefs                    : num  4 3 3 9 19 2 21 20 2 4 ...
##  $ num_self_hrefs               : num  2 1 1 0 19 2 20 20 0 1 ...
##  $ num_imgs                     : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos                   : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ num_keywords                 : num  5 4 6 7 7 9 10 9 7 5 ...
##  $ data_channel_is_lifestyle    : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ data_channel_is_entertainment: num  1 0 0 1 0 0 0 0 0 0 ...
##  $ data_channel_is_bus          : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ data_channel_is_socmed       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data_channel_is_tech         : num  0 0 0 0 1 1 0 1 1 0 ...
##  $ data_channel_is_world        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ kw_avg_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ self_reference_min_shares    : num  496 0 918 0 545 8500 545 545 0 0 ...
##  $ self_reference_max_shares    : num  496 0 918 0 16000 8500 16000 16000 0 0 ...
##  $ self_reference_avg_sharess   : num  496 0 918 0 3151 ...
##  $ weekday_is_monday            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_tuesday           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_wednesday         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_thursday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_friday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_saturday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_sunday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_weekend                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ global_subjectivity          : num  0.522 0.341 0.702 0.43 0.514 ...
##  $ global_sentiment_polarity    : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ global_rate_positive_words   : num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
##  $ global_rate_negative_words   : num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
##  $ rate_positive_words          : num  0.769 0.733 0.857 0.667 0.86 ...
##  $ rate_negative_words          : num  0.231 0.267 0.143 0.333 0.14 ...
##  $ avg_positive_polarity        : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ min_positive_polarity        : num  0.1 0.0333 0.1 0.1364 0.0333 ...
##  $ max_positive_polarity        : num  0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
##  $ avg_negative_polarity        : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
##  $ min_negative_polarity        : num  -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
##  $ max_negative_polarity        : num  -0.2 -0.1 -0.133 -0.167 -0.05 ...
##  $ title_subjectivity           : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity     : num  -0.188 0 0 0 0.136 ...
##  $ abs_title_subjectivity       : num  0 0.5 0.5 0.5 0.0455 ...
##  $ abs_title_sentiment_polarity : num  0.188 0 0 0 0.136 ...
##  $ shares                       : int  593 711 1500 1200 505 855 556 891 3600 710 ...

1. Resume table with 5 examples of fields:

Table continues below
url	num_hrefs	num_self_hrefs	num_imgs
amazon-instant-video-browser	4	2	1
ap-samsung-sponsored-tweets	3	1	1
apple-40-billion-app-downloads	3	1	1
astronaut-notre-dame-bcs	9	0	1
att-u-verse-apps	19	19	20

Table continues below
num_videos	num_keywords	data_channel_is_lifestyle
0	5	0
0	4	0
0	6	0
0	7	0
0	7	0

Table continues below
data_channel_is_entertainment	data_channel_is_bus	data_channel_is_socmed
1	0	0
0	1	0
0	1	0
1	0	0
0	0	0

Table continues below
data_channel_is_tech	data_channel_is_world	kw_avg_avg
0	0	0
0	0	0
0	0	0
0	0	0
1	0	0

Table continues below
self_reference_min_shares	self_reference_max_shares
496	496
0	0
918	918
0	0
545	16000

Table continues below
self_reference_avg_sharess	weekday_is_monday	weekday_is_tuesday
496	1	0
0	1	0
918	1	0
0	1	0
3151	1	0

Table continues below
weekday_is_wednesday	weekday_is_thursday	weekday_is_friday
0	0	0
0	0	0
0	0	0
0	0	0
0	0	0

Table continues below
weekday_is_saturday	weekday_is_sunday	is_weekend	global_subjectivity
0	0	0	0.5216
0	0	0	0.3412
0	0	0	0.7022
0	0	0	0.4298
0	0	0	0.5135

Table continues below
global_sentiment_polarity	global_rate_positive_words
0.09256	0.04566
0.1489	0.04314
0.3233	0.05687
0.1007	0.04143
0.281	0.07463

Table continues below
global_rate_negative_words	rate_positive_words	rate_negative_words
0.0137	0.7692	0.2308
0.01569	0.7333	0.2667
0.009479	0.8571	0.1429
0.02072	0.6667	0.3333
0.01213	0.8602	0.1398

Table continues below
avg_positive_polarity	min_positive_polarity	max_positive_polarity
0.3786	0.1	0.7
0.2869	0.03333	0.7
0.4958	0.1	1
0.386	0.1364	0.8
0.4111	0.03333	1

Table continues below
avg_negative_polarity	min_negative_polarity	max_negative_polarity
-0.35	-0.6	-0.2
-0.1187	-0.125	-0.1
-0.4667	-0.8	-0.1333
-0.3697	-0.6	-0.1667
-0.2202	-0.5	-0.05

Table continues below
title_subjectivity	title_sentiment_polarity	abs_title_subjectivity
0.5	-0.1875	0
0	0	0.5
0	0	0.5
0	0	0.5
0.4545	0.1364	0.04545

abs_title_sentiment_polarity	shares
0.1875	593
0	711
0	1500
0	1200
0.1364	505

2. Split training dataset and test dataset

First, we choose the variable avg_positive_polarity to be our dependent variable. Then, we will randomize the observations and split the data into trainning and test set with the ratio 0.8. So we will include training_set and test_set

## 'data.frame':    7928 obs. of  41 variables:
##  $ url                          : chr  "astronaut-notre-dame-bcs" "att-u-verse-apps" "canon-poweshot-n" "crayon-creatures" ...
##  $ num_hrefs                    : num  9 19 20 7 26 11 9 20 5 5 ...
##  $ num_self_hrefs               : num  0 19 20 0 18 0 2 20 2 0 ...
##  $ num_imgs                     : num  1 20 20 1 12 1 1 20 1 1 ...
##  $ num_videos                   : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ num_keywords                 : num  7 7 9 7 5 6 7 7 10 6 ...
##  $ data_channel_is_lifestyle    : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ data_channel_is_entertainment: num  1 0 0 0 0 0 0 0 0 0 ...
##  $ data_channel_is_bus          : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ data_channel_is_socmed       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data_channel_is_tech         : num  0 1 1 0 0 0 0 1 1 0 ...
##  $ data_channel_is_world        : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ kw_avg_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ self_reference_min_shares    : num  0 545 545 0 555 0 704 545 924 0 ...
##  $ self_reference_max_shares    : num  0 16000 16000 0 14000 0 704 16000 924 0 ...
##  $ self_reference_avg_sharess   : num  0 3151 3151 0 3905 ...
##  $ weekday_is_monday            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_tuesday           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_wednesday         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_thursday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_friday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_saturday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_sunday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_weekend                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ global_subjectivity          : num  0.43 0.514 0.543 0.477 0.58 ...
##  $ global_sentiment_polarity    : num  0.1007 0.281 0.2986 0.15 0.0564 ...
##  $ global_rate_positive_words   : num  0.0414 0.0746 0.0839 0.0267 0.0411 ...
##  $ global_rate_negative_words   : num  0.0207 0.0121 0.0152 0.0107 0.0259 ...
##  $ rate_positive_words          : num  0.667 0.86 0.847 0.714 0.613 ...
##  $ rate_negative_words          : num  0.333 0.14 0.153 0.286 0.387 ...
##  $ min_positive_polarity        : num  0.1364 0.0333 0.1 0.2 0.1 ...
##  $ max_positive_polarity        : num  0.8 1 1 0.7 1 1 0.35 1 0.35 1 ...
##  $ avg_negative_polarity        : num  -0.37 -0.22 -0.243 -0.263 -0.401 ...
##  $ min_negative_polarity        : num  -0.6 -0.5 -0.5 -0.4 -1 ...
##  $ max_negative_polarity        : num  -0.167 -0.05 -0.05 -0.125 -0.05 ...
##  $ title_subjectivity           : num  0 0.455 1 0 0.567 ...
##  $ title_sentiment_polarity     : num  0 0.136 0.5 0 -0.1 ...
##  $ abs_title_subjectivity       : num  0.5 0.0455 0.5 0.5 0.0667 ...
##  $ abs_title_sentiment_polarity : num  0 0.136 0.5 0 0.1 ...
##  $ shares                       : int  1200 505 891 1900 13600 3100 598 445 783 573 ...
##  $ popular                      : num  0 0 0 0 1 0 0 0 0 0 ...

3. KNN Algorithm

train_label= training_set$popular
test_label= test_set$popular
#knn_pred<- knn(train=training_set , test=test_set , cl=train_label, k=21)
#knn_pred= ifelse(knn_pred>0.5,1, 0)

#knncm=table(x=test_label, y=knn_pred)
#knncm

Notes and Credits

If you spot a typo, or any error in the report, please let me know so I can fix it.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. In other words, feel free to use it, share it, edit it for non-commercial purposes and please, give credit.
https://github.com/juandes/FuzzyLogic-R The real coder, tnks for this repo.
https://rpubs.com/fangya/classification_news Base of the exercise, original source.
Database :K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.