Taller 3: Analysis of article classification
This work seeks to determine popularity parameters for different articles published in Online, it is a preliminary analysis is looking for alternatives of analysis over commercialized articles on the web.
The next table, show the used data for the exercise:
#pander(summary(all_articles))
str(all_articles)
## 'data.frame': 39644 obs. of 41 variables:
## $ url : chr "amazon-instant-video-browser" "ap-samsung-sponsored-tweets" "apple-40-billion-app-downloads" "astronaut-notre-dame-bcs" ...
## $ num_hrefs : num 4 3 3 9 19 2 21 20 2 4 ...
## $ num_self_hrefs : num 2 1 1 0 19 2 20 20 0 1 ...
## $ num_imgs : num 1 1 1 1 20 0 20 20 0 1 ...
## $ num_videos : num 0 0 0 0 0 0 0 0 0 1 ...
## $ num_keywords : num 5 4 6 7 7 9 10 9 7 5 ...
## $ data_channel_is_lifestyle : num 0 0 0 0 0 0 1 0 0 0 ...
## $ data_channel_is_entertainment: num 1 0 0 1 0 0 0 0 0 0 ...
## $ data_channel_is_bus : num 0 1 1 0 0 0 0 0 0 0 ...
## $ data_channel_is_socmed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ data_channel_is_tech : num 0 0 0 0 1 1 0 1 1 0 ...
## $ data_channel_is_world : num 0 0 0 0 0 0 0 0 0 1 ...
## $ kw_avg_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ self_reference_min_shares : num 496 0 918 0 545 8500 545 545 0 0 ...
## $ self_reference_max_shares : num 496 0 918 0 16000 8500 16000 16000 0 0 ...
## $ self_reference_avg_sharess : num 496 0 918 0 3151 ...
## $ weekday_is_monday : num 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_tuesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_wednesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_thursday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_friday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_saturday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_sunday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ is_weekend : num 0 0 0 0 0 0 0 0 0 0 ...
## $ global_subjectivity : num 0.522 0.341 0.702 0.43 0.514 ...
## $ global_sentiment_polarity : num 0.0926 0.1489 0.3233 0.1007 0.281 ...
## $ global_rate_positive_words : num 0.0457 0.0431 0.0569 0.0414 0.0746 ...
## $ global_rate_negative_words : num 0.0137 0.01569 0.00948 0.02072 0.01213 ...
## $ rate_positive_words : num 0.769 0.733 0.857 0.667 0.86 ...
## $ rate_negative_words : num 0.231 0.267 0.143 0.333 0.14 ...
## $ avg_positive_polarity : num 0.379 0.287 0.496 0.386 0.411 ...
## $ min_positive_polarity : num 0.1 0.0333 0.1 0.1364 0.0333 ...
## $ max_positive_polarity : num 0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
## $ avg_negative_polarity : num -0.35 -0.119 -0.467 -0.37 -0.22 ...
## $ min_negative_polarity : num -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
## $ max_negative_polarity : num -0.2 -0.1 -0.133 -0.167 -0.05 ...
## $ title_subjectivity : num 0.5 0 0 0 0.455 ...
## $ title_sentiment_polarity : num -0.188 0 0 0 0.136 ...
## $ abs_title_subjectivity : num 0 0.5 0.5 0.5 0.0455 ...
## $ abs_title_sentiment_polarity : num 0.188 0 0 0 0.136 ...
## $ shares : int 593 711 1500 1200 505 855 556 891 3600 710 ...
1. Resume table with 5 examples of fields:
Table continues below
| amazon-instant-video-browser |
4 |
2 |
1 |
| ap-samsung-sponsored-tweets |
3 |
1 |
1 |
| apple-40-billion-app-downloads |
3 |
1 |
1 |
| astronaut-notre-dame-bcs |
9 |
0 |
1 |
| att-u-verse-apps |
19 |
19 |
20 |
Table continues below
| 0 |
5 |
0 |
| 0 |
4 |
0 |
| 0 |
6 |
0 |
| 0 |
7 |
0 |
| 0 |
7 |
0 |
Table continues below
| 1 |
0 |
0 |
| 0 |
1 |
0 |
| 0 |
1 |
0 |
| 1 |
0 |
0 |
| 0 |
0 |
0 |
Table continues below
| 0 |
0 |
0 |
| 0 |
0 |
0 |
| 0 |
0 |
0 |
| 0 |
0 |
0 |
| 1 |
0 |
0 |
Table continues below
| 496 |
496 |
| 0 |
0 |
| 918 |
918 |
| 0 |
0 |
| 545 |
16000 |
Table continues below
| 496 |
1 |
0 |
| 0 |
1 |
0 |
| 918 |
1 |
0 |
| 0 |
1 |
0 |
| 3151 |
1 |
0 |
Table continues below
| 0 |
0 |
0 |
| 0 |
0 |
0 |
| 0 |
0 |
0 |
| 0 |
0 |
0 |
| 0 |
0 |
0 |
Table continues below
| 0 |
0 |
0 |
0.5216 |
| 0 |
0 |
0 |
0.3412 |
| 0 |
0 |
0 |
0.7022 |
| 0 |
0 |
0 |
0.4298 |
| 0 |
0 |
0 |
0.5135 |
Table continues below
| 0.09256 |
0.04566 |
| 0.1489 |
0.04314 |
| 0.3233 |
0.05687 |
| 0.1007 |
0.04143 |
| 0.281 |
0.07463 |
Table continues below
| 0.0137 |
0.7692 |
0.2308 |
| 0.01569 |
0.7333 |
0.2667 |
| 0.009479 |
0.8571 |
0.1429 |
| 0.02072 |
0.6667 |
0.3333 |
| 0.01213 |
0.8602 |
0.1398 |
Table continues below
| 0.3786 |
0.1 |
0.7 |
| 0.2869 |
0.03333 |
0.7 |
| 0.4958 |
0.1 |
1 |
| 0.386 |
0.1364 |
0.8 |
| 0.4111 |
0.03333 |
1 |
Table continues below
| -0.35 |
-0.6 |
-0.2 |
| -0.1187 |
-0.125 |
-0.1 |
| -0.4667 |
-0.8 |
-0.1333 |
| -0.3697 |
-0.6 |
-0.1667 |
| -0.2202 |
-0.5 |
-0.05 |
Table continues below
| 0.5 |
-0.1875 |
0 |
| 0 |
0 |
0.5 |
| 0 |
0 |
0.5 |
| 0 |
0 |
0.5 |
| 0.4545 |
0.1364 |
0.04545 |
| 0.1875 |
593 |
| 0 |
711 |
| 0 |
1500 |
| 0 |
1200 |
| 0.1364 |
505 |
2. Split training dataset and test dataset
First, we choose the variable avg_positive_polarity to be our dependent variable. Then, we will randomize the observations and split the data into trainning and test set with the ratio 0.8. So we will include training_set and test_set
## 'data.frame': 7928 obs. of 41 variables:
## $ url : chr "astronaut-notre-dame-bcs" "att-u-verse-apps" "canon-poweshot-n" "crayon-creatures" ...
## $ num_hrefs : num 9 19 20 7 26 11 9 20 5 5 ...
## $ num_self_hrefs : num 0 19 20 0 18 0 2 20 2 0 ...
## $ num_imgs : num 1 20 20 1 12 1 1 20 1 1 ...
## $ num_videos : num 0 0 0 0 1 0 1 0 0 0 ...
## $ num_keywords : num 7 7 9 7 5 6 7 7 10 6 ...
## $ data_channel_is_lifestyle : num 0 0 0 1 0 0 0 0 0 0 ...
## $ data_channel_is_entertainment: num 1 0 0 0 0 0 0 0 0 0 ...
## $ data_channel_is_bus : num 0 0 0 0 0 1 0 0 0 0 ...
## $ data_channel_is_socmed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ data_channel_is_tech : num 0 1 1 0 0 0 0 1 1 0 ...
## $ data_channel_is_world : num 0 0 0 0 0 0 1 0 0 0 ...
## $ kw_avg_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ self_reference_min_shares : num 0 545 545 0 555 0 704 545 924 0 ...
## $ self_reference_max_shares : num 0 16000 16000 0 14000 0 704 16000 924 0 ...
## $ self_reference_avg_sharess : num 0 3151 3151 0 3905 ...
## $ weekday_is_monday : num 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_tuesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_wednesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_thursday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_friday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_saturday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_sunday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ is_weekend : num 0 0 0 0 0 0 0 0 0 0 ...
## $ global_subjectivity : num 0.43 0.514 0.543 0.477 0.58 ...
## $ global_sentiment_polarity : num 0.1007 0.281 0.2986 0.15 0.0564 ...
## $ global_rate_positive_words : num 0.0414 0.0746 0.0839 0.0267 0.0411 ...
## $ global_rate_negative_words : num 0.0207 0.0121 0.0152 0.0107 0.0259 ...
## $ rate_positive_words : num 0.667 0.86 0.847 0.714 0.613 ...
## $ rate_negative_words : num 0.333 0.14 0.153 0.286 0.387 ...
## $ min_positive_polarity : num 0.1364 0.0333 0.1 0.2 0.1 ...
## $ max_positive_polarity : num 0.8 1 1 0.7 1 1 0.35 1 0.35 1 ...
## $ avg_negative_polarity : num -0.37 -0.22 -0.243 -0.263 -0.401 ...
## $ min_negative_polarity : num -0.6 -0.5 -0.5 -0.4 -1 ...
## $ max_negative_polarity : num -0.167 -0.05 -0.05 -0.125 -0.05 ...
## $ title_subjectivity : num 0 0.455 1 0 0.567 ...
## $ title_sentiment_polarity : num 0 0.136 0.5 0 -0.1 ...
## $ abs_title_subjectivity : num 0.5 0.0455 0.5 0.5 0.0667 ...
## $ abs_title_sentiment_polarity : num 0 0.136 0.5 0 0.1 ...
## $ shares : int 1200 505 891 1900 13600 3100 598 445 783 573 ...
## $ popular : num 0 0 0 0 1 0 0 0 0 0 ...
3. KNN Algorithm
train_label= training_set$popular
test_label= test_set$popular
#knn_pred<- knn(train=training_set , test=test_set , cl=train_label, k=21)
#knn_pred= ifelse(knn_pred>0.5,1, 0)
#knncm=table(x=test_label, y=knn_pred)
#knncm
Notes and Credits
- If you spot a typo, or any error in the report, please let me know so I can fix it.
- This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. In other words, feel free to use it, share it, edit it for non-commercial purposes and please, give credit.
- https://github.com/juandes/FuzzyLogic-R The real coder, tnks for this repo.
- https://rpubs.com/fangya/classification_news Base of the exercise, original source.
- Database :K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.