Taller 3: Analysis of article classification

This work seeks to determine popularity parameters for different articles published in Online, it is a preliminary analysis is looking for alternatives of analysis over commercialized articles on the web.

The next table, show the used data for the exercise:

#pander(summary(all_articles))
str(all_articles)
## 'data.frame':    39644 obs. of  41 variables:
##  $ url                          : chr  "amazon-instant-video-browser" "ap-samsung-sponsored-tweets" "apple-40-billion-app-downloads" "astronaut-notre-dame-bcs" ...
##  $ num_hrefs                    : num  4 3 3 9 19 2 21 20 2 4 ...
##  $ num_self_hrefs               : num  2 1 1 0 19 2 20 20 0 1 ...
##  $ num_imgs                     : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos                   : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ num_keywords                 : num  5 4 6 7 7 9 10 9 7 5 ...
##  $ data_channel_is_lifestyle    : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ data_channel_is_entertainment: num  1 0 0 1 0 0 0 0 0 0 ...
##  $ data_channel_is_bus          : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ data_channel_is_socmed       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data_channel_is_tech         : num  0 0 0 0 1 1 0 1 1 0 ...
##  $ data_channel_is_world        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ kw_avg_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ self_reference_min_shares    : num  496 0 918 0 545 8500 545 545 0 0 ...
##  $ self_reference_max_shares    : num  496 0 918 0 16000 8500 16000 16000 0 0 ...
##  $ self_reference_avg_sharess   : num  496 0 918 0 3151 ...
##  $ weekday_is_monday            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_tuesday           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_wednesday         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_thursday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_friday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_saturday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_sunday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_weekend                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ global_subjectivity          : num  0.522 0.341 0.702 0.43 0.514 ...
##  $ global_sentiment_polarity    : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ global_rate_positive_words   : num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
##  $ global_rate_negative_words   : num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
##  $ rate_positive_words          : num  0.769 0.733 0.857 0.667 0.86 ...
##  $ rate_negative_words          : num  0.231 0.267 0.143 0.333 0.14 ...
##  $ avg_positive_polarity        : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ min_positive_polarity        : num  0.1 0.0333 0.1 0.1364 0.0333 ...
##  $ max_positive_polarity        : num  0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
##  $ avg_negative_polarity        : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
##  $ min_negative_polarity        : num  -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
##  $ max_negative_polarity        : num  -0.2 -0.1 -0.133 -0.167 -0.05 ...
##  $ title_subjectivity           : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity     : num  -0.188 0 0 0 0.136 ...
##  $ abs_title_subjectivity       : num  0 0.5 0.5 0.5 0.0455 ...
##  $ abs_title_sentiment_polarity : num  0.188 0 0 0 0.136 ...
##  $ shares                       : int  593 711 1500 1200 505 855 556 891 3600 710 ...

1. Resume table with 5 examples of fields:

Table continues below
url num_hrefs num_self_hrefs num_imgs
amazon-instant-video-browser 4 2 1
ap-samsung-sponsored-tweets 3 1 1
apple-40-billion-app-downloads 3 1 1
astronaut-notre-dame-bcs 9 0 1
att-u-verse-apps 19 19 20
Table continues below
num_videos num_keywords data_channel_is_lifestyle
0 5 0
0 4 0
0 6 0
0 7 0
0 7 0
Table continues below
data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed
1 0 0
0 1 0
0 1 0
1 0 0
0 0 0
Table continues below
data_channel_is_tech data_channel_is_world kw_avg_avg
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
Table continues below
self_reference_min_shares self_reference_max_shares
496 496
0 0
918 918
0 0
545 16000
Table continues below
self_reference_avg_sharess weekday_is_monday weekday_is_tuesday
496 1 0
0 1 0
918 1 0
0 1 0
3151 1 0
Table continues below
weekday_is_wednesday weekday_is_thursday weekday_is_friday
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
Table continues below
weekday_is_saturday weekday_is_sunday is_weekend global_subjectivity
0 0 0 0.5216
0 0 0 0.3412
0 0 0 0.7022
0 0 0 0.4298
0 0 0 0.5135
Table continues below
global_sentiment_polarity global_rate_positive_words
0.09256 0.04566
0.1489 0.04314
0.3233 0.05687
0.1007 0.04143
0.281 0.07463
Table continues below
global_rate_negative_words rate_positive_words rate_negative_words
0.0137 0.7692 0.2308
0.01569 0.7333 0.2667
0.009479 0.8571 0.1429
0.02072 0.6667 0.3333
0.01213 0.8602 0.1398
Table continues below
avg_positive_polarity min_positive_polarity max_positive_polarity
0.3786 0.1 0.7
0.2869 0.03333 0.7
0.4958 0.1 1
0.386 0.1364 0.8
0.4111 0.03333 1
Table continues below
avg_negative_polarity min_negative_polarity max_negative_polarity
-0.35 -0.6 -0.2
-0.1187 -0.125 -0.1
-0.4667 -0.8 -0.1333
-0.3697 -0.6 -0.1667
-0.2202 -0.5 -0.05
Table continues below
title_subjectivity title_sentiment_polarity abs_title_subjectivity
0.5 -0.1875 0
0 0 0.5
0 0 0.5
0 0 0.5
0.4545 0.1364 0.04545
abs_title_sentiment_polarity shares
0.1875 593
0 711
0 1500
0 1200
0.1364 505

2. Split training dataset and test dataset

First, we choose the variable avg_positive_polarity to be our dependent variable. Then, we will randomize the observations and split the data into trainning and test set with the ratio 0.8. So we will include training_set and test_set

## 'data.frame':    7928 obs. of  41 variables:
##  $ url                          : chr  "astronaut-notre-dame-bcs" "att-u-verse-apps" "canon-poweshot-n" "crayon-creatures" ...
##  $ num_hrefs                    : num  9 19 20 7 26 11 9 20 5 5 ...
##  $ num_self_hrefs               : num  0 19 20 0 18 0 2 20 2 0 ...
##  $ num_imgs                     : num  1 20 20 1 12 1 1 20 1 1 ...
##  $ num_videos                   : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ num_keywords                 : num  7 7 9 7 5 6 7 7 10 6 ...
##  $ data_channel_is_lifestyle    : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ data_channel_is_entertainment: num  1 0 0 0 0 0 0 0 0 0 ...
##  $ data_channel_is_bus          : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ data_channel_is_socmed       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data_channel_is_tech         : num  0 1 1 0 0 0 0 1 1 0 ...
##  $ data_channel_is_world        : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ kw_avg_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ self_reference_min_shares    : num  0 545 545 0 555 0 704 545 924 0 ...
##  $ self_reference_max_shares    : num  0 16000 16000 0 14000 0 704 16000 924 0 ...
##  $ self_reference_avg_sharess   : num  0 3151 3151 0 3905 ...
##  $ weekday_is_monday            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_tuesday           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_wednesday         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_thursday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_friday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_saturday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_sunday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_weekend                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ global_subjectivity          : num  0.43 0.514 0.543 0.477 0.58 ...
##  $ global_sentiment_polarity    : num  0.1007 0.281 0.2986 0.15 0.0564 ...
##  $ global_rate_positive_words   : num  0.0414 0.0746 0.0839 0.0267 0.0411 ...
##  $ global_rate_negative_words   : num  0.0207 0.0121 0.0152 0.0107 0.0259 ...
##  $ rate_positive_words          : num  0.667 0.86 0.847 0.714 0.613 ...
##  $ rate_negative_words          : num  0.333 0.14 0.153 0.286 0.387 ...
##  $ min_positive_polarity        : num  0.1364 0.0333 0.1 0.2 0.1 ...
##  $ max_positive_polarity        : num  0.8 1 1 0.7 1 1 0.35 1 0.35 1 ...
##  $ avg_negative_polarity        : num  -0.37 -0.22 -0.243 -0.263 -0.401 ...
##  $ min_negative_polarity        : num  -0.6 -0.5 -0.5 -0.4 -1 ...
##  $ max_negative_polarity        : num  -0.167 -0.05 -0.05 -0.125 -0.05 ...
##  $ title_subjectivity           : num  0 0.455 1 0 0.567 ...
##  $ title_sentiment_polarity     : num  0 0.136 0.5 0 -0.1 ...
##  $ abs_title_subjectivity       : num  0.5 0.0455 0.5 0.5 0.0667 ...
##  $ abs_title_sentiment_polarity : num  0 0.136 0.5 0 0.1 ...
##  $ shares                       : int  1200 505 891 1900 13600 3100 598 445 783 573 ...
##  $ popular                      : num  0 0 0 0 1 0 0 0 0 0 ...

3. KNN Algorithm

train_label= training_set$popular
test_label= test_set$popular
#knn_pred<- knn(train=training_set , test=test_set , cl=train_label, k=21)
#knn_pred= ifelse(knn_pred>0.5,1, 0)
#knncm=table(x=test_label, y=knn_pred)
#knncm

Notes and Credits