In the rapidly evolving digital age, the dissemination of news has largely shifted from traditional print media to online platforms. This transition has created a competitive landscape where the popularity of news articles is crucial for driving web traffic, engagement, and revenue. Understanding the factors that contribute to the popularity of online news articles can provide valuable insights for content creators, publishers, and marketers to optimize their strategies and maximize reach. This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years.
The objective of this project is to conduct data analysis on the Online News Popularity dataset to uncover patterns, trends, and insights that could inform strategies for increasing the popularity of online news articles and to predict the number of shares in social networks (popularity).
The dataset is obtained from UCI Machine Learning Repository Online New Popularity.
The dataset consists of 39,644 entries with 61 columns. The main variable in the study is the number of shares, which serves as the indicator of a site’s or post’s popularity. This is followed by 61 additional variables as detailed below:• url: URL of the article (non-predictive).
• timedelta: Days between the article publication and the dataset acquisition (non-predictive).
• n_tokens_title: Number of words in the title.
• n_tokens_content: Number of words in the content.
• n_unique_tokens: Rate of unique words in the content.
• n_non_stop_words: Rate of non-stop words in the content.
• n_non_stop_unique_tokens: Rate of unique non-stop words in the content.
• num_hrefs: Number of links.
• num_self_hrefs: Number of links to other articles published by Mashable.
• num_imgs: Number of images.
• num_videos: Number of videos.
• average_token_length: Average length of the words in the content.
• num_keywords: Number of keywords in the metadata
• data_channel_is_lifestyle: Is data channel ‘Lifestyle’?
• data_channel_is_entertainment: Is data channel ‘Entertainment’?
• data_channel_is_bus: Is data channel ‘Business’?
• data_channel_is_socmed: Is data channel ‘Social Media’?
• data_channel_is_tech: Is data channel ‘Tech’?
• data_channel_is_world: Is data channel ‘World’?
• kw_min_min: Worst keyword (min. shares).
• kw_max_min: Worst keyword (max. shares).
• kw_avg_min: Worst keyword (avg. shares) .
• kw_min_max: Best keyword (min. shares) .
• kw_max_max: Best keyword (max. shares) .
• kw_avg_max: Best keyword (avg. shares) .
• kw_min_avg: Avg. keyword (min. shares) .
• kw_max_avg: Avg. keyword (max. shares) .
• kw_avg_avg: Avg. keyword (avg. shares).
• self_reference_min_shares: Min. shares of referenced articles in Mashable.
• self_reference_max_shares: Max. shares of referenced articles in Mashable.
• self_reference_avg_shares: Avg. shares of referenced articles in Mashable.
• weekday_is_monday: Was the article published on a Monday?
• weekday_is_tuesday: Was the article published on a Tuesday?
• weekday_is_wednesday: Was the article published on a Wednesday?
• weekday_is_thursday: Was the article published on a Thursday?
• weekday_is_friday: Was the article published on a Friday?
• weekday_is_saturday: Was the article published on a Saturday?
• weekday_is_sunday: Was the article published on a Sunday?
• is_weekend: Was the article published on the weekend?
• LDA_00: Closeness to LDA topic 0.
• LDA_01: Closeness to LDA topic 1.
• LDA_02: Closeness to LDA topic 2.
• LDA_03: Closeness to LDA topic 3.
• LDA_04: Closeness to LDA topic 4.
• global_subjectivity: Text subjectivity.
• global_sentiment_polarity: Text sentiment polarity.
• global_rate_positive_words: Rate of positive words in the content.
• global_rate_negative_words: Rate of negative words in the content.
• rate_positive_words: Rate of positive words among non-neutral tokens.
• rate_negative_words: Rate of negative words among non-neutral tokens.
• avg_positive_polarity: Avg. polarity of positive words.
• min_positive_polarity: Min. polarity of positive words.
• max_positive_polarity: Max. polarity of positive words.
• avg_negative_polarity: Avg. polarity of negative words.
• min_negative_polarity: Min. polarity of negative words.
• max_negative_polarity: Max. polarity of negative words.
• title_subjectivity: Title subjectivity.
• title_sentiment_polarity: Title polarity.
• abs_title_subjectivity: Absolute subjectivity level.
• abs_title_sentiment_polarity: Absolute polarity level.
• shares: Number of shares (target).
At first, we import the data set. Then we use str() to find the structure of the data set and information about the class, length and content of each column.
'data.frame': 39644 obs. of 61 variables:
$ url : chr "http://mashable.com/2013/01/07/amazon-instant-video-browser/" "http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/" "http://mashable.com/2013/01/07/apple-40-billion-app-downloads/" "http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/" ...
$ timedelta : num 731 731 731 731 731 731 731 731 731 731 ...
$ n_tokens_title : num 12 9 9 9 13 10 8 12 11 10 ...
$ n_tokens_content : num 219 255 211 531 1072 ...
$ n_unique_tokens : num 0.664 0.605 0.575 0.504 0.416 ...
$ n_non_stop_words : num 1 1 1 1 1 ...
$ n_non_stop_unique_tokens : num 0.815 0.792 0.664 0.666 0.541 ...
$ num_hrefs : num 4 3 3 9 19 2 21 20 2 4 ...
$ num_self_hrefs : num 2 1 1 0 19 2 20 20 0 1 ...
$ num_imgs : num 1 1 1 1 20 0 20 20 0 1 ...
$ num_videos : num 0 0 0 0 0 0 0 0 0 1 ...
$ average_token_length : num 4.68 4.91 4.39 4.4 4.68 ...
$ num_keywords : num 5 4 6 7 7 9 10 9 7 5 ...
$ data_channel_is_lifestyle : num 0 0 0 0 0 0 1 0 0 0 ...
$ data_channel_is_entertainment: num 1 0 0 1 0 0 0 0 0 0 ...
$ data_channel_is_bus : num 0 1 1 0 0 0 0 0 0 0 ...
$ data_channel_is_socmed : num 0 0 0 0 0 0 0 0 0 0 ...
$ data_channel_is_tech : num 0 0 0 0 1 1 0 1 1 0 ...
$ data_channel_is_world : num 0 0 0 0 0 0 0 0 0 1 ...
$ kw_min_min : num 0 0 0 0 0 0 0 0 0 0 ...
$ kw_max_min : num 0 0 0 0 0 0 0 0 0 0 ...
$ kw_avg_min : num 0 0 0 0 0 0 0 0 0 0 ...
$ kw_min_max : num 0 0 0 0 0 0 0 0 0 0 ...
$ kw_max_max : num 0 0 0 0 0 0 0 0 0 0 ...
$ kw_avg_max : num 0 0 0 0 0 0 0 0 0 0 ...
$ kw_min_avg : num 0 0 0 0 0 0 0 0 0 0 ...
$ kw_max_avg : num 0 0 0 0 0 0 0 0 0 0 ...
$ kw_avg_avg : num 0 0 0 0 0 0 0 0 0 0 ...
$ self_reference_min_shares : num 496 0 918 0 545 8500 545 545 0 0 ...
$ self_reference_max_shares : num 496 0 918 0 16000 8500 16000 16000 0 0 ...
$ self_reference_avg_sharess : num 496 0 918 0 3151 ...
$ weekday_is_monday : num 1 1 1 1 1 1 1 1 1 1 ...
$ weekday_is_tuesday : num 0 0 0 0 0 0 0 0 0 0 ...
$ weekday_is_wednesday : num 0 0 0 0 0 0 0 0 0 0 ...
$ weekday_is_thursday : num 0 0 0 0 0 0 0 0 0 0 ...
$ weekday_is_friday : num 0 0 0 0 0 0 0 0 0 0 ...
$ weekday_is_saturday : num 0 0 0 0 0 0 0 0 0 0 ...
$ weekday_is_sunday : num 0 0 0 0 0 0 0 0 0 0 ...
$ is_weekend : num 0 0 0 0 0 0 0 0 0 0 ...
$ LDA_00 : num 0.5003 0.7998 0.2178 0.0286 0.0286 ...
$ LDA_01 : num 0.3783 0.05 0.0333 0.4193 0.0288 ...
$ LDA_02 : num 0.04 0.0501 0.0334 0.4947 0.0286 ...
$ LDA_03 : num 0.0413 0.0501 0.0333 0.0289 0.0286 ...
$ LDA_04 : num 0.0401 0.05 0.6822 0.0286 0.8854 ...
$ global_subjectivity : num 0.522 0.341 0.702 0.43 0.514 ...
$ global_sentiment_polarity : num 0.0926 0.1489 0.3233 0.1007 0.281 ...
$ global_rate_positive_words : num 0.0457 0.0431 0.0569 0.0414 0.0746 ...
$ global_rate_negative_words : num 0.0137 0.01569 0.00948 0.02072 0.01213 ...
$ rate_positive_words : num 0.769 0.733 0.857 0.667 0.86 ...
$ rate_negative_words : num 0.231 0.267 0.143 0.333 0.14 ...
$ avg_positive_polarity : num 0.379 0.287 0.496 0.386 0.411 ...
$ min_positive_polarity : num 0.1 0.0333 0.1 0.1364 0.0333 ...
$ max_positive_polarity : num 0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
$ avg_negative_polarity : num -0.35 -0.119 -0.467 -0.37 -0.22 ...
$ min_negative_polarity : num -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
$ max_negative_polarity : num -0.2 -0.1 -0.133 -0.167 -0.05 ...
$ title_subjectivity : num 0.5 0 0 0 0.455 ...
$ title_sentiment_polarity : num -0.188 0 0 0 0.136 ...
$ abs_title_subjectivity : num 0 0.5 0.5 0.5 0.0455 ...
$ abs_title_sentiment_polarity : num 0.188 0 0 0 0.136 ...
$ shares : int 593 711 1500 1200 505 855 556 891 3600 710 ...
All of the variables are of integer data type except the first variable url is of character data type.
Now, we take a look at the first few rows of the dataset.
url | timedelta | n_tokens_title | n_tokens_content | n_unique_tokens | n_non_stop_words | n_non_stop_unique_tokens | num_hrefs | num_self_hrefs | num_imgs | num_videos | average_token_length | num_keywords | data_channel_is_lifestyle | data_channel_is_entertainment | data_channel_is_bus | data_channel_is_socmed | data_channel_is_tech | data_channel_is_world | kw_min_min | kw_max_min | kw_avg_min | kw_min_max | kw_max_max | kw_avg_max | kw_min_avg | kw_max_avg | kw_avg_avg | self_reference_min_shares | self_reference_max_shares | self_reference_avg_sharess | weekday_is_monday | weekday_is_tuesday | weekday_is_wednesday | weekday_is_thursday | weekday_is_friday | weekday_is_saturday | weekday_is_sunday | is_weekend | LDA_00 | LDA_01 | LDA_02 | LDA_03 | LDA_04 | global_subjectivity | global_sentiment_polarity | global_rate_positive_words | global_rate_negative_words | rate_positive_words | rate_negative_words | avg_positive_polarity | min_positive_polarity | max_positive_polarity | avg_negative_polarity | min_negative_polarity | max_negative_polarity | title_subjectivity | title_sentiment_polarity | abs_title_subjectivity | abs_title_sentiment_polarity | shares |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
http://mashable.com/2013/01/07/amazon-instant-video-browser/ | 731 | 12 | 219 | 0.6635945 | 1 | 0.8153846 | 4 | 2 | 1 | 0 | 4.680365 | 5 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 496 | 496 | 496.000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.5003312 | 0.3782789 | 0.0400047 | 0.0412626 | 0.0401225 | 0.5216171 | 0.0925620 | 0.0456621 | 0.0136986 | 0.7692308 | 0.2307692 | 0.3786364 | 0.1000000 | 0.7 | -0.3500000 | -0.600 | -0.2000000 | 0.5000000 | -0.1875000 | 0.0000000 | 0.1875000 | 593 |
http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/ | 731 | 9 | 255 | 0.6047431 | 1 | 0.7919463 | 3 | 1 | 1 | 0 | 4.913725 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.7997557 | 0.0500467 | 0.0500963 | 0.0501007 | 0.0500007 | 0.3412458 | 0.1489478 | 0.0431373 | 0.0156863 | 0.7333333 | 0.2666667 | 0.2869146 | 0.0333333 | 0.7 | -0.1187500 | -0.125 | -0.1000000 | 0.0000000 | 0.0000000 | 0.5000000 | 0.0000000 | 711 |
http://mashable.com/2013/01/07/apple-40-billion-app-downloads/ | 731 | 9 | 211 | 0.5751295 | 1 | 0.6638655 | 3 | 1 | 1 | 0 | 4.393365 | 6 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 918 | 918 | 918.000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.2177923 | 0.0333345 | 0.0333514 | 0.0333335 | 0.6821883 | 0.7022222 | 0.3233333 | 0.0568720 | 0.0094787 | 0.8571429 | 0.1428571 | 0.4958333 | 0.1000000 | 1.0 | -0.4666667 | -0.800 | -0.1333333 | 0.0000000 | 0.0000000 | 0.5000000 | 0.0000000 | 1500 |
http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/ | 731 | 9 | 531 | 0.5037879 | 1 | 0.6656347 | 9 | 0 | 1 | 0 | 4.404896 | 7 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0285732 | 0.4192996 | 0.4946508 | 0.0289047 | 0.0285716 | 0.4298497 | 0.1007047 | 0.0414313 | 0.0207156 | 0.6666667 | 0.3333333 | 0.3859652 | 0.1363636 | 0.8 | -0.3696970 | -0.600 | -0.1666667 | 0.0000000 | 0.0000000 | 0.5000000 | 0.0000000 | 1200 |
http://mashable.com/2013/01/07/att-u-verse-apps/ | 731 | 13 | 1072 | 0.4156456 | 1 | 0.5408895 | 19 | 19 | 20 | 0 | 4.682836 | 7 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 545 | 16000 | 3151.158 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0286328 | 0.0287936 | 0.0285752 | 0.0285717 | 0.8854268 | 0.5135021 | 0.2810035 | 0.0746269 | 0.0121269 | 0.8602151 | 0.1397849 | 0.4111274 | 0.0333333 | 1.0 | -0.2201923 | -0.500 | -0.0500000 | 0.4545455 | 0.1363636 | 0.0454545 | 0.1363636 | 505 |
http://mashable.com/2013/01/07/beewi-smart-toys/ | 731 | 10 | 370 | 0.5598886 | 1 | 0.6981982 | 2 | 2 | 0 | 0 | 4.359459 | 9 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8500 | 8500 | 8500.000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0222453 | 0.3067176 | 0.0222313 | 0.0222243 | 0.6265816 | 0.4374086 | 0.0711842 | 0.0297297 | 0.0270270 | 0.5238095 | 0.4761905 | 0.3506100 | 0.1363636 | 0.6 | -0.1950000 | -0.400 | -0.1000000 | 0.6428571 | 0.2142857 | 0.1428571 | 0.2142857 | 855 |
http://mashable.com/2013/01/07/bodymedia-armbandgets-update/ | 731 | 8 | 960 | 0.4181626 | 1 | 0.5498339 | 21 | 20 | 20 | 0 | 4.654167 | 10 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 545 | 16000 | 3151.158 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0200817 | 0.1147054 | 0.0200244 | 0.0200153 | 0.8251732 | 0.5144803 | 0.2683027 | 0.0802083 | 0.0166667 | 0.8279570 | 0.1720430 | 0.4020386 | 0.1000000 | 1.0 | -0.2244792 | -0.500 | -0.0500000 | 0.0000000 | 0.0000000 | 0.5000000 | 0.0000000 | 556 |
http://mashable.com/2013/01/07/canon-poweshot-n/ | 731 | 12 | 989 | 0.4335736 | 1 | 0.5721078 | 20 | 20 | 20 | 0 | 4.617796 | 9 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 545 | 16000 | 3151.158 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0222244 | 0.1507330 | 0.2434355 | 0.0222236 | 0.5613836 | 0.5434742 | 0.2986135 | 0.0839232 | 0.0151668 | 0.8469388 | 0.1530612 | 0.4277205 | 0.1000000 | 1.0 | -0.2427778 | -0.500 | -0.0500000 | 1.0000000 | 0.5000000 | 0.5000000 | 0.5000000 | 891 |
http://mashable.com/2013/01/07/car-of-the-future-infographic/ | 731 | 11 | 97 | 0.6701031 | 1 | 0.8367347 | 2 | 0 | 0 | 0 | 4.855670 | 7 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.4582504 | 0.0289794 | 0.0286619 | 0.0296959 | 0.4544124 | 0.5388889 | 0.1611111 | 0.0309278 | 0.0206186 | 0.6000000 | 0.4000000 | 0.5666667 | 0.4000000 | 0.8 | -0.1250000 | -0.125 | -0.1250000 | 0.1250000 | 0.0000000 | 0.3750000 | 0.0000000 | 3600 |
http://mashable.com/2013/01/07/chuck-hagel-website/ | 731 | 10 | 231 | 0.6363636 | 1 | 0.7971014 | 4 | 1 | 1 | 1 | 5.090909 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0400001 | 0.0400000 | 0.8399972 | 0.0400006 | 0.0400020 | 0.3138889 | 0.0518519 | 0.0389610 | 0.0303030 | 0.5625000 | 0.4375000 | 0.2984127 | 0.1000000 | 0.5 | -0.2380952 | -0.500 | -0.1000000 | 0.0000000 | 0.0000000 | 0.5000000 | 0.0000000 | 710 |
[1] 39644 61
Here, shares is the response variable.
Our objective is to investigate and identify the decisive factors which result in the sharing of articles published in the Mashable website.
For successful data analysis, it is required to check whether there are any missing values in the data set or not since missing information may lead to erroneous conclusions. If there are missing observations, deleting the rows or columns containing missing values or imputing the missing value with a constant or some statistics like mean, median or mode of each column in which the missing value is located is an effective way of processing the data for further analysis.
url timedelta
0 0
n_tokens_title n_tokens_content
0 0
n_unique_tokens n_non_stop_words
0 0
n_non_stop_unique_tokens num_hrefs
0 0
num_self_hrefs num_imgs
0 0
num_videos average_token_length
0 0
num_keywords data_channel_is_lifestyle
0 0
data_channel_is_entertainment data_channel_is_bus
0 0
data_channel_is_socmed data_channel_is_tech
0 0
data_channel_is_world kw_min_min
0 0
kw_max_min kw_avg_min
0 0
kw_min_max kw_max_max
0 0
kw_avg_max kw_min_avg
0 0
kw_max_avg kw_avg_avg
0 0
self_reference_min_shares self_reference_max_shares
0 0
self_reference_avg_sharess weekday_is_monday
0 0
weekday_is_tuesday weekday_is_wednesday
0 0
weekday_is_thursday weekday_is_friday
0 0
weekday_is_saturday weekday_is_sunday
0 0
is_weekend LDA_00
0 0
LDA_01 LDA_02
0 0
LDA_03 LDA_04
0 0
global_subjectivity global_sentiment_polarity
0 0
global_rate_positive_words global_rate_negative_words
0 0
rate_positive_words rate_negative_words
0 0
avg_positive_polarity min_positive_polarity
0 0
max_positive_polarity avg_negative_polarity
0 0
min_negative_polarity max_negative_polarity
0 0
title_subjectivity title_sentiment_polarity
0 0
abs_title_subjectivity abs_title_sentiment_polarity
0 0
shares
0
It is found that our data set contains no missing information. Therefore, the analysis can be proceeded.
Here the two non-predictive (url and timedelta) attributes are dropped from the dataset since these variables are meta-data and cannot be treated as features. n_tokens_content represents Number of words in the content. However its minimum value is 0 which means that there are articles that do not have any content. Such records should be dropped as their related attributes add no meaning to the data analysis. The is_weekend column is also dropped since it is a duplicate of the already existing is_saturday and is_sunday columns.
[1] 38463 58
The dimension of the data reduces to (38463, 58) from (39,644,61).
Converting the columns weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, weekday_is_thursday, weekday_is_friday, weekday_is_saturday and weekday_is_sunday into a single variable day_of_the_week.
n_tokens_title | n_tokens_content | n_unique_tokens | n_non_stop_words | n_non_stop_unique_tokens | num_hrefs | num_self_hrefs | num_imgs | num_videos | average_token_length | num_keywords | data_channel_is_lifestyle | data_channel_is_entertainment | data_channel_is_bus | data_channel_is_socmed | data_channel_is_tech | data_channel_is_world | kw_min_min | kw_max_min | kw_avg_min | kw_min_max | kw_max_max | kw_avg_max | kw_min_avg | kw_max_avg | kw_avg_avg | self_reference_min_shares | self_reference_max_shares | self_reference_avg_sharess | LDA_00 | LDA_01 | LDA_02 | LDA_03 | LDA_04 | global_subjectivity | global_sentiment_polarity | global_rate_positive_words | global_rate_negative_words | rate_positive_words | rate_negative_words | avg_positive_polarity | min_positive_polarity | max_positive_polarity | avg_negative_polarity | min_negative_polarity | max_negative_polarity | title_subjectivity | title_sentiment_polarity | abs_title_subjectivity | abs_title_sentiment_polarity | shares | day_of_the_week | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 12.00 | 219.00 | 0.66 | 1.00 | 0.82 | 4.00 | 2.00 | 1.00 | 0.00 | 4.68 | 5.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 496.00 | 496.00 | 496.00 | 0.50 | 0.38 | 0.04 | 0.04 | 0.04 | 0.52 | 0.09 | 0.05 | 0.01 | 0.77 | 0.23 | 0.38 | 0.10 | 0.70 | -0.35 | -0.60 | -0.20 | 0.50 | -0.19 | 0.00 | 0.19 | 593 | Monday |
2 | 9.00 | 255.00 | 0.60 | 1.00 | 0.79 | 3.00 | 1.00 | 1.00 | 0.00 | 4.91 | 4.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.80 | 0.05 | 0.05 | 0.05 | 0.05 | 0.34 | 0.15 | 0.04 | 0.02 | 0.73 | 0.27 | 0.29 | 0.03 | 0.70 | -0.12 | -0.12 | -0.10 | 0.00 | 0.00 | 0.50 | 0.00 | 711 | Monday |
3 | 9.00 | 211.00 | 0.58 | 1.00 | 0.66 | 3.00 | 1.00 | 1.00 | 0.00 | 4.39 | 6.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 918.00 | 918.00 | 918.00 | 0.22 | 0.03 | 0.03 | 0.03 | 0.68 | 0.70 | 0.32 | 0.06 | 0.01 | 0.86 | 0.14 | 0.50 | 0.10 | 1.00 | -0.47 | -0.80 | -0.13 | 0.00 | 0.00 | 0.50 | 0.00 | 1500 | Monday |
4 | 9.00 | 531.00 | 0.50 | 1.00 | 0.67 | 9.00 | 0.00 | 1.00 | 0.00 | 4.40 | 7.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.42 | 0.49 | 0.03 | 0.03 | 0.43 | 0.10 | 0.04 | 0.02 | 0.67 | 0.33 | 0.39 | 0.14 | 0.80 | -0.37 | -0.60 | -0.17 | 0.00 | 0.00 | 0.50 | 0.00 | 1200 | Monday |
5 | 13.00 | 1072.00 | 0.42 | 1.00 | 0.54 | 19.00 | 19.00 | 20.00 | 0.00 | 4.68 | 7.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 545.00 | 16000.00 | 3151.16 | 0.03 | 0.03 | 0.03 | 0.03 | 0.89 | 0.51 | 0.28 | 0.07 | 0.01 | 0.86 | 0.14 | 0.41 | 0.03 | 1.00 | -0.22 | -0.50 | -0.05 | 0.45 | 0.14 | 0.05 | 0.14 | 505 | Monday |
6 | 10.00 | 370.00 | 0.56 | 1.00 | 0.70 | 2.00 | 2.00 | 0.00 | 0.00 | 4.36 | 9.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 8500.00 | 8500.00 | 8500.00 | 0.02 | 0.31 | 0.02 | 0.02 | 0.63 | 0.44 | 0.07 | 0.03 | 0.03 | 0.52 | 0.48 | 0.35 | 0.14 | 0.60 | -0.20 | -0.40 | -0.10 | 0.64 | 0.21 | 0.14 | 0.21 | 855 | Monday |
7 | 8.00 | 960.00 | 0.42 | 1.00 | 0.55 | 21.00 | 20.00 | 20.00 | 0.00 | 4.65 | 10.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 545.00 | 16000.00 | 3151.16 | 0.02 | 0.11 | 0.02 | 0.02 | 0.83 | 0.51 | 0.27 | 0.08 | 0.02 | 0.83 | 0.17 | 0.40 | 0.10 | 1.00 | -0.22 | -0.50 | -0.05 | 0.00 | 0.00 | 0.50 | 0.00 | 556 | Monday |
8 | 12.00 | 989.00 | 0.43 | 1.00 | 0.57 | 20.00 | 20.00 | 20.00 | 0.00 | 4.62 | 9.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 545.00 | 16000.00 | 3151.16 | 0.02 | 0.15 | 0.24 | 0.02 | 0.56 | 0.54 | 0.30 | 0.08 | 0.02 | 0.85 | 0.15 | 0.43 | 0.10 | 1.00 | -0.24 | -0.50 | -0.05 | 1.00 | 0.50 | 0.50 | 0.50 | 891 | Monday |
9 | 11.00 | 97.00 | 0.67 | 1.00 | 0.84 | 2.00 | 0.00 | 0.00 | 0.00 | 4.86 | 7.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.46 | 0.03 | 0.03 | 0.03 | 0.45 | 0.54 | 0.16 | 0.03 | 0.02 | 0.60 | 0.40 | 0.57 | 0.40 | 0.80 | -0.12 | -0.12 | -0.12 | 0.12 | 0.00 | 0.38 | 0.00 | 3600 | Monday |
10 | 10.00 | 231.00 | 0.64 | 1.00 | 0.80 | 4.00 | 1.00 | 1.00 | 1.00 | 5.09 | 5.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.04 | 0.84 | 0.04 | 0.04 | 0.31 | 0.05 | 0.04 | 0.03 | 0.56 | 0.44 | 0.30 | 0.10 | 0.50 | -0.24 | -0.50 | -0.10 | 0.00 | 0.00 | 0.50 | 0.00 | 710 | Monday |
The column day_of_the_week represents the day of the week each article was published, based on the information from the seven original columns. Also removing the redundant columns to simplify the data analysis.
Converting the columns data_channel_is_lifestyle, data_channel_is_entertainment, data_channel_is_bus, data_channel_is_socmed, data_channel_is_tech, data_channel_is_world and others into a single variable data_channel.
n_tokens_title | n_tokens_content | n_unique_tokens | n_non_stop_words | n_non_stop_unique_tokens | num_hrefs | num_self_hrefs | num_imgs | num_videos | average_token_length | kw_min_min | kw_max_min | kw_avg_min | kw_min_max | kw_max_max | kw_avg_max | kw_min_avg | kw_max_avg | kw_avg_avg | self_reference_min_shares | self_reference_max_shares | self_reference_avg_sharess | LDA_00 | LDA_01 | LDA_02 | LDA_03 | LDA_04 | global_subjectivity | global_sentiment_polarity | global_rate_positive_words | global_rate_negative_words | rate_positive_words | rate_negative_words | avg_positive_polarity | min_positive_polarity | max_positive_polarity | avg_negative_polarity | min_negative_polarity | max_negative_polarity | title_subjectivity | title_sentiment_polarity | abs_title_subjectivity | abs_title_sentiment_polarity | shares | day_of_the_week | data_channel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 12.00 | 219.00 | 0.66 | 1.00 | 0.82 | 4.00 | 2.00 | 1.00 | 0.00 | 4.68 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 496.00 | 496.00 | 496.00 | 0.50 | 0.38 | 0.04 | 0.04 | 0.04 | 0.52 | 0.09 | 0.05 | 0.01 | 0.77 | 0.23 | 0.38 | 0.10 | 0.70 | -0.35 | -0.60 | -0.20 | 0.50 | -0.19 | 0.00 | 0.19 | 593 | Monday | Entertainment |
2 | 9.00 | 255.00 | 0.60 | 1.00 | 0.79 | 3.00 | 1.00 | 1.00 | 0.00 | 4.91 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.80 | 0.05 | 0.05 | 0.05 | 0.05 | 0.34 | 0.15 | 0.04 | 0.02 | 0.73 | 0.27 | 0.29 | 0.03 | 0.70 | -0.12 | -0.12 | -0.10 | 0.00 | 0.00 | 0.50 | 0.00 | 711 | Monday | Business |
3 | 9.00 | 211.00 | 0.58 | 1.00 | 0.66 | 3.00 | 1.00 | 1.00 | 0.00 | 4.39 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 918.00 | 918.00 | 918.00 | 0.22 | 0.03 | 0.03 | 0.03 | 0.68 | 0.70 | 0.32 | 0.06 | 0.01 | 0.86 | 0.14 | 0.50 | 0.10 | 1.00 | -0.47 | -0.80 | -0.13 | 0.00 | 0.00 | 0.50 | 0.00 | 1500 | Monday | Business |
4 | 9.00 | 531.00 | 0.50 | 1.00 | 0.67 | 9.00 | 0.00 | 1.00 | 0.00 | 4.40 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.42 | 0.49 | 0.03 | 0.03 | 0.43 | 0.10 | 0.04 | 0.02 | 0.67 | 0.33 | 0.39 | 0.14 | 0.80 | -0.37 | -0.60 | -0.17 | 0.00 | 0.00 | 0.50 | 0.00 | 1200 | Monday | Entertainment |
5 | 13.00 | 1072.00 | 0.42 | 1.00 | 0.54 | 19.00 | 19.00 | 20.00 | 0.00 | 4.68 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 545.00 | 16000.00 | 3151.16 | 0.03 | 0.03 | 0.03 | 0.03 | 0.89 | 0.51 | 0.28 | 0.07 | 0.01 | 0.86 | 0.14 | 0.41 | 0.03 | 1.00 | -0.22 | -0.50 | -0.05 | 0.45 | 0.14 | 0.05 | 0.14 | 505 | Monday | Tech |
6 | 10.00 | 370.00 | 0.56 | 1.00 | 0.70 | 2.00 | 2.00 | 0.00 | 0.00 | 4.36 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 8500.00 | 8500.00 | 8500.00 | 0.02 | 0.31 | 0.02 | 0.02 | 0.63 | 0.44 | 0.07 | 0.03 | 0.03 | 0.52 | 0.48 | 0.35 | 0.14 | 0.60 | -0.20 | -0.40 | -0.10 | 0.64 | 0.21 | 0.14 | 0.21 | 855 | Monday | Tech |
7 | 8.00 | 960.00 | 0.42 | 1.00 | 0.55 | 21.00 | 20.00 | 20.00 | 0.00 | 4.65 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 545.00 | 16000.00 | 3151.16 | 0.02 | 0.11 | 0.02 | 0.02 | 0.83 | 0.51 | 0.27 | 0.08 | 0.02 | 0.83 | 0.17 | 0.40 | 0.10 | 1.00 | -0.22 | -0.50 | -0.05 | 0.00 | 0.00 | 0.50 | 0.00 | 556 | Monday | Lifestyle |
8 | 12.00 | 989.00 | 0.43 | 1.00 | 0.57 | 20.00 | 20.00 | 20.00 | 0.00 | 4.62 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 545.00 | 16000.00 | 3151.16 | 0.02 | 0.15 | 0.24 | 0.02 | 0.56 | 0.54 | 0.30 | 0.08 | 0.02 | 0.85 | 0.15 | 0.43 | 0.10 | 1.00 | -0.24 | -0.50 | -0.05 | 1.00 | 0.50 | 0.50 | 0.50 | 891 | Monday | Tech |
9 | 11.00 | 97.00 | 0.67 | 1.00 | 0.84 | 2.00 | 0.00 | 0.00 | 0.00 | 4.86 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.46 | 0.03 | 0.03 | 0.03 | 0.45 | 0.54 | 0.16 | 0.03 | 0.02 | 0.60 | 0.40 | 0.57 | 0.40 | 0.80 | -0.12 | -0.12 | -0.12 | 0.12 | 0.00 | 0.38 | 0.00 | 3600 | Monday | Tech |
10 | 10.00 | 231.00 | 0.64 | 1.00 | 0.80 | 4.00 | 1.00 | 1.00 | 1.00 | 5.09 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.04 | 0.84 | 0.04 | 0.04 | 0.31 | 0.05 | 0.04 | 0.03 | 0.56 | 0.44 | 0.30 | 0.10 | 0.50 | -0.24 | -0.50 | -0.10 | 0.00 | 0.00 | 0.50 | 0.00 | 710 | Monday | World |
The data_channel column is a single column containing the information regarding all types of data channels, therefore, removing the redundant columns to simplify the data analysis.
[1] 38463 46
The dimension of the data frame further shrinks from 58 to 46 columns.
Converting the variables day_of_the_week and data_channel to factor variables. This is done since these variables are categorical in nature.
Plotting histogram of the variables to understand each of the feature variables.
Summary of analysis:
The variables n_tokens_title, global_subjectivity, global_sentiment_polarity, min_negative_polarity and avg_positive_polarity seem to follow normal distributions.
The variables n_tokens_content, n_unique_tokens, n_non_stop_words, n_non_stop_unique_words, num_hrefs, num_self_hrefs, num_imgs, num_videos, average_token_length, kw_min_min, kw_max_min, kw_avg_min, kw_min_max, kw_avg_max, kw_min_avg, kw_max_avg, kw_avg_avg, self_reference_min_shares, self_reference_max_shares, self_reference_avg_shares, LDA_00, LDA_01, LDA_02, LDA_03, LDA_04, global_rate_positive_words, global_rate_negative_words, rate_negative_words, min_positive_polarity, title_subjectivity, abs_title_sentiment_polarity and shares are heavily right-skewed.
The variables kw_max_max, max_positive_polarity, avg_negative_polarity, max_negative_polarity and abs_title_subjectivity are heavily left skewed.-
Dividing the number of shares by popularity:
Here, if the number of shares is greater than than the median shares then the share is considered popular otherwise unpopular.
popularity
Popular Unpopular
20464 17999
Popularity of shares:
The number of popular shares are greater in count than the number of unpopular ones.
-
The articles having 8 to 16 words in the title has the maximum number of shares.
-
An article with words less than 1600 words in the content results in more shares, making the article popular.
-
The articles having 0 to 45 links have the maximum number of shares.
-
The articles having average polarity of positive words between 0.2 and 0.6 has the most number of shares.
-
The number of articles published in the subjects World, Tech and Entertainment are greater than compared to all other subjects.
-
Popularity of different data channels:
The hottest subjects of popular shares seem to be the data channels of Business, Technology and Social Media compared to the others.
-
The majority of articles are published on the weekdays as compared to the weekends.
-
Popularity of Shares based on the days of the week:
The articles published during the majority of weekends (i.e. Saturday and Sunday) tend to be more shared compared to the weekdays with respect to the total articles published on that day. Most popular articles are usually posted on Mondays.
-
Relationship between rate of unique non-stop words in the content and the number of shares:
The box plot of the dataset shows that if the rate of unique non_stop words in the content falls within the range of 0.6 to 0.8 has the maximum number of shares.
-
Popularity of shares based on the number of images in the content:
If the number of images in the article is between 1-40 then the article is popular.
-
Popularity of shares based on the number of videos in the content:
If the number of videos in the article is between 1-15 then the article has maximum number of shares.
-
Relationship between average token length and the number of shares:
If the article has average length of the words in the content between 4-6 then the number of shares is maximum.
-
Relationship between number of keywords and the number of shares:
The number of keywords in the metadata influences the number of shares. If the number of keywords are greater than 5 results in more popular articles.
-
[1] 10000 13600 5700 17100 7700 6400 5700 7100 5600 9200
[11] 6000 19400 10400 5600 11500 28000 7500 5600 8100 5500
[21] 11400 11800 12000 37400 18000 18200 19800 9000 6300 25200
[31] 15700 5400 7900 9700 12100 14300 33100 5400 5600 8000
[41] 5800 39400 6000 6800 5900 5400 8900 5500 17100 6300
[51] 10400 6400 17300 11400 9100 51900 12300 9100 5700 7200
[61] 15800 17500 5700 8000 5800 7700 7800 7800 5800 9800
[71] 17600 5700 6800 6500 6700 11900 6000 6500 22000 5500
[81] 5600 10200 8200 7400 8300 8300 39200 6000 13800 9500
[91] 8100 5700 10700 9700 11800 7000 8000 10800 9800 30000
[101] 6200 5600 6300 5900 8200 6400 5900 6100 6300 8500
[111] 8100 6700 7100 30200 8700 16700 6700 11400 6300 6000
[121] 11100 14400 6300 10300 7900 11900 69100 16800 6700 16500
[131] 16500 7000 40100 5400 17300 12900 7400 5500 9700 8900
[141] 10900 9200 7400 6900 5800 6500 9400 6000 8700 6000
[151] 7100 8200 9100 8300 18100 8500 6500 6900 10700 7700
[161] 14400 18400 11000 7000 41600 30400 6600 6600 13600 7300
[171] 5900 7300 6600 5400 10800 7600 12300 6800 16600 6100
[181] 9300 6400 12800 27400 11900 5800 6100 13500 9700 16600
[191] 18100 8200 8100 66900 11400 9200 67500 15600 8400 7400
[201] 7200 10300 6100 15400 6500 7600 16800 9900 14500 8600
[211] 6500 19200 14800 5500 10500 11300 16700 11100 14300 7300
[221] 13200 5500 6000 6900 11700 16800 5500 25300 6400 26800
[231] 5800 19800 40400 6400 9800 32700 7500 14700 18600 23700
[241] 5700 5800 42700 12300 5500 6100 10800 10100 12500 10400
[251] 12000 8600 20000 7400 12300 16200 27300 6400 18000 14400
[261] 17800 9800 80400 16700 11400 10600 5600 10800 7200 7700
[271] 5700 28200 5400 15300 21700 9300 9100 13900 57600 6500
[281] 21400 15200 7300 8400 5600 27900 8100 31300 5500 35800
[291] 13600 26900 9600 18400 28400 11700 6100 10400 53100 6100
[301] 10200 227300 20800 7800 13000 21100 7600 7200 8000 8200
[311] 35100 13300 10700 144400 26400 617900 7200 14500 11700 10000
[321] 10000 14200 7100 9600 27300 12500 11100 8300 9200 7200
[331] 20200 9500 26400 12800 20400 8500 11100 10100 6000 12300
[341] 5700 5800 36200 13300 6400 35800 8600 11200 9400 10800
[351] 21900 6400 13300 9400 7400 8900 9600 13000 11700 14300
[361] 6500 5600 17800 7700 15000 7400 34900 6600 9800 15700
[371] 16900 9300 44700 18400 21800 5400 8700 9400 24300 10600
[381] 6100 9700 62300 6200 7500 6000 12400 21500 6300 12000
[391] 11900 22200 71800 7400 11200 23600 6000 5400 6400 11500
[401] 11700 17200 6900 5600 53700 8000 20700 9700 7100 17100
[411] 7300 36700 7600 8800 33900 12500 12200 14500 6300 10600
[421] 5400 13300 22300 7600 6700 6200 14700 22300 5600 7700
[431] 7300 5700 14400 28300 7900 5400 6400 5900 7400 6500
[441] 7300 8800 6200 25200 18300 20200 6300 5700 10600 17100
[451] 11000 6200 9100 36200 21400 7800 6800 6700 6200 22100
[461] 6600 32200 6500 23500 12100 7400 6200 8600 8600 22000
[471] 6900 10300 13400 65300 5500 5700 16300 15200 27700 16600
[481] 7400 12400 20900 8500 6900 49000 7100 5800 15700 7000
[491] 9500 9000 16900 6300 10700 14400 14900 16400 6900 24400
[501] 26600 15900 306100 20600 16100 6700 8800 6400 6000 36200
[511] 9300 17700 6400 12600 6800 68300 10700 6100 9600 12400
[521] 25300 10900 10700 5400 9700 11900 15200 6900 6500 10000
[531] 7400 10100 6200 31400 10600 11400 23500 12300 9000 8000
[541] 6900 6300 7800 17100 5900 6000 5800 5400 6000 9900
[551] 6100 11100 14800 14300 24900 7300 8200 12800 16600 6300
[561] 17500 6900 6400 5400 6500 10300 6200 7400 7900 12400
[571] 7900 9700 9200 6000 18700 6300 9000 5800 5700 7800
[581] 32600 11300 7700 7500 6600 5900 7700 10800 7400 6800
[591] 17900 22300 6200 10600 24800 6100 8700 14100 11600 5600
[601] 10300 6600 18000 7900 6200 14700 16300 6300 7200 12400
[611] 9700 5500 690400 12000 11600 10400 18800 20900 13700 7000
[621] 8900 9400 20800 7700 8500 6700 23500 9700 11900 21800
[631] 6300 16300 7400 7100 5700 112500 6700 6600 16200 6300
[641] 7400 9200 5400 50200 13700 6900 8100 11500 15400 13900
[651] 10100 6600 6400 14000 17100 17600 13800 6200 14300 13700
[661] 6200 11000 93800 13400 5800 7800 6600 8400 6500 6400
[671] 5900 7700 17600 5600 5800 5500 7700 8400 8400 7200
[681] 8800 7000 26400 12300 15500 20500 16100 6200 6300 6300
[691] 12300 14800 7100 7600 10700 15900 6000 9400 12800 14200
[701] 41700 11700 12600 6900 16400 6400 7800 7100 15000 5700
[711] 6400 8700 10600 9900 7600 10700 7500 25000 8700 16200
[721] 6500 12700 7500 11100 19100 13200 7900 8300 8600 9200
[731] 18100 10600 87000 5400 7300 7900 8100 6500 8300 25700
[741] 20300 8100 28200 28900 12200 9500 10700 30200 8000 8400
[751] 14900 15600 6800 7900 10200 6800 42200 26600 17000 61500
[761] 18400 6100 8500 9100 11900 6300 9000 15400 5500 6200
[771] 6500 7500 13500 36800 8400 11400 9100 11000 20700 27000
[781] 9800 6800 7600 5400 10300 6700 7000 12500 5900 5900
[791] 10900 25100 6500 6600 15800 11100 8000 45400 21700 7800
[801] 23700 6700 9500 9100 14500 5900 8300 7600 27000 24000
[811] 7900 7400 6400 7400 7000 11700 9500 23900 6500 12000
[821] 41200 6000 12000 97200 6200 16700 7200 8000 8900 7100
[831] 9000 21000 8000 6800 20200 5500 6800 23100 7500 24400
[841] 15700 6200 12700 7300 16200 12500 50200 6300 7600 22100
[851] 6100 8100 7400 5800 10000 6800 18400 5400 6300 5600
[861] 27700 6900 61600 13900 7200 8200 5800 6100 21500 11100
[871] 10900 11700 5700 73100 15300 7300 14700 11400 5400 6300
[881] 15000 5500 5800 9200 9000 5500 6100 5400 28600 23000
[891] 11700 20600 9800 8400 8800 9800 8000 13000 6300 7500
[901] 5600 10700 7300 6000 5400 6900 11800 5600 23800 7800
[911] 6000 5600 118700 5500 31500 9800 9600 7200 48800 18200
[921] 6500 16300 9000 6000 9700 24200 6500 5700 17900 5700
[931] 9900 10200 81200 7400 6100 5600 5700 59000 10400 8100
[941] 6200 8000 5500 15400 10500 9300 17000 10300 6300 5500
[951] 12900 9200 5800 42600 20900 9400 15200 9100 13400 5500
[961] 30400 6700 6700 17400 6800 16300 8100 11000 7100 6500
[971] 7000 12100 9400 6700 23300 8200 18200 8500 8500 5400
[981] 8800 10300 17900 7300 6900 8200 12300 6100 7600 5800
[991] 6900 5700 8900 7800 5800 6500 5800 7800 10700 7200
[1001] 7900 6100 16300 5400 5500 9500 40400 26300 13900 13900
[1011] 7200 6800 11300 9600 13100 6500 13700 8000 12200 16100
[1021] 8700 5700 28200 54300 13900 6500 11800 6600 6600 843300
[1031] 9500 9100 13900 24500 7200 24500 14800 10000 9100 12400
[1041] 5800 36500 5900 7800 104100 25200 24500 6600 6900 7200
[1051] 7200 5700 6400 5700 9700 6300 196700 14600 8000 16300
[1061] 35300 10000 7600 10300 10100 5500 18900 6100 9000 26200
[1071] 6800 9300 36100 5700 6600 56000 9400 7100 56400 43100
[1081] 210300 14500 10400 15000 10700 6000 7700 6400 5900 15000
[1091] 12400 6000 31600 6600 22000 5700 12100 9600 5700 6000
[1101] 16700 5400 7700 12300 7900 10500 6200 5600 7400 12800
[1111] 13800 15200 11600 6300 71000 5800 8100 5400 6100 7400
[1121] 11800 17600 32500 5700 12900 6500 16900 5900 29000 5800
[1131] 6800 7900 11200 19700 7300 19300 7300 7500 7500 8000
[1141] 9800 23700 18200 45100 14700 6300 7100 7600 5900 16400
[1151] 14800 20800 9200 21400 6900 17700 6200 6500 14700 8700
[1161] 17600 15600 21900 5700 11600 18900 14200 5600 29500 8300
[1171] 9800 5800 6300 7600 6900 53200 6100 6600 20000 5600
[1181] 8300 9000 6600 6500 7900 7900 14500 15100 11200 9700
[1191] 11900 12600 6700 16300 16100 6500 17800 39500 10800 7400
[1201] 49500 6300 5900 8200 14300 11800 5900 10500 16800 5400
[1211] 11400 14000 5900 8300 10900 13300 7800 29400 11200 18200
[1221] 11400 6200 9900 8700 12900 6500 11400 17800 13200 6200
[1231] 22800 12500 8400 9500 42500 8400 11200 6300 5900 5800
[1241] 22300 5800 8300 13300 8200 47400 6800 38400 7100 7600
[1251] 5600 9100 12400 5500 7900 54100 6300 9700 5500 12400
[1261] 25500 18200 10400 19800 9300 17000 8700 5600 17200 23900
[1271] 25200 113700 6100 34700 10000 18000 6800 7100 10400 5700
[1281] 7200 6800 32600 24900 16600 18600 9300 20600 10400 11500
[1291] 15400 8000 7200 10300 8700 5800 8600 12500 58100 23500
[1301] 9100 6100 8400 10900 6500 8300 9000 6000 11500 17500
[1311] 13100 13900 10600 9700 6300 17400 8600 17000 26000 6000
[1321] 6700 5400 8400 6000 96100 35400 5600 6800 7800 11000
[1331] 143100 21900 6500 7900 10000 5600 10400 6900 7500 5600
[1341] 6300 5800 5900 10800 16300 8200 11600 8600 138700 9300
[1351] 7000 10200 7600 17300 30400 23000 5800 47800 20200 10600
[1361] 5500 6200 7000 7800 7900 9200 19800 18200 17400 8200
[1371] 8200 20100 10000 18600 5800 7800 41000 6500 10200 5700
[1381] 5600 6100 7400 24200 18200 26300 11800 5800 11900 5900
[1391] 11800 5700 8600 18200 9000 6200 15700 16100 29600 6000
[1401] 6900 5800 11200 9500 20300 5900 15700 37500 7100 22100
[1411] 14200 5500 13400 6400 5800 13900 13300 17800 9800 5700
[1421] 14900 7800 16200 7200 8700 5500 13900 10700 9000 10100
[1431] 19300 26200 16900 14400 25300 17400 6200 74100 11100 5500
[1441] 11000 23900 7500 18700 37700 10800 6400 5400 12300 6000
[1451] 5500 48800 15300 6800 71700 16700 86300 13000 24400 10400
[1461] 6500 17000 9600 17000 12400 12400 5400 11900 7200 9900
[1471] 8800 7200 33800 17500 11000 12400 12200 15100 8300 5500
[1481] 18100 29900 20300 12500 5800 6500 7800 16900 7400 16600
[1491] 5500 58800 19200 32200 6600 9800 8900 5800 13900 15500
[1501] 11900 6700 8400 8400 33400 11000 7700 21700 5900 6500
[1511] 19800 19300 17500 21100 28400 11900 6600 37800 6100 8300
[1521] 15000 8600 12500 9800 13300 7400 6300 15900 24300 27100
[1531] 25100 7900 6100 8900 13700 8500 12600 5500 8100 15500
[1541] 19500 6700 14100 36600 6400 9000 8200 20000 6200 6100
[1551] 8100 6000 6500 9800 64599 7800 6900 5500 6700 6000
[1561] 11500 7300 6200 23600 6800 15200 26200 6500 6500 25900
[1571] 53900 6400 15200 6700 9900 20100 6800 6800 6700 18300
[1581] 11800 6500 5700 12900 8100 6500 38000 7400 45900 8300
[1591] 7300 6600 22500 7700 8500 14700 5900 16700 6400 8200
[1601] 24100 18300 9900 9400 6100 39900 12600 7900 7700 23800
[1611] 5600 7100 48100 11600 31300 7600 16200 57400 12200 9800
[1621] 12300 22800 9500 5400 8000 5700 13700 11000 5800 9000
[1631] 7600 20800 53200 5900 27300 6100 6600 14700 16000 26600
[1641] 8400 21500 6600 10200 10400 5700 96500 5700 17000 10400
[1651] 11900 72100 24400 7400 9400 10000 54200 24900 6900 14100
[1661] 10700 25500 12900 6000 7200 34800 7800 5400 6500 6300
[1671] 9100 233400 102200 13400 5700 8700 20700 5600 10900 8500
[1681] 9700 7200 6200 5700 17900 6200 6200 7400 15500 95100
[1691] 49700 5400 15400 6500 22300 13500 5900 25300 19100 7400
[1701] 7500 12900 27500 7800 5600 10000 5900 55200 8800 20700
[1711] 5500 5500 6700 6100 6200 12700 12300 5800 7500 86200
[1721] 19100 11200 10400 64500 5700 5600 6800 11700 11300 5700
[1731] 7800 6400 26800 11100 9700 13600 8700 30900 13100 6800
[1741] 9000 6900 98700 6900 17700 6400 8600 15400 6500 77600
[1751] 39900 8200 10100 6700 6000 10900 6800 6100 19800 8300
[1761] 11900 10000 12700 30000 11600 8000 6800 44300 15600 7200
[1771] 53100 11800 14000 5600 6500 7700 11700 7000 5900 14500
[1781] 7700 28200 8900 8500 11300 6100 10300 46000 5700 10500
[1791] 6200 10500 10300 12000 8500 29600 16200 11000 19500 10600
[1801] 9800 5600 24400 14400 6700 5800 9300 15600 9400 8700
[1811] 15800 14400 15300 25900 7800 7400 23100 6900 48000 27600
[1821] 15500 6200 17600 5700 10500 6500 6600 11500 7000 17900
[1831] 12200 441000 6500 9000 6200 53500 5500 19100 9300 7200
[1841] 15300 6800 33000 6500 298400 5900 7500 8600 27600 10100
[1851] 50900 128500 6500 19300 18600 8000 12300 10300 14800 8100
[1861] 8100 6300 11000 8700 8800 23700 12300 27300 12100 17000
[1871] 16500 6700 14200 652900 11500 128800 17300 6400 6600 14000
[1881] 9100 11000 23200 106400 7000 6100 14400 14000 6900 13400
[1891] 12500 8300 8200 18300 29400 9900 14800 9800 11900 7300
[1901] 5500 6200 56500 5900 5400 6300 6700 10800 14800 6000
[1911] 10600 21700 37000 5800 20300 22800 17000 42300 12800 12700
[1921] 13900 139600 10900 11200 12800 9200 35900 7200 122800 6400
[1931] 16600 12700 7700 6000 5500 7900 94400 7200 6400 15000
[1941] 20300 5400 12300 7600 10500 11200 57000 13000 25900 16900
[1951] 11200 14900 8900 7200 7900 83300 6200 7300 17800 14300
[1961] 5700 11400 7700 6200 10500 104100 9800 8100 30300 10400
[1971] 6500 9900 7000 21000 24900 6700 6500 9100 10900 7900
[1981] 22200 6200 13800 158900 12800 18300 11700 7400 26200 205600
[1991] 34800 10600 5500 6200 14700 9300 11000 23500 6300 6100
[2001] 7200 5400 10100 8600 12800 8800 6600 7800 31300 8000
[2011] 5500 8000 26600 19200 10700 17300 6000 8800 7700 7200
[2021] 5500 16000 6500 19300 14900 180600 6000 5800 5400 5500
[2031] 30900 54900 11800 44700 45800 8000 12000 6500 5400 9600
[2041] 9500 5700 20000 6600 37200 24200 6300 7900 9000 6500
[2051] 11400 19500 7000 9300 6400 8200 15000 5400 72900 15900
[2061] 14700 9200 7400 32000 24200 12100 15600 111300 8200 11700
[2071] 15200 49000 7700 8100 9900 14700 8800 10500 6800 104600
[2081] 9400 5800 38900 15100 9100 11000 13800 22200 29300 7600
[2091] 8800 6700 15200 8200 8200 5400 20200 10600 6500 8300
[2101] 30200 15000 16500 8300 8600 8000 80800 5700 9300 7300
[2111] 5400 40300 6000 7200 12800 16800 14700 40200 22300 6300
[2121] 10700 12400 16600 11500 6500 6100 9600 9300 13000 38900
[2131] 9200 10400 6900 15700 17700 8200 8500 11900 10300 11000
[2141] 29400 33100 10900 9600 6400 197600 6500 10600 9700 10900
[2151] 6700 193400 24700 36600 10700 8600 19500 5900 7600 7900
[2161] 6500 13600 10800 13000 35100 6100 14400 5900 6200 8500
[2171] 7400 12400 5400 8700 7300 16300 7300 7100 18700 24100
[2181] 5500 50600 65900 5900 6900 7000 12200 5700 14900 5500
[2191] 11500 5600 7700 7700 13000 5500 10700 18800 5500 10000
[2201] 8700 8100 47300 10700 11700 16400 17100 5900 7200 6900
[2211] 21300 7800 10600 6300 13400 10000 6900 25300 6400 14900
[2221] 30600 9400 10500 5500 53100 9500 17900 5600 78600 8800
[2231] 8000 6800 10200 7600 9000 6000 208300 69300 8600 26600
[2241] 12900 9800 7800 6400 9700 35100 25700 5500 6200 54600
[2251] 13900 12300 6900 8400 6000 7600 7000 12200 5900 10000
[2261] 13400 11000 7600 10600 18600 21300 9200 11400 11800 18000
[2271] 10200 18300 30500 6700 5600 34800 20600 27300 15000 9000
[2281] 53800 15800 13300 6200 25400 11800 21000 22800 20700 12300
[2291] 8900 8600 6800 310800 24200 5800 6200 98000 89500 5700
[2301] 28200 139500 11800 6400 8300 7700 10400 40000 7900 100300
[2311] 5700 19000 6400 6900 5700 5600 63100 13600 7800 8400
[2321] 8000 6200 7900 15200 7600 6000 9700 12100 5700 12500
[2331] 6400 6300 7700 20800 7500 6500 14900 21700 5600 5700
[2341] 10100 88500 8400 8800 11200 25500 13400 5600 17300 10200
[2351] 9600 9900 5900 6300 12800 6900 5600 8200 9900 74300
[2361] 23100 6000 5700 9900 6400 9200 10300 10300 26300 38600
[2371] 5400 5800 12800 6000 31500 6600 6500 10400 6100 9000
[2381] 28400 12800 29700 11700 6700 18000 6300 30600 7400 12500
[2391] 9900 7700 8000 6400 5400 10100 9100 17200 8900 9600
[2401] 15500 5900 8200 8600 19500 6500 9200 7800 110200 6700
[2411] 10900 6100 18800 18300 5700 11400 11600 9700 5800 22600
[2421] 6700 9000 133700 77800 19200 5900 5900 11500 92100 15200
[2431] 11100 11000 9100 5900 7100 5900 8700 21000 18000 19600
[2441] 11300 8500 7600 8500 6000 10900 16100 5600 11900 16900
[2451] 12400 10100 14400 6700 6200 8300 5600 20200 6500 6200
[2461] 5400 7500 22200 10800 13600 9000 9500 13900 26900 62900
[2471] 10400 22300 12500 5400 6600 6300 25000 21600 19900 7400
[2481] 9500 7300 11100 6200 6200 20900 9000 7700 26900 12000
[2491] 5900 46800 10100 6300 21000 14100 33800 13500 11900 7700
[2501] 18800 112600 9900 9400 6800 5500 7200 17900 9500 11200
[2511] 18800 26100 6200 7000 9400 6300 5900 37300 6300 11000
[2521] 5600 9600 7500 7600 6200 9300 12100 7100 15900 9900
[2531] 7000 21400 5700 20800 8900 23800 9700 16600 15000 7600
[2541] 5600 7300 5400 9100 8400 19400 6800 11200 15700 6700
[2551] 7600 9500 5900 11700 9100 16200 7400 8800 128900 5400
[2561] 18200 8500 9800 6800 45900 7100 7400 5500 11200 17100
[2571] 9200 6300 6500 9600 50000 27100 5400 12900 5400 9800
[2581] 10600 18900 12400 8600 6100 16800 34100 7400 41600 7400
[2591] 60100 5600 6500 34700 10500 7700 41900 8300 6000 5500
[2601] 6200 7100 7900 7500 7900 5800 6400 14400 5900 8600
[2611] 14000 64099 8100 6400 13000 9200 6100 5400 5800 12700
[2621] 7700 47800 21600 7900 7300 9000 8600 11500 25000 21500
[2631] 8700 6100 5600 21000 6900 9000 14400 14000 6900 21900
[2641] 17900 60300 15400 7800 5600 7900 12400 6700 6300 5400
[2651] 14700 18800 10400 5900 5700 18800 10200 5500 7800 9100
[2661] 13000 38200 6100 39000 34300 6000 5700 7600 12800 10600
[2671] 22800 5700 8400 12200 13900 14000 13100 12000 11700 6800
[2681] 26500 13400 24200 19800 16600 6500 6500 7200 10100 16200
[2691] 17200 6600 9100 6300 14600 43600 5400 13700 7400 9800
[2701] 18100 8900 12200 5500 33600 17400 12200 7900 9600 11400
[2711] 8000 10900 45400 15300 14900 8300 30200 20400 18700 10300
[2721] 13800 7300 5500 6200 7600 6100 18500 5400 11400 12900
[2731] 10200 5800 7400 5500 15400 8500 6500 35400 7300 15500
[2741] 12600 9800 8400 7800 11400 11200 5900 9100 13200 5400
[2751] 11700 16000 7700 9400 24900 11700 16500 10100 23200 5700
[2761] 16300 29800 22300 41800 23100 7200 9800 21200 11800 10000
[2771] 9400 11300 7000 6200 9300 5600 19700 7800 30700 15700
[2781] 71300 8500 8600 10300 53300 8600 6900 10700 5400 8600
[2791] 6300 11700 6600 9800 10800 10600 7000 10600 8000 11500
[2801] 7500 26800 46100 19700 12700 57800 6000 28600 15000 9000
[2811] 5500 11000 10000 7000 9100 49800 7800 8300 9200 10100
[2821] 14900 6500 10900 13200 19100 6400 9400 7600 6100 11800
[2831] 12200 5700 10100 5500 15400 13600 41000 8900 6100 6700
[2841] 9100 8200 22300 77300 6100 9200 7600 75600 6300 8900
[2851] 50100 6100 96900 9800 13000 10700 10100 24700 7200 15700
[2861] 6800 6000 6900 5700 8600 12200 7800 5400 7800 6600
[2871] 31500 6700 10000 12500 6500 7100 11600 20500 9000 5800
[2881] 6000 27100 23100 9400 12700 12600 16700 7200 12300 14100
[2891] 21700 25000 7500 20600 35200 5400 5800 57200 7100 17600
[2901] 11300 16900 6800 13100 9500 13000 17800 30000 5500 5600
[2911] 5400 12500 5400 18500 12700 8000 14100 8400 5700 5500
[2921] 7500 663600 9200 20700 8800 7500 13600 12600 6100 67700
[2931] 6400 9000 6100 7700 9300 6300 5400 13700 6200 50700
[2941] 13200 5800 15800 7500 5600 5900 7500 82000 9700 8300
[2951] 6600 5800 13600 9200 81700 6000 10100 8800 17200 6800
[2961] 28800 141400 16500 11100 10500 5600 10400 13900 6100 6800
[2971] 10000 5800 5800 145500 5500 6200 7300 27400 13700 6500
[2981] 11200 16800 8100 5400 12400 6500 8900 17100 7400 12000
[2991] 28200 34100 23700 62900 5600 5400 6300 7400 12500 7300
[3001] 5800 11200 5500 6400 34600 26000 5800 7400 6300 5700
[3011] 6700 5600 6700 32299 7800 20800 18000 8700 7400 9100
[3021] 7100 5800 8200 10300 6400 6100 9500 7700 5400 16600
[3031] 7300 5900 17300 10300 12800 9700 12400 11200 7900 6400
[3041] 7000 5800 6200 10700 15700 7500 8600 6800 8100 5800
[3051] 8700 9200 26700 13500 13100 10000 11900 11300 6300 23200
[3061] 59100 6300 5500 5600 5500 6600 20500 10100 9400 5400
[3071] 6800 8300 33500 6600 9200 6000 20400 12900 27400 9900
[3081] 13700 6000 12900 8800 13700 6500 10400 5800 5400 6000
[3091] 20500 16600 8700 29200 20400 5600 18100 5600 5500 12700
[3101] 9200 21400 31000 96000 23800 5800 13400 6800 5500 11500
[3111] 11400 43700 59000 5900 24200 5500 18200 5900 6500 18600
[3121] 9800 6600 22100 6900 7000 14700 9400 10700 9400 18800
[3131] 9700 6900 7500 10000 5600 6900 6500 10500 5800 5700
[3141] 5800 27900 31300 14300 6700 144900 6600 60300 8200 5500
[3151] 29300 11600 11500 7600 6000 21700 7100 6000 7500 6200
[3161] 7400 16600 5500 53100 12400 19000 11800 18000 36900 15000
[3171] 35500 7900 5600 6600 7300 20000 5600 10500 27200 13300
[3181] 6600 35400 10000 18300 6400 5400 20000 24900 12800 15200
[3191] 7400 9900 5900 18400 5900 24500 8200 5400 6500 5600
[3201] 18300 10300 16300 7200 16900 7600 8500 12400 7800 5700
[3211] 5700 7300 6600 6300 8100 8000 7300 6200 6500 11300
[3221] 22500 8000 5600 6500 5900 5500 14700 9900 20400 6800
[3231] 11900 5500 11800 6600 19300 5600 14000 5900 71900 16600
[3241] 9200 5700 6000 44800 6000 32000 9900 11300 24100 8100
[3251] 11300 15000 5700 8900 7900 10500 15800 7300 6300 31400
[3261] 8800 25000 6900 109100 6600 45100 10200 8200 11700 8600
[3271] 9700 10100 8500 31100 6500 7400 95400 6200 11800 9100
[3281] 11500 8000 9000 10500 69500 5900 6900 8300 17400 10300
[3291] 12900 7600 7000 17300 5500 16800 70600 12100 8400 7400
[3301] 68300 11100 6200 7700 15100 6000 95300 7700 9500 6300
[3311] 11700 37500 6000 7800 8400 5700 5800 10100 8500 6700
[3321] 52100 5800 11100 8700 6400 6900 6300 5800 5500 57500
[3331] 8400 5800 6900 47100 15600 7100 10600 12500 25000 6800
[3341] 8200 27700 11300 10100 7800 9600 11100 27900 6600 9100
[3351] 6600 8200 48800 12200 8500 16700 11100 40000 7800 12900
[3361] 5600 5700 31400 5400 22000 7100 5400 8400 7600 15700
[3371] 6400 47100 7100 13700 6500 8200 21900 7200 11400 5500
[3381] 13100 6500 26900 56900 6000 5700 105400 7100 7800 48500
[3391] 9400 12900 11900 17300 8500 19600 7800 31400 7300 17700
[3401] 6400 47800 15600 12900 70200 23200 11100 7000 11600 6100
[3411] 8000 6500 7500 19500 6000 13400 5800 6300 7600 6000
[3421] 5700 7100 5700 18800 94800 6200 17700 7700 16300 18400
[3431] 29900 5700 13300 5700 6800 6200 20700 11000 12200 16200
[3441] 8900 15800 10500 8900 11700 10600 8900 7300 51500 16900
[3451] 8400 31900 7800 5900 6100 8700 9500 11500 5600 12600
[3461] 8700 11000 21200 16300 5500 13200 6100 12700 7400 6100
[3471] 9100 21200 5600 12100 7500 19200 9500 35700 5400 22300
[3481] 7700 14400 8700 17100 7400 7500 6200 6500 5700 15400
[3491] 5400 7000 6400 7200 13300 11700 8200 13700 5500 7900
[3501] 11700 27500 16900 9500 15000 5600 14800 6800 5500 8800
[3511] 7100 24800 8300 13800 7200 8800 18900 12100 8600 15200
[3521] 18100 7100 18800 67800 14800 32000 109800 64400 26800 7200
[3531] 7700 12700 6100 6900 24800 20400 8200 27700 7500 8300
[3541] 7000 11100 5600 7700 9000 5600 9600 24000 5800 48000
[3551] 6600 6800 6200 6200 10400 5600 11500 5900 8900 11800
[3561] 7800 7800 20700 10700 27700 22800 8500 13300 6600 20200
[3571] 6900 10500 8100 33100 8000 12200 12200 9000 39000 8600
[3581] 105400 9800 10900 26700 25900 10800 5400 6700 39300 5500
[3591] 6300 14100 36600 31900 6700 5700 7700 20000 11100 6500
[3601] 6000 8600 43000 7200 5800 6100 7000 9300 17300 9800
[3611] 6500 18800 11300 16500 6200 12900 6900 7700 16300 6200
[3621] 5600 5600 20000 7400 5700 12100 8100 6000 12000 6100
[3631] 12100 53800 9000 6800 40200 7800 14700 27500 18500 7600
[3641] 22100 6100 10500 32000 6300 6900 12500 7700 7200 15100
[3651] 19700 5800 5500 7600 5500 35300 24000 8800 16700 9700
[3661] 48700 8300 13800 8200 6500 5900 6100 16400 6800 9200
[3671] 27900 52700 5900 47300 31700 5400 6100 7700 14300 6000
[3681] 16300 9200 12800 6000 42200 8800 13000 5900 6300 6200
[3691] 6900 19700 9400 5400 6600 9100 7800 8400 46600 5400
[3701] 14800 5700 22300 6600 6100 6000 5400 19900 8400 6600
[3711] 5600 10400 9500 9600 9300 15200 5400 9900 9700 20700
[3721] 6000 6400 6200 6000 9100 5700 7500 13500 11900 9600
[3731] 6100 9100 6300 18800 5700 8500 21900 16300 13000 5500
[3741] 29700 47700 25300 13100 22300 8000 6300 6700 7600 8700
[3751] 21000 5700 13900 9700 7900 6400 7800 12200 14900 5500
[3761] 8900 9300 8100 6700 14700 9200 38800 5600 8800 11100
[3771] 8100 7400 6200 7500 26200 6000 11500 8200 60200 40600
[3781] 6000 5900 8900 5900 7500 7400 10300 9200 36300 14300
[3791] 6700 6700 12000 17800 6600 17900 8500 5400 18400 8300
[3801] 9200 5500 8500 5400 8000 6600 16800 9000 10000 27800
[3811] 9900 5600 13100 5700 7700 23000 6800 14500 13700 15400
[3821] 6400 32800 6000 22500 56500 5700 24100 23700 28800 6600
[3831] 39100 63300 6800 5400 47100 8800 8800 6500 49000 7900
[3841] 11500 11800 11000 18400 17300 10700 14900 11100 6500 12900
[3851] 55600 6500 8600 7900 8400 21200 29000 7800 15800 5900
[3861] 8800 11200 11300 5900 37700 21800 10400 16900 10500 6200
[3871] 5700 8900 7600 5900 9200 6000 46400 6400 8500 38700
[3881] 5500 7000 11100 16100 40600 5600 19400 23300 6900 17200
[3891] 7000 12000 12800 9500 8100 8100 14300 8100 5900 8200
[3901] 92600 6600 82200 15000 10900 7600 19400 6800 7100 9300
[3911] 6100 6900 8500 6900 22100 11600 77200 108400 29800 15000
[3921] 5500 5900 10000 12700 6000 13300 9100 9400 53200 11800
[3931] 8500 11600 13100 9500 5800 17500 10400 18600 7800 98500
[3941] 57100 6000 5400 29900 67300 6400 7300 7500 7700 6600
[3951] 11300 7900 11000 32299 16400 6100 7900 5500 37900 11100
[3961] 5900 6200 16700 5400 21900 6100 6600 36700 40600 51900
[3971] 7700 6600 6800 8500 9100 6400 8000 24300 11300 8300
[3981] 7200 8800 32500 7600 5800 9000 33400 48000 10900 6600
[3991] 13100 18000 6200 18500 17600 15700 5600 7700 12300 19000
[4001] 9300 17200 7600 6800 5500 18600 43200 56500 12100 25200
[4011] 46300 5500 62000 6200 8000 6200 17600 8200 16200 5400
[4021] 5500 6400 7900 7100 6900 6800 16700 10100 15200 7600
[4031] 5500 5600 17100 6400 10500 10300 8300 9900 31300 7400
[4041] 16100 13400 7600 7900 7300 5700 6400 6400 16500 7600
[4051] 11500 8900 18400 15600 7200 6500 8600 23700 8400 43700
[4061] 6100 8800 7600 10500 42300 5800 19400 9900 7300 8100
[4071] 17500 6700 11400 25500 10200 9100 11900 9300 12300 17500
[4081] 7900 6500 14200 32400 6000 8900 12800 15400 7800 24400
[4091] 7200 8200 11800 6400 13800 7700 9900 6000 9400 10300
[4101] 7700 7200 24900 5400 6300 84800 31600 7500 8500 11800
[4111] 14100 16900 13200 7500 6600 52600 15000 11800 12300 8600
[4121] 16200 32500 56900 11800 7100 6400 6000 7200 39000 7300
[4131] 6700 18700 13900 9000 17200 23600 12500 13600 8500 16100
[4141] 8800 12100 16700 10300 6700 5800 12100 5900 10400 10600
[4151] 23100 109500 7400 7900 6900 52500 17600 8000 14200 21600
[4161] 14800 13400 6300 5900 52600 13900 6300 6700 20700 5400
[4171] 7900 10900 41800 35700 5500 5800 47300 10900 9800 8100
[4181] 5800 15400 79900 12300 13400 7900 78600 41100 7200 284700
[4191] 7000 22900 37100 48000 29900 5600 13700 10500 13900 7800
[4201] 9300 7100 16800 11500 6500 5800 21000 9800 7300 19100
[4211] 8500 5400 6000 51000 16400 6000 20200 15100 5900 13000
[4221] 42000 5400 6200 7200 33300 7000 7800 23200 6300 19400
[4231] 14500 12800 5800 6100 7600 6900 6200 7400 48500 9700
[4241] 7500 5800 7500 5700 16500 6500 5800 6700 23000 9100
[4251] 15100 6300 10800 11900 13000 115700 5800 6500 25500 69000
[4261] 17800 7000 7900 5400 25300 12000 5900 11600 19900 10400
[4271] 5900 5900 6200 13300 10300 5900 5500 26400 10000 7800
[4281] 12500 12700 6600 7200 6400 9100 5500 10200 7200 5500
[4291] 7800 16200 7200 12400 5400 11700 25400 8500 5500 6300
[4301] 6600 38200 7100 7500 8500 17900 55600 6800 12200 9700
[4311] 7600 8000 7600 9300 5700 10100 6700 26600 8500 9700
[4321] 11500 5500 10600 15900 6800 15300 7300 14300 7900 7400
[4331] 5800 19200 32200 5400 5400 12800 10000 5800 10300 6800
[4341] 6300 7400 6700 5700 7300 36100 9700 22200 7600 14000
[4351] 7700 6300 9500 10800 87600 32100 15000 9000 9100 5600
[4361] 18600 9200 6300 27500 11600 42500 6400 8700 13400 14600
[4371] 9500 7200 7600 10800 8400 9700 6200 5800 7300 39300
[4381] 8100 47300 6500 8800 11100 15000 5700 5400 8300 22500
[4391] 12900 7000 6300 6200 8100 20400 30000 5600 20900 7400
[4401] 6100 12700 7700 13300 10600 6100 16800 6100 5800 14400
[4411] 8200 6500 11500 5800 10300 5400 51900 8900 6900 7000
[4421] 14300 19200 32000 18600 52900 5700 12500 8700 53000 20600
[4431] 8700 9500 10300 5600 70800 64300 6300 6200 20300 9700
[4441] 9800 17300 5500 10000 10800 5700 7100 6800 27100 6200
[4451] 16800 19900 10200 9600 5600 10000 5400 8700 13900 5800
[4461] 8400 6500 6700 8900 9300 9600 6000 9400 7300 13800
[4471] 6800 7800 9300 13500 6600 5400 23600 5400 30200 8100
[4481] 6600 11500 6400 5400 7800 5900 8300 12200 6700 12300
[4491] 9100 5800 39800 9600 7800 5700 7700 9100 9800 7900
[4501] 25000 7900 6000 6500 11500 9500 5800 7000 11400 7800
[4511] 6200 9900 11700 34500 20900 11700 5500 45000 16800 10200
[4521] 7600 6000 24300
The dependent variable shares has a skewed distribution, meaning it is not evenly spread out. There are also a large number of outliers, which are data points that are very different from the rest. To reduce the impact of these outliers on the model’s predictions, a log transformation is performed on the variable shares. This means the logarithm of the number of shares for each article is taken.
Fitting a multiple regression model by least
squares:
Call:
lm(formula = shares ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-8.0764 -0.5433 -0.1628 0.3867 5.9779
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.581e+00 4.448e-01 14.798 < 2e-16 ***
n_tokens_title 6.952e-03 2.186e-03 3.180 0.00147 **
n_tokens_content 4.685e-05 1.683e-05 2.784 0.00538 **
n_unique_tokens 1.548e-01 1.446e-01 1.070 0.28459
n_non_stop_words 5.822e-02 4.887e-02 1.191 0.23355
n_non_stop_unique_tokens -2.575e-01 1.228e-01 -2.096 0.03608 *
num_hrefs 4.240e-03 5.042e-04 8.409 < 2e-16 ***
num_self_hrefs -7.307e-03 1.337e-03 -5.464 4.67e-08 ***
num_imgs 1.450e-03 6.956e-04 2.084 0.03712 *
num_videos 1.526e-03 1.190e-03 1.282 0.19971
average_token_length -8.548e-02 1.829e-02 -4.673 2.98e-06 ***
kw_min_min 9.322e-04 1.225e-04 7.612 2.77e-14 ***
kw_max_min 1.831e-05 3.792e-06 4.827 1.39e-06 ***
kw_avg_min -1.357e-04 2.318e-05 -5.855 4.81e-09 ***
kw_min_max -4.010e-07 9.115e-08 -4.399 1.09e-05 ***
kw_max_max 5.891e-08 4.328e-08 1.361 0.17353
kw_avg_max -3.886e-07 5.928e-08 -6.554 5.68e-11 ***
kw_min_avg -5.703e-05 5.659e-06 -10.078 < 2e-16 ***
kw_max_avg -4.334e-05 1.930e-06 -22.451 < 2e-16 ***
kw_avg_avg 3.457e-04 1.086e-05 31.831 < 2e-16 ***
self_reference_min_shares 6.659e-07 5.655e-07 1.178 0.23896
self_reference_max_shares -2.181e-08 3.069e-07 -0.071 0.94334
self_reference_avg_sharess 1.525e-06 7.845e-07 1.944 0.05189 .
LDA_00 2.090e-01 3.456e-02 6.048 1.48e-09 ***
LDA_01 -1.517e-01 3.846e-02 -3.944 8.03e-05 ***
LDA_02 -2.435e-01 3.464e-02 -7.029 2.11e-12 ***
LDA_03 -1.202e-01 3.653e-02 -3.289 0.00101 **
LDA_04 NA NA NA NA
global_subjectivity 3.933e-01 6.404e-02 6.141 8.26e-10 ***
global_sentiment_polarity -1.096e-01 1.254e-01 -0.874 0.38205
global_rate_positive_words -1.155e+00 5.388e-01 -2.144 0.03202 *
global_rate_negative_words 4.728e-01 1.028e+00 0.460 0.64561
rate_positive_words 2.926e-01 4.342e-01 0.674 0.50035
rate_negative_words 1.327e-01 4.376e-01 0.303 0.76178
avg_positive_polarity 7.668e-03 1.027e-01 0.075 0.94050
min_positive_polarity -2.814e-01 8.602e-02 -3.272 0.00107 **
max_positive_polarity -2.539e-02 3.240e-02 -0.784 0.43327
avg_negative_polarity -1.377e-01 9.463e-02 -1.455 0.14559
min_negative_polarity 5.223e-03 3.450e-02 0.151 0.87966
max_negative_polarity 7.905e-02 7.868e-02 1.005 0.31504
title_subjectivity 6.118e-02 2.096e-02 2.918 0.00352 **
title_sentiment_polarity 7.488e-02 1.923e-02 3.894 9.89e-05 ***
abs_title_subjectivity 1.328e-01 2.787e-02 4.763 1.91e-06 ***
abs_title_sentiment_polarity 2.581e-02 3.031e-02 0.852 0.39446
day_of_the_weekMonday -7.522e-03 1.586e-02 -0.474 0.63533
day_of_the_weekSaturday 2.211e-01 2.132e-02 10.370 < 2e-16 ***
day_of_the_weekSunday 2.110e-01 2.051e-02 10.288 < 2e-16 ***
day_of_the_weekThursday -6.618e-02 1.554e-02 -4.260 2.05e-05 ***
day_of_the_weekTuesday -7.458e-02 1.548e-02 -4.817 1.46e-06 ***
day_of_the_weekWednesday -6.796e-02 1.547e-02 -4.393 1.12e-05 ***
data_channelEntertainment -1.949e-02 2.767e-02 -0.704 0.48126
data_channelLifestyle 5.756e-02 2.808e-02 2.050 0.04038 *
data_channelOthers 1.808e-01 2.925e-02 6.180 6.48e-10 ***
data_channelSocial Media 3.173e-01 2.368e-02 13.398 < 2e-16 ***
data_channelTech 2.667e-01 2.518e-02 10.595 < 2e-16 ***
data_channelWorld 1.131e-01 2.709e-02 4.176 2.98e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8644 on 38408 degrees of freedom
Multiple R-squared: 0.1289, Adjusted R-squared: 0.1277
F-statistic: 105.3 on 54 and 38408 DF, p-value: < 2.2e-16
12.89% of the variance in log of shares is explained by the model. The “NA” values in the output indicates that certain coefficients could not be estimated due to singularities in the model matrix. This may occur when one or more predictor variables are perfectly correlated with one another. Therefore, removing the variable with “NA” value.
Call:
lm(formula = shares ~ ., data = data[, -27])
Residuals:
Min 1Q Median 3Q Max
-8.0764 -0.5433 -0.1628 0.3867 5.9779
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.581e+00 4.448e-01 14.798 < 2e-16 ***
n_tokens_title 6.952e-03 2.186e-03 3.180 0.00147 **
n_tokens_content 4.685e-05 1.683e-05 2.784 0.00538 **
n_unique_tokens 1.548e-01 1.446e-01 1.070 0.28459
n_non_stop_words 5.822e-02 4.887e-02 1.191 0.23355
n_non_stop_unique_tokens -2.575e-01 1.228e-01 -2.096 0.03608 *
num_hrefs 4.240e-03 5.042e-04 8.409 < 2e-16 ***
num_self_hrefs -7.307e-03 1.337e-03 -5.464 4.67e-08 ***
num_imgs 1.450e-03 6.956e-04 2.084 0.03712 *
num_videos 1.526e-03 1.190e-03 1.282 0.19971
average_token_length -8.548e-02 1.829e-02 -4.673 2.98e-06 ***
kw_min_min 9.322e-04 1.225e-04 7.612 2.77e-14 ***
kw_max_min 1.831e-05 3.792e-06 4.827 1.39e-06 ***
kw_avg_min -1.357e-04 2.318e-05 -5.855 4.81e-09 ***
kw_min_max -4.010e-07 9.115e-08 -4.399 1.09e-05 ***
kw_max_max 5.891e-08 4.328e-08 1.361 0.17353
kw_avg_max -3.886e-07 5.928e-08 -6.554 5.68e-11 ***
kw_min_avg -5.703e-05 5.659e-06 -10.078 < 2e-16 ***
kw_max_avg -4.334e-05 1.930e-06 -22.451 < 2e-16 ***
kw_avg_avg 3.457e-04 1.086e-05 31.831 < 2e-16 ***
self_reference_min_shares 6.659e-07 5.655e-07 1.178 0.23896
self_reference_max_shares -2.181e-08 3.069e-07 -0.071 0.94334
self_reference_avg_sharess 1.525e-06 7.845e-07 1.944 0.05189 .
LDA_00 2.090e-01 3.456e-02 6.048 1.48e-09 ***
LDA_01 -1.517e-01 3.846e-02 -3.944 8.03e-05 ***
LDA_02 -2.435e-01 3.464e-02 -7.029 2.11e-12 ***
LDA_03 -1.202e-01 3.653e-02 -3.289 0.00101 **
global_subjectivity 3.933e-01 6.404e-02 6.141 8.26e-10 ***
global_sentiment_polarity -1.096e-01 1.254e-01 -0.874 0.38205
global_rate_positive_words -1.155e+00 5.388e-01 -2.144 0.03202 *
global_rate_negative_words 4.728e-01 1.028e+00 0.460 0.64561
rate_positive_words 2.926e-01 4.342e-01 0.674 0.50035
rate_negative_words 1.327e-01 4.376e-01 0.303 0.76178
avg_positive_polarity 7.668e-03 1.027e-01 0.075 0.94050
min_positive_polarity -2.814e-01 8.602e-02 -3.272 0.00107 **
max_positive_polarity -2.539e-02 3.240e-02 -0.784 0.43327
avg_negative_polarity -1.377e-01 9.463e-02 -1.455 0.14559
min_negative_polarity 5.223e-03 3.450e-02 0.151 0.87966
max_negative_polarity 7.905e-02 7.868e-02 1.005 0.31504
title_subjectivity 6.118e-02 2.096e-02 2.918 0.00352 **
title_sentiment_polarity 7.488e-02 1.923e-02 3.894 9.89e-05 ***
abs_title_subjectivity 1.328e-01 2.787e-02 4.763 1.91e-06 ***
abs_title_sentiment_polarity 2.581e-02 3.031e-02 0.852 0.39446
day_of_the_weekMonday -7.522e-03 1.586e-02 -0.474 0.63533
day_of_the_weekSaturday 2.211e-01 2.132e-02 10.370 < 2e-16 ***
day_of_the_weekSunday 2.110e-01 2.051e-02 10.288 < 2e-16 ***
day_of_the_weekThursday -6.618e-02 1.554e-02 -4.260 2.05e-05 ***
day_of_the_weekTuesday -7.458e-02 1.548e-02 -4.817 1.46e-06 ***
day_of_the_weekWednesday -6.796e-02 1.547e-02 -4.393 1.12e-05 ***
data_channelEntertainment -1.949e-02 2.767e-02 -0.704 0.48126
data_channelLifestyle 5.756e-02 2.808e-02 2.050 0.04038 *
data_channelOthers 1.808e-01 2.925e-02 6.180 6.48e-10 ***
data_channelSocial Media 3.173e-01 2.368e-02 13.398 < 2e-16 ***
data_channelTech 2.667e-01 2.518e-02 10.595 < 2e-16 ***
data_channelWorld 1.131e-01 2.709e-02 4.176 2.98e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8644 on 38408 degrees of freedom
Multiple R-squared: 0.1289, Adjusted R-squared: 0.1277
F-statistic: 105.3 on 54 and 38408 DF, p-value: < 2.2e-16
The output indicates that approximately 12.89% of the variance in log of shares can be explained by the model.
Now we find whether are any anomalies in the data set.
Cook’s Distance measures the change in distance in the fitted regression line if an observation is deleted from the regression equation. It therefore combines the outlier and leverage point diagnostics of a measure. The Cook’s Distance statistic is
\(D_{i}=\frac{\Sigma_{j=1}^{n}(\hat{y_j}-\hat{y}_{j(i)})^{2}}{ps^{2}}\) where \(s^{2}\) is the Mean Squared Error and \(\hat{y}_{j(i)}\) is the fitted response value after deleting the ith observation.
If \(D_{i}>\frac{4}{n}\) where n is the number of observations then \(D_{i}\)is tagged as an influential point.
The points above the red line are the influential points.
Studentized residuals are a type of standardized residual used in regression analysis to assess the fit of a model. These help in identifying outliers and influential data points. The studentized residual statistic is \(e_{i}^{s}=\frac{e_{i}}{\hat\sigma_{(i)} {\sqrt{1-h_{ii}}}}\) where
• \(e_{i}=y_{i}-\hat{y}_{i}\) is the value of the ith residual (the difference between the observed value and the predicted value).
• \(\hat\sigma_{(i)}\) is the standard deviation of the residuals calculated without the ith observation.
• \(h_{ii}\) is the leverage of the ith observation, a measure of the influence of the ith data point on the fitted value.
At 5% level of significance, if \(e_{i}^{s}>2\) then the ith observation can be tagged as an outlier.
The points above the green line indicate the outliers of the data.
Removing the points (which are both influential as well as outlier) from the dataset to clean the dataset and make the data ready for further analysis.
Original dimension: 38463 46
New dimension: 37151 46
The original dimension shrinks.
Fitting the multiple linear regression model after removing the outliers:
Call:
lm(formula = shares ~ ., data = data1[, -27])
Residuals:
Min 1Q Median 3Q Max
-7.8766 -0.4689 -0.1118 0.3902 2.6906
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.614e+00 3.715e-01 17.803 < 2e-16 ***
n_tokens_title 2.228e-03 1.858e-03 1.199 0.230451
n_tokens_content 2.545e-05 1.434e-05 1.774 0.076076 .
n_unique_tokens -1.891e-01 1.232e-01 -1.535 0.124733
n_non_stop_words 2.085e-01 4.165e-02 5.006 5.59e-07 ***
n_non_stop_unique_tokens -1.275e-01 1.046e-01 -1.219 0.222908
num_hrefs 3.463e-03 4.323e-04 8.010 1.18e-15 ***
num_self_hrefs -5.618e-03 1.134e-03 -4.953 7.34e-07 ***
num_imgs 4.591e-04 5.955e-04 0.771 0.440713
num_videos 5.734e-04 1.023e-03 0.561 0.575018
average_token_length -6.040e-02 1.563e-02 -3.866 0.000111 ***
kw_min_min 7.996e-04 1.047e-04 7.639 2.24e-14 ***
kw_max_min 2.446e-05 3.405e-06 7.182 7.00e-13 ***
kw_avg_min -1.752e-04 2.126e-05 -8.240 < 2e-16 ***
kw_min_max -4.423e-07 7.813e-08 -5.661 1.52e-08 ***
kw_max_max 5.253e-08 3.690e-08 1.424 0.154557
kw_avg_max -4.466e-07 5.066e-08 -8.815 < 2e-16 ***
kw_min_avg -4.371e-05 4.846e-06 -9.019 < 2e-16 ***
kw_max_avg -4.278e-05 1.683e-06 -25.413 < 2e-16 ***
kw_avg_avg 3.291e-04 9.411e-06 34.968 < 2e-16 ***
self_reference_min_shares 1.574e-07 4.824e-07 0.326 0.744176
self_reference_max_shares -1.373e-07 2.643e-07 -0.519 0.603419
self_reference_avg_sharess 1.574e-06 6.726e-07 2.339 0.019315 *
LDA_00 2.387e-01 2.932e-02 8.141 4.03e-16 ***
LDA_01 -1.664e-01 3.276e-02 -5.078 3.83e-07 ***
LDA_02 -2.036e-01 2.943e-02 -6.919 4.64e-12 ***
LDA_03 -1.456e-01 3.113e-02 -4.677 2.92e-06 ***
global_subjectivity 2.757e-01 5.458e-02 5.051 4.42e-07 ***
global_sentiment_polarity -1.309e-01 1.075e-01 -1.217 0.223526
global_rate_positive_words -4.451e-01 4.617e-01 -0.964 0.335103
global_rate_negative_words 9.244e-02 8.905e-01 0.104 0.917326
rate_positive_words 2.301e-01 3.623e-01 0.635 0.525412
rate_negative_words 7.112e-02 3.653e-01 0.195 0.845662
avg_positive_polarity 2.119e-02 8.769e-02 0.242 0.809100
min_positive_polarity -3.286e-01 7.371e-02 -4.458 8.31e-06 ***
max_positive_polarity -4.947e-02 2.750e-02 -1.798 0.072107 .
avg_negative_polarity -7.956e-02 8.062e-02 -0.987 0.323674
min_negative_polarity 3.124e-03 2.934e-02 0.106 0.915218
max_negative_polarity 1.111e-01 6.723e-02 1.653 0.098373 .
title_subjectivity 5.866e-02 1.790e-02 3.277 0.001051 **
title_sentiment_polarity 7.379e-02 1.651e-02 4.470 7.86e-06 ***
abs_title_subjectivity 9.923e-02 2.375e-02 4.178 2.95e-05 ***
abs_title_sentiment_polarity -2.065e-02 2.598e-02 -0.795 0.426765
day_of_the_weekMonday -2.696e-02 1.347e-02 -2.001 0.045432 *
day_of_the_weekSaturday 2.249e-01 1.809e-02 12.430 < 2e-16 ***
day_of_the_weekSunday 2.156e-01 1.744e-02 12.366 < 2e-16 ***
day_of_the_weekThursday -7.086e-02 1.318e-02 -5.375 7.72e-08 ***
day_of_the_weekTuesday -7.487e-02 1.313e-02 -5.702 1.19e-08 ***
day_of_the_weekWednesday -7.907e-02 1.313e-02 -6.021 1.75e-09 ***
data_channelEntertainment -3.441e-02 2.356e-02 -1.461 0.144123
data_channelLifestyle 2.330e-02 2.395e-02 0.973 0.330490
data_channelOthers 1.529e-01 2.506e-02 6.100 1.07e-09 ***
data_channelSocial Media 3.266e-01 2.007e-02 16.271 < 2e-16 ***
data_channelTech 2.925e-01 2.132e-02 13.718 < 2e-16 ***
data_channelWorld 9.620e-02 2.301e-02 4.181 2.90e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7211 on 37096 degrees of freedom
Multiple R-squared: 0.1554, Adjusted R-squared: 0.1542
F-statistic: 126.4 on 54 and 37096 DF, p-value: < 2.2e-16
The output indicates that approximately 15.54% of the variance in log of shares can be explained by the model and the residual standard error is 0.7211. This model has improved a bit from the previous models.
The Mashable dataset may show heteroscedasticity because different types of articles can have varying numbers of shares. Some articles might go viral, while others do not get much attention, leading to differences in variance. Changes in popularity over time and factors like the author’s reputation can also affect shares.
Test of homoscedasticity:
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 494.5087, Df = 1, p = < 2.22e-16
The residuals plot indicate heteroscedasticity in the model. Also the levene’s test has a p-value <2.22e-16 which results in the rejection of the null hypothesis of homoscedasticity.
-
The pattern in the residuals in the plot violate the assumption of linearity.
-
News popularity may exhibit autocorrelation if the observations are collected over time and show a correlation with their past values. If the dataset includes time-series data, such as the number of shares or views of articles over time, it is likely that the values at one time point are influenced by values at previous time points. News articles may have trending patterns where the popularity of an article can affect the popularity of subsequent articles, leading to positive autocorrelation. This may happen that certain topics perform better at specific times of the year which may create autocorrelation
Autocorrelation is defined as the correlation between the members of a series of observations. We need to test if \(cov(\epsilon_{i},\epsilon_{j})\ne0\forall i\ne j\)
We use durbin watson test for detecting autocorrelation: \(d=\frac{\sum_{i=2}^{n}(\epsilon_{i}-\epsilon_{i-1})^{2}}{\sum_{i=1}^{n}\epsilon_{i}^{2}}\)
The following assumptions are made to use the statistic d:
• Model includes intercept term
• Explanatory variables are non stochastic
• \(\epsilon_{i}'s\) are generated from AR(1) model, i.e., \(\epsilon_{i}=\epsilon_{i-1}+u_{i}\forall i=1(1)300.\)
• \(\epsilon_{i}\sim N(0,\sigma^{2})\forall i=1(1)300\)
• No missing observations
Sample correlation estimate: \(\hat{\rho}=\frac{\sum_{i=1}^{n}\hat{\epsilon_{i}}\hat{\epsilon}_{i-1}}{\sqrt{\sum_{i=2}^{n}\hat{\epsilon}_{i-1}^{2}\sum_{i=1}^{n}\hat{\epsilon}_{i}^{2}}}\)
Assuming \(\sum_{i=1}^{n}\hat{\epsilon}_{i}^{2}\approx\sum_{i=2}^{n}\hat{\epsilon}_{i-1}^{2}\),we have \(d=2(1-\hat{\rho})\)
We want to test \(H_{0}:\hat{\rho}=0\) against \(H_{1}:\hat{\rho}\ne0\)
If \(d\) value turns out to be near \(2\) then there is no autocorrelation.
[1] 1.933195
The value of Durbin Watson test statistic d turns out to be close to 2. Thus, the null hypothesis cannot be rejected at 5% level of significance and conclude that there is no autocorrelation in error terms.
-
Multicollinearity may be present in the Mashable online news dataset may arise due to several factors such as: Many features may measure similar aspects of the articles, such as n_tokens_title and n_unique_tokens leading to high correlations. Some features could be derived from others, like n_tokens_content being related to n_unique_tokens which can create redundancy. Articles on similar topics or from the same author may exhibit correlated characteristics, contributing to multicollinearity.
Multicollinearity means the existence of perfect relationship among all explanatory variables in a regression model. In this model, an exact relationship is said to exist if the following condition is satisfied: \(\beta_{1}x_{i1}+\beta_{2}x_{i2}+\beta_{3}x_{i3}+\beta_{4}x_{i4}+\beta_{5}x_{i5}+\beta_{6}x_{i6}=0\) where not all coefficients are simultaneously zero. In terms of linear algebra, we explore an issue of multicollinearity if exact linear relationship among the regressors, i.e., at least one column of X will be linear combination of the others and Rank(X) will not be of full column rank and as a result X’X will not be invertible.
In order to detect multicollinearity, we use a standard measure known as Variance Inflation Factor (VIF).
In the model, \(Y_{i}=\beta_{0}+\beta_{1}x_{i1}+\beta_{2}x_{i2}+\beta_{3}x_{i3}+\beta_{4}x_{i4}+\beta_{5}x_{i5}+\beta_{6}x_{i6}+\epsilon_{i}\forall i=1(1)300\), the VIF of the regressor of the jth regressor is defined as: \(VIF_{j}=\frac{1}{1-R_{(j)}^{2}}\) where \(R_{(j)}^{2}\) is the coefficient of determination from the equation \(X_{i}\) on \((X_{1},X_{2},...,X_{j-1},X_{j+1},...,X_{p})\).\(VIF_{j}\) measures the dependence of \(X_{j}\) on all other 5 regressors. A large VIF value indicates multicollinearity in the model. As a thumb rule, if \(VIF>5\) we conclude that there is multicollinearity in the model.
GVIF Df GVIF^(1/(2*Df))
n_tokens_title 1.098777 1 1.048226
n_tokens_content 3.207900 1 1.791061
n_unique_tokens 14325.010060 1 119.687134
n_non_stop_words 3615.053887 1 60.125318
n_non_stop_unique_tokens 8882.445746 1 94.246728
num_hrefs 1.687374 1 1.298990
num_self_hrefs 1.379314 1 1.174442
num_imgs 1.714136 1 1.309250
num_videos 1.264432 1 1.124469
average_token_length 1.387139 1 1.177769
kw_min_min 3.845149 1 1.960905
kw_max_min 12.095228 1 3.477819
kw_avg_min 11.795877 1 3.434513
kw_min_max 1.381226 1 1.175256
kw_max_max 4.543649 1 2.131584
kw_avg_max 3.153582 1 1.775833
kw_min_avg 2.110867 1 1.452882
kw_max_avg 7.270848 1 2.696451
kw_avg_avg 10.457942 1 3.233874
self_reference_min_shares 6.660905 1 2.580873
self_reference_max_shares 8.659105 1 2.942636
self_reference_avg_sharess 19.560266 1 4.422699
LDA_00 4.368346 1 2.090059
LDA_01 3.688890 1 1.920648
LDA_02 4.992530 1 2.234397
LDA_03 5.673060 1 2.381819
global_subjectivity 1.648839 1 1.284071
global_sentiment_polarity 7.487136 1 2.736263
global_rate_positive_words 3.993715 1 1.998428
global_rate_negative_words 6.221607 1 2.494315
rate_positive_words 209.986854 1 14.490923
rate_negative_words 213.008066 1 14.594796
avg_positive_polarity 3.951993 1 1.987962
min_positive_polarity 1.875229 1 1.369390
max_positive_polarity 2.439797 1 1.561985
avg_negative_polarity 6.729812 1 2.594188
min_negative_polarity 4.797791 1 2.190386
max_negative_polarity 2.849208 1 1.687960
title_subjectivity 2.376710 1 1.541658
title_sentiment_polarity 1.335435 1 1.155610
abs_title_subjectivity 1.434971 1 1.197903
abs_title_sentiment_polarity 2.414298 1 1.553801
day_of_the_week 1.041632 6 1.003405
data_channel 95.360060 6 1.461999
The variables n_unique_tokens, n_non_stop_words, n_non_stop_unique_tokens, “rate_positive_words”, “rate_negative_words” have vif values (GVIF^(1/(2 * Df))) greater than 5 indicating that these are the variables which give rise to multicollinearity. Therefore, there is presence of multicollinearity in the model.
-
Checking for normality in the news dataset by QQ-Plot:
The plot shows slight departures from normality near the tails of the distributions. Based on this Q-Q plot, it is reasonable to conclude that the data is approximately normally distributed, with some minor deviations.
-
Therefore, the model assumptions of homoscedasticity does not hold true. Multicollinearity is also detected by the model.
First splitting the data into training and test data and then applying regularization methods like ridge and lasso to handle multicollinearity present in the data.
In order to determine the model efficiency, the data is divided into two parts:
• training dataset: subset to train a model.
• testing dataset: subset to test the trained model.
The data is divided into training and testing dataset in the ratio 80:20.
Loaded glmnet 4.1-8
The cv.glmnet function is used for cross-validation of a generalized linear model with regularization. The cv.glmnet produces a plot that helps in selecting the best model based on the cross-validation error.
• Lambda \(\lambda\) on the x-axis: This is the tuning parameter for the regularization strength. Smaller values of \(\lambda\) mean less regularization, and larger values mean more regularization.
• Mean Squared Error (MSE) or Deviance on the y-axis: This is the measure of prediction error for the model. The plot shows how the error changes with different values of \(\lambda\).
• Dots and Error Bars: Each dot represents the mean cross-validated error for a given value of λ. The error bars show the variability of the error estimate (usually one standard deviation).
• Two Vertical Dashed Lines:
Left Line (\(\lambda.min\)): This corresponds to the value of \(\lambda\) that gives the minimum mean cross-validated error. This is often referred to as the best \(\lambda\).
Right Line (\(\lambda.1se\)): This represents the largest value of \(\lambda\) such that the cross-validated error is within one standard deviation of the minimum error. This \(\lambda\) value is more regularized and can be preferred for simplicity and robustness.
• x-axis Labels: The values of \(\lambda\) are plotted on a logarithmic scale to better visualize the range of \(\lambda\) values.
This plot is crucial for understanding how different levels of regularization affect model performance and for selecting an optimal balance between bias and variance.
[1] "Coefficients of the ridge model with best predictive accuracy"
56 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
n_tokens_title 2.031639e-02
n_tokens_content 2.272256e-05
n_unique_tokens 3.069142e-04
n_non_stop_words 2.523671e-04
n_non_stop_unique_tokens 4.466687e-04
num_hrefs 7.615001e-04
num_self_hrefs 2.000594e-03
num_imgs 5.840392e-04
num_videos 6.350604e-04
average_token_length 5.134859e-01
kw_min_min 4.975249e-05
kw_max_min 6.397873e-07
kw_avg_min 6.907710e-06
kw_min_max 3.419849e-08
kw_max_max 1.410612e-07
kw_avg_max 1.289991e-07
kw_min_avg 7.805378e-06
kw_max_avg 1.329564e-06
kw_avg_avg 1.626707e-05
self_reference_min_shares 8.977204e-08
self_reference_max_shares 5.476251e-08
self_reference_avg_sharess 9.728616e-08
LDA_00 2.396922e-02
LDA_01 2.487085e-02
LDA_02 2.273089e-02
LDA_03 2.262960e-02
LDA_04 2.562076e-02
global_subjectivity 5.123913e-01
global_sentiment_polarity 1.180979e-01
global_rate_positive_words 1.365239e+00
global_rate_negative_words 1.337089e+00
rate_positive_words 2.731969e-01
rate_negative_words 1.149571e-01
avg_positive_polarity 4.403814e-01
min_positive_polarity 1.756380e-01
max_positive_polarity 1.511942e-01
avg_negative_polarity -1.603497e-01
min_negative_polarity -6.012569e-02
max_negative_polarity -1.082025e-01
title_subjectivity 2.357003e-02
title_sentiment_polarity 9.134744e-03
abs_title_subjectivity 8.425483e-02
abs_title_sentiment_polarity 2.686242e-02
day_of_the_weekMonday 1.043745e-02
day_of_the_weekSaturday 1.007414e-02
day_of_the_weekSunday 1.003946e-02
day_of_the_weekThursday 1.057160e-02
day_of_the_weekTuesday 1.060534e-02
day_of_the_weekWednesday 1.057031e-02
data_channelEntertainment 1.004860e-02
data_channelLifestyle 9.548962e-03
data_channelOthers 1.052349e-02
data_channelSocial Media 1.028977e-02
data_channelTech 1.158305e-02
data_channelWorld 1.039660e-02
Mean squared error: 12.77726
Standard Error: 0.06743605
The residuals roughly show a constant horizontal band around the mean residual errors line suggesting that the variances of the error terms are equal, thereby indicating no heteroscedasticity in the model.
-
[1] "Coefficients of the lasso model with best predictive accuracy:"
56 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
n_tokens_title .
n_tokens_content .
n_unique_tokens .
n_non_stop_words .
n_non_stop_unique_tokens .
num_hrefs .
num_self_hrefs .
num_imgs .
num_videos .
average_token_length 1.557366
kw_min_min .
kw_max_min .
kw_avg_min .
kw_min_max .
kw_max_max .
kw_avg_max .
kw_min_avg .
kw_max_avg .
kw_avg_avg .
self_reference_min_shares .
self_reference_max_shares .
self_reference_avg_sharess .
LDA_00 .
LDA_01 .
LDA_02 .
LDA_03 .
LDA_04 .
global_subjectivity .
global_sentiment_polarity .
global_rate_positive_words .
global_rate_negative_words .
rate_positive_words .
rate_negative_words .
avg_positive_polarity .
min_positive_polarity .
max_positive_polarity .
avg_negative_polarity .
min_negative_polarity .
max_negative_polarity .
title_subjectivity .
title_sentiment_polarity .
abs_title_subjectivity .
abs_title_sentiment_polarity .
day_of_the_weekMonday .
day_of_the_weekSaturday .
day_of_the_weekSunday .
day_of_the_weekThursday .
day_of_the_weekTuesday .
day_of_the_weekWednesday .
data_channelEntertainment .
data_channelLifestyle .
data_channelOthers .
data_channelSocial Media .
data_channelTech .
data_channelWorld .
Mean squared error: 0.8075717
Standard Error: 0.01515756
Lasso regression is not suitable in this case as it drops all of the important predictors. Hence, the ridge regression model is most suitable as it can both handle multicollinearity and remove heteroscedasticity of the data.
-
After analyzing the data, our final model is:
Final Model:
\(log(shares)_i = 0.0202\times\)n_tokens_\(title_i+0.000022\times\) n_tokens_\(content_i + 0.00031 \times\) n_unique_\(tokens_i + 0.00025 \times\) n_non_stop_\(words_i + 0.00045 \times\) n_non_stop_unique_\(tokens_i + 0.00077 \times\) num_\(hrefs_i + 0.0021 \times\) num_self_\(hrefs_i + 0.00058 \times\) num_\(imgs_i + 0.00061 \times\) num_\(videos_i + 0.516 \times\) average_token_\(length_i + 0.00005 \times\) kw_min_\(min_i + 0.0000006 \times\) kw_max_\(min_i + 0.000006 \times\) kw_avg_\(min_i + 0.00000003 \times\) kw_min_\(max_i + 0.0000001 \times\) kw_max_\(max_i + 0.0000001 \times\) kw_avg_\(max_i + 0.000007 \times\) kw_min_\(avg_i + 0.000001 \times\) kw_max_\(avg_i + 0.00001 \times\) kw_avg_\(avg_i + 0.00000009 \times\) self_reference_min_\(shares_i + 0.00000005 \times\) self_reference_max_\(shares_i + 0.00000009 \times\) self_reference_avg_\(sharess_i + 0.024 \times\) LDA_\(00_i + 0.024 \times\) LDA_\(01_i + 0.022 \times\) LDA_\(02_i + 0.022 \times\) LDA_\(03_i + 0.025 \times\) LDA_\(04_i + 0.517 \times\) global_\(subjectivity_i + 0.119 \times\) global_sentiment_\(polarity_i + 1.375 \times\) global_rate_positive_\(words_i + 1.362 \times\) global_rate_negative_\(words_i + 0.275 \times\) rate_positive_\(words_i + 0.115 \times\) rate_negative_\(words_i + 0.438 \times\) avg_positive_\(polarity_i + 0.175 \times\) min_positive_\(polarity_i + 0.149 \times\) max_positive_\(polarity_i - 0.159 \times\) avg_negative_\(polarity_i - 0.597 \times\) min_negative_\(polarity_i - 0.111 \times\) max_negative_\(polarity_i + 0.023 \times\) title_\(subjectivity_i + 0.0091 \times\) title_sentiment_\(polarity_i + 0.083 \times\) abs_title_\(subjectivity_i + 0.026 \times\) abs_title_sentiment_\(polarity_i + 0.0103 \times\) day_of_the_\(weekMonday_i + 0.0099 \times\) day_of_the_\(weekSaturday_i + 0.0099 \times\) day_of_the_\(weekSunday_i + 0.0104 \times\) day_of_the_\(weekThursday_i + 0.0104 \times\) day_of_the_\(weekTuesday_i + 0.0104 \times\) day_of_the_\(weekWednesday_i + 0.0099 \times\) data_\(channelEntertainment_i + 0.0094 \times\) data_\(channelLifestyle_i + 0.0103 \times\) data_\(channelOthers + 0.0101 \times\) data_\(channelSocial Media + 0.0114 \times\) data_\(channelTech_i + 0.0103 \times\) data_\(channelWorld_i + \epsilon_{i}\forall i=1(1)39644\)
Since the dependent variable, shares, is log-transformed, any change in this variable is interpreted as a percentage change corresponding to a one-unit change in the independent variable \(x_i\). For example, if the number of words in an article or post increases by one, the shares of that post are expected to rise by 0.000022%. The variable with the most significant positive impact on shares is the rate of positive words in the content variable global_rate_positive_words, which has an effect of 1.375%. Conversely, the min polarity of negative words variable shows the strongest negative impact, with an effect of -0.597%.-
Final Findings:
The analysis reveals a significant disparity between popular and unpopular articles, with popular shares totaling 20,464 compared to 17,999 for unpopular ones. Key findings indicate that articles with titles containing 8 to 16 words and content under 1,600 words tend to garner the highest shares. Additionally, articles featuring 0 to 45 links and a positive word polarity between 0.2 and 0.6 are more likely to be popular. Subject-wise, articles in the categories of World, Tech and Entertainment are published more frequently and attract greater shares, particularly in the Business, Technology and Social Media channels. Interestingly, while most articles are published on weekdays, those released on weekends, especially Saturdays and Sundays, achieve higher share counts. The analysis also highlights that articles with 1 to 40 images and 1 to 15 videos tend to be more popular. Furthermore, an optimal average word length of 4 to 6 and a keyword count exceeding five in the metadata positively influence shareability. The regression model indicates that various factors, including the number of tokens in titles and content, unique tokens, and sentiment polarity, significantly affect the log of shares. Notably, the coefficients associated with global sentiment and positive word rates suggest that emotional engagement plays a crucial role in enhancing article popularity. In summary, the study underscores the importance of title length, content characteristics, multimedia elements and emotional tone in driving article shares, providing actionable insights for content creators aiming to boost engagement.
-