INTRODUCTION

Description of the Problem Statement and Objective of the study

In the rapidly evolving digital age, the dissemination of news has largely shifted from traditional print media to online platforms. This transition has created a competitive landscape where the popularity of news articles is crucial for driving web traffic, engagement, and revenue. Understanding the factors that contribute to the popularity of online news articles can provide valuable insights for content creators, publishers, and marketers to optimize their strategies and maximize reach. This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years.

The objective of this project is to conduct data analysis on the Online News Popularity dataset to uncover patterns, trends, and insights that could inform strategies for increasing the popularity of online news articles and to predict the number of shares in social networks (popularity).

DATA DESCRIPTION

The dataset is obtained from UCI Machine Learning Repository Online New Popularity.

The dataset consists of 39,644 entries with 61 columns. The main variable in the study is the number of shares, which serves as the indicator of a site’s or post’s popularity. This is followed by 61 additional variables as detailed below:
Click to View

url: URL of the article (non-predictive).

timedelta: Days between the article publication and the dataset acquisition (non-predictive).

n_tokens_title: Number of words in the title.

n_tokens_content: Number of words in the content.

n_unique_tokens: Rate of unique words in the content.

n_non_stop_words: Rate of non-stop words in the content.

n_non_stop_unique_tokens: Rate of unique non-stop words in the content.

num_hrefs: Number of links.

num_self_hrefs: Number of links to other articles published by Mashable.

num_imgs: Number of images.

num_videos: Number of videos.

average_token_length: Average length of the words in the content.

num_keywords: Number of keywords in the metadata

data_channel_is_lifestyle: Is data channel ‘Lifestyle’?

data_channel_is_entertainment: Is data channel ‘Entertainment’?

data_channel_is_bus: Is data channel ‘Business’?

data_channel_is_socmed: Is data channel ‘Social Media’?

data_channel_is_tech: Is data channel ‘Tech’?

data_channel_is_world: Is data channel ‘World’?

kw_min_min: Worst keyword (min. shares).

kw_max_min: Worst keyword (max. shares).

kw_avg_min: Worst keyword (avg. shares) .

kw_min_max: Best keyword (min. shares) .

kw_max_max: Best keyword (max. shares) .

kw_avg_max: Best keyword (avg. shares) .

kw_min_avg: Avg. keyword (min. shares) .

kw_max_avg: Avg. keyword (max. shares) .

kw_avg_avg: Avg. keyword (avg. shares).

self_reference_min_shares: Min. shares of referenced articles in Mashable.

self_reference_max_shares: Max. shares of referenced articles in Mashable.

self_reference_avg_shares: Avg. shares of referenced articles in Mashable.

weekday_is_monday: Was the article published on a Monday?

weekday_is_tuesday: Was the article published on a Tuesday?

weekday_is_wednesday: Was the article published on a Wednesday?

weekday_is_thursday: Was the article published on a Thursday?

weekday_is_friday: Was the article published on a Friday?

weekday_is_saturday: Was the article published on a Saturday?

weekday_is_sunday: Was the article published on a Sunday?

is_weekend: Was the article published on the weekend?

LDA_00: Closeness to LDA topic 0.

LDA_01: Closeness to LDA topic 1.

LDA_02: Closeness to LDA topic 2.

LDA_03: Closeness to LDA topic 3.

LDA_04: Closeness to LDA topic 4.

global_subjectivity: Text subjectivity.

global_sentiment_polarity: Text sentiment polarity.

global_rate_positive_words: Rate of positive words in the content.

global_rate_negative_words: Rate of negative words in the content.

rate_positive_words: Rate of positive words among non-neutral tokens.

rate_negative_words: Rate of negative words among non-neutral tokens.

avg_positive_polarity: Avg. polarity of positive words.

min_positive_polarity: Min. polarity of positive words.

max_positive_polarity: Max. polarity of positive words.

avg_negative_polarity: Avg. polarity of negative words.

min_negative_polarity: Min. polarity of negative words.

max_negative_polarity: Max. polarity of negative words.

title_subjectivity: Title subjectivity.

title_sentiment_polarity: Title polarity.

abs_title_subjectivity: Absolute subjectivity level.

abs_title_sentiment_polarity: Absolute polarity level.

shares: Number of shares (target).

At first, we import the data set. Then we use str() to find the structure of the data set and information about the class, length and content of each column.

'data.frame':   39644 obs. of  61 variables:
 $ url                          : chr  "http://mashable.com/2013/01/07/amazon-instant-video-browser/" "http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/" "http://mashable.com/2013/01/07/apple-40-billion-app-downloads/" "http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/" ...
 $ timedelta                    : num  731 731 731 731 731 731 731 731 731 731 ...
 $ n_tokens_title               : num  12 9 9 9 13 10 8 12 11 10 ...
 $ n_tokens_content             : num  219 255 211 531 1072 ...
 $ n_unique_tokens              : num  0.664 0.605 0.575 0.504 0.416 ...
 $ n_non_stop_words             : num  1 1 1 1 1 ...
 $ n_non_stop_unique_tokens     : num  0.815 0.792 0.664 0.666 0.541 ...
 $ num_hrefs                    : num  4 3 3 9 19 2 21 20 2 4 ...
 $ num_self_hrefs               : num  2 1 1 0 19 2 20 20 0 1 ...
 $ num_imgs                     : num  1 1 1 1 20 0 20 20 0 1 ...
 $ num_videos                   : num  0 0 0 0 0 0 0 0 0 1 ...
 $ average_token_length         : num  4.68 4.91 4.39 4.4 4.68 ...
 $ num_keywords                 : num  5 4 6 7 7 9 10 9 7 5 ...
 $ data_channel_is_lifestyle    : num  0 0 0 0 0 0 1 0 0 0 ...
 $ data_channel_is_entertainment: num  1 0 0 1 0 0 0 0 0 0 ...
 $ data_channel_is_bus          : num  0 1 1 0 0 0 0 0 0 0 ...
 $ data_channel_is_socmed       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ data_channel_is_tech         : num  0 0 0 0 1 1 0 1 1 0 ...
 $ data_channel_is_world        : num  0 0 0 0 0 0 0 0 0 1 ...
 $ kw_min_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ kw_max_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ kw_avg_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ kw_min_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ kw_max_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ kw_avg_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ kw_min_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ kw_max_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ kw_avg_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ self_reference_min_shares    : num  496 0 918 0 545 8500 545 545 0 0 ...
 $ self_reference_max_shares    : num  496 0 918 0 16000 8500 16000 16000 0 0 ...
 $ self_reference_avg_sharess   : num  496 0 918 0 3151 ...
 $ weekday_is_monday            : num  1 1 1 1 1 1 1 1 1 1 ...
 $ weekday_is_tuesday           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday_is_wednesday         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday_is_thursday          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday_is_friday            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday_is_saturday          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday_is_sunday            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_weekend                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LDA_00                       : num  0.5003 0.7998 0.2178 0.0286 0.0286 ...
 $ LDA_01                       : num  0.3783 0.05 0.0333 0.4193 0.0288 ...
 $ LDA_02                       : num  0.04 0.0501 0.0334 0.4947 0.0286 ...
 $ LDA_03                       : num  0.0413 0.0501 0.0333 0.0289 0.0286 ...
 $ LDA_04                       : num  0.0401 0.05 0.6822 0.0286 0.8854 ...
 $ global_subjectivity          : num  0.522 0.341 0.702 0.43 0.514 ...
 $ global_sentiment_polarity    : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
 $ global_rate_positive_words   : num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
 $ global_rate_negative_words   : num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
 $ rate_positive_words          : num  0.769 0.733 0.857 0.667 0.86 ...
 $ rate_negative_words          : num  0.231 0.267 0.143 0.333 0.14 ...
 $ avg_positive_polarity        : num  0.379 0.287 0.496 0.386 0.411 ...
 $ min_positive_polarity        : num  0.1 0.0333 0.1 0.1364 0.0333 ...
 $ max_positive_polarity        : num  0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
 $ avg_negative_polarity        : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
 $ min_negative_polarity        : num  -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
 $ max_negative_polarity        : num  -0.2 -0.1 -0.133 -0.167 -0.05 ...
 $ title_subjectivity           : num  0.5 0 0 0 0.455 ...
 $ title_sentiment_polarity     : num  -0.188 0 0 0 0.136 ...
 $ abs_title_subjectivity       : num  0 0.5 0.5 0.5 0.0455 ...
 $ abs_title_sentiment_polarity : num  0.188 0 0 0 0.136 ...
 $ shares                       : int  593 711 1500 1200 505 855 556 891 3600 710 ...

All of the variables are of integer data type except the first variable url is of character data type.

Now, we take a look at the first few rows of the dataset.

url timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length num_keywords data_channel_is_lifestyle data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed data_channel_is_tech data_channel_is_world kw_min_min kw_max_min kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares self_reference_max_shares self_reference_avg_sharess weekday_is_monday weekday_is_tuesday weekday_is_wednesday weekday_is_thursday weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend LDA_00 LDA_01 LDA_02 LDA_03 LDA_04 global_subjectivity global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words rate_negative_words avg_positive_polarity min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares
http://mashable.com/2013/01/07/amazon-instant-video-browser/ 731 12 219 0.6635945 1 0.8153846 4 2 1 0 4.680365 5 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 496 496 496.000 1 0 0 0 0 0 0 0 0.5003312 0.3782789 0.0400047 0.0412626 0.0401225 0.5216171 0.0925620 0.0456621 0.0136986 0.7692308 0.2307692 0.3786364 0.1000000 0.7 -0.3500000 -0.600 -0.2000000 0.5000000 -0.1875000 0.0000000 0.1875000 593
http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/ 731 9 255 0.6047431 1 0.7919463 3 1 1 0 4.913725 4 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000 1 0 0 0 0 0 0 0 0.7997557 0.0500467 0.0500963 0.0501007 0.0500007 0.3412458 0.1489478 0.0431373 0.0156863 0.7333333 0.2666667 0.2869146 0.0333333 0.7 -0.1187500 -0.125 -0.1000000 0.0000000 0.0000000 0.5000000 0.0000000 711
http://mashable.com/2013/01/07/apple-40-billion-app-downloads/ 731 9 211 0.5751295 1 0.6638655 3 1 1 0 4.393365 6 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 918 918 918.000 1 0 0 0 0 0 0 0 0.2177923 0.0333345 0.0333514 0.0333335 0.6821883 0.7022222 0.3233333 0.0568720 0.0094787 0.8571429 0.1428571 0.4958333 0.1000000 1.0 -0.4666667 -0.800 -0.1333333 0.0000000 0.0000000 0.5000000 0.0000000 1500
http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/ 731 9 531 0.5037879 1 0.6656347 9 0 1 0 4.404896 7 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000 1 0 0 0 0 0 0 0 0.0285732 0.4192996 0.4946508 0.0289047 0.0285716 0.4298497 0.1007047 0.0414313 0.0207156 0.6666667 0.3333333 0.3859652 0.1363636 0.8 -0.3696970 -0.600 -0.1666667 0.0000000 0.0000000 0.5000000 0.0000000 1200
http://mashable.com/2013/01/07/att-u-verse-apps/ 731 13 1072 0.4156456 1 0.5408895 19 19 20 0 4.682836 7 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 545 16000 3151.158 1 0 0 0 0 0 0 0 0.0286328 0.0287936 0.0285752 0.0285717 0.8854268 0.5135021 0.2810035 0.0746269 0.0121269 0.8602151 0.1397849 0.4111274 0.0333333 1.0 -0.2201923 -0.500 -0.0500000 0.4545455 0.1363636 0.0454545 0.1363636 505
http://mashable.com/2013/01/07/beewi-smart-toys/ 731 10 370 0.5598886 1 0.6981982 2 2 0 0 4.359459 9 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8500 8500 8500.000 1 0 0 0 0 0 0 0 0.0222453 0.3067176 0.0222313 0.0222243 0.6265816 0.4374086 0.0711842 0.0297297 0.0270270 0.5238095 0.4761905 0.3506100 0.1363636 0.6 -0.1950000 -0.400 -0.1000000 0.6428571 0.2142857 0.1428571 0.2142857 855
http://mashable.com/2013/01/07/bodymedia-armbandgets-update/ 731 8 960 0.4181626 1 0.5498339 21 20 20 0 4.654167 10 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 545 16000 3151.158 1 0 0 0 0 0 0 0 0.0200817 0.1147054 0.0200244 0.0200153 0.8251732 0.5144803 0.2683027 0.0802083 0.0166667 0.8279570 0.1720430 0.4020386 0.1000000 1.0 -0.2244792 -0.500 -0.0500000 0.0000000 0.0000000 0.5000000 0.0000000 556
http://mashable.com/2013/01/07/canon-poweshot-n/ 731 12 989 0.4335736 1 0.5721078 20 20 20 0 4.617796 9 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 545 16000 3151.158 1 0 0 0 0 0 0 0 0.0222244 0.1507330 0.2434355 0.0222236 0.5613836 0.5434742 0.2986135 0.0839232 0.0151668 0.8469388 0.1530612 0.4277205 0.1000000 1.0 -0.2427778 -0.500 -0.0500000 1.0000000 0.5000000 0.5000000 0.5000000 891
http://mashable.com/2013/01/07/car-of-the-future-infographic/ 731 11 97 0.6701031 1 0.8367347 2 0 0 0 4.855670 7 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0.000 1 0 0 0 0 0 0 0 0.4582504 0.0289794 0.0286619 0.0296959 0.4544124 0.5388889 0.1611111 0.0309278 0.0206186 0.6000000 0.4000000 0.5666667 0.4000000 0.8 -0.1250000 -0.125 -0.1250000 0.1250000 0.0000000 0.3750000 0.0000000 3600
http://mashable.com/2013/01/07/chuck-hagel-website/ 731 10 231 0.6363636 1 0.7971014 4 1 1 1 5.090909 5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0.000 1 0 0 0 0 0 0 0 0.0400001 0.0400000 0.8399972 0.0400006 0.0400020 0.3138889 0.0518519 0.0389610 0.0303030 0.5625000 0.4375000 0.2984127 0.1000000 0.5 -0.2380952 -0.500 -0.1000000 0.0000000 0.0000000 0.5000000 0.0000000 710
[1] 39644    61

Here, shares is the response variable.

Our objective is to investigate and identify the decisive factors which result in the sharing of articles published in the Mashable website.

DATA PREPROCESSING

Checking for Missing Values

For successful data analysis, it is required to check whether there are any missing values in the data set or not since missing information may lead to erroneous conclusions. If there are missing observations, deleting the rows or columns containing missing values or imputing the missing value with a constant or some statistics like mean, median or mode of each column in which the missing value is located is an effective way of processing the data for further analysis.

                          url                     timedelta 
                            0                             0 
               n_tokens_title              n_tokens_content 
                            0                             0 
              n_unique_tokens              n_non_stop_words 
                            0                             0 
     n_non_stop_unique_tokens                     num_hrefs 
                            0                             0 
               num_self_hrefs                      num_imgs 
                            0                             0 
                   num_videos          average_token_length 
                            0                             0 
                 num_keywords     data_channel_is_lifestyle 
                            0                             0 
data_channel_is_entertainment           data_channel_is_bus 
                            0                             0 
       data_channel_is_socmed          data_channel_is_tech 
                            0                             0 
        data_channel_is_world                    kw_min_min 
                            0                             0 
                   kw_max_min                    kw_avg_min 
                            0                             0 
                   kw_min_max                    kw_max_max 
                            0                             0 
                   kw_avg_max                    kw_min_avg 
                            0                             0 
                   kw_max_avg                    kw_avg_avg 
                            0                             0 
    self_reference_min_shares     self_reference_max_shares 
                            0                             0 
   self_reference_avg_sharess             weekday_is_monday 
                            0                             0 
           weekday_is_tuesday          weekday_is_wednesday 
                            0                             0 
          weekday_is_thursday             weekday_is_friday 
                            0                             0 
          weekday_is_saturday             weekday_is_sunday 
                            0                             0 
                   is_weekend                        LDA_00 
                            0                             0 
                       LDA_01                        LDA_02 
                            0                             0 
                       LDA_03                        LDA_04 
                            0                             0 
          global_subjectivity     global_sentiment_polarity 
                            0                             0 
   global_rate_positive_words    global_rate_negative_words 
                            0                             0 
          rate_positive_words           rate_negative_words 
                            0                             0 
        avg_positive_polarity         min_positive_polarity 
                            0                             0 
        max_positive_polarity         avg_negative_polarity 
                            0                             0 
        min_negative_polarity         max_negative_polarity 
                            0                             0 
           title_subjectivity      title_sentiment_polarity 
                            0                             0 
       abs_title_subjectivity  abs_title_sentiment_polarity 
                            0                             0 
                       shares 
                            0 

It is found that our data set contains no missing information. Therefore, the analysis can be proceeded.

Here the two non-predictive (url and timedelta) attributes are dropped from the dataset since these variables are meta-data and cannot be treated as features. n_tokens_content represents Number of words in the content. However its minimum value is 0 which means that there are articles that do not have any content. Such records should be dropped as their related attributes add no meaning to the data analysis. The is_weekend column is also dropped since it is a duplicate of the already existing is_saturday and is_sunday columns.

[1] 38463    58

The dimension of the data reduces to (38463, 58) from (39,644,61).

Converting the columns weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, weekday_is_thursday, weekday_is_friday, weekday_is_saturday and weekday_is_sunday into a single variable day_of_the_week.

n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length num_keywords data_channel_is_lifestyle data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed data_channel_is_tech data_channel_is_world kw_min_min kw_max_min kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares self_reference_max_shares self_reference_avg_sharess LDA_00 LDA_01 LDA_02 LDA_03 LDA_04 global_subjectivity global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words rate_negative_words avg_positive_polarity min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares day_of_the_week
1 12.00 219.00 0.66 1.00 0.82 4.00 2.00 1.00 0.00 4.68 5.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 496.00 496.00 496.00 0.50 0.38 0.04 0.04 0.04 0.52 0.09 0.05 0.01 0.77 0.23 0.38 0.10 0.70 -0.35 -0.60 -0.20 0.50 -0.19 0.00 0.19 593 Monday
2 9.00 255.00 0.60 1.00 0.79 3.00 1.00 1.00 0.00 4.91 4.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.80 0.05 0.05 0.05 0.05 0.34 0.15 0.04 0.02 0.73 0.27 0.29 0.03 0.70 -0.12 -0.12 -0.10 0.00 0.00 0.50 0.00 711 Monday
3 9.00 211.00 0.58 1.00 0.66 3.00 1.00 1.00 0.00 4.39 6.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 918.00 918.00 918.00 0.22 0.03 0.03 0.03 0.68 0.70 0.32 0.06 0.01 0.86 0.14 0.50 0.10 1.00 -0.47 -0.80 -0.13 0.00 0.00 0.50 0.00 1500 Monday
4 9.00 531.00 0.50 1.00 0.67 9.00 0.00 1.00 0.00 4.40 7.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.42 0.49 0.03 0.03 0.43 0.10 0.04 0.02 0.67 0.33 0.39 0.14 0.80 -0.37 -0.60 -0.17 0.00 0.00 0.50 0.00 1200 Monday
5 13.00 1072.00 0.42 1.00 0.54 19.00 19.00 20.00 0.00 4.68 7.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 545.00 16000.00 3151.16 0.03 0.03 0.03 0.03 0.89 0.51 0.28 0.07 0.01 0.86 0.14 0.41 0.03 1.00 -0.22 -0.50 -0.05 0.45 0.14 0.05 0.14 505 Monday
6 10.00 370.00 0.56 1.00 0.70 2.00 2.00 0.00 0.00 4.36 9.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8500.00 8500.00 8500.00 0.02 0.31 0.02 0.02 0.63 0.44 0.07 0.03 0.03 0.52 0.48 0.35 0.14 0.60 -0.20 -0.40 -0.10 0.64 0.21 0.14 0.21 855 Monday
7 8.00 960.00 0.42 1.00 0.55 21.00 20.00 20.00 0.00 4.65 10.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 545.00 16000.00 3151.16 0.02 0.11 0.02 0.02 0.83 0.51 0.27 0.08 0.02 0.83 0.17 0.40 0.10 1.00 -0.22 -0.50 -0.05 0.00 0.00 0.50 0.00 556 Monday
8 12.00 989.00 0.43 1.00 0.57 20.00 20.00 20.00 0.00 4.62 9.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 545.00 16000.00 3151.16 0.02 0.15 0.24 0.02 0.56 0.54 0.30 0.08 0.02 0.85 0.15 0.43 0.10 1.00 -0.24 -0.50 -0.05 1.00 0.50 0.50 0.50 891 Monday
9 11.00 97.00 0.67 1.00 0.84 2.00 0.00 0.00 0.00 4.86 7.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.46 0.03 0.03 0.03 0.45 0.54 0.16 0.03 0.02 0.60 0.40 0.57 0.40 0.80 -0.12 -0.12 -0.12 0.12 0.00 0.38 0.00 3600 Monday
10 10.00 231.00 0.64 1.00 0.80 4.00 1.00 1.00 1.00 5.09 5.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.04 0.84 0.04 0.04 0.31 0.05 0.04 0.03 0.56 0.44 0.30 0.10 0.50 -0.24 -0.50 -0.10 0.00 0.00 0.50 0.00 710 Monday

The column day_of_the_week represents the day of the week each article was published, based on the information from the seven original columns. Also removing the redundant columns to simplify the data analysis.

Converting the columns data_channel_is_lifestyle, data_channel_is_entertainment, data_channel_is_bus, data_channel_is_socmed, data_channel_is_tech, data_channel_is_world and others into a single variable data_channel.

n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length kw_min_min kw_max_min kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares self_reference_max_shares self_reference_avg_sharess LDA_00 LDA_01 LDA_02 LDA_03 LDA_04 global_subjectivity global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words rate_negative_words avg_positive_polarity min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares day_of_the_week data_channel
1 12.00 219.00 0.66 1.00 0.82 4.00 2.00 1.00 0.00 4.68 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 496.00 496.00 496.00 0.50 0.38 0.04 0.04 0.04 0.52 0.09 0.05 0.01 0.77 0.23 0.38 0.10 0.70 -0.35 -0.60 -0.20 0.50 -0.19 0.00 0.19 593 Monday Entertainment
2 9.00 255.00 0.60 1.00 0.79 3.00 1.00 1.00 0.00 4.91 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.80 0.05 0.05 0.05 0.05 0.34 0.15 0.04 0.02 0.73 0.27 0.29 0.03 0.70 -0.12 -0.12 -0.10 0.00 0.00 0.50 0.00 711 Monday Business
3 9.00 211.00 0.58 1.00 0.66 3.00 1.00 1.00 0.00 4.39 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 918.00 918.00 918.00 0.22 0.03 0.03 0.03 0.68 0.70 0.32 0.06 0.01 0.86 0.14 0.50 0.10 1.00 -0.47 -0.80 -0.13 0.00 0.00 0.50 0.00 1500 Monday Business
4 9.00 531.00 0.50 1.00 0.67 9.00 0.00 1.00 0.00 4.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.42 0.49 0.03 0.03 0.43 0.10 0.04 0.02 0.67 0.33 0.39 0.14 0.80 -0.37 -0.60 -0.17 0.00 0.00 0.50 0.00 1200 Monday Entertainment
5 13.00 1072.00 0.42 1.00 0.54 19.00 19.00 20.00 0.00 4.68 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 545.00 16000.00 3151.16 0.03 0.03 0.03 0.03 0.89 0.51 0.28 0.07 0.01 0.86 0.14 0.41 0.03 1.00 -0.22 -0.50 -0.05 0.45 0.14 0.05 0.14 505 Monday Tech
6 10.00 370.00 0.56 1.00 0.70 2.00 2.00 0.00 0.00 4.36 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8500.00 8500.00 8500.00 0.02 0.31 0.02 0.02 0.63 0.44 0.07 0.03 0.03 0.52 0.48 0.35 0.14 0.60 -0.20 -0.40 -0.10 0.64 0.21 0.14 0.21 855 Monday Tech
7 8.00 960.00 0.42 1.00 0.55 21.00 20.00 20.00 0.00 4.65 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 545.00 16000.00 3151.16 0.02 0.11 0.02 0.02 0.83 0.51 0.27 0.08 0.02 0.83 0.17 0.40 0.10 1.00 -0.22 -0.50 -0.05 0.00 0.00 0.50 0.00 556 Monday Lifestyle
8 12.00 989.00 0.43 1.00 0.57 20.00 20.00 20.00 0.00 4.62 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 545.00 16000.00 3151.16 0.02 0.15 0.24 0.02 0.56 0.54 0.30 0.08 0.02 0.85 0.15 0.43 0.10 1.00 -0.24 -0.50 -0.05 1.00 0.50 0.50 0.50 891 Monday Tech
9 11.00 97.00 0.67 1.00 0.84 2.00 0.00 0.00 0.00 4.86 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.46 0.03 0.03 0.03 0.45 0.54 0.16 0.03 0.02 0.60 0.40 0.57 0.40 0.80 -0.12 -0.12 -0.12 0.12 0.00 0.38 0.00 3600 Monday Tech
10 10.00 231.00 0.64 1.00 0.80 4.00 1.00 1.00 1.00 5.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.04 0.84 0.04 0.04 0.31 0.05 0.04 0.03 0.56 0.44 0.30 0.10 0.50 -0.24 -0.50 -0.10 0.00 0.00 0.50 0.00 710 Monday World

The data_channel column is a single column containing the information regarding all types of data channels, therefore, removing the redundant columns to simplify the data analysis.

[1] 38463    46

The dimension of the data frame further shrinks from 58 to 46 columns.

Converting the variables day_of_the_week and data_channel to factor variables. This is done since these variables are categorical in nature.

EXPLORATORY DATA ANALYSIS

Plotting histogram of the variables to understand each of the feature variables.

Summary of analysis:

The variables n_tokens_title, global_subjectivity, global_sentiment_polarity, min_negative_polarity and avg_positive_polarity seem to follow normal distributions.

The variables n_tokens_content, n_unique_tokens, n_non_stop_words, n_non_stop_unique_words, num_hrefs, num_self_hrefs, num_imgs, num_videos, average_token_length, kw_min_min, kw_max_min, kw_avg_min, kw_min_max, kw_avg_max, kw_min_avg, kw_max_avg, kw_avg_avg, self_reference_min_shares, self_reference_max_shares, self_reference_avg_shares, LDA_00, LDA_01, LDA_02, LDA_03, LDA_04, global_rate_positive_words, global_rate_negative_words, rate_negative_words, min_positive_polarity, title_subjectivity, abs_title_sentiment_polarity and shares are heavily right-skewed.

The variables kw_max_max, max_positive_polarity, avg_negative_polarity, max_negative_polarity and abs_title_subjectivity are heavily left skewed.

-

Dividing the number of shares by popularity:

Here, if the number of shares is greater than than the median shares then the share is considered popular otherwise unpopular.

popularity
  Popular Unpopular 
    20464     17999 

Popularity of shares:

The number of popular shares are greater in count than the number of unpopular ones.

-

The articles having 8 to 16 words in the title has the maximum number of shares.

-

An article with words less than 1600 words in the content results in more shares, making the article popular.

-

The articles having 0 to 45 links have the maximum number of shares.

-

The articles having average polarity of positive words between 0.2 and 0.6 has the most number of shares.

-

The number of articles published in the subjects World, Tech and Entertainment are greater than compared to all other subjects.

-

Popularity of different data channels:

The hottest subjects of popular shares seem to be the data channels of Business, Technology and Social Media compared to the others.

-

The majority of articles are published on the weekdays as compared to the weekends.

-

Popularity of Shares based on the days of the week:

The articles published during the majority of weekends (i.e. Saturday and Sunday) tend to be more shared compared to the weekdays with respect to the total articles published on that day. Most popular articles are usually posted on Mondays.

-

Relationship between rate of unique non-stop words in the content and the number of shares:

The box plot of the dataset shows that if the rate of unique non_stop words in the content falls within the range of 0.6 to 0.8 has the maximum number of shares.

-

Popularity of shares based on the number of images in the content:

If the number of images in the article is between 1-40 then the article is popular.

-

Popularity of shares based on the number of videos in the content:

If the number of videos in the article is between 1-15 then the article has maximum number of shares.

-

Relationship between average token length and the number of shares:

If the article has average length of the words in the content between 4-6 then the number of shares is maximum.

-

Relationship between number of keywords and the number of shares:

The number of keywords in the metadata influences the number of shares. If the number of keywords are greater than 5 results in more popular articles.

-

Click to view the outliers
   [1]  10000  13600   5700  17100   7700   6400   5700   7100   5600   9200
  [11]   6000  19400  10400   5600  11500  28000   7500   5600   8100   5500
  [21]  11400  11800  12000  37400  18000  18200  19800   9000   6300  25200
  [31]  15700   5400   7900   9700  12100  14300  33100   5400   5600   8000
  [41]   5800  39400   6000   6800   5900   5400   8900   5500  17100   6300
  [51]  10400   6400  17300  11400   9100  51900  12300   9100   5700   7200
  [61]  15800  17500   5700   8000   5800   7700   7800   7800   5800   9800
  [71]  17600   5700   6800   6500   6700  11900   6000   6500  22000   5500
  [81]   5600  10200   8200   7400   8300   8300  39200   6000  13800   9500
  [91]   8100   5700  10700   9700  11800   7000   8000  10800   9800  30000
 [101]   6200   5600   6300   5900   8200   6400   5900   6100   6300   8500
 [111]   8100   6700   7100  30200   8700  16700   6700  11400   6300   6000
 [121]  11100  14400   6300  10300   7900  11900  69100  16800   6700  16500
 [131]  16500   7000  40100   5400  17300  12900   7400   5500   9700   8900
 [141]  10900   9200   7400   6900   5800   6500   9400   6000   8700   6000
 [151]   7100   8200   9100   8300  18100   8500   6500   6900  10700   7700
 [161]  14400  18400  11000   7000  41600  30400   6600   6600  13600   7300
 [171]   5900   7300   6600   5400  10800   7600  12300   6800  16600   6100
 [181]   9300   6400  12800  27400  11900   5800   6100  13500   9700  16600
 [191]  18100   8200   8100  66900  11400   9200  67500  15600   8400   7400
 [201]   7200  10300   6100  15400   6500   7600  16800   9900  14500   8600
 [211]   6500  19200  14800   5500  10500  11300  16700  11100  14300   7300
 [221]  13200   5500   6000   6900  11700  16800   5500  25300   6400  26800
 [231]   5800  19800  40400   6400   9800  32700   7500  14700  18600  23700
 [241]   5700   5800  42700  12300   5500   6100  10800  10100  12500  10400
 [251]  12000   8600  20000   7400  12300  16200  27300   6400  18000  14400
 [261]  17800   9800  80400  16700  11400  10600   5600  10800   7200   7700
 [271]   5700  28200   5400  15300  21700   9300   9100  13900  57600   6500
 [281]  21400  15200   7300   8400   5600  27900   8100  31300   5500  35800
 [291]  13600  26900   9600  18400  28400  11700   6100  10400  53100   6100
 [301]  10200 227300  20800   7800  13000  21100   7600   7200   8000   8200
 [311]  35100  13300  10700 144400  26400 617900   7200  14500  11700  10000
 [321]  10000  14200   7100   9600  27300  12500  11100   8300   9200   7200
 [331]  20200   9500  26400  12800  20400   8500  11100  10100   6000  12300
 [341]   5700   5800  36200  13300   6400  35800   8600  11200   9400  10800
 [351]  21900   6400  13300   9400   7400   8900   9600  13000  11700  14300
 [361]   6500   5600  17800   7700  15000   7400  34900   6600   9800  15700
 [371]  16900   9300  44700  18400  21800   5400   8700   9400  24300  10600
 [381]   6100   9700  62300   6200   7500   6000  12400  21500   6300  12000
 [391]  11900  22200  71800   7400  11200  23600   6000   5400   6400  11500
 [401]  11700  17200   6900   5600  53700   8000  20700   9700   7100  17100
 [411]   7300  36700   7600   8800  33900  12500  12200  14500   6300  10600
 [421]   5400  13300  22300   7600   6700   6200  14700  22300   5600   7700
 [431]   7300   5700  14400  28300   7900   5400   6400   5900   7400   6500
 [441]   7300   8800   6200  25200  18300  20200   6300   5700  10600  17100
 [451]  11000   6200   9100  36200  21400   7800   6800   6700   6200  22100
 [461]   6600  32200   6500  23500  12100   7400   6200   8600   8600  22000
 [471]   6900  10300  13400  65300   5500   5700  16300  15200  27700  16600
 [481]   7400  12400  20900   8500   6900  49000   7100   5800  15700   7000
 [491]   9500   9000  16900   6300  10700  14400  14900  16400   6900  24400
 [501]  26600  15900 306100  20600  16100   6700   8800   6400   6000  36200
 [511]   9300  17700   6400  12600   6800  68300  10700   6100   9600  12400
 [521]  25300  10900  10700   5400   9700  11900  15200   6900   6500  10000
 [531]   7400  10100   6200  31400  10600  11400  23500  12300   9000   8000
 [541]   6900   6300   7800  17100   5900   6000   5800   5400   6000   9900
 [551]   6100  11100  14800  14300  24900   7300   8200  12800  16600   6300
 [561]  17500   6900   6400   5400   6500  10300   6200   7400   7900  12400
 [571]   7900   9700   9200   6000  18700   6300   9000   5800   5700   7800
 [581]  32600  11300   7700   7500   6600   5900   7700  10800   7400   6800
 [591]  17900  22300   6200  10600  24800   6100   8700  14100  11600   5600
 [601]  10300   6600  18000   7900   6200  14700  16300   6300   7200  12400
 [611]   9700   5500 690400  12000  11600  10400  18800  20900  13700   7000
 [621]   8900   9400  20800   7700   8500   6700  23500   9700  11900  21800
 [631]   6300  16300   7400   7100   5700 112500   6700   6600  16200   6300
 [641]   7400   9200   5400  50200  13700   6900   8100  11500  15400  13900
 [651]  10100   6600   6400  14000  17100  17600  13800   6200  14300  13700
 [661]   6200  11000  93800  13400   5800   7800   6600   8400   6500   6400
 [671]   5900   7700  17600   5600   5800   5500   7700   8400   8400   7200
 [681]   8800   7000  26400  12300  15500  20500  16100   6200   6300   6300
 [691]  12300  14800   7100   7600  10700  15900   6000   9400  12800  14200
 [701]  41700  11700  12600   6900  16400   6400   7800   7100  15000   5700
 [711]   6400   8700  10600   9900   7600  10700   7500  25000   8700  16200
 [721]   6500  12700   7500  11100  19100  13200   7900   8300   8600   9200
 [731]  18100  10600  87000   5400   7300   7900   8100   6500   8300  25700
 [741]  20300   8100  28200  28900  12200   9500  10700  30200   8000   8400
 [751]  14900  15600   6800   7900  10200   6800  42200  26600  17000  61500
 [761]  18400   6100   8500   9100  11900   6300   9000  15400   5500   6200
 [771]   6500   7500  13500  36800   8400  11400   9100  11000  20700  27000
 [781]   9800   6800   7600   5400  10300   6700   7000  12500   5900   5900
 [791]  10900  25100   6500   6600  15800  11100   8000  45400  21700   7800
 [801]  23700   6700   9500   9100  14500   5900   8300   7600  27000  24000
 [811]   7900   7400   6400   7400   7000  11700   9500  23900   6500  12000
 [821]  41200   6000  12000  97200   6200  16700   7200   8000   8900   7100
 [831]   9000  21000   8000   6800  20200   5500   6800  23100   7500  24400
 [841]  15700   6200  12700   7300  16200  12500  50200   6300   7600  22100
 [851]   6100   8100   7400   5800  10000   6800  18400   5400   6300   5600
 [861]  27700   6900  61600  13900   7200   8200   5800   6100  21500  11100
 [871]  10900  11700   5700  73100  15300   7300  14700  11400   5400   6300
 [881]  15000   5500   5800   9200   9000   5500   6100   5400  28600  23000
 [891]  11700  20600   9800   8400   8800   9800   8000  13000   6300   7500
 [901]   5600  10700   7300   6000   5400   6900  11800   5600  23800   7800
 [911]   6000   5600 118700   5500  31500   9800   9600   7200  48800  18200
 [921]   6500  16300   9000   6000   9700  24200   6500   5700  17900   5700
 [931]   9900  10200  81200   7400   6100   5600   5700  59000  10400   8100
 [941]   6200   8000   5500  15400  10500   9300  17000  10300   6300   5500
 [951]  12900   9200   5800  42600  20900   9400  15200   9100  13400   5500
 [961]  30400   6700   6700  17400   6800  16300   8100  11000   7100   6500
 [971]   7000  12100   9400   6700  23300   8200  18200   8500   8500   5400
 [981]   8800  10300  17900   7300   6900   8200  12300   6100   7600   5800
 [991]   6900   5700   8900   7800   5800   6500   5800   7800  10700   7200
[1001]   7900   6100  16300   5400   5500   9500  40400  26300  13900  13900
[1011]   7200   6800  11300   9600  13100   6500  13700   8000  12200  16100
[1021]   8700   5700  28200  54300  13900   6500  11800   6600   6600 843300
[1031]   9500   9100  13900  24500   7200  24500  14800  10000   9100  12400
[1041]   5800  36500   5900   7800 104100  25200  24500   6600   6900   7200
[1051]   7200   5700   6400   5700   9700   6300 196700  14600   8000  16300
[1061]  35300  10000   7600  10300  10100   5500  18900   6100   9000  26200
[1071]   6800   9300  36100   5700   6600  56000   9400   7100  56400  43100
[1081] 210300  14500  10400  15000  10700   6000   7700   6400   5900  15000
[1091]  12400   6000  31600   6600  22000   5700  12100   9600   5700   6000
[1101]  16700   5400   7700  12300   7900  10500   6200   5600   7400  12800
[1111]  13800  15200  11600   6300  71000   5800   8100   5400   6100   7400
[1121]  11800  17600  32500   5700  12900   6500  16900   5900  29000   5800
[1131]   6800   7900  11200  19700   7300  19300   7300   7500   7500   8000
[1141]   9800  23700  18200  45100  14700   6300   7100   7600   5900  16400
[1151]  14800  20800   9200  21400   6900  17700   6200   6500  14700   8700
[1161]  17600  15600  21900   5700  11600  18900  14200   5600  29500   8300
[1171]   9800   5800   6300   7600   6900  53200   6100   6600  20000   5600
[1181]   8300   9000   6600   6500   7900   7900  14500  15100  11200   9700
[1191]  11900  12600   6700  16300  16100   6500  17800  39500  10800   7400
[1201]  49500   6300   5900   8200  14300  11800   5900  10500  16800   5400
[1211]  11400  14000   5900   8300  10900  13300   7800  29400  11200  18200
[1221]  11400   6200   9900   8700  12900   6500  11400  17800  13200   6200
[1231]  22800  12500   8400   9500  42500   8400  11200   6300   5900   5800
[1241]  22300   5800   8300  13300   8200  47400   6800  38400   7100   7600
[1251]   5600   9100  12400   5500   7900  54100   6300   9700   5500  12400
[1261]  25500  18200  10400  19800   9300  17000   8700   5600  17200  23900
[1271]  25200 113700   6100  34700  10000  18000   6800   7100  10400   5700
[1281]   7200   6800  32600  24900  16600  18600   9300  20600  10400  11500
[1291]  15400   8000   7200  10300   8700   5800   8600  12500  58100  23500
[1301]   9100   6100   8400  10900   6500   8300   9000   6000  11500  17500
[1311]  13100  13900  10600   9700   6300  17400   8600  17000  26000   6000
[1321]   6700   5400   8400   6000  96100  35400   5600   6800   7800  11000
[1331] 143100  21900   6500   7900  10000   5600  10400   6900   7500   5600
[1341]   6300   5800   5900  10800  16300   8200  11600   8600 138700   9300
[1351]   7000  10200   7600  17300  30400  23000   5800  47800  20200  10600
[1361]   5500   6200   7000   7800   7900   9200  19800  18200  17400   8200
[1371]   8200  20100  10000  18600   5800   7800  41000   6500  10200   5700
[1381]   5600   6100   7400  24200  18200  26300  11800   5800  11900   5900
[1391]  11800   5700   8600  18200   9000   6200  15700  16100  29600   6000
[1401]   6900   5800  11200   9500  20300   5900  15700  37500   7100  22100
[1411]  14200   5500  13400   6400   5800  13900  13300  17800   9800   5700
[1421]  14900   7800  16200   7200   8700   5500  13900  10700   9000  10100
[1431]  19300  26200  16900  14400  25300  17400   6200  74100  11100   5500
[1441]  11000  23900   7500  18700  37700  10800   6400   5400  12300   6000
[1451]   5500  48800  15300   6800  71700  16700  86300  13000  24400  10400
[1461]   6500  17000   9600  17000  12400  12400   5400  11900   7200   9900
[1471]   8800   7200  33800  17500  11000  12400  12200  15100   8300   5500
[1481]  18100  29900  20300  12500   5800   6500   7800  16900   7400  16600
[1491]   5500  58800  19200  32200   6600   9800   8900   5800  13900  15500
[1501]  11900   6700   8400   8400  33400  11000   7700  21700   5900   6500
[1511]  19800  19300  17500  21100  28400  11900   6600  37800   6100   8300
[1521]  15000   8600  12500   9800  13300   7400   6300  15900  24300  27100
[1531]  25100   7900   6100   8900  13700   8500  12600   5500   8100  15500
[1541]  19500   6700  14100  36600   6400   9000   8200  20000   6200   6100
[1551]   8100   6000   6500   9800  64599   7800   6900   5500   6700   6000
[1561]  11500   7300   6200  23600   6800  15200  26200   6500   6500  25900
[1571]  53900   6400  15200   6700   9900  20100   6800   6800   6700  18300
[1581]  11800   6500   5700  12900   8100   6500  38000   7400  45900   8300
[1591]   7300   6600  22500   7700   8500  14700   5900  16700   6400   8200
[1601]  24100  18300   9900   9400   6100  39900  12600   7900   7700  23800
[1611]   5600   7100  48100  11600  31300   7600  16200  57400  12200   9800
[1621]  12300  22800   9500   5400   8000   5700  13700  11000   5800   9000
[1631]   7600  20800  53200   5900  27300   6100   6600  14700  16000  26600
[1641]   8400  21500   6600  10200  10400   5700  96500   5700  17000  10400
[1651]  11900  72100  24400   7400   9400  10000  54200  24900   6900  14100
[1661]  10700  25500  12900   6000   7200  34800   7800   5400   6500   6300
[1671]   9100 233400 102200  13400   5700   8700  20700   5600  10900   8500
[1681]   9700   7200   6200   5700  17900   6200   6200   7400  15500  95100
[1691]  49700   5400  15400   6500  22300  13500   5900  25300  19100   7400
[1701]   7500  12900  27500   7800   5600  10000   5900  55200   8800  20700
[1711]   5500   5500   6700   6100   6200  12700  12300   5800   7500  86200
[1721]  19100  11200  10400  64500   5700   5600   6800  11700  11300   5700
[1731]   7800   6400  26800  11100   9700  13600   8700  30900  13100   6800
[1741]   9000   6900  98700   6900  17700   6400   8600  15400   6500  77600
[1751]  39900   8200  10100   6700   6000  10900   6800   6100  19800   8300
[1761]  11900  10000  12700  30000  11600   8000   6800  44300  15600   7200
[1771]  53100  11800  14000   5600   6500   7700  11700   7000   5900  14500
[1781]   7700  28200   8900   8500  11300   6100  10300  46000   5700  10500
[1791]   6200  10500  10300  12000   8500  29600  16200  11000  19500  10600
[1801]   9800   5600  24400  14400   6700   5800   9300  15600   9400   8700
[1811]  15800  14400  15300  25900   7800   7400  23100   6900  48000  27600
[1821]  15500   6200  17600   5700  10500   6500   6600  11500   7000  17900
[1831]  12200 441000   6500   9000   6200  53500   5500  19100   9300   7200
[1841]  15300   6800  33000   6500 298400   5900   7500   8600  27600  10100
[1851]  50900 128500   6500  19300  18600   8000  12300  10300  14800   8100
[1861]   8100   6300  11000   8700   8800  23700  12300  27300  12100  17000
[1871]  16500   6700  14200 652900  11500 128800  17300   6400   6600  14000
[1881]   9100  11000  23200 106400   7000   6100  14400  14000   6900  13400
[1891]  12500   8300   8200  18300  29400   9900  14800   9800  11900   7300
[1901]   5500   6200  56500   5900   5400   6300   6700  10800  14800   6000
[1911]  10600  21700  37000   5800  20300  22800  17000  42300  12800  12700
[1921]  13900 139600  10900  11200  12800   9200  35900   7200 122800   6400
[1931]  16600  12700   7700   6000   5500   7900  94400   7200   6400  15000
[1941]  20300   5400  12300   7600  10500  11200  57000  13000  25900  16900
[1951]  11200  14900   8900   7200   7900  83300   6200   7300  17800  14300
[1961]   5700  11400   7700   6200  10500 104100   9800   8100  30300  10400
[1971]   6500   9900   7000  21000  24900   6700   6500   9100  10900   7900
[1981]  22200   6200  13800 158900  12800  18300  11700   7400  26200 205600
[1991]  34800  10600   5500   6200  14700   9300  11000  23500   6300   6100
[2001]   7200   5400  10100   8600  12800   8800   6600   7800  31300   8000
[2011]   5500   8000  26600  19200  10700  17300   6000   8800   7700   7200
[2021]   5500  16000   6500  19300  14900 180600   6000   5800   5400   5500
[2031]  30900  54900  11800  44700  45800   8000  12000   6500   5400   9600
[2041]   9500   5700  20000   6600  37200  24200   6300   7900   9000   6500
[2051]  11400  19500   7000   9300   6400   8200  15000   5400  72900  15900
[2061]  14700   9200   7400  32000  24200  12100  15600 111300   8200  11700
[2071]  15200  49000   7700   8100   9900  14700   8800  10500   6800 104600
[2081]   9400   5800  38900  15100   9100  11000  13800  22200  29300   7600
[2091]   8800   6700  15200   8200   8200   5400  20200  10600   6500   8300
[2101]  30200  15000  16500   8300   8600   8000  80800   5700   9300   7300
[2111]   5400  40300   6000   7200  12800  16800  14700  40200  22300   6300
[2121]  10700  12400  16600  11500   6500   6100   9600   9300  13000  38900
[2131]   9200  10400   6900  15700  17700   8200   8500  11900  10300  11000
[2141]  29400  33100  10900   9600   6400 197600   6500  10600   9700  10900
[2151]   6700 193400  24700  36600  10700   8600  19500   5900   7600   7900
[2161]   6500  13600  10800  13000  35100   6100  14400   5900   6200   8500
[2171]   7400  12400   5400   8700   7300  16300   7300   7100  18700  24100
[2181]   5500  50600  65900   5900   6900   7000  12200   5700  14900   5500
[2191]  11500   5600   7700   7700  13000   5500  10700  18800   5500  10000
[2201]   8700   8100  47300  10700  11700  16400  17100   5900   7200   6900
[2211]  21300   7800  10600   6300  13400  10000   6900  25300   6400  14900
[2221]  30600   9400  10500   5500  53100   9500  17900   5600  78600   8800
[2231]   8000   6800  10200   7600   9000   6000 208300  69300   8600  26600
[2241]  12900   9800   7800   6400   9700  35100  25700   5500   6200  54600
[2251]  13900  12300   6900   8400   6000   7600   7000  12200   5900  10000
[2261]  13400  11000   7600  10600  18600  21300   9200  11400  11800  18000
[2271]  10200  18300  30500   6700   5600  34800  20600  27300  15000   9000
[2281]  53800  15800  13300   6200  25400  11800  21000  22800  20700  12300
[2291]   8900   8600   6800 310800  24200   5800   6200  98000  89500   5700
[2301]  28200 139500  11800   6400   8300   7700  10400  40000   7900 100300
[2311]   5700  19000   6400   6900   5700   5600  63100  13600   7800   8400
[2321]   8000   6200   7900  15200   7600   6000   9700  12100   5700  12500
[2331]   6400   6300   7700  20800   7500   6500  14900  21700   5600   5700
[2341]  10100  88500   8400   8800  11200  25500  13400   5600  17300  10200
[2351]   9600   9900   5900   6300  12800   6900   5600   8200   9900  74300
[2361]  23100   6000   5700   9900   6400   9200  10300  10300  26300  38600
[2371]   5400   5800  12800   6000  31500   6600   6500  10400   6100   9000
[2381]  28400  12800  29700  11700   6700  18000   6300  30600   7400  12500
[2391]   9900   7700   8000   6400   5400  10100   9100  17200   8900   9600
[2401]  15500   5900   8200   8600  19500   6500   9200   7800 110200   6700
[2411]  10900   6100  18800  18300   5700  11400  11600   9700   5800  22600
[2421]   6700   9000 133700  77800  19200   5900   5900  11500  92100  15200
[2431]  11100  11000   9100   5900   7100   5900   8700  21000  18000  19600
[2441]  11300   8500   7600   8500   6000  10900  16100   5600  11900  16900
[2451]  12400  10100  14400   6700   6200   8300   5600  20200   6500   6200
[2461]   5400   7500  22200  10800  13600   9000   9500  13900  26900  62900
[2471]  10400  22300  12500   5400   6600   6300  25000  21600  19900   7400
[2481]   9500   7300  11100   6200   6200  20900   9000   7700  26900  12000
[2491]   5900  46800  10100   6300  21000  14100  33800  13500  11900   7700
[2501]  18800 112600   9900   9400   6800   5500   7200  17900   9500  11200
[2511]  18800  26100   6200   7000   9400   6300   5900  37300   6300  11000
[2521]   5600   9600   7500   7600   6200   9300  12100   7100  15900   9900
[2531]   7000  21400   5700  20800   8900  23800   9700  16600  15000   7600
[2541]   5600   7300   5400   9100   8400  19400   6800  11200  15700   6700
[2551]   7600   9500   5900  11700   9100  16200   7400   8800 128900   5400
[2561]  18200   8500   9800   6800  45900   7100   7400   5500  11200  17100
[2571]   9200   6300   6500   9600  50000  27100   5400  12900   5400   9800
[2581]  10600  18900  12400   8600   6100  16800  34100   7400  41600   7400
[2591]  60100   5600   6500  34700  10500   7700  41900   8300   6000   5500
[2601]   6200   7100   7900   7500   7900   5800   6400  14400   5900   8600
[2611]  14000  64099   8100   6400  13000   9200   6100   5400   5800  12700
[2621]   7700  47800  21600   7900   7300   9000   8600  11500  25000  21500
[2631]   8700   6100   5600  21000   6900   9000  14400  14000   6900  21900
[2641]  17900  60300  15400   7800   5600   7900  12400   6700   6300   5400
[2651]  14700  18800  10400   5900   5700  18800  10200   5500   7800   9100
[2661]  13000  38200   6100  39000  34300   6000   5700   7600  12800  10600
[2671]  22800   5700   8400  12200  13900  14000  13100  12000  11700   6800
[2681]  26500  13400  24200  19800  16600   6500   6500   7200  10100  16200
[2691]  17200   6600   9100   6300  14600  43600   5400  13700   7400   9800
[2701]  18100   8900  12200   5500  33600  17400  12200   7900   9600  11400
[2711]   8000  10900  45400  15300  14900   8300  30200  20400  18700  10300
[2721]  13800   7300   5500   6200   7600   6100  18500   5400  11400  12900
[2731]  10200   5800   7400   5500  15400   8500   6500  35400   7300  15500
[2741]  12600   9800   8400   7800  11400  11200   5900   9100  13200   5400
[2751]  11700  16000   7700   9400  24900  11700  16500  10100  23200   5700
[2761]  16300  29800  22300  41800  23100   7200   9800  21200  11800  10000
[2771]   9400  11300   7000   6200   9300   5600  19700   7800  30700  15700
[2781]  71300   8500   8600  10300  53300   8600   6900  10700   5400   8600
[2791]   6300  11700   6600   9800  10800  10600   7000  10600   8000  11500
[2801]   7500  26800  46100  19700  12700  57800   6000  28600  15000   9000
[2811]   5500  11000  10000   7000   9100  49800   7800   8300   9200  10100
[2821]  14900   6500  10900  13200  19100   6400   9400   7600   6100  11800
[2831]  12200   5700  10100   5500  15400  13600  41000   8900   6100   6700
[2841]   9100   8200  22300  77300   6100   9200   7600  75600   6300   8900
[2851]  50100   6100  96900   9800  13000  10700  10100  24700   7200  15700
[2861]   6800   6000   6900   5700   8600  12200   7800   5400   7800   6600
[2871]  31500   6700  10000  12500   6500   7100  11600  20500   9000   5800
[2881]   6000  27100  23100   9400  12700  12600  16700   7200  12300  14100
[2891]  21700  25000   7500  20600  35200   5400   5800  57200   7100  17600
[2901]  11300  16900   6800  13100   9500  13000  17800  30000   5500   5600
[2911]   5400  12500   5400  18500  12700   8000  14100   8400   5700   5500
[2921]   7500 663600   9200  20700   8800   7500  13600  12600   6100  67700
[2931]   6400   9000   6100   7700   9300   6300   5400  13700   6200  50700
[2941]  13200   5800  15800   7500   5600   5900   7500  82000   9700   8300
[2951]   6600   5800  13600   9200  81700   6000  10100   8800  17200   6800
[2961]  28800 141400  16500  11100  10500   5600  10400  13900   6100   6800
[2971]  10000   5800   5800 145500   5500   6200   7300  27400  13700   6500
[2981]  11200  16800   8100   5400  12400   6500   8900  17100   7400  12000
[2991]  28200  34100  23700  62900   5600   5400   6300   7400  12500   7300
[3001]   5800  11200   5500   6400  34600  26000   5800   7400   6300   5700
[3011]   6700   5600   6700  32299   7800  20800  18000   8700   7400   9100
[3021]   7100   5800   8200  10300   6400   6100   9500   7700   5400  16600
[3031]   7300   5900  17300  10300  12800   9700  12400  11200   7900   6400
[3041]   7000   5800   6200  10700  15700   7500   8600   6800   8100   5800
[3051]   8700   9200  26700  13500  13100  10000  11900  11300   6300  23200
[3061]  59100   6300   5500   5600   5500   6600  20500  10100   9400   5400
[3071]   6800   8300  33500   6600   9200   6000  20400  12900  27400   9900
[3081]  13700   6000  12900   8800  13700   6500  10400   5800   5400   6000
[3091]  20500  16600   8700  29200  20400   5600  18100   5600   5500  12700
[3101]   9200  21400  31000  96000  23800   5800  13400   6800   5500  11500
[3111]  11400  43700  59000   5900  24200   5500  18200   5900   6500  18600
[3121]   9800   6600  22100   6900   7000  14700   9400  10700   9400  18800
[3131]   9700   6900   7500  10000   5600   6900   6500  10500   5800   5700
[3141]   5800  27900  31300  14300   6700 144900   6600  60300   8200   5500
[3151]  29300  11600  11500   7600   6000  21700   7100   6000   7500   6200
[3161]   7400  16600   5500  53100  12400  19000  11800  18000  36900  15000
[3171]  35500   7900   5600   6600   7300  20000   5600  10500  27200  13300
[3181]   6600  35400  10000  18300   6400   5400  20000  24900  12800  15200
[3191]   7400   9900   5900  18400   5900  24500   8200   5400   6500   5600
[3201]  18300  10300  16300   7200  16900   7600   8500  12400   7800   5700
[3211]   5700   7300   6600   6300   8100   8000   7300   6200   6500  11300
[3221]  22500   8000   5600   6500   5900   5500  14700   9900  20400   6800
[3231]  11900   5500  11800   6600  19300   5600  14000   5900  71900  16600
[3241]   9200   5700   6000  44800   6000  32000   9900  11300  24100   8100
[3251]  11300  15000   5700   8900   7900  10500  15800   7300   6300  31400
[3261]   8800  25000   6900 109100   6600  45100  10200   8200  11700   8600
[3271]   9700  10100   8500  31100   6500   7400  95400   6200  11800   9100
[3281]  11500   8000   9000  10500  69500   5900   6900   8300  17400  10300
[3291]  12900   7600   7000  17300   5500  16800  70600  12100   8400   7400
[3301]  68300  11100   6200   7700  15100   6000  95300   7700   9500   6300
[3311]  11700  37500   6000   7800   8400   5700   5800  10100   8500   6700
[3321]  52100   5800  11100   8700   6400   6900   6300   5800   5500  57500
[3331]   8400   5800   6900  47100  15600   7100  10600  12500  25000   6800
[3341]   8200  27700  11300  10100   7800   9600  11100  27900   6600   9100
[3351]   6600   8200  48800  12200   8500  16700  11100  40000   7800  12900
[3361]   5600   5700  31400   5400  22000   7100   5400   8400   7600  15700
[3371]   6400  47100   7100  13700   6500   8200  21900   7200  11400   5500
[3381]  13100   6500  26900  56900   6000   5700 105400   7100   7800  48500
[3391]   9400  12900  11900  17300   8500  19600   7800  31400   7300  17700
[3401]   6400  47800  15600  12900  70200  23200  11100   7000  11600   6100
[3411]   8000   6500   7500  19500   6000  13400   5800   6300   7600   6000
[3421]   5700   7100   5700  18800  94800   6200  17700   7700  16300  18400
[3431]  29900   5700  13300   5700   6800   6200  20700  11000  12200  16200
[3441]   8900  15800  10500   8900  11700  10600   8900   7300  51500  16900
[3451]   8400  31900   7800   5900   6100   8700   9500  11500   5600  12600
[3461]   8700  11000  21200  16300   5500  13200   6100  12700   7400   6100
[3471]   9100  21200   5600  12100   7500  19200   9500  35700   5400  22300
[3481]   7700  14400   8700  17100   7400   7500   6200   6500   5700  15400
[3491]   5400   7000   6400   7200  13300  11700   8200  13700   5500   7900
[3501]  11700  27500  16900   9500  15000   5600  14800   6800   5500   8800
[3511]   7100  24800   8300  13800   7200   8800  18900  12100   8600  15200
[3521]  18100   7100  18800  67800  14800  32000 109800  64400  26800   7200
[3531]   7700  12700   6100   6900  24800  20400   8200  27700   7500   8300
[3541]   7000  11100   5600   7700   9000   5600   9600  24000   5800  48000
[3551]   6600   6800   6200   6200  10400   5600  11500   5900   8900  11800
[3561]   7800   7800  20700  10700  27700  22800   8500  13300   6600  20200
[3571]   6900  10500   8100  33100   8000  12200  12200   9000  39000   8600
[3581] 105400   9800  10900  26700  25900  10800   5400   6700  39300   5500
[3591]   6300  14100  36600  31900   6700   5700   7700  20000  11100   6500
[3601]   6000   8600  43000   7200   5800   6100   7000   9300  17300   9800
[3611]   6500  18800  11300  16500   6200  12900   6900   7700  16300   6200
[3621]   5600   5600  20000   7400   5700  12100   8100   6000  12000   6100
[3631]  12100  53800   9000   6800  40200   7800  14700  27500  18500   7600
[3641]  22100   6100  10500  32000   6300   6900  12500   7700   7200  15100
[3651]  19700   5800   5500   7600   5500  35300  24000   8800  16700   9700
[3661]  48700   8300  13800   8200   6500   5900   6100  16400   6800   9200
[3671]  27900  52700   5900  47300  31700   5400   6100   7700  14300   6000
[3681]  16300   9200  12800   6000  42200   8800  13000   5900   6300   6200
[3691]   6900  19700   9400   5400   6600   9100   7800   8400  46600   5400
[3701]  14800   5700  22300   6600   6100   6000   5400  19900   8400   6600
[3711]   5600  10400   9500   9600   9300  15200   5400   9900   9700  20700
[3721]   6000   6400   6200   6000   9100   5700   7500  13500  11900   9600
[3731]   6100   9100   6300  18800   5700   8500  21900  16300  13000   5500
[3741]  29700  47700  25300  13100  22300   8000   6300   6700   7600   8700
[3751]  21000   5700  13900   9700   7900   6400   7800  12200  14900   5500
[3761]   8900   9300   8100   6700  14700   9200  38800   5600   8800  11100
[3771]   8100   7400   6200   7500  26200   6000  11500   8200  60200  40600
[3781]   6000   5900   8900   5900   7500   7400  10300   9200  36300  14300
[3791]   6700   6700  12000  17800   6600  17900   8500   5400  18400   8300
[3801]   9200   5500   8500   5400   8000   6600  16800   9000  10000  27800
[3811]   9900   5600  13100   5700   7700  23000   6800  14500  13700  15400
[3821]   6400  32800   6000  22500  56500   5700  24100  23700  28800   6600
[3831]  39100  63300   6800   5400  47100   8800   8800   6500  49000   7900
[3841]  11500  11800  11000  18400  17300  10700  14900  11100   6500  12900
[3851]  55600   6500   8600   7900   8400  21200  29000   7800  15800   5900
[3861]   8800  11200  11300   5900  37700  21800  10400  16900  10500   6200
[3871]   5700   8900   7600   5900   9200   6000  46400   6400   8500  38700
[3881]   5500   7000  11100  16100  40600   5600  19400  23300   6900  17200
[3891]   7000  12000  12800   9500   8100   8100  14300   8100   5900   8200
[3901]  92600   6600  82200  15000  10900   7600  19400   6800   7100   9300
[3911]   6100   6900   8500   6900  22100  11600  77200 108400  29800  15000
[3921]   5500   5900  10000  12700   6000  13300   9100   9400  53200  11800
[3931]   8500  11600  13100   9500   5800  17500  10400  18600   7800  98500
[3941]  57100   6000   5400  29900  67300   6400   7300   7500   7700   6600
[3951]  11300   7900  11000  32299  16400   6100   7900   5500  37900  11100
[3961]   5900   6200  16700   5400  21900   6100   6600  36700  40600  51900
[3971]   7700   6600   6800   8500   9100   6400   8000  24300  11300   8300
[3981]   7200   8800  32500   7600   5800   9000  33400  48000  10900   6600
[3991]  13100  18000   6200  18500  17600  15700   5600   7700  12300  19000
[4001]   9300  17200   7600   6800   5500  18600  43200  56500  12100  25200
[4011]  46300   5500  62000   6200   8000   6200  17600   8200  16200   5400
[4021]   5500   6400   7900   7100   6900   6800  16700  10100  15200   7600
[4031]   5500   5600  17100   6400  10500  10300   8300   9900  31300   7400
[4041]  16100  13400   7600   7900   7300   5700   6400   6400  16500   7600
[4051]  11500   8900  18400  15600   7200   6500   8600  23700   8400  43700
[4061]   6100   8800   7600  10500  42300   5800  19400   9900   7300   8100
[4071]  17500   6700  11400  25500  10200   9100  11900   9300  12300  17500
[4081]   7900   6500  14200  32400   6000   8900  12800  15400   7800  24400
[4091]   7200   8200  11800   6400  13800   7700   9900   6000   9400  10300
[4101]   7700   7200  24900   5400   6300  84800  31600   7500   8500  11800
[4111]  14100  16900  13200   7500   6600  52600  15000  11800  12300   8600
[4121]  16200  32500  56900  11800   7100   6400   6000   7200  39000   7300
[4131]   6700  18700  13900   9000  17200  23600  12500  13600   8500  16100
[4141]   8800  12100  16700  10300   6700   5800  12100   5900  10400  10600
[4151]  23100 109500   7400   7900   6900  52500  17600   8000  14200  21600
[4161]  14800  13400   6300   5900  52600  13900   6300   6700  20700   5400
[4171]   7900  10900  41800  35700   5500   5800  47300  10900   9800   8100
[4181]   5800  15400  79900  12300  13400   7900  78600  41100   7200 284700
[4191]   7000  22900  37100  48000  29900   5600  13700  10500  13900   7800
[4201]   9300   7100  16800  11500   6500   5800  21000   9800   7300  19100
[4211]   8500   5400   6000  51000  16400   6000  20200  15100   5900  13000
[4221]  42000   5400   6200   7200  33300   7000   7800  23200   6300  19400
[4231]  14500  12800   5800   6100   7600   6900   6200   7400  48500   9700
[4241]   7500   5800   7500   5700  16500   6500   5800   6700  23000   9100
[4251]  15100   6300  10800  11900  13000 115700   5800   6500  25500  69000
[4261]  17800   7000   7900   5400  25300  12000   5900  11600  19900  10400
[4271]   5900   5900   6200  13300  10300   5900   5500  26400  10000   7800
[4281]  12500  12700   6600   7200   6400   9100   5500  10200   7200   5500
[4291]   7800  16200   7200  12400   5400  11700  25400   8500   5500   6300
[4301]   6600  38200   7100   7500   8500  17900  55600   6800  12200   9700
[4311]   7600   8000   7600   9300   5700  10100   6700  26600   8500   9700
[4321]  11500   5500  10600  15900   6800  15300   7300  14300   7900   7400
[4331]   5800  19200  32200   5400   5400  12800  10000   5800  10300   6800
[4341]   6300   7400   6700   5700   7300  36100   9700  22200   7600  14000
[4351]   7700   6300   9500  10800  87600  32100  15000   9000   9100   5600
[4361]  18600   9200   6300  27500  11600  42500   6400   8700  13400  14600
[4371]   9500   7200   7600  10800   8400   9700   6200   5800   7300  39300
[4381]   8100  47300   6500   8800  11100  15000   5700   5400   8300  22500
[4391]  12900   7000   6300   6200   8100  20400  30000   5600  20900   7400
[4401]   6100  12700   7700  13300  10600   6100  16800   6100   5800  14400
[4411]   8200   6500  11500   5800  10300   5400  51900   8900   6900   7000
[4421]  14300  19200  32000  18600  52900   5700  12500   8700  53000  20600
[4431]   8700   9500  10300   5600  70800  64300   6300   6200  20300   9700
[4441]   9800  17300   5500  10000  10800   5700   7100   6800  27100   6200
[4451]  16800  19900  10200   9600   5600  10000   5400   8700  13900   5800
[4461]   8400   6500   6700   8900   9300   9600   6000   9400   7300  13800
[4471]   6800   7800   9300  13500   6600   5400  23600   5400  30200   8100
[4481]   6600  11500   6400   5400   7800   5900   8300  12200   6700  12300
[4491]   9100   5800  39800   9600   7800   5700   7700   9100   9800   7900
[4501]  25000   7900   6000   6500  11500   9500   5800   7000  11400   7800
[4511]   6200   9900  11700  34500  20900  11700   5500  45000  16800  10200
[4521]   7600   6000  24300

The dependent variable shares has a skewed distribution, meaning it is not evenly spread out. There are also a large number of outliers, which are data points that are very different from the rest. To reduce the impact of these outliers on the model’s predictions, a log transformation is performed on the variable shares. This means the logarithm of the number of shares for each article is taken.

Fitting a multiple regression model by least squares:


Call:
lm(formula = shares ~ ., data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0764 -0.5433 -0.1628  0.3867  5.9779 

Coefficients: (1 not defined because of singularities)
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   6.581e+00  4.448e-01  14.798  < 2e-16 ***
n_tokens_title                6.952e-03  2.186e-03   3.180  0.00147 ** 
n_tokens_content              4.685e-05  1.683e-05   2.784  0.00538 ** 
n_unique_tokens               1.548e-01  1.446e-01   1.070  0.28459    
n_non_stop_words              5.822e-02  4.887e-02   1.191  0.23355    
n_non_stop_unique_tokens     -2.575e-01  1.228e-01  -2.096  0.03608 *  
num_hrefs                     4.240e-03  5.042e-04   8.409  < 2e-16 ***
num_self_hrefs               -7.307e-03  1.337e-03  -5.464 4.67e-08 ***
num_imgs                      1.450e-03  6.956e-04   2.084  0.03712 *  
num_videos                    1.526e-03  1.190e-03   1.282  0.19971    
average_token_length         -8.548e-02  1.829e-02  -4.673 2.98e-06 ***
kw_min_min                    9.322e-04  1.225e-04   7.612 2.77e-14 ***
kw_max_min                    1.831e-05  3.792e-06   4.827 1.39e-06 ***
kw_avg_min                   -1.357e-04  2.318e-05  -5.855 4.81e-09 ***
kw_min_max                   -4.010e-07  9.115e-08  -4.399 1.09e-05 ***
kw_max_max                    5.891e-08  4.328e-08   1.361  0.17353    
kw_avg_max                   -3.886e-07  5.928e-08  -6.554 5.68e-11 ***
kw_min_avg                   -5.703e-05  5.659e-06 -10.078  < 2e-16 ***
kw_max_avg                   -4.334e-05  1.930e-06 -22.451  < 2e-16 ***
kw_avg_avg                    3.457e-04  1.086e-05  31.831  < 2e-16 ***
self_reference_min_shares     6.659e-07  5.655e-07   1.178  0.23896    
self_reference_max_shares    -2.181e-08  3.069e-07  -0.071  0.94334    
self_reference_avg_sharess    1.525e-06  7.845e-07   1.944  0.05189 .  
LDA_00                        2.090e-01  3.456e-02   6.048 1.48e-09 ***
LDA_01                       -1.517e-01  3.846e-02  -3.944 8.03e-05 ***
LDA_02                       -2.435e-01  3.464e-02  -7.029 2.11e-12 ***
LDA_03                       -1.202e-01  3.653e-02  -3.289  0.00101 ** 
LDA_04                               NA         NA      NA       NA    
global_subjectivity           3.933e-01  6.404e-02   6.141 8.26e-10 ***
global_sentiment_polarity    -1.096e-01  1.254e-01  -0.874  0.38205    
global_rate_positive_words   -1.155e+00  5.388e-01  -2.144  0.03202 *  
global_rate_negative_words    4.728e-01  1.028e+00   0.460  0.64561    
rate_positive_words           2.926e-01  4.342e-01   0.674  0.50035    
rate_negative_words           1.327e-01  4.376e-01   0.303  0.76178    
avg_positive_polarity         7.668e-03  1.027e-01   0.075  0.94050    
min_positive_polarity        -2.814e-01  8.602e-02  -3.272  0.00107 ** 
max_positive_polarity        -2.539e-02  3.240e-02  -0.784  0.43327    
avg_negative_polarity        -1.377e-01  9.463e-02  -1.455  0.14559    
min_negative_polarity         5.223e-03  3.450e-02   0.151  0.87966    
max_negative_polarity         7.905e-02  7.868e-02   1.005  0.31504    
title_subjectivity            6.118e-02  2.096e-02   2.918  0.00352 ** 
title_sentiment_polarity      7.488e-02  1.923e-02   3.894 9.89e-05 ***
abs_title_subjectivity        1.328e-01  2.787e-02   4.763 1.91e-06 ***
abs_title_sentiment_polarity  2.581e-02  3.031e-02   0.852  0.39446    
day_of_the_weekMonday        -7.522e-03  1.586e-02  -0.474  0.63533    
day_of_the_weekSaturday       2.211e-01  2.132e-02  10.370  < 2e-16 ***
day_of_the_weekSunday         2.110e-01  2.051e-02  10.288  < 2e-16 ***
day_of_the_weekThursday      -6.618e-02  1.554e-02  -4.260 2.05e-05 ***
day_of_the_weekTuesday       -7.458e-02  1.548e-02  -4.817 1.46e-06 ***
day_of_the_weekWednesday     -6.796e-02  1.547e-02  -4.393 1.12e-05 ***
data_channelEntertainment    -1.949e-02  2.767e-02  -0.704  0.48126    
data_channelLifestyle         5.756e-02  2.808e-02   2.050  0.04038 *  
data_channelOthers            1.808e-01  2.925e-02   6.180 6.48e-10 ***
data_channelSocial Media      3.173e-01  2.368e-02  13.398  < 2e-16 ***
data_channelTech              2.667e-01  2.518e-02  10.595  < 2e-16 ***
data_channelWorld             1.131e-01  2.709e-02   4.176 2.98e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8644 on 38408 degrees of freedom
Multiple R-squared:  0.1289,    Adjusted R-squared:  0.1277 
F-statistic: 105.3 on 54 and 38408 DF,  p-value: < 2.2e-16

12.89% of the variance in log of shares is explained by the model. The “NA” values in the output indicates that certain coefficients could not be estimated due to singularities in the model matrix. This may occur when one or more predictor variables are perfectly correlated with one another. Therefore, removing the variable with “NA” value.


Call:
lm(formula = shares ~ ., data = data[, -27])

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0764 -0.5433 -0.1628  0.3867  5.9779 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   6.581e+00  4.448e-01  14.798  < 2e-16 ***
n_tokens_title                6.952e-03  2.186e-03   3.180  0.00147 ** 
n_tokens_content              4.685e-05  1.683e-05   2.784  0.00538 ** 
n_unique_tokens               1.548e-01  1.446e-01   1.070  0.28459    
n_non_stop_words              5.822e-02  4.887e-02   1.191  0.23355    
n_non_stop_unique_tokens     -2.575e-01  1.228e-01  -2.096  0.03608 *  
num_hrefs                     4.240e-03  5.042e-04   8.409  < 2e-16 ***
num_self_hrefs               -7.307e-03  1.337e-03  -5.464 4.67e-08 ***
num_imgs                      1.450e-03  6.956e-04   2.084  0.03712 *  
num_videos                    1.526e-03  1.190e-03   1.282  0.19971    
average_token_length         -8.548e-02  1.829e-02  -4.673 2.98e-06 ***
kw_min_min                    9.322e-04  1.225e-04   7.612 2.77e-14 ***
kw_max_min                    1.831e-05  3.792e-06   4.827 1.39e-06 ***
kw_avg_min                   -1.357e-04  2.318e-05  -5.855 4.81e-09 ***
kw_min_max                   -4.010e-07  9.115e-08  -4.399 1.09e-05 ***
kw_max_max                    5.891e-08  4.328e-08   1.361  0.17353    
kw_avg_max                   -3.886e-07  5.928e-08  -6.554 5.68e-11 ***
kw_min_avg                   -5.703e-05  5.659e-06 -10.078  < 2e-16 ***
kw_max_avg                   -4.334e-05  1.930e-06 -22.451  < 2e-16 ***
kw_avg_avg                    3.457e-04  1.086e-05  31.831  < 2e-16 ***
self_reference_min_shares     6.659e-07  5.655e-07   1.178  0.23896    
self_reference_max_shares    -2.181e-08  3.069e-07  -0.071  0.94334    
self_reference_avg_sharess    1.525e-06  7.845e-07   1.944  0.05189 .  
LDA_00                        2.090e-01  3.456e-02   6.048 1.48e-09 ***
LDA_01                       -1.517e-01  3.846e-02  -3.944 8.03e-05 ***
LDA_02                       -2.435e-01  3.464e-02  -7.029 2.11e-12 ***
LDA_03                       -1.202e-01  3.653e-02  -3.289  0.00101 ** 
global_subjectivity           3.933e-01  6.404e-02   6.141 8.26e-10 ***
global_sentiment_polarity    -1.096e-01  1.254e-01  -0.874  0.38205    
global_rate_positive_words   -1.155e+00  5.388e-01  -2.144  0.03202 *  
global_rate_negative_words    4.728e-01  1.028e+00   0.460  0.64561    
rate_positive_words           2.926e-01  4.342e-01   0.674  0.50035    
rate_negative_words           1.327e-01  4.376e-01   0.303  0.76178    
avg_positive_polarity         7.668e-03  1.027e-01   0.075  0.94050    
min_positive_polarity        -2.814e-01  8.602e-02  -3.272  0.00107 ** 
max_positive_polarity        -2.539e-02  3.240e-02  -0.784  0.43327    
avg_negative_polarity        -1.377e-01  9.463e-02  -1.455  0.14559    
min_negative_polarity         5.223e-03  3.450e-02   0.151  0.87966    
max_negative_polarity         7.905e-02  7.868e-02   1.005  0.31504    
title_subjectivity            6.118e-02  2.096e-02   2.918  0.00352 ** 
title_sentiment_polarity      7.488e-02  1.923e-02   3.894 9.89e-05 ***
abs_title_subjectivity        1.328e-01  2.787e-02   4.763 1.91e-06 ***
abs_title_sentiment_polarity  2.581e-02  3.031e-02   0.852  0.39446    
day_of_the_weekMonday        -7.522e-03  1.586e-02  -0.474  0.63533    
day_of_the_weekSaturday       2.211e-01  2.132e-02  10.370  < 2e-16 ***
day_of_the_weekSunday         2.110e-01  2.051e-02  10.288  < 2e-16 ***
day_of_the_weekThursday      -6.618e-02  1.554e-02  -4.260 2.05e-05 ***
day_of_the_weekTuesday       -7.458e-02  1.548e-02  -4.817 1.46e-06 ***
day_of_the_weekWednesday     -6.796e-02  1.547e-02  -4.393 1.12e-05 ***
data_channelEntertainment    -1.949e-02  2.767e-02  -0.704  0.48126    
data_channelLifestyle         5.756e-02  2.808e-02   2.050  0.04038 *  
data_channelOthers            1.808e-01  2.925e-02   6.180 6.48e-10 ***
data_channelSocial Media      3.173e-01  2.368e-02  13.398  < 2e-16 ***
data_channelTech              2.667e-01  2.518e-02  10.595  < 2e-16 ***
data_channelWorld             1.131e-01  2.709e-02   4.176 2.98e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8644 on 38408 degrees of freedom
Multiple R-squared:  0.1289,    Adjusted R-squared:  0.1277 
F-statistic: 105.3 on 54 and 38408 DF,  p-value: < 2.2e-16

The output indicates that approximately 12.89% of the variance in log of shares can be explained by the model.

DETECTION OF OUTLIERS AND INFLUENTIAL POINTS:

Now we find whether are any anomalies in the data set.

Detection of influential points by cook’s distance:

Cook’s Distance measures the change in distance in the fitted regression line if an observation is deleted from the regression equation. It therefore combines the outlier and leverage point diagnostics of a measure. The Cook’s Distance statistic is

\(D_{i}=\frac{\Sigma_{j=1}^{n}(\hat{y_j}-\hat{y}_{j(i)})^{2}}{ps^{2}}\) where \(s^{2}\) is the Mean Squared Error and \(\hat{y}_{j(i)}\) is the fitted response value after deleting the ith observation.

If \(D_{i}>\frac{4}{n}\) where n is the number of observations then \(D_{i}\)is tagged as an influential point.

The points above the red line are the influential points.

Detection of outliers using studentized residuals:

Studentized residuals are a type of standardized residual used in regression analysis to assess the fit of a model. These help in identifying outliers and influential data points. The studentized residual statistic is \(e_{i}^{s}=\frac{e_{i}}{\hat\sigma_{(i)} {\sqrt{1-h_{ii}}}}\) where

\(e_{i}=y_{i}-\hat{y}_{i}\) is the value of the ith residual (the difference between the observed value and the predicted value).

\(\hat\sigma_{(i)}\) is the standard deviation of the residuals calculated without the ith observation.

\(h_{ii}\) is the leverage of the ith observation, a measure of the influence of the ith data point on the fitted value.

At 5% level of significance, if \(e_{i}^{s}>2\) then the ith observation can be tagged as an outlier.

The points above the green line indicate the outliers of the data.

Removing the points (which are both influential as well as outlier) from the dataset to clean the dataset and make the data ready for further analysis.

Original dimension: 38463 46
New dimension: 37151 46

The original dimension shrinks.

Fitting the multiple linear regression model after removing the outliers:


Call:
lm(formula = shares ~ ., data = data1[, -27])

Residuals:
    Min      1Q  Median      3Q     Max 
-7.8766 -0.4689 -0.1118  0.3902  2.6906 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   6.614e+00  3.715e-01  17.803  < 2e-16 ***
n_tokens_title                2.228e-03  1.858e-03   1.199 0.230451    
n_tokens_content              2.545e-05  1.434e-05   1.774 0.076076 .  
n_unique_tokens              -1.891e-01  1.232e-01  -1.535 0.124733    
n_non_stop_words              2.085e-01  4.165e-02   5.006 5.59e-07 ***
n_non_stop_unique_tokens     -1.275e-01  1.046e-01  -1.219 0.222908    
num_hrefs                     3.463e-03  4.323e-04   8.010 1.18e-15 ***
num_self_hrefs               -5.618e-03  1.134e-03  -4.953 7.34e-07 ***
num_imgs                      4.591e-04  5.955e-04   0.771 0.440713    
num_videos                    5.734e-04  1.023e-03   0.561 0.575018    
average_token_length         -6.040e-02  1.563e-02  -3.866 0.000111 ***
kw_min_min                    7.996e-04  1.047e-04   7.639 2.24e-14 ***
kw_max_min                    2.446e-05  3.405e-06   7.182 7.00e-13 ***
kw_avg_min                   -1.752e-04  2.126e-05  -8.240  < 2e-16 ***
kw_min_max                   -4.423e-07  7.813e-08  -5.661 1.52e-08 ***
kw_max_max                    5.253e-08  3.690e-08   1.424 0.154557    
kw_avg_max                   -4.466e-07  5.066e-08  -8.815  < 2e-16 ***
kw_min_avg                   -4.371e-05  4.846e-06  -9.019  < 2e-16 ***
kw_max_avg                   -4.278e-05  1.683e-06 -25.413  < 2e-16 ***
kw_avg_avg                    3.291e-04  9.411e-06  34.968  < 2e-16 ***
self_reference_min_shares     1.574e-07  4.824e-07   0.326 0.744176    
self_reference_max_shares    -1.373e-07  2.643e-07  -0.519 0.603419    
self_reference_avg_sharess    1.574e-06  6.726e-07   2.339 0.019315 *  
LDA_00                        2.387e-01  2.932e-02   8.141 4.03e-16 ***
LDA_01                       -1.664e-01  3.276e-02  -5.078 3.83e-07 ***
LDA_02                       -2.036e-01  2.943e-02  -6.919 4.64e-12 ***
LDA_03                       -1.456e-01  3.113e-02  -4.677 2.92e-06 ***
global_subjectivity           2.757e-01  5.458e-02   5.051 4.42e-07 ***
global_sentiment_polarity    -1.309e-01  1.075e-01  -1.217 0.223526    
global_rate_positive_words   -4.451e-01  4.617e-01  -0.964 0.335103    
global_rate_negative_words    9.244e-02  8.905e-01   0.104 0.917326    
rate_positive_words           2.301e-01  3.623e-01   0.635 0.525412    
rate_negative_words           7.112e-02  3.653e-01   0.195 0.845662    
avg_positive_polarity         2.119e-02  8.769e-02   0.242 0.809100    
min_positive_polarity        -3.286e-01  7.371e-02  -4.458 8.31e-06 ***
max_positive_polarity        -4.947e-02  2.750e-02  -1.798 0.072107 .  
avg_negative_polarity        -7.956e-02  8.062e-02  -0.987 0.323674    
min_negative_polarity         3.124e-03  2.934e-02   0.106 0.915218    
max_negative_polarity         1.111e-01  6.723e-02   1.653 0.098373 .  
title_subjectivity            5.866e-02  1.790e-02   3.277 0.001051 ** 
title_sentiment_polarity      7.379e-02  1.651e-02   4.470 7.86e-06 ***
abs_title_subjectivity        9.923e-02  2.375e-02   4.178 2.95e-05 ***
abs_title_sentiment_polarity -2.065e-02  2.598e-02  -0.795 0.426765    
day_of_the_weekMonday        -2.696e-02  1.347e-02  -2.001 0.045432 *  
day_of_the_weekSaturday       2.249e-01  1.809e-02  12.430  < 2e-16 ***
day_of_the_weekSunday         2.156e-01  1.744e-02  12.366  < 2e-16 ***
day_of_the_weekThursday      -7.086e-02  1.318e-02  -5.375 7.72e-08 ***
day_of_the_weekTuesday       -7.487e-02  1.313e-02  -5.702 1.19e-08 ***
day_of_the_weekWednesday     -7.907e-02  1.313e-02  -6.021 1.75e-09 ***
data_channelEntertainment    -3.441e-02  2.356e-02  -1.461 0.144123    
data_channelLifestyle         2.330e-02  2.395e-02   0.973 0.330490    
data_channelOthers            1.529e-01  2.506e-02   6.100 1.07e-09 ***
data_channelSocial Media      3.266e-01  2.007e-02  16.271  < 2e-16 ***
data_channelTech              2.925e-01  2.132e-02  13.718  < 2e-16 ***
data_channelWorld             9.620e-02  2.301e-02   4.181 2.90e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7211 on 37096 degrees of freedom
Multiple R-squared:  0.1554,    Adjusted R-squared:  0.1542 
F-statistic: 126.4 on 54 and 37096 DF,  p-value: < 2.2e-16

The output indicates that approximately 15.54% of the variance in log of shares can be explained by the model and the residual standard error is 0.7211. This model has improved a bit from the previous models.

CHECKING MODEL ASSUMPTIONS

Heteroscedasticity:

The Mashable dataset may show heteroscedasticity because different types of articles can have varying numbers of shares. Some articles might go viral, while others do not get much attention, leading to differences in variance. Changes in popularity over time and factors like the author’s reputation can also affect shares.

Test of homoscedasticity:

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 494.5087, Df = 1, p = < 2.22e-16

The residuals plot indicate heteroscedasticity in the model. Also the levene’s test has a p-value <2.22e-16 which results in the rejection of the null hypothesis of homoscedasticity.

-

Linearity:

The pattern in the residuals in the plot violate the assumption of linearity.

-

Autocorrelation:

News popularity may exhibit autocorrelation if the observations are collected over time and show a correlation with their past values. If the dataset includes time-series data, such as the number of shares or views of articles over time, it is likely that the values at one time point are influenced by values at previous time points. News articles may have trending patterns where the popularity of an article can affect the popularity of subsequent articles, leading to positive autocorrelation. This may happen that certain topics perform better at specific times of the year which may create autocorrelation

Autocorrelation is defined as the correlation between the members of a series of observations. We need to test if \(cov(\epsilon_{i},\epsilon_{j})\ne0\forall i\ne j\)

We use durbin watson test for detecting autocorrelation: \(d=\frac{\sum_{i=2}^{n}(\epsilon_{i}-\epsilon_{i-1})^{2}}{\sum_{i=1}^{n}\epsilon_{i}^{2}}\)

The following assumptions are made to use the statistic d:

• Model includes intercept term

• Explanatory variables are non stochastic

\(\epsilon_{i}'s\) are generated from AR(1) model, i.e., \(\epsilon_{i}=\epsilon_{i-1}+u_{i}\forall i=1(1)300.\)

\(\epsilon_{i}\sim N(0,\sigma^{2})\forall i=1(1)300\)

• No missing observations

Sample correlation estimate: \(\hat{\rho}=\frac{\sum_{i=1}^{n}\hat{\epsilon_{i}}\hat{\epsilon}_{i-1}}{\sqrt{\sum_{i=2}^{n}\hat{\epsilon}_{i-1}^{2}\sum_{i=1}^{n}\hat{\epsilon}_{i}^{2}}}\)

Assuming \(\sum_{i=1}^{n}\hat{\epsilon}_{i}^{2}\approx\sum_{i=2}^{n}\hat{\epsilon}_{i-1}^{2}\),we have \(d=2(1-\hat{\rho})\)

We want to test \(H_{0}:\hat{\rho}=0\) against \(H_{1}:\hat{\rho}\ne0\)

If \(d\) value turns out to be near \(2\) then there is no autocorrelation.

[1] 1.933195

The value of Durbin Watson test statistic d turns out to be close to 2. Thus, the null hypothesis cannot be rejected at 5% level of significance and conclude that there is no autocorrelation in error terms.

-

Multicollinearity

Multicollinearity may be present in the Mashable online news dataset may arise due to several factors such as: Many features may measure similar aspects of the articles, such as n_tokens_title and n_unique_tokens leading to high correlations. Some features could be derived from others, like n_tokens_content being related to n_unique_tokens which can create redundancy. Articles on similar topics or from the same author may exhibit correlated characteristics, contributing to multicollinearity.

Multicollinearity means the existence of perfect relationship among all explanatory variables in a regression model. In this model, an exact relationship is said to exist if the following condition is satisfied: \(\beta_{1}x_{i1}+\beta_{2}x_{i2}+\beta_{3}x_{i3}+\beta_{4}x_{i4}+\beta_{5}x_{i5}+\beta_{6}x_{i6}=0\) where not all coefficients are simultaneously zero. In terms of linear algebra, we explore an issue of multicollinearity if exact linear relationship among the regressors, i.e., at least one column of X will be linear combination of the others and Rank(X) will not be of full column rank and as a result X’X will not be invertible.

In order to detect multicollinearity, we use a standard measure known as Variance Inflation Factor (VIF).

In the model, \(Y_{i}=\beta_{0}+\beta_{1}x_{i1}+\beta_{2}x_{i2}+\beta_{3}x_{i3}+\beta_{4}x_{i4}+\beta_{5}x_{i5}+\beta_{6}x_{i6}+\epsilon_{i}\forall i=1(1)300\), the VIF of the regressor of the jth regressor is defined as: \(VIF_{j}=\frac{1}{1-R_{(j)}^{2}}\) where \(R_{(j)}^{2}\) is the coefficient of determination from the equation \(X_{i}\) on \((X_{1},X_{2},...,X_{j-1},X_{j+1},...,X_{p})\).\(VIF_{j}\) measures the dependence of \(X_{j}\) on all other 5 regressors. A large VIF value indicates multicollinearity in the model. As a thumb rule, if \(VIF>5\) we conclude that there is multicollinearity in the model.

                                     GVIF Df GVIF^(1/(2*Df))
n_tokens_title                   1.098777  1        1.048226
n_tokens_content                 3.207900  1        1.791061
n_unique_tokens              14325.010060  1      119.687134
n_non_stop_words              3615.053887  1       60.125318
n_non_stop_unique_tokens      8882.445746  1       94.246728
num_hrefs                        1.687374  1        1.298990
num_self_hrefs                   1.379314  1        1.174442
num_imgs                         1.714136  1        1.309250
num_videos                       1.264432  1        1.124469
average_token_length             1.387139  1        1.177769
kw_min_min                       3.845149  1        1.960905
kw_max_min                      12.095228  1        3.477819
kw_avg_min                      11.795877  1        3.434513
kw_min_max                       1.381226  1        1.175256
kw_max_max                       4.543649  1        2.131584
kw_avg_max                       3.153582  1        1.775833
kw_min_avg                       2.110867  1        1.452882
kw_max_avg                       7.270848  1        2.696451
kw_avg_avg                      10.457942  1        3.233874
self_reference_min_shares        6.660905  1        2.580873
self_reference_max_shares        8.659105  1        2.942636
self_reference_avg_sharess      19.560266  1        4.422699
LDA_00                           4.368346  1        2.090059
LDA_01                           3.688890  1        1.920648
LDA_02                           4.992530  1        2.234397
LDA_03                           5.673060  1        2.381819
global_subjectivity              1.648839  1        1.284071
global_sentiment_polarity        7.487136  1        2.736263
global_rate_positive_words       3.993715  1        1.998428
global_rate_negative_words       6.221607  1        2.494315
rate_positive_words            209.986854  1       14.490923
rate_negative_words            213.008066  1       14.594796
avg_positive_polarity            3.951993  1        1.987962
min_positive_polarity            1.875229  1        1.369390
max_positive_polarity            2.439797  1        1.561985
avg_negative_polarity            6.729812  1        2.594188
min_negative_polarity            4.797791  1        2.190386
max_negative_polarity            2.849208  1        1.687960
title_subjectivity               2.376710  1        1.541658
title_sentiment_polarity         1.335435  1        1.155610
abs_title_subjectivity           1.434971  1        1.197903
abs_title_sentiment_polarity     2.414298  1        1.553801
day_of_the_week                  1.041632  6        1.003405
data_channel                    95.360060  6        1.461999

The variables n_unique_tokens, n_non_stop_words, n_non_stop_unique_tokens, “rate_positive_words”, “rate_negative_words” have vif values (GVIF^(1/(2 * Df))) greater than 5 indicating that these are the variables which give rise to multicollinearity. Therefore, there is presence of multicollinearity in the model.

-

Normality:

Checking for normality in the news dataset by QQ-Plot:

The plot shows slight departures from normality near the tails of the distributions. Based on this Q-Q plot, it is reasonable to conclude that the data is approximately normally distributed, with some minor deviations.

-

Therefore, the model assumptions of homoscedasticity does not hold true. Multicollinearity is also detected by the model.

First splitting the data into training and test data and then applying regularization methods like ridge and lasso to handle multicollinearity present in the data.

TRAIN AND TEST DATA

In order to determine the model efficiency, the data is divided into two parts:

training dataset: subset to train a model.

testing dataset: subset to test the trained model.

The data is divided into training and testing dataset in the ratio 80:20.

RIDGE REGRESSION

Loaded glmnet 4.1-8

The cv.glmnet function is used for cross-validation of a generalized linear model with regularization. The cv.glmnet produces a plot that helps in selecting the best model based on the cross-validation error.

Lambda \(\lambda\) on the x-axis: This is the tuning parameter for the regularization strength. Smaller values of \(\lambda\) mean less regularization, and larger values mean more regularization.

Mean Squared Error (MSE) or Deviance on the y-axis: This is the measure of prediction error for the model. The plot shows how the error changes with different values of \(\lambda\).

Dots and Error Bars: Each dot represents the mean cross-validated error for a given value of λ. The error bars show the variability of the error estimate (usually one standard deviation).

Two Vertical Dashed Lines:

x-axis Labels: The values of \(\lambda\) are plotted on a logarithmic scale to better visualize the range of \(\lambda\) values.

This plot is crucial for understanding how different levels of regularization affect model performance and for selecting an optimal balance between bias and variance.

[1] "Coefficients of the ridge model with best predictive accuracy"
56 x 1 sparse Matrix of class "dgCMatrix"
                                        s0
(Intercept)                   .           
n_tokens_title                2.031639e-02
n_tokens_content              2.272256e-05
n_unique_tokens               3.069142e-04
n_non_stop_words              2.523671e-04
n_non_stop_unique_tokens      4.466687e-04
num_hrefs                     7.615001e-04
num_self_hrefs                2.000594e-03
num_imgs                      5.840392e-04
num_videos                    6.350604e-04
average_token_length          5.134859e-01
kw_min_min                    4.975249e-05
kw_max_min                    6.397873e-07
kw_avg_min                    6.907710e-06
kw_min_max                    3.419849e-08
kw_max_max                    1.410612e-07
kw_avg_max                    1.289991e-07
kw_min_avg                    7.805378e-06
kw_max_avg                    1.329564e-06
kw_avg_avg                    1.626707e-05
self_reference_min_shares     8.977204e-08
self_reference_max_shares     5.476251e-08
self_reference_avg_sharess    9.728616e-08
LDA_00                        2.396922e-02
LDA_01                        2.487085e-02
LDA_02                        2.273089e-02
LDA_03                        2.262960e-02
LDA_04                        2.562076e-02
global_subjectivity           5.123913e-01
global_sentiment_polarity     1.180979e-01
global_rate_positive_words    1.365239e+00
global_rate_negative_words    1.337089e+00
rate_positive_words           2.731969e-01
rate_negative_words           1.149571e-01
avg_positive_polarity         4.403814e-01
min_positive_polarity         1.756380e-01
max_positive_polarity         1.511942e-01
avg_negative_polarity        -1.603497e-01
min_negative_polarity        -6.012569e-02
max_negative_polarity        -1.082025e-01
title_subjectivity            2.357003e-02
title_sentiment_polarity      9.134744e-03
abs_title_subjectivity        8.425483e-02
abs_title_sentiment_polarity  2.686242e-02
day_of_the_weekMonday         1.043745e-02
day_of_the_weekSaturday       1.007414e-02
day_of_the_weekSunday         1.003946e-02
day_of_the_weekThursday       1.057160e-02
day_of_the_weekTuesday        1.060534e-02
day_of_the_weekWednesday      1.057031e-02
data_channelEntertainment     1.004860e-02
data_channelLifestyle         9.548962e-03
data_channelOthers            1.052349e-02
data_channelSocial Media      1.028977e-02
data_channelTech              1.158305e-02
data_channelWorld             1.039660e-02
Mean squared error: 12.77726
Standard Error: 0.06743605

The residuals roughly show a constant horizontal band around the mean residual errors line suggesting that the variances of the error terms are equal, thereby indicating no heteroscedasticity in the model.

-

LASSO REGRESSION

[1] "Coefficients of the lasso model with best predictive accuracy:"
56 x 1 sparse Matrix of class "dgCMatrix"
                                   s0
(Intercept)                  .       
n_tokens_title               .       
n_tokens_content             .       
n_unique_tokens              .       
n_non_stop_words             .       
n_non_stop_unique_tokens     .       
num_hrefs                    .       
num_self_hrefs               .       
num_imgs                     .       
num_videos                   .       
average_token_length         1.557366
kw_min_min                   .       
kw_max_min                   .       
kw_avg_min                   .       
kw_min_max                   .       
kw_max_max                   .       
kw_avg_max                   .       
kw_min_avg                   .       
kw_max_avg                   .       
kw_avg_avg                   .       
self_reference_min_shares    .       
self_reference_max_shares    .       
self_reference_avg_sharess   .       
LDA_00                       .       
LDA_01                       .       
LDA_02                       .       
LDA_03                       .       
LDA_04                       .       
global_subjectivity          .       
global_sentiment_polarity    .       
global_rate_positive_words   .       
global_rate_negative_words   .       
rate_positive_words          .       
rate_negative_words          .       
avg_positive_polarity        .       
min_positive_polarity        .       
max_positive_polarity        .       
avg_negative_polarity        .       
min_negative_polarity        .       
max_negative_polarity        .       
title_subjectivity           .       
title_sentiment_polarity     .       
abs_title_subjectivity       .       
abs_title_sentiment_polarity .       
day_of_the_weekMonday        .       
day_of_the_weekSaturday      .       
day_of_the_weekSunday        .       
day_of_the_weekThursday      .       
day_of_the_weekTuesday       .       
day_of_the_weekWednesday     .       
data_channelEntertainment    .       
data_channelLifestyle        .       
data_channelOthers           .       
data_channelSocial Media     .       
data_channelTech             .       
data_channelWorld            .       
Mean squared error: 0.8075717
Standard Error: 0.01515756

Lasso regression is not suitable in this case as it drops all of the important predictors. Hence, the ridge regression model is most suitable as it can both handle multicollinearity and remove heteroscedasticity of the data.

-

CONCLUSION

After analyzing the data, our final model is:

Final Model:

\(log(shares)_i = 0.0202\times\)n_tokens_\(title_i+0.000022\times\) n_tokens_\(content_i + 0.00031 \times\) n_unique_\(tokens_i + 0.00025 \times\) n_non_stop_\(words_i + 0.00045 \times\) n_non_stop_unique_\(tokens_i + 0.00077 \times\) num_\(hrefs_i + 0.0021 \times\) num_self_\(hrefs_i + 0.00058 \times\) num_\(imgs_i + 0.00061 \times\) num_\(videos_i + 0.516 \times\) average_token_\(length_i + 0.00005 \times\) kw_min_\(min_i + 0.0000006 \times\) kw_max_\(min_i + 0.000006 \times\) kw_avg_\(min_i + 0.00000003 \times\) kw_min_\(max_i + 0.0000001 \times\) kw_max_\(max_i + 0.0000001 \times\) kw_avg_\(max_i + 0.000007 \times\) kw_min_\(avg_i + 0.000001 \times\) kw_max_\(avg_i + 0.00001 \times\) kw_avg_\(avg_i + 0.00000009 \times\) self_reference_min_\(shares_i + 0.00000005 \times\) self_reference_max_\(shares_i + 0.00000009 \times\) self_reference_avg_\(sharess_i + 0.024 \times\) LDA_\(00_i + 0.024 \times\) LDA_\(01_i + 0.022 \times\) LDA_\(02_i + 0.022 \times\) LDA_\(03_i + 0.025 \times\) LDA_\(04_i + 0.517 \times\) global_\(subjectivity_i + 0.119 \times\) global_sentiment_\(polarity_i + 1.375 \times\) global_rate_positive_\(words_i + 1.362 \times\) global_rate_negative_\(words_i + 0.275 \times\) rate_positive_\(words_i + 0.115 \times\) rate_negative_\(words_i + 0.438 \times\) avg_positive_\(polarity_i + 0.175 \times\) min_positive_\(polarity_i + 0.149 \times\) max_positive_\(polarity_i - 0.159 \times\) avg_negative_\(polarity_i - 0.597 \times\) min_negative_\(polarity_i - 0.111 \times\) max_negative_\(polarity_i + 0.023 \times\) title_\(subjectivity_i + 0.0091 \times\) title_sentiment_\(polarity_i + 0.083 \times\) abs_title_\(subjectivity_i + 0.026 \times\) abs_title_sentiment_\(polarity_i + 0.0103 \times\) day_of_the_\(weekMonday_i + 0.0099 \times\) day_of_the_\(weekSaturday_i + 0.0099 \times\) day_of_the_\(weekSunday_i + 0.0104 \times\) day_of_the_\(weekThursday_i + 0.0104 \times\) day_of_the_\(weekTuesday_i + 0.0104 \times\) day_of_the_\(weekWednesday_i + 0.0099 \times\) data_\(channelEntertainment_i + 0.0094 \times\) data_\(channelLifestyle_i + 0.0103 \times\) data_\(channelOthers + 0.0101 \times\) data_\(channelSocial Media + 0.0114 \times\) data_\(channelTech_i + 0.0103 \times\) data_\(channelWorld_i + \epsilon_{i}\forall i=1(1)39644\)

Since the dependent variable, shares, is log-transformed, any change in this variable is interpreted as a percentage change corresponding to a one-unit change in the independent variable \(x_i\). For example, if the number of words in an article or post increases by one, the shares of that post are expected to rise by 0.000022%. The variable with the most significant positive impact on shares is the rate of positive words in the content variable global_rate_positive_words, which has an effect of 1.375%. Conversely, the min polarity of negative words variable shows the strongest negative impact, with an effect of -0.597%.

-

Final Findings:

The analysis reveals a significant disparity between popular and unpopular articles, with popular shares totaling 20,464 compared to 17,999 for unpopular ones. Key findings indicate that articles with titles containing 8 to 16 words and content under 1,600 words tend to garner the highest shares. Additionally, articles featuring 0 to 45 links and a positive word polarity between 0.2 and 0.6 are more likely to be popular. Subject-wise, articles in the categories of World, Tech and Entertainment are published more frequently and attract greater shares, particularly in the Business, Technology and Social Media channels. Interestingly, while most articles are published on weekdays, those released on weekends, especially Saturdays and Sundays, achieve higher share counts. The analysis also highlights that articles with 1 to 40 images and 1 to 15 videos tend to be more popular. Furthermore, an optimal average word length of 4 to 6 and a keyword count exceeding five in the metadata positively influence shareability. The regression model indicates that various factors, including the number of tokens in titles and content, unique tokens, and sentiment polarity, significantly affect the log of shares. Notably, the coefficients associated with global sentiment and positive word rates suggest that emotional engagement plays a crucial role in enhancing article popularity. In summary, the study underscores the importance of title length, content characteristics, multimedia elements and emotional tone in driving article shares, providing actionable insights for content creators aiming to boost engagement.

-