Factors Influencing the Popularity of Online News

Introduction

“What time should I post it to get the most likes and shares?” Commonly heard among groups of friends, it is evident that as a society, we have put more emphasis on seeming “popular” on the Internet.

This study was focused on articles found on Mashable, with analysis done on what factors will lead to the most number of shares of articles published on their website. Mashable is a digital media website, meaning that all the content that they create can only be found on their media platforms. They self-describe themselves as “a global, multi-platform media and entertainment company” (http://mashable.com/about/). The target audience for Mashable is the millennials on their phones and using technology, constantly ready to share the articles they deem interesting onto their social media platforms. Mashable allows and encourages their readers to share their articles on their personal social media, keeping a tally of the number of shares at that moment on the top of the article.

When browsing through the site or through the newspaper, the title is the first and often the only part of the article the reader will read. If the title seems interesting, the individual is more likely to go to that article and read it, also increasing the chance of the article being shared. According to a research article published on Buffer by Kevan Lee (https://blog.bufferapp.com/the-ideal-length-of-everything-online-according-to-science), when reading the title, people are most likely to read the first three and last three words, making the ideal length for the title or headline at 6 words. This seems to indicate that in this research study, the articles with the most number of shares should have titles that are around 6 words.

Mashable articles are split into seven categories - social media, technology, business, entertainment, world, lifestyle, and watercooler. Watercooler articles are comprised of miscellaneous articles that do not fit in any of the other categories and typically are amusing, entertaining, and lighthearted articles about currently the most popular content on social media or quizzes to reveal an quirky aspect about an individual. According to Ragan (https://www.ragan.com/Main/Articles/Infographic_The_3_content_types_that_get_the_most_49836.aspx), 80% of the shared articles on the Internet in 2014 were quizzes. According to a study done by the Huffington Post (http://www.huffingtonpost.com/noah-kagan/why-content-goes-viral-wh_b_5492767.html), 25% of the most popular articles online invoked the emotion of “awe” from the readers, 17% invoked laughter, and 15% conveyed amusement. As seen through this data, the readers like to be entertained and want their friends to be entertained, which is why they share the articles.

With this, it is predicted that the greatest factor influencing the popularity of an online article is the number of words found in the article title and the article type. The articles with the most number of shares should have titles that are around 6 words and be watercooler articles.

Methods

The dataset was retrieved from the UCI Repository. The link to retrieve the data is: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity. The CSV file was downloaded from the website and imported using the read.csv command.

#retrieve data
news = read.csv("OnlineNewsPopularity.csv")

The data in the dataset is comprised of all the articles published by Mashable in 2014. Each Mashable article contains a counter for the number of shares. The other statistics were added into the dataset by the authors - Kelwin Fernandes, Pedro Vinagre, Paulo Cortez, and Pedro Sernadela. The attributes for global subjectivity and polarity were determined by the authors using a Random Forest classifier and rolling windows as an assessment method. However, these variables were not included in this study.

Of the 61 attributes found in the dataset, only the attributes for URL, the number of tokens (words) found in the article, the number of tokens found in the article title, the six attributes denoting the article type, the seven attributes denoting the day published, and the number of shares were kept for this study. The remaining attributes, such as the average polarity of negative words found in the article or the average keyword, were deemed too detailed for this study and were removed from the research to put focus on the chosen factors.

#select wanted columns and display the data set 
newsCleanedUp <- news[ , c(1, 3, 4, 10, 14, 15, 16, 17, 18, 19, 32, 33, 34, 35, 36, 37, 38, 61)]
head(newsCleanedUp)

##                                                              url
## 1   http://mashable.com/2013/01/07/amazon-instant-video-browser/
## 2    http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/
## 3 http://mashable.com/2013/01/07/apple-40-billion-app-downloads/
## 4       http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/
## 5               http://mashable.com/2013/01/07/att-u-verse-apps/
## 6               http://mashable.com/2013/01/07/beewi-smart-toys/
##   n_tokens_title n_tokens_content num_imgs data_channel_is_lifestyle
## 1             12              219        1                         0
## 2              9              255        1                         0
## 3              9              211        1                         0
## 4              9              531        1                         0
## 5             13             1072       20                         0
## 6             10              370        0                         0
##   data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed
## 1                             1                   0                      0
## 2                             0                   1                      0
## 3                             0                   1                      0
## 4                             1                   0                      0
## 5                             0                   0                      0
## 6                             0                   0                      0
##   data_channel_is_tech data_channel_is_world weekday_is_monday
## 1                    0                     0                 1
## 2                    0                     0                 1
## 3                    0                     0                 1
## 4                    0                     0                 1
## 5                    1                     0                 1
## 6                    1                     0                 1
##   weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
## 1                  0                    0                   0
## 2                  0                    0                   0
## 3                  0                    0                   0
## 4                  0                    0                   0
## 5                  0                    0                   0
## 6                  0                    0                   0
##   weekday_is_friday weekday_is_saturday weekday_is_sunday shares
## 1                 0                   0                 0    593
## 2                 0                   0                 0    711
## 3                 0                   0                 0   1500
## 4                 0                   0                 0   1200
## 5                 0                   0                 0    505
## 6                 0                   0                 0    855

The data was comprised of separate columns noting what day the article was published and what the article type was. This meant that instead of one column denoting which day the article was published, there were seven columns, each representing a day of the week. If the article was published on that particular day, a “1” was written; if not, a “0” was written. To clean this up, a new variable was generated and values between 1 through 7 were input into that column, with 1 representing Monday and so on. Similarly, a new column called “article type” was created to condense the six columns that noted what the type of the article was. Mashable separates their articles into seven types, but those considered “watercooler” articles did not have a column. After the new columns were created, the individual columns for the article types and day published were removed to condense the data set. There were no NAs in the dataset.

#engineer a new variable newsCleanedUp$dayPublished that states what day the article was published
newsCleanedUp$dayPublished <- ""
newsCleanedUp$dayPublished[news$weekday_is_monday == "1"] <- "1"
newsCleanedUp$dayPublished[news$weekday_is_tuesday == "1"] <- "2"
newsCleanedUp$dayPublished[news$weekday_is_wednesday == "1"] <- "3"
newsCleanedUp$dayPublished[news$weekday_is_thursday == "1"] <- "4"
newsCleanedUp$dayPublished[news$weekday_is_friday == "1"] <- "5"
newsCleanedUp$dayPublished[news$weekday_is_saturday == "1"] <- "6"
newsCleanedUp$dayPublished[news$weekday_is_sunday == "1"] <- "7"
#changes this attribute from characters to numerics 
library(varhandle)
class(newsCleanedUp$dayPublished)

## [1] "character"

newsCleanedUp$dayPublished <- as.numeric(as.character(newsCleanedUp$dayPublished))
summary(newsCleanedUp$dayPublished)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    3.00    3.41    5.00    7.00

#engineer a new variable newsCleanedUp$variableType that states what the article type is
newsCleanedUp$articleType <- "watercooler"
newsCleanedUp$articleType[news$data_channel_is_lifestyle == "1"] <- "lifestyle"
newsCleanedUp$articleType[news$data_channel_is_entertainment == "1"] <- "entertainment"
newsCleanedUp$articleType[news$data_channel_is_bus == "1"] <- "business"
newsCleanedUp$articleType[news$data_channel_is_socmed == "1"] <- "social media"
newsCleanedUp$articleType[news$data_channel_is_tech == "1"] <- "technology"
newsCleanedUp$articleType[news$data_channel_is_world == "1"] <- "world"
#changes this attribute from characters to factors 
class(newsCleanedUp$articleType)

## [1] "character"

newsCleanedUp$articleType = factor(newsCleanedUp$articleType)
class(newsCleanedUp$articleType)

## [1] "factor"

summary(newsCleanedUp$articleType)

##      business entertainment     lifestyle  social media    technology 
##          6258          7057          2099          2323          7346 
##   watercooler         world 
##          6134          8427

#create final data set with needed attributes
newsFinal <- newsCleanedUp[ , c(1:4, 18, 19, 20)]
summary(newsFinal)

##                                                              url       
##  http://mashable.com/2013/01/07/amazon-instant-video-browser/  :    1  
##  http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/   :    1  
##  http://mashable.com/2013/01/07/apple-40-billion-app-downloads/:    1  
##  http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/      :    1  
##  http://mashable.com/2013/01/07/att-u-verse-apps/              :    1  
##  http://mashable.com/2013/01/07/beewi-smart-toys/              :    1  
##  (Other)                                                       :39638  
##  n_tokens_title n_tokens_content    num_imgs           shares      
##  Min.   : 2.0   Min.   :   0.0   Min.   :  0.000   Min.   :     1  
##  1st Qu.: 9.0   1st Qu.: 246.0   1st Qu.:  1.000   1st Qu.:   946  
##  Median :10.0   Median : 409.0   Median :  1.000   Median :  1400  
##  Mean   :10.4   Mean   : 546.5   Mean   :  4.544   Mean   :  3395  
##  3rd Qu.:12.0   3rd Qu.: 716.0   3rd Qu.:  4.000   3rd Qu.:  2800  
##  Max.   :23.0   Max.   :8474.0   Max.   :128.000   Max.   :843300  
##                                                                    
##   dayPublished         articleType  
##  Min.   :1.00   business     :6258  
##  1st Qu.:2.00   entertainment:7057  
##  Median :3.00   lifestyle    :2099  
##  Mean   :3.41   social media :2323  
##  3rd Qu.:5.00   technology   :7346  
##  Max.   :7.00   watercooler  :6134  
##                 world        :8427

Results

#designate colors for each article type
newsFinal$color <- "deepskyblue"
newsFinal$color[newsFinal$articleType == "lifestyle"] <- "red"
newsFinal$color[newsFinal$articleType == "entertainment"] <- "green"
newsFinal$color[newsFinal$articleType == "business"] <- "cyan"
newsFinal$color[newsFinal$articleType == "social media"] <- "darkgoldenrod"
newsFinal$color[newsFinal$articleType == "technology"] <- "darkorchid"
newsFinal$color[newsFinal$articleType == "world"] <- "magenta"

#scatterplot comparing number of words in title vs. shares
plot(newsFinal$n_tokens_title, newsFinal$shares,
     xlab = "Number of Words in Article Title", ylab = "Number of Shares", ylim = c(0, 120000),
     main = "Number of Words in Article Title vs. Shares", 
     col = newsFinal$color)
legend("topright", legend = c("business", "entertainment", "lifestyle", "social media", "technology", "watercooler", "world"), 
       col = c("cyan", "green", "red", "darkgoldenrod", "darkorchid", "deepskyblue", "magenta"), pch = 16)

#determine the 1st and 3rd quartiles and the mean in the data set
test = newsFinal$n_tokens_title
quantile(test)

##   0%  25%  50%  75% 100% 
##    2    9   10   12   23

mean(newsFinal$shares)

## [1] 3395.38

Through this scatterplot comparing the number of words in the article title vs. the number of shares, a clear bell graph distribution can be seen. In a range of 1 to 24 words in the article title, the first quartile was at 9 words and the third quartile was at 12 words, with these achieving the highest number of shares. The colors on this graph also indicate the article types of each of the articles. Due to the large magnitude of data, a clear dominating article type has not been seen.

par(mfrow=c(1, 1))

#subset the data for all the business articles
business <- subset(newsFinal, newsFinal$articleType == "business")
#create a scatterplot comparing the number of words in the title vs. number of shares for business articles
plot(business$n_tokens_title, business$shares, xlab = "Number of Words in Article Title",
     ylab = "Number of Shares", main = "Number of Words in Article Title vs. Number of Shares \nfor Business Articles", 
     col = business$color, ylim = c(0, 120000))

#determine mean of the number of shares for business articles
mean(business$shares)

## [1] 3063.019

#subset the data for all the entertainment articles
entertainment <- subset(newsFinal, newsFinal$articleType == "entertainment")
#create a scatterplot comparing the number of words in the title vs. number of shares for entertainment articles
plot(entertainment$n_tokens_title, entertainment$shares, xlab = "Number of Words in Article Title",
     ylab = "Number of Shares", main = "Number of Words in Article Title vs. Number of Shares \nfor Entertainment Articles", 
     col = entertainment$color, ylim = c(0, 120000))

#determine mean of the number of shares for entertainment articles
mean(entertainment$shares)

## [1] 2970.487

#subset the data for all the lifestyle articles
lifestyle <- subset(newsFinal, newsFinal$articleType == "lifestyle")
#create a scatterplot comparing the number of words in the title vs. number of shares for lifestyle articles
plot(lifestyle$n_tokens_title, lifestyle$shares, xlab = "Number of Words in Article Title",
     ylab = "Number of Shares", main = "Number of Words in Article Title vs. Number of Shares \nfor Lifestyle Articles", 
     col = lifestyle$color, ylim = c(0, 120000))

#determine mean of the number of shares for lifestyle articles
mean(lifestyle$shares)

## [1] 3682.123

#subset the data for all the social media articles
socialmedianews <- subset(newsFinal, newsFinal$articleType == "social media")
plot(socialmedianews$n_tokens_title, socialmedianews$shares, xlab = "Number of Words in Article Title",
     ylab = "Number of Shares", main = "Number of Words in Article Title vs. Number of Shares \nfor Social Media Articles", 
     col = socialmedianews$color, ylim = c(0, 120000))

#determine mean of the number of shares for social media articles
mean(socialmedianews$shares)

## [1] 3629.383

#subset the data for all the technology articles
technology <- subset(newsFinal, newsFinal$articleType == "technology")
#create a scatterplot comparing the number of words in the title vs. number of shares for technology articles
plot(technology$n_tokens_title, technology$shares, xlab = "Number of Words in Article Title",
     ylab = "Number of Shares", main = "Number of Words in Article Title vs. Number of Shares \nfor Technology Articles", 
     col = technology$color, ylim = c(0, 120000))

#determine mean of the number of shares for technology articles
mean(technology$shares)

## [1] 3072.283

#subset the data for all the watercooler articles
watercooler <- subset(newsFinal, newsFinal$articleType == "watercooler")
#create a scatterplot comparing the number of words in the title vs. number of shares for watercooler articles
plot(watercooler$n_tokens_title, watercooler$shares, xlab = "Number of Words in Article Title",
     ylab = "Number of Shares", main = "Number of Words in Article Title vs. Number of Shares \nfor Watercooler Articles", 
     col = watercooler$color, ylim = c(0, 120000))

#determine mean of the number of shares for watercooler articles
mean(watercooler$shares)

## [1] 5945.19

#subset the data for all the world news articles
worldnews <- subset(newsFinal, newsFinal$articleType == "world")
#create a scatterplot comparing the number of words in the title vs. number of shares for world news articles
plot(worldnews$n_tokens_title, worldnews$shares, xlab = "Number of Words in Article",
     ylab = "Number of Shares", main = "Number of Words in Article Title vs. Number of Shares \nfor World News", 
     col = worldnews$color, ylim = c(0, 120000))

#determine mean of the number of shares for world news articles
mean(worldnews$shares)

## [1] 2287.734

After separating the article types and creating graphs comparing number of words in the article title vs. the number of shares, it was still evident that the number of words in the article title affect the number of shares, regardless of the article type. For all seven types, the articles with the most number of shares tended to be closer to the middle of the range. After creating graphs for each article type, it became easier to see how the article type might also affect the number of shares. The article type that scored the largest number of shares was the watercooler articles. As seen in the graph for watercooler articles, these articles had higher number of shares compared to the article types. The average number of shares for watercooler articles was 5945.19, which was 2549.81 shares higher than the average number for the total number of articles at 3395.38. In comparison, the graph for the social media articles indicates that the number of shares is lower than the other article types. With the exception of one outlier that achieved more than 120,000 shares, the majority of the social media articles were under 60,000. In comparison to the average number of shares for all the articles at 3395.38, the average for business was at 3063.02, entertainment was at 2970.49, lifestyle was at 3682.12, social media was at 3629.38, technology was at 3072.28, world news was at 2287.83, and watercooler again was at 5945.19.

Discussion

Overall, this confirms the initial hypothesis that the article types and the article titles have great influence on the number of shares. It was confirmed that watercooler articles have the greatest number of shares. However, having 6 words in the title for the greatest number of shares was not proved. This study suggested that the greatest number of shares were the articles titles between 9 to 12 words.

Watercooler articles having the greatest number of shares is in line with the research conducted by Ragan and the Huffington Post. Both articles suggested that the most popular articles online were the quirky, lighthearted ones. This means that when browsing online, people tend to gravitate towards articles that are enjoyable and as the Huffington Post reports, considered humorous. The number of words in the title may suggest that titles of 6 are not enough for readers to have a firm understanding of the article, making them not want to click on it and share it. Titles between 9 and 12 give readers an understanding without being too long that they feel as though they are already reading the article.

For future studies, it would be interesting to analyze and look at articles besides ones published by Mashable. Because the target audience of Mashable is millennials, the articles that are shared may be the ones that are more unusual and random. Different audiences are likely to have different article interests, and the articles with the most number of shares would change. In addition, with the intention of Mashable being to entertain, the audience would expect to find articles that are humorous and enjoyable; an analysis on a more traditional news source would also reflect different findings. In addition to various news sources, other factors could be analyzed, such as whether the number of videos in the article or the specific words used in the article would have influence.

Factors Influencing the Popularity of Online News

Rebecca Lin

May 10, 2017

Introduction

Methods

Results

Discussion