The Shareability Formula

Author

Mariia, Lucas, Olivia

Abstract

This technical appendix investigates a dataset containing 40,000 instances of news articles published by Mashable (2013 - 2015). The data contains variables related to date of posting, textual characteristics, and visual content, among others. We investigated whether certain days of posting, subjects, tones, or perspectives make articles more shareable. Through ANOVA and Kruskal-Wallis rank sum tests, we found that image count, tone, subjectivity and posting date are significant predictors of article shareability. Notably, weekend publications and articles outside Mashable’s six primary theme categories outperformed others in shares. Textual analysis also revealed that higher positivity and opinionated tones correlated with popularity. These findings have limitations in terms of the dataset’s age and potential platform-specific biases. Short summary of the study can be found here.

Introduction and context

Media has emerged as a powerful source for information, influence, and entertainment. Content creators today can shape public opinion, create emotional responses, spread ideas, and affect consumer behavior. With hundreds of millions of posts every day flooding the web, understanding what makes content worth sharing is more critical than ever. We aim to determine content shareability indicators - from post timing and content category to the number of visuals and overall sentiment. These determinations can help newsmakers, digital marketers, and content managers improve their tactics and reach greater online audiences.

Dataset and Wrangling

The dataset contains approximately 40,000 articles published by Mashable between January 2013 and January 2015. The dataset is essentially a census which scraped data from all of Mashable’s articles published during this two-year period. Mashable is a digital media and news platform known for covering technology, entertainment, and culture. As of November 2015, the outlet had over 6 million Twitter followers, highlighting its strong presence and influence on social media.

When do news articles get more shares (timing-wise)?

Data Wrangling

In order to clean up the initial dataset and allow for visualizations and analysis on timing of shares, the data was heavily culled and simplified. We began by creating two new variables; data_channel served as a longer version of six other variables identifying an article as being published under a certain category, with NA values now indicating that an article was not published in any of the six categories. Similarly, dayofweek functions as a longer version of seven variables indicating whether an article was published on a certain day of the week. Both variables were treated as categorical after their creation, with dayofweek also ordered from Monday-Sunday.

Finally, we selected only variables that were relevant for analysis of articles’ category and timing. These included dayofweek and data_channel as well as the shares (the number of shares the article received) and url (which served as an identifier variable). Extreme outliers (articles receiving less than 100 shares or more than 100,000 shares) were removed, reducing the total number of cases from 39,644 to 39,499.

# Create single categorical variable for data channel
news_EDA <- data |>
  mutate(data_channel = case_when(
    data_channel_is_lifestyle == 1 ~ "Lifestyle", 
    data_channel_is_entertainment == 1 ~ "Entertainment",    
    data_channel_is_bus == 1 ~ "Business", 
    data_channel_is_socmed == 1 ~ "Social Media", 
    data_channel_is_tech == 1 ~ "Tech", 
    data_channel_is_world == 1 ~ "World", 
    TRUE ~ NA)
  ) |>
  select(!starts_with("data_channel_is"))

# Create single categorical variable for day of week
news_EDA_2 <- news_EDA |>
  mutate(dayofweek = case_when(
     weekday_is_monday == 1 ~ "Monday",
     weekday_is_tuesday == 1 ~ "Tuesday",
     weekday_is_wednesday  == 1 ~ "Wednesday",
     weekday_is_thursday == 1 ~ "Thursday",
     weekday_is_friday == 1 ~ "Friday",
     weekday_is_saturday == 1 ~ "Saturday",
     weekday_is_sunday == 1 ~ "Sunday",
    TRUE ~ NA)
  ) |>
  select(!starts_with("weekday_is")) |>
  mutate(dayofweek = factor(dayofweek, ordered = TRUE, levels = 
                              c("Monday", "Tuesday", "Wednesday", 
                                "Thursday", "Friday", "Saturday", "Sunday")))

# Select only relevant variables for next stage of analysis, exclude outliers
newsanalysis <- news_EDA_2 |>
  select(url, dayofweek, data_channel, shares) |>
  filter(shares > 100, shares < 100000)

Exploratory Data Analysis

Before performing any tests, we hoped to get a broader sense of what days of the week articles tended to get increased shares, and whether those patterns differed depending on the type of article.

Counting the number of articles in each data channel (Figure 1) reveals that each article is assigned a distinct data channel, with no overlap. The most common data channels are lifestyle and social media; however, a sizable proportion of articles (~15.5%) are not assigned to any data channel. As shown in the figure below, these uncategorized articles also receive more shares on average than any assigned category.

# Get counts for each data channel
data |>
  count(data_channel_is_lifestyle, data_channel_is_entertainment, 
        data_channel_is_bus, data_channel_is_socmed, data_channel_is_tech, 
        data_channel_is_world)

  data_channel_is_lifestyle data_channel_is_entertainment data_channel_is_bus
1                         0                             0                   0
2                         0                             0                   0
3                         0                             0                   0
4                         0                             0                   0
5                         0                             0                   1
6                         0                             1                   0
7                         1                             0                   0
  data_channel_is_socmed data_channel_is_tech data_channel_is_world    n
1                      0                    0                     0 6134
2                      0                    0                     1 8427
3                      0                    1                     0 7346
4                      1                    0                     0 2323
5                      0                    0                     0 6258
6                      0                    0                     0 7057
7                      0                    0                     0 2099

# No articles are categorized in multiple data channels, but a
# large number are not assigned to any data channel

# Plot mean shares by data channel (news category)
newsanalysis |>
  group_by(data_channel) |>
  dplyr::summarize(meanshares = mean(shares)) |>
  ggplot(aes(x = data_channel, y = meanshares)) +
  geom_bar(stat = "identity") +
  labs(title = "Mean shares by data channel",
    x = "Data channel",
       y = "Mean shares") +
    theme_minimal()+
  theme(
  axis.text.x = element_text(angle = 45, hjust = 1))

The average number of shares an article in each data channel receives.

We also counted the number of articles published on each day of the week (Figure 2). Most articles are published throughout the work week, with a slight drop-off on Fridays and fewer articles published on weekends (for example, only 2737 of the articles were published on a Sunday, versus 7435 published on a Wednesday).

However, though fewer in number, articles published on the weekend appear to get more shares on average.

# Get counts for each weekday
data |>
  count(weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, 
        weekday_is_thursday, weekday_is_friday, weekday_is_saturday, 
        weekday_is_sunday)

  weekday_is_monday weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
1                 0                  0                    0                   0
2                 0                  0                    0                   0
3                 0                  0                    0                   0
4                 0                  0                    0                   1
5                 0                  0                    1                   0
6                 0                  1                    0                   0
7                 1                  0                    0                   0
  weekday_is_friday weekday_is_saturday weekday_is_sunday    n
1                 0                   0                 1 2737
2                 0                   1                 0 2453
3                 1                   0                 0 5701
4                 0                   0                 0 7267
5                 0                   0                 0 7435
6                 0                   0                 0 7390
7                 0                   0                 0 6661

# Fewer articles seem to be published on Saturdays and Sundays, 
# with a slightly lower number of articles published on Fridays 
# compared to other weekdays.

# Plot mean shares by day of week
newsanalysis |>
  group_by(dayofweek) |>
  dplyr::summarize(meanshares = mean(shares)) |>
  ggplot(aes(x = dayofweek, y = meanshares)) +
  geom_bar(stat = "identity")+
  labs(title = "Mean shares by day of the week",
       x = "Mean shares",
       y = "Day of the week") +
    theme_minimal()+
  theme(
  axis.text.x = element_text(angle = 45, hjust = 1))

The average number of article shares per day of the week.

Figure 3 shows the effects of the day of the week an article is published and the category (“data channel”) the article is assigned to on the article’s shares. As evidenced by the first plot, we found significant differences in an article’s average number of shares on weekends versus weekdays. More specifically, an article published on Saturday or Sunday is likely to get significantly higher shares than if it was published on any weekday. However, articles assigned to the most popular categories, Social Media and Lifestyle, tend to get higher shares regardless of what day of the week they are published. We also found evidence that uncategorized articles receive significantly more shares on average than articles published in any other data channel, which is reflected in the second plot. These articles also tend to perform well regardless of what day of the week they are published. Given that there are about 6000 articles with no assigned category that make up 15% of all articles in the sample, other factors may have a greater influence on an article’s popularity than their publishing time and category.

# plot mean shares by data channel and day of week
plot1 <- newsanalysis |>
  filter(is.na(data_channel) == FALSE) |>
  group_by(dayofweek, data_channel) |>
  dplyr::summarize(meanshares = mean(shares)) |>
  ggplot(aes(x = dayofweek, y = meanshares)) +
  facet_wrap(~data_channel) +
  geom_bar(stat = "identity", width = 0.5) +
  scale_x_discrete(guide = guide_axis(angle = 90)) +
  xlab("Day of week") +
  ylab("Mean shares") +
  labs(title = "Mean Shares by Day of Week and Data Channel") +
    theme_minimal()+
  theme(
    strip.text = element_text(size = 10),
    plot.title = element_text(margin = ggplot2::margin(b = 5), 
                              size = 13, hjust = 0.5),
    plot.subtitle = element_text(size = 10),
    axis.title.x = element_text(margin = ggplot2::margin(t = 5)),
    axis.title.y = element_text(margin = ggplot2::margin(r = 4)),
    axis.text.x = element_text(size = 8),
    axis.text.y = element_text(size = 8),
    legend.text = element_text(size = 10)
  )
# plot mean shares for articles with no assigned data channel
plot2 <- newsanalysis |>
  filter(is.na(data_channel) == TRUE) |>
  group_by(dayofweek) |>
  dplyr::summarize(meanshares = mean(shares)) |>
  ggplot(aes(x = dayofweek, y = meanshares)) +
  geom_bar(stat = "identity", width = 0.7) +
  scale_x_discrete(guide = guide_axis(angle = 90)) +
  xlab("Day of week") +
  ylab("Mean shares") +
  labs(fill = "Mean shares", 
       title = "Mean Shares for Uncategorized Articles") +
   theme_minimal()+
  theme(
    strip.text = element_text(size = 10),
    plot.title = element_text(margin = ggplot2::margin(b = 5), 
                              size = 13, hjust = 0.5),
    plot.subtitle = element_text(size = 10),
    axis.title.x = element_text(margin = ggplot2::margin(t = 5)),
    axis.title.y = element_text(margin = ggplot2::margin(r = 4)),
    axis.text.x = element_text(size = 8),
    axis.text.y = element_text(size = 8),
    legend.text = element_text(size = 10),
    legend.position = "right"
  )

grid.arrange(plot1, plot2, ncol = 2)

Average shares by day of week and data channel for categorized and uncategorized articles. Results indicate an overall increase in shares on weekends, with uncategorized articles more popular than categorized articles regardless of day of week.

Checking Model Assumptions

Because our question of interest involves analyzing one continuous variable (number of shares) and two categorical variables (day of week and data channel), we decided to conduct a two-way ANOVA to determine where significant differences are present in the number of shares depending on the day of the week, the assigned data channel/category, and the interaction between those two variables.

However, before conducting an ANOVA and any associated tests of significance, we first checked the relevant assumptions: normality, independence, homogeneity of variance, and outliers.

Since we are essentially performing a census of Mashable articles which are each counted only once, we can assume the independence condition has been met.

Additionally, since we’ve seen from our exploratory data analysis that each data channel and day of week contains a large number of observations, we can assume the normality condition based on a large sample size.

This leaves us with the homogeneity of variance and potential outliers.

# Test ANOVA assumptions

# Homogeneity assumption: are variances equal? Are there any outliers?
ggplot(newsanalysis) +
  aes(x = dayofweek, y = shares) +
  geom_boxplot()+
    theme_minimal()+
  theme(
  axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs (x = "Data channel",
        y = "Shares")

ggplot(newsanalysis) +
  aes(x = data_channel, y = shares) +
  geom_boxplot()+
    theme_minimal()+
  theme(
  axis.text.x = element_text(angle = 45, hjust = 1)) +
   labs (x = "Data channel",
        y = "Shares")

In both cases, there are a number of significant outliers even with the most extreme cases removed during wrangling. However, we believed it was worth running the analysis with these outliers, since extreme shareability is often the goal of these articles, while also running identical tests on a transformed version of the data to see whether effects would persist.

Model Results and Interpretation

In order to determine whether there were significant differences in the number of shares an article receives based its day of publishing and its assigned data channel, we conducted a two-way ANOVA on the dataset (with cases <100 shares and >100,000 shares removed).

# Conduct two-way ANOVA with only the largest outliers removed
newsanalysis2 <- newsanalysis |>
  mutate(data_channel = ifelse(is.na(data_channel), "None", data_channel))

newsanova <- aov(shares ~ dayofweek * data_channel, data = newsanalysis2)
summary(newsanova)

                          Df    Sum Sq   Mean Sq F value   Pr(>F)    
dayofweek                  6 2.878e+09 4.796e+08  13.693 1.33e-15 ***
data_channel               6 3.329e+10 5.549e+09 158.422  < 2e-16 ***
dayofweek:data_channel    36 2.032e+09 5.643e+07   1.611   0.0115 *  
Residuals              39450 1.382e+12 3.503e+07                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the two-way ANOVA revealed significant effects of both day of week and data channel on number of shares, we conducted Tukey’s HSD post-hoc tests on both variables within the two-way ANOVA model to determine which differences were most significant.

# Conduct Tukey's HSD to assess significant comparisons within the ANOVA
TukeyHSD(newsanova, "dayofweek")

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = shares ~ dayofweek * data_channel, data = newsanalysis2)

$dayofweek
                         diff       lwr        upr     p adj
Tuesday-Monday     -190.67867 -486.0236  104.66630 0.4778961
Wednesday-Monday   -237.50533 -532.4531   57.44241 0.2095346
Thursday-Monday    -226.48218 -523.0334   70.06899 0.2679557
Friday-Monday       -79.43680 -394.8823  236.00869 0.9899057
Saturday-Monday     624.58076  211.5240 1037.63751 0.0001676
Sunday-Monday       588.30100  191.8069  984.79512 0.0002454
Wednesday-Tuesday   -46.82666 -333.9822  240.32883 0.9990906
Thursday-Tuesday    -35.80351 -324.6057  252.99869 0.9998132
Friday-Tuesday      111.24187 -196.9301  419.41385 0.9384601
Saturday-Tuesday    815.25943  407.7303 1222.78856 0.0000001
Sunday-Tuesday      778.97967  388.2474 1169.71193 0.0000001
Thursday-Wednesday   11.02315 -277.3728  299.41911 0.9999998
Friday-Wednesday    158.06853 -149.7228  465.85984 0.7364602
Saturday-Wednesday  862.08610  454.8448 1269.32744 0.0000000
Sunday-Wednesday    825.80633  435.3742 1216.23842 0.0000000
Friday-Thursday     147.04538 -162.2828  456.37356 0.8012845
Saturday-Thursday   851.06295  442.6588 1259.46708 0.0000000
Sunday-Thursday     814.78318  423.1384 1206.42798 0.0000000
Saturday-Friday     704.01756  281.6940 1126.34113 0.0000184
Sunday-Friday       667.73780  261.5988 1073.87682 0.0000258
Sunday-Saturday     -36.27976 -522.1387  449.57919 0.9999906

TukeyHSD(newsanova, "data_channel")

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = shares ~ dayofweek * data_channel, data = newsanalysis2)

$data_channel
                                 diff         lwr        upr     p adj
Entertainment-Business       207.6241   -95.85624   511.1044 0.4036539
Lifestyle-Business           773.8166   332.97166  1214.6616 0.0000047
None-Business               2447.5226  2133.34501  2761.7002 0.0000000
Social Media-Business        967.8233   543.31079  1392.3358 0.0000000
Tech-Business                358.1572    57.67577   658.6386 0.0080309
World-Business              -420.8654  -712.65242  -129.0783 0.0004227
Lifestyle-Entertainment      566.1925   131.63428  1000.7508 0.0023472
None-Entertainment          2239.8985  1934.60496  2545.1921 0.0000000
Social Media-Entertainment   760.1992   342.21899  1178.1795 0.0000017
Tech-Entertainment           150.5331  -140.64671   441.7129 0.7302940
World-Entertainment         -628.4894  -910.68847  -346.2904 0.0000000
None-Lifestyle              1673.7060  1231.61084  2115.8012 0.0000000
Social Media-Lifestyle       194.0067  -332.25506   720.2685 0.9321654
Tech-Lifestyle              -415.6594  -848.12867    16.8098 0.0689563
World-Lifestyle            -1194.6820 -1621.15618  -768.2078 0.0000000
Social Media-None          -1479.6993 -1905.50998 -1053.8886 0.0000000
Tech-None                  -2089.3654 -2391.67807 -1787.0528 0.0000000
World-None                 -2868.3880 -3162.06048 -2574.7155 0.0000000
Tech-Social Media           -609.6661 -1025.47411  -193.8582 0.0003096
World-Social Media         -1388.6887 -1798.25781  -979.1196 0.0000000
World-Tech                  -779.0225 -1057.99397  -500.0511 0.0000000

In order to better glean where the greatest differences in day of week and data channel were, we filtered out only results with significant p-values:

# Run Tukey's HSD for day of week
tukey_day <- TukeyHSD(newsanova, "dayofweek")

# Convert to data frame and filter for significant comparisons
sig_day <- as.data.frame(tukey_day$dayofweek)
sig_day <- sig_day[sig_day$`p adj` < 0.05, ]
print(sig_day)

                       diff      lwr       upr        p adj
Saturday-Monday    624.5808 211.5240 1037.6375 1.676368e-04
Sunday-Monday      588.3010 191.8069  984.7951 2.454411e-04
Saturday-Tuesday   815.2594 407.7303 1222.7886 7.703548e-08
Sunday-Tuesday     778.9797 388.2474 1169.7119 8.704169e-08
Saturday-Wednesday 862.0861 454.8448 1269.3274 9.103839e-09
Sunday-Wednesday   825.8063 435.3742 1216.2384 9.414906e-09
Saturday-Thursday  851.0629 442.6588 1259.4671 1.688196e-08
Sunday-Thursday    814.7832 423.1384 1206.4280 1.800113e-08
Saturday-Friday    704.0176 281.6940 1126.3411 1.835743e-05
Sunday-Friday      667.7378 261.5988 1073.8768 2.580043e-05

# Complete same process for data channel comparisons
tukey_channel <- TukeyHSD(newsanova, "data_channel")
sig_channel <- as.data.frame(tukey_channel$data_channel)
sig_channel <- sig_channel[sig_channel$`p adj` < 0.05, ]
print(sig_channel)

                                 diff         lwr        upr        p adj
Lifestyle-Business           773.8166   332.97166  1214.6616 4.733279e-06
None-Business               2447.5226  2133.34501  2761.7002 0.000000e+00
Social Media-Business        967.8233   543.31079  1392.3358 3.770525e-10
Tech-Business                358.1572    57.67577   658.6386 8.030875e-03
World-Business              -420.8654  -712.65242  -129.0783 4.227115e-04
Lifestyle-Entertainment      566.1925   131.63428  1000.7508 2.347180e-03
None-Entertainment          2239.8985  1934.60496  2545.1921 0.000000e+00
Social Media-Entertainment   760.1992   342.21899  1178.1795 1.713966e-06
World-Entertainment         -628.4894  -910.68847  -346.2904 1.083039e-09
None-Lifestyle              1673.7060  1231.61084  2115.8012 2.020606e-14
World-Lifestyle            -1194.6820 -1621.15618  -768.2078 8.149037e-14
Social Media-None          -1479.6993 -1905.50998 -1053.8886 5.762057e-14
Tech-None                  -2089.3654 -2391.67807 -1787.0528 0.000000e+00
World-None                 -2868.3880 -3162.06048 -2574.7155 0.000000e+00
Tech-Social Media           -609.6661 -1025.47411  -193.8582 3.096165e-04
World-Social Media         -1388.6887 -1798.25781  -979.1196 5.551115e-14
World-Tech                  -779.0225 -1057.99397  -500.0511 8.404388e-14

Notably, the only day of week comparisons with significant differences were between either Saturday or Sunday and each weekday, with the weekend day always receiving significantly more shares than the weekday.

More nuanced differences are apparent when data channels are compared, with 17 out of a possible 21 combinations registering significant differences (this is most likely due to the extremely large sample size). However, the six largest differences in mean shares as determined by Tukey’s HSD were between uncategorized articles and all data channels, with uncategorized articles always having higher mean shares. Additionally, articles categorized as “social media” performed significantly better on average than articles in any of the other five data channels.

In order to confirm that these significant results were not brought about by the outliers noted earlier, we conducted a log-transformation on the number of shares an article receives and reconducted the two-way ANOVA using the transformed data.

# Create log-transformation of shares
newsanalysis3 <- newsanalysis2 |>
  mutate(logshares = log(shares))

# Reassess distributions
ggplot(newsanalysis3) +
  aes(x = dayofweek, y = logshares) +
  geom_boxplot() +
  theme(
  axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(y = "Log shares",
       x = "Data channel")

ggplot(newsanalysis3) +
  aes(x = data_channel, y = logshares) +
  geom_boxplot()+
  theme(
  axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(y = "Log shares",
       x = "Data channel")

# Reconduct ANOVA
newsanova3 <- aov(logshares ~ dayofweek * data_channel, data = newsanalysis3)
summary(newsanova3)

                          Df Sum Sq Mean Sq F value   Pr(>F)    
dayofweek                  6    497   82.75 109.957  < 2e-16 ***
data_channel               6   1723  287.15 381.548  < 2e-16 ***
dayofweek:data_channel    36    104    2.89   3.845 6.67e-14 ***
Residuals              39450  29690    0.75                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Not only did the two-way ANOVA on the transformed data indicate that the significant effects of day of week and data channel (as well as their interaction) persist even when the distribution is transformed, but the effects of all three on number of shares actually have larger F values than in the prior analysis.

We also re-ran Tukey’s post-hoc tests to check whether the patterns of significant differences between days and data channels remained similar.

# Reconduct Tukey's HSD post-hoc tests
TukeyHSD(newsanova3, "dayofweek")

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = logshares ~ dayofweek * data_channel, data = newsanalysis3)

$dayofweek
                           diff         lwr          upr     p adj
Tuesday-Monday     -0.036012593 -0.07930466  0.007279471 0.1769689
Wednesday-Monday   -0.049002774 -0.09223661 -0.005768935 0.0146139
Thursday-Monday    -0.029288287 -0.07275716  0.014180586 0.4232210
Friday-Monday       0.035786550 -0.01045188  0.082024977 0.2527036
Saturday-Monday     0.320523605  0.25997719  0.381070021 0.0000000
Sunday-Monday       0.288080394  0.22996175  0.346199038 0.0000000
Wednesday-Tuesday  -0.012990181 -0.05508182  0.029101461 0.9711379
Thursday-Tuesday    0.006724307 -0.03560871  0.049057324 0.9992173
Friday-Tuesday      0.071799143  0.02662688  0.116971409 0.0000571
Saturday-Tuesday    0.356536198  0.29680003  0.416272368 0.0000000
Sunday-Tuesday      0.324092987  0.26681892  0.381367050 0.0000000
Thursday-Wednesday  0.019714487 -0.02255898  0.061987958 0.8153659
Friday-Wednesday    0.084789324  0.03967286  0.129905791 0.0000006
Saturday-Wednesday  0.369526378  0.30983239  0.429220365 0.0000000
Sunday-Wednesday    0.337083168  0.27985310  0.394313232 0.0000000
Friday-Thursday     0.065074837  0.01973309  0.110416579 0.0004637
Saturday-Thursday   0.349811891  0.28994746  0.409676322 0.0000000
Sunday-Thursday     0.317368681  0.25996086  0.374776505 0.0000000
Saturday-Friday     0.284737054  0.22283229  0.346641814 0.0000000
Sunday-Friday       0.252293844  0.19276144  0.311826251 0.0000000
Sunday-Saturday    -0.032443210 -0.10366107  0.038774652 0.8315662

TukeyHSD(newsanova3, "data_channel")

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = logshares ~ dayofweek * data_channel, data = newsanalysis3)

$data_channel
                                   diff         lwr         upr     p adj
Entertainment-Business     -0.105327921 -0.14981248 -0.06084337 0.0000000
Lifestyle-Business          0.168365992  0.10374634  0.23298564 0.0000000
None-Business               0.374082795  0.32803022  0.42013537 0.0000000
Social Media-Business       0.365662704  0.30343708  0.42788832 0.0000000
Tech-Business               0.165831390  0.12178642  0.20987636 0.0000000
World-Business             -0.203242472 -0.24601301 -0.16047193 0.0000000
Lifestyle-Entertainment     0.273693913  0.20999578  0.33739205 0.0000000
None-Entertainment          0.479410716  0.43466037  0.52416106 0.0000000
Social Media-Entertainment  0.470990626  0.40972251  0.53225874 0.0000000
Tech-Entertainment          0.271159311  0.22847778  0.31384084 0.0000000
World-Entertainment        -0.097914551 -0.13927967 -0.05654944 0.0000000
None-Lifestyle              0.205716804  0.14091390  0.27051971 0.0000000
Social Media-Lifestyle      0.197296713  0.12015655  0.27443688 0.0000000
Tech-Lifestyle             -0.002534602 -0.06592653  0.06085732 0.9999998
World-Lifestyle            -0.371608463 -0.43412163 -0.30909530 0.0000000
Social Media-None          -0.008420091 -0.07083600  0.05399582 0.9996947
Tech-None                  -0.208251406 -0.25256480 -0.16393801 0.0000000
World-None                 -0.577325267 -0.62037218 -0.53427835 0.0000000
Tech-Social Media          -0.199831315 -0.26078101 -0.13888162 0.0000000
World-Social Media         -0.568905176 -0.62894037 -0.50886998 0.0000000
World-Tech                 -0.369073861 -0.40996587 -0.32818185 0.0000000

# Rerun Tukey HSD for day of week
tukey_day <- TukeyHSD(newsanova3, "dayofweek")
sig_day <- as.data.frame(tukey_day$dayofweek)
sig_day <- sig_day[sig_day$`p adj` < 0.05, ]
print(sig_day)

                          diff         lwr          upr        p adj
Wednesday-Monday   -0.04900277 -0.09223661 -0.005768935 1.461386e-02
Saturday-Monday     0.32052360  0.25997719  0.381070021 0.000000e+00
Sunday-Monday       0.28808039  0.22996175  0.346199038 0.000000e+00
Friday-Tuesday      0.07179914  0.02662688  0.116971409 5.705933e-05
Saturday-Tuesday    0.35653620  0.29680003  0.416272368 0.000000e+00
Sunday-Tuesday      0.32409299  0.26681892  0.381367050 0.000000e+00
Friday-Wednesday    0.08478932  0.03967286  0.129905791 6.287716e-07
Saturday-Wednesday  0.36952638  0.30983239  0.429220365 0.000000e+00
Sunday-Wednesday    0.33708317  0.27985310  0.394313232 0.000000e+00
Friday-Thursday     0.06507484  0.01973309  0.110416579 4.636769e-04
Saturday-Thursday   0.34981189  0.28994746  0.409676322 0.000000e+00
Sunday-Thursday     0.31736868  0.25996086  0.374776505 0.000000e+00
Saturday-Friday     0.28473705  0.22283229  0.346641814 0.000000e+00
Sunday-Friday       0.25229384  0.19276144  0.311826251 0.000000e+00

# Complete same process for data channel comparisons
tukey_channel <- TukeyHSD(newsanova3, "data_channel")
sig_channel <- as.data.frame(tukey_channel$data_channel)
sig_channel <- sig_channel[sig_channel$`p adj` < 0.05, ]
print(sig_channel)

                                  diff        lwr         upr        p adj
Entertainment-Business     -0.10532792 -0.1498125 -0.06084337 6.164680e-11
Lifestyle-Business          0.16836599  0.1037463  0.23298564 3.712586e-13
None-Business               0.37408280  0.3280302  0.42013537 0.000000e+00
Social Media-Business       0.36566270  0.3034371  0.42788832 0.000000e+00
Tech-Business               0.16583139  0.1217864  0.20987636 2.953193e-14
World-Business             -0.20324247 -0.2460130 -0.16047193 0.000000e+00
Lifestyle-Entertainment     0.27369391  0.2099958  0.33739205 0.000000e+00
None-Entertainment          0.47941072  0.4346604  0.52416106 0.000000e+00
Social Media-Entertainment  0.47099063  0.4097225  0.53225874 0.000000e+00
Tech-Entertainment          0.27115931  0.2284778  0.31384084 0.000000e+00
World-Entertainment        -0.09791455 -0.1392797 -0.05654944 6.250622e-11
None-Lifestyle              0.20571680  0.1409139  0.27051971 6.750156e-14
Social Media-Lifestyle      0.19729671  0.1201566  0.27443688 1.024736e-12
World-Lifestyle            -0.37160846 -0.4341216 -0.30909530 0.000000e+00
Tech-None                  -0.20825141 -0.2525648 -0.16393801 0.000000e+00
World-None                 -0.57732527 -0.6203722 -0.53427835 0.000000e+00
Tech-Social Media          -0.19983131 -0.2607810 -0.13888162 5.029310e-14
World-Social Media         -0.56890518 -0.6289404 -0.50886998 0.000000e+00
World-Tech                 -0.36907386 -0.4099659 -0.32818185 0.000000e+00

While the tests on the transformed data indicate that there are additional significant differences between days of the week and data channels when outliers are accounted for, differences between weekend days and weekdays still make up the majority of significant differences in average number of log(shares).

With these results in mind as well as those from our EDA, it seems as though while there is no clear cut rule as to when any particular article should be published, there are better times and better categories to publish within when trying to maximize shares. In particular, publishing any article (but particularly one which falls into one of the six data channels Mashable created) on the weekend is likely to result in a higher number of shares on average than publishing the article on a weekday. This could be due to Mashable’s readers having more free time to read and share articles on weekends despite the lower number of articles published.

Additionally, articles about social media tend to perform better on average than articles in other categories (entertainment, business, lifestyle, world, and tech), but articles that don’t fall into any of those six categories tend to receive more shares than articles that do. This may be because these articles tend to appeal to a broader audience, while categorized articles are geared towards a certain section of Mashable’s readership.

What subject matter do people tend to share?

Next, we were interested in the content of articles that people tend to share more. There were several variables included in the original dataset that were related to content. For our analysis, we were specifically interested in images and videos, aka media elements.

Data Wrangling

To prepare the data for visualizing which types of subject matter people tend to share, we carried out several key wrangling steps. We began by creating a new categorical variable, data_channel, by combining multiple binary indicator columns that signified whether an article belonged to one of six main categories: Lifestyle, Entertainment, Business, Social Media, Tech, or World. Articles that did not fall into any of these categories were excluded from the analysis. To simplify comparisons across articles with different image counts, we created a num_imgs variable into five bins: 0, 1–2, 3–5, 6–10, and 11 or more. We then removed rows with missing data and calculated the mean number of shares and total article count for each combination of content category and image bin. Finally, we reordered the image bins as an ordered factor to ensure the bar charts displayed them in a meaningful sequence. These wrangling steps enabled a clearer summary and visualization of how article popularity interacts with content category and visual media use.

Exploratory Data Analysis

To explore how media content relates to article sharing, we began by examining the relationship between the number of images or videos and the number of shares each article received. Figure 4 visualizes this relationship using scatterplots and linear trend lines, with the y-axis log-transformed to account for the heavy right skew in the shares distribution. This analysis reveals that while articles with more media tend to receive slightly more shares, the effect is modest. Most data points cluster at low counts of images or videos, and the near-flat trend lines suggest that simply increasing the number of media elements does not significantly boost an article’s popularity.

To investigate whether this relationship varies by subject matter, we categorized each article by its content channel (e.g., Business, World, Lifestyle) and grouped image counts into five bins: 0, 1–2, 3–5, 6–10, and 11 or more. Figure 4 shows the average number of shares per image bin within each channel. This breakdown reveals more complex patterns: in the Business and World categories, articles with either no images or a large number (11+) tend to receive more shares, while Lifestyle and Entertainment articles show more uniform sharing across bins. These results suggest that the influence of visual media on article popularity depends in part on the content domain, pointing to an interaction between subject matter and media use in shaping user engagement.

Based on our analysis of Figure 4, both images and videos show a slight positive relationship with article shares, the effect is minimal, especially given how clustered the data is at low media counts. The first plot shows that articles with more images or videos tend to receive marginally more shares on average, but the trend lines are almost flat, indicating that media quantity alone doesn’t strongly drive popularity.

news_long <- data %>%
  dplyr::select(shares, num_imgs, num_videos) %>%
  pivot_longer(cols = c(num_imgs, num_videos),
               names_to = "media_type",
               values_to = "count") %>%
  mutate(media_type = recode(media_type,
                             num_imgs = "Images",
                             num_videos = "Videos"))

# Plot
ggplot(news_long, aes(x = count, y = log1p(shares), color = media_type)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 0.5) +
  facet_wrap(~media_type, scales = "free_x") +
  labs(
    title = "Relationship Between Media Elements and Article Shares",
    x = "Number of Media Elements",
    y = "Shares (log scale)",
    color = "Media Type"
  )  + theme_minimal()+
  theme(
  strip.text = element_text(size = 10), margin = ggplot2::margin(b = 5),
  plot.title = element_text(margin = ggplot2::margin(b = 5), size = 13, 
                            hjust = 0.5),
  plot.subtitle = element_text(size = 10),
  axis.title.x = element_text(margin = ggplot2::margin(t = 5)),
  axis.title.y = element_text(margin = ggplot2::margin(r = 4)),
  axis.text.x = element_text(size = 8),
  axis.text.y = element_text(size = 8),
  legend.text = element_text(size = 10))

Relationship between media elements and number of shares

However, Figure 5, which suggests down image counts by content category, reveals more nuanced patterns. For example, in the Business and World channels, articles with either no images or a high number of images (11+) tend to receive more shares, suggesting that users may prefer either streamlined content or richly visual stories in those domains. In contrast, other categories like Lifestyle or Entertainment show less variation across image bins. This indicates that subject matter and its interaction with media content can influence sharing behavior—certain audiences may respond differently to visual density depending on the topic.

news_EDA_binned <- data |>
  mutate(data_channel = case_when(
    data_channel_is_lifestyle == 1 ~ "Lifestyle", 
    data_channel_is_entertainment == 1 ~ "Entertainment",    
    data_channel_is_bus == 1 ~ "Business", 
    data_channel_is_socmed == 1 ~ "Social Media", 
    data_channel_is_tech == 1 ~ "Tech", 
    data_channel_is_world == 1 ~ "World", 
    TRUE ~ NA
  )) |>
  filter(!is.na(data_channel)) |>
  mutate(img_bin = case_when(
    num_imgs == 0 ~ "0",
    num_imgs %in% 1:2 ~ "1–2",
    num_imgs %in% 3:5 ~ "3–5",
    num_imgs %in% 6:10 ~ "6–10",
    num_imgs > 10 ~ "11+"
  ))

# Summarize mean shares (raw)
img_bin_summary <- news_EDA_binned |>
  group_by(data_channel, img_bin) |>
  summarise(
    mean_shares = mean(shares, na.rm = TRUE),
    count = n(),
    .groups = "drop"
  )

# Order bins
img_bin_summary$img_bin <- factor(
  img_bin_summary$img_bin,
  levels = c("0", "1–2", "3–5", "6–10", "11+")
)

# Plot (raw mean shares) with fixed y-axis scales
ggplot(img_bin_summary, aes(x = img_bin, y = mean_shares)) +
  geom_col(alpha = 0.8, width = 0.7) +
  facet_wrap(~data_channel, scales = "fixed") +  # <- Changed here
  labs(
    title = "Mean Shares by Number of Images (Binned) and Data Channel",
    x = "Number of Images (Binned)",
    y = "Mean Shares"
  ) +  theme_minimal()+
  theme(
  strip.text = element_text(size = 10),
  plot.title = element_text(margin = ggplot2::margin(b = 5), size = 13, 
                            hjust = 0.5),
  plot.subtitle = element_text(size = 10),
  axis.title.x = element_text(margin = ggplot2::margin(t = 5)),
  axis.title.y = element_text(margin = ggplot2::margin(r = 4)),
  axis.text.x = element_text(size = 8),
  axis.text.y = element_text(size = 8),
  legend.text = element_text(size = 8))

The mean number of shares by number of images and data channel

Checking Model Assumptions

# Do mean shares differ across image bins?
anova_model <- aov(log1p(shares) ~ img_bin, data = news_EDA_binned)

# check conditions 
par(mfrow = c(1, 2))
hist(residuals(anova_model), main = "Histogram of Residuals")
qqnorm(residuals(anova_model))
qqline(residuals(anova_model))

To test whether the average number of shares differed significantly across image bins, we conducted a one-way ANOVA using the log-transformed shares variable to address the skewness in the original distribution. Prior to interpreting the ANOVA results, we checked key model assumptions. A histogram of residuals and a Q-Q plot were used to assess the normality of residuals. The histogram appeared roughly bell-shaped, and the Q-Q plot showed points reasonably close to the reference line, suggesting that the normality assumption was not severely violated. Additionally, because ANOVA is relatively robust to mild violations of normality when sample sizes are large and balanced, the sizable number of observations in each image bin provides further confidence in the validity of the test results.

Model Results Interpretation

anova_result <- aov(log1p(shares) ~ img_bin, data = news_EDA_binned)
summary(anova_result)

               Df Sum Sq Mean Sq F value Pr(>F)    
img_bin         4    342   85.46   112.9 <2e-16 ***
Residuals   33505  25363    0.76                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Conduct Tukey's HSD post-hoc test for image bins
tukey_img_bin <- TukeyHSD(anova_result, "img_bin")
sig_img_bin <- as.data.frame(tukey_img_bin$img_bin)
sig_img_bin <- sig_img_bin[sig_img_bin$`p adj` < 0.05, ]
print(sig_img_bin)

                diff         lwr         upr        p adj
1–2-0    -0.14411104 -0.18248643 -0.10573566 4.130030e-14
11+-0     0.09546684  0.04526312  0.14567056 2.123104e-06
6–10-0    0.11814234  0.05548675  0.18079792 2.679944e-06
11+-1–2   0.23957789  0.19954909  0.27960668 0.000000e+00
3–5-1–2   0.16561634  0.11201612  0.21921656 5.040413e-14
6–10-1–2  0.26225338  0.20741167  0.31709509 0.000000e+00
3–5-11+  -0.07396154 -0.13657725 -0.01134583 1.115105e-02
6–10-3–5  0.09663704  0.02365734  0.16961673 2.809465e-03

The ANOVA results revealed a highly significant effect of image bin on log-transformed shares, with an F-value of 112.9 and a p-value less than 2e-16. This extremely small p-value indicates strong evidence against the null hypothesis of equal means across all image bins. In other words, there are statistically significant differences in average (log) shares among the different categories of image count. The large F-statistic further suggests that the between-group variability in shares is substantially greater than the within-group variability, reinforcing the conclusion that the number of images in an article is meaningfully associated with its sharing behavior.

To further investigate which specific image bins differ, we conducted a Tukey’s HSD post-hoc test. The results revealed that several pairwise comparisons between image count categories were statistically significant, particularly those involving the highest bin (11+ images). Articles with 11 or more images consistently received significantly more (log-transformed) shares compared to articles with fewer images. This suggests that while simply increasing media count has a modest overall effect, articles with a high density of images may benefit from a meaningful boost in shareability. However, given the large sample size, even small differences can become statistically significant, so these results should be interpreted with caution regarding their practical importance.

What tone and perspective make articles more shareable?

Audiences engage with content in various ways, which is often influenced by the structure of the articles, their rhetoric and sentiment. In this part of the analysis, we aim to identify how the variations in tone, subjectivity, and rhetorical strategy correspond to measurable differences in how frequently content is shared using Analysis of Variance (ANOVA).

Data Processing and Methods

In order to identify whether article shareability depends on the textual characteristics, we selected variables related to sentiment scores, polarity ratings, and subjectivity. Particularly,

num_keywords - The number of keywords identified in the text.
global_subjectivity - A score measuring the overall subjectivity of the full text. Closer to 1 = more subjective/opinionated; closer to 0 = more objective/factual.
global_sentiment_polarity - Overall sentiment polarity score of the full text, ranging from -1 = very negative to +1 = very positive).
global_rate_positive_words - The ratio of positive words to the total number of words in the full text.
global_rate_negative_words - The ratio of negative words to total words in the full text.
avg_positive_polarity - Average polarity of all the positive words in the text. A higher average indicates stronger positivity when it does occur.
avg_negative_polarity - Average polarity of all the negative words in the text. A more negative value indicates stronger negativity when present.
title_subjectivity - Subjectivity score of the post title.
title_sentiment_polarity - Sentiment polarity score of the title. Useful for gauging first impressions or clickbait potential.

Since the dataset contains almost 40,000 observations, we took a 10% sample to balance computational efficiency with analytical validity. The drawn sample will allow to capture the key trends that can be extended to the full dataset.

set.seed(321)
sampled_data <- data |> 
  sample_frac(0.10)

Exploratory Data Analysis

Based on the scatterplots below (Figure 6), we noted key things:

The number of keywords and shares have a weak positive association. Some highly shared articles are clustered around a keyword count of 10.
Articles with a moderate level of subjectivity receive more shares, with most data clustering around 0.5. Highly objective or subjective articles receive fewer shares, but there are still a few outliers with very high share counts across the spectrum.
Neutral sentiment polarity (values within the range of 0.0 - 0.25) is associated with higher shareability, including some of the most viral articles, suggesting restraint in emotional language may correlate with shareability.
Articles with higher ratios of both positive and negative words are associated with fewer shares.
Both negative and positive polarity with scores of around -0.25 and 0.25 respectively are associated with the highest share count, which can be attributed to more neutral articles performing better than very emotional ones.
Title subjectivity showed no strong correlation with number of shares. However, neutral or mildly polar titles tended to dominate among highly shared articles.

# Common theme for plots
gg <- theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(size = 10, face = "bold"),
    axis.title = element_text(size = 8),
    axis.text = element_text(size = 7)
  )

# Scatter plots 
p1 <- ggplot(sampled_data, aes(x = num_keywords, y = shares)) +
  geom_point(alpha = 0.3) + geom_smooth(method = "loess", se = FALSE) +
  gg + xlab("Number of keywords") + ylab("Shares")

p2 <- ggplot(sampled_data, aes(x = global_subjectivity, y = shares)) +
  geom_point(alpha = 0.3) + geom_smooth(method = "loess", se = FALSE) +
  gg + xlab("Global subjectivity") + ylab("Shares")

p3 <- ggplot(sampled_data, aes(x = global_sentiment_polarity, y = shares)) +
  geom_point(alpha = 0.3) + geom_smooth(method = "loess", se = FALSE) +
  gg + xlab("Global sentiment polarity") + ylab("Shares")

p4 <- ggplot(sampled_data, aes(x = global_rate_positive_words, y = shares)) +
  geom_point(alpha = 0.3) + geom_smooth(method = "loess", se = FALSE) +
  gg + xlab("Positive word rate") + ylab("Shares")

p5 <- ggplot(sampled_data, aes(x = global_rate_negative_words, y = shares)) +
  geom_point(alpha = 0.3) + geom_smooth(method = "loess", se = FALSE) +
  gg + xlab("Negative word rate") + ylab("Shares")

p6 <- ggplot(sampled_data, aes(x = avg_positive_polarity, y = shares)) +
  geom_point(alpha = 0.3) + geom_smooth(method = "loess", se = FALSE) +
  gg + xlab("Avg positive polarity") + ylab("Shares")

p7 <- ggplot(sampled_data, aes(x = avg_negative_polarity, y = shares)) +
  geom_point(alpha = 0.3) + geom_smooth(method = "loess", se = FALSE) +
  gg + xlab("Avg negative polarity") + ylab("Shares")

p8 <- ggplot(sampled_data, aes(x = title_subjectivity, y = shares)) +
  geom_point(alpha = 0.3) + geom_smooth(method = "loess", se = FALSE) +
  gg + xlab("Title subjectivity") + ylab("Shares")

p9 <- ggplot(sampled_data, aes(x = title_sentiment_polarity, y = shares)) +
  geom_point(alpha = 0.3) + geom_smooth(method = "loess", se = FALSE) +
  gg + xlab("Title sentiment polarity") + ylab("Shares")

# Arrange in grid
grid.arrange(grobs = list(p1, p2, p3, p4, p5, p6, p7, p8, p9), 
             ncol = 3, nrow = 3, 
             top = "Scatter plots: Text Features vs. Shares")

Correlation coefficients

To assess the strength of the relationships between text-related variables and share counts, we conducted the Spearman Coefficient test. Based on the output (Table 1), the top 7 correlations are statistically significant (p < 0.05). global_subjectivity has the strongest positive association with article shares, but the relationship is still quite weak (0.12). Negative sentiment-related variables, such as avg_negative_polarity and global_rate_negative_words, along with title_subjectivity show very weak negative correlations (-0.01) and have p-values above significance level of 0.05.

# Select variables of interest
selected <- c("num_keywords", "global_subjectivity", 
              "global_sentiment_polarity",
              "global_rate_positive_words", 
              "global_rate_negative_words",
              "avg_positive_polarity", 
              "avg_negative_polarity",
              "title_subjectivity", 
              "title_sentiment_polarity", 
              "shares")

# Calculate Spearman correlations 
result <- sampled_data |>
  select(all_of(selected)) |>
  as.matrix() |>
  rcorr(type = "spearman") |>
  with({
    features <- setdiff(colnames(r), "shares")
    # Select correlation coefficients related to shares only
    tibble(
      Feature = features,
      Spearman_Coefficient = r["shares", features],
      p_value = P["shares", features]
    )
  }) |>
  arrange(desc(abs(Spearman_Coefficient)))

result |>
  rename(
    "Spearman Coefficient" = Spearman_Coefficient,
    "p-value" = p_value
  ) |>
  mutate(across(where(is.numeric), ~ round(., 3))) |>
  knitr::kable(
    caption = "Spearman Correlations with Article Shares",
    format = "latex", 
    booktabs = TRUE,  
    align = c("l", "c", "c"),
    col.names = c("Predictor", "Correlation", "p-value")
  ) |>
  kableExtra::kable_styling(
    latex_options = "hold_position",
    bootstrap_options = c("striped"),
    full_width = FALSE,
    font_size = 12
  )

Because of low correlation values, we chose 5 variables with the highest correlation values, such as global_subjectivity, global_sentiment_polarity, global_rate_positive_words, num_keywords, and avg_positive_polarity. Since they had many outliers and no apparent linear relationships, we split these continuous variables into quartiles to treat them as categorical variables. It allowed us to better capture non-linear relationships and differences between groups. Shares appeared as a highly skewed distribution with outliers, so we applied a logarithmic y-axis transformation to allow for easier comparisons between groups (Figure 7).

# Split the data into 4 bins to treat as categorical variables
binned_data <- sampled_data |>
  mutate(
    bin_subjectivity = ntile(global_subjectivity, 4),
    bin_sentiment = ntile(global_sentiment_polarity, 4),
    bin_posrate = ntile(global_rate_positive_words, 4),
    bin_keywords = ntile(num_keywords, 4),
    bin_avgpos = ntile(avg_positive_polarity, 4)
  )

make_boxplot <- function(data, bin_var, bin_label) {
  ggplot(data, aes(x = factor(.data[[bin_var]]), y = shares)) +
    geom_boxplot() +
    scale_y_log10() +
    annotation_logticks(sides = "l") +
    labs(
      title = paste(bin_label),
      x = paste(bin_label, "Bin"), y = "Shares"
    ) + gg
}

# Create individual plots
p1 <- make_boxplot(binned_data, "bin_subjectivity", "Global subjectivity")
p2 <- make_boxplot(binned_data, "bin_sentiment", "Global sentiment polarity")
p3 <- make_boxplot(binned_data, "bin_posrate", "Positive word rate")
p4 <- make_boxplot(binned_data, "bin_keywords", "Number of keywords")
p5 <- make_boxplot(binned_data, "bin_avgpos", "Average positive polarity")

grid.arrange(grobs = list(p1, p2, p3, p4, p5), 
             ncol = 3, nrow = 2, 
             top = "Box plots: Text Features vs. Shares")

Testing

To find whether mean number of shares varied significantly by groups, we had initially considered doing a one-way Analysis of Variance (ANOVA). ANOVA was used to evaluate whether the mean number of shares varies across these quartile-based bins. For each test, the null hypothesis was that all group means are equal, such as that bin membership does not have any impact on average shares, whereas the alternative hypothesis was that at least one of the group means is different.

Before running ANOVA, we verified its key assumptions. The independence of observations was satisfied with the sampling design because each observation is independent from the others. However, the residuals were not normally distributed, as confirmed by the plot showing the differences between observed and predicted values. This extreme non-normality made the result of the ANOVA invalid for some variables.

Therefore, we used the Kruskal-Wallis test, a non-parametric alternative to ANOVA that does not assume normality of residuals. The test verifies if share distributions differ across the quartile-based bins. The results revealed statistically significant differences in shares for all five variables. These findings suggest that the shareability of articles is influenced by variations in textual features.

kruskal.test(shares ~ factor(bin_subjectivity), data = binned_data)


    Kruskal-Wallis rank sum test

data:  shares by factor(bin_subjectivity)
Kruskal-Wallis chi-squared = 59.283, df = 3, p-value = 8.363e-13

kruskal.test(shares ~ factor(bin_sentiment), data = binned_data)


    Kruskal-Wallis rank sum test

data:  shares by factor(bin_sentiment)
Kruskal-Wallis chi-squared = 45.095, df = 3, p-value = 8.831e-10

kruskal.test(shares ~ factor(bin_posrate), data = binned_data)


    Kruskal-Wallis rank sum test

data:  shares by factor(bin_posrate)
Kruskal-Wallis chi-squared = 27.536, df = 3, p-value = 4.546e-06

kruskal.test(shares ~ factor(bin_keywords), data = binned_data)


    Kruskal-Wallis rank sum test

data:  shares by factor(bin_keywords)
Kruskal-Wallis chi-squared = 24.107, df = 3, p-value = 2.373e-05

kruskal.test(shares ~ factor(bin_avgpos), data = binned_data)


    Kruskal-Wallis rank sum test

data:  shares by factor(bin_avgpos)
Kruskal-Wallis chi-squared = 24.17, df = 3, p-value = 2.302e-05

Specifically, the variable global_subjectivity had the highest effect on share counts, followed by global_sentiment_polarity, which means that subjectivity of content and sentiment are connected with audience engagement. Similarly, global_rate_positive_words, num_keywords, and avg_positive_polarity saw significant differences in shares. These results indicate that more subjective, positively skewed, keyword-dense, and positive content is likely to have higher shareability. Since Kruskal-Wallis only informs us that there are differences between groups, we need to conduct post-hoc testing to identify which particular groups are causing these differences, so we used Dunn test to compare each pair of groups individually and identify which binned observations show significant differences.

dunn.test(binned_data$shares, binned_data$bin_subjectivity, method = "bh")

  Kruskal-Wallis rank sum test

data: x and group
Kruskal-Wallis chi-squared = 59.2831, df = 3, p-value = 0

                           Comparison of x by group                            
                             (Benjamini-Hochberg)                              
Col Mean-|
Row Mean |          1          2          3
---------+---------------------------------
       2 |  -2.513559
         |    0.0072*
         |
       3 |  -5.153218  -2.639659
         |    0.0000*    0.0062*
         |
       4 |  -7.226222  -4.712663  -2.073004
         |    0.0000*    0.0000*    0.0191*

alpha = 0.05
Reject Ho if p <= alpha/2

dunn.test(binned_data$shares, binned_data$bin_sentiment, method = "bh")

  Kruskal-Wallis rank sum test

data: x and group
Kruskal-Wallis chi-squared = 45.0953, df = 3, p-value = 0

                           Comparison of x by group                            
                             (Benjamini-Hochberg)                              
Col Mean-|
Row Mean |          1          2          3
---------+---------------------------------
       2 |  -1.214068
         |     0.1124
         |
       3 |  -4.147784  -2.933715
         |    0.0000*    0.0025*
         |
       4 |  -6.022489  -4.808421  -1.874705
         |    0.0000*    0.0000*     0.0365

alpha = 0.05
Reject Ho if p <= alpha/2

dunn.test(binned_data$shares, binned_data$bin_posrate, method = "bh")

  Kruskal-Wallis rank sum test

data: x and group
Kruskal-Wallis chi-squared = 27.5357, df = 3, p-value = 0

                           Comparison of x by group                            
                             (Benjamini-Hochberg)                              
Col Mean-|
Row Mean |          1          2          3
---------+---------------------------------
       2 |  -2.481783
         |    0.0098*
         |
       3 |  -4.966011  -2.484227
         |    0.0000*    0.0130*
         |
       4 |  -3.866891  -1.385107   1.099120
         |    0.0002*     0.0996     0.1359

alpha = 0.05
Reject Ho if p <= alpha/2

dunn.test(binned_data$shares, binned_data$bin_keywords, method = "bh")

  Kruskal-Wallis rank sum test

data: x and group
Kruskal-Wallis chi-squared = 24.1065, df = 3, p-value = 0

                           Comparison of x by group                            
                             (Benjamini-Hochberg)                              
Col Mean-|
Row Mean |          1          2          3
---------+---------------------------------
       2 |  -3.276009
         |    0.0011*
         |
       3 |  -4.249648  -0.973638
         |    0.0001*     0.2477
         |
       4 |  -4.197739  -0.921729   0.051908
         |    0.0000*     0.2140     0.4793

alpha = 0.05
Reject Ho if p <= alpha/2

dunn.test(binned_data$shares, binned_data$bin_avgpos, method = "bh")

  Kruskal-Wallis rank sum test

data: x and group
Kruskal-Wallis chi-squared = 24.1703, df = 3, p-value = 0

                           Comparison of x by group                            
                             (Benjamini-Hochberg)                              
Col Mean-|
Row Mean |          1          2          3
---------+---------------------------------
       2 |  -1.822943
         |     0.0410
         |
       3 |  -4.035584  -2.212640
         |    0.0001*    0.0202*
         |
       4 |  -4.238114  -2.415170  -0.202529
         |    0.0001*    0.0157*     0.4198

alpha = 0.05
Reject Ho if p <= alpha/2

Interpretation

The Kruskal-Wallis tests showed statistically significant differences in the number of article shares across the levels of each binned variable: bin_subjectivity, bin_sentiment, bin_posrate, bin_keywords, and bin_avgpos. The Dunn tests showed revealed that:

For bin_subjectivity, all pairwise comparisons were significant (p < 0.05), indicating that article share counts differed significantly across every level of subjectivity.
For bin_sentiment, the highest and lowest bins differed significantly, but the comparison between bins 1 and 2 was not statistically significant, suggesting that more positive articles perform better.
In bin_posrate, shares were significantly different between most bins, particularly between lower and higher positivity rates.
For bin_keywords, articles in higher keyword-density bins had significantly different share counts compared to lower ones, especially between bins 1, 3, and 4.
Lastly, bin_avgpos showed significant differences between bins, but a some adjacent group comparisons, such as 3 and 4, were not significant.

These results suggest that the distribution of article shares is influenced by the sentiment-related metrics, with clear pairwise differences. In all five variables, Dunn test outcomes indicate a clear pattern: higher binned values (more subjectivity, more sentiment, higher rate of positivity) are associated with higher article shares.

Limitations

Data is limited to Mashable articles between 2013 and 2015, which may not reflect current trends or generalize to other media platforms.
Algorithms influence user reactions and sharing behavior, which may not fully capture genuine human responses.
When conducting statistical tests, such as ANOVA, we encountered big data challenges. Specifically, the assumptions for these tests (normality and homogeneity of variance) were not perfectly satisfied.

Implications and Conclusion

Through our statistical analysis, we were able to identify key variables impacting the shareability of the articles. Based on correlation results, ANOVA analysis and Kruskal-Wallis tests, we tested variables for significance in influencing our response variable of interest - share counts.

Our findings suggest that both visually dense articles and articles with no images can go viral, depending on the subject of the article. Articles published on the weekend are likely to get a higher share count on average than the article published on a weekday. In terms of categories, articles about social media are likely to perform better on average than articles in other categories (entertainment, business, lifestyle, world, and tech). However, articles that don’t fall into any of those six categories receive more shares on average than articles that do. Lastly, higher values related to tone and perspective, including more subjectivity, more positive perspective, and higher key word count are associated with higher article shares.

These findings underscore the importance of not only what is said in news articles or blog posts, but how and when it is said, which is crucial for content-makers, media groups, and marketing agencies, among others. With correct timing and carefully selected content, high shareability can lead to faster growth in platform popularity, number of subscribers, and greater engagement.