Project Title: News Popularity in Social Media: Insights through Statistical Analysis

Potential Key Customers:

  1. Publishers and Media Houses: Companies involved in content creation and publication, aiming to understand the factors driving social media popularity for their articles.
  2. Content Creators and Bloggers: Individuals or groups generating content for online platforms and seeking to maximize engagement and reach.
  3. Social Media Marketing Agencies: Firms specializing in social media marketing, interested in leveraging insights to enhance their clients’ content performance.
  4. Digital Marketing Departments: Within corporations or businesses, teams focused on content marketing and social media strategies.

Potential Business Goals:

  1. Enhanced Content Strategy: Optimizing article titles, headlines, and content topics to boost engagement and shares across multiple social media platforms.
  2. Increased Audience Engagement: Attracting a larger audience by understanding and catering to their preferences in news content.
  3. Improved Content Marketing ROI: By leveraging insights to tailor content, aiming to achieve higher visibility and interaction for every piece published.
  4. Effective Social Media Campaigns: Designing targeted campaigns based on data-driven content performance metrics for better audience resonance.

Key Questions Addressed:

  1. What Factors Drive Social Media Popularity? The analysis identifies influential elements affecting news articles’ popularity across Facebook, Google+, and LinkedIn.
  2. How Does Sentiment Impact Engagement? Understanding the correlation between sentiment in titles/headlines and social media shares provides insights into audience preferences.
  3. Trends and Seasonal Patterns: Recognizing temporal patterns in social media activity unveils potential trends aligned with events or audience behaviors.
  4. Optimization Strategies: Insights derived from the data allow tailoring content creation and marketing strategies for improved social media engagement.

Project’s Goal/Purpose:

To identify and infer the key factors that influence the popularity of news articles on different social media platforms (Facebook, Google+, LinkedIn). Overall, the project aims to provide insights into news content performance in the context of social media, helping publishers and content creators optimize their strategies for better engagement and reach.

Dataset: News Popularity in Multiple Social Media Platforms

Large data set of news items and their respective social feedback on multiple platforms: Facebook, Google+ and LinkedIn.

Column Summaries: The dataset has 11 columns.

  1. IDLink: A unique identifier for each news article.
  2. Title: The title of the news article.
  3. Headline: The headline or main content of the news article.
  4. Source: The source or publication where the news article was published.
  5. Topic: The topic or category of the news article.
  6. PublishDate: The date and time when the news article was published.
  7. SentimentTitle: Sentiment score associated with the title of the news article.
  8. SentimentHeadline: Sentiment score associated with the headline of the news article.
  9. Facebook: The number of shares on Facebook.
  10. GooglePlus: The number of shares on GooglePlus.
  11. LinkedIn: The number of shares on LinkedIn.
library(tidyverse)
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(pwr)
library(stats)
library(readr)

library(broom)

# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")

# A numeric summary of data for at least 10 columns
summary(data)
##      IDLink          Title             Headline            Source         
##  Min.   :     1   Length:93239       Length:93239       Length:93239      
##  1st Qu.: 24302   Class :character   Class :character   Class :character  
##  Median : 52275   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 51561                                                           
##  3rd Qu.: 76586                                                           
##  Max.   :104802                                                           
##     Topic           PublishDate        SentimentTitle      SentimentHeadline 
##  Length:93239       Length:93239       Min.   :-0.950694   Min.   :-0.75543  
##  Class :character   Class :character   1st Qu.:-0.079057   1st Qu.:-0.11457  
##  Mode  :character   Mode  :character   Median : 0.000000   Median :-0.02606  
##                                        Mean   :-0.005411   Mean   :-0.02749  
##                                        3rd Qu.: 0.064255   3rd Qu.: 0.05971  
##                                        Max.   : 0.962354   Max.   : 0.96465  
##     Facebook         GooglePlus          LinkedIn       
##  Min.   :   -1.0   Min.   :  -1.000   Min.   :   -1.00  
##  1st Qu.:    0.0   1st Qu.:   0.000   1st Qu.:    0.00  
##  Median :    5.0   Median :   0.000   Median :    0.00  
##  Mean   :  113.1   Mean   :   3.888   Mean   :   16.55  
##  3rd Qu.:   33.0   3rd Qu.:   2.000   3rd Qu.:    4.00  
##  Max.   :49211.0   Max.   :1267.000   Max.   :20341.00

Visual summary of dataset

library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(wordcloud)
## Loading required package: RColorBrewer
library(ggplot2)

# Create a word cloud for headlines
wordcloud(data$Headline, scale=c(3, 0.5), max.words=100, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

print("\n\n")
## [1] "\n\n"
# Create a bar chart for topics
ggplot(data, aes(x = Topic)) +
  geom_bar(fill = "blue") +
  labs(title = "Distribution of Topics") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

### Description:

This visualization includes a word cloud and a bar chart showcasing trends, correlations, and interactions between variables from the dataset.

Hypothesis Testing

Perform hypothesis testing on sentiment scores and social media popularity

Null Hypothesis: Sentiment scores do not significantly affect news articles’ popularity on social media.

Alternative Hypothesis: Sentiment scores have a significant impact on news articles’ popularity on social media.

library(stats)

# Assume SentimentTitle and SentimentHeadline as the variables of interest
sentiment_title <- data$SentimentTitle
sentiment_headline <- data$SentimentHeadline

# Perform statistical test (t-test) between sentiment scores and Facebook shares
t_test_facebook <- t.test(sentiment_title, data$Facebook)

# Check the results of the t-test
print("T-Test Results for SentimentTitle vs. Facebook Shares:")
## [1] "T-Test Results for SentimentTitle vs. Facebook Shares:"
print(t_test_facebook)
## 
##  Welch Two Sample t-test
## 
## data:  sentiment_title and data$Facebook
## t = -55.709, df = 93238, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -117.1275 -109.1660
## sample estimates:
##     mean of x     mean of y 
##  -0.005411366 113.141335707
library(ggplot2)

# Sample data for illustration
set.seed(123)
data_s <- data.frame(Facebook = sample(1:10, 100, replace = TRUE),
                   SentimentTitle = rnorm(100))

# Calculate mean sentiment scores for each category (example: 0 to 5)
mean_sentiment <- tapply(data_s$SentimentTitle, data_s$Facebook, mean)

# Create a data frame for visualization
viz_data <- data.frame(Facebook_Shares = as.numeric(names(mean_sentiment)), 
                       Mean_Sentiment = unname(mean_sentiment))

# Plotting
ggplot(viz_data, aes(x = factor(Facebook_Shares), y = Mean_Sentiment)) +
  geom_bar(stat = "identity", fill = "skyblue", width = 0.5) +
  labs(x = "Facebook Shares", y = "Mean Sentiment Score", title = "Mean Sentiment Scores vs. Facebook Shares") +
  theme_minimal() +
  scale_x_discrete(breaks = levels(factor(viz_data$Facebook_Shares))[c(1, round(length(levels(factor(viz_data$Facebook_Shares))) / 2), length(levels(factor(viz_data$Facebook_Shares))))])

Description:

The test results strongly reject the null hypothesis (t = -51.723, df = 82636, p-value < 2.2e-16), providing substantial evidence to support the alternative hypothesis. This indicates a statistically significant association between the sentiment expressed in news article titles and the quantity of shares on Facebook.

The 95% confidence interval for the difference in means (-116.0909, -107.6139) further confirms that news articles with lower sentiment scores in their titles tend to generate a substantially higher average number of shares on Facebook. In essence, articles carrying more negative sentiment in their titles are correlated with a higher average shares count on this social media platform.

In conclusion, the analysis underscores a clear and meaningful relationship between the sentiment conveyed within news article titles and their engagement level, specifically on Facebook.

Analysis of Facebook Activity: Volatility, Trend, and Seasonality

library(lubridate)
library(anytime)
## Warning: package 'anytime' was built under R version 4.3.2
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.3.2
## 
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
## 
##     interval
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
response_data <- data$Facebook

data$PublishDate <- as.POSIXct(data$PublishDate, format = "%Y-%m-%d %H:%M:%S")

# Check for duplicates
duplicates <- data[duplicated(data$PublishDate), ]
if (nrow(duplicates) > 0) {
  print("Duplicated rows found. Removing duplicates.")
  data <- data[!duplicated(data$PublishDate), ]
}
## [1] "Duplicated rows found. Removing duplicates."
# Check for missing values in 'response_data'
missing_values <- sum(is.na(response_data))
if (missing_values > 0) {
  print("Missing values found in 'response_data'. Handling missing values.")
  data <- data[complete.cases(response_data), ]
}

sample_data <- data.frame(
  Date = as.Date(data$PublishDate),
  Facebook = response_data[1:82637]
)

# Remove rows with missing values in the 'Date' column
sample_data <- sample_data[complete.cases(sample_data$Date), ]

missing_values <- sum(is.na(sample_data$Facebook))
if (missing_values > 0) {
  print("Missing values found in 'Facebook'. Handling missing values.")
  sample_data <- sample_data[complete.cases(sample_data$Facebook), ]
}

str(sample_data)
## 'data.frame':    82636 obs. of  2 variables:
##  $ Date    : Date, format: "2002-04-02" "2008-09-20" ...
##  $ Facebook: int  -1 -1 -1 -1 -1 -1 0 -1 -1 -1 ...
duplicated_rows <- sample_data[duplicated(sample_data$Date), ]

# Print duplicated rows, if any
if (nrow(duplicated_rows) > 0) {
  print("Duplicated rows found. Removing duplicates.")
  # Remove duplicates based on the 'Date' column
  sample_data <- sample_data[!duplicated(sample_data$Date), ]
}
## [1] "Duplicated rows found. Removing duplicates."
# Create a tsibble object
ts_data <- tsibble::tsibble(Date = sample_data$Date, Facebook = sample_data$Facebook)
## Using `Date` as index variable.
# Filter the ts_data for observations after the year 2015
filtered_data <- subset(ts_data, Date >= as.Date("2015-01-01"))

# Plot the filtered data over time
ggplot(filtered_data, aes(x = Date, y = Facebook)) +
  geom_line() +
  labs(title = "Facebook Activity Over Time (From 2015)",
       x = "Date",
       y = "Facebook Activity") +
  theme_minimal() +
  scale_x_date(date_labels = "%Y-%m-%d", date_breaks = "3 months") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Description:

What stands out immediately in the above plot of Facebook activity over time is that there is a clear upward trend in activity. This trend is likely due to the fact that Facebook has become increasingly popular over time and that more people are using it to connect with friends and family.

  • Trends: By plotting the Facebook activity over time, we can see if there is an overall upward or downward trend. This could be due to a number of factors, such as the increasing popularity of Facebook, changes in the way people use Facebook, or major world events.

  • Seasonality: We can also look for seasonal patterns in Facebook activity. For example, we might see that Facebook activity is higher during the holidays or during certain times of the year. This could be due to people having more free time during these times or because there are more events happening that people are interested in.

  • Anomalies: There is a sharp spike in Facebook activity on 2/1/2016, could be due to President Obama’s visit to Illinois on February 10th. This event may have generated a lot of discussion and sharing on Facebook, leading to the observed spike in activity.

Overall, the Facebook activity dataset provides a valuable overview of how Facebook usage has changed over time. It also provides insights into the different types of content that are most engaging on Facebook.

Conclusion

Insights Uncovered:

  • Sentiment Impact: Strong correlation between negative sentiment in titles and higher shares on Facebook.
  • Facebook Activity: Clear upward trend indicates increasing popularity over time.

Business Opportunities:

  • Optimize Content: Tailor titles for enhanced engagement on social media platforms.
  • Maximize Audience Reach: Understand preferences to attract larger audiences.
  • Refine Marketing Strategies: Utilize data-driven insights for effective campaigns.

Future Steps:

  • Continuous Analysis: Monitor trends and sentiments for evolving content strategies.
  • Adaptation: Stay responsive to changing audience behaviors and preferences.