R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Part 1:

Working with data and summarize them as: - A numeric summary of data for at least 10 columns of data - For categorical columns, this should include unique values and counts - For numeric columns, this includes min/max, central tendency, and some notion of distribution (e.g., quantiles) - These summaries can be combined.

# Load the necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")

# A numeric summary of data for at least 10 columns
summary(data)
##      IDLink          Title             Headline            Source         
##  Min.   :     1   Length:93239       Length:93239       Length:93239      
##  1st Qu.: 24302   Class :character   Class :character   Class :character  
##  Median : 52275   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 51561                                                           
##  3rd Qu.: 76586                                                           
##  Max.   :104802                                                           
##     Topic           PublishDate        SentimentTitle      SentimentHeadline 
##  Length:93239       Length:93239       Min.   :-0.950694   Min.   :-0.75543  
##  Class :character   Class :character   1st Qu.:-0.079057   1st Qu.:-0.11457  
##  Mode  :character   Mode  :character   Median : 0.000000   Median :-0.02606  
##                                        Mean   :-0.005411   Mean   :-0.02749  
##                                        3rd Qu.: 0.064255   3rd Qu.: 0.05971  
##                                        Max.   : 0.962354   Max.   : 0.96465  
##     Facebook         GooglePlus          LinkedIn       
##  Min.   :   -1.0   Min.   :  -1.000   Min.   :   -1.00  
##  1st Qu.:    0.0   1st Qu.:   0.000   1st Qu.:    0.00  
##  Median :    5.0   Median :   0.000   Median :    0.00  
##  Mean   :  113.1   Mean   :   3.888   Mean   :   16.55  
##  3rd Qu.:   33.0   3rd Qu.:   2.000   3rd Qu.:    4.00  
##  Max.   :49211.0   Max.   :1267.000   Max.   :20341.00
str(data)
## 'data.frame':    93239 obs. of  11 variables:
##  $ IDLink           : num  99248 10423 18828 27788 27789 ...
##  $ Title            : chr  "Obama Lays Wreath at Arlington National Cemetery" "A Look at the Health of the Chinese Economy" "Nouriel Roubini: Global Economy Not Back to 2008" "Finland GDP Expands In Q4" ...
##  $ Headline         : chr  "Obama Lays Wreath at Arlington National Cemetery. President Barack Obama has laid a wreath at the Tomb of the U"| __truncated__ "Tim Haywood, investment director business-unit head for fixed income at Gam, discusses the China beige book and"| __truncated__ "Nouriel Roubini, NYU professor and chairman at Roubini Global Economics, explains why the global economy isn't "| __truncated__ "Finland's economy expanded marginally in the three months ended December, after contracting in the previous qua"| __truncated__ ...
##  $ Source           : chr  "USA TODAY" "Bloomberg" "Bloomberg" "RTT News" ...
##  $ Topic            : chr  "obama" "economy" "economy" "economy" ...
##  $ PublishDate      : chr  "2002-04-02 00:00:00" "2008-09-20 00:00:00" "2012-01-28 00:00:00" "2015-03-01 00:06:00" ...
##  $ SentimentTitle   : num  0 0.208 -0.425 0 0 ...
##  $ SentimentHeadline: num  -0.0533 -0.1564 0.1398 0.0261 0.1411 ...
##  $ Facebook         : int  -1 -1 -1 -1 -1 -1 0 -1 -1 -1 ...
##  $ GooglePlus       : int  -1 -1 -1 -1 -1 -1 0 -1 -1 -1 ...
##  $ LinkedIn         : int  -1 -1 -1 -1 -1 -1 0 -1 -1 -1 ...
data_types <- sapply(data, class)
print(data_types)
##            IDLink             Title          Headline            Source 
##         "numeric"       "character"       "character"       "character" 
##             Topic       PublishDate    SentimentTitle SentimentHeadline 
##       "character"       "character"         "numeric"         "numeric" 
##          Facebook        GooglePlus          LinkedIn 
##         "integer"         "integer"         "integer"
numeric_columns <- sapply(data, is.numeric)
categorical_columns <- sapply(data, function(x) is.factor(x) || is.character(x))

# Create summaries for numeric columns
numeric_summary <- data %>%
  select_if(numeric_columns) %>%
  summarise_at(vars(everything()),
               list(
                 Min = ~ min(.),
                 Max = ~ max(.),
                 Mean = ~ mean(.),
                 Median = ~ median(.),
                 Q1 = ~ quantile(., 0.25),
                 Q3 = ~ quantile(., 0.75)
               ))

# Create summaries for categorical columns
categorical_summary <- data %>%
  select_if(categorical_columns) %>%
  summarise_all(list(Unique_Values = ~ length(unique(.)),
                     Counts = ~ n()))

# Combine the summaries
combined_summary <- bind_cols(numeric_summary, categorical_summary)

# Print the combined summary
print(combined_summary)
##   IDLink_Min SentimentTitle_Min SentimentHeadline_Min Facebook_Min
## 1          1         -0.9506944            -0.7554334           -1
##   GooglePlus_Min LinkedIn_Min IDLink_Max SentimentTitle_Max
## 1             -1           -1     104802          0.9623536
##   SentimentHeadline_Max Facebook_Max GooglePlus_Max LinkedIn_Max IDLink_Mean
## 1             0.9646462        49211           1267        20341    51560.65
##   SentimentTitle_Mean SentimentHeadline_Mean Facebook_Mean GooglePlus_Mean
## 1        -0.005411366            -0.02749305      113.1413        3.888362
##   LinkedIn_Mean IDLink_Median SentimentTitle_Median SentimentHeadline_Median
## 1      16.54796         52275                     0               -0.0260643
##   Facebook_Median GooglePlus_Median LinkedIn_Median IDLink_Q1 SentimentTitle_Q1
## 1               5                 0               0   24301.5       -0.07905694
##   SentimentHeadline_Q1 Facebook_Q1 GooglePlus_Q1 LinkedIn_Q1 IDLink_Q3
## 1           -0.1145743           0             0           0   76585.5
##   SentimentTitle_Q3 SentimentHeadline_Q3 Facebook_Q3 GooglePlus_Q3 LinkedIn_Q3
## 1        0.06425521            0.0597091          33             2           4
##   Title_Unique_Values Headline_Unique_Values Source_Unique_Values
## 1               81259                  86695                 5757
##   Topic_Unique_Values PublishDate_Unique_Values Title_Counts Headline_Counts
## 1                   4                     82644        93239           93239
##   Source_Counts Topic_Counts PublishDate_Counts
## 1         93239        93239              93239

Part 2:

Novel questions:

  1. Which social media platform(s) has the highest average sentiment score (SentimentTitle) for news articles?
  2. What is the distribution of sentiment scores for the “SentimentTitle” column?
  3. Are there any correlations between “Facebook” shares and “GooglePlus” shares?
  4. How does the sentiment in the “SentimentHeadline” column vary by “Topic”?
  5. Does the length of a news (measured in words) significantly affect its popularity on social media platforms?

Dataset: News Popularity in Multiple Social Media Platforms

Large data set of news items and their respective social feedback on multiple platforms: Facebook, Google+ and LinkedIn.

Column Summaries: The dataset has 11 columns.

  1. IDLink: A unique identifier for each news article.
  2. Title: The title of the news article.
  3. Headline: The headline or main content of the news article.
  4. Source: The source or publication where the news article was published.
  5. Topic: The topic or category of the news article.
  6. PublishDate: The date and time when the news article was published.
  7. SentimentTitle: Sentiment score associated with the title of the news article.
  8. SentimentHeadline: Sentiment score associated with the headline of the news article.
  9. Facebook: The number of shares on Facebook.
  10. GooglePlus: The number of shares on GooglePlus.
  11. LinkedIn: The number of shares on LinkedIn.

Project’s Goal/Purpose:

To identify and infer the key factors that influence the popularity of news articles on different social media platforms (Facebook, Google+, LinkedIn). Overall, the project aims to provide insights into news content performance in the context of social media, helping publishers and content creators optimize their strategies for better engagement and reach.

Part 3:

Use of aggregation functions (other than the ones used from the first bullet, above) i.e., use these explore something interesting about your data

Example: Find the trend in “GooglePlus” shares over time.

# Find the trend in "GooglePlus" shares over time.
# Aggregation to see the trend in "GooglePlus" shares over time

trend_googleplus <- data %>%
  group_by(PublishDate) %>%
  summarise(Mean_GooglePlus = mean(GooglePlus))

print(trend_googleplus)
## # A tibble: 82,644 × 2
##    PublishDate         Mean_GooglePlus
##    <chr>                         <dbl>
##  1 2002-04-02 00:00:00              -1
##  2 2008-09-20 00:00:00              -1
##  3 2012-01-28 00:00:00              -1
##  4 2015-03-01 00:06:00              -1
##  5 2015-03-01 00:11:00              -1
##  6 2015-03-01 00:19:00              -1
##  7 2015-03-01 00:45:00              -1
##  8 2015-03-01 01:20:00              -1
##  9 2015-03-01 01:32:00              -1
## 10 2015-03-01 02:14:00              -1
## # ℹ 82,634 more rows

Part 4:

A visual summary of at least 5 columns of your data that follows: - This should include distributions at least - In addition, you should consider trends, correlations, and interactions between variables - Use different channels (e.g., color) to show how categorical variables interact with continuous variables

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## Loading required package: RColorBrewer
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

## [1] "\n\n"