This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Working with data and summarize them as: - A numeric summary of data for at least 10 columns of data - For categorical columns, this should include unique values and counts - For numeric columns, this includes min/max, central tendency, and some notion of distribution (e.g., quantiles) - These summaries can be combined.
# Load the necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")
# A numeric summary of data for at least 10 columns
summary(data)
## IDLink Title Headline Source
## Min. : 1 Length:93239 Length:93239 Length:93239
## 1st Qu.: 24302 Class :character Class :character Class :character
## Median : 52275 Mode :character Mode :character Mode :character
## Mean : 51561
## 3rd Qu.: 76586
## Max. :104802
## Topic PublishDate SentimentTitle SentimentHeadline
## Length:93239 Length:93239 Min. :-0.950694 Min. :-0.75543
## Class :character Class :character 1st Qu.:-0.079057 1st Qu.:-0.11457
## Mode :character Mode :character Median : 0.000000 Median :-0.02606
## Mean :-0.005411 Mean :-0.02749
## 3rd Qu.: 0.064255 3rd Qu.: 0.05971
## Max. : 0.962354 Max. : 0.96465
## Facebook GooglePlus LinkedIn
## Min. : -1.0 Min. : -1.000 Min. : -1.00
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 5.0 Median : 0.000 Median : 0.00
## Mean : 113.1 Mean : 3.888 Mean : 16.55
## 3rd Qu.: 33.0 3rd Qu.: 2.000 3rd Qu.: 4.00
## Max. :49211.0 Max. :1267.000 Max. :20341.00
str(data)
## 'data.frame': 93239 obs. of 11 variables:
## $ IDLink : num 99248 10423 18828 27788 27789 ...
## $ Title : chr "Obama Lays Wreath at Arlington National Cemetery" "A Look at the Health of the Chinese Economy" "Nouriel Roubini: Global Economy Not Back to 2008" "Finland GDP Expands In Q4" ...
## $ Headline : chr "Obama Lays Wreath at Arlington National Cemetery. President Barack Obama has laid a wreath at the Tomb of the U"| __truncated__ "Tim Haywood, investment director business-unit head for fixed income at Gam, discusses the China beige book and"| __truncated__ "Nouriel Roubini, NYU professor and chairman at Roubini Global Economics, explains why the global economy isn't "| __truncated__ "Finland's economy expanded marginally in the three months ended December, after contracting in the previous qua"| __truncated__ ...
## $ Source : chr "USA TODAY" "Bloomberg" "Bloomberg" "RTT News" ...
## $ Topic : chr "obama" "economy" "economy" "economy" ...
## $ PublishDate : chr "2002-04-02 00:00:00" "2008-09-20 00:00:00" "2012-01-28 00:00:00" "2015-03-01 00:06:00" ...
## $ SentimentTitle : num 0 0.208 -0.425 0 0 ...
## $ SentimentHeadline: num -0.0533 -0.1564 0.1398 0.0261 0.1411 ...
## $ Facebook : int -1 -1 -1 -1 -1 -1 0 -1 -1 -1 ...
## $ GooglePlus : int -1 -1 -1 -1 -1 -1 0 -1 -1 -1 ...
## $ LinkedIn : int -1 -1 -1 -1 -1 -1 0 -1 -1 -1 ...
data_types <- sapply(data, class)
print(data_types)
## IDLink Title Headline Source
## "numeric" "character" "character" "character"
## Topic PublishDate SentimentTitle SentimentHeadline
## "character" "character" "numeric" "numeric"
## Facebook GooglePlus LinkedIn
## "integer" "integer" "integer"
numeric_columns <- sapply(data, is.numeric)
categorical_columns <- sapply(data, function(x) is.factor(x) || is.character(x))
# Create summaries for numeric columns
numeric_summary <- data %>%
select_if(numeric_columns) %>%
summarise_at(vars(everything()),
list(
Min = ~ min(.),
Max = ~ max(.),
Mean = ~ mean(.),
Median = ~ median(.),
Q1 = ~ quantile(., 0.25),
Q3 = ~ quantile(., 0.75)
))
# Create summaries for categorical columns
categorical_summary <- data %>%
select_if(categorical_columns) %>%
summarise_all(list(Unique_Values = ~ length(unique(.)),
Counts = ~ n()))
# Combine the summaries
combined_summary <- bind_cols(numeric_summary, categorical_summary)
# Print the combined summary
print(combined_summary)
## IDLink_Min SentimentTitle_Min SentimentHeadline_Min Facebook_Min
## 1 1 -0.9506944 -0.7554334 -1
## GooglePlus_Min LinkedIn_Min IDLink_Max SentimentTitle_Max
## 1 -1 -1 104802 0.9623536
## SentimentHeadline_Max Facebook_Max GooglePlus_Max LinkedIn_Max IDLink_Mean
## 1 0.9646462 49211 1267 20341 51560.65
## SentimentTitle_Mean SentimentHeadline_Mean Facebook_Mean GooglePlus_Mean
## 1 -0.005411366 -0.02749305 113.1413 3.888362
## LinkedIn_Mean IDLink_Median SentimentTitle_Median SentimentHeadline_Median
## 1 16.54796 52275 0 -0.0260643
## Facebook_Median GooglePlus_Median LinkedIn_Median IDLink_Q1 SentimentTitle_Q1
## 1 5 0 0 24301.5 -0.07905694
## SentimentHeadline_Q1 Facebook_Q1 GooglePlus_Q1 LinkedIn_Q1 IDLink_Q3
## 1 -0.1145743 0 0 0 76585.5
## SentimentTitle_Q3 SentimentHeadline_Q3 Facebook_Q3 GooglePlus_Q3 LinkedIn_Q3
## 1 0.06425521 0.0597091 33 2 4
## Title_Unique_Values Headline_Unique_Values Source_Unique_Values
## 1 81259 86695 5757
## Topic_Unique_Values PublishDate_Unique_Values Title_Counts Headline_Counts
## 1 4 82644 93239 93239
## Source_Counts Topic_Counts PublishDate_Counts
## 1 93239 93239 93239
To identify and infer the key factors that influence the popularity of news articles on different social media platforms (Facebook, Google+, LinkedIn). Overall, the project aims to provide insights into news content performance in the context of social media, helping publishers and content creators optimize their strategies for better engagement and reach.
Use of aggregation functions (other than the ones used from the first bullet, above) i.e., use these explore something interesting about your data
Example: Find the trend in “GooglePlus” shares over time.
# Find the trend in "GooglePlus" shares over time.
# Aggregation to see the trend in "GooglePlus" shares over time
trend_googleplus <- data %>%
group_by(PublishDate) %>%
summarise(Mean_GooglePlus = mean(GooglePlus))
print(trend_googleplus)
## # A tibble: 82,644 × 2
## PublishDate Mean_GooglePlus
## <chr> <dbl>
## 1 2002-04-02 00:00:00 -1
## 2 2008-09-20 00:00:00 -1
## 3 2012-01-28 00:00:00 -1
## 4 2015-03-01 00:06:00 -1
## 5 2015-03-01 00:11:00 -1
## 6 2015-03-01 00:19:00 -1
## 7 2015-03-01 00:45:00 -1
## 8 2015-03-01 01:20:00 -1
## 9 2015-03-01 01:32:00 -1
## 10 2015-03-01 02:14:00 -1
## # ℹ 82,634 more rows
A visual summary of at least 5 columns of your data that follows: - This should include distributions at least - In addition, you should consider trends, correlations, and interactions between variables - Use different channels (e.g., color) to show how categorical variables interact with continuous variables
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
## Loading required package: RColorBrewer
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
## [1] "\n\n"