This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# Load the necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")
# A numeric summary of data for at least 10 columns
summary(data)
## IDLink Title Headline Source
## Min. : 1 Length:93239 Length:93239 Length:93239
## 1st Qu.: 24302 Class :character Class :character Class :character
## Median : 52275 Mode :character Mode :character Mode :character
## Mean : 51561
## 3rd Qu.: 76586
## Max. :104802
## Topic PublishDate SentimentTitle SentimentHeadline
## Length:93239 Length:93239 Min. :-0.950694 Min. :-0.75543
## Class :character Class :character 1st Qu.:-0.079057 1st Qu.:-0.11457
## Mode :character Mode :character Median : 0.000000 Median :-0.02606
## Mean :-0.005411 Mean :-0.02749
## 3rd Qu.: 0.064255 3rd Qu.: 0.05971
## Max. : 0.962354 Max. : 0.96465
## Facebook GooglePlus LinkedIn
## Min. : -1.0 Min. : -1.000 Min. : -1.00
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 5.0 Median : 0.000 Median : 0.00
## Mean : 113.1 Mean : 3.888 Mean : 16.55
## 3rd Qu.: 33.0 3rd Qu.: 2.000 3rd Qu.: 4.00
## Max. :49211.0 Max. :1267.000 Max. :20341.00
# Set the seed for reproducibility
set.seed(123)
# Create Set 1: Variable Combination
set1 <- data %>%
select(SentimentTitle, SentimentHeadline, Facebook, GooglePlus, LinkedIn) %>%
mutate(
# Create a new calculated variable
CombinedSentiment = SentimentTitle + SentimentHeadline,
ResponseVariable = Facebook
)
# Create Set 2: Variable Combination
set2 <- data %>%
select(SentimentTitle, SentimentHeadline, Facebook, GooglePlus, LinkedIn) %>%
mutate(
# Create a new calculated variable
TotalSocial = Facebook + GooglePlus + LinkedIn,
ResponseVariable = SentimentTitle
)
# Create Set 3: Variable Combination
set3 <- data %>%
select(SentimentTitle, SentimentHeadline, Facebook, GooglePlus, LinkedIn) %>%
mutate(
# Create a new calculated variable
TotalSentiment = SentimentTitle + SentimentHeadline,
ResponseVariable = GooglePlus
)
# Print the first few rows of each set
head(set1)
## SentimentTitle SentimentHeadline Facebook GooglePlus LinkedIn
## 1 0.00000000 -0.05330018 -1 -1 -1
## 2 0.20833333 -0.15638581 -1 -1 -1
## 3 -0.42521003 0.13975425 -1 -1 -1
## 4 0.00000000 0.02606430 -1 -1 -1
## 5 0.00000000 0.14108446 -1 -1 -1
## 6 -0.07537784 0.03677279 -1 -1 -1
## CombinedSentiment ResponseVariable
## 1 -0.05330018 -1
## 2 0.05194752 -1
## 3 -0.28545578 -1
## 4 0.02606430 -1
## 5 0.14108446 -1
## 6 -0.03860504 -1
head(set2)
## SentimentTitle SentimentHeadline Facebook GooglePlus LinkedIn TotalSocial
## 1 0.00000000 -0.05330018 -1 -1 -1 -3
## 2 0.20833333 -0.15638581 -1 -1 -1 -3
## 3 -0.42521003 0.13975425 -1 -1 -1 -3
## 4 0.00000000 0.02606430 -1 -1 -1 -3
## 5 0.00000000 0.14108446 -1 -1 -1 -3
## 6 -0.07537784 0.03677279 -1 -1 -1 -3
## ResponseVariable
## 1 0.00000000
## 2 0.20833333
## 3 -0.42521003
## 4 0.00000000
## 5 0.00000000
## 6 -0.07537784
head(set3)
## SentimentTitle SentimentHeadline Facebook GooglePlus LinkedIn TotalSentiment
## 1 0.00000000 -0.05330018 -1 -1 -1 -0.05330018
## 2 0.20833333 -0.15638581 -1 -1 -1 0.05194752
## 3 -0.42521003 0.13975425 -1 -1 -1 -0.28545578
## 4 0.00000000 0.02606430 -1 -1 -1 0.02606430
## 5 0.00000000 0.14108446 -1 -1 -1 0.14108446
## 6 -0.07537784 0.03677279 -1 -1 -1 -0.03860504
## ResponseVariable
## 1 -1
## 2 -1
## 3 -1
## 4 -1
## 5 -1
## 6 -1
# Plot a visualization for each response-explanatory relationship
# Visualization for Set 1 (CombinedSentiment vs. Facebook)
plot(set1$CombinedSentiment, set1$Facebook, main = "CombinedSentiment vs. Facebook",
xlab = "CombinedSentiment", ylab = "Facebook", col = "blue")
# Visualization for Set 2 (TotalSocial vs. SentimentTitle)
plot(set2$TotalSocial, set2$SentimentTitle, main = "TotalSocial vs. SentimentTitle",
xlab = "TotalSocial", ylab = "SentimentTitle", col = "green")
# Visualization for Set 3 (TotalSentiment vs. GooglePlus)
plot(set3$TotalSentiment, set3$GooglePlus, main = "TotalSentiment vs. GooglePlus",
xlab = "TotalSentiment", ylab = "GooglePlus", col = "red")
Based on my analysis of the above 3 plots, there are no significant outliers. The data points seem to be uniformly distributed, with no single point sticking out significantly from the rest.
# Calculate the appropriate correlation coefficient
correlation_set1 <- cor(set1$CombinedSentiment, set1$Facebook)
correlation_set2 <- cor(set2$TotalSocial, set2$SentimentTitle)
correlation_set3 <- cor(set3$TotalSentiment, set3$GooglePlus)
# Print the correlation coefficients
cat("Correlation Set 1:", correlation_set1, "\n")
## Correlation Set 1: -0.002122828
cat("Correlation Set 2:", correlation_set2, "\n")
## Correlation Set 2: -0.003068925
cat("Correlation Set 3:", correlation_set3, "\n")
## Correlation Set 3: -0.005385344
# Build confidence intervals for the response variables
# Confidence interval for Set 1 (Facebook)
conf_interval_set1 <- t.test(set1$Facebook)$conf.int
cat("Confidence Interval for Facebook (Set 1):", conf_interval_set1, "\n")
## Confidence Interval for Facebook (Set 1): 109.1606 117.1221
# Confidence interval for Set 2 (SentimentTitle)
conf_interval_set2 <- t.test(set2$SentimentTitle)$conf.int
cat("Confidence Interval for SentimentTitle (Set 2):", conf_interval_set2, "\n")
## Confidence Interval for SentimentTitle (Set 2): -0.006287091 -0.00453564
# Confidence interval for Set 3 (GooglePlus)
conf_interval_set3 <- t.test(set3$GooglePlus)$conf.int
cat("Confidence Interval for GooglePlus (Set 3):", conf_interval_set3, "\n")
## Confidence Interval for GooglePlus (Set 3): 3.769661 4.007063
Confidence intervals provide a range of values within which the population parameter (in this case, the population mean) is likely to fall. The above code creates confidence intervals for the response variables in each set:-
Confidence Interval for Facebook (Set 1): This interval estimates the range within which the true mean of Facebook engagement is likely to lie. The confidence interval gives the sense of the precision of your sample mean estimate.
Confidence Interval for SentimentTitle (Set 2): Similar to the previous interval, this one estimates the range for the true mean of SentimentTitle. It tells how confident to be about the population mean based on your sample data.
Confidence Interval for GooglePlus (Set 3): This interval provides an estimate for the true mean of GooglePlus engagement. Like the others, it helps to understand the likely range of the population mean.