Mansi_Assignment_Data_Dive_Confidence

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Part 1: Build at least three sets of variable combinations

For each set of variables, include at least one column that you created (i.e., calculated based on others)
All variables for this data dive should be either continuous (i.e., numeric) or ordered (e.g., [‘small’, ‘medium’, ‘large’] is okay, but [“apples”, “oranges”, “bananas”] is not)
For each set, there should be one response variable with the others as explanatory variables

# Load the necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")

# A numeric summary of data for at least 10 columns
summary(data)

##      IDLink          Title             Headline            Source         
##  Min.   :     1   Length:93239       Length:93239       Length:93239      
##  1st Qu.: 24302   Class :character   Class :character   Class :character  
##  Median : 52275   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 51561                                                           
##  3rd Qu.: 76586                                                           
##  Max.   :104802                                                           
##     Topic           PublishDate        SentimentTitle      SentimentHeadline 
##  Length:93239       Length:93239       Min.   :-0.950694   Min.   :-0.75543  
##  Class :character   Class :character   1st Qu.:-0.079057   1st Qu.:-0.11457  
##  Mode  :character   Mode  :character   Median : 0.000000   Median :-0.02606  
##                                        Mean   :-0.005411   Mean   :-0.02749  
##                                        3rd Qu.: 0.064255   3rd Qu.: 0.05971  
##                                        Max.   : 0.962354   Max.   : 0.96465  
##     Facebook         GooglePlus          LinkedIn       
##  Min.   :   -1.0   Min.   :  -1.000   Min.   :   -1.00  
##  1st Qu.:    0.0   1st Qu.:   0.000   1st Qu.:    0.00  
##  Median :    5.0   Median :   0.000   Median :    0.00  
##  Mean   :  113.1   Mean   :   3.888   Mean   :   16.55  
##  3rd Qu.:   33.0   3rd Qu.:   2.000   3rd Qu.:    4.00  
##  Max.   :49211.0   Max.   :1267.000   Max.   :20341.00

# Set the seed for reproducibility
set.seed(123)

# Create Set 1: Variable Combination
set1 <- data %>%
  select(SentimentTitle, SentimentHeadline, Facebook, GooglePlus, LinkedIn) %>%
  mutate(
    # Create a new calculated variable
    CombinedSentiment = SentimentTitle + SentimentHeadline,
    ResponseVariable = Facebook
  )

# Create Set 2: Variable Combination
set2 <- data %>%
  select(SentimentTitle, SentimentHeadline, Facebook, GooglePlus, LinkedIn) %>%
  mutate(
    # Create a new calculated variable
    TotalSocial = Facebook + GooglePlus + LinkedIn,
    ResponseVariable = SentimentTitle
  )

# Create Set 3: Variable Combination
set3 <- data %>%
  select(SentimentTitle, SentimentHeadline, Facebook, GooglePlus, LinkedIn) %>%
  mutate(
    # Create a new calculated variable
    TotalSentiment = SentimentTitle + SentimentHeadline,
    ResponseVariable = GooglePlus
  )

# Print the first few rows of each set
head(set1)

##   SentimentTitle SentimentHeadline Facebook GooglePlus LinkedIn
## 1     0.00000000       -0.05330018       -1         -1       -1
## 2     0.20833333       -0.15638581       -1         -1       -1
## 3    -0.42521003        0.13975425       -1         -1       -1
## 4     0.00000000        0.02606430       -1         -1       -1
## 5     0.00000000        0.14108446       -1         -1       -1
## 6    -0.07537784        0.03677279       -1         -1       -1
##   CombinedSentiment ResponseVariable
## 1       -0.05330018               -1
## 2        0.05194752               -1
## 3       -0.28545578               -1
## 4        0.02606430               -1
## 5        0.14108446               -1
## 6       -0.03860504               -1

head(set2)

##   SentimentTitle SentimentHeadline Facebook GooglePlus LinkedIn TotalSocial
## 1     0.00000000       -0.05330018       -1         -1       -1          -3
## 2     0.20833333       -0.15638581       -1         -1       -1          -3
## 3    -0.42521003        0.13975425       -1         -1       -1          -3
## 4     0.00000000        0.02606430       -1         -1       -1          -3
## 5     0.00000000        0.14108446       -1         -1       -1          -3
## 6    -0.07537784        0.03677279       -1         -1       -1          -3
##   ResponseVariable
## 1       0.00000000
## 2       0.20833333
## 3      -0.42521003
## 4       0.00000000
## 5       0.00000000
## 6      -0.07537784

head(set3)

##   SentimentTitle SentimentHeadline Facebook GooglePlus LinkedIn TotalSentiment
## 1     0.00000000       -0.05330018       -1         -1       -1    -0.05330018
## 2     0.20833333       -0.15638581       -1         -1       -1     0.05194752
## 3    -0.42521003        0.13975425       -1         -1       -1    -0.28545578
## 4     0.00000000        0.02606430       -1         -1       -1     0.02606430
## 5     0.00000000        0.14108446       -1         -1       -1     0.14108446
## 6    -0.07537784        0.03677279       -1         -1       -1    -0.03860504
##   ResponseVariable
## 1               -1
## 2               -1
## 3               -1
## 4               -1
## 5               -1
## 6               -1

Part 2: Plot a visualization for each response-explanatory relationship, and draw some conclusions based on the plot

Use what we’ve covered so far in class to scrutinize the plot (e.g., are there any outliers?)

# Plot a visualization for each response-explanatory relationship
# Visualization for Set 1 (CombinedSentiment vs. Facebook)
plot(set1$CombinedSentiment, set1$Facebook, main = "CombinedSentiment vs. Facebook", 
     xlab = "CombinedSentiment", ylab = "Facebook", col = "blue")

# Visualization for Set 2 (TotalSocial vs. SentimentTitle)
plot(set2$TotalSocial, set2$SentimentTitle, main = "TotalSocial vs. SentimentTitle", 
     xlab = "TotalSocial", ylab = "SentimentTitle", col = "green")

# Visualization for Set 3 (TotalSentiment vs. GooglePlus)
plot(set3$TotalSentiment, set3$GooglePlus, main = "TotalSentiment vs. GooglePlus", 
     xlab = "TotalSentiment", ylab = "GooglePlus", col = "red")

Based on my analysis of the above 3 plots, there are no significant outliers. The data points seem to be uniformly distributed, with no single point sticking out significantly from the rest.

Part 3 : Calculate the appropriate correlation coefficient for each of these combinations

Explain why the value makes sense (or doesn’t) based on the visualization(s)

# Calculate the appropriate correlation coefficient
correlation_set1 <- cor(set1$CombinedSentiment, set1$Facebook)
correlation_set2 <- cor(set2$TotalSocial, set2$SentimentTitle)
correlation_set3 <- cor(set3$TotalSentiment, set3$GooglePlus)

# Print the correlation coefficients
cat("Correlation Set 1:", correlation_set1, "\n")

## Correlation Set 1: -0.002122828

cat("Correlation Set 2:", correlation_set2, "\n")

## Correlation Set 2: -0.003068925

cat("Correlation Set 3:", correlation_set3, "\n")

## Correlation Set 3: -0.005385344

Part 4: Build a confidence interval for each of the response variables. Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.

# Build confidence intervals for the response variables
# Confidence interval for Set 1 (Facebook)
conf_interval_set1 <- t.test(set1$Facebook)$conf.int
cat("Confidence Interval for Facebook (Set 1):", conf_interval_set1, "\n")

## Confidence Interval for Facebook (Set 1): 109.1606 117.1221

# Confidence interval for Set 2 (SentimentTitle)
conf_interval_set2 <- t.test(set2$SentimentTitle)$conf.int
cat("Confidence Interval for SentimentTitle (Set 2):", conf_interval_set2, "\n")

## Confidence Interval for SentimentTitle (Set 2): -0.006287091 -0.00453564

# Confidence interval for Set 3 (GooglePlus)
conf_interval_set3 <- t.test(set3$GooglePlus)$conf.int
cat("Confidence Interval for GooglePlus (Set 3):", conf_interval_set3, "\n")

## Confidence Interval for GooglePlus (Set 3): 3.769661 4.007063

Confidence intervals provide a range of values within which the population parameter (in this case, the population mean) is likely to fall. The above code creates confidence intervals for the response variables in each set:-

Confidence Interval for Facebook (Set 1): This interval estimates the range within which the true mean of Facebook engagement is likely to lie. The confidence interval gives the sense of the precision of your sample mean estimate.

Confidence Interval for SentimentTitle (Set 2): Similar to the previous interval, this one estimates the range for the true mean of SentimentTitle. It tells how confident to be about the population mean based on your sample data.

Confidence Interval for GooglePlus (Set 3): This interval provides an estimate for the true mean of GooglePlus engagement. Like the others, it helps to understand the likely range of the population mean.

Mansi_Assignment_Data_Dive_Confidence_Intervals

2023-09-04

R Markdown

Part 1: Build at least three sets of variable combinations

Part 2: Plot a visualization for each response-explanatory relationship, and draw some conclusions based on the plot

Part 3 : Calculate the appropriate correlation coefficient for each of these combinations

Part 4: Build a confidence interval for each of the response variables. Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.