R Markdown

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(pwr)
library(stats)
library(readr)
library(broom)

# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")

# A numeric summary of data for at least 10 columns
summary(data)
##      IDLink          Title             Headline            Source         
##  Min.   :     1   Length:93239       Length:93239       Length:93239      
##  1st Qu.: 24302   Class :character   Class :character   Class :character  
##  Median : 52275   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 51561                                                           
##  3rd Qu.: 76586                                                           
##  Max.   :104802                                                           
##     Topic           PublishDate        SentimentTitle      SentimentHeadline 
##  Length:93239       Length:93239       Min.   :-0.950694   Min.   :-0.75543  
##  Class :character   Class :character   1st Qu.:-0.079057   1st Qu.:-0.11457  
##  Mode  :character   Mode  :character   Median : 0.000000   Median :-0.02606  
##                                        Mean   :-0.005411   Mean   :-0.02749  
##                                        3rd Qu.: 0.064255   3rd Qu.: 0.05971  
##                                        Max.   : 0.962354   Max.   : 0.96465  
##     Facebook         GooglePlus          LinkedIn       
##  Min.   :   -1.0   Min.   :  -1.000   Min.   :   -1.00  
##  1st Qu.:    0.0   1st Qu.:   0.000   1st Qu.:    0.00  
##  Median :    5.0   Median :   0.000   Median :    0.00  
##  Mean   :  113.1   Mean   :   3.888   Mean   :   16.55  
##  3rd Qu.:   33.0   3rd Qu.:   2.000   3rd Qu.:    4.00  
##  Max.   :49211.0   Max.   :1267.000   Max.   :20341.00

Part 1:

Build a linear (or generalized linear) model as you like

# Build a linear regression model
lm_model <- lm(SentimentHeadline ~ SentimentTitle + Facebook + GooglePlus + LinkedIn, data = data)

# Display the summary of the model
summary(lm_model)
## 
## Call:
## lm(formula = SentimentHeadline ~ SentimentTitle + Facebook + 
##     GooglePlus + LinkedIn, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75140 -0.08520  0.00256  0.08649  0.95233 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.660e-02  4.699e-04 -56.603   <2e-16 ***
## SentimentTitle  1.910e-01  3.350e-03  57.016   <2e-16 ***
## Facebook       -1.178e-07  8.590e-07  -0.137    0.891    
## GooglePlus      2.044e-05  2.980e-05   0.686    0.493    
## LinkedIn        4.252e-06  3.078e-06   1.381    0.167    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1396 on 93234 degrees of freedom
## Multiple R-squared:  0.03373,    Adjusted R-squared:  0.03368 
## F-statistic: 813.5 on 4 and 93234 DF,  p-value: < 2.2e-16

Part 2:

Use the tools from previous weeks to diagnose the model

# Diagnostic plots
par(mfrow=c(2,2))
plot(lm_model)

# Residual analysis
residuals <- residuals(lm_model)

# Plotting residuals
plot(residuals)

# Q-Q plot
qqnorm(residuals)
qqline(residuals)

# Scale-Location plot
sqrt_abs_std_residuals <- sqrt(abs(residuals(lm_model) / sd(residuals(lm_model))))
plot(predict(lm_model), sqrt_abs_std_residuals)

# Leverage plot
plot(hatvalues(lm_model))

We use “Diagnostic Plots”, “Residual Analysis”, “Q-Q Plot”, “Scale-Location Plot” and “Leverage Plot” to diagnose our lm_model.

Based on these plots, here are some issues with the model that can be identified from the residual plots:

These issues with the model could lead to biased and inaccurate estimates of the model parameters. They could also lead to problems with predictions, such as overfitting or underfitting.

Part 3:

Interpret at least one of the coefficients

Answer:

In the above results, the coefficient estimates for the predictor variables are as follows:

  1. SentimentTitle (Estimate: 0.1910, Std. Error: 0.003350): The coefficient for the variable “SentimentTitle” is 0.1910. This implies that a one-unit increase in the “SentimentTitle” is associated with a 0.1910-unit increase in the “SentimentHeadline” response variable, all other variables being constant. Given the very low p-value (close to 0), we can confidently say that the coefficient is statistically significant.

  2. Facebook (Estimate: -1.178e-07, Std. Error: 8.59e-07): The coefficient for the variable “Facebook” is very close to zero (-1.178e-07). This indicates that the variable “Facebook” has a minimal impact on the “SentimentHeadline” response variable. The high p-value (0.891) suggests that the coefficient is not statistically significant, and the variable “Facebook” may not be contributing much to the model’s predictive ability.

  3. GooglePlus (Estimate: 2.044e-05, Std. Error: 2.98e-05): The coefficient for the variable “GooglePlus” is 2.044e-05. This implies that a one-unit increase in the “GooglePlus” variable corresponds to a 2.044e-05 increase in the “SentimentHeadline” response variable, holding other variables constant. The relatively high p-value (0.493) suggests that the variable “GooglePlus” is not statistically significant in predicting the “SentimentHeadline.”

  4. LinkedIn (Estimate: 4.252e-06, Std. Error: 3.078e-06): The coefficient for the variable “LinkedIn” is 4.252e-06. This indicates that a one-unit increase in the “LinkedIn” variable is associated with a 4.252e-06 increase in the “SentimentHeadline” response variable, holding other variables constant. The p-value (0.167) suggests that the variable “LinkedIn” is not statistically significant at conventional significance levels, indicating that it may not contribute significantly to the predictive power of the model.