R Markdown

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(pwr)
library(stats)
library(readr)
library(broom)

# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")

# A numeric summary of data for at least 10 columns
summary(data)

##      IDLink          Title             Headline            Source         
##  Min.   :     1   Length:93239       Length:93239       Length:93239      
##  1st Qu.: 24302   Class :character   Class :character   Class :character  
##  Median : 52275   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 51561                                                           
##  3rd Qu.: 76586                                                           
##  Max.   :104802                                                           
##     Topic           PublishDate        SentimentTitle      SentimentHeadline 
##  Length:93239       Length:93239       Min.   :-0.950694   Min.   :-0.75543  
##  Class :character   Class :character   1st Qu.:-0.079057   1st Qu.:-0.11457  
##  Mode  :character   Mode  :character   Median : 0.000000   Median :-0.02606  
##                                        Mean   :-0.005411   Mean   :-0.02749  
##                                        3rd Qu.: 0.064255   3rd Qu.: 0.05971  
##                                        Max.   : 0.962354   Max.   : 0.96465  
##     Facebook         GooglePlus          LinkedIn       
##  Min.   :   -1.0   Min.   :  -1.000   Min.   :   -1.00  
##  1st Qu.:    0.0   1st Qu.:   0.000   1st Qu.:    0.00  
##  Median :    5.0   Median :   0.000   Median :    0.00  
##  Mean   :  113.1   Mean   :   3.888   Mean   :   16.55  
##  3rd Qu.:   33.0   3rd Qu.:   2.000   3rd Qu.:    4.00  
##  Max.   :49211.0   Max.   :1267.000   Max.   :20341.00

Part 1:

Build a linear (or generalized linear) model as you like

Use whatever response variable and explanatory variables you prefer

# Build a linear regression model
lm_model <- lm(SentimentHeadline ~ SentimentTitle + Facebook + GooglePlus + LinkedIn, data = data)

# Display the summary of the model
summary(lm_model)

## 
## Call:
## lm(formula = SentimentHeadline ~ SentimentTitle + Facebook + 
##     GooglePlus + LinkedIn, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75140 -0.08520  0.00256  0.08649  0.95233 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.660e-02  4.699e-04 -56.603   <2e-16 ***
## SentimentTitle  1.910e-01  3.350e-03  57.016   <2e-16 ***
## Facebook       -1.178e-07  8.590e-07  -0.137    0.891    
## GooglePlus      2.044e-05  2.980e-05   0.686    0.493    
## LinkedIn        4.252e-06  3.078e-06   1.381    0.167    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1396 on 93234 degrees of freedom
## Multiple R-squared:  0.03373,    Adjusted R-squared:  0.03368 
## F-statistic: 813.5 on 4 and 93234 DF,  p-value: < 2.2e-16

Part 2:

Use the tools from previous weeks to diagnose the model

Highlight any issues with the model

# Diagnostic plots
par(mfrow=c(2,2))
plot(lm_model)

# Residual analysis
residuals <- residuals(lm_model)

# Plotting residuals
plot(residuals)

# Q-Q plot
qqnorm(residuals)
qqline(residuals)

# Scale-Location plot
sqrt_abs_std_residuals <- sqrt(abs(residuals(lm_model) / sd(residuals(lm_model))))
plot(predict(lm_model), sqrt_abs_std_residuals)

# Leverage plot
plot(hatvalues(lm_model))

We use “Diagnostic Plots”, “Residual Analysis”, “Q-Q Plot”, “Scale-Location Plot” and “Leverage Plot” to diagnose our lm_model.

Based on these plots, here are some issues with the model that can be identified from the residual plots:

Heteroscedasticity: The residuals appear to be heteroscedastic, as the variance of the residuals increases with the independent variable. This means that the assumption of homoscedasticity is violated.
Normality: The residuals do not appear to be normally distributed, as the Q-Q plot shows a slight deviation from the straight line. This means that the assumption of normality is violated.
Outliers: There are a few outliers in the data, as the leverage plot shows a few observations with high hat values. These outliers may be having a significant impact on the fitted values.

These issues with the model could lead to biased and inaccurate estimates of the model parameters. They could also lead to problems with predictions, such as overfitting or underfitting.

Part 3:

Interpret at least one of the coefficients

Answer:

In the above results, the coefficient estimates for the predictor variables are as follows:

SentimentTitle (Estimate: 0.1910, Std. Error: 0.003350): The coefficient for the variable “SentimentTitle” is 0.1910. This implies that a one-unit increase in the “SentimentTitle” is associated with a 0.1910-unit increase in the “SentimentHeadline” response variable, all other variables being constant. Given the very low p-value (close to 0), we can confidently say that the coefficient is statistically significant.
Facebook (Estimate: -1.178e-07, Std. Error: 8.59e-07): The coefficient for the variable “Facebook” is very close to zero (-1.178e-07). This indicates that the variable “Facebook” has a minimal impact on the “SentimentHeadline” response variable. The high p-value (0.891) suggests that the coefficient is not statistically significant, and the variable “Facebook” may not be contributing much to the model’s predictive ability.
GooglePlus (Estimate: 2.044e-05, Std. Error: 2.98e-05): The coefficient for the variable “GooglePlus” is 2.044e-05. This implies that a one-unit increase in the “GooglePlus” variable corresponds to a 2.044e-05 increase in the “SentimentHeadline” response variable, holding other variables constant. The relatively high p-value (0.493) suggests that the variable “GooglePlus” is not statistically significant in predicting the “SentimentHeadline.”
LinkedIn (Estimate: 4.252e-06, Std. Error: 3.078e-06): The coefficient for the variable “LinkedIn” is 4.252e-06. This indicates that a one-unit increase in the “LinkedIn” variable is associated with a 4.252e-06 increase in the “SentimentHeadline” response variable, holding other variables constant. The p-value (0.167) suggests that the variable “LinkedIn” is not statistically significant at conventional significance levels, indicating that it may not contribute significantly to the predictive power of the model.

Mansi_Data_Dive_GLMs_Part2

2023-11-04

R Markdown

Part 1:

Part 2:

Part 3:

Answer: