library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(pwr)
library(stats)
library(readr)
library(broom)
# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")
# A numeric summary of data for at least 10 columns
summary(data)
## IDLink Title Headline Source
## Min. : 1 Length:93239 Length:93239 Length:93239
## 1st Qu.: 24302 Class :character Class :character Class :character
## Median : 52275 Mode :character Mode :character Mode :character
## Mean : 51561
## 3rd Qu.: 76586
## Max. :104802
## Topic PublishDate SentimentTitle SentimentHeadline
## Length:93239 Length:93239 Min. :-0.950694 Min. :-0.75543
## Class :character Class :character 1st Qu.:-0.079057 1st Qu.:-0.11457
## Mode :character Mode :character Median : 0.000000 Median :-0.02606
## Mean :-0.005411 Mean :-0.02749
## 3rd Qu.: 0.064255 3rd Qu.: 0.05971
## Max. : 0.962354 Max. : 0.96465
## Facebook GooglePlus LinkedIn
## Min. : -1.0 Min. : -1.000 Min. : -1.00
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 5.0 Median : 0.000 Median : 0.00
## Mean : 113.1 Mean : 3.888 Mean : 16.55
## 3rd Qu.: 33.0 3rd Qu.: 2.000 3rd Qu.: 4.00
## Max. :49211.0 Max. :1267.000 Max. :20341.00
Build a linear (or generalized linear) model as you like
# Build a linear regression model
lm_model <- lm(SentimentHeadline ~ SentimentTitle + Facebook + GooglePlus + LinkedIn, data = data)
# Display the summary of the model
summary(lm_model)
##
## Call:
## lm(formula = SentimentHeadline ~ SentimentTitle + Facebook +
## GooglePlus + LinkedIn, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75140 -0.08520 0.00256 0.08649 0.95233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.660e-02 4.699e-04 -56.603 <2e-16 ***
## SentimentTitle 1.910e-01 3.350e-03 57.016 <2e-16 ***
## Facebook -1.178e-07 8.590e-07 -0.137 0.891
## GooglePlus 2.044e-05 2.980e-05 0.686 0.493
## LinkedIn 4.252e-06 3.078e-06 1.381 0.167
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1396 on 93234 degrees of freedom
## Multiple R-squared: 0.03373, Adjusted R-squared: 0.03368
## F-statistic: 813.5 on 4 and 93234 DF, p-value: < 2.2e-16
Use the tools from previous weeks to diagnose the model
# Diagnostic plots
par(mfrow=c(2,2))
plot(lm_model)
# Residual analysis
residuals <- residuals(lm_model)
# Plotting residuals
plot(residuals)
# Q-Q plot
qqnorm(residuals)
qqline(residuals)
# Scale-Location plot
sqrt_abs_std_residuals <- sqrt(abs(residuals(lm_model) / sd(residuals(lm_model))))
plot(predict(lm_model), sqrt_abs_std_residuals)
# Leverage plot
plot(hatvalues(lm_model))
We use “Diagnostic Plots”, “Residual Analysis”, “Q-Q Plot”, “Scale-Location Plot” and “Leverage Plot” to diagnose our lm_model.
Based on these plots, here are some issues with the model that can be identified from the residual plots:
Heteroscedasticity: The residuals appear to be heteroscedastic, as the variance of the residuals increases with the independent variable. This means that the assumption of homoscedasticity is violated.
Normality: The residuals do not appear to be normally distributed, as the Q-Q plot shows a slight deviation from the straight line. This means that the assumption of normality is violated.
Outliers: There are a few outliers in the data, as the leverage plot shows a few observations with high hat values. These outliers may be having a significant impact on the fitted values.
These issues with the model could lead to biased and inaccurate estimates of the model parameters. They could also lead to problems with predictions, such as overfitting or underfitting.
Interpret at least one of the coefficients
In the above results, the coefficient estimates for the predictor variables are as follows:
SentimentTitle (Estimate: 0.1910, Std. Error: 0.003350): The coefficient for the variable “SentimentTitle” is 0.1910. This implies that a one-unit increase in the “SentimentTitle” is associated with a 0.1910-unit increase in the “SentimentHeadline” response variable, all other variables being constant. Given the very low p-value (close to 0), we can confidently say that the coefficient is statistically significant.
Facebook (Estimate: -1.178e-07, Std. Error: 8.59e-07): The coefficient for the variable “Facebook” is very close to zero (-1.178e-07). This indicates that the variable “Facebook” has a minimal impact on the “SentimentHeadline” response variable. The high p-value (0.891) suggests that the coefficient is not statistically significant, and the variable “Facebook” may not be contributing much to the model’s predictive ability.
GooglePlus (Estimate: 2.044e-05, Std. Error: 2.98e-05): The coefficient for the variable “GooglePlus” is 2.044e-05. This implies that a one-unit increase in the “GooglePlus” variable corresponds to a 2.044e-05 increase in the “SentimentHeadline” response variable, holding other variables constant. The relatively high p-value (0.493) suggests that the variable “GooglePlus” is not statistically significant in predicting the “SentimentHeadline.”
LinkedIn (Estimate: 4.252e-06, Std. Error: 3.078e-06): The coefficient for the variable “LinkedIn” is 4.252e-06. This indicates that a one-unit increase in the “LinkedIn” variable is associated with a 4.252e-06 increase in the “SentimentHeadline” response variable, holding other variables constant. The p-value (0.167) suggests that the variable “LinkedIn” is not statistically significant at conventional significance levels, indicating that it may not contribute significantly to the predictive power of the model.