Mansi_Data_Dive_Regression

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(pwr)
library(stats)
library(readr)

# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")

# A numeric summary of data for at least 10 columns
summary(data)

##      IDLink          Title             Headline            Source         
##  Min.   :     1   Length:93239       Length:93239       Length:93239      
##  1st Qu.: 24302   Class :character   Class :character   Class :character  
##  Median : 52275   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 51561                                                           
##  3rd Qu.: 76586                                                           
##  Max.   :104802                                                           
##     Topic           PublishDate        SentimentTitle      SentimentHeadline 
##  Length:93239       Length:93239       Min.   :-0.950694   Min.   :-0.75543  
##  Class :character   Class :character   1st Qu.:-0.079057   1st Qu.:-0.11457  
##  Mode  :character   Mode  :character   Median : 0.000000   Median :-0.02606  
##                                        Mean   :-0.005411   Mean   :-0.02749  
##                                        3rd Qu.: 0.064255   3rd Qu.: 0.05971  
##                                        Max.   : 0.962354   Max.   : 0.96465  
##     Facebook         GooglePlus          LinkedIn       
##  Min.   :   -1.0   Min.   :  -1.000   Min.   :   -1.00  
##  1st Qu.:    0.0   1st Qu.:   0.000   1st Qu.:    0.00  
##  Median :    5.0   Median :   0.000   Median :    0.00  
##  Mean   :  113.1   Mean   :   3.888   Mean   :   16.55  
##  3rd Qu.:   33.0   3rd Qu.:   2.000   3rd Qu.:    4.00  
##  Max.   :49211.0   Max.   :1267.000   Max.   :20341.00

str(data)

## 'data.frame':    93239 obs. of  11 variables:
##  $ IDLink           : num  99248 10423 18828 27788 27789 ...
##  $ Title            : chr  "Obama Lays Wreath at Arlington National Cemetery" "A Look at the Health of the Chinese Economy" "Nouriel Roubini: Global Economy Not Back to 2008" "Finland GDP Expands In Q4" ...
##  $ Headline         : chr  "Obama Lays Wreath at Arlington National Cemetery. President Barack Obama has laid a wreath at the Tomb of the U"| __truncated__ "Tim Haywood, investment director business-unit head for fixed income at Gam, discusses the China beige book and"| __truncated__ "Nouriel Roubini, NYU professor and chairman at Roubini Global Economics, explains why the global economy isn't "| __truncated__ "Finland's economy expanded marginally in the three months ended December, after contracting in the previous qua"| __truncated__ ...
##  $ Source           : chr  "USA TODAY" "Bloomberg" "Bloomberg" "RTT News" ...
##  $ Topic            : chr  "obama" "economy" "economy" "economy" ...
##  $ PublishDate      : chr  "2002-04-02 00:00:00" "2008-09-20 00:00:00" "2012-01-28 00:00:00" "2015-03-01 00:06:00" ...
##  $ SentimentTitle   : num  0 0.208 -0.425 0 0 ...
##  $ SentimentHeadline: num  -0.0533 -0.1564 0.1398 0.0261 0.1411 ...
##  $ Facebook         : int  -1 -1 -1 -1 -1 -1 0 -1 -1 -1 ...
##  $ GooglePlus       : int  -1 -1 -1 -1 -1 -1 0 -1 -1 -1 ...
##  $ LinkedIn         : int  -1 -1 -1 -1 -1 -1 0 -1 -1 -1 ...

Part 1:

Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable.

For example, in the Ames housing data, the price of the house is likely of the most value to both buyers and sellers. This is the thing most people will ask about when it comes to houses.

# Set the "SentimentHeadline" column as the response variable
response_variable <- data$SentimentHeadline

# Check the structure and summary of the response variable
str(response_variable)

##  num [1:93239] -0.0533 -0.1564 0.1398 0.0261 0.1411 ...

summary(response_variable)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.75543 -0.11457 -0.02606 -0.02749  0.05971  0.96465

Part 2:

Select a categorical column of data (explanatory variable) that you expect might influence the response variable.

Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions.
If there are more than 10 categories, consider consolidating them before running the test using the methods we’ve learned in class.
Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”.

# Select a categorical column as an explanatory variable
explanatory_variable <- data$Topic

# View the frequency count for each category
category_counts <- table(explanatory_variable)

# Check the number of unique categories
num_categories <- length(category_counts)

# If the number of categories is more than 10, consider consolidating them
if (num_categories > 10) {
  # Sort the categories by frequency
  sorted_categories <- sort(category_counts, decreasing = TRUE)

  # Select the top 10 categories based on frequency
  top_categories <- names(sorted_categories)[1:10]

  # Consolidate the remaining categories as 'Other'
  explanatory_variable <- ifelse(explanatory_variable %in% top_categories, explanatory_variable, "Other")
}

# Perform ANOVA test
anova_result <- aov(response_variable ~ as.factor(explanatory_variable), data = data)

# Summarize the ANOVA results
summary(anova_result)

##                                    Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(explanatory_variable)     3   13.7   4.572   228.5 <2e-16 ***
## Residuals                       93235 1865.4   0.020                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer:

Null Hypothesis (H0): The Topic column (i.e. categorical column, ‘explanatory_variable’) in the dataset, does not significantly influence the sentiment headlines (i.e. continuous column, ‘response_variable’).

Based on the provided ANOVA results:

The F-value is 228.5, which is notably high, indicating a significant difference between the groups.
The extremely low p-value (<2e-16) provides strong evidence against the null hypothesis.
The significant ’***’ sign further supports the rejection of the null hypothesis.

Therefore, based on the ANOVA results, it can be concluded that there is strong evidence to suggest that the ‘Topic’ column significantly influences the ‘SentimentHeadline’ column. The identified categories within the ‘explanatory_variable’ from ‘Topic’ column have a considerable impact on the sentiment values in the ‘SentimentHeadline’ column.

Part 3:

Find at least one other continuous (or ordered integer) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear.

Build a linear regression model of the response using just this column, and evaluate its fit.
Run appropriate hypothesis tests and summarize their results. Use diagnostic plots to identify any issues with your model.
Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?

# Choose an appropriate continuous (or ordered integer) column
explanatory_continuous_variable <- data$SentimentTitle

# Build a linear regression model of the response using the selected column
linear_model <- lm(response_variable ~ explanatory_continuous_variable, data = data)

# Evaluate the model's fit
summary(linear_model)

## 
## Call:
## lm(formula = response_variable ~ explanatory_continuous_variable, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75153 -0.08520  0.00255  0.08644  0.95220 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -0.0264595  0.0004574  -57.85   <2e-16 ***
## explanatory_continuous_variable  0.1909990  0.0033499   57.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1396 on 93237 degrees of freedom
## Multiple R-squared:  0.03369,    Adjusted R-squared:  0.03368 
## F-statistic:  3251 on 1 and 93237 DF,  p-value: < 2.2e-16

Answer:

Based on linear regression model summary, the appropriate hypothesis test has been performed for the coefficient of the explanatory continuous variable (for ‘SentimentTitle’ column). The results are as follows:

Hypothesis Test:

Null Hypothesis (H0): There is no linear relationship between the explanatory continuous variable and the response variable.
Alternative Hypothesis (HA): There is a linear relationship between the explanatory continuous variable and the response variable.

Summary of Results: - The p-value for the explanatory continuous variable is less than 2.2e-16, which is considerably lower than the significance level of 0.05. Thus, we have strong evidence to reject the null hypothesis. - This implies that there is a significant linear relationship between the explanatory continuous variable and the response variable.

The coefficient estimate of the explanatory continuous variable is 0.1909990, indicating that a unit increase in the explanatory continuous variable (‘SentimentTitle’ column) results in an increase of 0.1909990 units in the response variable (‘SentimentHeadline’ column), given that other variables remain constant.

# Diagnostic plots
par(mfrow=c(2,2))
plot(linear_model)

Interpreting the coefficients in the context of the linear regression model:

The intercept (-0.0264595) suggests the expected sentiment value when the explanatory continuous variable is zero. However, in this context, the interpretability of this value might not be particularly meaningful due to the absence of contextual information about the continuous variable and the sentiment values.

The coefficient estimate for the explanatory continuous variable (0.1909990) indicates that for every unit increase in the explanatory continuous variable, the sentiment value is expected to increase by 0.1909990 units, assuming that all other variables remain constant.

In the context of the data, this implies that as the value of the explanatory continuous variable increases, there is a corresponding increase in the sentiment values captured in the response variable. However, it is essential to consider the specific nature of the continuous variable and its relevance to the sentiment analysis, as well as the potential impact of other relevant factors not included in the current model.

Part 4:

Include at least one other variable into your regression model (e.g., you might use the one from the ANOVA), and evaluate how it helps (or doesn’t). - Maybe include an interaction term, but explain why you included it. - You can add up to 4 variables if you like.

new_data <- data.frame(response_variable, explanatory_variable, explanatory_continuous_variable)

# Build a linear regression model with an interaction term
linear_model_with_interaction <- lm(response_variable ~ explanatory_variable * explanatory_continuous_variable, data = new_data)

# Evaluate the model
summary(linear_model_with_interaction)

## 
## Call:
## lm(formula = response_variable ~ explanatory_variable * explanatory_continuous_variable, 
##     data = new_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73988 -0.08570  0.00248  0.08648  0.93885 
## 
## Coefficients:
##                                                                 Estimate
## (Intercept)                                                   -0.0375715
## explanatory_variablemicrosoft                                  0.0223521
## explanatory_variableobama                                      0.0200125
## explanatory_variablepalestine                                 -0.0040969
## explanatory_continuous_variable                                0.1864115
## explanatory_variablemicrosoft:explanatory_continuous_variable  0.0149513
## explanatory_variableobama:explanatory_continuous_variable      0.0040171
## explanatory_variablepalestine:explanatory_continuous_variable -0.0461720
##                                                               Std. Error
## (Intercept)                                                    0.0007573
## explanatory_variablemicrosoft                                  0.0012080
## explanatory_variableobama                                      0.0011180
## explanatory_variablepalestine                                  0.0016795
## explanatory_continuous_variable                                0.0052819
## explanatory_variablemicrosoft:explanatory_continuous_variable  0.0091933
## explanatory_variableobama:explanatory_continuous_variable      0.0078963
## explanatory_variablepalestine:explanatory_continuous_variable  0.0131471
##                                                               t value Pr(>|t|)
## (Intercept)                                                   -49.615  < 2e-16
## explanatory_variablemicrosoft                                  18.504  < 2e-16
## explanatory_variableobama                                      17.900  < 2e-16
## explanatory_variablepalestine                                  -2.439 0.014714
## explanatory_continuous_variable                                35.293  < 2e-16
## explanatory_variablemicrosoft:explanatory_continuous_variable   1.626 0.103883
## explanatory_variableobama:explanatory_continuous_variable       0.509 0.610944
## explanatory_variablepalestine:explanatory_continuous_variable  -3.512 0.000445
##                                                                  
## (Intercept)                                                   ***
## explanatory_variablemicrosoft                                 ***
## explanatory_variableobama                                     ***
## explanatory_variablepalestine                                 *  
## explanatory_continuous_variable                               ***
## explanatory_variablemicrosoft:explanatory_continuous_variable    
## explanatory_variableobama:explanatory_continuous_variable        
## explanatory_variablepalestine:explanatory_continuous_variable ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1391 on 93231 degrees of freedom
## Multiple R-squared:  0.03974,    Adjusted R-squared:  0.03967 
## F-statistic: 551.2 on 7 and 93231 DF,  p-value: < 2.2e-16

Answer:

The summary provides insights into the significance of the variables and the interaction term in predicting the response variable. Below is the evaluation:

The ‘explanatory_variable’ categories ‘microsoft’ and ‘obama’ are both highly significant with p-values close to 0. This indicates that these variables have a strong influence on the response variable.
The ‘explanatory_variable’ category ‘palestine’ is also significant but to a lesser extent, with a p-value of 0.014714.
The ‘explanatory_continuous_variable’ is highly significant with a very low p-value, indicating that it significantly influences the response variable.
The interaction term ‘explanatory_variablemicrosoft:explanatory_continuous_variable’ is not significant, as indicated by the p-value of 0.103883.
Similarly, the interaction term ‘explanatory_variableobama:explanatory_continuous_variable’ is also not significant, as the p-value is 0.610944.
The interaction term ‘explanatory_variablepalestine:explanatory_continuous_variable’ is highly significant, with a very low p-value of 0.000445.

The overall model’s R-squared value is 0.03974, indicating that only approximately 3.974% of the variability in the response variable is explained by the model. The F-statistic is significant, suggesting that the model as a whole is significant.

The results suggest that the interaction term involving the ‘palestine’ category has a significant effect on the response variable. However, the interaction terms involving the ‘microsoft’ and ‘obama’ categories do not significantly impact the response variable. This information can guide decision-making in further exploring the relationship between any other provided variables and the response variable.

Mansi_Data_Dive_Regression

2023-10-23

Part 1:

Part 2:

Answer:

Part 3:

Answer:

Part 4:

Answer: