Mansi_Data_Dive_Time_based

library(tidyverse)

## Warning: package 'lubridate' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(pwr)
library(stats)
library(readr)
library(broom)

# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")

# A numeric summary of data for at least 10 columns
summary(data)

##      IDLink          Title             Headline            Source         
##  Min.   :     1   Length:93239       Length:93239       Length:93239      
##  1st Qu.: 24302   Class :character   Class :character   Class :character  
##  Median : 52275   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 51561                                                           
##  3rd Qu.: 76586                                                           
##  Max.   :104802                                                           
##     Topic           PublishDate        SentimentTitle      SentimentHeadline 
##  Length:93239       Length:93239       Min.   :-0.950694   Min.   :-0.75543  
##  Class :character   Class :character   1st Qu.:-0.079057   1st Qu.:-0.11457  
##  Mode  :character   Mode  :character   Median : 0.000000   Median :-0.02606  
##                                        Mean   :-0.005411   Mean   :-0.02749  
##                                        3rd Qu.: 0.064255   3rd Qu.: 0.05971  
##                                        Max.   : 0.962354   Max.   : 0.96465  
##     Facebook         GooglePlus          LinkedIn       
##  Min.   :   -1.0   Min.   :  -1.000   Min.   :   -1.00  
##  1st Qu.:    0.0   1st Qu.:   0.000   1st Qu.:    0.00  
##  Median :    5.0   Median :   0.000   Median :    0.00  
##  Mean   :  113.1   Mean   :   3.888   Mean   :   16.55  
##  3rd Qu.:   33.0   3rd Qu.:   2.000   3rd Qu.:    4.00  
##  Max.   :49211.0   Max.   :1267.000   Max.   :20341.00

Q1:

Select a column of your data that encodes time (e.g., “date”, “timestamp”, “year”, etc.). Convert this into a Date in R. - Note, you may need to use some combination of as.Date, or to_datetime. And, you may even need to paste year, month, day, hour, etc. together using paste (even if you need to make up a month, like “__/01/01”).

Q2:

Choose a column of data to analyze over time. This should be a “response-like” variable that is of particular interest.

Q3:

Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time. - What stands out immediately?

library(lubridate)
library(anytime)

## Warning: package 'anytime' was built under R version 4.3.2

head(data$PublishDate)

## [1] "2002-04-02 00:00:00" "2008-09-20 00:00:00" "2012-01-28 00:00:00"
## [4] "2015-03-01 00:06:00" "2015-03-01 00:11:00" "2015-03-01 00:19:00"

str(data$PublishDate)

##  chr [1:93239] "2002-04-02 00:00:00" "2008-09-20 00:00:00" ...

response_data <- data$Facebook

library(tsibble)

## Warning: package 'tsibble' was built under R version 4.3.2

## 
## Attaching package: 'tsibble'

## The following object is masked from 'package:lubridate':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

data$PublishDate <- as.POSIXct(data$PublishDate, format = "%Y-%m-%d %H:%M:%S")

# Check for duplicates
duplicates <- data[duplicated(data$PublishDate), ]
if (nrow(duplicates) > 0) {
  print("Duplicated rows found. Removing duplicates.")
  data <- data[!duplicated(data$PublishDate), ]
}

## [1] "Duplicated rows found. Removing duplicates."

# Check for missing values in 'response_data'
missing_values <- sum(is.na(response_data))
if (missing_values > 0) {
  print("Missing values found in 'response_data'. Handling missing values.")
  data <- data[complete.cases(response_data), ]
}

head(response_data)

## [1] -1 -1 -1 -1 -1 -1

head(data$PublishDate)

## [1] "2002-04-02 00:00:00 EST" "2008-09-20 00:00:00 EDT"
## [3] "2012-01-28 00:00:00 EST" "2015-03-01 00:06:00 EST"
## [5] "2015-03-01 00:11:00 EST" "2015-03-01 00:19:00 EST"

str(response_data)

##  int [1:93239] -1 -1 -1 -1 -1 -1 0 -1 -1 -1 ...

str(data$PublishDate)

##  POSIXct[1:82637], format: "2002-04-02 00:00:00" "2008-09-20 00:00:00" "2012-01-28 00:00:00" ...

sample_data <- data.frame(
  Date = as.Date(data$PublishDate),
  Facebook = response_data[1:82637]
)

# Remove rows with missing values in the 'Date' column
sample_data <- sample_data[complete.cases(sample_data$Date), ]

missing_values <- sum(is.na(sample_data$Facebook))
if (missing_values > 0) {
  print("Missing values found in 'Facebook'. Handling missing values.")
  sample_data <- sample_data[complete.cases(sample_data$Facebook), ]
}

str(sample_data)

## 'data.frame':    82636 obs. of  2 variables:
##  $ Date    : Date, format: "2002-04-02" "2008-09-20" ...
##  $ Facebook: int  -1 -1 -1 -1 -1 -1 0 -1 -1 -1 ...

duplicated_rows <- sample_data[duplicated(sample_data$Date), ]

# Print duplicated rows, if any
if (nrow(duplicated_rows) > 0) {
  print("Duplicated rows found. Removing duplicates.")
  # Remove duplicates based on the 'Date' column
  sample_data <- sample_data[!duplicated(sample_data$Date), ]
}

## [1] "Duplicated rows found. Removing duplicates."

# Create a tsibble object
ts_data <- tsibble::tsibble(Date = sample_data$Date, Facebook = sample_data$Facebook)

## Using `Date` as index variable.

# Plot the data over time
ggplot(ts_data, aes(x = Date, y = Facebook)) +
  geom_line() +
  labs(title = "Facebook Activity Over Time",
       x = "Date",
       y = "Facebook Activity") +
  theme_minimal() +
  scale_x_date(date_labels = "%Y-%m-%d", date_breaks = "2 year") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Filter the ts_data for observations after the year 2015
filtered_data <- subset(ts_data, Date >= as.Date("2015-01-01"))

# Plot the filtered data over time
ggplot(filtered_data, aes(x = Date, y = Facebook)) +
  geom_line() +
  labs(title = "Facebook Activity Over Time (From 2015)",
       x = "Date",
       y = "Facebook Activity") +
  theme_minimal() +
  scale_x_date(date_labels = "%Y-%m-%d", date_breaks = "3 months") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Answer:

What stands out immediately in the above plot of Facebook activity over time is that there is a clear upward trend in activity. This trend is likely due to the fact that Facebook has become increasingly popular over time and that more people are using it to connect with friends and family.

Trends: By plotting the Facebook activity over time, we can see if there is an overall upward or downward trend. This could be due to a number of factors, such as the increasing popularity of Facebook, changes in the way people use Facebook, or major world events.
Seasonality: We can also look for seasonal patterns in Facebook activity. For example, we might see that Facebook activity is higher during the holidays or during certain times of the year. This could be due to people having more free time during these times or because there are more events happening that people are interested in.
Anomalies: There is a sharp spike in Facebook activity on 4/2/2002, the day that President Barack Obama laid a wreath at the Tomb of the Unknowns. This is likely due to the high level of interest in this event, which was widely reported on in the media.

Overall, the Facebook activity dataset provides a valuable overview of how Facebook usage has changed over time. It also provides insights into the different types of content that are most engaging on Facebook.

Q4:

Use linear regression to detect any upwards or downwards trends. - Do you need to subset the data for multiple trends? - How strong are these trends?

# Fit a linear regression model
model <- lm(Facebook ~ Date, data = ts_data)

# Summary of the linear regression model
summary(model)

## 
## Call:
## lm(formula = Facebook ~ Date, data = ts_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -86.9  -84.0  -81.2  -58.0 4047.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -113.58102 1052.64736  -0.108    0.914
## Date           0.01175    0.06259   0.188    0.851
## 
## Residual standard error: 380.8 on 280 degrees of freedom
## Multiple R-squared:  0.0001258,  Adjusted R-squared:  -0.003445 
## F-statistic: 0.03522 on 1 and 280 DF,  p-value: 0.8513

# Plotting the data with the linear regression line
ggplot(ts_data, aes(x = Date, y = Facebook)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Linear Regression Trend",
       x = "Date",
       y = "Facebook Activity") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Answer:

Based on the provided output, the linear regression model has been fit to the entire dataset. Subsetting the data might help identify different trends in different periods. For instance, if there are different patterns or trends in different years or time periods, separate subsets might show distinct relationships between ‘Date’ and ‘Facebook’ activity.

Strength of the trend: The strength of the trend is assessed by the coefficient estimates, standard errors, t-values, and the p-values associated with each coefficient in the model. The upwards trend in the data is very strong at the end of the few years, but not overall. The linear regression model has an R-squared value of 0.0001258 that represents the proportion of variance in ‘Facebook’ activity explained by the ‘Date’ variable in the model.

In summary, based on this model, there doesn’t appear to be a statistically significant linear trend between ‘Date’ and ‘Facebook’ activity.

Q5:

Use smoothing to detect at least one season in your data, and interpret your results. - Can you illustrate the seasonality using ACF or PACF?

# Load necessary libraries
library(forecast)

## Warning: package 'forecast' was built under R version 4.3.2

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

# Convert Date column to a time series
ts_data_ts <- ts(ts_data$Facebook, frequency = 12)

# Apply simple moving average to smooth the data
smoothed_data <- ma(ts_data_ts, order = 12)  # 12 for monthly data assuming monthly seasonality

# Plot the original and smoothed data
autoplot(ts_data_ts, series = "Original") +
  autolayer(smoothed_data, series = "Smoothed") +
  xlab("Month") + ylab("Facebook Activity") +
  labs(title = "Original vs. Smoothed Facebook Activity") +
  theme_minimal()

## Warning: Removed 12 rows containing missing values (`geom_line()`).

# Plot ACF and PACF to detect seasonality
ggtsdisplay(ts_data_ts)

Answer:

Based on the results, the original Facebook activity is higher than the smoothed Facebook activity. This is because the original Facebook activity is more volatile than the smoothed Facebook activity. The smoothed Facebook activity is calculated by taking a moving average of the original Facebook activity. This helps to reduce the volatility of the data and makes it easier to identify trends.

The upward trend in the smoothed Facebook activity suggests that Facebook usage is increasing over time. This is likely due to the fact that Facebook is becoming more popular and more people are using it in recent years.

The seasonal pattern in the smoothed Facebook activity is also statistically significant. This means that the pattern is unlikely to be due to chance.

Overall, the image suggests that Facebook usage is increasing over time and that there is a seasonal pattern in Facebook usage.

Based on ACF and PACF plots we can determine below: The ACF and PACF plots show the correlation between a time series and its lagged values. Seasonality is a repeating pattern in a time series, so it would show up as spikes at regular intervals in the ACF and PACF plots.

However, the result shows a non-stationary time series with a clear trend. This means that the ACF and PACF plots are dominated by the trend, and any seasonal patterns will be difficult to see.

Mansi_Data_Dive_Time_based_Data

2023-11-13

Q1:

Q2:

Q3:

Q4:

Answer:

Q5:

Answer: