library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
# Load the data
data <- read.csv("C://Users//saisr//Downloads//pageview obesity data.csv")
# Preview the data
head(data)
## Date Obesity Overweight Healthy.diet
## 1 10/6/2022 1335 188 778
## 2 10/7/2022 1234 197 837
## 3 10/8/2022 1075 199 694
## 4 10/9/2022 1214 190 829
## 5 10/10/2022 1277 215 822
## 6 10/11/2022 1430 167 849
data <- data %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y"))
str(data)
## 'data.frame': 274 obs. of 4 variables:
## $ Date : Date, format: "2022-10-06" "2022-10-07" ...
## $ Obesity : int 1335 1234 1075 1214 1277 1430 1487 1469 1254 1240 ...
## $ Overweight : int 188 197 199 190 215 167 166 222 227 208 ...
## $ Healthy.diet: int 778 837 694 829 822 849 911 902 694 674 ...
# Create a tsibble for the 'Obesity' column
data_tsibble <- data %>%
as_tsibble(index = Date) %>%
select(Date, Obesity)
# Plot Obesity pageviews over the entire time range
data_tsibble %>%
ggplot(aes(x = Date, y = Obesity)) +
geom_line(color = "blue") +
labs(title = "Obesity Pageviews Over Time",
x = "Date",
y = "Pageviews")
# Filter for the first 6 months
data_tsibble %>%
filter(Date <= as.Date("2023-03-31")) %>%
ggplot(aes(x = Date, y = Obesity)) +
geom_line(color = "blue") +
labs(title = "Obesity Pageviews: First 6 Months",
x = "Date",
y = "Pageviews")
# Filter for the last 6 months
data_tsibble %>%
filter(Date >= as.Date("2023-04-01")) %>%
ggplot(aes(x = Date, y = Obesity)) +
geom_line(color = "blue") +
labs(title = "Obesity Pageviews: Last 6 Months",
x = "Date",
y = "Pageviews")
# Fit a linear regression model for the overall data
linear_model <- lm(Obesity ~ as.numeric(Date), data = data_tsibble)
# Display the summary of the linear model
summary(linear_model)
##
## Call:
## lm(formula = Obesity ~ as.numeric(Date), data = data_tsibble)
##
## Residuals:
## Min 1Q Median 3Q Max
## -549.3 -169.9 -51.2 57.7 9605.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25432.4366 9601.0171 2.649 0.00855 **
## as.numeric(Date) -1.2353 0.4947 -2.497 0.01311 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 647.7 on 272 degrees of freedom
## Multiple R-squared: 0.02241, Adjusted R-squared: 0.01882
## F-statistic: 6.235 on 1 and 272 DF, p-value: 0.01311
# Visualize the trend
data_tsibble %>%
ggplot(aes(x = Date, y = Obesity)) +
geom_line(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Linear Regression: Overall Trend in Obesity Pageviews",
x = "Date",
y = "Pageviews")
## `geom_smooth()` using formula = 'y ~ x'
Equation of the Model: Obesity=25432.44−1.235×Date (numeric) The negative slope (−1.235) suggests a downward trend in the number of pageviews for “Obesity” over time.
Significance of the Slope: The p-value for the slope (𝑝=0.01311) is less than 0.05, meaning the downward trend is statistically significant.
Strength of the Trend: The Multiple R-squared value is 0.0224, indicating that only 2.24% of the variation in Obesity pageviews is explained by time. This is a very weak trend, as most of the variation is not captured by the model.
Residual Analysis: The Residual Standard Error (RSE) is 647.7, which represents the average deviation of observed values from the regression line. The wide residual range (−549.3 to 9605.4) shows that there are large deviations in certain time points.
Is there a need to Subset the Data? Yes, based on the weak R-squared value and the potential for differing patterns in different time periods, subsetting the data into logical periods may help uncover stronger trends.
# Calculate the median date
mid_date <- median(data_tsibble$Date)
# Subset into first half and second half
first_half <- data_tsibble %>% filter(Date <= mid_date)
second_half <- data_tsibble %>% filter(Date > mid_date)
# Linear regression for the first half
linear_model_first <- lm(Obesity ~ as.numeric(Date), data = first_half)
summary(linear_model_first)
##
## Call:
## lm(formula = Obesity ~ as.numeric(Date), data = first_half)
##
## Residuals:
## Min 1Q Median 3Q Max
## -628.5 -214.3 -98.9 66.2 9561.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59303.467 37333.590 1.588 0.115
## as.numeric(Date) -2.989 1.930 -1.548 0.124
##
## Residual standard error: 893.6 on 135 degrees of freedom
## Multiple R-squared: 0.01744, Adjusted R-squared: 0.01017
## F-statistic: 2.397 on 1 and 135 DF, p-value: 0.1239
# Linear regression for the second half
linear_model_second <- lm(Obesity ~ as.numeric(Date), data = second_half)
summary(linear_model_second)
##
## Call:
## lm(formula = Obesity ~ as.numeric(Date), data = second_half)
##
## Residuals:
## Min 1Q Median 3Q Max
## -389.29 -84.58 12.66 90.28 1256.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53600.583 7965.108 6.729 4.44e-10 ***
## as.numeric(Date) -2.680 0.409 -6.552 1.10e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 189.3 on 135 degrees of freedom
## Multiple R-squared: 0.2413, Adjusted R-squared: 0.2357
## F-statistic: 42.93 on 1 and 135 DF, p-value: 1.097e-09
In the first half, there is little evidence of a meaningful trend, as the slope is not significant and the R-squared is negligible.
In the second half, the trend is much clearer and statistically significant, suggesting a consistent and noticeable downward trend in obesity-related pageviews.
Subset Analysis: The weak overall trend in the full dataset masked the stronger trend present in the second half. Subsetting reveals a meaningful downward trend in the second half.
Stronger Trend in the Second Half: A significant decline in pageviews for obesity-related topics occurs in the second half of the timeline.
Possible Explanations: External factors like reduced interest in the topic or changes in public awareness might explain the stronger downward trend in the later period.
# Plot for the first half
ggplot(first_half, aes(x = Date, y = Obesity)) +
geom_line(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Trend in First Half of Data", x = "Date", y = "Pageviews")
## `geom_smooth()` using formula = 'y ~ x'
# Plot for the second half
ggplot(second_half, aes(x = Date, y = Obesity)) +
geom_line(color = "green") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Trend in Second Half of Data", x = "Date", y = "Pageviews")
## `geom_smooth()` using formula = 'y ~ x'
# Load necessary libraries
library(ggplot2)
# Apply smoothing (LOESS smoothing)
ggplot(data_tsibble, aes(x = Date, y = Obesity)) +
geom_line(color = "blue") +
geom_smooth(method = "loess", formula = y ~ x, span = 0.2, color = "red", se = FALSE) +
labs(title = "Smoothing to Detect Seasonality",
x = "Date",
y = "Pageviews") +
theme_minimal()
This test detectes periodicity
acf(data_tsibble$Obesity, main = "ACF: Detecting Seasonality", lag.max = 50)
PACF isolates direct correlations by removing indirect ones.
pacf(data_tsibble$Obesity, main = "PACF: Verifying Seasonality", lag.max = 50)
# Convert data to a time series object
data_ts <- ts(data$Obesity, frequency = 7) # Adjust 'Obesity' and frequency as needed
# Decompose the time series
decomposed <- stl(data_ts, s.window = "periodic")
# Plot the decomposed components
plot(decomposed, main = "Decomposition of Time Series")
Top Panel-data - This is the raw time series data. - we can see an initial spike followed by more stable data with periodic fluctuations. Second Panel-seasonal - This shows repeating patterns or cycles within the data, such as weekly, monthly, or yearly seasonality. - The consistent oscillations suggest a strong seasonal pattern, likely periodic over the specified time frame. - The amplitude of these oscillations appears stable over time. Third Panel-trend - This represents the long-term movement in the data, abstracting away seasonal variations. - Initially, there is a sharp upward trend, corresponding to the spike in the data. - Afterward, the trend stabilizes and shows small fluctuations over time. Bottom Panel-remainder - This shows the random noise or residual variation that cannot be explained by the trend or seasonal components. - The large outlier corresponds to the initial spike in the data, which is not explained by seasonality or trend. - The remainder is relatively stable after the spike, with small variations.
Seasonality: A strong seasonal pattern is evident in the data, indicating regular cycles. Trend: The trend stabilizes after an initial spike, showing a long-term consistent movement. Remainder: Anomalies (e.g., the spike) are captured as residuals, likely due to external events.