week 12

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tsibble)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

# Load the data
data <- read.csv("C://Users//saisr//Downloads//pageview obesity data.csv")

# Preview the data
head(data)

##         Date Obesity Overweight Healthy.diet
## 1  10/6/2022    1335        188          778
## 2  10/7/2022    1234        197          837
## 3  10/8/2022    1075        199          694
## 4  10/9/2022    1214        190          829
## 5 10/10/2022    1277        215          822
## 6 10/11/2022    1430        167          849

Converting date column

data <- data %>%
  mutate(Date = as.Date(Date, format = "%m/%d/%Y"))

str(data)

## 'data.frame':    274 obs. of  4 variables:
##  $ Date        : Date, format: "2022-10-06" "2022-10-07" ...
##  $ Obesity     : int  1335 1234 1075 1214 1277 1430 1487 1469 1254 1240 ...
##  $ Overweight  : int  188 197 199 190 215 167 166 222 227 208 ...
##  $ Healthy.diet: int  778 837 694 829 822 849 911 902 694 674 ...

From the dataset structure, the Obesity column is an excellent choice for analysis because it represents a response-like variable of public interest, directly tied to the topic of obesity.

Creating a tsibble Object

# Create a tsibble for the 'Obesity' column
data_tsibble <- data %>%
  as_tsibble(index = Date) %>%
  select(Date, Obesity)

plotting data over time

# Plot Obesity pageviews over the entire time range
data_tsibble %>%
  ggplot(aes(x = Date, y = Obesity)) +
  geom_line(color = "blue") +
  labs(title = "Obesity Pageviews Over Time",
       x = "Date",
       y = "Pageviews")

Different Windows of Time

# Filter for the first 6 months
data_tsibble %>%
  filter(Date <= as.Date("2023-03-31")) %>%
  ggplot(aes(x = Date, y = Obesity)) +
  geom_line(color = "blue") +
  labs(title = "Obesity Pageviews: First 6 Months",
       x = "Date",
       y = "Pageviews")

# Filter for the last 6 months
data_tsibble %>%
  filter(Date >= as.Date("2023-04-01")) %>%
  ggplot(aes(x = Date, y = Obesity)) +
  geom_line(color = "blue") +
  labs(title = "Obesity Pageviews: Last 6 Months",
       x = "Date",
       y = "Pageviews")

key observations

The spike early in the dataset is the most prominent feature and likely corresponds to an anomaly or a significant event. -The last six months show more consistent variability compared to the earlier periods.

Linear Regression to Detect Trends

# Fit a linear regression model for the overall data
linear_model <- lm(Obesity ~ as.numeric(Date), data = data_tsibble)

# Display the summary of the linear model
summary(linear_model)

## 
## Call:
## lm(formula = Obesity ~ as.numeric(Date), data = data_tsibble)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -549.3 -169.9  -51.2   57.7 9605.4 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      25432.4366  9601.0171   2.649  0.00855 **
## as.numeric(Date)    -1.2353     0.4947  -2.497  0.01311 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 647.7 on 272 degrees of freedom
## Multiple R-squared:  0.02241,    Adjusted R-squared:  0.01882 
## F-statistic: 6.235 on 1 and 272 DF,  p-value: 0.01311

# Visualize the trend
data_tsibble %>%
  ggplot(aes(x = Date, y = Obesity)) +
  geom_line(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Linear Regression: Overall Trend in Obesity Pageviews",
       x = "Date",
       y = "Pageviews")

## `geom_smooth()` using formula = 'y ~ x'

Insights

Equation of the Model: Obesity=25432.44−1.235×Date (numeric) The negative slope (−1.235) suggests a downward trend in the number of pageviews for “Obesity” over time.
Significance of the Slope: The p-value for the slope (𝑝=0.01311) is less than 0.05, meaning the downward trend is statistically significant.
Strength of the Trend: The Multiple R-squared value is 0.0224, indicating that only 2.24% of the variation in Obesity pageviews is explained by time. This is a very weak trend, as most of the variation is not captured by the model.
Residual Analysis: The Residual Standard Error (RSE) is 647.7, which represents the average deviation of observed values from the regression line. The wide residual range (−549.3 to 9605.4) shows that there are large deviations in certain time points.
Is there a need to Subset the Data? Yes, based on the weak R-squared value and the potential for differing patterns in different time periods, subsetting the data into logical periods may help uncover stronger trends.

Subset Analysis of first half

# Calculate the median date
mid_date <- median(data_tsibble$Date)

# Subset into first half and second half
first_half <- data_tsibble %>% filter(Date <= mid_date)
second_half <- data_tsibble %>% filter(Date > mid_date)

# Linear regression for the first half
linear_model_first <- lm(Obesity ~ as.numeric(Date), data = first_half)
summary(linear_model_first)

## 
## Call:
## lm(formula = Obesity ~ as.numeric(Date), data = first_half)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -628.5 -214.3  -98.9   66.2 9561.3 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)      59303.467  37333.590   1.588    0.115
## as.numeric(Date)    -2.989      1.930  -1.548    0.124
## 
## Residual standard error: 893.6 on 135 degrees of freedom
## Multiple R-squared:  0.01744,    Adjusted R-squared:  0.01017 
## F-statistic: 2.397 on 1 and 135 DF,  p-value: 0.1239

# Linear regression for the second half
linear_model_second <- lm(Obesity ~ as.numeric(Date), data = second_half)
summary(linear_model_second)

## 
## Call:
## lm(formula = Obesity ~ as.numeric(Date), data = second_half)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -389.29  -84.58   12.66   90.28 1256.04 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      53600.583   7965.108   6.729 4.44e-10 ***
## as.numeric(Date)    -2.680      0.409  -6.552 1.10e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 189.3 on 135 degrees of freedom
## Multiple R-squared:  0.2413, Adjusted R-squared:  0.2357 
## F-statistic: 42.93 on 1 and 135 DF,  p-value: 1.097e-09

Insights

In the first half, there is little evidence of a meaningful trend, as the slope is not significant and the R-squared is negligible.
In the second half, the trend is much clearer and statistically significant, suggesting a consistent and noticeable downward trend in obesity-related pageviews.
Subset Analysis: The weak overall trend in the full dataset masked the stronger trend present in the second half. Subsetting reveals a meaningful downward trend in the second half.
Stronger Trend in the Second Half: A significant decline in pageviews for obesity-related topics occurs in the second half of the timeline.
Possible Explanations: External factors like reduced interest in the topic or changes in public awareness might explain the stronger downward trend in the later period.

Visualization of subset trends

# Plot for the first half
ggplot(first_half, aes(x = Date, y = Obesity)) +
  geom_line(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Trend in First Half of Data", x = "Date", y = "Pageviews")

## `geom_smooth()` using formula = 'y ~ x'

# Plot for the second half
ggplot(second_half, aes(x = Date, y = Obesity)) +
  geom_line(color = "green") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Trend in Second Half of Data", x = "Date", y = "Pageviews")

## `geom_smooth()` using formula = 'y ~ x'

Seasonality Analysis

1. Smoothing to Detect Seasonality

# Load necessary libraries
library(ggplot2)

# Apply smoothing (LOESS smoothing)
ggplot(data_tsibble, aes(x = Date, y = Obesity)) +
  geom_line(color = "blue") +
  geom_smooth(method = "loess", formula = y ~ x, span = 0.2, color = "red", se = FALSE) +
  labs(title = "Smoothing to Detect Seasonality",
       x = "Date",
       y = "Pageviews") +
  theme_minimal()

Interpretation

There is a large spike in pageviews early in the timeline, likely indicating an event or anomaly.
After the spike, the pageviews stabilize but exhibit some oscillations.
The red line, created using a LOESS smoothing method, provides a smoothed estimate of the trends in the data.
There appears to be some recurring patterns or seasonality, especially visible in the smaller oscillations post-spike, although the periodicity is unclear without additional analysis.

Illustrating Seasonality Using ACF

This test detectes periodicity

acf(data_tsibble$Obesity, main = "ACF: Detecting Seasonality", lag.max = 50)

Using PACF to Verify Seasonal Effects

PACF isolates direct correlations by removing indirect ones.

pacf(data_tsibble$Obesity, main = "PACF: Verifying Seasonality", lag.max = 50)

Decomposing the Time Series

# Convert data to a time series object
data_ts <- ts(data$Obesity, frequency = 7)  # Adjust 'Obesity' and frequency as needed

# Decompose the time series
decomposed <- stl(data_ts, s.window = "periodic")

# Plot the decomposed components
plot(decomposed, main = "Decomposition of Time Series")

Insights

Top Panel-data - This is the raw time series data. - we can see an initial spike followed by more stable data with periodic fluctuations. Second Panel-seasonal - This shows repeating patterns or cycles within the data, such as weekly, monthly, or yearly seasonality. - The consistent oscillations suggest a strong seasonal pattern, likely periodic over the specified time frame. - The amplitude of these oscillations appears stable over time. Third Panel-trend - This represents the long-term movement in the data, abstracting away seasonal variations. - Initially, there is a sharp upward trend, corresponding to the spike in the data. - Afterward, the trend stabilizes and shows small fluctuations over time. Bottom Panel-remainder - This shows the random noise or residual variation that cannot be explained by the trend or seasonal components. - The large outlier corresponds to the initial spike in the data, which is not explained by seasonality or trend. - The remainder is relatively stable after the spike, with small variations.

Seasonality: A strong seasonal pattern is evident in the data, indicating regular cycles. Trend: The trend stabilizes after an initial spike, showing a long-term consistent movement. Remainder: Anomalies (e.g., the spike) are captured as residuals, likely due to external events.

week 12

Saisree Mucharla

2024-11-16

Converting date column

Creating a tsibble Object

plotting data over time

Different Windows of Time

key observations

Linear Regression to Detect Trends

Insights

Subset Analysis of first half

Insights

Visualization of subset trends

Seasonality Analysis

1. Smoothing to Detect Seasonality

Interpretation

Illustrating Seasonality Using ACF

Using PACF to Verify Seasonal Effects

Decomposing the Time Series

Insights