Data Dive 12 - Time-based Data

Extract Time-series of Page Views for Wikipedia Page

My dataset does not have a column including any time, so I am using the page view data for the Wikipedia page “Obesity”. The dates range from 8/19/2024-11/17/2024 (the past 90 days, which was the most data available).

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## Warning: package 'lubridate' was built under R version 4.1.3

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.5.1     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

timedata <- read.csv(file.choose())

Choose a column of data to analyze over time

The column I will be analyzing over time is the “Obesity” column, which contains the number of page views per day, as that is the only other column in the Wikipedia time-series data other than the date.

library(xts)

## Warning: package 'xts' was built under R version 4.1.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.1.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(tsibble)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## 
## Attaching package: 'tsibble'

## The following object is masked from 'package:zoo':
## 
##     index

## The following object is masked from 'package:lubridate':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

#Filter out duplicates
timedata_1 <- timedata |>
  select(Date, Obesity) |>
  distinct()

library(lubridate)

#create a tsibble of daily page visits

timedata_ts <- timedata_1 |>
  mutate(Date = as_date(Date)) |>
  as_tsibble(index = Date) |>
  fill_gaps()

Plot the data over time

library(ggplot2)

ggplot(timedata_ts, aes(x = Date, y = Obesity)) +
  geom_line() +
  labs(title = "'Obesity' Wikipedia Page Visits from 8/19/2024 to 11/17/2024", x = "Date", y = "# of Page Visits") +
  theme_minimal()

Something that stands out immediately to me when looking at the line graph for all 90 days of data is the spike in early to mid October. I thought maybe October was maybe National Obesity Awareness month or something like that, but Childhood Obesity Awareness month is in September, which wouldn’t explain a spike in October, necessarily. I was able to find that October 11 used to be World Obesity Day pre-2020, but the day was switched to March 4. This might account for that spike.

Looking at the month of October (window of time)

#subset the data
october <- timedata_ts |>
  filter(Date >= as.Date("2024-10-01") & Date <= as.Date("2024-10-31"))
  
#graph the month of October

ggplot(october, aes(x = Date, y = Obesity)) +
  geom_line() +
  labs(title = "'Obesity' Wikipedia Page Visits in October", x = "Date", y = "# of Page Visits") +
  theme_minimal()

Looking at this, it does appear that the specific spike in searches appears around October 11, which gives some validity to my guess that it was for World Obesity Day, even though it’s no longer on October 11. After that one day, the page visits appear to be “normal”.

Use Linear Regression to Detect Any Trends

model <- lm(Obesity ~ Date, data = timedata_ts)

ggplot(timedata_ts, aes(x = Date, y = Obesity)) +
  geom_line() +
  geom_smooth(method = "lm", color = "pink")

## `geom_smooth()` using formula = 'y ~ x'

  labs(title = "'Obesity' Wikipedia Page Visits from 8/19/2024 to 11/17/2024", x = "Date", y = "# of Page Visits") +
  theme_minimal()

## NULL

summary(model)

## 
## Call:
## lm(formula = Obesity ~ Date, data = timedata_ts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -212.79  -84.99    6.13   60.60  634.93 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2597.47449 9735.84654   0.267    0.790
## Date          -0.06875    0.48682  -0.141    0.888
## 
## Residual standard error: 122 on 89 degrees of freedom
## Multiple R-squared:  0.0002241,  Adjusted R-squared:  -0.01101 
## F-statistic: 0.01995 on 1 and 89 DF,  p-value: 0.888

Looking at this trend line and linear regression model, I was able to determine that this model is not significant, so there is no relationship between the date and the number of page views. I was able to figure this out, one, by looking at the summary of the model. The p-value is large - almost 1 - and the general rule of thumb is that the smaller the p-value, the more significant. Also, the R-squared value is very small (the closer to 1, the better the model is at explaining), so this says that the date explains basically none of the number of page visits. Looking at the graph, the number of page visits is wildly varied and does not follow a trend, so I can then assume also that there is no relationship there.

I think, in this case, it might make sense to subset the data by month just because of the strong spike during October, but otherwise most months look relatively consistent. For this reason, I didn’t run a linear regression model on the month of October, just because the trend is pretty weak.

Use Smoothing

I chose to use a LO(W)ESS smoothing function on my data.

ggplot(timedata_ts, aes(x = Date, y = Obesity)) +
  geom_line() +
  geom_smooth(span = 0.2, color = "pink") +
  labs(title = "Smoothed Graph", x = "Date", y = "# of Page Visits") +
  theme_minimal()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The only trend that I see here, even with smoothing, is the spike in October, which I’ve already addressed earlier in this data dive. Otherwise, the spikes and drops seem pretty consistent across the other months.

Illustrate Seasonality Using ACF

acf(timedata_ts$Obesity, main = "ACF of Page Visits")

Looking at this ACF, there are a good number of points that are correlated to the previous time period, and it seems like the general trend is a negative correlation for a few weeks, and then a positive correlation, repeated. I’m still unsure of how to read/interpret these graphs, but I think this aligns with the spikes and drops I was seeing in my initial line graph of this data.

Because this isn’t directly correlated with my other dataset I’m using for this class, there’s not much more analysis that I would do. However, if I was going to be working with this dataset more, I would likely explore what’s going on in October more, maybe breaking the graph down so I can see page views by day, or running some kind of regression model to see if there’s any relationship going on there.