Extract Time-series of Page Views for Wikipedia Page

My dataset does not have a column including any time, so I am using the page view data for the Wikipedia page “Obesity”. The dates range from 8/19/2024-11/17/2024 (the past 90 days, which was the most data available).

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## Warning: package 'lubridate' was built under R version 4.1.3
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.5.1     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
timedata <- read.csv(file.choose())

Choose a column of data to analyze over time

The column I will be analyzing over time is the “Obesity” column, which contains the number of page views per day, as that is the only other column in the Wikipedia time-series data other than the date.

library(xts)
## Warning: package 'xts' was built under R version 4.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.1.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## The following object is masked from 'package:zoo':
## 
##     index
## The following object is masked from 'package:lubridate':
## 
##     interval
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
#Filter out duplicates
timedata_1 <- timedata |>
  select(Date, Obesity) |>
  distinct()
library(lubridate)

#create a tsibble of daily page visits

timedata_ts <- timedata_1 |>
  mutate(Date = as_date(Date)) |>
  as_tsibble(index = Date) |>
  fill_gaps()

Plot the data over time

library(ggplot2)

ggplot(timedata_ts, aes(x = Date, y = Obesity)) +
  geom_line() +
  labs(title = "'Obesity' Wikipedia Page Visits from 8/19/2024 to 11/17/2024", x = "Date", y = "# of Page Visits") +
  theme_minimal()

Something that stands out immediately to me when looking at the line graph for all 90 days of data is the spike in early to mid October. I thought maybe October was maybe National Obesity Awareness month or something like that, but Childhood Obesity Awareness month is in September, which wouldn’t explain a spike in October, necessarily. I was able to find that October 11 used to be World Obesity Day pre-2020, but the day was switched to March 4. This might account for that spike.

Looking at the month of October (window of time)

#subset the data
october <- timedata_ts |>
  filter(Date >= as.Date("2024-10-01") & Date <= as.Date("2024-10-31"))
  
#graph the month of October

ggplot(october, aes(x = Date, y = Obesity)) +
  geom_line() +
  labs(title = "'Obesity' Wikipedia Page Visits in October", x = "Date", y = "# of Page Visits") +
  theme_minimal()

Looking at this, it does appear that the specific spike in searches appears around October 11, which gives some validity to my guess that it was for World Obesity Day, even though it’s no longer on October 11. After that one day, the page visits appear to be “normal”.

Use Smoothing

I chose to use a LO(W)ESS smoothing function on my data.

ggplot(timedata_ts, aes(x = Date, y = Obesity)) +
  geom_line() +
  geom_smooth(span = 0.2, color = "pink") +
  labs(title = "Smoothed Graph", x = "Date", y = "# of Page Visits") +
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The only trend that I see here, even with smoothing, is the spike in October, which I’ve already addressed earlier in this data dive. Otherwise, the spikes and drops seem pretty consistent across the other months.

Illustrate Seasonality Using ACF

acf(timedata_ts$Obesity, main = "ACF of Page Visits")

Looking at this ACF, there are a good number of points that are correlated to the previous time period, and it seems like the general trend is a negative correlation for a few weeks, and then a positive correlation, repeated. I’m still unsure of how to read/interpret these graphs, but I think this aligns with the spikes and drops I was seeing in my initial line graph of this data.

Because this isn’t directly correlated with my other dataset I’m using for this class, there’s not much more analysis that I would do. However, if I was going to be working with this dataset more, I would likely explore what’s going on in October more, maybe breaking the graph down so I can see page views by day, or running some kind of regression model to see if there’s any relationship going on there.