First I will install the wikipediatrend library to get wikipedia page view data, the xts and tsibble libraries to work with time series data, and the tidyverse library for general dataset management and making visualizations.

library(wikipediatrend)
## Warning: package 'wikipediatrend' was built under R version 4.3.2
## 
##   [wikipediatrend]
##     
##   Note:
##     
##     - Data before 2016-01-01 
##       * is provided by petermeissner.de and
##       * was prepared in a project commissioned by the Hertie School of Governance (Prof. Dr. Simon Munzert)
##       * and supported by the Daimler and Benz Foundation.
##     
##     - Data from 2016-01-01 onwards 
##       * is provided by the Wikipedia Foundation
##       * via its pageviews package and API.
## 
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(xts)
## Warning: package 'xts' was built under R version 4.3.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.2
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.3.2
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:zoo':
## 
##     index
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

Getting Time-Series Data

Because the hypothyroidism dataset I normally analyze does not contain any time values, I chose to use page views for the hypothyroidism wikipedia page so I had time data to analyze. I chose the timeframe 2009-2013 because I couldn’t get data after December 31, 2015, I wanted 5 years of data to analyze for this data dive, and there were a couple of massive outliers in 2014 and 2015 that made visualizing data with those two years included difficult.

hypopageviews <- wpd_get_exact(page="Hypothyroidism", from="2009-01-01", to="2013-12-31", lang="en", warn=TRUE)

Next, I will make a tsibble of the pageviews dataset to use going forward.

pageViews <- select(hypopageviews, date, views)
pageViewsTS <- as_tsibble(pageViews, index = date)

#Visualizing the Time-Series with Smoothing We can visualize the change in views for the Hypothyroidism Wikipedia page over time below:

pageViewsTS |>
  ggplot(aes(x=date,y=views)) + geom_line(color="gold") + geom_smooth(span=0.2, color = 'navy', se=FALSE,size=1.5) + ylim(0,13000)+labs(title = '"Hypothyroidism" Page Views on Wikipedia from 2009-2013') +  theme_bw() + scale_x_date(breaks = "1 year", labels = \(x) year(x))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The number of page views seems to change greatly with each passing day, as seen by the sharp up and down of the original line. There were times where page views dipped heavily for brief periods of time, and some periods where page views sharply increased for a brief period of time, which reflects the natural variation in how much the page is accessed in a day. For instance, a sharp dip could reflect a holiday when fewer people are probably using Wikipedia in general, while a sudden peak could reflect lots of people looking up hypothyroidism after a news story about it.

There are two obvious overarching trends for number of page views over time. The first is a slow upward trend over time until about a quarter of the way through 2013, where it shifts to a sharp decrease in page views. You could also break it down further into five trends, one per year, where 2009 maintained fairly consistent or maybe slightly decreasing page views, 2010 had increasing page views, 2011 has decreasing page views, 2012 has slightly increasing page views, and 2013 had sharply decreasing page views.

Looking at Seasons

Visualization

First to determine potential seasons, a smooth curve for quarterly page view averages on the “Hypothyroidism” Wikipedia page will be plotted:

pageViewsTS|>
  index_by(year = floor_date(date, 'quarter')) |>
  summarise(avg_views= mean(views, na.rm = TRUE)) |>
  ggplot(mapping = aes(x = year, y = avg_views)) + geom_line(color="navy",size=1.5) +geom_smooth(span = 0.3, color = 'red', se=FALSE, size=1.5) +labs(title = 'Average Number of "Hypothyroidism" Wikipedia Page Views Over Time', subtitle = "(by quarter year)",y="Average Views",x="Year") + scale_x_date(breaks = "1 year", labels = \(x) year(x)) + theme_bw()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

From 2011 on, the page views seem to peak around the beginning of the year every year and decline over the rest of the year, which would suggest yearly seasonality. Additionally, average views still appear to be increasing overall with each year until 2013, when they plummeted.

Auto Correlation

We can look at how page views correlate over a time period to see if there are any smaller-scale seasons, which cannot be visualized easily via a smooth curve. In this case, a graph where a lag of 1 is equivalent to a week passing has been created:

acf(pageViewsTS, ci = 0.95, na.action = na.exclude,main='ACF for "Hypothyroidism" Wikipedia Page Views',xlab="Lag (Weekly)")

From this, we can see that the pattern repeats every seven lines, which indicates weekly auto-correction since the default lag was used. This shows that there may also be a weekly seasonality to “Hypothyroidism” Wikipedia page views.

Conclusions

From the analysis of the “Hypothyroidism” Wikipedia page views between January 2009 and December 2013, we could see 2 clear trends emerge. Between January 2009 and April 2013, there was an overall slight increasing trend, and between April 2013 and December 2013, there was a steeper decreasing trend in the number of page views. It is unclear why the slight increasing trend switches to a stronger decreasing trend during the course of the last 9 months of 2013. Perhaps the page was undergoing massive changes, or maybe Google or the Wikipedia search tool stopped recommending the page and resulted in page views sharply decreasing.

Additionally, there appears to be both yearly seasonality and weekly seasonality to the page views data- page views oscillate similarly over the course of the week, but overall, they tend to peak around the beginning of the year, decrease through the first half of the year, and increase once again in the second half of the year. The weekly seasonality may be explained by the fact that people tend to have more time on the weekends and less time during the week to be looking up information online, so some general weekly trends in page views are to be expected. The yearly seasonality may be explained by the fact that people tend to be more health-conscious at the beginning of the year due to New Year’s Resolutions, which may translate to people going to the doctor more at the beginning of the year and being diagnosed with hypothyroidism. I do not know for certain if doctors’ visits also increase at the beginning of year to be confident in my yearly seasonality hypothesis, but it could be worth looking into further.