My dataset does not have a column including any time, so I am using the page view data for the Wikipedia page “Obesity”. The dates range from 8/19/2024-11/17/2024 (the past 90 days, which was the most data available).
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## Warning: package 'lubridate' was built under R version 4.1.3
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr 1.1.2 v readr 2.1.4
## v forcats 1.0.0 v stringr 1.5.1
## v ggplot2 3.5.1 v tibble 3.2.1
## v lubridate 1.9.2 v tidyr 1.3.0
## v purrr 1.0.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
timedata <- read.csv(file.choose())
The column I will be analyzing over time is the “Obesity” column, which contains the number of page views per day, as that is the only other column in the Wikipedia time-series data other than the date.
library(xts)
## Warning: package 'xts' was built under R version 4.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.1.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
## The following object is masked from 'package:zoo':
##
## index
## The following object is masked from 'package:lubridate':
##
## interval
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
#Filter out duplicates
timedata_1 <- timedata |>
select(Date, Obesity) |>
distinct()
library(lubridate)
#create a tsibble of daily page visits
timedata_ts <- timedata_1 |>
mutate(Date = as_date(Date)) |>
as_tsibble(index = Date) |>
fill_gaps()
library(ggplot2)
ggplot(timedata_ts, aes(x = Date, y = Obesity)) +
geom_line() +
labs(title = "'Obesity' Wikipedia Page Visits from 8/19/2024 to 11/17/2024", x = "Date", y = "# of Page Visits") +
theme_minimal()
Something that stands out immediately to me when looking at the line graph for all 90 days of data is the spike in early to mid October. I thought maybe October was maybe National Obesity Awareness month or something like that, but Childhood Obesity Awareness month is in September, which wouldn’t explain a spike in October, necessarily. I was able to find that October 11 used to be World Obesity Day pre-2020, but the day was switched to March 4. This might account for that spike.
#subset the data
october <- timedata_ts |>
filter(Date >= as.Date("2024-10-01") & Date <= as.Date("2024-10-31"))
#graph the month of October
ggplot(october, aes(x = Date, y = Obesity)) +
geom_line() +
labs(title = "'Obesity' Wikipedia Page Visits in October", x = "Date", y = "# of Page Visits") +
theme_minimal()
Looking at this, it does appear that the specific spike in searches appears around October 11, which gives some validity to my guess that it was for World Obesity Day, even though it’s no longer on October 11. After that one day, the page visits appear to be “normal”.
model <- lm(Obesity ~ Date, data = timedata_ts)
ggplot(timedata_ts, aes(x = Date, y = Obesity)) +
geom_line() +
geom_smooth(method = "lm", color = "pink")
## `geom_smooth()` using formula = 'y ~ x'
labs(title = "'Obesity' Wikipedia Page Visits from 8/19/2024 to 11/17/2024", x = "Date", y = "# of Page Visits") +
theme_minimal()
## NULL
summary(model)
##
## Call:
## lm(formula = Obesity ~ Date, data = timedata_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -212.79 -84.99 6.13 60.60 634.93
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2597.47449 9735.84654 0.267 0.790
## Date -0.06875 0.48682 -0.141 0.888
##
## Residual standard error: 122 on 89 degrees of freedom
## Multiple R-squared: 0.0002241, Adjusted R-squared: -0.01101
## F-statistic: 0.01995 on 1 and 89 DF, p-value: 0.888
Looking at this trend line and linear regression model, I was able to determine that this model is not significant, so there is no relationship between the date and the number of page views. I was able to figure this out, one, by looking at the summary of the model. The p-value is large - almost 1 - and the general rule of thumb is that the smaller the p-value, the more significant. Also, the R-squared value is very small (the closer to 1, the better the model is at explaining), so this says that the date explains basically none of the number of page visits. Looking at the graph, the number of page visits is wildly varied and does not follow a trend, so I can then assume also that there is no relationship there.
I think, in this case, it might make sense to subset the data by month just because of the strong spike during October, but otherwise most months look relatively consistent. For this reason, I didn’t run a linear regression model on the month of October, just because the trend is pretty weak.
I chose to use a LO(W)ESS smoothing function on my data.
ggplot(timedata_ts, aes(x = Date, y = Obesity)) +
geom_line() +
geom_smooth(span = 0.2, color = "pink") +
labs(title = "Smoothed Graph", x = "Date", y = "# of Page Visits") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The only trend that I see here, even with smoothing, is the spike in October, which I’ve already addressed earlier in this data dive. Otherwise, the spikes and drops seem pretty consistent across the other months.
acf(timedata_ts$Obesity, main = "ACF of Page Visits")
Looking at this ACF, there are a good number of points that are correlated to the previous time period, and it seems like the general trend is a negative correlation for a few weeks, and then a positive correlation, repeated. I’m still unsure of how to read/interpret these graphs, but I think this aligns with the spikes and drops I was seeing in my initial line graph of this data.
Because this isn’t directly correlated with my other dataset I’m using for this class, there’s not much more analysis that I would do. However, if I was going to be working with this dataset more, I would likely explore what’s going on in October more, maybe breaking the graph down so I can see page views by day, or running some kind of regression model to see if there’s any relationship going on there.