Data Dive 12 Your RMarkdown notebook for this data dive should contain the following:

Select a column of your data that encodes time (e.g., “date”, “timestamp”, “year”, etc.). Convert this into a Date in R. Note, you may need to use some combination of as.Date, or to_datetime. And, you may even need to paste year, month, day, hour, etc. together using paste (even if you need to make up a month, like “__/01/01”). If you do not have a time-based column of data: find a Wikipedia page that is related to your dataset. Then, extract a time series of page views for that page using the wikipedia page views websiteLinks to an external site. or the R package used in this week’s lab. If you choose this option, find ways to tie your results from the below analysis into what you’re seeing with your own data! Choose a column of data to analyze over time. This should be a “response-like” variable that is of particular interest. Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time. What stands out immediately? Use linear regression to detect any upwards or downwards trends. Do you need to subset the data for multiple trends? How strong are these trends? Use smoothing to detect at least one season in your data, and interpret your results. Can you illustrate the seasonality using ACF or PACF?

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'xts' was built under R version 4.3.2
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## Warning: package 'tsibble' was built under R version 4.3.2
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:zoo':
## 
##     index
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
## Warning: package 'pageviews' was built under R version 4.3.2
## Warning: package 'WikipediR' was built under R version 4.3.2
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), 
##     envir = .GlobalEnv, overwrite = TRUE)

Since we don’t have column where we can perform timeseries, we will choose one of the wikipedia page called Cricket_worldcup

## 'data.frame':    316 obs. of  8 variables:
##  $ project    : chr  "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
##  $ language   : chr  "en" "en" "en" "en" ...
##  $ article    : chr  "2023_Cricket_World_Cup" "2023_Cricket_World_Cup" "2023_Cricket_World_Cup" "2023_Cricket_World_Cup" ...
##  $ access     : chr  "all-access" "all-access" "all-access" "all-access" ...
##  $ agent      : chr  "user" "user" "user" "user" ...
##  $ granularity: chr  "daily" "daily" "daily" "daily" ...
##  $ date       : POSIXct, format: "2023-01-01" "2023-01-02" ...
##  $ views      : num  27564 28215 36058 18780 23476 ...

Baseline Traffic: There’s a baseline level of page views shows that it is relatively constant over time, with occasional small spikes.

Significant Spike: There’s a substantial spike in page views starting in October. This likely correlates with increased public interest as the event draws closer or .the event beginning eventually.

Trend Analysis

## 
## Call:
## lm(formula = views ~ date, data = page_views_tsibble)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -202757 -132744  -25994   62658  807165 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.594e+07  1.987e+06  -13.06   <2e-16 ***
## date         1.335e+03  1.018e+02   13.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 165100 on 314 degrees of freedom
## Multiple R-squared:  0.3537, Adjusted R-squared:  0.3517 
## F-statistic: 171.9 on 1 and 314 DF,  p-value: < 2.2e-16
## Warning: Removed 6 rows containing missing values (`geom_line()`).

The graph and calculations indicate a increase in the number of people reading the Wikipedia page as time goes on, especially as the Cricket World Cup draws near. The statistics show that the date is an important factor in predicting page views, but it doesn’t tell us everything—other factors also play a role in the day-to-day changes in page views. About one-third of the change in the number of views can be explained by the passing of time, pointing towards growing interest as the event approached.

## Warning: package 'forecast' was built under R version 4.3.2
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

The ACF plot shows that the number of page views is linked over several days; if views are high one day, they’re likely to be high the next. The PACF plot suggests that each day’s views are mostly influenced just by the day before. There isn’t a clear repeating pattern, like a “season” of high views, which might be because the World Cup hasn’t happened enough times to show such patterns, or because the general increase in interest is so strong that it’s the main thing we’re seeing.


I COULDN’T FIND ANYWAY TO RELATE THIS TO MY DATASET AND ITS TREND.