Week12 datadive

# Load libraries
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggrepel)
data_1 <- data("midwest")

# time series toolkits
library(xts)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(tsibble)

## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:zoo':
## 
##     index
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(pageviews)
library(wikipediatrend)

## 
##   [wikipediatrend]
##     
##   Note:
##     
##     - Data before 2016-01-01 
##       * is provided by petermeissner.de and
##       * was prepared in a project commissioned by the Hertie School of Governance (Prof. Dr. Simon Munzert)
##       * and supported by the Daimler and Benz Foundation.
##     
##     - Data from 2016-01-01 onwards 
##       * is provided by the Wikipedia Foundation
##       * via its pageviews package and API.
##

Midwest dataset does not have a date/time related colum, so using wiki page related to this dataset for the data dive.

IL = wp_trend(                        
  "Illinois",                               # search term is "Illinois"
  from = "2021-01-01", 
  to = "2023-11-01")

IN = wp_trend(                        
  "Indiana",                               # search term is "INdiana"
  from = "2021-01-01", 
  to = "2023-11-01")

MI = wp_trend(                        
  "Michigan",                               # search term is "mi"
  from = "2021-01-01", 
  to = "2023-11-01")

Choose a column of data to analyze over time. This should be a “response-like” variable that is of particular interest.

Using pageviews as response variable.

IL$state <- "Illinois"
IN$state <- "Indiana"
MI$state <- "Michigan"

# Combine the datasets
combined_data <- rbind(IL, IN, MI)

# Plotting the combined data
ggplot(combined_data, aes(date, views, color = state)) +
  geom_line() +
  labs(title = "Page Views of States (IL, IN, MI) on Wikipedia", x = "Date", y = "Page Views") +
  theme_minimal()

-Michigan has large number of hits during late 2021- could be game season. -Illinois has huge hits in 2022 can be related to events happening in Chicago. -overall the number of hits can be arranged in ascending: Indiana > Michigan >Illinois -this can be related to total population.

pop_data <- midwest %>%
  group_by(state) %>%
  summarise(total_population = sum(poptotal))

ggplot(data = pop_data, aes(x = state, y = total_population, fill = state)) +
  geom_bar(stat = "identity") +
  ggtitle("Total Population of Midwest States") +
  xlab("Midwest States") +
  ylab("Total Population")

Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time.

What stands out immediately?

#response-like variable of interest could be the "page views" data itself.
hits <- wpd_get_exact(page="Midwestern_United_States", 
                      from="2013-01-01", 
                      to="2023-01-01", 
                      lang="en", 
                      warn=TRUE)

hits <- select(hits, date, views)
hits_ts <- as_tsibble(hits, index = date)

hits_ts |>
  ggplot() +
  geom_line(mapping = aes(x = date, y = views)) +
  labs(title = '"Time Series" Page Views on Wikipedia') +
  theme_minimal()

hits_ts %>%
  ggplot(aes(x=date, y=views)) +
  geom_line(color="lightblue") + 
  geom_smooth(method = "loess", color = 'purple', se=FALSE, size=1) + 
  ylim(0,13000)+
  labs(title = 'Trend of "Midwest" Page Views on Wikipedia', x="Date", y="Page Views") +  
  theme_minimal() + 
  scale_x_date(date_breaks = "1 year", date_labels = "%Y")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

Use linear regression to detect any upwards or downwards trends.

Do you need to subset the data for multiple trends?

How strong are these trends?

trend_model <- lm(views ~ date, data = hits_ts)
summary(trend_model)

## 
## Call:
## lm(formula = views ~ date, data = hits_ts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2182.3  -509.0   115.0   519.2  5671.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.087e+04  1.469e+03   21.02   <2e-16 ***
## date        -1.793e+00  9.035e-02  -19.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 945.1 on 1093 degrees of freedom
## Multiple R-squared:  0.2648, Adjusted R-squared:  0.2642 
## F-statistic: 393.8 on 1 and 1093 DF,  p-value: < 2.2e-16

Date coefficient shows negative value, signifying a downward trend ie with increase in time the views decrease by 1.793e+00. We should also consider the changes in the overall behavior of users seeking information or a shift to alternative sources rather than solely relying on Wikipedia could certainly affect page views.

Use smoothing to detect at least one season in your data, and interpret your results. Can you illustrate the seasonality using ACF or PACF?

acf(hits_ts, ci = 0.95, na.action = na.exclude)

-The vertical bars in the ACF plot represent the correlation values between the time series and its lagged versions. -A point on the plot at lag k shows the correlation between the series at time t and the series at time t-k. -Dashed horizontal lines represent confidence intervals for the correlation values. -Recurring peaks at regular intervals suggest seasonality or periodic behavior. -Consecutive values appear to follow one another fairly closely, suggesting an autoregression model could be appropriate -Here we notice that there is a significant spike at a lag of 1 and much lower spikes for the subsequent lags. Thus, an AR(1) model would likely be feasible for this data set.

hits_xts <- xts(hits_ts$views, 
                order.by = hits_ts$date,
                frequency = 12)  # we'll keep this as 7
hits_xts <- setNames(hits_xts, "views")
pacf(hits_xts, na.action = na.exclude,
     xlab = "Lag (Weekly)", main = "PACF for Wikipedia Hits")

-Similar to ACF plots, PACF plots often have blue dashed horizontal lines that represent CI. -Points outside these bounds may suggest statistically significant partial correlations -Significant spikes at specific lags indicate direct relationships between the observations at those intervals. -Non-significant spikes might represent correlations explained by shorter lags.

Week12 datadive

parimala

2023-11-10