Week-12-Data-Dive---Time-Based-Data.knit

Week 12 Data Dive -Time Based Data

Due: Mon Mar 25, 2024 11:59pm

Task(s)

The purpose of this week’s data dive is for you to explore the time aspect of your data, or the time aspect of some other related dataset.

Your RMarkdown notebook for this data dive should contain the following:

Select a column of your data that encodes time (e.g., “date”, “timestamp”, “year”, etc.). Convert this into a Date in R.
- Note, you may need to use some combination of as.Date, or to_datetime. And, you may even need to paste year, month, day, hour, etc. together using paste (even if you need to make up a month, like "__/01/01").
If you do not have a time-based column of data: find a Wikipedia page that is related to your dataset. Then, extract a time series of page views for that page using the wikipedia page views websiteLinks to an external site. or the R package used in this week’s lab.
- If you choose this option, find ways to tie your results from the below analysis into what you’re seeing with your own data!
Choose a column of data to analyze over time. This should be a “response-like” variable that is of particular interest.
Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time.
- What stands out immediately?
Use linear regression to detect any upwards or downwards trends.
- Do you need to subset the data for multiple trends?
- How strong are these trends?
Use smoothing to detect at least one season in your data, and interpret your results.
- Can you illustrate the seasonality using ACF or PACF?

For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

Data

In this data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.

Link to Data

standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')

## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

standings

## # A tibble: 638 × 15
##    team         team_name  year  wins  loss points_for points_against
##    <chr>        <chr>     <dbl> <dbl> <dbl>      <dbl>          <dbl>
##  1 Miami        Dolphins   2000    11     5        323            226
##  2 Indianapolis Colts      2000    10     6        429            326
##  3 New York     Jets       2000     9     7        321            321
##  4 Buffalo      Bills      2000     8     8        315            350
##  5 New England  Patriots   2000     5    11        276            338
##  6 Tennessee    Titans     2000    13     3        346            191
##  7 Baltimore    Ravens     2000    12     4        333            165
##  8 Pittsburgh   Steelers   2000     9     7        321            255
##  9 Jacksonville Jaguars    2000     7     9        367            327
## 10 Cincinnati   Bengals    2000     4    12        185            359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>

Selecting Time Based Variables

This data set contains a variable named year which we will use to encode time.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Aggregate values by year
standings_aggregated <- standings |>
  group_by(year) |>
  summarise(points_for = sum(points_for))

standings_aggregated

## # A tibble: 20 × 2
##     year points_for
##    <dbl>      <dbl>
##  1  2000      10254
##  2  2001      10024
##  3  2002      11097
##  4  2003      10666
##  5  2004      11000
##  6  2005      10556
##  7  2006      10577
##  8  2007      11104
##  9  2008      11279
## 10  2009      10991
## 11  2010      11283
## 12  2011      11356
## 13  2012      11651
## 14  2013      11985
## 15  2014      11565
## 16  2015      11680
## 17  2016      11661
## 18  2017      11120
## 19  2018      11952
## 20  2019      11680

Plotting

library(tsibble)

## 
## Attaching package: 'tsibble'

## The following object is masked from 'package:lubridate':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(ggplot2)

# Convert the year column to a date format
standings_aggregated$date <- as.Date(paste0(standings_aggregated$year, "-01-01"))

# Convert the data frame to a tsibble
standings_tsibble <- as_tsibble(standings_aggregated, index = date)

standings_tsibble <- standings_tsibble |>
  select(date,points_for)

# Plot the tsibble
standings_tsibble |>
  ggplot() +
  geom_line(mapping = aes(x=date, y=points_for))

It appears there is a strong upward trend in the number of points scored by NFL teams. It also appears that every few years the totals oscillate between an increasing to decreasing relative to the previous years.

standings_tsibble |>
  ggplot(mapping = aes(x = date, y = points_for)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  labs(title = "Points For Over Time in NFL")

## `geom_smooth()` using formula = 'y ~ x'

model1 <- lm(points_for ~ date, data = standings_tsibble)

print(model1$coefficients)

##  (Intercept)         date 
## 8120.3542888    0.2116617

summary(model1)

## 
## Call:
## lm(formula = points_for ~ date, data = standings_tsibble)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -633.95 -196.26   54.52  147.69  540.29 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.120e+03  4.781e+02  16.985 1.58e-12 ***
## date        2.117e-01  3.279e-02   6.455 4.50e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 308.8 on 18 degrees of freedom
## Multiple R-squared:  0.6983, Adjusted R-squared:  0.6816 
## F-statistic: 41.67 on 1 and 18 DF,  p-value: 4.498e-06

Creating a linear model we can see that the date coefficient is 0.21, which indicates that as time increases points_for increases by 0.21. The p-values for this coefficient are very small indicating that this value is significant and that the trend is strong.

Using ACF we can see what is the best value for detecting seasons in the data.

library(xts)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following object is masked from 'package:tsibble':
## 
##     index

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

standings_xts <- xts(standings_tsibble$points_for, 
                order.by = standings_tsibble$date,
                frequency = 1)

acf(standings_xts, ci = 0.95, na.action = na.exclude)

ACF indicates that the best lag values are either 0 or 1.

pacf(standings_xts, na.action = na.exclude,
     xlab = "Lag (Yearly)", main = "PACF for Points For")

PACF indicates a similar value of 1.

These values suggest that the only significant seasons are based on the previous year.

Potentially with more data it would be easier to detect a trend, but with only 20 years to work with it is difficult to say with certainty that there is a seasonal trend within the data.

library(zoo)

# Calculate a 2-year rolling average
rolling_avg <- rollapply(standings_tsibble$points_for, width = 2, FUN = mean, fill = NA)

# Create a data frame for the rolling average
rolling_avg_df <- data.frame(date = standings_tsibble$date, points_for = rolling_avg)

# Plot the rolling average
rolling_avg_df |>
  ggplot(mapping = aes(x = date, y = points_for)) +
  geom_line() +
  labs(title = "Points For over Time",
       subtitle = "2 Year Rolling Average")

## Warning: Removed 1 row containing missing values (`geom_line()`).

Smoothing the data by creating a 2 year rolling average it does appear there are peaks every ~2 years.

library(lubridate)

standings_tsibble |>
  index_by(year = floor_date(date, "2 year")) |>
  summarise(points_for = mean(points_for, na.rm = TRUE)) |>
  ggplot(mapping = aes(x = year, y = points_for)) +
  geom_line() +
  geom_smooth(span = 0.5, color = 'blue', se=FALSE, ) +
  labs(title = "Average Points For Over Time",
       subtitle = "(by 2 years)") +
  scale_x_date(breaks = "2 year", labels = \(x) year(x))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Using LOESS the season appears longer, about 5 years.