Due: Mon Mar 25, 2024 11:59pm
The purpose of this week’s data dive is for you to explore the time aspect of your data, or the time aspect of some other related dataset.
Your RMarkdown notebook for this data dive should contain the following:
Select a column of your data that encodes time (e.g., “date”, “timestamp”, “year”, etc.). Convert this into a Date in R.
as.Date
,
or to_datetime
. And, you may even need to paste year,
month, day, hour, etc. together using paste
(even if you
need to make up a month, like "__/01/01"
).If you do not have a time-based column of data: find a Wikipedia page that is related to your dataset. Then, extract a time series of page views for that page using the wikipedia page views websiteLinks to an external site. or the R package used in this week’s lab.
Choose a column of data to analyze over time. This should be a “response-like” variable that is of particular interest.
Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time.
Use linear regression to detect any upwards or downwards trends.
Do you need to subset the data for multiple trends?
How strong are these trends?
Use smoothing to detect at least one season in your data, and interpret your results.
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
In this data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
standings
## # A tibble: 638 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226
## 2 Indianapolis Colts 2000 10 6 429 326
## 3 New York Jets 2000 9 7 321 321
## 4 Buffalo Bills 2000 8 8 315 350
## 5 New England Patriots 2000 5 11 276 338
## 6 Tennessee Titans 2000 13 3 346 191
## 7 Baltimore Ravens 2000 12 4 333 165
## 8 Pittsburgh Steelers 2000 9 7 321 255
## 9 Jacksonville Jaguars 2000 7 9 367 327
## 10 Cincinnati Bengals 2000 4 12 185 359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
This data set contains a variable named year which we will use to encode time.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Aggregate values by year
standings_aggregated <- standings |>
group_by(year) |>
summarise(points_for = sum(points_for))
standings_aggregated
## # A tibble: 20 × 2
## year points_for
## <dbl> <dbl>
## 1 2000 10254
## 2 2001 10024
## 3 2002 11097
## 4 2003 10666
## 5 2004 11000
## 6 2005 10556
## 7 2006 10577
## 8 2007 11104
## 9 2008 11279
## 10 2009 10991
## 11 2010 11283
## 12 2011 11356
## 13 2012 11651
## 14 2013 11985
## 15 2014 11565
## 16 2015 11680
## 17 2016 11661
## 18 2017 11120
## 19 2018 11952
## 20 2019 11680
library(tsibble)
##
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
##
## interval
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(ggplot2)
# Convert the year column to a date format
standings_aggregated$date <- as.Date(paste0(standings_aggregated$year, "-01-01"))
# Convert the data frame to a tsibble
standings_tsibble <- as_tsibble(standings_aggregated, index = date)
standings_tsibble <- standings_tsibble |>
select(date,points_for)
# Plot the tsibble
standings_tsibble |>
ggplot() +
geom_line(mapping = aes(x=date, y=points_for))
It appears there is a strong upward trend in the number of points scored by NFL teams. It also appears that every few years the totals oscillate between an increasing to decreasing relative to the previous years.
standings_tsibble |>
ggplot(mapping = aes(x = date, y = points_for)) +
geom_line() +
geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
labs(title = "Points For Over Time in NFL")
## `geom_smooth()` using formula = 'y ~ x'
model1 <- lm(points_for ~ date, data = standings_tsibble)
print(model1$coefficients)
## (Intercept) date
## 8120.3542888 0.2116617
summary(model1)
##
## Call:
## lm(formula = points_for ~ date, data = standings_tsibble)
##
## Residuals:
## Min 1Q Median 3Q Max
## -633.95 -196.26 54.52 147.69 540.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.120e+03 4.781e+02 16.985 1.58e-12 ***
## date 2.117e-01 3.279e-02 6.455 4.50e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 308.8 on 18 degrees of freedom
## Multiple R-squared: 0.6983, Adjusted R-squared: 0.6816
## F-statistic: 41.67 on 1 and 18 DF, p-value: 4.498e-06
Creating a linear model we can see that the date coefficient is 0.21, which indicates that as time increases points_for increases by 0.21. The p-values for this coefficient are very small indicating that this value is significant and that the trend is strong.
Using ACF we can see what is the best value for detecting seasons in the data.
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
##
## index
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
standings_xts <- xts(standings_tsibble$points_for,
order.by = standings_tsibble$date,
frequency = 1)
acf(standings_xts, ci = 0.95, na.action = na.exclude)
ACF indicates that the best lag values are either 0 or 1.
pacf(standings_xts, na.action = na.exclude,
xlab = "Lag (Yearly)", main = "PACF for Points For")
PACF indicates a similar value of 1.
These values suggest that the only significant seasons are based on the previous year.
Potentially with more data it would be easier to detect a trend, but with only 20 years to work with it is difficult to say with certainty that there is a seasonal trend within the data.
library(zoo)
# Calculate a 2-year rolling average
rolling_avg <- rollapply(standings_tsibble$points_for, width = 2, FUN = mean, fill = NA)
# Create a data frame for the rolling average
rolling_avg_df <- data.frame(date = standings_tsibble$date, points_for = rolling_avg)
# Plot the rolling average
rolling_avg_df |>
ggplot(mapping = aes(x = date, y = points_for)) +
geom_line() +
labs(title = "Points For over Time",
subtitle = "2 Year Rolling Average")
## Warning: Removed 1 row containing missing values (`geom_line()`).
Smoothing the data by creating a 2 year rolling average it does appear there are peaks every ~2 years.
library(lubridate)
standings_tsibble |>
index_by(year = floor_date(date, "2 year")) |>
summarise(points_for = mean(points_for, na.rm = TRUE)) |>
ggplot(mapping = aes(x = year, y = points_for)) +
geom_line() +
geom_smooth(span = 0.5, color = 'blue', se=FALSE, ) +
labs(title = "Average Points For Over Time",
subtitle = "(by 2 years)") +
scale_x_date(breaks = "2 year", labels = \(x) year(x))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Using LOESS the season appears longer, about 5 years.