Loading in the tidyverse, data and setting seed

# Loading tidyverse and other packages

library(tidyverse)
library(ggthemes)
library(ggrepel)

#Loading in Data

nhl_draft <- read_csv("nhldraft.csv")

# Setting seed

set.seed(1)

For this data dive we will be going over time based data and how we can use Time Series to help forecast the future.

With this data being based around draft picks and the year they were drafted we will be comparing the points of players over time.

First, I’m setting the year column to a date column.

# Loading in lubridate package
library(lubridate)

nhl_draft$year <- ymd(nhl_draft$year, truncated = 2L)

Then I’m going to load in the packages we need to create a time series plot for our data

library(xts)
## Warning: package 'xts' was built under R version 4.3.2
## Warning: package 'zoo' was built under R version 4.3.2
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.3.2

Now I’m going to create a ts_tibble object to the plot over time. This uses the date column we just made with as well as the variable we are trying to predict. The average number of points for each draft class.

avg_points_ <- nhl_draft |> 
  group_by(year) |> 
  summarize(avg_points = mean(points, na.rm = TRUE))

Now we’ll convert that tibble as a time series tibble:

avg_points_ts <- as_tsibble(avg_points_, index = year)

Now we’re ready to create a time series model.

avg_points_xts <- xts(x = avg_points_ts$avg_points, order.by = avg_points_ts$year)

avg_points_xts <- setNames(avg_points_xts, "avg_pts")

Let’s plot this time series.

avg_points_xts %>%
  rollapply(width = 30, \(x) mean(x, na.rm = TRUE), fill = FALSE) %>%
  ggplot(mapping = aes(x = Index, y = avg_pts)) +
  geom_line() +
  labs(title = "Average Points over Time",
       subtitle = "Monthly Rolling Average") +
  theme_hc()

Looking at this plot we can see that points dramatically spikes the closer we get to 2000 and that’s because points as a stat was not as heavily used until the 90s and then everyone started using it.

We can see that there has been a downward trend in the average amount of points per class as we’ve gotten closer to the present day.

Using regression we can get an idea of the current trend.

avg_points_ts |>
  filter_index("1995-01" ~ "2022-01") |>
  ggplot(mapping = aes(x = year, y = avg_points)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  labs(title = "Average Points from 1995 to 2022") +
  theme_hc()
## `geom_smooth()` using formula = 'y ~ x'

The blue line above defines the trend of the data even though it doesn’t perfectly fit the model. We can see that the trend is pretty strong going negative.

Finally I will be using smoothing on the data.

avg_points_ts |>
  filter_index("1995-01" ~ "2022-01") |>
  drop_na() |>
  ggplot(mapping = aes(x = year, y = avg_points)) +
  geom_point(size=1, shape='O') +
  geom_smooth(span=0.2, color = 'blue', se=FALSE) +
  labs(title = "Average Points from 1995 to 2022") +
  theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

USing the smoothing one of the only season trends I see is from the year 2000. It rises from the year 2000 to 2002 and then drops a little bit. Then it does the same thing from 2008 to 2009 in a very similar fashion.

Here are the acf and pacf of the model.

acf(avg_points_ts, ci = 0.95, na.action = na.exclude)

pacf(avg_points_ts, na.action = na.exclude)

Hope you liked this Data Dive over time series!