Learning Data Science with R:

Introduction
Setup
Data
EDA
Analysis
- Time Series
- Merging
Next steps

Introduction

Time series analysis, and time series decomposition, are tools that nearly every data scientist, Six Sigma practitioner, and engineer will find useful. R provides a range of tools for working with time series data; here we’ll explore the most basic: ts() and stl().

Setup

All of the time series analysis will be done with base R functions, but we’ll load some additional libraries to aid us in working with the data. We’ll load the libraries up-front in order to keep our code clean and make sure that anyone trying to reproduce the work can easily see which packages to install.

# For obtaining GDP data
library(WDI)

# For manipulating data
library(magrittr)
library(rprojroot)
library(forcats)
library(dplyr)
library(tidyr)
library(readr)
library(lubridate)

# For descriptive statistics and graphing
library(skimr)
library(ggplot2)
library(scales)
library(gridExtra)

We’ll also use a home-made helper function to clean the data. Some CSV files have an extra carriage return or newline character after the last data value; read_csv() interprets this as a line of NAs. Our custom function accepts a data frame and strips out those rows that contain only NA values. This is different than using na.omit(), which will delete a row if there are any NA values.

Because the function accepts a data frame as input, and returns the same data frame with modification, it works nicely in a dplyr chain.

strip_na_rows <- function (x) {
  if ("data.frame" %in% class(x)) {
    x <- x[rowSums(is.na(x)) != ncol(x), ]
  }
  else {
    stop(paste(deparse(substitute(x)), "is not a data frame. Please pass a data frame to 'x'."))
  }
  return(x)
}

When working in .R script files, RStudio sets the root project directory as the current working directory, and files can be referenced relative to that root directory. This makes it convenient to store R script or code files in a subdirectory of the project (e.g. “/R”), and data in a separate subdirectory (e.g. “/data”). In RMarkdown files, the working directory is the directory in which the .RMD file is saved. This makes it difficult to nicely organize your projects while maintaining portable code. The rprojroot library provides functionality to easily create a string to the project root directory, which is portable and can be used to reference other files in subfolders.

root <- is_rstudio_project
root_file_f <- root$make_fix_file()
root_dir <- paste0(dirname(root_file_f("tidyr_timeseries.Rproj")), 
                   "/")

Data

# csv downloaded from
# \url{https://datamarket.com/data/set/22xr/monthly-beer-production-in-australia-megalitres-includes-ale-and-stout-does-not-include-beverages-with-alcohol-percentage-less-than-115-jan-1956-aug-1995}
beer_df <- read_csv(paste0(root_dir, "data-raw/monthly-beer-production-in-austr.csv"), 
                    col_types = cols(Month = col_date(format = "%Y-%m"))) %>% 
  setNames(c("Date", "Beer_ML")) %>% 
  arrange(Date) %>% 
  strip_na_rows()

## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col     row col   expected        actual                    file               expected   <int> <chr> <chr>           <chr>                     <chr>              actual 1   477 Month date like %Y-%m Monthly beer production … '/Users/tomhopper… file 2   477 <NA>  2 columns       1 columns                 '/Users/tomhopper…

These parsing failures in read_csv() are due to the extra carriage return, which produces a row with one column and an NA value. It’s the single column that’s causing the message (“expected” vs “actual”). Our strip_na_rows() function fixes this after read_csv() throws the parsing failure message.

EDA

Now we want to have a look at the data, to make sure we know what we have. We’ll start with skimr::skim(), which is a nice replacement for summary().

beer_df %>% skim()

## Skim summary statistics
##  n obs: 476 
##  n variables: 2 
## 
## Variable type: Date 
##  variable missing complete   n        min        max     median n_unique
##      Date       0      476 476 1956-01-01 1995-08-01 1975-10-16      476
## 
## Variable type: numeric 
##  variable missing complete   n  mean    sd   p0   p25    p50    p75  p100
##   Beer_ML       0      476 476 136.4 33.74 64.8 112.9 139.15 158.82 217.8
##      hist
##  ▂▅▃▆▇▃▂▁

Numeric summaries are no substitute for visual exploration, so we also plot the data. Given that this is a time series, a line plot with the date along the x-axis is a good place to start.

beer_df %>% 
  ggplot(aes(x = Date, y = Beer_ML)) +
  geom_line()

Analysis

We can see from the EDA that there is a decadal trend, and a clear short-term up-and-down pattern. Counting peaks, we see that there are ten per decade, so this shorter-term pattern is an annual, cyclic trend. This data looks ideal to apply time series analysis.

Time Series

First we convert our data to a time series object using the ts() function.

beer_ts <- ts(data = beer_df$Beer_ML, 
              start = c(year(first(beer_df$Date)), month(first(beer_df$Date))), 
              end = c(year(last(beer_df$Date)), month(last(beer_df$Date))), 
              frequency = 12)

beer_ts %>% skim()

## Skim summary statistics
## 
## Variable type: ts 
##  variable missing complete   n start  end frequency deltat  mean    sd
##         .       0      476 476  1956 1995        12  0.083 136.4 33.74
##   min   max median line_graph
##  64.8 217.8 139.15   ⣀⡠⡠⠡⠊⠌⠒⠑

The key parameters here are data and frequency. Data is the numeric vector that we want to analyze, and frequency indicates how many observations, rows of data, go into each cycle of the repeating pattern. Since we have monthly data with an annual cycle, we set frequency to 12 (months).

The start and end parameters allow ts() to label each observation with the corresponding date or time. ts() expects either a single number, or a vector of two integers which, in this case, can be thought of as as c(year, month). We make use of the lubridate functions year() and month(), together with the dplyr functions first() and last(), to extract the numbers we need from the Date column.

We then use stl() to decompose the time series into the constituent parts: the repeating seasonal trend, the long-term decadal trend, and the remainder, which may be thought of as either noise or as variation not accounted for in the model described by

$data = trend + seasonal + remainder$

beer_stl <- stl(x = beer_ts, s.window = "periodic")

The seasonal window parameter, s.window, tells stl() how to calculate the repeating seasonal trend. The value “periodic” works with most seasonal data.

The default plot() function for stl objects provides a nice visual of the resulting decomposition.

plot(beer_stl)

The gray bar on the right of each plot spans the same height in y-axis coordinates (about 55 ML, in this case), and provides a convenient means of quickly comparing the y-axis scales of each of the four plots.

That trend line looks a little rough for a long-term trend; there are small trends showing up that span a year or less. To smooth further, we adjust the trend window, t.window, parameter. After a little trial and error, 36 looks like a good value. We’re still seeing some variation on the order of 2 years, bust mostly we’re seeing longer-term trends.

beer_stl <- stl(x = beer_ts, s.window = "periodic", t.window = 36)

plot(beer_stl)

Merging

If we want to use this data for further analysis, we may want to save these trends and the remainder back to our data frame. We’ll start by looking at the structure of the stl object.

str(beer_stl)

## List of 8
##  $ time.series: Time-Series [1:476, 1:3] from 1956 to 1996: 3.64 -5.2 6.16 -7.77 -11.27 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:3] "seasonal" "trend" "remainder"
##  $ weights    : num [1:476] 1 1 1 1 1 1 1 1 1 1 ...
##  $ call       : language stl(x = beer_ts, s.window = "periodic", t.window = 36)
##  $ win        : Named num [1:3] 4761 36 13
##   ..- attr(*, "names")= chr [1:3] "s" "t" "l"
##  $ deg        : Named int [1:3] 0 1 1
##   ..- attr(*, "names")= chr [1:3] "s" "t" "l"
##  $ jump       : Named num [1:3] 477 4 2
##   ..- attr(*, "names")= chr [1:3] "s" "t" "l"
##  $ inner      : int 2
##  $ outer      : int 0
##  - attr(*, "class")= chr "stl"

It’s pretty clear that all the data we need must be in beer_stl$time.series. After plotting each column of beer_stl$time.series (not shown), it’s clear that this matrix holds the decomposition, but not the original data. We can extract those columns back to our original data set for further analysis.

For now, we’ll check the data types; it won’t do to

class(beer_stl$time.series)

## [1] "mts"    "ts"     "matrix"

typeof(beer_stl$time.series)

## [1] "double"

So time.series is an object of class ts (“Time-Series”), which is also a matrix, and contains data of type double. We can see from the str() call that it’s a matrix with 3 columns and 476 rows, so we have the same number of observations as our original data. Appending the columns to our data frames should therefore be easy.

One way to extract those columns is with subsetting, and assigning the subset using mutate():

beer_df <- beer_df %>% 
  mutate(Seasonal = beer_stl$time.series[,1],
         Trend = beer_stl$time.series[,2],
         Remainder = beer_stl$time.series[,3])

beer_df %>% skim()

## Skim summary statistics
##  n obs: 476 
##  n variables: 5 
## 
## Variable type: Date 
##  variable missing complete   n        min        max     median n_unique
##      Date       0      476 476 1956-01-01 1995-08-01 1975-10-16      476
## 
## Variable type: numeric 
##  variable missing complete   n  mean    sd   p0   p25    p50    p75  p100
##   Beer_ML       0      476 476 136.4 33.74 64.8 112.9 139.15 158.82 217.8
##      hist
##  ▂▅▃▆▇▃▂▁
## 
## Variable type: ts 
##   variable missing complete   n start  end frequency deltat   mean    sd
##  Remainder       0      476 476  1956 1995        12  0.083  -0.01  9.23
##   Seasonal       0      476 476  1956 1995        12  0.083  -0.13 16.09
##      Trend       0      476 476  1956 1995        12  0.083 136.54 28.02
##     min    max median line_graph
##  -37.5   35.34   0.99   ⠑⠒⠒⠑⠊⡈⠑⠉
##  -24.78  34.38  -6.18   ⠤⠌⡐⢁⠔⡠⠤⠡
##   86.76 165.53 149.47   ⣀⣀⠤⠊⠉⠉⠉⠉

Just to show that we have the data we wanted, we’ll recreate the decomposition plot using ggplot2.

beer_df %>% 
  gather(Series, Amount, -Date) %>% 
  mutate(Series = forcats::as_factor(Series)) %>% 
  mutate(Series = fct_relevel(Series, "Beer_ML", "Trend", "Seasonal", "Remainder")) %>% 
  mutate(Series = fct_recode(Series, `Data` = "Beer_ML")) %>% 
  ggplot(aes(x = Date, y = Amount)) +
  geom_line() +
  facet_grid(facets = Series~., scales = "free_y") +
  labs(x = "Date",
       y = "Amount, ML") +
  theme_bw()

There are some extra steps in this code to get the order of facets right. ggplot2 plots character data in alphanumeric order, and after the call to gather(), the components of the decomposition are stored as a character vector in the column Series. Without further steps, then, the facets will be ordered Data, Remainder, Seasonal, and finally Trend. To control the order in which ggplot2 plots these series, we need to convert the character column to a factor, and then control the order in which the factors are coded. Here we use the forcats package’s fct_relevel().

We also use forcats::fct_recode() to change the Beer_ML factor level to Data. We could also have substituted “Data” for “Beer_ML” when the column was still in character form, as below:

beer_df %>% 
  gather(Series, Amount, -Date) %>% 
  mutate(Series = sub(pattern = "Beer_ML", replacement = "Data", Series))

Instead of sub(), we might have used replace(), or subsetting, or other methods.

Next steps

Production planning

Perhaps we’re simply interested in planning for future production needs. How much beer do we need next month, or in six months? We now have a good model for prediction to within about 10 million liters. We could could use the predict() function, or delve into more powerful methods in the packages Forecast, tidyquant, bsts, or prophet.

Understanding larger trends

Having separated out the long-term trend and the seasonal fluctuation, we might look more closely at either Trend or Remainder for further explanatory variables. What’s causing that long-term trend? Maybe Trend correlates to growth in wealth, or advertising dollars, or unemployment, or climate change.

Let’s test one of those hypothesis, and use gross domestic produt (GDP) as a proxy for wealth. We can download GDP data from the World Bank’s API using the WDI package.

aus_gdp_df <- WDI(country = "AU", indicator = "NY.GDP.MKTP.CD", 
                  start = year(first(beer_df$Date)), 
                  end = year(last(beer_df$Date)))
aus_gdp_df %>% skim()

## Skim summary statistics
##  n obs: 36 
##  n variables: 4 
## 
## Variable type: character 
##  variable missing complete  n min max empty n_unique
##   country       0       36 36   9   9     0        1
##     iso2c       0       36 36   2   2     0        1
## 
## Variable type: numeric 
##        variable missing complete  n       mean       sd         p0
##  NY.GDP.MKTP.CD       0       36 36    1.4e+11  1.1e+11    1.9e+10
##            year       0       36 36 1977.5     10.54    1960      
##         p25        p50        p75       p100     hist
##     3.6e+10    1.1e+11    1.9e+11    3.7e+11 ▇▂▂▃▂▁▂▂
##  1968.75    1977.5     1986.25    1995       ▇▆▇▆▆▇▆▇

Unfortunately, the GDP data from the World Bank only goes back to 1960, so we’ll lose some of our production data in the following analysis.

Having obtained both data sets, we merge the GDP data with the beer production data. Using a left join, we keep all of the GDP data, and throw away any production data that doesn’t match. Since GDP is in annual numbers, we need to summarise the Trend component of our time series composition into an annual number.

temp_df <- aus_gdp_df %>% 
  left_join({
    beer_df %>% 
      mutate(year = year(Date)) %>% 
      group_by(year) %>% 
      summarise(Trend = sum(Trend))
  }, 
  by = 'year')

And finally plot the two series side-by-side, taking advantage of the gridExtra::grid.arrange().

gdp_plot <- temp_df %>% 
  select(year, NY.GDP.MKTP.CD) %>% 
  ggplot(aes(x = year, y = NY.GDP.MKTP.CD)) +
  geom_line() +
  scale_y_continuous(labels = gdp_labeller) +
  labs(x = "Year",
       y = "GDP, millions USD") +
  theme_bw()

trend_plot <- temp_df %>% 
  select(year, Trend) %>% 
  ggplot(aes(x = year, y = Trend)) +
  geom_line() +
  labs(x = "Year",
       y = "Beer production, ML") +
  theme_bw()

grid.arrange(gdp_plot, trend_plot, ncol=2)

We can see that two variables appear to covary for over a decade, and then diverge. But how much do they diverge? Is the early covariance real or imagined? Is the later divergence entirely real, or can we make out some remaining covariance?

We’ll check this this more directly by plotting the two variables against each other, and we’ll color the plotted data by year to check our intuition that the main variable here is time rather than some relationship between GDP and beer production.

temp_df %>% 
  ggplot(aes(x = NY.GDP.MKTP.CD, y = Trend, color = year)) +
  geom_point() +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = gdp_labeller) +
  labs(x = "GDP, millions USD",
       y = "Beer production, ML") +
  theme_bw()

The steep, nearly linear, correlation below about $100 billion disappears at higher amounts (and later years), as we expected.

Understanding smaller trends

As we considered the long-term trend, we might also ask “what drives all that short-term variation in Remainder?”" Maybe Remainder can be partially modeled by sports games, holidays, weather, or some other factor. If we had attempted to correlate beer production to sudden temperature swings or to major sport games, the very short-term effect of these events would have been swamped by the longer-term trends of Seasonal and Trend. Having decomposed the time series, though, we could compare these short-term impulses to Remainder.