Tutorial

Introduction

This tutorial is for those already familiar with using dataframes in R. This assumes familiarity with subsetting and the library dplyr for data manipulation.

In R there are two primary time series objects that are used. These objects are either ts or xts.

For the purpose of this tutorial, we are going to focus on using ts to create time-series objects. R preloads this function through the stats package.

Making a Univariate Time-Series Objects

Columns are variables, observations are rows. You can view a time-series object the same way as a matrix or dataframe. At its core, a time-series object is a vector (univariate) or matrix (multivariate). Dataframes are implicitly converted to matrices before they are actually converted to a time-series object.

Consider, for example, loading the co2_ts.csv file:

co2_df <- read.csv("co2_ts.csv")

class(co2_df)

## [1] "data.frame"

When first loaded, note that R saves the file as a data frame. Therefore we must convert the data frame to a time series object. To make a time series object, we need to know the start, end, and frequency for our data. For this csv file, the data is already sorted for us. If this were not the case, we could use the arrange function from dplyr package to sort the data instead. Because we already have the month number in one column and the year column in another, we can sort the data accordingly:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

co2_df <- co2_df %>%
  arrange(Year, Month)

By visually inspecting this newly arranged data, we see that the dataset runs from 05-1974 to 09-1987. Therefore we will specify the start and end dates accordingly when making the time series objects. Because we are dealing with monthly data, we set the frequency argument to 12.

We also see that we are dealing with univariate data. Therefore when making the time series object, we only need to take the CO2 column from the data frame.

co2_ts <- ts(co2_df$CO2, start = c(1974, 5), end = c(1987, 9), frequency = 12)

This code took only the CO2 column from our original data frame since this is the only column that contains data not pertaining to the date.

Our new object is of type ts.

class(co2_ts)

## [1] "ts"

Making a Multivariate Time Series Object

Will fill this in once we have a multivariate time series set to work with.

Libraries

The main library used for forecasting in R is (surprise) the forecast package!

library(forecast)
library(sarima)

## Warning: package 'sarima' was built under R version 3.4.4

## Loading required package: stats4

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:forecast':
## 
##     autolayer

library(astsa)

## 
## Attaching package: 'astsa'

## The following object is masked from 'package:forecast':
## 
##     gas

With this package come several useful functions. The ones that we are going to discuss in this tutorial include … The full uses of this package can be found at https://cran.r-project.org/web/packages/forecast/forecast.pdf.

We can use the sarima library to handle ARMA and ARIMA models for us, allowing us to easier adjust our parameters.

EDA Tutorial

We can use the basic plot function to visualize the data set:

autoplot(co2_ts)

ggAcf(co2_ts)

ggPacf(co2_ts)

Seasonality

Models

This is an example of forecasting in R.

solarpv_df <- read.csv("solarpv.csv")
solarpv_df

When we take a look at the dataset we have just loaded, the first thing we must do is ensure that our date is in the proper date format and that R recognizes that it is a date (i.e. make sure the dates belong to the Date class and are encoded properly). We use the lapply function to apply the class function and see what class each variable/column belongs to.

# using the lapply function to look at the data type of each variable, and we see that our date is not in 
# a date format 
lapply(solarpv_df, class)

## $EDT
## [1] "factor"
## 
## $kW_Gen
## [1] "numeric"
## 
## $Cloud_Cover
## [1] "numeric"

As we can see, the dates in the dataset are not encoded properly and need to be changed to dates in the Date class. https://www.statmethods.net/input/dates.html has a great explanation on the different date encodings to be used with as.Date.

# to transform, we are going to need the as.Date() function 
solarpv_df$EDT <- as.Date(solarpv_df$EDT, "%a, %d %b %Y")
lapply(solarpv_df, class)

## $EDT
## [1] "Date"
## 
## $kW_Gen
## [1] "numeric"
## 
## $Cloud_Cover
## [1] "numeric"

# https://www.statmethods.net/input/dates.html
# now we can see we transformed the date properly

The first thing we do is plot the univariate time series, as well the ACF (autocorrelative function) and the PACF(partial autocorrelative function) for the series. Next, we determine whether the univariate time series is a white noise, using the Ljung-Box method, and parameters set. We also take a look at the mean, standard deviation, and length of the time series.

# here we plot the series, a univariate series as we can see
plot(y=solarpv_df$kW_Gen, x=solarpv_df$EDT, type='l')

ggAcf(solarpv_df$kW_Gen, lag.max = 24)

ggPacf(solarpv_df$kW_Gen, lag.max = 24)

# autocorrelation white noise test
whiteNoiseTest(autocorrelations(solarpv_df$kW_Gen, maxlag = 24), h0 = "iid", nlags = c(6,12,18,24), x = x, method = "LjungBox")

## $test
##          ChiSq DF       pvalue
## [1,]  81.65128  6 1.628370e-15
## [2,]  95.52573 12 4.168860e-15
## [3,] 121.53443 18 2.153171e-17
## [4,] 168.57518 24 1.087627e-23
## attr(,"method")
## [1] "LjungBox"
## 
## $ci
##             int         
##  [1,] -0.302429 0.302429
##  [2,] -0.302429 0.302429
##  [3,] -0.302429 0.302429
##  [4,] -0.302429 0.302429
##  [5,] -0.302429 0.302429
##  [6,] -0.302429 0.302429
##  [7,] -0.302429 0.302429
##  [8,] -0.302429 0.302429
##  [9,] -0.302429 0.302429
## [10,] -0.302429 0.302429
## [11,] -0.302429 0.302429
## [12,] -0.302429 0.302429
## [13,] -0.302429 0.302429
## [14,] -0.302429 0.302429
## [15,] -0.302429 0.302429
## [16,] -0.302429 0.302429
## [17,] -0.302429 0.302429
## [18,] -0.302429 0.302429
## [19,] -0.302429 0.302429
## [20,] -0.302429 0.302429
## [21,] -0.302429 0.302429
## [22,] -0.302429 0.302429
## [23,] -0.302429 0.302429
## [24,] -0.302429 0.302429
## attr(,"level")
## [1] 0.95

mean(solarpv_df$kW_Gen)

## [1] 0.5110783

sd(solarpv_df$kW_Gen)

## [1] 0.181538

length(solarpv_df$kW_Gen)

## [1] 42

Here we use both arima and sarima to forecase our ARMA model. sarima includes a constant while arima does not (because the default value of the argument include.drift is set to FALSE in arima; but you can change that manually). Using a constant for a differenced series (which is not this example) implies a linear trend for the original series.

#estimate arima model
mod.1 <- arima(solarpv_df$kW_Gen, order=c(1,0,0), method = "ML")
mod.2 <- sarima(solarpv_df$kW_Gen, p = 1, d = 0, q = 0)

## initial  value -1.706954 
## iter   2 value -2.064177
## iter   3 value -2.064245
## iter   4 value -2.064370
## iter   5 value -2.064389
## iter   6 value -2.064392
## iter   6 value -2.064392
## final  value -2.064392 
## converged
## initial  value -2.067245 
## iter   2 value -2.067499
## iter   3 value -2.067577
## iter   4 value -2.067577
## iter   4 value -2.067577
## iter   4 value -2.067577
## final  value -2.067577 
## converged

summary(mod.1)

## 
## Call:
## arima(x = solarpv_df$kW_Gen, order = c(1, 0, 0), method = "ML")
## 
## Coefficients:
##          ar1  intercept
##       0.7039     0.5202
## s.e.  0.1051     0.0621
## 
## sigma^2 estimated as 0.01574:  log likelihood = 27.24,  aic = -48.49
## 
## Training set error measures:
##                         ME      RMSE      MAE       MPE   MAPE      MASE
## Training set -0.0007846569 0.1254658 0.101899 -9.648602 25.177 0.8939698
##                    ACF1
## Training set -0.1873829

summary(mod.2)

##                    Length Class  Mode   
## fit                14     Arima  list   
## degrees_of_freedom  1     -none- numeric
## ttable              8     -none- numeric
## AIC                 1     -none- numeric
## AICc                1     -none- numeric
## BIC                 1     -none- numeric

Here we evaluate the model, looking at the ACF, PACF, and white noise test on the residuals.

#diagnose arima model
ggAcf(mod.1$residuals)

ggPacf(mod.1$residuals)

Box.test(mod.1$residuals)

## 
##  Box-Pierce test
## 
## data:  mod.1$residuals
## X-squared = 1.4747, df = 1, p-value = 0.2246

Messy Data Examples

ch1_demodat.csv

Consider the ch1_demodat.csv file from class on March 15. This data has both missing values and a difficult-to-parse dtdate variable. We will first work on the dtdate variable by using the strptime function.

ch1_demodat <- read.csv("ch1_demodat.csv")

ch1_demodat$time <- as.POSIXct(strptime(ch1_demodat$dtdate, format = '%d%b%Y:%T'))

Note the weird format argument. Look more at the strptime function for more informration on how I chose this by typing “?strptime” in the console.

The primary reasoning here is that R stores time variables as either a POSIXlt object or a POSIXct object. In this code, we converted first to a POSIXlt object before then converting to a POSIXct object.

There are some differences between the two objects but for the purposes of this tutorial we will not go into them.

We have the hourly data but we want to work with this at the monthly level. To aggregate to the monthly level, we can use the group_by function in conjunction with the floor_date function, as demonstrated below.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

monthly_demodat <- ch1_demodat %>%
  group_by(time = floor_date(time, "month")) %>%
  summarize(units = sum(units, na.rm = TRUE))

ts_monthly_demodat <- ts(monthly_demodat$units, start = c(2004, 01, 01), frequency = 12)

autoplot(ts_monthly_demodat)