Data624 Project 1

Group Members

Subhalaxmi Rout
Kenan Sooklall
Devin Teran
Christian Thieme
Leo Yi

Get The Data

The dataset for this project was provided to us in excel format which made it relatively easy to ingest in R. Some of us pointed to the dataset from our local machines and others uploaded a converted csv to github, importing the data from the raw link online.

Once the raw data was imported, the next step was to isolate the specific variables required for each set in the assignment. Again, our methods varied, but the results were verified and accurate.

Initial Exploration of the Data

## Rows: 10,572
## Columns: 8
## $ row_index <int> 40669, 40669, 40669, 40669, 40669, 40669, 40670, 40670, 4067~
## $ group     <chr> "S03", "S02", "S01", "S06", "S05", "S04", "S03", "S02", "S01~
## $ var01     <dbl> 30.64286, 10.28000, 26.61000, 27.48000, 69.26000, 17.20000, ~
## $ var02     <int> 123432400, 60855800, 10369300, 39335700, 27809100, 16587400,~
## $ var03     <dbl> 30.34000, 10.05000, 25.89000, 26.82000, 68.19000, 16.88000, ~
## $ var05     <dbl> 30.49000, 10.17000, 26.20000, 27.02000, 68.72000, 16.94000, ~
## $ var07     <dbl> 30.57286, 10.28000, 26.01000, 27.32000, 69.15000, 17.10000, ~
## $ date      <date> 2011-05-06, 2011-05-06, 2011-05-06, 2011-05-06, 2011-05-06,~

Here’s a glimpse of what the data looks like. There was a discussion and agreement within the group where we believe that we’re working with daily data. The format of the index was familiar to some of us as excel formatted dates, which we sometimes see when exporting and importing excel files. The idea that this is daily data leads us to check for types of seasonality later in our analysis, however we still recognize that it has not been confirmed to be daily data.

This is a chart of all of the time series sharing the same x-axis. We can see that 8 of the 12 variables have max values less than 200 and the other 4 variables have relatively enormous values. We can also see some clear trends, obvious outliers, and potential cycles. The idea of seasonality here is difficult to see on this plot and depends on a deeper analysis at the variable level.

This is a consolidated view, but in practice, we split the variables up among our group and each individual performed a deep dive on each variable.

Imputing Missing Values

This is a crucial step in data preparation. Each of the series provided had gaps within the data and we all seemed to take different methods to tackle the null values based on the series. For seemingly stationary series, an average value could be used. For non-stationary data, we can impute based on values nearest to the missing value, or use other packages that have functions to handle this situation. We used functions like tidyr::fill() and imputeTS::na.interpolation() .

Missing Values example

As an example, here’s a portion of both series from S03. The vertical lines represent the points where there is missing data. In this case, we averaged a window of the two lagging and leading periods.

Outliers

In some of the datasets, there are clear outliers that may affect the models we build. Our approach was to use the forecast::tsclean function to resolve this issue.

After imputing missing values and before outlier resolution:

Same time series after outliers adjusted:

Seasonality

Finding seasonality seemed to be tied to the frequency we set when creating our time series objects. We discussed using 1 vs 365, but there was some hesitation since we couldn’t confirm that the actual frequency of the data was measured by day. From here, we had serveral approaches to determining seasonality. One method was to contruct plots and to visually examine for seasonality. Another method was to plot autocorrelations with a large lag assuming this was daily data and to observe for spikes in order to determine the length of the period. Finally, we also found a resource online which provided a function we could use to determine the frequency based on each series.

Here is an autocorrelation plot for S03 Var05 lagging 720 days, which was chosen to include approximately two years worth of observations. This plot shows a steady decline. If seasonality existed within this data, we would observe peaks, which we could then use to set the frequency for the time series.

Alternatively, we can use the function referenced from the links below to calculate the frequency to use.

find.freq <- function(x)
{
    n <- length(x)
    spec <- spec.ar(c(x),plot=FALSE)
    if(max(spec$spec)>10) # Arbitrary threshold chosen by trial and error.
    {
        period <- round(1/spec$freq[which.max(spec$spec)])
        if(period==Inf) # Find next local maximum
        {
            j <- which(diff(spec$spec)>0)
            if(length(j)>0)
            {
                nextmax <- j[1] + which.max(spec$spec[j[1]:500])
                period <- round(1/spec$freq[nextmax])
            }
            else
                period <- 1
        }
    }
    else
        period <- 1
    return(period)
}

find.freq(s3$var05)

## [1] 1

Differencing

This step will allow us to transform the time series into stationary data. This is an important step in order to construct an ARIMA model. In order to determine whether a series was stationary or not, we plotted the series using ggtsdisplay() and observed the plot, as well as running Kwiatkowski-Phillips-Schmidt-Shin (KPSS) tests.

In order to transform each series, some group members chose to do this manually, while others started with the function ndiffs() . The differenced results were plotted again and this method was repeated until we reached stationary results.

Modeling

Once the data is prepared for modeling, our general approach was to split each time series into training and testing sets, using about 80-85% of the first 1,622 rows as the training data.

We created multiple models using the training data and compared the accuracy against the test data. The model with the minimum errors, using RMSE and MAPE, were selected and then the best model would be used to generate the forecast for the next 140 periods.

One thing to note is that some of us chose to difference the data and run Arima() , while others chose to skip the differencing step and used auto.arima() . We also fit using ets() , before comparing accuracy.

auto.arima() accuracy:

##                         ME      RMSE        MAE          MPE      MAPE     MASE
## Training set  -0.000129212  1.248385  0.8778607  -0.03489299  1.314015  0.98769
## Test set     -29.049859982 34.509005 29.0577636 -27.65330257 27.659181 32.69319
##                       ACF1 Theil's U
## Training set -0.0002637416        NA
## Test set      0.9856681602  16.27615

ets() accuracy:

##                        ME      RMSE        MAE         MPE      MAPE       MASE
## Training set   0.08263014  1.255415  0.8847694   0.1032869  1.322841  0.9954631
## Test set     -16.44346086 20.334135 16.5771789 -15.8117042 15.912734 18.6511526
##                     ACF1 Theil's U
## Training set 0.000387123        NA
## Test set     0.976861010  9.672666

Above, we can see the results of the accuracy function on S03 Var05. The training data used the first 80% of the full series, and the last 20% was used as the test set.

It looks like exponential smoothing does a better job of fitting the test data.

Above we can see the plot of the time series associated with the accuracy outputs with the predictions of the two models on the test data, without prediction intervals. This time series did not show seasonality and the ETS model in blue was chosen based on the accuracy. We can also observe here that the ets model is a closer match to the actual test data.

Check Residuals

After fitting the models with the best performer on the test data, the final step before making the predictions was to check the residuals.

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,A,N)
## Q* = 15.087, df = 6, p-value = 0.01959
## 
## Model df: 4.   Total lags used: 10

Forecast

Finally, once we have chosen our models, we need to predict the next 140 periods for submission and evaluation.

fit_v5 %>%
  forecast(h=140) %>%
  data.frame() %>%
  select(Point.Forecast) %>%
  head()

##      Point.Forecast
## 1623       98.74952
## 1624       98.81825
## 1625       98.88698
## 1626       98.95571
## 1627       99.02444
## 1628       99.09317