ARIMA Time Series Analysis

Load in the data set from Quandl.

import quandl
dt = quandl.get("ZILLOW/N52_PRRAH", authtoken="GzFy7sTypFanLb3Tvs1s")

Plot the “Value” by Month from 10-31-2010 through 7-31-2018

Begin ARIMA model-building

Plot 12-month rolling mean, 12-month rolling std, and [actual] Value

Plot Observed, Trend, Seasonal, and Residual plots

The data must be stationary before we can fit an ARIMA model.

Test seasonal difference for stationarity using the Dickey-Fuller test

# Store in a function to summarize Dickey-Fuller test for stationarity
# Import function
from statsmodels.tsa.stattools import adfuller
# Define function
def adf_check(time_series):
    result = adfuller(time_series)
    print('Augmented Dickey-Fuller Test:')
    labels = ['ADF Test Statistic','p-value','#Number of Lags Used','Number of Observations Used']
    for value,label in zip(result,labels):
        print(label+' : '+str(value) )
    
    if result[1] <= 0.05:
        print("Reject the null hypothesis. Data has no unit root and is stationary.")
    else:
        print("Fail to reject the null hypothesis. Time series has a unit root, indicating it is non-stationary.")
# Check 'Value' for stationarity
adf_check(dt['Value'])
# Non-stationary, so it must be transformed
## Augmented Dickey-Fuller Test:
## ADF Test Statistic : -2.3881188732672256
## p-value : 0.1451178867301326
## #Number of Lags Used : 2
## Number of Observations Used : 92
## Fail to reject the null hypothesis. Time series has a unit root, indicating it is non-stationary.

Log-transform ‘Value’ to and test for stationarity

import numpy as np
dt['Log Value'] = np.log(dt['Value'])
# Check it for stationarity
adf_check(dt['Log Value'])
## Augmented Dickey-Fuller Test:
## ADF Test Statistic : -2.8661067736357446
## p-value : 0.049433664257042126
## #Number of Lags Used : 2
## Number of Observations Used : 92
## Reject the null hypothesis. Data has no unit root and is stationary.

Plot [stationary] log-transformed ‘Value’

Plot autocorrelation and partial autocorrelation of first difference to find values for p and q, respectively

The p appears to be about 21 and the q seems to be about 2

Build, fit, and summarize ARIMA model

# Import libraries
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(dt['Log Value'].iloc[1:], order=(21,0,2)) # p = 21, d = 0, q = 2
# fit model
## C:\Users\aengland\AppData\Local\CONTIN~1\ANACON~1\lib\site-packages\statsmodels\tsa\base\tsa_model.py:171: ValueWarning: No frequency information was provided, so inferred frequency M will be used.
##   % freq, ValueWarning)
results = model.fit()
# Get summary of model
## C:\Users\aengland\AppData\Local\CONTIN~1\ANACON~1\lib\site-packages\statsmodels\base\model.py:488: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
##   'available', HessianInversionWarning)
results.summary()

Get predicted values

dt['Predicted Log Value'] = results.predict()
# Back transform predicted log value into Value
dt['Predicted Value'] = np.exp(dt['Predicted Log Value'])
# Get the residuals
dt['Residual'] = dt['Value'].iloc[2:] - dt['Predicted Value'].iloc[2:]

Plot the residuals by time

Plot distribution of the residuals with kernel density estimation (KDE)

Plot the actual and predicted values

Forecast for the next 24 months

import pandas as pd
from pandas.tseries.offsets import DateOffset
# Get future dates for the next 24 months
future_dates = [dt.index[-1] + DateOffset(months=x) for x in range(0,24)] # 24 months
# Put the future dates into a data frame with the column names from dt
future_dates_df = pd.DataFrame(index=future_dates[1:], columns=dt.columns)
# Create new data frame that adds the future dates data frame to the original dt
future_df = pd.concat([dt, future_dates_df])
# Forecast the unseen data points up until month 118
future_df['Forecasted Log Value'] = results.predict(start = 94, end = 118, dynamic=True) 
# back transform this
future_df['Forecasted Actual Value'] = np.exp(future_df['Forecasted Log Value'])

Plot the predicted, actual, and 24 month forecasted values (with 95% confidence interval)

Plot the forecasted values with a 95% Confidence Interval