https://www.statsmodels.org/stable/index.html
Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
> import numpy as np
+ import pandas as pd
+ import matplotlib.pyplot as plt
+ import statsmodels.api as smLOAD SAMPLE MACRO DATASET
year quarter realgdp realcons ... unemp pop infl realint
0 1959.0 1.0 2710.349 1707.4 ... 5.8 177.146 0.00 0.00
1 1959.0 2.0 2778.801 1733.7 ... 5.1 177.830 2.34 0.74
2 1959.0 3.0 2775.488 1751.8 ... 5.3 178.657 2.74 1.09
3 1959.0 4.0 2785.204 1753.7 ... 5.6 179.386 0.27 4.06
4 1960.0 1.0 2847.699 1770.5 ... 5.2 180.007 2.31 1.19
.. ... ... ... ... ... ... ... ... ...
198 2008.0 3.0 13324.600 9267.7 ... 6.0 305.270 -3.16 4.33
199 2008.0 4.0 13141.920 9195.3 ... 6.9 305.952 -8.79 8.91
200 2009.0 1.0 12925.410 9209.2 ... 8.1 306.547 0.94 -0.71
201 2009.0 2.0 12901.504 9189.0 ... 9.2 307.226 3.37 -3.19
202 2009.0 3.0 12990.341 9256.0 ... 9.6 308.013 3.56 -3.44
[203 rows x 14 columns]
VARIABLE DEFINITIONS
::
Number of Observations - 203
Number of Variables - 14
Variable name definitions::
year - 1959q1 - 2009q3
quarter - 1-4
realgdp - Real gross domestic product (Bil. of chained 2005 US$,
seasonally adjusted annual rate)
realcons - Real personal consumption expenditures (Bil. of chained
2005 US$, seasonally adjusted annual rate)
realinv - Real gross private domestic investment (Bil. of chained
2005 US$, seasonally adjusted annual rate)
realgovt - Real federal consumption expenditures & gross investment
(Bil. of chained 2005 US$, seasonally adjusted annual rate)
realdpi - Real private disposable income (Bil. of chained 2005
US$, seasonally adjusted annual rate)
cpi - End of the quarter consumer price index for all urban
consumers: all items (1982-84 = 100, seasonally adjusted).
m1 - End of the quarter M1 nominal money stock (Seasonally
adjusted)
tbilrate - Quarterly monthly average of the monthly 3-month
treasury bill: secondary market rate
unemp - Seasonally adjusted unemployment rate (%)
pop - End of the quarter total population: all ages incl. armed
forces over seas
infl - Inflation rate (ln(cpi_{t}/cpi_{t-1}) * 400)
realint - Real interest rate (tbilrate - infl)
SET NEW DATE INDEX
> index = pd.Index(sm.tsa.datetools.dates_from_range(
+ '1959Q1', '2009Q3'))
+ df.index = index
+ df.head() year quarter realgdp realcons ... unemp pop infl realint
1959-03-31 1959.0 1.0 2710.349 1707.4 ... 5.8 177.146 0.00 0.00
1959-06-30 1959.0 2.0 2778.801 1733.7 ... 5.1 177.830 2.34 0.74
1959-09-30 1959.0 3.0 2775.488 1751.8 ... 5.3 178.657 2.74 1.09
1959-12-31 1959.0 4.0 2785.204 1753.7 ... 5.6 179.386 0.27 4.06
1960-03-31 1960.0 1.0 2847.699 1770.5 ... 5.2 180.007 2.31 1.19
[5 rows x 14 columns]
PLOT
The Hodrick-Prescott filter separates a time-series \(y_t\) into a trend \(τ_t\) and a cyclical component \(ζt\).
\(y_t = \tau_t + \zeta_t\)
The components are determined by minimizing the following quadratic loss function
\[\min_{\\{ \tau_{t}\\} }\sum_{t}^{T}\zeta_{t}^{2}+\lambda\sum_{t=1}^{T}\left[\left(\tau_{t}-\tau_{t-1}\right)-\left(\tau_{t-1}-\tau_{t-2}\right)\right]^{2}\]
TREND AND CYCLE
1959-03-31 39.511915
1959-06-30 80.088532
1959-09-30 48.875455
1959-12-31 30.591933
1960-03-31 64.882667
...
2008-09-30 102.018455
2008-12-31 -107.269472
2009-03-31 -349.047706
2009-06-30 -397.557073
2009-09-30 -333.115243
Name: realgdp, Length: 203, dtype: float64
1959-03-31 2670.837085
1959-06-30 2698.712468
1959-09-30 2726.612545
1959-12-31 2754.612067
1960-03-31 2782.816333
...
2008-09-30 13222.581545
2008-12-31 13249.189472
2009-03-31 13274.457706
2009-06-30 13299.061073
2009-09-30 13323.456243
Name: realgdp, Length: 203, dtype: float64
<class 'pandas.core.series.Series'>
PLOT
ZOOM IN
Exponentially-Weighted Moving Average
IMPORT AIRLINE DATA
> #Drop NA values and make index a datetime
+ airline.dropna(inplace=True)
+ airline.index = pd.to_datetime(airline.index) Thousands of Passengers
Month
1949-01-01 112.0
1949-02-01 118.0
1949-03-01 132.0
1949-04-01 129.0
1949-05-01 121.0
SIMPLE MOVING AVERAGE
> airline['6-month-SMA']=airline['Thousands of Passengers']. \
+ rolling(window=6).mean()
+ airline['12-month-SMA']=airline['Thousands of Passengers']. \
+ rolling(window=12).mean() Thousands of Passengers 6-month-SMA 12-month-SMA
Month
1949-01-01 112.0 NaN NaN
1949-02-01 118.0 NaN NaN
1949-03-01 132.0 NaN NaN
1949-04-01 129.0 NaN NaN
1949-05-01 121.0 NaN NaN
EXPONENTIALLY-WEIGHTED MOVING AVERAGE
A simple moving average has some weaknesses:
EWMA fixes some of these issues by reducing the lag effect from SMA and by putting more weight on recent values.
The formula for EWMA is:
\[y_t = \frac{\sum\limits_{i=0}^t w_i x_{t-i}}{\sum\limits_{i=0}^t w_i}\] Where \(x_t\) is the input value, \(w_i\) is the applied weight (Note how it can change from i=0 to t), and \(y_t\) is the output.
How to we define the weight term \(w_i\) ?
When adjust=True (default), weighted averages are calculated using weights:
\[y_t = \frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2 x_{t-2} + ... + (1 - \alpha)^t x_{0}}{1 + (1 - \alpha) + (1 - \alpha)^2 + ... + (1 - \alpha)^t}\]
When adjust=False is specified, moving averages are calculated as:
\[\begin{split}y_0 &= x_0 \\ y_t &= (1 - \alpha) y_{t-1} + \alpha x_t,\end{split}\]
which is equivalent to using weights:
\[\begin{split}w_i = \begin{cases} \alpha (1 - \alpha)^i & \text{if } i < t \\ (1 - \alpha)^i & \text{if } i = t. \end{cases}\end{split}\]
\(\alpha\) must be between 0 and 1 (older times have a smaller weight)
\(\alpha\) can be specified directly or you can set a value for span, com (center of mass), or halflife.
\[\begin{split}\alpha = \begin{cases} \frac{2}{s + 1}, & \text{for span}\ s \geq 1\\ \frac{1}{1 + c}, & \text{for center of mass}\ c \geq 0\\ 1 - \exp^{\frac{\log 0.5}{h}}, & \text{for half-life}\ h > 0 \end{cases}\end{split}\]
EXAMPLE
We can decompose a time series into its components - Error, Trend, and Seasonality.
An additive model is best when the trend appears linear and the seasonality and trend components seem to be constant over time (e.g. every year we add 10,000 passengers).
A multiplicative model is more appropriate when the trend is increasing (or decreasing) at a non-linear rate (e.g. each year we double the amount of passengers).
> from statsmodels.tsa.seasonal import seasonal_decompose
+ result = seasonal_decompose(airline['Thousands of Passengers'],
+ model='multiplicative')
+ result.plot();
+ plt.show()We can also plot each component separately:
ARIMA and Seasonal ARIMA
Autoregressive Integrated Moving Averages
The general process for ARIMA models is the following:
Month Monthly milk production: pounds per cow. Jan 62 ? Dec 75
164 1975-09 817.0
165 1975-10 827.0
166 1975-11 797.0
167 1975-12 843.0
168 Monthly milk production: pounds per cow. Jan 6... NaN
CLEAN THE DATA
?.168 in missing.Month type is not datetime.> # Change column name
+ # Drop the missing value
+ # Change Month to datetime
+ # Change Index to Month
+ df.columns = ['Month','Milk in pounds per cow']
+ df.drop(168,axis=0,inplace=True)
+ df['Month'] = pd.to_datetime(df['Month'])
+ df.set_index('Month',inplace=True)
+ df.tail() Milk in pounds per cow
Month
1975-08-01 858.0
1975-09-01 817.0
1975-10-01 827.0
1975-11-01 797.0
1975-12-01 843.0
Milk in pounds per cow
count 168.000000
mean 754.708333
std 102.204524
min 553.000000
25% 677.750000
50% 761.000000
75% 824.500000
max 969.000000
It will change from a DataFrame into a Series by selecting a column.
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
We can then plot the series along with its moving averages. We can see an increasing mean and relatively stable standard deviation.
> timeseries.rolling(12).mean().plot(
+ label='12 Month Rolling Mean');
+ timeseries.rolling(12).std().plot(
+ label='12 Month Rolling Std');
+ timeseries.plot();
+ plt.legend();
+ plt.show()We can use ETS to see the components.
> decomposition = seasonal_decompose(
+ df['Milk in pounds per cow'], freq=12)
+ fig = plt.figure();
+ fig = decomposition.plot();
+ fig.set_size_inches(8, 4);
+ plt.show()To be stationary:
We can see from plots that the mean is increasing over time, so it is not a stationary time series. However, we can use the Augmented Dickey-Fuller unit root test to confirm.
In statistics and econometrics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis that a unit root is present in a time series sample. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity.
We are deciding whether to accept the Null Hypothesis H0 (the time series has a unit root, indicating that it is non-stationary) or reject H0 (the time series has no unit root and is stationary).
We base our decision on the p-value.
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject.
> from statsmodels.tsa.stattools import adfuller
+ result = adfuller(df['Milk in pounds per cow'])
+ result(-1.3038115874221294, 0.6274267086030316, 13, 154, {'1%': -3.473542528196209, '5%': -2.880497674144038, '10%': -2.576878053634677}, 1115.1730447395112)
Unfortunately, the output is a tuple without labels and is not very informative.
We can improve it with this function.
> def adf_check(time_series):
+ """
+ Pass in a time series, returns ADF report
+ """
+ result = adfuller(time_series)
+ print('Augmented Dickey-Fuller Test:')
+ labels = ['ADF Test Statistic','p-value',
+ '#Lags Used','Number of Observations Used']
+
+ for value,label in zip(result,labels):
+ print(label+' : '+str(value) )
+
+ if result[1] <= 0.05:
+ print("strong evidence against the null hypothesis")
+ print("reject the null hypothesis")
+ print("Data has no unit root and is stationary")
+ else:
+ print("weak evidence against null hypothesis")
+ print("time series has a unit root")
+ print("indicating that it is non-stationary")Augmented Dickey-Fuller Test:
ADF Test Statistic : -1.3038115874221294
p-value : 0.6274267086030316
#Lags Used : 13
Number of Observations Used : 154
weak evidence against null hypothesis
time series has a unit root
indicating that it is non-stationary
Since the data is not stationary we can try differencing, which is the Integrated part of ARIMA.
The first difference of a time series is the series of changes from one period to the next. You can do this easily with pandas and can continue to take the second difference, third difference, and so on until your data is stationary.
FIRST DIFFERENCE
> # first difference
+ df['Milk First Difference'] = df['Milk in pounds per cow'] - \
+ df['Milk in pounds per cow'].shift(1)
+
+ # adf check
+ adf_check(df['Milk First Difference'].dropna())Augmented Dickey-Fuller Test:
ADF Test Statistic : -3.0549955586530704
p-value : 0.030068004001785647
#Lags Used : 14
Number of Observations Used : 152
strong evidence against the null hypothesis
reject the null hypothesis
Data has no unit root and is stationary
We can reject the Null. The differenced time series is stationary.
The plot shows a stable mean and variance.
SECOND DIFFERENCE
Although we don’t need a second difference, we can check for an improvement.
> df['Milk Second Difference'] = df['Milk First Difference'] - \
+ df['Milk First Difference'].shift(1)
+ adf_check(df['Milk Second Difference'].dropna())Augmented Dickey-Fuller Test:
ADF Test Statistic : -14.327873645603301
p-value : 1.1126989332084581e-26
#Lags Used : 11
Number of Observations Used : 154
strong evidence against the null hypothesis
reject the null hypothesis
Data has no unit root and is stationary
It is also stationary, but there isn’t a meaningful improvement.
SEASONAL DIFFERENCE
We know from the ETS decomposition that there is a seasonal pattern - in this case yearly.
> df['Seasonal Difference'] = df['Milk in pounds per cow'] - \
+ df['Milk in pounds per cow'].shift(12)
+ adf_check(df['Seasonal Difference'].dropna())Augmented Dickey-Fuller Test:
ADF Test Statistic : -2.335419314359398
p-value : 0.1607988052771135
#Lags Used : 12
Number of Observations Used : 143
weak evidence against null hypothesis
time series has a unit root
indicating that it is non-stationary
A seasonal difference alone wasn’t enough. We fail to reject the Null.
We can see from the plot that the variance changes over time.
SEASONAL FIRST DIFFERENCE
We can also try a seasonal difference on the first difference.
> df['Seasonal First Difference'] = df['Milk First Difference'] - \
+ df['Milk First Difference'].shift(12)
+ adf_check(df['Seasonal First Difference'].dropna())Augmented Dickey-Fuller Test:
ADF Test Statistic : -5.038002274921984
p-value : 1.8654234318788342e-05
#Lags Used : 11
Number of Observations Used : 143
strong evidence against the null hypothesis
reject the null hypothesis
Data has no unit root and is stationary
The p-value is very low (reject Null). The time series is stationary. It also seems to be an improvement over the first difference alone.
ACF
An autocorrelation plot (also known as a Correlogram ) shows the correlation of the series with itself, lagged by x time units. So the y axis is the correlation and the x axis is the number of time units of lag.
ACF - Seasonal First Difference
> from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
+ fig_seasonal_first = plot_acf(df["Seasonal First Difference"].
+ dropna(),lags=50);
+ plt.show()PACF
In general, a partial correlation is a conditional correlation.
It is the correlation between two variables under the assumption that we know and take into account the values of some other set of variables.
For instance, consider a regression context in which \(y\) = response variable and \(x_1\), \(x_2\), and \(x_3\) are predictor variables. The partial correlation between \(y\) and \(x_3\) is the correlation between the variables determined taking into account how both \(y\) and \(x_3\) are related to \(x_1\) and \(x_2\).
Formally, this is relationship is defined as:
\[\frac{\text{Covariance}(y, x_3|x_1, x_2)}{\sqrt{\text{Variance}(y|x_1, x_2)\text{Variance}(x_3| x_1, x_2)}}\]
PACF Seasonal First Difference
ACF and PACF Plots Combined
> fig = plt.figure(figsize=(12,8))
+ ax1 = fig.add_subplot(211)
+ fig = sm.graphics.tsa.plot_acf(df['Seasonal First Difference'].iloc[13:], lags=40, ax=ax1)
+ ax2 = fig.add_subplot(212)
+ fig = sm.graphics.tsa.plot_pacf(df['Seasonal First Difference'].iloc[13:], lags=40, ax=ax2)
+ plt.show()Behavior of the ACF and PACF for ARMA models
| Plot | AR(p) | MA(q) | ARMA(p,q) |
|---|---|---|---|
| ACF | Tails off | Cuts off after lag q | Tails off |
| PACF | Cuts off after lag p | Tails off | Tails off |
We need to evaluate two components:
Seasonal
Non-Seasonal
\[SARIMA(p,d,q) \times (P,D,Q)_s\]
p,d,q parameters
Our Model
\[SARIMA(0,1,0) \times (0,1,1)_{12}\]
> import warnings
+ warnings.filterwarnings("ignore")
+ model = sm.tsa.statespace.SARIMAX(df['Milk in pounds per cow'],
+ order=(0,1,0), seasonal_order=(0,1,1,12))
+ results = model.fit()
+ print(results.summary()) Statespace Model Results
==========================================================================================
Dep. Variable: Milk in pounds per cow No. Observations: 168
Model: SARIMAX(0, 1, 0)x(0, 1, 1, 12) Log Likelihood -534.140
Date: Fri, 11 Dec 2020 AIC 1072.280
Time: 16:40:01 BIC 1078.367
Sample: 01-01-1962 HQIC 1074.752
- 12-01-1975
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ma.S.L12 -0.6110 0.066 -9.285 0.000 -0.740 -0.482
sigma2 55.5966 5.322 10.446 0.000 45.166 66.028
===================================================================================
Ljung-Box (Q): 32.42 Jarque-Bera (JB): 31.77
Prob(Q): 0.80 Prob(JB): 0.00
Heteroskedasticity (H): 0.70 Skew: 0.78
Prob(H) (two-sided): 0.19 Kurtosis: 4.59
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
\[SARIMA(1,1,1) \times (0,1,1)_{12}\]
> model = sm.tsa.statespace.SARIMAX(df['Milk in pounds per cow'],
+ order=(1,1,1), seasonal_order=(0,1,1,12))
+ results = model.fit()
+ print(results.summary()) Statespace Model Results
==========================================================================================
Dep. Variable: Milk in pounds per cow No. Observations: 168
Model: SARIMAX(1, 1, 1)x(0, 1, 1, 12) Log Likelihood -530.032
Date: Fri, 11 Dec 2020 AIC 1068.064
Time: 16:40:01 BIC 1080.238
Sample: 01-01-1962 HQIC 1073.009
- 12-01-1975
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.1346 0.334 -0.403 0.687 -0.790 0.520
ma.L1 -0.0969 0.334 -0.290 0.772 -0.752 0.558
ma.S.L12 -0.6204 0.070 -8.827 0.000 -0.758 -0.483
sigma2 52.6316 4.939 10.656 0.000 42.951 62.312
===================================================================================
Ljung-Box (Q): 21.48 Jarque-Bera (JB): 34.44
Prob(Q): 0.99 Prob(JB): 0.00
Heteroskedasticity (H): 0.83 Skew: 0.74
Prob(H) (two-sided): 0.51 Kurtosis: 4.78
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
The non-seasonal ARMA(1,1) has a lower AIC value, which is preferred.
We can evaluate the model performance by evaluating the residuals.
Most of the residuals are minimal and centered around zero, so the model seems adequate.
First, we can predict values that we already know:
> df['forecast'] = results.predict(start = 150, end= 168,
+ dynamic= True)
+ df[['Milk in pounds per cow','forecast']].plot(figsize=(12,8));
+ plt.show()FORECASTING
We need to create additional time periods.
> from pandas.tseries.offsets import DateOffset
+ future_dates = [df.index[-1] + DateOffset(months=x)
+ for x in range(1,24) ]
+ future_dates[Timestamp('1976-01-01 00:00:00'), Timestamp('1976-02-01 00:00:00'), Timestamp('1976-03-01 00:00:00'), Timestamp('1976-04-01 00:00:00'), Timestamp('1976-05-01 00:00:00'), Timestamp('1976-06-01 00:00:00'), Timestamp('1976-07-01 00:00:00'), Timestamp('1976-08-01 00:00:00'), Timestamp('1976-09-01 00:00:00'), Timestamp('1976-10-01 00:00:00'), Timestamp('1976-11-01 00:00:00'), Timestamp('1976-12-01 00:00:00'), Timestamp('1977-01-01 00:00:00'), Timestamp('1977-02-01 00:00:00'), Timestamp('1977-03-01 00:00:00'), Timestamp('1977-04-01 00:00:00'), Timestamp('1977-05-01 00:00:00'), Timestamp('1977-06-01 00:00:00'), Timestamp('1977-07-01 00:00:00'), Timestamp('1977-08-01 00:00:00'), Timestamp('1977-09-01 00:00:00'), Timestamp('1977-10-01 00:00:00'), Timestamp('1977-11-01 00:00:00')]
> # empty data frame with new dates
+ future_dates_df = pd.DataFrame(index=future_dates,columns=df.columns)
+ future_dates_df.head() Milk in pounds per cow ... forecast
1976-01-01 NaN ... NaN
1976-02-01 NaN ... NaN
1976-03-01 NaN ... NaN
1976-04-01 NaN ... NaN
1976-05-01 NaN ... NaN
[5 rows x 6 columns]
> # combine with the existing data frame
+ future_df = pd.concat([df,future_dates_df])
+ future_df.head() Milk in pounds per cow ... forecast
1962-01-01 589.0 ... NaN
1962-02-01 561.0 ... NaN
1962-03-01 640.0 ... NaN
1962-04-01 656.0 ... NaN
1962-05-01 727.0 ... NaN
[5 rows x 6 columns]
> future_df['forecast'] = results.predict(start = 168,
+ end = 190, dynamic= True)
+ future_df[['Milk in pounds per cow', 'forecast']]. \
+ plot(figsize=(12, 8))