Intro to Statsmodels

https://www.statsmodels.org/stable/index.html

Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

> import numpy as np
+ import pandas as pd
+ import matplotlib.pyplot as plt
+ import statsmodels.api as sm

LOAD SAMPLE MACRO DATASET

> df = sm.datasets.macrodata.load_pandas().data
+ df

       year  quarter    realgdp  realcons  ...  unemp      pop  infl  realint
0    1959.0      1.0   2710.349    1707.4  ...    5.8  177.146  0.00     0.00
1    1959.0      2.0   2778.801    1733.7  ...    5.1  177.830  2.34     0.74
2    1959.0      3.0   2775.488    1751.8  ...    5.3  178.657  2.74     1.09
3    1959.0      4.0   2785.204    1753.7  ...    5.6  179.386  0.27     4.06
4    1960.0      1.0   2847.699    1770.5  ...    5.2  180.007  2.31     1.19
..      ...      ...        ...       ...  ...    ...      ...   ...      ...
198  2008.0      3.0  13324.600    9267.7  ...    6.0  305.270 -3.16     4.33
199  2008.0      4.0  13141.920    9195.3  ...    6.9  305.952 -8.79     8.91
200  2009.0      1.0  12925.410    9209.2  ...    8.1  306.547  0.94    -0.71
201  2009.0      2.0  12901.504    9189.0  ...    9.2  307.226  3.37    -3.19
202  2009.0      3.0  12990.341    9256.0  ...    9.6  308.013  3.56    -3.44

[203 rows x 14 columns]

VARIABLE DEFINITIONS

> print(sm.datasets.macrodata.NOTE)

::
    Number of Observations - 203

    Number of Variables - 14

    Variable name definitions::

        year      - 1959q1 - 2009q3
        quarter   - 1-4
        realgdp   - Real gross domestic product (Bil. of chained 2005 US$,
                    seasonally adjusted annual rate)
        realcons  - Real personal consumption expenditures (Bil. of chained
                    2005 US$, seasonally adjusted annual rate)
        realinv   - Real gross private domestic investment (Bil. of chained
                    2005 US$, seasonally adjusted annual rate)
        realgovt  - Real federal consumption expenditures & gross investment
                    (Bil. of chained 2005 US$, seasonally adjusted annual rate)
        realdpi   - Real private disposable income (Bil. of chained 2005
                    US$, seasonally adjusted annual rate)
        cpi       - End of the quarter consumer price index for all urban
                    consumers: all items (1982-84 = 100, seasonally adjusted).
        m1        - End of the quarter M1 nominal money stock (Seasonally
                    adjusted)
        tbilrate  - Quarterly monthly average of the monthly 3-month
                    treasury bill: secondary market rate
        unemp     - Seasonally adjusted unemployment rate (%)
        pop       - End of the quarter total population: all ages incl. armed
                    forces over seas
        infl      - Inflation rate (ln(cpi_{t}/cpi_{t-1}) * 400)
        realint   - Real interest rate (tbilrate - infl)

SET NEW DATE INDEX

> index = pd.Index(sm.tsa.datetools.dates_from_range(
+                                   '1959Q1', '2009Q3'))
+ df.index = index
+ df.head()

              year  quarter   realgdp  realcons  ...  unemp      pop  infl  realint
1959-03-31  1959.0      1.0  2710.349    1707.4  ...    5.8  177.146  0.00     0.00
1959-06-30  1959.0      2.0  2778.801    1733.7  ...    5.1  177.830  2.34     0.74
1959-09-30  1959.0      3.0  2775.488    1751.8  ...    5.3  178.657  2.74     1.09
1959-12-31  1959.0      4.0  2785.204    1753.7  ...    5.6  179.386  0.27     4.06
1960-03-31  1960.0      1.0  2847.699    1770.5  ...    5.2  180.007  2.31     1.19

[5 rows x 14 columns]

PLOT

> df['realgdp'].plot();
+ plt.ylabel("REAL GDP");
+ plt.show()

Statsmodels Trend

The Hodrick-Prescott filter separates a time-series \(y_t\) into a trend \(τ_t\) and a cyclical component \(ζt\).

\(y_t = \tau_t + \zeta_t\)

The components are determined by minimizing the following quadratic loss function

\[\min_{\\{ \tau_{t}\\} }\sum_{t}^{T}\zeta_{t}^{2}+\lambda\sum_{t=1}^{T}\left[\left(\tau_{t}-\tau_{t-1}\right)-\left(\tau_{t-1}-\tau_{t-2}\right)\right]^{2}\]

TREND AND CYCLE

> # Tuple unpacking
+ gdp_cycle, gdp_trend = sm.tsa.filters.hpfilter(df.realgdp)

> gdp_cycle

1959-03-31     39.511915
1959-06-30     80.088532
1959-09-30     48.875455
1959-12-31     30.591933
1960-03-31     64.882667
                 ...    
2008-09-30    102.018455
2008-12-31   -107.269472
2009-03-31   -349.047706
2009-06-30   -397.557073
2009-09-30   -333.115243
Name: realgdp, Length: 203, dtype: float64

> gdp_trend

1959-03-31     2670.837085
1959-06-30     2698.712468
1959-09-30     2726.612545
1959-12-31     2754.612067
1960-03-31     2782.816333
                  ...     
2008-09-30    13222.581545
2008-12-31    13249.189472
2009-03-31    13274.457706
2009-06-30    13299.061073
2009-09-30    13323.456243
Name: realgdp, Length: 203, dtype: float64

> type(gdp_cycle)

<class 'pandas.core.series.Series'>

PLOT

> df['trend'] = gdp_trend
+ df[['trend','realgdp']].plot();
+ plt.show()

ZOOM IN

> df[['trend','realgdp']]["2000-03-31":].plot(figsize=(12,8));
+ plt.show()

EWMA

Exponentially-Weighted Moving Average

IMPORT AIRLINE DATA

> airline = pd.read_csv('airline_passengers.csv',
>                       index_col="Month")

> #Drop NA values and make index a datetime
+ airline.dropna(inplace=True)
+ airline.index = pd.to_datetime(airline.index)

> airline.head()

            Thousands of Passengers
Month                              
1949-01-01                    112.0
1949-02-01                    118.0
1949-03-01                    132.0
1949-04-01                    129.0
1949-05-01                    121.0

SIMPLE MOVING AVERAGE

> airline['6-month-SMA']=airline['Thousands of Passengers']. \
+         rolling(window=6).mean()
+ airline['12-month-SMA']=airline['Thousands of Passengers']. \
+         rolling(window=12).mean()

> airline.head()

            Thousands of Passengers  6-month-SMA  12-month-SMA
Month                                                         
1949-01-01                    112.0          NaN           NaN
1949-02-01                    118.0          NaN           NaN
1949-03-01                    132.0          NaN           NaN
1949-04-01                    129.0          NaN           NaN
1949-05-01                    121.0          NaN           NaN

> airline.plot();
+ plt.show()

EXPONENTIALLY-WEIGHTED MOVING AVERAGE

A simple moving average has some weaknesses:

Smaller windows will lead to more noise, rather than signal
It will always lag by the size of the window
It will never reach to full peak or valley of the data due to the averaging.
Does not really inform you about possible future behavior. It only describes trends in your data.
Extreme historical values can skew your SMA significantly.

EWMA fixes some of these issues by reducing the lag effect from SMA and by putting more weight on recent values.

The formula for EWMA is:

\[y_t = \frac{\sum\limits_{i=0}^t w_i x_{t-i}}{\sum\limits_{i=0}^t w_i}\] Where \(x_t\) is the input value, \(w_i\) is the applied weight (Note how it can change from i=0 to t), and \(y_t\) is the output.

How to we define the weight term \(w_i\) ?

When adjust=True (default), weighted averages are calculated using weights:

\[y_t = \frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2 x_{t-2} + ... + (1 - \alpha)^t x_{0}}{1 + (1 - \alpha) + (1 - \alpha)^2 + ... + (1 - \alpha)^t}\]

When adjust=False is specified, moving averages are calculated as:

\[\begin{split}y_0 &= x_0 \\ y_t &= (1 - \alpha) y_{t-1} + \alpha x_t,\end{split}\]

which is equivalent to using weights:

\[\begin{split}w_i = \begin{cases} \alpha (1 - \alpha)^i & \text{if } i < t \\ (1 - \alpha)^i & \text{if } i = t. \end{cases}\end{split}\]

\(\alpha\) must be between 0 and 1 (older times have a smaller weight)
\(\alpha\) can be specified directly or you can set a value for span, com (center of mass), or halflife.

\[\begin{split}\alpha = \begin{cases} \frac{2}{s + 1}, & \text{for span}\ s \geq 1\\ \frac{1}{1 + c}, & \text{for center of mass}\ c \geq 0\\ 1 - \exp^{\frac{\log 0.5}{h}}, & \text{for half-life}\ h > 0 \end{cases}\end{split}\]

EXAMPLE

> airline['EWMA12'] = airline['Thousands of Passengers']. \
+ ewm(span=12).mean()

> airline[['Thousands of Passengers','EWMA12']].plot();
+ plt.show()

ETS

We can decompose a time series into its components - Error, Trend, and Seasonality.

An additive model is best when the trend appears linear and the seasonality and trend components seem to be constant over time (e.g. every year we add 10,000 passengers).
A multiplicative model is more appropriate when the trend is increasing (or decreasing) at a non-linear rate (e.g. each year we double the amount of passengers).

> from statsmodels.tsa.seasonal import seasonal_decompose
+ result = seasonal_decompose(airline['Thousands of Passengers'],
+ model='multiplicative')
+ result.plot();
+ plt.show()

We can also plot each component separately:

> result.seasonal.plot(figsize=(8,4),title='Seasonal');
+ plt.tight_layout()
+ plt.show()

> result.trend.plot(figsize=(8,4),title='Trend');
+ plt.tight_layout()
+ plt.show()

> result.resid.plot(figsize=(8,4),title='Error');
+ plt.tight_layout()
+ plt.show()

SARIMA

ARIMA and Seasonal ARIMA

Autoregressive Integrated Moving Averages

The general process for ARIMA models is the following:

Visualize the Time Series Data
Make the time series data stationary
Plot the Correlation and AutoCorrelation Charts
Construct the ARIMA Model
Use the model to make predictions

Import Data

> df = pd.read_csv('monthly-milk-production-pounds-p.csv')

> df.tail()

                                                 Month  Monthly milk production: pounds per cow. Jan 62 ? Dec 75
164                                            1975-09                                              817.0       
165                                            1975-10                                              827.0       
166                                            1975-11                                              797.0       
167                                            1975-12                                              843.0       
168  Monthly milk production: pounds per cow. Jan 6...                                                NaN

CLEAN THE DATA

One of the column titles is too long and contains a ?.
Index 168 in missing.
The Month type is not datetime.
The Index is a row number.

> # Change column name
+ # Drop the missing value
+ # Change Month to datetime
+ # Change Index to Month
+ df.columns = ['Month','Milk in pounds per cow']
+ df.drop(168,axis=0,inplace=True)
+ df['Month'] = pd.to_datetime(df['Month'])
+ df.set_index('Month',inplace=True)
+ df.tail()

            Milk in pounds per cow
Month                             
1975-08-01                   858.0
1975-09-01                   817.0
1975-10-01                   827.0
1975-11-01                   797.0
1975-12-01                   843.0

> df.describe()

       Milk in pounds per cow
count              168.000000
mean               754.708333
std                102.204524
min                553.000000
25%                677.750000
50%                761.000000
75%                824.500000
max                969.000000

Visualize

> df.plot();
+ plt.show()

It will change from a DataFrame into a Series by selecting a column.

> type(df)

<class 'pandas.core.frame.DataFrame'>

> timeseries = df['Milk in pounds per cow']
+ type(timeseries)

<class 'pandas.core.series.Series'>

We can then plot the series along with its moving averages. We can see an increasing mean and relatively stable standard deviation.

> timeseries.rolling(12).mean().plot(
+         label='12 Month Rolling Mean');
+ timeseries.rolling(12).std().plot(
+          label='12 Month Rolling Std');
+ timeseries.plot();
+ plt.legend();
+ plt.show()

Decomposition

We can use ETS to see the components.

> decomposition = seasonal_decompose(
+       df['Milk in pounds per cow'], freq=12)  
+ fig = plt.figure();
+ fig = decomposition.plot(); 
+ fig.set_size_inches(8, 4);
+ plt.show()

Stationarity

To be stationary:

The mean needs to be constant
The variance should not be a function of time
The covariance should not be a function of time

We can see from plots that the mean is increasing over time, so it is not a stationary time series. However, we can use the Augmented Dickey-Fuller unit root test to confirm.

In statistics and econometrics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis that a unit root is present in a time series sample. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity.

We are deciding whether to accept the Null Hypothesis H0 (the time series has a unit root, indicating that it is non-stationary) or reject H0 (the time series has no unit root and is stationary).

We base our decision on the p-value.

A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject.

> from statsmodels.tsa.stattools import adfuller
+ result = adfuller(df['Milk in pounds per cow'])
+ result

(-1.3038115874221294, 0.6274267086030316, 13, 154, {'1%': -3.473542528196209, '5%': -2.880497674144038, '10%': -2.576878053634677}, 1115.1730447395112)

Unfortunately, the output is a tuple without labels and is not very informative.

We can improve it with this function.

> def adf_check(time_series):
+     """
+     Pass in a time series, returns ADF report
+     """
+     result = adfuller(time_series)
+     print('Augmented Dickey-Fuller Test:')
+     labels = ['ADF Test Statistic','p-value',
+     '#Lags Used','Number of Observations Used']
+ 
+     for value,label in zip(result,labels):
+         print(label+' : '+str(value) )
+     
+     if result[1] <= 0.05:
+         print("strong evidence against the null hypothesis") 
+         print("reject the null hypothesis")
+         print("Data has no unit root and is stationary")
+     else:
+         print("weak evidence against null hypothesis")
+         print("time series has a unit root")
+         print("indicating that it is non-stationary")

> adf_check(df['Milk in pounds per cow'])

Augmented Dickey-Fuller Test:
ADF Test Statistic : -1.3038115874221294
p-value : 0.6274267086030316
#Lags Used : 13
Number of Observations Used : 154
weak evidence against null hypothesis
time series has a unit root
indicating that it is non-stationary

Differencing

Since the data is not stationary we can try differencing, which is the Integrated part of ARIMA.

The first difference of a time series is the series of changes from one period to the next. You can do this easily with pandas and can continue to take the second difference, third difference, and so on until your data is stationary.

FIRST DIFFERENCE

> # first difference
+ df['Milk First Difference'] = df['Milk in pounds per cow'] - \
+ df['Milk in pounds per cow'].shift(1)
+ 
+ # adf check
+ adf_check(df['Milk First Difference'].dropna())

Augmented Dickey-Fuller Test:
ADF Test Statistic : -3.0549955586530704
p-value : 0.030068004001785647
#Lags Used : 14
Number of Observations Used : 152
strong evidence against the null hypothesis
reject the null hypothesis
Data has no unit root and is stationary

We can reject the Null. The differenced time series is stationary.

The plot shows a stable mean and variance.

> df['Milk First Difference'].plot();
+ plt.show()

SECOND DIFFERENCE

Although we don’t need a second difference, we can check for an improvement.

> df['Milk Second Difference'] = df['Milk First Difference'] - \
+ df['Milk First Difference'].shift(1)
+ adf_check(df['Milk Second Difference'].dropna())

Augmented Dickey-Fuller Test:
ADF Test Statistic : -14.327873645603301
p-value : 1.1126989332084581e-26
#Lags Used : 11
Number of Observations Used : 154
strong evidence against the null hypothesis
reject the null hypothesis
Data has no unit root and is stationary

It is also stationary, but there isn’t a meaningful improvement.

> df['Milk Second Difference'].plot();
+ plt.show()

SEASONAL DIFFERENCE

We know from the ETS decomposition that there is a seasonal pattern - in this case yearly.

> df['Seasonal Difference'] = df['Milk in pounds per cow'] - \
+ df['Milk in pounds per cow'].shift(12)
+ adf_check(df['Seasonal Difference'].dropna())

Augmented Dickey-Fuller Test:
ADF Test Statistic : -2.335419314359398
p-value : 0.1607988052771135
#Lags Used : 12
Number of Observations Used : 143
weak evidence against null hypothesis
time series has a unit root
indicating that it is non-stationary

A seasonal difference alone wasn’t enough. We fail to reject the Null.

We can see from the plot that the variance changes over time.

> df['Seasonal Difference'].plot();
+ plt.show()

SEASONAL FIRST DIFFERENCE

We can also try a seasonal difference on the first difference.

> df['Seasonal First Difference'] = df['Milk First Difference'] - \
+ df['Milk First Difference'].shift(12)
+ adf_check(df['Seasonal First Difference'].dropna())

Augmented Dickey-Fuller Test:
ADF Test Statistic : -5.038002274921984
p-value : 1.8654234318788342e-05
#Lags Used : 11
Number of Observations Used : 143
strong evidence against the null hypothesis
reject the null hypothesis
Data has no unit root and is stationary

The p-value is very low (reject Null). The time series is stationary. It also seems to be an improvement over the first difference alone.

> df['Seasonal First Difference'].plot();
+ plt.show()

ACF and PACF

ACF

An autocorrelation plot (also known as a Correlogram ) shows the correlation of the series with itself, lagged by x time units. So the y axis is the correlation and the x axis is the number of time units of lag.

ACF - Seasonal First Difference

> from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
+ fig_seasonal_first = plot_acf(df["Seasonal First Difference"].
+                               dropna(),lags=50);
+ plt.show()

PACF

In general, a partial correlation is a conditional correlation.

It is the correlation between two variables under the assumption that we know and take into account the values of some other set of variables.

For instance, consider a regression context in which \(y\) = response variable and \(x_1\), \(x_2\), and \(x_3\) are predictor variables. The partial correlation between \(y\) and \(x_3\) is the correlation between the variables determined taking into account how both \(y\) and \(x_3\) are related to \(x_1\) and \(x_2\).

Formally, this is relationship is defined as:

\[\frac{\text{Covariance}(y, x_3|x_1, x_2)}{\sqrt{\text{Variance}(y|x_1, x_2)\text{Variance}(x_3| x_1, x_2)}}\]

PACF Seasonal First Difference

> result = plot_pacf(df["Seasonal First Difference"].
+                    dropna(),lags=50);
+ plt.show()

ACF and PACF Plots Combined

> fig = plt.figure(figsize=(12,8))
+ ax1 = fig.add_subplot(211)
+ fig = sm.graphics.tsa.plot_acf(df['Seasonal First Difference'].iloc[13:], lags=40, ax=ax1)
+ ax2 = fig.add_subplot(212)
+ fig = sm.graphics.tsa.plot_pacf(df['Seasonal First Difference'].iloc[13:], lags=40, ax=ax2)
+ plt.show()

Interpretation

Behavior of the ACF and PACF for ARMA models

Plot	AR(p)	MA(q)	ARMA(p,q)
ACF	Tails off	Cuts off after lag q	Tails off
PACF	Cuts off after lag p	Tails off	Tails off

We need to evaluate two components:

The seasonal lags - 12,24,36, etc.
The non-seasonal lags - 1,2,3,4, etc.

Seasonal

We can see that the the ACF is significant at lag 12 (1st seasonal lag) and cuts off (24,36 are not significant).
The PACF is significant at 12,24,36 and is tailing off (declining)
This suggests a seasonal MA(1)

Non-Seasonal

Both the ACF and PACF are significant at lag 1 and then tail-off (Potential ARMA(1,1))
None of the values are very significant, so we can try a model with ARMA(0,0)

SARIMA

\[SARIMA(p,d,q) \times (P,D,Q)_s\]

p,d,q parameters

p: The number of lag observations included in the model (AR).
d: The number of times that the raw observations are differenced, also called the degree of differencing.
q: The size of the moving average window, also called the order of moving average (MA).
P: The seasonal AR value
D: The seasonal degree of differencing
Q: The seasonal MA value
s: The seasonal lag value

Our Model

We differenced the data once: d=1.
We took one seasonal difference: D=12.
The seasonal component is yearly: s=12.
The ACF and PACF suggest seasonal MA1: P=0, Q=1
Non-seasonal: p=1, q=1, or try p=0, q=0

\[SARIMA(0,1,0) \times (0,1,1)_{12}\]

> import warnings
+ warnings.filterwarnings("ignore")
+ model = sm.tsa.statespace.SARIMAX(df['Milk in pounds per cow'],
+                     order=(0,1,0), seasonal_order=(0,1,1,12))
+ results = model.fit()
+ print(results.summary())

                                 Statespace Model Results                                 
==========================================================================================
Dep. Variable:             Milk in pounds per cow   No. Observations:                  168
Model:             SARIMAX(0, 1, 0)x(0, 1, 1, 12)   Log Likelihood                -534.140
Date:                            Fri, 11 Dec 2020   AIC                           1072.280
Time:                                    16:40:01   BIC                           1078.367
Sample:                                01-01-1962   HQIC                          1074.752
                                     - 12-01-1975                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ma.S.L12      -0.6110      0.066     -9.285      0.000      -0.740      -0.482
sigma2        55.5966      5.322     10.446      0.000      45.166      66.028
===================================================================================
Ljung-Box (Q):                       32.42   Jarque-Bera (JB):                31.77
Prob(Q):                              0.80   Prob(JB):                         0.00
Heteroskedasticity (H):               0.70   Skew:                             0.78
Prob(H) (two-sided):                  0.19   Kurtosis:                         4.59
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

\[SARIMA(1,1,1) \times (0,1,1)_{12}\]

> model = sm.tsa.statespace.SARIMAX(df['Milk in pounds per cow'],
+                     order=(1,1,1), seasonal_order=(0,1,1,12))
+ results = model.fit()
+ print(results.summary())

                                 Statespace Model Results                                 
==========================================================================================
Dep. Variable:             Milk in pounds per cow   No. Observations:                  168
Model:             SARIMAX(1, 1, 1)x(0, 1, 1, 12)   Log Likelihood                -530.032
Date:                            Fri, 11 Dec 2020   AIC                           1068.064
Time:                                    16:40:01   BIC                           1080.238
Sample:                                01-01-1962   HQIC                          1073.009
                                     - 12-01-1975                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.1346      0.334     -0.403      0.687      -0.790       0.520
ma.L1         -0.0969      0.334     -0.290      0.772      -0.752       0.558
ma.S.L12      -0.6204      0.070     -8.827      0.000      -0.758      -0.483
sigma2        52.6316      4.939     10.656      0.000      42.951      62.312
===================================================================================
Ljung-Box (Q):                       21.48   Jarque-Bera (JB):                34.44
Prob(Q):                              0.99   Prob(JB):                         0.00
Heteroskedasticity (H):               0.83   Skew:                             0.74
Prob(H) (two-sided):                  0.51   Kurtosis:                         4.78
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

The non-seasonal ARMA(1,1) has a lower AIC value, which is preferred.

We can evaluate the model performance by evaluating the residuals.

> results.resid.plot();
+ plt.show()

> results.resid.plot(kind='kde');
+ plt.show()

Most of the residuals are minimal and centered around zero, so the model seems adequate.

Prediction

First, we can predict values that we already know:

> df['forecast'] = results.predict(start = 150, end= 168, 
+                                           dynamic= True)  
+ df[['Milk in pounds per cow','forecast']].plot(figsize=(12,8));
+ plt.show()

FORECASTING

We need to create additional time periods.

> from pandas.tseries.offsets import DateOffset
+ future_dates = [df.index[-1] + DateOffset(months=x) 
+                 for x in range(1,24) ]
+ future_dates

[Timestamp('1976-01-01 00:00:00'), Timestamp('1976-02-01 00:00:00'), Timestamp('1976-03-01 00:00:00'), Timestamp('1976-04-01 00:00:00'), Timestamp('1976-05-01 00:00:00'), Timestamp('1976-06-01 00:00:00'), Timestamp('1976-07-01 00:00:00'), Timestamp('1976-08-01 00:00:00'), Timestamp('1976-09-01 00:00:00'), Timestamp('1976-10-01 00:00:00'), Timestamp('1976-11-01 00:00:00'), Timestamp('1976-12-01 00:00:00'), Timestamp('1977-01-01 00:00:00'), Timestamp('1977-02-01 00:00:00'), Timestamp('1977-03-01 00:00:00'), Timestamp('1977-04-01 00:00:00'), Timestamp('1977-05-01 00:00:00'), Timestamp('1977-06-01 00:00:00'), Timestamp('1977-07-01 00:00:00'), Timestamp('1977-08-01 00:00:00'), Timestamp('1977-09-01 00:00:00'), Timestamp('1977-10-01 00:00:00'), Timestamp('1977-11-01 00:00:00')]

> # empty data frame with new dates 
+ future_dates_df = pd.DataFrame(index=future_dates,columns=df.columns)
+ future_dates_df.head()

           Milk in pounds per cow  ... forecast
1976-01-01                    NaN  ...      NaN
1976-02-01                    NaN  ...      NaN
1976-03-01                    NaN  ...      NaN
1976-04-01                    NaN  ...      NaN
1976-05-01                    NaN  ...      NaN

[5 rows x 6 columns]

> # combine with the existing data frame
+ future_df = pd.concat([df,future_dates_df])
+ future_df.head()

            Milk in pounds per cow  ...  forecast
1962-01-01                   589.0  ...       NaN
1962-02-01                   561.0  ...       NaN
1962-03-01                   640.0  ...       NaN
1962-04-01                   656.0  ...       NaN
1962-05-01                   727.0  ...       NaN

[5 rows x 6 columns]

> future_df['forecast'] = results.predict(start = 168, 
+                             end = 190, dynamic= True)  
+ future_df[['Milk in pounds per cow', 'forecast']]. \
+                         plot(figsize=(12, 8))

Time Series Analysis with Python

Produced in R Markdown

Paul Jozefek

2020-12-11

Intro to Statsmodels

Statsmodels Trend

EWMA

ETS

SARIMA

Import Data

Visualize

Decomposition

Stationarity

Differencing

ACF and PACF

Interpretation

SARIMA

Prediction