DATA624 - Project 1

Author

Anthony Josue Roman

Part A – ATM Forecast

The objective of this part is to forecast the daily cash withdrawals for four ATM machines for May 2010. The dataset used for this purpose comprises daily measurements of cash withdrawals in terms of hundreds of dollars. Data cleaning involved converting the dates, sorting the data, and separating the data for individual machines.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from utilsforecast.plotting import plot_series
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.holtwinters import ExponentialSmoothing

atm = pd.read_excel("ATM624Data.xlsx")

atm["DATE"] = pd.to_datetime(atm["DATE"])
atm = atm.dropna(subset=["ATM"])
atm = atm.sort_values(["ATM", "DATE"])

atm_dict = {
    name: group.set_index("DATE")["Cash"].asfreq("D").interpolate()
    for name, group in atm.groupby("ATM")
}

Explore Each ATM

The time series plots show that both ATM1 and ATM2 have clear seasonal patterns, with ATM2 showing larger variability. ATM3 is consistently at zero until the end, suggesting that not enough historical data is available for traditional modeling. ATM4 shows large values with significant variability, including an outlier that will need to be addressed prior to forecasting.

for name, series in atm_dict.items():
    plt.figure(figsize=(10, 4))
    plt.plot(series.index, series.values)
    plt.title(f"{name} Cash Withdrawals")
    plt.xlabel("Date")
    plt.ylabel("Cash (hundreds)")
    plt.tight_layout()
    plt.show()
    
for name, series in atm_dict.items():
    print(f"\n{name}")
    print(series.describe())


ATM1
count    365.000000
mean      84.154795
std       36.636262
min        1.000000
25%       73.000000
50%       91.000000
75%      108.000000
max      180.000000
Name: Cash, dtype: float64

ATM2
count    365.000000
mean      62.591781
std       38.805735
min        0.000000
25%       26.000000
50%       67.000000
75%       93.000000
max      147.000000
Name: Cash, dtype: float64

ATM3
count    365.000000
mean       0.720548
std        7.944778
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max       96.000000
Name: Cash, dtype: float64

ATM4
count      365.000000
mean       474.043345
std        650.935991
min          1.563260
25%        124.334415
50%        403.839336
75%        704.507042
max      10919.761638
Name: Cash, dtype: float64

def forecast_ets(series, steps=31, seasonal_periods=7):
    series = series.asfreq("D")

    model = ExponentialSmoothing(
        series,
        trend=None,
        seasonal="add",
        seasonal_periods=seasonal_periods,
        initialization_method="estimated"
    ).fit(optimized=True, use_brute=True)

    forecast = model.forecast(steps)
    return model, forecast

From the patterns observed, Holt-Winters Exponential Smoothing (ETS) with Additive Seasonality and a Weekly Period of 7 days was chosen. The chosen model did not include a trend component, as the data is primarily seasonal rather than trending.

atm_forecasts = {}
atm_models = {}

for atm_name in ["ATM1", "ATM2"]:
    series = atm_dict[atm_name]
    model, fc = forecast_ets(series, steps=31, seasonal_periods=7)
    atm_models[atm_name] = model
    atm_forecasts[atm_name] = fc

ETS models were used for the ATM1 and ATM2 series, as these time series exhibited regular weekly seasonality. This method enables the model to learn from the recurring patterns in the withdrawal amounts over time.

atm3 = atm_dict["ATM3"]

recent_nonzero = atm3[atm3 > 0].tail(3)

if len(recent_nonzero) > 0:
    atm3_value = recent_nonzero.mean()
else:
    atm3_value = 0

atm3_index = pd.date_range(start="2010-05-01", periods=31, freq="D")
atm3_fc = pd.Series(np.repeat(atm3_value, 31), index=atm3_index)

atm_forecasts["ATM3"] = atm3_fc

For the ATM3 time series, there wasn’t enough history to allow for a regular time series model, as most of the time, the withdrawal amounts were zero. A basic method is to average the last non-zero values and extrapolate them to generate the forecast.

atm4 = atm_dict["ATM4"]

threshold = atm4.mean() + 3 * atm4.std()
atm4_clean = atm4.clip(upper=threshold)

model4, fc4 = forecast_ets(atm4_clean, steps=31, seasonal_periods=7)
atm_models["ATM4"] = model4
atm_forecasts["ATM4"] = fc4

The data in ATM4 had an extreme outlier that could cause instability in the model estimation process. To correct this, the data was censored at a certain level before the ETS model was applied. This ensured the model was more stable and gave a better forecast.

for atm_name in ["ATM1", "ATM2", "ATM3", "ATM4"]:
    plt.figure(figsize=(10, 4))

    if atm_name == "ATM4":
        history = atm4_clean
    else:
        history = atm_dict[atm_name]

    plt.plot(history.index, history.values, label="History")
    plt.plot(atm_forecasts[atm_name].index, atm_forecasts[atm_name].values, label="Forecast")
    plt.title(f"{atm_name} Forecast")
    plt.xlabel("Date")
    plt.ylabel("Cash (hundreds)")
    plt.legend()
    plt.tight_layout()
    plt.show()

The forecasts for each ATM follow the general pattern of the data. ATM1 and ATM2 retain their weekly pattern, ATM3 reflects the current level of activation, and ATM4 stabilizes the data before providing the forecast. All the models reflect the underlying data and provide a good forecast for May 2010.

output_frames = []

for atm_name, fc in atm_forecasts.items():
    temp = pd.DataFrame({
        "Date": fc.index,
        "ATM": atm_name,
        "Forecast": fc.values
    })
    output_frames.append(temp)

atm_output = pd.concat(output_frames, ignore_index=True)

atm_output = atm_output.sort_values(["ATM", "Date"]).reset_index(drop=True)

print(atm_output.head(10))
print(atm_output.tail(10))

        Date   ATM    Forecast
0 2010-05-01  ATM1   86.899964
1 2010-05-02  ATM1  100.609325
2 2010-05-03  ATM1   72.976075
3 2010-05-04  ATM1    5.578975
4 2010-05-05  ATM1  100.014529
5 2010-05-06  ATM1   79.311701
6 2010-05-07  ATM1   85.728771
7 2010-05-08  ATM1   86.899964
8 2010-05-09  ATM1  100.609325
9 2010-05-10  ATM1   72.976075
          Date   ATM    Forecast
114 2010-05-22  ATM4  476.677394
115 2010-05-23  ATM4  530.662898
116 2010-05-24  ATM4  477.210890
117 2010-05-25  ATM4  438.666381
118 2010-05-26  ATM4  429.784305
119 2010-05-27  ATM4  223.285794
120 2010-05-28  ATM4  580.712134
121 2010-05-29  ATM4  476.677394
122 2010-05-30  ATM4  530.662898
123 2010-05-31  ATM4  477.210890

The final forecasts for all four ATMs were combined into a single data set, and this data set was exported as an Excel file. This data set includes daily forecasts for each ATM for May 2010, and it meets the project requirement.

atm_output.to_excel("ATM_May2010_Forecast.xlsx", index=False)

print("\nSaved forecast file: ATM_May2010_Forecast.xlsx")


Saved forecast file: ATM_May2010_Forecast.xlsx

The forecasting approach for this problem is based on the behavior of each ATM. For ATM1, ATM2, and ATM4, a seasonal ETS model has been applied to take into account their weekly behavior. For ATM3, a simple average has been applied, as there is little activity. This approach will ensure realistic forecasts for May 2010 based on the underlying behavior of each ATM.

Part B – Forecasting Power

The aim of this part is to develop a residential electric power demand model and generate a forecast for 2014. The data set includes monthly data on electricity use (measured in kilowatt-hours) for a period of 16 years, starting from January 1998 and ending in December 2013.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

power = pd.read_excel("ResidentialCustomerForecastLoad-624.xlsx")

power.columns = [c.strip() for c in power.columns]

power["Date"] = pd.to_datetime(power["YYYY-MMM"], format="%Y-%b")
power = power.sort_values("Date")
power = power.set_index("Date")

power["KWH"] = pd.to_numeric(power["KWH"], errors="coerce")
power = power.dropna(subset=["KWH"])

series = power["KWH"].asfreq("MS").dropna()

plt.figure(figsize=(10,4))
plt.plot(series)
plt.title("Residential Power Consumption")
plt.xlabel("Date")
plt.ylabel("KWH")
plt.tight_layout()
plt.show()

From the time plot, there is a noticeable trend in addition to strong seasonality. This implies that a good model should include yearly seasonality.

decomp = seasonal_decompose(series, model="additive", period=12)
decomp.plot()
plt.show()

From the data decomposition plot, there is a strong yearly seasonality and a steady upward trend in electricity use over time.

train = series[: "2012-12-01"]
test  = series["2013-01-01":]

seasonal_naive = train[-12:].values
snaive_fc = np.tile(seasonal_naive, int(np.ceil(len(test) / 12)))[:len(test)]

ets_model = ExponentialSmoothing(
    train.astype(float).values,
    trend=None,
    seasonal="add",
    seasonal_periods=12,
    initialization_method="heuristic"
).fit(optimized=True, use_brute=True)

ets_fc = pd.Series(ets_model.forecast(len(test)), index=test.index)

arima_model = ARIMA(train, order=(1,1,1), seasonal_order=(1,1,1,12)).fit()
arima_fc = arima_model.forecast(len(test))

def rmse(y_true, y_pred):
    y_true = np.asarray(y_true, dtype=float)
    y_pred = np.asarray(y_pred, dtype=float)
    mask = ~np.isnan(y_true) & ~np.isnan(y_pred)
    return np.sqrt(mean_squared_error(y_true[mask], y_pred[mask]))

print("Seasonal Naive RMSE:", rmse(test, snaive_fc))
print("ETS RMSE:", rmse(test, ets_fc))
print("ARIMA RMSE:", rmse(test, arima_fc))

Seasonal Naive RMSE: 1035537.6062066747
ETS RMSE: 1056228.0053759678
ARIMA RMSE: 1046465.6731312411

The performance of the models was also evaluated based on their RMSE on a holdout sample. The best-performing model, based on RMSE, was a naive approach to seasonality, and this was therefore selected as the best model for use.

seasonal_pattern = series[-12:].values

power_fc_2014 = pd.Series(
    seasonal_pattern,
    index=pd.date_range(start="2014-01-01", periods=12, freq="MS")
)

plt.figure(figsize=(10,4))
plt.plot(series, label="History")
plt.plot(power_fc_2014, label="Forecast")
plt.title("Power Forecast for 2014")
plt.xlabel("Date")
plt.ylabel("KWH")
plt.legend()
plt.tight_layout()
plt.show()

The forecast for 2014 follows the yearly seasonality pattern observed in the data. This implies that the naive approach to seasonality is a good model for predicting future data, as it reflects recent usage patterns.

output = pd.DataFrame({
    "Date": power_fc_2014.index,
    "Forecast_KWH": power_fc_2014.values
})

output.to_excel("Power_2014_Forecast.xlsx", index=False)

print("Saved Power_2014_Forecast.xlsx")

Saved Power_2014_Forecast.xlsx

The power consumption values showed a high level of yearly seasonality and a steady increasing trend. Although ETS and ARIMA models were also considered, the results showed that the seasonal naive model resulted in the lowest value of RMSE and hence was chosen as the final model. This suggests that recent seasonal patterns are the best predictor of future values.

Part C - Bonus: Waterflow Forecasting

The purpose of this part is to align the two waterflow data sets to a common time base, check if they are stationary, and check if it is appropriate to forecast them for one week. For consistency, both data sets were aggregated to an hourly time base using the average water flow per hour.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.tsa.holtwinters import ExponentialSmoothing

pipe1 = pd.read_excel("Waterflow_Pipe1.xlsx")
pipe2 = pd.read_excel("Waterflow_Pipe2.xlsx")

pipe1["Date Time"] = pd.to_datetime(pipe1["Date Time"])
pipe2["Date Time"] = pd.to_datetime(pipe2["Date Time"])

pipe1 = pipe1.sort_values("Date Time")
pipe2 = pipe2.sort_values("Date Time")

pipe1_hourly = (
    pipe1.set_index("Date Time")
         .resample("h")["WaterFlow"]
         .mean()
         .dropna()
)

pipe2_hourly = (
    pipe2.set_index("Date Time")
         .resample("h")["WaterFlow"]
         .mean()
         .dropna()
)

The first data set has irregular time stamps, so it is necessary to aggregate it at an hourly level. The second data set is already close to an hourly level, though the same aggregation method is used for consistency.

import matplotlib.dates as mdates

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(pipe1_hourly)

ax.set_title("Pipe 1 Hourly Water Flow")
ax.set_xlabel("Date Time")
ax.set_ylabel("WaterFlow")

ax.xaxis.set_major_locator(mdates.DayLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m-%d'))

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


fig, ax = plt.subplots(figsize=(10,4))
ax.plot(pipe2_hourly)

ax.set_title("Pipe 2 Hourly Water Flow")
ax.set_xlabel("Date Time")
ax.set_ylabel("WaterFlow")

ax.xaxis.set_major_locator(mdates.DayLocator(interval=3))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m-%d'))

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Stationarity Tests

def stationarity_tests(series, name):
    adf_result = adfuller(series.dropna())
    kpss_result = kpss(series.dropna(), regression="c", nlags="auto")
    
    print(f"{name}")
    print(f"ADF p-value: {adf_result[1]:.6f}")
    print(f"KPSS p-value: {kpss_result[1]:.6f}")
    print("-" * 40)

stationarity_tests(pipe1_hourly, "Pipe 1")
stationarity_tests(pipe2_hourly, "Pipe 2")

Pipe 1
ADF p-value: 0.000000
KPSS p-value: 0.100000
----------------------------------------
Pipe 2
ADF p-value: 0.000000
KPSS p-value: 0.100000
----------------------------------------

From the stationarity tests carried out on the data, it is evident that both waterflow data series are stationary processes. This is because the Augmented Dickey-Fuller test showed that the data series had p-values close to zero, while the KPSS test showed that the data series had high p-values, rejecting the null hypothesis that they were non-stationary processes. This proves that both Pipe 1 and Pipe 2 have stable statistical properties and can be used in forecasting.

Forecasting

This is equivalent to 168 observations at an hourly interval.

pipe1_model = ExponentialSmoothing(
    pipe1_hourly.astype(float).values,
    trend=None,
    seasonal="add",
    seasonal_periods=24,
    initialization_method="heuristic"
).fit(optimized=True, use_brute=True)

pipe1_fc = pd.Series(
    pipe1_model.forecast(168),
    index=pd.date_range(start=pipe1_hourly.index[-1] + pd.Timedelta(hours=1),
                        periods=168,
                        freq="h")
)

pipe2_model = ExponentialSmoothing(
    pipe2_hourly.astype(float).values,
    trend=None,
    seasonal="add",
    seasonal_periods=24,
    initialization_method="heuristic"
).fit(optimized=True, use_brute=True)

pipe2_fc = pd.Series(
    pipe2_model.forecast(168),
    index=pd.date_range(start=pipe2_hourly.index[-1] + pd.Timedelta(hours=1),
                        periods=168,
                        freq="h")
)

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(pipe1_hourly, label="History")
ax.plot(pipe1_fc, label="Forecast")

ax.set_title("Pipe 1 One-Week Forecast")
ax.set_xlabel("Date Time")
ax.set_ylabel("WaterFlow")
ax.legend()

ax.xaxis.set_major_locator(mdates.DayLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m-%d'))

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


fig, ax = plt.subplots(figsize=(10,4))
ax.plot(pipe2_hourly, label="History")
ax.plot(pipe2_fc, label="Forecast")

ax.set_title("Pipe 2 One-Week Forecast")
ax.set_xlabel("Date Time")
ax.set_ylabel("WaterFlow")
ax.legend()

ax.xaxis.set_major_locator(mdates.DayLocator(interval=3))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m-%d'))

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

The forecasts extend the hourly waterflow series by one week. A seasonal period of 24 is used to allow for possible repetition in the hourly flow data.

pipe1_output = pd.DataFrame({
    "Date Time": pipe1_fc.index,
    "Pipe": "Pipe1",
    "Forecast_WaterFlow": pipe1_fc.values
})

pipe2_output = pd.DataFrame({
    "Date Time": pipe2_fc.index,
    "Pipe": "Pipe2",
    "Forecast_WaterFlow": pipe2_fc.values
})

waterflow_output = pd.concat([pipe1_output, pipe2_output], ignore_index=True)

waterflow_output.to_excel("Waterflow_OneWeek_Forecast.xlsx", index=False)

print("Saved Waterflow_OneWeek_Forecast.xlsx")

Saved Waterflow_OneWeek_Forecast.xlsx

Both the hourly water flow data for the two pipes are aggregated and analyzed for stationarity. The results from the ADF and KPSS tests confirm that the data are stationary, indicating suitability for time series forecasting. A one-week forecast is performed, and the results confirm stable data, following a similar trend to the recent history, with no apparent trend or change in the data.