Other Projects
Electricity demand forecasting with R: https://rpubs.com/alrickcampbell/energy_demand_forecasting_R
Visualizing electricity demand with Python: https://rpubs.com/alrickcampbell/energy_demand_visuals
To predict electricity demand, traditional econometric approaches and statistical models can be limited in their application when it comes to big data. The electricity industry is complex, fast-changing, and generates data from various sources and in different formats. Smart meters along with other grid and customer devices are now able to collect large volumes of data at high frequency.
By using more complex mathematical tools, we can use big data to create more highly accurate forecasting models. Big data analysis gives us a richer understanding of electricity demand patterns associated with different events and technologies, and allows independent system operators to better manage demand and optimize delivery of electricity.
This evolving energy landscape also brings with it numerous challenges to both utilities and regulators. Utilities will need to predict an economy’s long-term electricity demand needs to determine what capacity is needed for future energy generation and regulators need new tools to better understand energy markets, the energy revolution taking place, and the industry’s long-term prospects.
The goal of this project is to predict residential electricity demand for the province of Ontario in Canada by utilizing hourly time series data and weather patterns from January 1, 2003 to December 31, 2016. To achieve this, I apply a range of supervised machine learning algorithms, including classical linear regression (CLM), generalized linear regression (GLM), Gradient Boosting (GB), and Random Forest (RF).
Ontario is the most heavily populated of all the Canadian provinces and territories, with around 14 million people, or 38% of the country’s population, residing in an area of over 1 million square kilometers. It is located in Eastern/Central Canada and experiences seasonal and regional climate variations, with average temperatures ranging from -26°C in winter to 28°C in summer (Government of Ontario, 2023).
Canada’s primary energy production is sourced from various sources, with natural gas and crude oil accounting for approximately 82% of the total. Of the secondary energy consumed in the country, the largest portion is natural gas (31%), followed by electricity (20%). The industrial sector is the largest consumer of electricity in Canada, accounting for approximately 40% of total use (Canada Energy Regulator, 2021).
Though Quebec accounts for the majority of electricity consumption in Canada (37.5%), Ontario is the second largest user (25.6%) (National Energy Board, 2021). In 2019, Ontario largely discontinued the production of electricity from coal plants and was producing the majority of its electricity from zero-carbon sources, such as nuclear (59%), hydro (24%), wind (8%), and solar (1%) (Canada Energy Regulator, 2021). The proper management and forecasting of energy demand is crucial in ensuring the availability and reliability of sufficient generating capacity in the future. Understanding the energy needs of Ontario, Canada’s most populous province, is critical for understanding the energy needs of the entire country.
The raw data used for this exercise can be found at https://ssc.ca/en/case-study/predicting-hourly-electricity-demand-ontario. The data set provided consists of hourly electricity demand and various weather variables. Some of these variables include:
There are 122,736 observations at hourly intervals running from January 1, 2003 December 31, 2016.
To begin, I first install Reticulate in R which will then allow me to run Python in the R environment through a virtual environment that was created.
# ```{r setup}
library(reticulate)
py_list_packages(
envname = NULL,
type = c("auto", "virtualenv", "conda"),
python = NULL
)
# py_install("cython")
# py_install("seaborn")
# py_install("sktime")
# py_install("pandas")
# py_install("xlwt")
# py_install("statsmodels")
# py_install("matplotlib")
# py_install("pillow")
# py_install("scikit-learn")
# py_install("numpy")
# py_install("openpyxl")
# py_install("skforecast")
# py_install("pytz")The list of Python modules I need for this exercise are provided below.
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from datetime import datetime, timedelta
from pytz import timezone
import openpyxl
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from skforecast.ForecasterAutoregMultiSeries import ForecasterAutoregMultiSeries
from skforecast.model_selection_multiseries import backtesting_forecaster_multiseries
from skforecast.model_selection_multiseries import grid_search_forecaster_multiseries
from sklearn import model_selection
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression
import statsmodels.api as sm
from sklearn import metrics
from scipy.stats import pearsonr
from scipy import stats
from sktime.forecasting.naive import NaiveForecasterI then read in the energy demand data, take a look at its dimension, check variable type and then view the first five observations.
df = pd.read_excel(r'SSC2020_hourly_demand.xlsx', sheet_name='Hourly Demand')
# Rename the columns date column for for merge
df = df.rename(columns = {'Total Energy Use from Electricity (MW)': 'MW'}, inplace = False)
len(df.index)## 122736
print(df)## Date Hour MW Year Month
## 0 2003-01-01 1 14745 2003 1
## 1 2003-01-01 2 14280 2003 1
## 2 2003-01-01 3 13821 2003 1
## 3 2003-01-01 4 13239 2003 1
## 4 2003-01-01 5 13236 2003 1
## ... ... ... ... ... ...
## 122731 2016-12-31 20 16260 2016 12
## 122732 2016-12-31 21 15658 2016 12
## 122733 2016-12-31 22 15195 2016 12
## 122734 2016-12-31 23 14758 2016 12
## 122735 2016-12-31 24 14153 2016 12
##
## [122736 rows x 5 columns]
n = df.shape[0]
df.dtypes## Date datetime64[ns]
## Hour int64
## MW int64
## Year int64
## Month int64
## dtype: object
There are 122736 observations. Upon first observation, the ‘Date’ variable does not appear to be in the correct date format so we will need to fix that by combining the ‘Date’ and ‘Hour’ variables. We will also shift back the hour by 1 so that the ‘Date is in the correct 24-hour format. Our new variable is ’datetime’ and we can now see that it is in the required ‘%Y-%m-%d %H:%M:%S’ time format. Take a look at our ‘datetime’ variable and we can see that the format is exactly what we want.
df['ymd'] = df.Date
#df['ymd'] = df['Date'].dt.strftime('%Y-%m-%d')
df['Hour'] = df['Hour'] - 1
df['Hour'] = pd.to_datetime(df.Hour, unit='h').dt.strftime('%H:%M:%S')
df.dtypes## Date datetime64[ns]
## Hour object
## MW int64
## Year int64
## Month int64
## ymd datetime64[ns]
## dtype: object
df['datetime'] = df.apply(lambda row: pd.to_datetime(row['ymd']) + pd.to_timedelta(row['Hour']), axis = 1)
df.dtypes## Date datetime64[ns]
## Hour object
## MW int64
## Year int64
## Month int64
## ymd datetime64[ns]
## datetime datetime64[ns]
## dtype: object
Since we do not have a lot of variables we are going to use the ‘datetime’ variable to create some additional variables that we may need at a later point. I created three additional variables to capture the time of day, day of the week, whether the day of the week represents a weekday or a weekend. A variable that represents the different seasons in the year was also created. After doing this we check for missing values or potential outliers. Given how the minimum and the maximum values compare to the first and third quartile of MW, there may be some concerns there, but we will need to look at this graphically a bit later on. Since we have no missing data and we now have more variables, I provide a snapshot of the new dataset.
# return the day of week
df['day_num'] = df['ymd'].dt.weekday
df['day_of_week'] = df['ymd'].dt.strftime('%A')
print (df['day_of_week'].unique())
## ['Wednesday' 'Thursday' 'Friday' 'Saturday' 'Sunday' 'Monday' 'Tuesday']
df.dtypes
## Date datetime64[ns]
## Hour object
## MW int64
## Year int64
## Month int64
## ymd datetime64[ns]
## datetime datetime64[ns]
## day_num int64
## day_of_week object
## dtype: object
df['day'] = pd.cut(df['day_num'],
bins=[-np.inf, 4, np.inf],
labels=['Weekday', 'Weekend'],include_lowest =True)
print (df['day'].unique())
## ['Weekday', 'Weekend']
## Categories (2, object): ['Weekday' < 'Weekend']
df['day_in_month'] = df['ymd'].dt.day
df['month'] = df['ymd'].dt.month
def seasons(ymd):
m = ymd.month
d = ymd.day
season=None
if (m==3 and d>=21) or m==4 or m==5 or (m==6 and d<=20):
season = 'spring'
elif (m==6 and d>=21 ) or m==7 or m==8 or (m==9 and d<=20):
season = 'summer'
elif (m==9 and d>=21 ) or m==10 or m==11 or (m==12 and d<=20):
season = 'autumn'
elif (m==12 and d>=21 ) or m==1 or m==2 or (m==3 and d<=20):
season = 'winter'
return season
df['season'] = df.apply(lambda x: seasons(x['ymd']), axis=1)
print (df['season'].unique())
## ['winter' 'spring' 'summer' 'autumn']
print(df['season'].value_counts()) # to print count of every category
# summary statistics
## spring 30912
## summer 30912
## autumn 30576
## winter 30336
## Name: season, dtype: int64
print (df.describe())
## MW Year ... day_in_month month
## count 122736.000000 122736.000000 ... 122736.000000 122736.000000
## mean 16561.766271 2009.500391 ... 15.730935 6.522487
## std 2610.524068 4.031533 ... 8.800676 3.448839
## min 2270.000000 2003.000000 ... 1.000000 1.000000
## 25% 14610.000000 2006.000000 ... 8.000000 4.000000
## 50% 16488.000000 2009.500000 ... 16.000000 7.000000
## 75% 18358.000000 2013.000000 ... 23.000000 10.000000
## max 27005.000000 2016.000000 ... 31.000000 12.000000
##
## [8 rows x 6 columns]
print(df)
## Date Hour MW Year ... day day_in_month month season
## 0 2003-01-01 00:00:00 14745 2003 ... Weekday 1 1 winter
## 1 2003-01-01 01:00:00 14280 2003 ... Weekday 1 1 winter
## 2 2003-01-01 02:00:00 13821 2003 ... Weekday 1 1 winter
## 3 2003-01-01 03:00:00 13239 2003 ... Weekday 1 1 winter
## 4 2003-01-01 04:00:00 13236 2003 ... Weekday 1 1 winter
## ... ... ... ... ... ... ... ... ... ...
## 122731 2016-12-31 19:00:00 16260 2016 ... Weekend 31 12 winter
## 122732 2016-12-31 20:00:00 15658 2016 ... Weekend 31 12 winter
## 122733 2016-12-31 21:00:00 15195 2016 ... Weekend 31 12 winter
## 122734 2016-12-31 22:00:00 14758 2016 ... Weekend 31 12 winter
## 122735 2016-12-31 23:00:00 14153 2016 ... Weekend 31 12 winter
##
## [122736 rows x 13 columns]Let us now take a look at the temperature dataset. Similar to the electricity data, we take a look at its dimension, examine the variable types, check for missing values and then take a look at a the first few observations.
# Weather Time Series
# xlsx files
wea_data = pd.read_excel(r'SSC2020_hourly_weather.xlsx', sheet_name='Hourly Weather')
# Convert Celsius to Kelvin
wea_data['temperature'] = 273.15 + wea_data['temperature']
# Rename the columns date column for for merge
wea_data = wea_data.rename(columns = {'time': 'datetime'}, inplace = False)
wea_data.dtypes## datetime datetime64[ns]
## precipitation float64
## temperature float64
## irradiance_surface float64
## irradiance_toa float64
## snowfall float64
## snow_depth float64
## cloud_cover float64
## air_density float64
## dtype: object
print(wea_data.isnull().sum())## datetime 0
## precipitation 0
## temperature 0
## irradiance_surface 0
## irradiance_toa 0
## snowfall 0
## snow_depth 0
## cloud_cover 0
## air_density 0
## dtype: int64
print(wea_data)## datetime precipitation ... cloud_cover air_density
## 0 2003-01-01 00:00:00 0.0100 ... 0.3196 1.2612
## 1 2003-01-01 01:00:00 0.0022 ... 0.3167 1.2644
## 2 2003-01-01 02:00:00 0.0014 ... 0.2958 1.2669
## 3 2003-01-01 03:00:00 0.0013 ... 0.3745 1.2695
## 4 2003-01-01 04:00:00 0.0011 ... 0.6073 1.2724
## ... ... ... ... ... ...
## 122731 2016-12-31 19:00:00 0.1642 ... 0.9761 1.2408
## 122732 2016-12-31 20:00:00 0.1857 ... 0.9729 1.2407
## 122733 2016-12-31 21:00:00 0.2322 ... 0.9628 1.2417
## 122734 2016-12-31 22:00:00 0.2462 ... 0.9353 1.2430
## 122735 2016-12-31 23:00:00 0.2163 ... 0.8799 1.2441
##
## [122736 rows x 9 columns]
Everything looks fine so far and both datasets appear to have the sample size but we will need to now merge the data using the `datetime’ variable.
# Merge data
data = pd.merge(df, wea_data, how="right", on="datetime")
print(data.isnull().sum())## Date 0
## Hour 0
## MW 0
## Year 0
## Month 0
## ymd 0
## datetime 0
## day_num 0
## day_of_week 0
## day 0
## day_in_month 0
## month 0
## season 0
## precipitation 0
## temperature 0
## irradiance_surface 0
## irradiance_toa 0
## snowfall 0
## snow_depth 0
## cloud_cover 0
## air_density 0
## dtype: int64
data.loc[data['temperature'].isnull()]## Empty DataFrame
## Columns: [Date, Hour, MW, Year, Month, ymd, datetime, day_num, day_of_week, day, day_in_month, month, season, precipitation, temperature, irradiance_surface, irradiance_toa, snowfall, snow_depth, cloud_cover, air_density]
## Index: []
print(data)## Date Hour MW ... snow_depth cloud_cover air_density
## 0 2003-01-01 00:00:00 14745 ... 17.4309 0.3196 1.2612
## 1 2003-01-01 01:00:00 14280 ... 17.4307 0.3167 1.2644
## 2 2003-01-01 02:00:00 13821 ... 17.4304 0.2958 1.2669
## 3 2003-01-01 03:00:00 13239 ... 17.4302 0.3745 1.2695
## 4 2003-01-01 04:00:00 13236 ... 17.4300 0.6073 1.2724
## ... ... ... ... ... ... ... ...
## 122731 2016-12-31 19:00:00 16260 ... 16.9569 0.9761 1.2408
## 122732 2016-12-31 20:00:00 15658 ... 16.9597 0.9729 1.2407
## 122733 2016-12-31 21:00:00 15195 ... 16.9680 0.9628 1.2417
## 122734 2016-12-31 22:00:00 14758 ... 16.9865 0.9353 1.2430
## 122735 2016-12-31 23:00:00 14153 ... 17.0087 0.8799 1.2441
##
## [122736 rows x 21 columns]
data1 = data[["datetime","MW","temperature","Hour","day_num","season"]]
data1.head(5)## datetime MW temperature Hour day_num season
## 0 2003-01-01 00:00:00 14745 271.430 00:00:00 2 winter
## 1 2003-01-01 01:00:00 14280 271.105 01:00:00 2 winter
## 2 2003-01-01 02:00:00 13821 270.748 02:00:00 2 winter
## 3 2003-01-01 03:00:00 13239 270.418 03:00:00 2 winter
## 4 2003-01-01 04:00:00 13236 270.067 04:00:00 2 winter
start_date = '2004-01-01 00:00:00'
split_date = '2015-12-31 23:00:00'
data2 = data1.loc[data1['datetime'] >= start_date]
data3 = pd.get_dummies(data=data2,columns=['Hour','day_num','season'],drop_first=True)Since everything looks good with our merged dataset, we remind ourselves that the electricity data series may have had some outliers in it. We check for minimum values below which somewhat confirms our suspicion. Unusually small outliers appear in the 2003 data. We will look at this graphically in the next section.
I start the exploratory data analysis by focusing on the target variable, hourly household electricity demand in MW. Figure 3.1 shows the evolution of household electricity demand in Ontario from 2003 to 2016. In panel (a), the entire dataset is included, while panel (b) covers the period 2004 to 2016. The x-axis of both panels shows the time period, while the y-axis represents the hourly electricity demand in megawatts (MW).
In both panels, the graph exhibits a clear seasonal pattern with peaks occurring during the summer and winter months and troughs during the spring and fall months. The demand for electricity is highest in the summer months, specifically in July and August, which may be attributed to increased use of air conditioning.
data1 = data[["datetime","MW","temperature","Hour","day_num","season"]]
data1.head(5)## datetime MW temperature Hour day_num season
## 0 2003-01-01 00:00:00 14745 271.430 00:00:00 2 winter
## 1 2003-01-01 01:00:00 14280 271.105 01:00:00 2 winter
## 2 2003-01-01 02:00:00 13821 270.748 02:00:00 2 winter
## 3 2003-01-01 03:00:00 13239 270.418 03:00:00 2 winter
## 4 2003-01-01 04:00:00 13236 270.067 04:00:00 2 winter
start_date = '2004-01-01 00:00:00'
data2 = data1.loc[data1['datetime'] >= start_date]
data3 = pd.get_dummies(data=data2,columns=['Hour','day_num','season'],drop_first=True)
# Plot time series
fig, axs = plt.subplots(2, sharex=True)
# fig.suptitle('subplots')
axs[0].set_title('(a)')
axs[0].plot(data1['datetime'], data1['MW'], 'tab:orange')
axs[1].set_title('(b)')
axs[1].plot(data2['datetime'], data2['MW'], 'tab:red')
for ax in axs.flat:
ax.set(xlabel='Period', ylabel='MW')
# Hide x labels and tick labels for top plots and y ticks for right plots.
for ax in axs.flat:
ax.label_outer()
plt.show()Figure 3.1: The evolution of household electricity demand (MW) in Ontario. Panel (a) includes the entire dataset: 2003-2016. Panel (b) covers the period 2004-2016.
To see the seasonal pattern in electricity demand more clearly, we present the hourly variation in household electricity demand in Ontario for each day of 2016 in Figure 3.2. From the graph, we can observe that electricity demand is generally higher in the winter months (December to February) and the summer months (June to August), and lower in the spring (March to May) and fall (September to November). This is likely due to the increased use of heating and cooling systems during the extreme temperature months.
The peak demand is observed during the summer months (June to August), which could be due to the use of air conditioning systems to cope with the high temperatures. In contrast, the lowest demand is observed in the spring (March to May), when the temperatures are mild, and the use of heating and cooling systems is minimal.
# Plot time series for 2016 only
df_2016 = data.loc[data['datetime'] > split_date]
plt.subplots(nrows=1, ncols=1, sharex=False)
## (<Figure size 1400x1200 with 1 Axes>, <AxesSubplot: >)
plt.plot(df_2016['datetime'], df_2016['MW'], 'tab:red')
plt.xlabel("time"), plt.ylabel("MW")
## (Text(0.5, 0, 'time'), Text(0, 0.5, 'MW'))
plt.tight_layout()
plt.show()Figure 3.2: Household electricity demand (MW) in 2016.
Figure 3.3 compares the electricity demand on weekdays (Monday to Friday) and weekends (Saturday and Sunday) throughout the year in 2016. The x-axis shows the days of the year, while the y-axis shows the electricity demand in MW.
The graph shows a clear difference between the electricity demand on weekdays and weekends. The demand is higher on weekdays than on weekends for most of the year, with a few exceptions. During the summer months of June, July, and August, the demand on weekends is higher than on weekdays, which is likely due to increased use of air conditioning units in homes during the hot weather.
fig, ax = plt.subplots(figsize=(8,6))
df_2016.set_index('datetime', inplace=True)
df_2016.groupby('day')['MW'].plot(ax=ax)
## day
## Weekday AxesSubplot(0.125,0.2;0.775x0.68)
## Weekend AxesSubplot(0.125,0.2;0.775x0.68)
## Name: MW, dtype: object
ax.legend(frameon=False)
plt.xlabel("Date")
plt.ylabel("MW")
plt.show()Figure 3.3: Weekday vs weekend electricity demand (MW) in 2016.
The side-by-side boxplot in Figure 3.4 shows a comparison of the distribution of hourly electricity demand in megawatts (MW) between weekdays and weekends in Ontario during the year 2016. The left boxplot represents the distribution of hourly electricity demand on weekdays, while the right boxplot represents the distribution on weekends.
The boxplots show that the median demand on weekdays is higher than on weekends. The interquartile range (IQR) for both weekdays and weekends is similar, with the demand being more spread out on weekdays than on weekends. The upper whisker on weekdays is longer than on weekends, indicating that the demand on weekdays can be much higher than on weekends.
The weekdays boxplot also has a larger number of outliers, with the demand on some hours being significantly higher than on other hours. This is especially evident during the morning and evening peak hours, where the demand is much higher on weekdays than on weekends. On the other hand, the weekends boxplot has fewer outliers, indicating that the demand is relatively consistent throughout the day.
# Create a boxplot by category
ax = df_2016.boxplot(by='day', column='MW', figsize=(10, 6))
plt.suptitle('')
ax.set_title('')
plt.xlabel("Part of week")
plt.ylabel("MW")
# Show the plot
plt.show()Figure 3.4: Boxplot comparison of weekday and weekend electricity demand (MW) in 2016.
The boxplot in Figure 3.5 show the variation in the distribution of electricity demand at different times of the day across different seasons. The x-axis represents the time of day, and the y-axis represents the electricity demand in MW. The boxplots in the figure are grouped by season: winter (Dec-Feb), spring (Mar-May), summer (Jun-Aug), and fall (Sep-Nov).
The boxplots show that the median demand for electricity is generally higher in the morning and evening hours, with a dip in demand during the afternoon hours. Additionally, the demand for electricity is generally higher during the winter and summer seasons compared to spring and fall.
For example, in the winter season, the demand for electricity is highest at 7 pm, and the demand is lowest at around 2 am. The demand for electricity during the summer season is generally higher than during the winter season, with the highest demand occurring at 5 pm.
The boxplots also show that there is more variability in the demand for electricity during the winter and summer seasons, as indicated by the larger interquartile ranges (IQR) and the longer whiskers, compared to the spring and fall seasons.
# Create a boxplot for each group
fig, axes = plt.subplots(2, 2, figsize=(8, 6), sharey=True, sharex=True)
df_2016.groupby('season').boxplot(by='Hour', column='MW', ax=axes)
## autumn AxesSubplot(0.1,0.559091;0.363636x0.340909)
## spring AxesSubplot(0.536364,0.559091;0.363636x0.340909)
## summer AxesSubplot(0.1,0.15;0.363636x0.340909)
## winter AxesSubplot(0.536364,0.15;0.363636x0.340909)
## dtype: object
##
## <string>:1: UserWarning: When passing multiple axes, sharex and sharey are ignored. These settings must be specified when creating axes.
axes[1][0].tick_params(axis='x', rotation=90)
axes[1][1].tick_params(axis='x', rotation=90)
# Iterate through each subplot and set the xlabel to an empty string
for row in axes:
for ax in row:
ax.set_xlabel("")
fig.suptitle("")
fig.supxlabel('Hour')
fig.supylabel('MW')
plt.tight_layout()
plt.show()Figure 3.5: Seasonal load demand profiles of electricity by time-of-day, 2016
In addition to electricity demand, we also have several weather variables that can be included in our modeling analysis. Figure 3.6 show the variation in several weather variables over time. The top panels show air density (kg/m^3), cloud cover (fraction [0,1]) and irradiance surface (W/m^2), the middle panel shows irradiance top of the atmosphere ((W/m^2), precipitation (mm/hr), and snow depth (mm) and the bottom panel shows snowfall (mm/hr) and temperature in degrees Celsius. The x-axis in all panels represents time, while the y-axis in each panel represents the corresponding weather variable.
The plots suggest that air density, temperature, and both irradiance measures have some seasonal patterns, while the other variables do not. To avoid over-complexifying our models and for demonstration purposes, we will only use temperature as an external predictor.
The temperature panel shows that the temperature varies over the year with the highest temperature observed in July and August and the lowest temperature in January and February. It also shows occasional dips in temperature, likely due to the occurrence of weather events like cold fronts and storms.
#Plot all weather variables
wea_data = wea_data.loc[wea_data['datetime'] > split_date]
fig = plt.figure(figsize=(10, 8))
# Select the columns to plot
columns = ['temperature', 'precipitation', 'irradiance_surface',
'irradiance_toa', 'snowfall', 'snow_depth', 'cloud_cover', 'air_density']
# Define the number of rows and columns for the subplot grid
nrows = 3
ncols = 3
# Loop over the columns
for i, column in enumerate(columns):
ax = fig.add_subplot(nrows, ncols, i + 1)
ax.plot(wea_data['datetime'],wea_data[column])
ax.set_title(column)
ax.tick_params(axis='x', rotation=90)
# Show the plot
plt.tight_layout()
plt.show()Figure 3.6: Hourly weather data in 2016
Figure 3.7 consists of a set of scatterplots that show the relationship between various weather variables and household electricity demand. Each scatterplot shows the values of the weather variable on the x-axis and the corresponding values of electricity demand on the y-axis.
By examining the scatterplots, we can see that there is generally a non-linear relationship between temperature and electricity demand, with a minimum point around 285K (~12°C). It’s likely that a non-linear model would be more effective in predicting energy demand. The relationship between air density and electricity demand also appears to be non-linear, while the relationship between electricity demand and the other four variables appears to be weak or non-existent.
Overall, the scatterplots in 3.7 suggest that temperature is a more important driver of household electricity demand than air density, at least in the context of the data used in this analysis. However, it is important to note that other factors, such as the prevalence of air conditioning or heating systems, may also play a role in determining electricity demand.
#Scatterplot of all weather variables against MW
fig = plt.figure(figsize=(10, 8))
# Select the columns to plot
columns = ['temperature', 'precipitation', 'irradiance_surface',
'irradiance_toa', 'snowfall', 'snow_depth', 'cloud_cover', 'air_density']
# Define the number of rows and columns for the subplot grid
nrows = 3
ncols = 3
# Loop over the columns
for i, column in enumerate(columns):
ax = fig.add_subplot(nrows, ncols, i + 1)
ax.scatter(wea_data[column],df_2016['MW'])
# Compute a lowess smoothing of the data
smoothed = sm.nonparametric.lowess(df_2016['MW'],wea_data[column], frac=0.2)
ax.plot(smoothed[:, 0], smoothed[:, 1], c="k")
ax.set_title(column)
ax.tick_params(axis='x', rotation=90)
# Show the plot
fig.supylabel('MW')
plt.tight_layout()
plt.show()Figure 3.7: Effect of various weather variables on household energy demand.
With the data now in a format suitable for different algorithms, we can start the model development process. Given that the current version of the pyearth package (version 0.2.1) is not officially supported in Python 3.10, it may not be feasible to utilize MARS (Multivariate Adaptive Regression Splines) for predictive modeling in this version of Python. Due to this limitation, we have opted to concentrate on four alternative algorithms for our predictive modeling needs. Figure 4.1 is a plot showing the split between training and testing data sets used to build and evaluate a model to predict household electricity demand. The plot shows a time series of hourly electricity demand data, with the blue line representing the period from 2004 to 2015 and the red line line representing the subset of data used for testing the model.
The training dataset consists of the first 12 years of data, from 2004 to 2015, and the testing dataset consists of the final year of data, 2016. The split between training and testing data is important for building and evaluating models to ensure that the model is not simply memorizing the data it has seen, but is able to generalize to new data. The goal is to use pre-2016 data to parameterize our models and then evaluate their accuracy by testing the 2016 predictions against actual 2016 data.
In this case, the model will be trained on the training dataset and evaluated on the testing dataset to determine its accuracy in predicting household electricity demand. The plot shows that the split between training and testing data is based on time, which is appropriate for time series data such as electricity demand. The split also allows for the testing of the model on more recent data to ensure that it is still accurate and relevant.
dt_train = data3.loc[data3['datetime'] <= split_date]
dt_test = data3.loc[data3['datetime'] > split_date]
print(f"Train dates : {dt_train.index.min()} --- {dt_train.index.max()} (n={len(dt_train)})")
print(f"Test dates : {dt_test.index.min()} --- {dt_test.index.max()} (n={len(dt_test)})")
#Plot time series
plt.subplots(1, figsize=(10,10))
plt.plot(dt_train['datetime'], dt_train['MW'], '--', color="tab:orange", label="training")
plt.plot(dt_test['datetime'], dt_test['MW'], color="tab:green", label="testing")
plt.xlabel("time"), plt.ylabel("MW"), plt.legend(loc="best",frameon=False)
plt.tight_layout()
plt.axvline(datetime(2016, 1, 1))
plt.show()Figure 4.1: Plot showing training/test data split.
Before evaluating the predictions from the four models, we will establish a baseline. The data has significant seasonality, so we will use a seasonal ARIMA (SARIMA) model, specifically SARIMA(0,0,0)(0,1,0)168, to set the baseline. The “m” of 168 for hourly data indicates a weekly seasonal pattern.
#==============================================================================
#prepare variables
#Independent variables
X_train = dt_train.drop(['datetime','MW'], axis=1)
X_train.shape
X_test = dt_test.drop(['datetime','MW'], axis=1)
#Dependent variable
Y_train = np.array(dt_train['MW'])
Y_train.shape
Y_test = np.array(dt_test['MW'])
#Evaluate initial model performance
#Baseline seasonal naive predictions
forecaster = NaiveForecaster(strategy="last", sp=168)
forecast_fit = forecaster.fit(Y_train)
#forecast horizon: 8784 hours
fh = np.arange(1, 8785)
y_pred = forecaster.predict(fh)Figure 4.2 compares the seasonal naive prediction and observed test data for household electricity demand. The graph shows two lines: a red line representing the seasonal naive prediction and a blue line representing the observed test data.
The seasonal naive prediction is a simple forecasting method that uses the demand from the same season in the previous period as a predictor for the current period. In this case, we use a weekly period to represent the seasonal pattern. This method assumes that the demand in the current week will be similar to demand in the previous week. The observed test data, on the other hand, represents the actual demand values for the testing period, 2016.
The graph shows that the seasonal naive prediction tends to underestimate the actual demand during peak periods, such as in the summer months of July and August, and overestimate the demand during off-peak periods, such as in Spring and Autumn. This is because the seasonal naive prediction is based solely on the previous week’s demand, without taking into account other factors that may affect electricity demand, such as changes in population or weather patterns.
Overall, the graph highlights the limitations of the seasonal naive prediction method and the importance of using more advanced forecasting methods that take into account additional factors that may affect electricity demand.
#Plot time series
plt.subplots(1, figsize=(10,10))
plt.plot(dt_test['datetime'], dt_test['MW'], '--', color="tab:orange", label="observed")
plt.plot(dt_test['datetime'], y_pred, color="tab:green", label="seasonal naive forecast")
plt.xlabel("time"), plt.ylabel("MW"), plt.legend(loc="best",frameon=False)
plt.tight_layout()
plt.show()Figure 4.2: Comparisons of seasonal naive prediction and observed test data.
To determine the effectiveness of the baseline prediction, we need to measure the differences between the predicted values and the observed values. Common performance metrics used for this purpose are r-squared, MAE, and RMSE. However, caution should be exercised when using r-squared as a performance metric, as it is a measure of goodness of fit that is only appropriate for evaluating training data and not for out-of-sample predictions. MAE and RMSE are more suitable measures, and the calculation of both metrics differs between Python and R. In Python, the coefficient of determination is used, while R uses the square of Pearson’s r coefficient. For the purpose of comparison, the square of the correlation coefficient is used. The table below shows that the r-squared of the baseline prediction is 0.318, with MAE of 1808 and RMSE of 2326. If any of the models perform worse than the baseline prediction, it is an indication that those models are likely not appropriate.
# evaluate predictions
model = 's' #Seasonal naive
model_columns = []
model_compare = pd.DataFrame(columns = model_columns)
row_index = 0
for alg in model:
predicted = y_pred
model_name = alg.__class__.__name__
model_compare.loc[row_index,'model'] = model_name
model_compare.loc[row_index, 'rsq'] = [np.corrcoef(dt_test['MW'], predicted.T)[0][1]]
model_compare.loc[row_index, 'rsq'] = model_compare.loc[row_index, 'rsq']**2
model_compare.loc[row_index, 'mae'] = mean_absolute_error(dt_test['MW'], predicted)
model_compare.loc[row_index, 'rmse'] = np.sqrt(mean_squared_error(dt_test['MW'], predicted))
row_index+=1
model_compare## model rsq mae rmse
## 0 str 0.318156 1807.627391 2325.931257
The performance of the four models can be evaluated by comparing the predicted values to the observed values using metrics such as R-squared, MAE, and RMSE. The results show that all models perform better than the seasonal naive baseline, with gradient boosting, and random forest emerging as the best-fitting models. These models significantly outperform the linear regression models. Gradient boosting is consistently the best-fitting model, however, having an R-squared close to 1 and low MAE and RMSE values might lead to overfitting issues.
# All machine learning models
mdl = LinearRegression()
mdl_fit = mdl.fit(X_train, Y_train)
val_predictions = mdl.predict(X_test)
r_sq = r2_score(Y_test, val_predictions)
# print(mdl.intercept_)
# print(mdl.coef_)
# print(f"forecasting fit: {r_sq}")
# Official models
model = [
LinearRegression(),
ElasticNet(l1_ratio=1),
GradientBoostingRegressor(n_estimators=1000),
RandomForestRegressor(n_estimators=500)
]
model_columns = []
model_compare = pd.DataFrame(columns = model_columns)
#Create a new dataframe to store the predictions
pred_df = pd.DataFrame(dt_test[['datetime','MW']])
row_index = 0
for alg in model:
predicted = alg.fit(X_train, Y_train).predict(X_test)
model_name = alg.__class__.__name__
pred_df.loc[:, model_name] = predicted
model_compare.loc[row_index,'model'] = model_name
model_compare.loc[row_index, 'rsq'] = [np.corrcoef(Y_test, predicted)[0][1]]
model_compare.loc[row_index, 'rsq'] = model_compare.loc[row_index, 'rsq']**2
model_compare.loc[row_index, 'mae'] = mean_absolute_error(Y_test, predicted)
model_compare.loc[row_index, 'rmse'] = np.sqrt(mean_squared_error(Y_test, predicted))
row_index+=1
model_compare.sort_values(by = ['rsq'], ascending = False, inplace = True)
model_compare ## model rsq mae rmse
## 2 GradientBoostingRegressor 0.887009 1177.565746 1412.519252
## 3 RandomForestRegressor 0.860888 1225.688772 1491.092730
## 0 LinearRegression 0.602713 1502.953319 1796.438557
## 1 ElasticNet 0.602260 1503.248870 1794.219155
print(pred_df)## datetime ... RandomForestRegressor
## 113952 2016-01-01 00:00:00 ... 15035.068
## 113953 2016-01-01 01:00:00 ... 15806.238
## 113954 2016-01-01 02:00:00 ... 14914.296
## 113955 2016-01-01 03:00:00 ... 14266.356
## 113956 2016-01-01 04:00:00 ... 14567.848
## ... ... ... ...
## 122731 2016-12-31 19:00:00 ... 17853.604
## 122732 2016-12-31 20:00:00 ... 17581.708
## 122733 2016-12-31 21:00:00 ... 17312.674
## 122734 2016-12-31 22:00:00 ... 16185.088
## 122735 2016-12-31 23:00:00 ... 15136.678
##
## [8784 rows x 6 columns]
Figure 4.3 shows a set of scatterplots comparing the observed data and test predictions for each of the four models used in the study. The y-axis represents the observed data, and the x-axis represents the test predictions made by the respective model. Each point on the graph represents a single data point from the test set.
In general, a good model would result in points that lie close to the diagonal line, indicating a strong correlation between the observed data and test predictions. A scatterplot with a high degree of scatter and points far from the diagonal line indicates poor model performance.
Looking at the scatterplots in Figure 4.3, it is clear that some models perform better than others. For example, the gradient boosting and random forest models have a tight cluster of points near the diagonal line, indicating a strong correlation between the observed data and test predictions. In contrast, the linear regression models have much more scatter, with many points far from the diagonal line.
Overall, Figure 4.3 provides a useful visual comparison of the performance of the different models used in the study.
#Scatterplot
fig = plt.figure(figsize=(10, 8))
# Select the columns to plot
columns = ['LinearRegression',
'ElasticNet',
'GradientBoostingRegressor',
'RandomForestRegressor']
# Define the number of rows and columns for the subplot grid
nrows = 2
ncols = 2
# Loop over the columns
for i, column in enumerate(columns):
ax = fig.add_subplot(nrows, ncols, i + 1)
ax.scatter(pred_df[column], pred_df['MW'])
ax.set_title(column)
ax.tick_params(axis='x', rotation=90)
x = np.linspace(*ax.get_xlim())
ax.plot(x, x, color='g', linestyle='-.')
# Show the plot
fig.supxlabel('Predicted MW')
fig.supylabel('Observed MW')
plt.tight_layout()
plt.show()Figure 4.3: Comparison of observed data and test predictions for each model.
Figure 4.4 compares the observed demand and predicted demand for each model through time. The x-axis shows the time period, which ranges from January 1, 2016 to December 31, 2016, while the y-axis shows the electricity demand in MW. Each subplot represents a different model, with the observed demand shown in blue and the predicted demand shown in red.
In general, all the models perform well in predicting the overall shape of the demand curve, but they differ in their ability to accurately predict the level of demand at certain times. For example, the linear regression models tend to under-estimate demand during the summer months. The gradient boosting and random forest models, on the other hand, are able to capture more of the variability in the data and produce more accurate predictions overall.
Looking at the individual subplots, it’s clear that the gradient boosting model provides the best overall fit to the data. The predicted demand closely follows the observed demand, with only minor deviations. The random forest model also does a good job, but it tends to over-predict demand in some periods, particularly during the summer months. The linear regression modesl are the least accurate of the four, with large deviations from the observed demand throughout the year.
Overall, the subplots in Figure 4.4 demonstrate the effectiveness of the different models in predicting electricity demand and highlight the strengths and weaknesses of each approach.
#Plots over time
fig = plt.figure(figsize=(10, 8))
# Loop over the columns
for i, column in enumerate(columns):
ax = fig.add_subplot(nrows, ncols, i + 1)
ax.plot(pred_df['datetime'],pred_df['MW'], color='g',label='Observed')
ax.plot(pred_df['datetime'],pred_df[column],label='Predicted')
ax.legend(frameon=False)
ax.set_title(column)
ax.tick_params(axis='x', rotation=90)
# Show the plot
fig.supylabel('MW')
fig.supxlabel('Date')
plt.tight_layout()
plt.show()Figure 4.4: Comparison of observed demand and predicted demand for each model through time.
Panel (a) of Figure 4.5 displays a histogram of forecast errors, showing that the distribution of errors is not symmetrical. The majority of errors are negative and relatively large, while a small fraction of errors are positive and relatively small. This indicates that the gradient boosting model is generally generating sensible forecasts, but sometimes produces bigger errors, particularly when the actual demand is lower than predicted.
The second panel suggests that the model is overestimating electricity demand since a substantial proportion of forecast errors lie below the 5% lower forecast interval. Our results demonstrate that 90% of test dataset errors fell within the range of -18.7% to -0.2% of the actual MW, indicating that the model is more likely to over-predict demand.
In conclusion, the gradient boosting method is making reasonably accurate predictions, but there is still room for improvement in some situations by experimenting with different transformations, variables, and models.
# Calculate the differences for GB and append
pred_df['GB_diff'] = (pred_df['MW'] - pred_df['GradientBoostingRegressor'])/pred_df['MW']*100
pred_df['GB_diff'] = [(a - p) / a * 100 for a, p in zip(pred_df['MW'], pred_df['GradientBoostingRegressor'])]
# First plot - create a histogram of the differences
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 8))
ax1.hist(pred_df.GB_diff)
# Set the x-axis label
ax1.set_xlabel('Percentage Error')
# Set the y-axis label
ax1.set_ylabel('Count')
ax1.set_title('(a)') # Add a title to the first subplot
# Second plot
# Define the threshold values
u_l = 5
l_l = -5
# Define the colors for values above and below the thresholds
color = 'red'
o_color = 'green'
# Set the colors for each point based on whether it's above or below the thresholds
colors = np.where(pred_df['GB_diff'] > u_l, color, np.where(pred_df['GB_diff'] < l_l, color, o_color))
labels = np.where(pred_df['GB_diff'] > u_l, 'Above 5', np.where(pred_df['GB_diff'] < l_l, 'Below -5', 'Between -5 and 5'))
# Plot the scatter plot with different colors for values above and below the thresholds
ax2.scatter(pred_df['datetime'], pred_df['GB_diff'], c=colors, label=labels)
ax2.axhline(y=5, linestyle='--', color='gray')
ax2.axhline(y=-5, linestyle='--', color='gray')
# Add axis labels and a title
ax2.set_xlabel('Period')
ax2.set_ylabel('Percentage Error')
ax2.tick_params(axis='x', rotation=90)
ax2.set_title('(b)') # Add a title to the first subplot
# Create the legend with only two colors
legend_elements = [plt.Line2D([0], [0], marker='o', color=color, label='outside 5% threshold', markersize=5),
plt.Line2D([0], [0], marker='o', color=o_color, label='inside 5% threshold', markersize=5)]
plt.legend(handles=legend_elements, frameon=False, loc="upper left", fontsize=7)
# Show the plot
plt.show() Figure 4.5: Distribution of forecast errors.
# Calculate the 5% and 95% quantile of the forecast errors
quantiles = np.quantile(pred_df['GB_diff'], [0.05, 0.95])
# Print the results
print("5% quantile: {}, 95% quantile: {}".format(quantiles[0], quantiles[1])) ## 5% quantile: -18.74714976915795, 95% quantile: -0.20187161262279255
In conclusion, the analysis indicates that the gradient boosting method is the most appropriate method for forecasting electricity demand using the given data. However, it is important to note that the model has its limitations, and there is still room for improvement in certain scenarios by experimenting with different transformations, variables, and model types. While our model used temperature as the only predictor variable, the inclusion of other related variables could enhance forecast accuracy, although this requires further investigation.
Future projects will focus on improving the model through adjusting the hyperparameters and feature engineering. It would also be valuable to explore the model’s performance during the COVID-19 period as new data becomes available. In summary, while the current model shows promise, continued efforts to improve it are necessary for more accurate forecasting of electricity demand.
A project by Alrick Campbell
alrickcampbell83@gmail.com