Other Projects

1 Introduction

The objective of this exercise is to analyse the temporal aspect of household electricity demand in Ontario, with a particular focus on the data from 2016. To gain a better understanding of the patterns, I will narrow down my analysis to this specific timeframe. I will be examining the seasonality of electricity demand across various time dimensions, including hourly, daily, weekly, monthly, and seasonal.

The raw data used for this exercise can be found at https://ssc.ca/en/case-study/predicting-hourly-electricity-demand-ontario. The data set provided consists of hourly electricity demand (MW) from January 1, 2003 December 31, 2016.

2 Data cleaning and pre-processing

To begin, I first install Reticulate in R which will then allow me to run Python in the R environment through a virtual environment that was created.

# ```{r setup}
library(reticulate)
py_list_packages(
  envname = NULL,
  type = c("auto", "virtualenv", "conda"),
  python = NULL
)
# py_install("cython")
# py_install("seaborn")
# py_install("sktime")
# py_install("pandas")
# py_install("xlwt")
# py_install("statsmodels")
# py_install("matplotlib")
# py_install("pillow")
# py_install("scikit-learn")
# py_install("numpy")
# py_install("openpyxl")
# py_install("skforecast")
# py_install("pytz")

Click on code to see the list of Python modules used for this exercise.

import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from datetime import datetime, timedelta
from pytz import timezone
import openpyxl

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

from skforecast.ForecasterAutoregMultiSeries import ForecasterAutoregMultiSeries
from skforecast.model_selection_multiseries import backtesting_forecaster_multiseries
from skforecast.model_selection_multiseries import grid_search_forecaster_multiseries

from sklearn import model_selection
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression
import statsmodels.api as sm
from sklearn import metrics
from scipy.stats import pearsonr
from scipy import stats
from sktime.forecasting.naive import NaiveForecaster

Initially, I read in the energy demand data and inspect its dimensions and variable types, before displaying the first five observations. The ‘Date’ variable is observed to be in the incorrect date format, so I combine the ‘Date’ and ‘Hour’ variables and shift the hour back by one to fix the formatting. The resulting ‘datetime’ variable is then verified to be in the correct format.

Using the ‘datetime’ variable, I generate new variables to represent time of day, day of the week, weekday/weekend status, month, and season. Following this, I verify the absence of any missing data or outliers and present an overview of the revised dataset.

df = pd.read_excel(r'SSC2020_hourly_demand.xlsx', sheet_name='Hourly Demand')  
# Rename the columns date column for for merge
df = df.rename(columns = {'Total Energy Use from Electricity (MW)': 'MW'}, inplace = False)
len(df.index)
## 122736
print(df)
##              Date  Hour     MW  Year  Month
## 0      2003-01-01     1  14745  2003      1
## 1      2003-01-01     2  14280  2003      1
## 2      2003-01-01     3  13821  2003      1
## 3      2003-01-01     4  13239  2003      1
## 4      2003-01-01     5  13236  2003      1
## ...           ...   ...    ...   ...    ...
## 122731 2016-12-31    20  16260  2016     12
## 122732 2016-12-31    21  15658  2016     12
## 122733 2016-12-31    22  15195  2016     12
## 122734 2016-12-31    23  14758  2016     12
## 122735 2016-12-31    24  14153  2016     12
## 
## [122736 rows x 5 columns]
n = df.shape[0]
df.dtypes
## Date     datetime64[ns]
## Hour              int64
## MW                int64
## Year              int64
## Month             int64
## dtype: object
df['ymd'] = df.Date
#df['ymd'] = df['Date'].dt.strftime('%Y-%m-%d')

df['Hour'] = df['Hour'] - 1
df['Hour'] = pd.to_datetime(df.Hour, unit='h').dt.strftime('%H:%M:%S')
df.dtypes
## Date     datetime64[ns]
## Hour             object
## MW                int64
## Year              int64
## Month             int64
## ymd      datetime64[ns]
## dtype: object
df['datetime'] = df.apply(lambda row: pd.to_datetime(row['ymd']) + pd.to_timedelta(row['Hour']), axis = 1)
df.dtypes

# return the day of week
## Date        datetime64[ns]
## Hour                object
## MW                   int64
## Year                 int64
## Month                int64
## ymd         datetime64[ns]
## datetime    datetime64[ns]
## dtype: object
df['day_num'] = df['ymd'].dt.weekday
df['day_of_week'] = df['ymd'].dt.strftime('%A')
print (df['day_of_week'].unique())
## ['Wednesday' 'Thursday' 'Friday' 'Saturday' 'Sunday' 'Monday' 'Tuesday']
df.dtypes
## Date           datetime64[ns]
## Hour                   object
## MW                      int64
## Year                    int64
## Month                   int64
## ymd            datetime64[ns]
## datetime       datetime64[ns]
## day_num                 int64
## day_of_week            object
## dtype: object
df['day'] = pd.cut(df['day_num'],
                            bins=[-np.inf, 4, np.inf],
                            labels=['Weekday', 'Weekend'],include_lowest =True)
print (df['day'].unique())
## ['Weekday', 'Weekend']
## Categories (2, object): ['Weekday' < 'Weekend']
df['day_in_month'] = df['ymd'].dt.day
df['month'] = df['ymd'].dt.month

def seasons(ymd):
    m = ymd.month
    d = ymd.day
    season=None
    if (m==3 and d>=21) or m==4 or m==5 or (m==6 and d<=20):
        season = 'spring'
    elif (m==6 and d>=21 ) or m==7 or m==8 or (m==9 and d<=20):
        season = 'summer'
    elif (m==9 and d>=21 ) or m==10 or m==11 or (m==12 and d<=20):
        season = 'autumn'
    elif (m==12 and d>=21 ) or m==1 or m==2 or (m==3 and d<=20):
        season = 'winter'
    return season

df['season'] = df.apply(lambda x: seasons(x['ymd']), axis=1)
print (df['season'].unique())
## ['winter' 'spring' 'summer' 'autumn']
print(df['season'].value_counts()) # to print count of every category
# summary statistics
## spring    30912
## summer    30912
## autumn    30576
## winter    30336
## Name: season, dtype: int64
print (df.describe())
##                   MW           Year  ...   day_in_month          month
## count  122736.000000  122736.000000  ...  122736.000000  122736.000000
## mean    16561.766271    2009.500391  ...      15.730935       6.522487
## std      2610.524068       4.031533  ...       8.800676       3.448839
## min      2270.000000    2003.000000  ...       1.000000       1.000000
## 25%     14610.000000    2006.000000  ...       8.000000       4.000000
## 50%     16488.000000    2009.500000  ...      16.000000       7.000000
## 75%     18358.000000    2013.000000  ...      23.000000      10.000000
## max     27005.000000    2016.000000  ...      31.000000      12.000000
## 
## [8 rows x 6 columns]
print(df)
##              Date      Hour     MW  Year  ...      day day_in_month month  season
## 0      2003-01-01  00:00:00  14745  2003  ...  Weekday            1     1  winter
## 1      2003-01-01  01:00:00  14280  2003  ...  Weekday            1     1  winter
## 2      2003-01-01  02:00:00  13821  2003  ...  Weekday            1     1  winter
## 3      2003-01-01  03:00:00  13239  2003  ...  Weekday            1     1  winter
## 4      2003-01-01  04:00:00  13236  2003  ...  Weekday            1     1  winter
## ...           ...       ...    ...   ...  ...      ...          ...   ...     ...
## 122731 2016-12-31  19:00:00  16260  2016  ...  Weekend           31    12  winter
## 122732 2016-12-31  20:00:00  15658  2016  ...  Weekend           31    12  winter
## 122733 2016-12-31  21:00:00  15195  2016  ...  Weekend           31    12  winter
## 122734 2016-12-31  22:00:00  14758  2016  ...  Weekend           31    12  winter
## 122735 2016-12-31  23:00:00  14153  2016  ...  Weekend           31    12  winter
## 
## [122736 rows x 13 columns]
df.dtypes
## Date            datetime64[ns]
## Hour                    object
## MW                       int64
## Year                     int64
## Month                    int64
## ymd             datetime64[ns]
## datetime        datetime64[ns]
## day_num                  int64
## day_of_week             object
## day                   category
## day_in_month             int64
## month                    int64
## season                  object
## dtype: object
data = df

start_date = '2004-01-01 00:00:00'
split_date = '2015-12-31 23:00:00'
# Time series for 2016 only
df_2016 = data.loc[data['datetime'] > split_date]

3 Exploratory Data Analysis

Figure 3.1 displays the hourly variation in household electricity demand in Ontario for each day of 2016 to highlight the seasonal pattern in electricity demand. The graph shows that electricity demand is generally higher in the winter months (December to February) and the summer months (June to August) and lower in the spring (March to May) and fall (September to November). The high electricity demand during winter and summer may be due to the increased use of heating and cooling systems to cope with extreme temperatures. The peak demand occurs in the summer months (June to August), likely due to air conditioning systems’ usage, while the lowest demand occurs in the spring (March to May) when temperatures are mild and the use of heating and cooling systems is minimal.

# Plot time series for 2016
fig, ax = plt.subplots()
plt.plot(df_2016['datetime'], df_2016['MW'], 'tab:red')
plt.xlabel("date"), plt.ylabel("MW")
## (Text(0.5, 0, 'date'), Text(0, 0.5, 'MW'))
plt.tight_layout()
plt.show()
Household electricity demand (MW) in 2016.

Figure 3.1: Household electricity demand (MW) in 2016.

Upon analysing the data, it can be seen in Figure 3.2 that the demand for electricity is highest during the daytime, with a peak between 3 pm to 7 pm. Additionally, there is a distinct decrease in demand during the early morning hours.

# Create a boxplot by hour of day
fig = plt.figure()
ax = df_2016.boxplot(by='Hour', column='MW', figsize=(10, 6))
plt.suptitle('')
ax.set_title('')
plt.xlabel("Hour of day")
plt.ylabel("MW")
ax.tick_params(axis='x', rotation=90)
# Show the plot
plt.show()
Household electricity demand (MW) by hour of day.

Figure 3.2: Household electricity demand (MW) by hour of day.

According to Figure 3.3, the highest electricity demand usually occurs on Thursdays, and it is generally higher on weekdays than on weekends.

# Create a boxplot by day of week
fig = plt.figure()
# Sort the data by day of the week
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df_2016['day_of_week'] = pd.Categorical(df_2016['day_of_week'], categories=order, ordered=True)
## <string>:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_2016 = df_2016.sort_values(by='day_of_week')
ax1 = df_2016.boxplot(by='day_of_week', column='MW', figsize=(10, 6))
plt.suptitle('')
ax1.set_title('')
plt.xlabel("Day of week")
plt.ylabel("MW")
# Show the plot
plt.show()
Household electricity demand (MW) by day of the week.

Figure 3.3: Household electricity demand (MW) by day of the week.

The side-by-side boxplot in Figure 3.4 shows a comparison of the distribution of hourly electricity demand in megawatts (MW) between weekdays and weekends in Ontario during the year 2016. The left boxplot represents the distribution of hourly electricity demand on weekdays, while the right boxplot represents the distribution on weekends.

The boxplots show that the median demand on weekdays is higher than on weekends. The interquartile range (IQR) for both weekdays and weekends is similar, with the demand being more spread out on weekdays than on weekends. The upper whisker on weekdays is longer than on weekends, indicating that the demand on weekdays can be much higher than on weekends.

The weekdays boxplot also has a larger number of outliers, with the demand on some hours being significantly higher than on other hours. This is especially evident during the morning and evening peak hours, where the demand is much higher on weekdays than on weekends. On the other hand, the weekends boxplot has fewer outliers, indicating that the demand is relatively more consistent throughout the day.

# Create a boxplot by weekday vs weekend
fig = plt.figure()
ax2 = df_2016.boxplot(by='day', column='MW', figsize=(10, 6))
plt.suptitle('')
ax2.set_title('')
plt.xlabel("Part of week")
plt.ylabel("MW")
# Show the plot
plt.show()
Weekday vs weekend electricity demand (MW) in 2016.

Figure 3.4: Weekday vs weekend electricity demand (MW) in 2016.

Figure 3.5 compares the electricity demand on weekdays (Monday to Friday) and weekends (Saturday and Sunday) throughout the year in 2016. The x-axis shows the days of the year, while the y-axis shows the electricity demand in MW.

The graph shows a clear difference between the electricity demand on weekdays and weekends. The demand is higher on weekdays than on weekends for most of the year.

# Create a time plot of weekday vs weekend
fig, ax = plt.subplots(figsize=(8,6))
df_2016.set_index('datetime', inplace=True)
df_2016.groupby('day')['MW'].plot(ax=ax)
## day
## Weekday    AxesSubplot(0.125,0.2;0.775x0.68)
## Weekend    AxesSubplot(0.125,0.2;0.775x0.68)
## Name: MW, dtype: object
ax.legend(frameon=False)
plt.xlabel("Date")
plt.ylabel("MW")
plt.show()
Boxplot comparison of weekday and weekend electricity demand (MW) in 2016.

Figure 3.5: Boxplot comparison of weekday and weekend electricity demand (MW) in 2016.

After analysing the data, it is observed in Figure 3.6 that electricity demand varies by month, with higher demand during the summer and winter months compared to the spring and fall months. The analysis reveals that the peak demand periods occur during the summer months, especially in the months of July and August, and during the winter months, especially in the months of December and January. This is consistent with the pattern observed in the previous examples, indicating that peak demand periods are influenced by both time of day and month of the year.

# Create a boxplot by month
fig = plt.figure()
ax1 = df_2016.boxplot(by='Month', column='MW', figsize=(10, 6))
plt.suptitle('')
ax1.set_title('')
plt.xlabel("Month")
plt.ylabel("MW")
# Show the plot
plt.show()
Household electricity demand (MW) by month.

Figure 3.6: Household electricity demand (MW) by month.

The boxplot in Figure 3.7 shows how electricity demand varies by time of day and season. The median demand for electricity is generally higher in the morning and evening hours, with a dip in demand during the afternoon hours. The demand for electricity is generally higher during the winter and summer seasons compared to spring and fall. There is also more variability in the demand for electricity during the winter and summer seasons.

It is essential to pay attention to the time of day and season when using electricity to prevent higher electricity costs, particularly with electricity providers that have implemented time-of-use pricing schemes. Hydro Ottawa’s pricing structure, for example, varies electricity rates based on the time of day and season. During the winter, off-peak hours are from 7 p.m. to 7 a.m., and during summer, off-peak hours are from 7 p.m. to 7 a.m. and all day on weekends and holidays.

# Create a boxplot for each season
fig, axes = plt.subplots(2, 2, figsize=(8, 6), sharey=True, sharex=True)
df_2016.groupby('season').boxplot(by='Hour', column='MW', ax=axes)
## autumn         AxesSubplot(0.1,0.559091;0.363636x0.340909)
## spring    AxesSubplot(0.536364,0.559091;0.363636x0.340909)
## summer             AxesSubplot(0.1,0.15;0.363636x0.340909)
## winter        AxesSubplot(0.536364,0.15;0.363636x0.340909)
## dtype: object
## 
## <string>:1: UserWarning: When passing multiple axes, sharex and sharey are ignored. These settings must be specified when creating axes.
axes[1][0].tick_params(axis='x', rotation=90)
axes[1][1].tick_params(axis='x', rotation=90)
# Iterate through each subplot and set the xlabel to an empty string
for row in axes:
    for ax in row:
        ax.set_xlabel("")
fig.suptitle("")
fig.supxlabel('Hour')
fig.supylabel('MW')
plt.tight_layout() 
plt.show()
Seasonal load demand profiles of electricity by time-of-day, 2016

Figure 3.7: Seasonal load demand profiles of electricity by time-of-day, 2016

4 Conclusions

In conclusion, the analysis suggests that the grid operator needs to effectively plan and manage the supply of electricity during peak demand periods, particularly in the summer and winter months. The grid operator may consider increasing the grid capacity, implementing time-of-use pricing or demand-response programs to encourage customers to reduce their energy usage during peak demand hours, and factoring in temperature forecasts when planning electricity supply during the summer months. Additionally, implementing public education campaigns may help to raise awareness of the environmental impact of energy usage and encourage energy conservation.

 

A project by Alrick Campbell

alrickcampbell83@gmail.com