Other Projects

1 Introduction

Predicting electricity demand through traditional econometric methods and statistical models can prove to be challenging with the increasing amount of big data in the electricity industry. The industry generates a large volume of data from various sources, which has become possible with the use of smart meters and other grid and customer devices. By utilizing advanced mathematical techniques, big data analysis can lead to more accurate forecasting models. This analysis provides a deeper understanding of electricity demand patterns and enables the effective management of demand and optimization of electricity delivery.

However, this shift in the energy sector also presents challenges for utilities and regulators. Utilities must anticipate long-term electricity demand to plan for future energy generation, while regulators require new tools to grasp the energy revolution and the industry’s prospects.

The goal of this project is to predict residential electricity demand for the province of Ontario in Canada by utilizing hourly time series data and weather patterns from January 1, 2003 to December 31, 2016. To achieve this, I apply a range of supervised machine learning algorithms, including classical linear regression (CLM), generalized linear regression (GLM), Multivariate Adaptive Regression Splines (MARS), Gradient Boosting (GB), and Random Forest (RF).

2 Background

Ontario is the most heavily populated of all the Canadian provinces and territories, with around 14 million people, or 38% of the country’s population, residing in an area of over 1 million square kilometers. It is located in Eastern/Central Canada and experiences seasonal and regional climate variations, with average temperatures ranging from -26°C in winter to 28°C in summer (Government of Ontario, 2023).

Canada’s primary energy production is sourced from various sources, with natural gas and crude oil accounting for approximately 82% of the total. Of the secondary energy consumed in the country, the largest portion is natural gas (31%), followed by electricity (20%). The industrial sector is the largest consumer of electricity in Canada, accounting for approximately 40% of total use (Canada Energy Regulator, 2021).

Though Quebec accounts for the majority of electricity consumption in Canada (37.5%), Ontario is the second largest user (25.6%) (National Energy Board, 2021). In 2019, Ontario largely discontinued the production of electricity from coal plants and was producing the majority of its electricity from zero-carbon sources, such as nuclear (59%), hydro (24%), wind (8%), and solar (1%) (Canada Energy Regulator, 2021). The proper management and forecasting of energy demand is crucial in ensuring the availability and reliability of sufficient generating capacity in the future. Understanding the energy needs of Ontario, Canada’s most populous province, is critical for understanding the energy needs of the entire country.

3 Data

The raw data for this project is available at https://ssc.ca/en/case-study/predicting-hourly-electricity-demand-ontario. The dataset includes hourly electricity demand and various weather variables. Specifically, the variables in the dataset are:

  • electricity demand (MW)
  • precipitation (mm/hr)
  • temperature (°C)
  • irradiance surface (W/m^2)
  • irradiance top of the atmosphere ((W/m^2)
  • snowfall (mm/hr)
  • snow depth (mm)
  • cloud cover (fraction [0,1])
  • air density (kg/m^3)

The dataset comprises 122,736 hourly observations spanning from January 1, 2003 to December 31, 2016.

3.1 Data cleaning and pre-processing

To start, I will install Pacman to manage the packages needed for this project in a more intuitive way. Pacman will verify if a required package is already installed and will install it automatically if it is not.

pacman::p_load(readxl,ggplot2,tidyr,dplyr,lubridate,weathermetrics,recipes,
                 patchwork,gower,ipred,parsnip,earth,xgboost,ranger,
                 workflows,StatMatch,hardhat,glmnet,purrr,tibble,broom,
                 yardstick,gt,openxlsx,forecast,smooth,astsa)
Sys.setenv(TZ = "GMT") #Set system timezone

Next, I will import the energy demand data, examine its dimensions, verify the data type, and view the first five records.

#Electricity Time Series
#library("readxl")
# xlsx files
elec_data <- read_excel("SSC2020_hourly_demand.xlsx",sheet=2)
dim(elec_data)
## [1] 122736      5
str(elec_data)
## tibble [122,736 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Date                                  : POSIXct[1:122736], format: "2003-01-01" "2003-01-01" ...
##  $ Hour                                  : num [1:122736] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Total Energy Use from Electricity (MW): num [1:122736] 14745 14280 13821 13239 13236 ...
##  $ Year                                  : num [1:122736] 2003 2003 2003 2003 2003 ...
##  $ Month                                 : num [1:122736] 1 1 1 1 1 1 1 1 1 1 ...
head(elec_data,5)
## # A tibble: 5 × 5
##   Date                 Hour `Total Energy Use from Electricity (MW)`  Year Month
##   <dttm>              <dbl>                                    <dbl> <dbl> <dbl>
## 1 2003-01-01 00:00:00     1                                    14745  2003     1
## 2 2003-01-01 00:00:00     2                                    14280  2003     1
## 3 2003-01-01 00:00:00     3                                    13821  2003     1
## 4 2003-01-01 00:00:00     4                                    13239  2003     1
## 5 2003-01-01 00:00:00     5                                    13236  2003     1

The dataset contains 122,736 observations. Upon inspection, the ‘Date’ variable is not in the proper date format, so it must be combined with the ‘Hour’ variable using as.POSIXct. To correct the 24-hour format, we will shift the hour back by 1. The result will be the new variable ‘datetime,’ which is now in the correct format of ‘%Y-%m-%d %H:%M:%S.’ After examining ‘datetime,’ we can see that it is exactly as desired.

head(elec_data,5)
## # A tibble: 5 × 7
##   Date                 Hour Total Ener…¹  Year Month Hour_…² datetime           
##   <dttm>              <dbl>        <dbl> <dbl> <dbl> <chr>   <dttm>             
## 1 2003-01-01 00:00:00     1        14745  2003     1 00      2003-01-01 00:00:00
## 2 2003-01-01 00:00:00     2        14280  2003     1 01      2003-01-01 01:00:00
## 3 2003-01-01 00:00:00     3        13821  2003     1 02      2003-01-01 02:00:00
## 4 2003-01-01 00:00:00     4        13239  2003     1 03      2003-01-01 03:00:00
## 5 2003-01-01 00:00:00     5        13236  2003     1 04      2003-01-01 04:00:00
## # … with abbreviated variable names ¹​`Total Energy Use from Electricity (MW)`,
## #   ²​Hour_new

Since the dataset has few variables, I used the ‘datetime’ variable to generate additional variables that may be useful later. I created three variables to represent the time of day, day of the week, and whether the day is a weekday or weekend. I also created a variable to identify the season of the year. After generating these variables, I checked for missing values and potential outliers. There may be some issues with the minimum and maximum values compared to the first and third quartiles of MW, but this will be evaluated visually later. As there are no missing values and the dataset now has more variables, I display a ‘glimpse’ of the new data by listing the variables in rows instead of columns, as seen when using the ‘head’ command.

  elec_data$datetime <- as.POSIXct(elec_data$datetime, '%Y-%m-%d %H:%M:%S', tz = "GMT")
  #Sort by date (not initially in ascending order)
  #elec_data <- elec_data[order(elec_data$datetime),]
  #elec_data <- elec_data[elec_data$datetime >= '2012-10-01' & elec_data$datetime <= '2017-09-30',]
  
  
  elec_data$ymd <- format(as.POSIXct(elec_data$datetime,format='%Y-%m-%d %H:%M:%S'),format='%Y-%m-%d')
  elec_data$ymd <- as.Date(elec_data$ymd)
  
  elec_data$day <- weekdays(elec_data$datetime)                 # Convert dates to weekdays
  factor(levels = c("Weekday", "Weekend"))
  # Add categorical variable to the data frame
  elec_data$daytype <- as.factor(ifelse(elec_data$day =='Saturday', 'Weekend',
                                  ifelse(elec_data$day =='Sunday', 'Weekend',
                                  'Weekday')))
  str(elec_data)
  # #Create seasons
  season_dates <- as.Date(sort(c(outer(
    do.call(seq.int, as.list(1900 + as.POSIXlt(range(elec_data$Date) + c(-365, 365))$year)), 
    c("-03-21", "-06-21", "-09-21", "-12-21"),
    paste0))))
  season_names <- rep(c("Spring", "Summer", "Autumn", "Winter"), length.out = length(season_dates))
  elec_data <- as_tibble(elec_data) %>%
    mutate(season = season_names[ findInterval(ymd, season_dates) ])
  
  elec_data$season <- as.factor(elec_data$season)
  levels(elec_data$season) #List categories
summary(elec_data)
##       Date                          Hour      
##  Min.   :2003-01-01 00:00:00   Min.   : 1.00  
##  1st Qu.:2006-07-02 00:00:00   1st Qu.: 6.75  
##  Median :2009-12-31 12:00:00   Median :12.50  
##  Mean   :2009-12-31 12:00:00   Mean   :12.50  
##  3rd Qu.:2013-07-02 00:00:00   3rd Qu.:18.25  
##  Max.   :2016-12-31 00:00:00   Max.   :24.00  
##  Total Energy Use from Electricity (MW)      Year          Month       
##  Min.   : 2270                          Min.   :2003   Min.   : 1.000  
##  1st Qu.:14610                          1st Qu.:2006   1st Qu.: 4.000  
##  Median :16488                          Median :2010   Median : 7.000  
##  Mean   :16562                          Mean   :2010   Mean   : 6.522  
##  3rd Qu.:18358                          3rd Qu.:2013   3rd Qu.:10.000  
##  Max.   :27005                          Max.   :2016   Max.   :12.000  
##    Hour_new            datetime                        ymd            
##  Length:122736      Min.   :2003-01-01 00:00:00   Min.   :2003-01-01  
##  Class :character   1st Qu.:2006-07-02 11:45:00   1st Qu.:2006-07-02  
##  Mode  :character   Median :2009-12-31 23:30:00   Median :2009-12-31  
##                     Mean   :2009-12-31 23:30:00   Mean   :2009-12-31  
##                     3rd Qu.:2013-07-02 11:15:00   3rd Qu.:2013-07-02  
##                     Max.   :2016-12-31 23:00:00   Max.   :2016-12-31  
##      day               daytype         season     
##  Length:122736      Weekday:87672   Autumn:30576  
##  Class :character   Weekend:35064   Spring:30912  
##  Mode  :character                   Summer:30912  
##                                     Winter:30336  
##                                                   
## 
glimpse(elec_data)
## Rows: 122,736
## Columns: 11
## $ Date                                     <dttm> 2003-01-01, 2003-01-01, 2003…
## $ Hour                                     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
## $ `Total Energy Use from Electricity (MW)` <dbl> 14745, 14280, 13821, 13239, 1…
## $ Year                                     <dbl> 2003, 2003, 2003, 2003, 2003,…
## $ Month                                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Hour_new                                 <chr> "00", "01", "02", "03", "04",…
## $ datetime                                 <dttm> 2003-01-01 00:00:00, 2003-01…
## $ ymd                                      <date> 2003-01-01, 2003-01-01, 2003…
## $ day                                      <chr> "Wednesday", "Wednesday", "We…
## $ daytype                                  <fct> Weekday, Weekday, Weekday, We…
## $ season                                   <fct> Winter, Winter, Winter, Winte…

Some of the variables need to be converted to dummy variables for use in the models, but before doing so, let’s examine the weather dataset. Similar to the electricity data, we will examine its dimensions, check the variable types, look for missing values, and view the first few observations.

# Weather Time Series
# library("readxl")
# xlsx files
wea_data <- read_excel("SSC2020_hourly_weather.xlsx",sheet=2)
dim(wea_data)
str(wea_data)
summary(wea_data)
# library(weathermetrics)
wea_data$temperature <- celsius.to.kelvin(wea_data$temperature, round = 2)
# wea_data <- subset(wea_data, select = c('time','temperature'))
# library(dplyr)
wea_data <- rename(wea_data, datetime = time)
dim(wea_data)
## [1] 122736      9
str(wea_data)
## tibble [122,736 × 9] (S3: tbl_df/tbl/data.frame)
##  $ datetime          : POSIXct[1:122736], format: "2003-01-01 00:00:00" "2003-01-01 01:00:00" ...
##  $ precipitation     : num [1:122736] 0.01 0.0022 0.0014 0.0013 0.0011 0.0012 0.0062 0.0082 0.0067 0.0074 ...
##  $ temperature       : num [1:122736] 271 271 271 270 270 ...
##  $ irradiance_surface: num [1:122736] 0 0 0 0 0 0 0 0 0 0 ...
##  $ irradiance_toa    : num [1:122736] 0 0 0 0 0 0 0 0 0 0 ...
##  $ snowfall          : num [1:122736] 0.0011 0.001 0.0009 0.0008 0.0009 0.0012 0.0062 0.0082 0.0067 0.0074 ...
##  $ snow_depth        : num [1:122736] 17.4 17.4 17.4 17.4 17.4 ...
##  $ cloud_cover       : num [1:122736] 0.32 0.317 0.296 0.374 0.607 ...
##  $ air_density       : num [1:122736] 1.26 1.26 1.27 1.27 1.27 ...

So far, everything looks good and both datasets have the expected sample size, but we need to merge the data using the ‘datetime’ variable. However, the weather dataset uses ‘time’ instead of ‘datetime.’ To ensure a successful merge, we will rename ‘time’ to ‘datetime’ in the weather dataset.

glimpse(wea_data)
## Rows: 122,736
## Columns: 9
## $ datetime           <dttm> 2003-01-01 00:00:00, 2003-01-01 01:00:00, 2003-01-…
## $ precipitation      <dbl> 0.0100, 0.0022, 0.0014, 0.0013, 0.0011, 0.0012, 0.0…
## $ temperature        <dbl> 271.43, 271.10, 270.75, 270.42, 270.07, 269.72, 269…
## $ irradiance_surface <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0…
## $ irradiance_toa     <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0…
## $ snowfall           <dbl> 0.0011, 0.0010, 0.0009, 0.0008, 0.0009, 0.0012, 0.0…
## $ snow_depth         <dbl> 17.4309, 17.4307, 17.4304, 17.4302, 17.4300, 17.429…
## $ cloud_cover        <dbl> 0.3196, 0.3167, 0.2958, 0.3745, 0.6073, 0.7824, 0.8…
## $ air_density        <dbl> 1.2612, 1.2644, 1.2669, 1.2695, 1.2724, 1.2754, 1.2…

With both datasets now having the same ‘datetime’ variable to merge on, I can proceed with the merge, rename the temperature variable to a shorter name, and check its length, distribution, and missing values. It is important to note that the temperature variable has been converted to the Kelvin scale, which provides a more meaningful interpretation as the Kelvin scale has an absolute zero.

#Merge datasets
#Ensure that both dataset has same variable name
# library(dplyr)
elec_data <- rename(elec_data, MW = "Total Energy Use from Electricity (MW)")#Rename variables
wea_data_temp <- wea_data
data <- merge(elec_data, wea_data, by.x = 'datetime', by.y = 'datetime')

#Rename temperature variable
data <- rename(data, temp = temperature)
dim(data)
str(data)
summary(data)
glimpse(data)
## Rows: 122,736
## Columns: 19
## $ datetime           <dttm> 2003-01-01 00:00:00, 2003-01-01 01:00:00, 2003-01-…
## $ Date               <dttm> 2003-01-01, 2003-01-01, 2003-01-01, 2003-01-01, 20…
## $ Hour               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ MW                 <dbl> 14745, 14280, 13821, 13239, 13236, 13504, 13814, 13…
## $ Year               <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 200…
## $ Month              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Hour_new           <chr> "00", "01", "02", "03", "04", "05", "06", "07", "08…
## $ ymd                <date> 2003-01-01, 2003-01-01, 2003-01-01, 2003-01-01, 20…
## $ day                <chr> "Wednesday", "Wednesday", "Wednesday", "Wednesday",…
## $ daytype            <fct> Weekday, Weekday, Weekday, Weekday, Weekday, Weekda…
## $ season             <fct> Winter, Winter, Winter, Winter, Winter, Winter, Win…
## $ precipitation      <dbl> 0.0100, 0.0022, 0.0014, 0.0013, 0.0011, 0.0012, 0.0…
## $ temp               <dbl> 271.43, 271.10, 270.75, 270.42, 270.07, 269.72, 269…
## $ irradiance_surface <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0…
## $ irradiance_toa     <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0…
## $ snowfall           <dbl> 0.0011, 0.0010, 0.0009, 0.0008, 0.0009, 0.0012, 0.0…
## $ snow_depth         <dbl> 17.4309, 17.4307, 17.4304, 17.4302, 17.4300, 17.429…
## $ cloud_cover        <dbl> 0.3196, 0.3167, 0.2958, 0.3745, 0.6073, 0.7824, 0.8…
## $ air_density        <dbl> 1.2612, 1.2644, 1.2669, 1.2695, 1.2724, 1.2754, 1.2…

With the merged dataset appearing to be in good shape, it’s important to remember that the electricity data series may contain outliers. We check the minimum values and find some unusually small outliers in the 2003 data, confirming our suspicion. We will investigate this further in the next section with a graphical analysis.

#Examine electricity data and refine period
#Identify minimum value across each year
aggregate(MW ~ cbind(datetime = format(as.Date(datetime, format = "%Y-%m-%d %H:%M:%S"), "%Y")),
          FUN = min,
          data = data)
##    datetime    MW
## 1      2003  2270
## 2      2004 11983
## 3      2005 11950
## 4      2006 11621
## 5      2007 11699
## 6      2008 11450
## 7      2009 10678
## 8      2010 10618
## 9      2011 10799
## 10     2012 10998
## 11     2013 10765
## 12     2014 10719
## 13     2015 10539
## 14     2016 10461

Next, we will create additional variables required for each of our models, including the time of day, day of week, and time of year. We need to format these variables so that we can convert them into dummy variables using the model.matrix command. The model.matrix command generates a large number of dummy variables, so we will need to make sure that only the necessary variables are included in the final dataset.

#Check variable type
#check data type of every variable in data frame
str(data)
data$daytype <- as.factor(data$daytype)
levels(data$daytype) #List categories
str(data)

#Check for missing data
summary(data)

cbind(
  lapply(
    lapply(data, is.na)
    , sum))

sum(is.na(data$temp))
#Impute missing values
#data1 <- data[complete.cases(data), ] # For choosing cases without missed items
data1 <- data                                              # Duplicate data frame
data1$temp[is.na(data1$temp)] <- mean(data1$temp, na.rm = TRUE)  # Replace NA in one column
data1                                                      # Print updated data frame

#Create additional variables
# library(recipes)
data1$hour_of_day <- strftime(data1$datetime,'%H', tz = "UTC")
data1$day_of_week <- strftime(data1$datetime,'%u', tz = "UTC")
data1$time_of_year <- as.character(unclass(data1$season))
data1$day_in_year = yday(data1$datetime)
data1$year_half = lubridate::semester(data1$datetime) %>% as.factor
data1$week_day = lubridate::wday(data1$datetime)
data1$week_in_year = lubridate::week(data1$datetime)
data1$date_christmas = ifelse(between(data1$day_in_year, 358, 366),1, 0)
holidaysa = timeDate::listHolidays("CA")
holiday_rec <- recipe(~datetime, data1) %>%
  step_holiday(datetime, holidays = holidaysa,keep_original_cols = FALSE)
holiday_rec <- prep(holiday_rec, training = data1)
holiday_values <- bake(holiday_rec, new_data = data1)
str(holiday_values)
#Use Canada Day as reference category
data1$holiday_values <- subset(holiday_values, select=-c(1))
str(data1)
## 'data.frame':    122736 obs. of  28 variables:
##  $ datetime          : POSIXct, format: "2003-01-01 00:00:00" "2003-01-01 01:00:00" ...
##  $ Date              : POSIXct, format: "2003-01-01" "2003-01-01" ...
##  $ Hour              : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ MW                : num  14745 14280 13821 13239 13236 ...
##  $ Year              : num  2003 2003 2003 2003 2003 ...
##  $ Month             : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Hour_new          : chr  "00" "01" "02" "03" ...
##  $ ymd               : Date, format: "2003-01-01" "2003-01-01" ...
##  $ day               : chr  "Wednesday" "Wednesday" "Wednesday" "Wednesday" ...
##  $ daytype           : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
##  $ season            : Factor w/ 4 levels "Autumn","Spring",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ precipitation     : num  0.01 0.0022 0.0014 0.0013 0.0011 0.0012 0.0062 0.0082 0.0067 0.0074 ...
##  $ temp              : num  271 271 271 270 270 ...
##  $ irradiance_surface: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ irradiance_toa    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ snowfall          : num  0.0011 0.001 0.0009 0.0008 0.0009 0.0012 0.0062 0.0082 0.0067 0.0074 ...
##  $ snow_depth        : num  17.4 17.4 17.4 17.4 17.4 ...
##  $ cloud_cover       : num  0.32 0.317 0.296 0.374 0.607 ...
##  $ air_density       : num  1.26 1.26 1.27 1.27 1.27 ...
##  $ hour_of_day       : chr  "00" "01" "02" "03" ...
##  $ day_of_week       : chr  "3" "3" "3" "3" ...
##  $ time_of_year      : chr  "4" "4" "4" "4" ...
##  $ day_in_year       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ year_half         : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ week_day          : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ week_in_year      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ date_christmas    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ holiday_values    : tibble [122,736 × 4] (S3: tbl_df/tbl/data.frame)
##   ..$ datetime_CACivicProvincialHoliday: int  0 0 0 0 0 0 0 0 0 0 ...
##   ..$ datetime_CALabourDay             : int  0 0 0 0 0 0 0 0 0 0 ...
##   ..$ datetime_CAThanksgivingDay       : int  0 0 0 0 0 0 0 0 0 0 ...
##   ..$ datetime_CAVictoriaDay           : int  0 0 0 0 0 0 0 0 0 0 ...
#Check for missing data again
summary(data1)
##     datetime                           Date                       
##  Min.   :2003-01-01 00:00:00.00   Min.   :2003-01-01 00:00:00.00  
##  1st Qu.:2006-07-02 12:45:00.00   1st Qu.:2006-07-02 00:00:00.00  
##  Median :2009-12-31 23:30:00.00   Median :2009-12-31 12:00:00.00  
##  Mean   :2010-01-01 00:05:09.02   Mean   :2009-12-31 12:35:09.02  
##  3rd Qu.:2013-07-02 12:15:00.00   3rd Qu.:2013-07-02 00:00:00.00  
##  Max.   :2016-12-31 23:00:00.00   Max.   :2016-12-31 00:00:00.00  
##       Hour             MW             Year          Month       
##  Min.   : 1.00   Min.   : 2270   Min.   :2003   Min.   : 1.000  
##  1st Qu.: 6.75   1st Qu.:14610   1st Qu.:2006   1st Qu.: 4.000  
##  Median :12.50   Median :16488   Median :2010   Median : 7.000  
##  Mean   :12.50   Mean   :16562   Mean   :2010   Mean   : 6.523  
##  3rd Qu.:18.25   3rd Qu.:18358   3rd Qu.:2013   3rd Qu.:10.000  
##  Max.   :24.00   Max.   :27005   Max.   :2016   Max.   :12.000  
##    Hour_new              ymd                 day               daytype     
##  Length:122736      Min.   :2003-01-01   Length:122736      Weekday:87672  
##  Class :character   1st Qu.:2006-07-02   Class :character   Weekend:35064  
##  Mode  :character   Median :2009-12-31   Mode  :character                  
##                     Mean   :2009-12-31                                     
##                     3rd Qu.:2013-07-02                                     
##                     Max.   :2016-12-31                                     
##     season      precipitation          temp       irradiance_surface
##  Autumn:30590   Min.   :0.00000   Min.   :245.8   Min.   :  0.000   
##  Spring:30898   1st Qu.:0.00510   1st Qu.:272.2   1st Qu.:  0.000   
##  Summer:30912   Median :0.02070   Median :281.1   Median :  8.617   
##  Winter:30336   Mean   :0.08838   Mean   :280.9   Mean   :170.239   
##                 3rd Qu.:0.08050   3rd Qu.:290.3   3rd Qu.:290.086   
##                 Max.   :4.68710   Max.   :307.0   Max.   :994.833   
##  irradiance_toa       snowfall         snow_depth       cloud_cover    
##  Min.   :   0.00   Min.   :0.00000   Min.   : 0.0000   Min.   :0.0001  
##  1st Qu.:   0.00   1st Qu.:0.00000   1st Qu.: 0.0000   1st Qu.:0.2653  
##  Median :  37.46   Median :0.00010   Median : 0.4858   Median :0.5729  
##  Mean   : 311.13   Mean   :0.01756   Mean   : 6.7797   Mean   :0.5447  
##  3rd Qu.: 601.74   3rd Qu.:0.00630   3rd Qu.:16.3431   3rd Qu.:0.8339  
##  Max.   :1229.86   Max.   :2.21790   Max.   :27.9880   Max.   :0.9970  
##   air_density    hour_of_day        day_of_week        time_of_year      
##  Min.   :1.103   Length:122736      Length:122736      Length:122736     
##  1st Qu.:1.181   Class :character   Class :character   Class :character  
##  Median :1.221   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1.226                                                           
##  3rd Qu.:1.266                                                           
##  Max.   :1.412                                                           
##   day_in_year    year_half    week_day      week_in_year  date_christmas   
##  Min.   :  1.0   1:60898   Min.   :1.000   Min.   : 1.0   Min.   :0.00000  
##  1st Qu.: 92.0   2:61838   1st Qu.:2.000   1st Qu.:14.0   1st Qu.:0.00000  
##  Median :183.0             Median :4.000   Median :27.0   Median :0.00000  
##  Mean   :183.2             Mean   :4.001   Mean   :26.6   Mean   :0.02268  
##  3rd Qu.:275.0             3rd Qu.:6.000   3rd Qu.:40.0   3rd Qu.:0.00000  
##  Max.   :366.0             Max.   :7.000   Max.   :53.0   Max.   :1.00000  
##  holiday_values.datetime_CACivicProvincialHoliday  holiday_values.datetime_CALabourDay  holiday_values.datetime_CAThanksgivingDay  holiday_values.datetime_CAVictoriaDay
##  Min.   :0.0000000    Min.   :0.0000000    Min.   :0.0000000    Min.   :0.0000000                                                                                       
##  1st Qu.:0.0000000    1st Qu.:0.0000000    1st Qu.:0.0000000    1st Qu.:0.0000000                                                                                       
##  Median :0.0000000    Median :0.0000000    Median :0.0000000    Median :0.0000000                                                                                       
##  Mean   :0.0027376    Mean   :0.0027376    Mean   :0.0027376    Mean   :0.0027376                                                                                       
##  3rd Qu.:0.0000000    3rd Qu.:0.0000000    3rd Qu.:0.0000000    3rd Qu.:0.0000000                                                                                       
##  Max.   :1.0000000    Max.   :1.0000000    Max.   :1.0000000    Max.   :1.0000000

The table above shows that all our variables have been modified to be ready for analysis. The ‘datetime’ variable is in a date format and all other variables, including dummies, are in numerical format.

3.2 Exploratory Data Analysis

I start the exploratory data analysis by focusing on the target variable, hourly household electricity demand in MW. Figure 3.1 shows the evolution of household electricity demand in Ontario from 2003 to 2016. In panel (a), the entire dataset is included, while panel (b) covers the period 2004 to 2016. The x-axis of both panels shows the time period, while the y-axis represents the hourly electricity demand in megawatts (MW).

In both panels, the graph exhibits a clear seasonal pattern with peaks occurring during the summer and winter months and troughs during the spring and fall months. The demand for electricity is highest in the summer months, specifically in July and August, which may be attributed to increased use of air conditioning.

# library(ggplot2)
# library(patchwork)
# library(tidyr)
# library(tidyverse)
# library(lubridate)

# Select specific columns from the data frame
plot1 <- data1[, c("datetime", "MW")]
#Exclude 2003 and use this dataset going forward
data1 <- data1[data1$datetime >= '2004-01-01',]
# Select specific columns from the data frame
plot2 <- data1[, c("datetime", "MW")]
bind_rows(plot1 %>% mutate(plot_c = "a"),
          plot2 %>% mutate(plot_c = "b")) %>% 
  ggplot(aes(x = datetime, y = MW)) +
  geom_line(color="darkred") +
  facet_wrap(~plot_c, ncol = 1, scale = "free_y")
The evolution of hourly household electricity demand (MW) in Ontario. Panel (a) includes the entire dataset: 2003-2016. Panel (b) covers the period 2004-2016.

Figure 3.1: The evolution of hourly household electricity demand (MW) in Ontario. Panel (a) includes the entire dataset: 2003-2016. Panel (b) covers the period 2004-2016.

To see the seasonal pattern in electricity demand more clearly, we present the hourly variation in household electricity demand in Ontario for each day of 2016 in Figure 3.2. From the graph, we can observe that electricity demand is generally higher in the winter months (December to February) and the summer months (June to August), and lower in the spring (March to May) and fall (September to November). This is likely due to the increased use of heating and cooling systems during the extreme temperature months.

The peak demand is observed during the summer months (June to August), which could be due to the use of air conditioning systems to cope with the high temperatures. In contrast, the lowest demand is observed in the spring (March to May), when the temperatures are mild, and the use of heating and cooling systems is minimal.

#Power consumption in 2016 only
ggplot(data = data1[data1$datetime >= '2016-01-01' & data1$datetime <= '2016-12-31',], aes(x = datetime, y = MW))+
  geom_line(color = "#7843a6", linewidth = 0.5) +
  xlab('Date') + ylab('Consumption in MW')
Household electricity demand (MW) in 2016.

Figure 3.2: Household electricity demand (MW) in 2016.

Figure 3.3 compares the electricity demand on weekdays (Monday to Friday) and weekends (Saturday and Sunday) throughout the year in 2016. The x-axis shows the days of the year, while the y-axis shows the electricity demand in MW.

The graph shows a clear difference between the electricity demand on weekdays and weekends. The demand is higher on weekdays than on weekends for most of the year, with a few exceptions. During the summer months of June, July, and August, the demand on weekends is higher than on weekdays, which is likely due to increased use of air conditioning units in homes during the hot weather.

# Power consumption in 2016 only Multiple line plot
ggplot(data = data1[data1$datetime >= '2016-01-01' & data1$datetime <= '2016-12-31',], aes(x = datetime, y = MW, group = daytype, colour = daytype)) +
  geom_line(alpha = 0.4)+facet_wrap(~ daytype)+facet_grid(daytype ~ .)+
  labs(x = "Day Of The Year (1-365)",
       y = "Daily Consumption",
       colour = NULL)
Weekday vs weekend electricity demand (MW) in 2016.

Figure 3.3: Weekday vs weekend electricity demand (MW) in 2016.

The side-by-side boxplot in Figure 3.4 shows a comparison of the distribution of hourly electricity demand in megawatts (MW) between weekdays and weekends in Ontario during the year 2016. The left boxplot represents the distribution of hourly electricity demand on weekdays, while the right boxplot represents the distribution on weekends.

The boxplots show that the median demand on weekdays is higher than on weekends. The interquartile range (IQR) for both weekdays and weekends is similar, with the demand being more spread out on weekdays than on weekends. The upper whisker on weekdays is longer than on weekends, indicating that the demand on weekdays can be much higher than on weekends.

The weekdays boxplot also has a larger number of outliers, with the demand on some hours being significantly higher than on other hours. This is especially evident during the morning and evening peak hours, where the demand is much higher on weekdays than on weekends. On the other hand, the weekends boxplot has fewer outliers, indicating that the demand is relatively consistent throughout the day.

  # Power consumption in 2014 only Multiple line plot
ggplot(data = data1[data1$datetime >= '2016-01-01' & data1$datetime <= '2016-12-31',], aes(x = daytype, y = MW, group = daytype, colour = daytype)) +
geom_boxplot() +
labs(title = "", x = "Part of week") +
scale_color_discrete(name = "")
Boxplot comparison of weekday and weekend electricity demand (MW) in 2016.

Figure 3.4: Boxplot comparison of weekday and weekend electricity demand (MW) in 2016.

The boxplot in Figure 3.5 show the variation in the distribution of electricity demand at different times of the day across different seasons. The x-axis represents the time of day, and the y-axis represents the electricity demand in MW. The boxplots in the figure are grouped by season: winter (Dec-Feb), spring (Mar-May), summer (Jun-Aug), and fall (Sep-Nov).

The boxplots show that the median demand for electricity is generally higher in the morning and evening hours, with a dip in demand during the afternoon hours. Additionally, the demand for electricity is generally higher during the winter and summer seasons compared to spring and fall.

For example, in the winter season, the demand for electricity is highest at 7 pm, and the demand is lowest at around 2 am. The demand for electricity during the summer season is generally higher than during the winter season, with the highest demand occurring at 5 pm.

The boxplots also show that there is more variability in the demand for electricity during the winter and summer seasons, as indicated by the larger interquartile ranges (IQR) and the longer whiskers, compared to the spring and fall seasons.

#Scatterplot
#Energy demand by the hour, 2016
ggplot(data1[data1$datetime >= '2016-01-01' & data1$datetime <= '2016-12-31',], aes(x = as.factor(Hour_new), y = MW, colour = season)) +
  geom_boxplot()  + facet_wrap(~ season)+
  labs(x = "Hour of Day") +
  scale_x_discrete(breaks=c("00","06", "12","18", "23")) +
  # scale_x_discrete(limits=c("1", "24")) +
  theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 8),  # rotate x axis text
        panel.grid.minor = element_blank())  # turn off minor grid
Seasonal load demand profiles of electricity by time-of-day, 2016

Figure 3.5: Seasonal load demand profiles of electricity by time-of-day, 2016

In addition to electricity demand, we also have several weather variables that can be included in our modeling analysis. Figure 3.6 show the variation in several weather variables over time. The top panels show air density (kg/m^3), cloud cover (fraction [0,1]) and irradiance surface (W/m^2), the middle panel shows irradiance top of the atmosphere ((W/m^2), precipitation (mm/hr), and snow depth (mm) and the bottom panel shows snowfall (mm/hr) and temperature in degrees Celsius. The x-axis in all panels represents time, while the y-axis in each panel represents the corresponding weather variable.

The plots suggest that air density, temperature, and both irradiance measures have some seasonal patterns, while the other variables do not. To avoid over-complexifying our models and for demonstration purposes, we will only use temperature as an external predictor.

The temperature panel shows that the temperature varies over the year with the highest temperature observed in July and August and the lowest temperature in January and February. It also shows occasional dips in temperature, likely due to the occurrence of weather events like cold fronts and storms.

#Hourly weather data in 2016
data1 %>%
  filter(datetime >= '2016-01-01') %>%
  select(datetime, temp, precipitation, irradiance_surface,
         irradiance_toa, snowfall, snow_depth, cloud_cover, air_density) %>%
  pivot_longer(-datetime) %>%
  ggplot(aes(datetime, value, colour = name)) +
  geom_line(linewidth = 0.5, alpha = 0.8) +
  facet_wrap(~ name, scales = "free_y") +
  labs(title = "Hourly weather data in 2016",
       y = "Unit of measurement",
       x = NULL,
       colour = NULL)+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8),  # rotate x axis text
        panel.grid.minor = element_blank())  # turn off minor grid
Hourly weather data in 2016

Figure 3.6: Hourly weather data in 2016

Figure 3.7 consists of a set of scatterplots that show the relationship between various weather variables and household electricity demand. Each scatterplot shows the values of the weather variable on the x-axis and the corresponding values of electricity demand on the y-axis.

By examining the scatterplots, we can see that there is generally a non-linear relationship between temperature and electricity demand, with a minimum point around 285K (~12°C). It’s likely that a non-linear model would be more effective in predicting energy demand. The relationship between air density and electricity demand also appears to be non-linear, while the relationship between electricity demand and the other four variables appears to be weak or non-existent.

Overall, the scatterplots in 3.7 suggest that temperature is a more important driver of household electricity demand than air density, at least in the context of the data used in this analysis. However, it is important to note that other factors, such as the prevalence of air conditioning or heating systems, may also play a role in determining electricity demand.

#Scatterplot
data1 %>%
  select(datetime, MW, temp, precipitation, irradiance_surface,
         irradiance_toa, snowfall, snow_depth, cloud_cover, air_density) %>%
  pivot_longer(-c(datetime, MW)) %>%
  ggplot(aes(value, MW)) +
  geom_point(color = "blue", alpha = 1/10) +
  geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cs", k = 4), col="purple")  + 
  facet_wrap(~ name, scales = "free_x") +
  labs(y = "MW",
       x = "Unit of measurement",
       colour = NULL)
Effect of various weather variables on household energy demand.

Figure 3.7: Effect of various weather variables on household energy demand.

4 Model development

With the data now in a format suitable for different algorithms, we can start the model development process. Figure 4.1 is a plot showing the split between training and testing data sets used to build and evaluate a model to predict household electricity demand. The plot shows a time series of hourly electricity demand data, with the blue line representing the period from 2004 to 2015 and the red line line representing the subset of data used for testing the model.

The training dataset consists of the first 12 years of data, from 2004 to 2015, and the testing dataset consists of the final year of data, 2016. The split between training and testing data is important for building and evaluating models to ensure that the model is not simply memorizing the data it has seen, but is able to generalize to new data. The goal is to use pre-2016 data to parameterize our models and then evaluate their accuracy by testing the 2016 predictions against actual 2016 data.

In this case, the model will be trained on the training dataset and evaluated on the testing dataset to determine its accuracy in predicting household electricity demand. The plot shows that the split between training and testing data is based on time, which is appropriate for time series data such as electricity demand. The split also allows for the testing of the model on more recent data to ensure that it is still accurate and relevant.

# library(ggplot2)
bind_rows(
  dt_train %>% mutate(id = "training"),
  dt_test %>% mutate(id = "testing")
) %>%
  ggplot(aes(datetime, MW, colour = id)) +
  geom_line() +
  scale_colour_manual(values = c("red", "lightblue")) +
  labs(y = "Demand (MW)",
       x = NULL,
       colour = NULL) +
  theme_light() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", size = 10,
                                     colour = "grey50"),
        plot.caption = element_text(face = "italic", size = 9,
                                    colour = "grey50"),
        legend.position = "right")
Plot showing training/test data split.

Figure 4.1: Plot showing training/test data split.

After the training and testing split, we use the `recipes’ package to specify the dependent (output) variable and the predictors we plan to use.

#Set up variables
# library(recipes)
# library(gower)
# library(ipred)
dt_train <- dt_train[,-c(1)] #(keep only necessary variables)
model_rec <- recipe(MW ~ ., data = dt_train) %>%
  step_zv(all_predictors())
#View list of final variables
model_rec[["var_info"]][["variable"]]
##  [1] "temp"          "hour_of_day01" "hour_of_day02" "hour_of_day03"
##  [5] "hour_of_day04" "hour_of_day05" "hour_of_day06" "hour_of_day07"
##  [9] "hour_of_day08" "hour_of_day09" "hour_of_day10" "hour_of_day11"
## [13] "hour_of_day12" "hour_of_day13" "hour_of_day14" "hour_of_day15"
## [17] "hour_of_day16" "hour_of_day17" "hour_of_day18" "hour_of_day19"
## [21] "hour_of_day20" "hour_of_day21" "hour_of_day22" "hour_of_day23"
## [25] "day_of_week2"  "day_of_week3"  "day_of_week4"  "day_of_week5" 
## [29] "day_of_week6"  "day_of_week7"  "time_of_year2" "time_of_year3"
## [33] "time_of_year4" "MW"
rf_rec <- recipe(MW ~ ., data = dt_train) %>%
  step_zv(all_predictors())

To have a consistent approach in building our model, we use the `parsnip’ package. We first specify the model type: linear regression (LM), Multivariate Adaptive Regression Splines (MARS), Gradient Boosting (GB), and Random Forest (RF). Then we choose the fitting engine, in this case two types of linear regression models: classical linear regression (CLM) and generalized linear regression (GLM). CLM is fitted using ordinary least squares while GLM is fitted through maximum likelihood estimation. Finally, we declare the mode of the model, which is regression as our dependent variable is numeric. If the dependent variable were qualitative, the mode would be classification.

#Specify 5 models: linear, elastic net, mars (segmented), gradient boosting, random forest
# library(parsnip)
lm_spec <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

# library(parsnip)
en_spec <- linear_reg(penalty = 0.02, mixture = 0.5) %>%
  set_engine('glmnet') %>%
  set_mode("regression")

# library(earth)
mars_spec <- mars(prod_degree = 1) %>%  #<- No interaction terms/hinge function
  set_engine("earth") %>%
  set_mode("regression")

# library(xgboost,warn.conflicts = FALSE)
xg_spec <- boost_tree(trees = 1000) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# library(parsnip)
# library(ranger)
rf_spec <- rand_forest(trees = 500) %>%
  set_engine("ranger") %>%
  set_mode("regression")

With the model specifications in place, we use the `workflows’ function to streamline and evaluate our models.

#Fit models
# library(workflows)

en_wf_fit <- workflow() %>%
  add_recipe(model_rec) %>%
  add_model(en_spec) %>%
  fit(dt_train)

lm_wf_fit <- workflow() %>%
  add_recipe(model_rec) %>%
  add_model(lm_spec) %>%
  fit(dt_train)

mars_wf_fit <- workflow() %>%
  add_recipe(model_rec) %>%
  add_model(mars_spec) %>%
  fit(dt_train)

xg_wf_fit <- workflow() %>%
  add_recipe(model_rec) %>%
  add_model(xg_spec) %>%
  fit(dt_train)

rf_wf_fit <- workflow() %>%
  add_recipe(rf_rec) %>%
  add_model(rf_spec) %>%
  fit(dt_train)

5 Model forecast evaluation

I have now obtained my fitted models and let’s examine some of the calculated model coefficients. As anticipated, CLM and GLM yield the same results due to the assumption of normality in residuals.

#View model coefficients
lm_wf_fit %>% extract_fit_engine() %>% summary()
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6013.0 -1046.5  -104.0   897.5  8666.1 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   27666.6935   224.4114 123.286  < 2e-16 ***
## temp            -48.3844     0.7958 -60.800  < 2e-16 ***
## hour_of_day01  -429.9688    33.3395 -12.897  < 2e-16 ***
## hour_of_day02  -681.0083    33.3560 -20.416  < 2e-16 ***
## hour_of_day03  -737.1800    33.3720 -22.090  < 2e-16 ***
## hour_of_day04  -494.6656    33.3884 -14.816  < 2e-16 ***
## hour_of_day05   231.4434    33.4040   6.929 4.27e-12 ***
## hour_of_day06  1376.9789    33.4191  41.203  < 2e-16 ***
## hour_of_day07  2373.8403    33.4337  71.001  < 2e-16 ***
## hour_of_day08  2879.8110    33.4477  86.099  < 2e-16 ***
## hour_of_day09  3209.5465    33.4609  95.919  < 2e-16 ***
## hour_of_day10  3418.5828    33.4737 102.127  < 2e-16 ***
## hour_of_day11  3467.3702    33.4720 103.590  < 2e-16 ***
## hour_of_day12  3480.1794    33.4356 104.086  < 2e-16 ***
## hour_of_day13  3457.2372    33.3775 103.580  < 2e-16 ***
## hour_of_day14  3490.6896    33.3366 104.710  < 2e-16 ***
## hour_of_day15  3709.4177    33.3335 111.282  < 2e-16 ***
## hour_of_day16  4072.3201    33.3536 122.095  < 2e-16 ***
## hour_of_day17  4366.6796    33.3798 130.818  < 2e-16 ***
## hour_of_day18  4402.4604    33.4020 131.802  < 2e-16 ***
## hour_of_day19  4319.9038    33.4131 129.288  < 2e-16 ***
## hour_of_day20  3991.4730    33.4088 119.474  < 2e-16 ***
## hour_of_day21  3080.0907    33.3872  92.254  < 2e-16 ***
## hour_of_day22  1845.3427    33.3606  55.315  < 2e-16 ***
## hour_of_day23   736.4453    33.3400  22.089  < 2e-16 ***
## day_of_week2    290.4490    18.0029  16.133  < 2e-16 ***
## day_of_week3    276.3517    18.0031  15.350  < 2e-16 ***
## day_of_week4    185.7379    17.9956  10.321  < 2e-16 ***
## day_of_week5    -85.0636    18.0030  -4.725 2.30e-06 ***
## day_of_week6  -1350.0620    18.0045 -74.985  < 2e-16 ***
## day_of_week7  -1661.5022    18.0042 -92.284  < 2e-16 ***
## time_of_year2  -382.0071    13.9940 -27.298  < 2e-16 ***
## time_of_year3  1461.4892    17.8283  81.976  < 2e-16 ***
## time_of_year4  1028.5423    16.1971  63.502  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1560 on 105158 degrees of freedom
## Multiple R-squared:  0.6404, Adjusted R-squared:  0.6403 
## F-statistic:  5676 on 33 and 105158 DF,  p-value: < 2.2e-16
tidy(en_wf_fit) %>% print(n = Inf)
## # A tibble: 34 × 3
##    term          estimate penalty
##    <chr>            <dbl>   <dbl>
##  1 (Intercept)    27684.     0.02
##  2 temp             -48.1    0.02
##  3 hour_of_day01   -520.     0.02
##  4 hour_of_day02   -770.     0.02
##  5 hour_of_day03   -826.     0.02
##  6 hour_of_day04   -584.     0.02
##  7 hour_of_day05    135.     0.02
##  8 hour_of_day06   1281.     0.02
##  9 hour_of_day07   2277.     0.02
## 10 hour_of_day08   2783.     0.02
## 11 hour_of_day09   3113.     0.02
## 12 hour_of_day10   3322.     0.02
## 13 hour_of_day11   3371.     0.02
## 14 hour_of_day12   3384.     0.02
## 15 hour_of_day13   3361.     0.02
## 16 hour_of_day14   3394.     0.02
## 17 hour_of_day15   3612.     0.02
## 18 hour_of_day16   3975.     0.02
## 19 hour_of_day17   4269.     0.02
## 20 hour_of_day18   4305.     0.02
## 21 hour_of_day19   4222.     0.02
## 22 hour_of_day20   3894.     0.02
## 23 hour_of_day21   2983.     0.02
## 24 hour_of_day22   1749.     0.02
## 25 hour_of_day23    641.     0.02
## 26 day_of_week2     289.     0.02
## 27 day_of_week3     275.     0.02
## 28 day_of_week4     185.     0.02
## 29 day_of_week5     -82.7    0.02
## 30 day_of_week6   -1347.     0.02
## 31 day_of_week7   -1659.     0.02
## 32 time_of_year2   -384.     0.02
## 33 time_of_year3   1454.     0.02
## 34 time_of_year4   1028.     0.02
mars_wf_fit %>% extract_fit_engine() %>% summary()
## Call: earth(formula=..y~., data=data, keepxy=TRUE, degree=~1)
## 
##                coefficients
## (Intercept)      12740.3780
## hour_of_day01     -276.4688
## hour_of_day02     -447.9077
## hour_of_day03     -460.8605
## hour_of_day04     -187.9333
## hour_of_day05      558.3398
## hour_of_day06     1719.5353
## hour_of_day07     2729.0460
## hour_of_day08     3244.7458
## hour_of_day09     3581.9089
## hour_of_day10     3796.9101
## hour_of_day11     3829.7609
## hour_of_day12     3789.8862
## hour_of_day13     3697.5914
## hour_of_day14     3632.0886
## hour_of_day15     3734.2450
## hour_of_day16     3987.2584
## hour_of_day17     4187.5600
## hour_of_day18     4150.2108
## hour_of_day19     4023.2870
## hour_of_day20     3682.1301
## hour_of_day21     2786.6812
## hour_of_day22     1594.4562
## hour_of_day23      581.3297
## day_of_week6     -1447.2022
## day_of_week7     -1767.5562
## time_of_year2     -568.9550
## h(temp-282.45)     148.4622
## h(286.94-temp)     155.5844
## h(temp-286.94)      55.8565
## h(temp-298.74)     396.2347
## 
## Selected 31 of 31 terms, and 27 of 33 predictors
## Termination condition: RSq changed by less than 0.001 at 31 terms
## Importance: temp, day_of_week7, day_of_week6, hour_of_day02, hour_of_day03, ...
## Number of terms at each degree of interaction: 1 30 (additive model)
## GCV 1777668    RSS 186779614524    GRSq 0.7373786    RSq 0.7376782

Before evaluating the predictions from the five models, we will establish a baseline. The data has significant seasonality, so we will use a seasonal ARIMA (SARIMA) model, specifically SARIMA(0,0,0)(0,1,0)168, to set the baseline. The “m” of 168 for hourly data indicates a weekly seasonal pattern.

#Baseline seasonal naive predictions
evaluation_metrics <- metric_set(rsq, mae, rmse)
#generate seasonal naive forecasts
# snf_model <- sarima(data1$MW, nrow(dt_test),0,0, 0,1,0,168,no.constant = TRUE)
sa_data <- sarima.for(dt_train$MW, nrow(dt_test), 0,0,0,0,1,0,168,no.constant = TRUE)
snf_data <- dt_test[ , c('datetime','MW')]
pred <- sa_data[["pred"]]
pred_baseline <- snf_data %>%
  mutate(pred) #1 week (in hrs)

Figure 5.1 compares the seasonal naive prediction and observed test data for household electricity demand. The graph shows two lines: a red line representing the seasonal naive prediction and a blue line representing the observed test data.

The seasonal naive prediction is a simple forecasting method that uses the demand from the same season in the previous period as a predictor for the current period. In this case, we use a weekly period to represent the seasonal pattern. This method assumes that the demand in the current week will be similar to demand in the previous week. The observed test data, on the other hand, represents the actual demand values for the testing period, 2016.

The graph shows that the seasonal naive prediction tends to underestimate the actual demand during peak periods, such as in the summer months of July and August, and overestimate the demand during off-peak periods, such as in Spring and Autumn. This is because the seasonal naive prediction is based solely on the previous week’s demand, without taking into account other factors that may affect electricity demand, such as changes in population or weather patterns.

Overall, the graph highlights the limitations of the seasonal naive prediction method and the importance of using more advanced forecasting methods that take into account additional factors that may affect electricity demand.

ggplot(pred_baseline, aes(x=datetime)) + 
  geom_line(aes(y = MW, color = "observed")) + 
  geom_line(aes(y = pred, color="seasonal naive forecast")) +  # change to monthly ticks and labels
  labs(title="", 
       caption="Source: Author", 
       y="MW",
       x="Period",
       color=NULL) +  # title and caption
  scale_color_manual(labels = c("observed","seasonal naive forecast"), 
                     values = c("observed"="steelblue","seasonal naive forecast"="darkred")) +  # line color
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8),  # rotate x axis text
        panel.grid.minor = element_blank())  # turn off minor grid
Comparisons of seasonal naive prediction and observed test data.

Figure 5.1: Comparisons of seasonal naive prediction and observed test data.

To determine the effectiveness of the baseline prediction, we need to measure the differences between the predicted values and the observed values. Common performance metrics used for this purpose are r-squared, MAE, and RMSE. However, caution should be exercised when using r-squared as a performance metric, as it is a measure of goodness of fit that is only appropriate for evaluating training data and not for out-of-sample predictions. MAE and RMSE are more suitable measures, and the calculation of both metrics differs between Python and R. In Python, the coefficient of determination is used, while R uses the square of Pearson’s r coefficient. For the purpose of comparison, the square of the correlation coefficient is used. The table below shows that the r-squared of the baseline prediction is 0.318, with MAE of 1808 and RMSE of 2326. If any of the models perform worse than the baseline prediction, it is an indication that those models are likely not appropriate.

pred_baseline %>%
  evaluation_metrics(truth = MW, estimate = pred) %>% 
  select(-.estimator) %>% 
  pivot_wider(names_from = .metric, values_from = .estimate) %>% 
  arrange(-rsq)
## # A tibble: 1 × 3
##     rsq   mae  rmse
##   <dbl> <dbl> <dbl>
## 1 0.318 1808. 2326.
#Evaluate initial model performance
predictions <- tibble(
  type = c("LM", "EN", "MARS", "GB", "RF"),
  wflow = list(lm_wf_fit, en_wf_fit, mars_wf_fit, xg_wf_fit, rf_wf_fit),
  predictions = map(.x = wflow, .f = ~ augment(.x, dt_test))
) %>%
  select(-wflow) %>%
  unnest(predictions)
predictions %>% select(type, datetime, MW, .pred)
## # A tibble: 43,920 × 4
##    type  datetime               MW  .pred
##    <chr> <dttm>              <dbl>  <dbl>
##  1 LM    2016-01-01 00:00:00 13417 15525.
##  2 LM    2016-01-01 01:00:00 12968 15099.
##  3 LM    2016-01-01 02:00:00 12395 14842.
##  4 LM    2016-01-01 03:00:00 12228 14775.
##  5 LM    2016-01-01 04:00:00 12116 15015.
##  6 LM    2016-01-01 05:00:00 12257 15747.
##  7 LM    2016-01-01 06:00:00 12528 16902.
##  8 LM    2016-01-01 07:00:00 12820 17909.
##  9 LM    2016-01-01 08:00:00 12986 18429.
## 10 LM    2016-01-01 09:00:00 13569 18768.
## # … with 43,910 more rows

The performance of five models can be evaluated by comparing the predicted values to the observed values using metrics such as R-squared, MAE, and RMSE. The results show that all models perform better than the seasonal naive baseline, with gradient boosting, random forest, and MARS emerging as the best-fitting models. These models significantly outperform the linear regression models. Gradient boosting is consistently the best-fitting model, however, having an R-squared close to 1 and low MAE and RMSE values might lead to overfitting issues.

#Compute evaluation metrics
# library(yardstick)
evaluation_metrics <- metric_set(rsq, mae, rmse)
predictions %>%
  group_by(type) %>%
  evaluation_metrics(truth = MW, estimate = .pred) %>%
  select(-.estimator) %>%
  pivot_wider(names_from = ".metric", values_from = ".estimate") %>%
  arrange(rsq)
## # A tibble: 5 × 4
##   type    rsq   mae  rmse
##   <chr> <dbl> <dbl> <dbl>
## 1 LM    0.603 1503. 1796.
## 2 EN    0.603 1503. 1796.
## 3 MARS  0.762 1298. 1594.
## 4 RF    0.801 1236. 1516.
## 5 GB    0.888 1198. 1430.

Figure 5.2 shows a set of scatterplots comparing the observed data and test predictions for each of the five models used in the study. The y-axis represents the observed data, and the x-axis represents the test predictions made by the respective model. Each point on the graph represents a single data point from the test set.

In general, a good model would result in points that lie close to the diagonal line, indicating a strong correlation between the observed data and test predictions. A scatterplot with a high degree of scatter and points far from the diagonal line indicates poor model performance.

Looking at the scatterplots in Figure 5.2, it is clear that some models perform better than others. For example, the gradient boosting and random forest models have a tight cluster of points near the diagonal line, indicating a strong correlation between the observed data and test predictions. In contrast, the linear regression and MARS models have much more scatter, with many points far from the diagonal line.

Overall, Figure 5.2 provides a useful visual comparison of the performance of the different models used in the study.

#Plot predictions and actual values in scatterplot
predictions %>%
  ggplot(aes(.pred, MW)) +
  geom_point(alpha = 0.2, colour = "lightblue", size = 2) +
  facet_wrap(~ type) +
  geom_abline(lty = "dashed", colour = "grey50") +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::comma_format()) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
Comparison of observed data and test predictions for each model.

Figure 5.2: Comparison of observed data and test predictions for each model.

Figure 5.3 compares the observed demand and predicted demand for each model through time. The x-axis shows the time period, which ranges from January 1, 2016 to December 31, 2016, while the y-axis shows the electricity demand in MW. Each subplot represents a different model, with the observed demand shown in blue and the predicted demand shown in red.

In general, all the models perform well in predicting the overall shape of the demand curve, but they differ in their ability to accurately predict the level of demand at certain times. For example, the linear regression models tend to under-estimate demand during the summer months. The gradient boosting, random forest and MARS models, on the other hand, are able to capture more of the variability in the data and produce more accurate predictions overall.

Looking at the individual subplots, it’s clear that the gradient boosting model provides the best overall fit to the data. The predicted demand closely follows the observed demand, with only minor deviations. The random forest model also does a good job, but it tends to over-predict demand in some periods, particularly during the summer months. The linear regression modesl are the least accurate of the five, with large deviations from the observed demand throughout the year.

Overall, the subplots in Figure 5.3 demonstrate the effectiveness of the different models in predicting electricity demand and highlight the strengths and weaknesses of each approach.

#Plot predictions and actual values against time
ggplot(predictions, aes(x=datetime)) +
  geom_line(aes(y = MW, color = "observed")) +
  geom_line(aes(y = .pred, color="predicted")) +
  labs(title = NULL, x="Period") + # title and caption
  scale_color_manual(labels = c("observed", "predicted"),
                     values = c("observed"="steelblue", "predicted"="darkred")) +  # line color
  facet_wrap(~ type) +
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8),  # rotate x axis text
        panel.grid.minor = element_blank())  # turn off minor grid
Comparison of observed demand and predicted demand for each model through time.

Figure 5.3: Comparison of observed demand and predicted demand for each model through time.

In order to evaluate the accuracy of the forecast, we can examine the forecast errors which are the differences between the observed values and their corresponding predicted values. In Figure 5.4, we only display the errors for the gradient boosting method, which is the best-fitting model.

p3 <- xg_wf_fit %>%
    augment(dt_test) %>%
    mutate(model = "Gradient Boosting") %>%
    mutate(PE = (MW-.pred)/MW) %>%
    ungroup() %>%
  ggplot(aes(x = PE))+
  geom_histogram(bins = 50, na.rm=TRUE) +
scale_x_continuous(limits = c(-0.5, 0.5), labels = scales::percent)+
  labs(title = "(a)", x = "Percent Error")+
  theme_bw()

p4 <- xg_wf_fit %>%
  augment(dt_test) %>%
  mutate(model = "Gradient Boosting") %>%
  mutate(PE = (MW-.pred)/MW*100,
         interval = ifelse(abs(PE) > 0.05*100, 
                            "outside 5%", "inside 5%") %>% as.factor()) %>%  
  ungroup() %>%
  ggplot(aes(datetime, PE, colour = interval)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0.05*100, colour = "darkred", lty = "dashed") +
  geom_hline(yintercept = -0.05*100, colour = "darkred", lty = "dashed") +
  labs(title = "(b)", y = "Percent Error", x = "Period")+
  theme_bw() +
  theme(legend.position = "bottom")

p3 + p4 + plot_layout(ncol=2,widths=c(1,2))
Distribution of forecast errors.

Figure 5.4: Distribution of forecast errors.

Panel (a) of Figure 5.4 displays a histogram of forecast errors, showing that the distribution of errors is not symmetrical. The majority of errors are negative and relatively large, while a small fraction of errors are positive and relatively small. This indicates that the gradient boosting model is generally generating sensible forecasts, but sometimes produces bigger errors, particularly when the actual demand is lower than predicted.

The second panel suggests that the model is overestimating electricity demand since a substantial proportion of forecast errors lie below the 5% lower forecast interval. Our results demonstrate that 90% of test dataset errors fell within the range of -18% to 0.158% of the actual MW, indicating that the model is more likely to over-predict demand.

In conclusion, the gradient boosting method is making reasonably accurate predictions, but there is still room for improvement in some situations by experimenting with different transformations, variables, and models.

# library(gt)
xg_wf_fit %>%
  augment(dt_test) %>%
  mutate(model = "Gradient Boosting") %>%
  mutate(PE = (MW-.pred)/MW) %>%
  summarise(quant_05 = quantile(PE, 0.05) * 100,
            quant_95 = quantile(PE, 0.95) * 100)
## # A tibble: 1 × 2
##   quant_05 quant_95
##      <dbl>    <dbl>
## 1    -18.0    0.158

6 Conclusions and Limitations

In conclusion, the analysis indicates that the gradient boosting method is the most appropriate method for forecasting electricity demand using the given data. However, it is important to note that the model has its limitations, and there is still room for improvement in certain scenarios by experimenting with different transformations, variables, and model types. While our model used temperature as the only predictor variable, the inclusion of other related variables could enhance forecast accuracy, although this requires further investigation.

Future projects will focus on improving the model through adjusting the hyperparameters and feature engineering. It would also be valuable to explore the model’s performance during the COVID-19 period as new data becomes available. In summary, while the current model shows promise, continued efforts to improve it are necessary for more accurate forecasting of electricity demand.

References

Canada Energy Regulator. (2021). Provincial and territorial energy profiles - canada. https://www.cer-rec.gc.ca/en/data-analysis/energy-markets/provincial-territorial-energy-profiles/provincial-territorial-energy-profiles-canada.html.
Government of Ontario. (2023). About ontario. https://www.ontario.ca/page/about-ontario.
 

A project by Alrick Campbell

alrickcampbell83@gmail.com