Other Projects
Electricity demand forecasting with Python: https://rpubs.com/alrickcampbell/energy_forecasting_Python
Visualizing electricity demand with Python: https://rpubs.com/alrickcampbell/energy_demand_visuals
Predicting electricity demand through traditional econometric methods and statistical models can prove to be challenging with the increasing amount of big data in the electricity industry. The industry generates a large volume of data from various sources, which has become possible with the use of smart meters and other grid and customer devices. By utilizing advanced mathematical techniques, big data analysis can lead to more accurate forecasting models. This analysis provides a deeper understanding of electricity demand patterns and enables the effective management of demand and optimization of electricity delivery.
However, this shift in the energy sector also presents challenges for utilities and regulators. Utilities must anticipate long-term electricity demand to plan for future energy generation, while regulators require new tools to grasp the energy revolution and the industry’s prospects.
The goal of this project is to predict residential electricity demand for the province of Ontario in Canada by utilizing hourly time series data and weather patterns from January 1, 2003 to December 31, 2016. To achieve this, I apply a range of supervised machine learning algorithms, including classical linear regression (CLM), generalized linear regression (GLM), Multivariate Adaptive Regression Splines (MARS), Gradient Boosting (GB), and Random Forest (RF).
Ontario is the most heavily populated of all the Canadian provinces and territories, with around 14 million people, or 38% of the country’s population, residing in an area of over 1 million square kilometers. It is located in Eastern/Central Canada and experiences seasonal and regional climate variations, with average temperatures ranging from -26°C in winter to 28°C in summer (Government of Ontario, 2023).
Canada’s primary energy production is sourced from various sources, with natural gas and crude oil accounting for approximately 82% of the total. Of the secondary energy consumed in the country, the largest portion is natural gas (31%), followed by electricity (20%). The industrial sector is the largest consumer of electricity in Canada, accounting for approximately 40% of total use (Canada Energy Regulator, 2021).
Though Quebec accounts for the majority of electricity consumption in Canada (37.5%), Ontario is the second largest user (25.6%) (National Energy Board, 2021). In 2019, Ontario largely discontinued the production of electricity from coal plants and was producing the majority of its electricity from zero-carbon sources, such as nuclear (59%), hydro (24%), wind (8%), and solar (1%) (Canada Energy Regulator, 2021). The proper management and forecasting of energy demand is crucial in ensuring the availability and reliability of sufficient generating capacity in the future. Understanding the energy needs of Ontario, Canada’s most populous province, is critical for understanding the energy needs of the entire country.
The raw data for this project is available at https://ssc.ca/en/case-study/predicting-hourly-electricity-demand-ontario. The dataset includes hourly electricity demand and various weather variables. Specifically, the variables in the dataset are:
The dataset comprises 122,736 hourly observations spanning from January 1, 2003 to December 31, 2016.
To start, I will install Pacman to manage the packages needed for this project in a more intuitive way. Pacman will verify if a required package is already installed and will install it automatically if it is not.
pacman::p_load(readxl,ggplot2,tidyr,dplyr,lubridate,weathermetrics,recipes,
patchwork,gower,ipred,parsnip,earth,xgboost,ranger,
workflows,StatMatch,hardhat,glmnet,purrr,tibble,broom,
yardstick,gt,openxlsx,forecast,smooth,astsa)
Sys.setenv(TZ = "GMT") #Set system timezoneNext, I will import the energy demand data, examine its dimensions, verify the data type, and view the first five records.
#Electricity Time Series
#library("readxl")
# xlsx files
elec_data <- read_excel("SSC2020_hourly_demand.xlsx",sheet=2)
dim(elec_data)## [1] 122736 5
str(elec_data)## tibble [122,736 × 5] (S3: tbl_df/tbl/data.frame)
## $ Date : POSIXct[1:122736], format: "2003-01-01" "2003-01-01" ...
## $ Hour : num [1:122736] 1 2 3 4 5 6 7 8 9 10 ...
## $ Total Energy Use from Electricity (MW): num [1:122736] 14745 14280 13821 13239 13236 ...
## $ Year : num [1:122736] 2003 2003 2003 2003 2003 ...
## $ Month : num [1:122736] 1 1 1 1 1 1 1 1 1 1 ...
head(elec_data,5)## # A tibble: 5 × 5
## Date Hour `Total Energy Use from Electricity (MW)` Year Month
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2003-01-01 00:00:00 1 14745 2003 1
## 2 2003-01-01 00:00:00 2 14280 2003 1
## 3 2003-01-01 00:00:00 3 13821 2003 1
## 4 2003-01-01 00:00:00 4 13239 2003 1
## 5 2003-01-01 00:00:00 5 13236 2003 1
The dataset contains 122,736 observations. Upon inspection, the ‘Date’ variable is not in the proper date format, so it must be combined with the ‘Hour’ variable using as.POSIXct. To correct the 24-hour format, we will shift the hour back by 1. The result will be the new variable ‘datetime,’ which is now in the correct format of ‘%Y-%m-%d %H:%M:%S.’ After examining ‘datetime,’ we can see that it is exactly as desired.
head(elec_data,5)## # A tibble: 5 × 7
## Date Hour Total Ener…¹ Year Month Hour_…² datetime
## <dttm> <dbl> <dbl> <dbl> <dbl> <chr> <dttm>
## 1 2003-01-01 00:00:00 1 14745 2003 1 00 2003-01-01 00:00:00
## 2 2003-01-01 00:00:00 2 14280 2003 1 01 2003-01-01 01:00:00
## 3 2003-01-01 00:00:00 3 13821 2003 1 02 2003-01-01 02:00:00
## 4 2003-01-01 00:00:00 4 13239 2003 1 03 2003-01-01 03:00:00
## 5 2003-01-01 00:00:00 5 13236 2003 1 04 2003-01-01 04:00:00
## # … with abbreviated variable names ¹`Total Energy Use from Electricity (MW)`,
## # ²Hour_new
Since the dataset has few variables, I used the ‘datetime’ variable to generate additional variables that may be useful later. I created three variables to represent the time of day, day of the week, and whether the day is a weekday or weekend. I also created a variable to identify the season of the year. After generating these variables, I checked for missing values and potential outliers. There may be some issues with the minimum and maximum values compared to the first and third quartiles of MW, but this will be evaluated visually later. As there are no missing values and the dataset now has more variables, I display a ‘glimpse’ of the new data by listing the variables in rows instead of columns, as seen when using the ‘head’ command.
elec_data$datetime <- as.POSIXct(elec_data$datetime, '%Y-%m-%d %H:%M:%S', tz = "GMT")
#Sort by date (not initially in ascending order)
#elec_data <- elec_data[order(elec_data$datetime),]
#elec_data <- elec_data[elec_data$datetime >= '2012-10-01' & elec_data$datetime <= '2017-09-30',]
elec_data$ymd <- format(as.POSIXct(elec_data$datetime,format='%Y-%m-%d %H:%M:%S'),format='%Y-%m-%d')
elec_data$ymd <- as.Date(elec_data$ymd)
elec_data$day <- weekdays(elec_data$datetime) # Convert dates to weekdays
factor(levels = c("Weekday", "Weekend"))
# Add categorical variable to the data frame
elec_data$daytype <- as.factor(ifelse(elec_data$day =='Saturday', 'Weekend',
ifelse(elec_data$day =='Sunday', 'Weekend',
'Weekday')))
str(elec_data)
# #Create seasons
season_dates <- as.Date(sort(c(outer(
do.call(seq.int, as.list(1900 + as.POSIXlt(range(elec_data$Date) + c(-365, 365))$year)),
c("-03-21", "-06-21", "-09-21", "-12-21"),
paste0))))
season_names <- rep(c("Spring", "Summer", "Autumn", "Winter"), length.out = length(season_dates))
elec_data <- as_tibble(elec_data) %>%
mutate(season = season_names[ findInterval(ymd, season_dates) ])
elec_data$season <- as.factor(elec_data$season)
levels(elec_data$season) #List categoriessummary(elec_data)## Date Hour
## Min. :2003-01-01 00:00:00 Min. : 1.00
## 1st Qu.:2006-07-02 00:00:00 1st Qu.: 6.75
## Median :2009-12-31 12:00:00 Median :12.50
## Mean :2009-12-31 12:00:00 Mean :12.50
## 3rd Qu.:2013-07-02 00:00:00 3rd Qu.:18.25
## Max. :2016-12-31 00:00:00 Max. :24.00
## Total Energy Use from Electricity (MW) Year Month
## Min. : 2270 Min. :2003 Min. : 1.000
## 1st Qu.:14610 1st Qu.:2006 1st Qu.: 4.000
## Median :16488 Median :2010 Median : 7.000
## Mean :16562 Mean :2010 Mean : 6.522
## 3rd Qu.:18358 3rd Qu.:2013 3rd Qu.:10.000
## Max. :27005 Max. :2016 Max. :12.000
## Hour_new datetime ymd
## Length:122736 Min. :2003-01-01 00:00:00 Min. :2003-01-01
## Class :character 1st Qu.:2006-07-02 11:45:00 1st Qu.:2006-07-02
## Mode :character Median :2009-12-31 23:30:00 Median :2009-12-31
## Mean :2009-12-31 23:30:00 Mean :2009-12-31
## 3rd Qu.:2013-07-02 11:15:00 3rd Qu.:2013-07-02
## Max. :2016-12-31 23:00:00 Max. :2016-12-31
## day daytype season
## Length:122736 Weekday:87672 Autumn:30576
## Class :character Weekend:35064 Spring:30912
## Mode :character Summer:30912
## Winter:30336
##
##
glimpse(elec_data)## Rows: 122,736
## Columns: 11
## $ Date <dttm> 2003-01-01, 2003-01-01, 2003…
## $ Hour <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
## $ `Total Energy Use from Electricity (MW)` <dbl> 14745, 14280, 13821, 13239, 1…
## $ Year <dbl> 2003, 2003, 2003, 2003, 2003,…
## $ Month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Hour_new <chr> "00", "01", "02", "03", "04",…
## $ datetime <dttm> 2003-01-01 00:00:00, 2003-01…
## $ ymd <date> 2003-01-01, 2003-01-01, 2003…
## $ day <chr> "Wednesday", "Wednesday", "We…
## $ daytype <fct> Weekday, Weekday, Weekday, We…
## $ season <fct> Winter, Winter, Winter, Winte…
Some of the variables need to be converted to dummy variables for use in the models, but before doing so, let’s examine the weather dataset. Similar to the electricity data, we will examine its dimensions, check the variable types, look for missing values, and view the first few observations.
# Weather Time Series
# library("readxl")
# xlsx files
wea_data <- read_excel("SSC2020_hourly_weather.xlsx",sheet=2)
dim(wea_data)
str(wea_data)
summary(wea_data)
# library(weathermetrics)
wea_data$temperature <- celsius.to.kelvin(wea_data$temperature, round = 2)
# wea_data <- subset(wea_data, select = c('time','temperature'))
# library(dplyr)
wea_data <- rename(wea_data, datetime = time)dim(wea_data)## [1] 122736 9
str(wea_data)## tibble [122,736 × 9] (S3: tbl_df/tbl/data.frame)
## $ datetime : POSIXct[1:122736], format: "2003-01-01 00:00:00" "2003-01-01 01:00:00" ...
## $ precipitation : num [1:122736] 0.01 0.0022 0.0014 0.0013 0.0011 0.0012 0.0062 0.0082 0.0067 0.0074 ...
## $ temperature : num [1:122736] 271 271 271 270 270 ...
## $ irradiance_surface: num [1:122736] 0 0 0 0 0 0 0 0 0 0 ...
## $ irradiance_toa : num [1:122736] 0 0 0 0 0 0 0 0 0 0 ...
## $ snowfall : num [1:122736] 0.0011 0.001 0.0009 0.0008 0.0009 0.0012 0.0062 0.0082 0.0067 0.0074 ...
## $ snow_depth : num [1:122736] 17.4 17.4 17.4 17.4 17.4 ...
## $ cloud_cover : num [1:122736] 0.32 0.317 0.296 0.374 0.607 ...
## $ air_density : num [1:122736] 1.26 1.26 1.27 1.27 1.27 ...
So far, everything looks good and both datasets have the expected sample size, but we need to merge the data using the ‘datetime’ variable. However, the weather dataset uses ‘time’ instead of ‘datetime.’ To ensure a successful merge, we will rename ‘time’ to ‘datetime’ in the weather dataset.
glimpse(wea_data)## Rows: 122,736
## Columns: 9
## $ datetime <dttm> 2003-01-01 00:00:00, 2003-01-01 01:00:00, 2003-01-…
## $ precipitation <dbl> 0.0100, 0.0022, 0.0014, 0.0013, 0.0011, 0.0012, 0.0…
## $ temperature <dbl> 271.43, 271.10, 270.75, 270.42, 270.07, 269.72, 269…
## $ irradiance_surface <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0…
## $ irradiance_toa <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0…
## $ snowfall <dbl> 0.0011, 0.0010, 0.0009, 0.0008, 0.0009, 0.0012, 0.0…
## $ snow_depth <dbl> 17.4309, 17.4307, 17.4304, 17.4302, 17.4300, 17.429…
## $ cloud_cover <dbl> 0.3196, 0.3167, 0.2958, 0.3745, 0.6073, 0.7824, 0.8…
## $ air_density <dbl> 1.2612, 1.2644, 1.2669, 1.2695, 1.2724, 1.2754, 1.2…
With both datasets now having the same ‘datetime’ variable to merge on, I can proceed with the merge, rename the temperature variable to a shorter name, and check its length, distribution, and missing values. It is important to note that the temperature variable has been converted to the Kelvin scale, which provides a more meaningful interpretation as the Kelvin scale has an absolute zero.
#Merge datasets
#Ensure that both dataset has same variable name
# library(dplyr)
elec_data <- rename(elec_data, MW = "Total Energy Use from Electricity (MW)")#Rename variables
wea_data_temp <- wea_data
data <- merge(elec_data, wea_data, by.x = 'datetime', by.y = 'datetime')
#Rename temperature variable
data <- rename(data, temp = temperature)
dim(data)
str(data)
summary(data)glimpse(data)## Rows: 122,736
## Columns: 19
## $ datetime <dttm> 2003-01-01 00:00:00, 2003-01-01 01:00:00, 2003-01-…
## $ Date <dttm> 2003-01-01, 2003-01-01, 2003-01-01, 2003-01-01, 20…
## $ Hour <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ MW <dbl> 14745, 14280, 13821, 13239, 13236, 13504, 13814, 13…
## $ Year <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 200…
## $ Month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Hour_new <chr> "00", "01", "02", "03", "04", "05", "06", "07", "08…
## $ ymd <date> 2003-01-01, 2003-01-01, 2003-01-01, 2003-01-01, 20…
## $ day <chr> "Wednesday", "Wednesday", "Wednesday", "Wednesday",…
## $ daytype <fct> Weekday, Weekday, Weekday, Weekday, Weekday, Weekda…
## $ season <fct> Winter, Winter, Winter, Winter, Winter, Winter, Win…
## $ precipitation <dbl> 0.0100, 0.0022, 0.0014, 0.0013, 0.0011, 0.0012, 0.0…
## $ temp <dbl> 271.43, 271.10, 270.75, 270.42, 270.07, 269.72, 269…
## $ irradiance_surface <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0…
## $ irradiance_toa <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0…
## $ snowfall <dbl> 0.0011, 0.0010, 0.0009, 0.0008, 0.0009, 0.0012, 0.0…
## $ snow_depth <dbl> 17.4309, 17.4307, 17.4304, 17.4302, 17.4300, 17.429…
## $ cloud_cover <dbl> 0.3196, 0.3167, 0.2958, 0.3745, 0.6073, 0.7824, 0.8…
## $ air_density <dbl> 1.2612, 1.2644, 1.2669, 1.2695, 1.2724, 1.2754, 1.2…
With the merged dataset appearing to be in good shape, it’s important to remember that the electricity data series may contain outliers. We check the minimum values and find some unusually small outliers in the 2003 data, confirming our suspicion. We will investigate this further in the next section with a graphical analysis.
#Examine electricity data and refine period
#Identify minimum value across each year
aggregate(MW ~ cbind(datetime = format(as.Date(datetime, format = "%Y-%m-%d %H:%M:%S"), "%Y")),
FUN = min,
data = data)## datetime MW
## 1 2003 2270
## 2 2004 11983
## 3 2005 11950
## 4 2006 11621
## 5 2007 11699
## 6 2008 11450
## 7 2009 10678
## 8 2010 10618
## 9 2011 10799
## 10 2012 10998
## 11 2013 10765
## 12 2014 10719
## 13 2015 10539
## 14 2016 10461
Next, we will create additional variables required for each of our models, including the time of day, day of week, and time of year. We need to format these variables so that we can convert them into dummy variables using the model.matrix command. The model.matrix command generates a large number of dummy variables, so we will need to make sure that only the necessary variables are included in the final dataset.
#Check variable type
#check data type of every variable in data frame
str(data)
data$daytype <- as.factor(data$daytype)
levels(data$daytype) #List categories
str(data)
#Check for missing data
summary(data)
cbind(
lapply(
lapply(data, is.na)
, sum))
sum(is.na(data$temp))
#Impute missing values
#data1 <- data[complete.cases(data), ] # For choosing cases without missed items
data1 <- data # Duplicate data frame
data1$temp[is.na(data1$temp)] <- mean(data1$temp, na.rm = TRUE) # Replace NA in one column
data1 # Print updated data frame
#Create additional variables
# library(recipes)
data1$hour_of_day <- strftime(data1$datetime,'%H', tz = "UTC")
data1$day_of_week <- strftime(data1$datetime,'%u', tz = "UTC")
data1$time_of_year <- as.character(unclass(data1$season))
data1$day_in_year = yday(data1$datetime)
data1$year_half = lubridate::semester(data1$datetime) %>% as.factor
data1$week_day = lubridate::wday(data1$datetime)
data1$week_in_year = lubridate::week(data1$datetime)
data1$date_christmas = ifelse(between(data1$day_in_year, 358, 366),1, 0)
holidaysa = timeDate::listHolidays("CA")
holiday_rec <- recipe(~datetime, data1) %>%
step_holiday(datetime, holidays = holidaysa,keep_original_cols = FALSE)
holiday_rec <- prep(holiday_rec, training = data1)
holiday_values <- bake(holiday_rec, new_data = data1)
str(holiday_values)
#Use Canada Day as reference category
data1$holiday_values <- subset(holiday_values, select=-c(1))str(data1)## 'data.frame': 122736 obs. of 28 variables:
## $ datetime : POSIXct, format: "2003-01-01 00:00:00" "2003-01-01 01:00:00" ...
## $ Date : POSIXct, format: "2003-01-01" "2003-01-01" ...
## $ Hour : num 1 2 3 4 5 6 7 8 9 10 ...
## $ MW : num 14745 14280 13821 13239 13236 ...
## $ Year : num 2003 2003 2003 2003 2003 ...
## $ Month : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Hour_new : chr "00" "01" "02" "03" ...
## $ ymd : Date, format: "2003-01-01" "2003-01-01" ...
## $ day : chr "Wednesday" "Wednesday" "Wednesday" "Wednesday" ...
## $ daytype : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
## $ season : Factor w/ 4 levels "Autumn","Spring",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ precipitation : num 0.01 0.0022 0.0014 0.0013 0.0011 0.0012 0.0062 0.0082 0.0067 0.0074 ...
## $ temp : num 271 271 271 270 270 ...
## $ irradiance_surface: num 0 0 0 0 0 0 0 0 0 0 ...
## $ irradiance_toa : num 0 0 0 0 0 0 0 0 0 0 ...
## $ snowfall : num 0.0011 0.001 0.0009 0.0008 0.0009 0.0012 0.0062 0.0082 0.0067 0.0074 ...
## $ snow_depth : num 17.4 17.4 17.4 17.4 17.4 ...
## $ cloud_cover : num 0.32 0.317 0.296 0.374 0.607 ...
## $ air_density : num 1.26 1.26 1.27 1.27 1.27 ...
## $ hour_of_day : chr "00" "01" "02" "03" ...
## $ day_of_week : chr "3" "3" "3" "3" ...
## $ time_of_year : chr "4" "4" "4" "4" ...
## $ day_in_year : num 1 1 1 1 1 1 1 1 1 1 ...
## $ year_half : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ week_day : num 4 4 4 4 4 4 4 4 4 4 ...
## $ week_in_year : num 1 1 1 1 1 1 1 1 1 1 ...
## $ date_christmas : num 0 0 0 0 0 0 0 0 0 0 ...
## $ holiday_values : tibble [122,736 × 4] (S3: tbl_df/tbl/data.frame)
## ..$ datetime_CACivicProvincialHoliday: int 0 0 0 0 0 0 0 0 0 0 ...
## ..$ datetime_CALabourDay : int 0 0 0 0 0 0 0 0 0 0 ...
## ..$ datetime_CAThanksgivingDay : int 0 0 0 0 0 0 0 0 0 0 ...
## ..$ datetime_CAVictoriaDay : int 0 0 0 0 0 0 0 0 0 0 ...
#Check for missing data again
summary(data1)## datetime Date
## Min. :2003-01-01 00:00:00.00 Min. :2003-01-01 00:00:00.00
## 1st Qu.:2006-07-02 12:45:00.00 1st Qu.:2006-07-02 00:00:00.00
## Median :2009-12-31 23:30:00.00 Median :2009-12-31 12:00:00.00
## Mean :2010-01-01 00:05:09.02 Mean :2009-12-31 12:35:09.02
## 3rd Qu.:2013-07-02 12:15:00.00 3rd Qu.:2013-07-02 00:00:00.00
## Max. :2016-12-31 23:00:00.00 Max. :2016-12-31 00:00:00.00
## Hour MW Year Month
## Min. : 1.00 Min. : 2270 Min. :2003 Min. : 1.000
## 1st Qu.: 6.75 1st Qu.:14610 1st Qu.:2006 1st Qu.: 4.000
## Median :12.50 Median :16488 Median :2010 Median : 7.000
## Mean :12.50 Mean :16562 Mean :2010 Mean : 6.523
## 3rd Qu.:18.25 3rd Qu.:18358 3rd Qu.:2013 3rd Qu.:10.000
## Max. :24.00 Max. :27005 Max. :2016 Max. :12.000
## Hour_new ymd day daytype
## Length:122736 Min. :2003-01-01 Length:122736 Weekday:87672
## Class :character 1st Qu.:2006-07-02 Class :character Weekend:35064
## Mode :character Median :2009-12-31 Mode :character
## Mean :2009-12-31
## 3rd Qu.:2013-07-02
## Max. :2016-12-31
## season precipitation temp irradiance_surface
## Autumn:30590 Min. :0.00000 Min. :245.8 Min. : 0.000
## Spring:30898 1st Qu.:0.00510 1st Qu.:272.2 1st Qu.: 0.000
## Summer:30912 Median :0.02070 Median :281.1 Median : 8.617
## Winter:30336 Mean :0.08838 Mean :280.9 Mean :170.239
## 3rd Qu.:0.08050 3rd Qu.:290.3 3rd Qu.:290.086
## Max. :4.68710 Max. :307.0 Max. :994.833
## irradiance_toa snowfall snow_depth cloud_cover
## Min. : 0.00 Min. :0.00000 Min. : 0.0000 Min. :0.0001
## 1st Qu.: 0.00 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.:0.2653
## Median : 37.46 Median :0.00010 Median : 0.4858 Median :0.5729
## Mean : 311.13 Mean :0.01756 Mean : 6.7797 Mean :0.5447
## 3rd Qu.: 601.74 3rd Qu.:0.00630 3rd Qu.:16.3431 3rd Qu.:0.8339
## Max. :1229.86 Max. :2.21790 Max. :27.9880 Max. :0.9970
## air_density hour_of_day day_of_week time_of_year
## Min. :1.103 Length:122736 Length:122736 Length:122736
## 1st Qu.:1.181 Class :character Class :character Class :character
## Median :1.221 Mode :character Mode :character Mode :character
## Mean :1.226
## 3rd Qu.:1.266
## Max. :1.412
## day_in_year year_half week_day week_in_year date_christmas
## Min. : 1.0 1:60898 Min. :1.000 Min. : 1.0 Min. :0.00000
## 1st Qu.: 92.0 2:61838 1st Qu.:2.000 1st Qu.:14.0 1st Qu.:0.00000
## Median :183.0 Median :4.000 Median :27.0 Median :0.00000
## Mean :183.2 Mean :4.001 Mean :26.6 Mean :0.02268
## 3rd Qu.:275.0 3rd Qu.:6.000 3rd Qu.:40.0 3rd Qu.:0.00000
## Max. :366.0 Max. :7.000 Max. :53.0 Max. :1.00000
## holiday_values.datetime_CACivicProvincialHoliday holiday_values.datetime_CALabourDay holiday_values.datetime_CAThanksgivingDay holiday_values.datetime_CAVictoriaDay
## Min. :0.0000000 Min. :0.0000000 Min. :0.0000000 Min. :0.0000000
## 1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.0000000
## Median :0.0000000 Median :0.0000000 Median :0.0000000 Median :0.0000000
## Mean :0.0027376 Mean :0.0027376 Mean :0.0027376 Mean :0.0027376
## 3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000
## Max. :1.0000000 Max. :1.0000000 Max. :1.0000000 Max. :1.0000000
The table above shows that all our variables have been modified to be ready for analysis. The ‘datetime’ variable is in a date format and all other variables, including dummies, are in numerical format.
I start the exploratory data analysis by focusing on the target variable, hourly household electricity demand in MW. Figure 3.1 shows the evolution of household electricity demand in Ontario from 2003 to 2016. In panel (a), the entire dataset is included, while panel (b) covers the period 2004 to 2016. The x-axis of both panels shows the time period, while the y-axis represents the hourly electricity demand in megawatts (MW).
In both panels, the graph exhibits a clear seasonal pattern with peaks occurring during the summer and winter months and troughs during the spring and fall months. The demand for electricity is highest in the summer months, specifically in July and August, which may be attributed to increased use of air conditioning.
# library(ggplot2)
# library(patchwork)
# library(tidyr)
# library(tidyverse)
# library(lubridate)
# Select specific columns from the data frame
plot1 <- data1[, c("datetime", "MW")]
#Exclude 2003 and use this dataset going forward
data1 <- data1[data1$datetime >= '2004-01-01',]
# Select specific columns from the data frame
plot2 <- data1[, c("datetime", "MW")]
bind_rows(plot1 %>% mutate(plot_c = "a"),
plot2 %>% mutate(plot_c = "b")) %>%
ggplot(aes(x = datetime, y = MW)) +
geom_line(color="darkred") +
facet_wrap(~plot_c, ncol = 1, scale = "free_y")Figure 3.1: The evolution of hourly household electricity demand (MW) in Ontario. Panel (a) includes the entire dataset: 2003-2016. Panel (b) covers the period 2004-2016.
To see the seasonal pattern in electricity demand more clearly, we present the hourly variation in household electricity demand in Ontario for each day of 2016 in Figure 3.2. From the graph, we can observe that electricity demand is generally higher in the winter months (December to February) and the summer months (June to August), and lower in the spring (March to May) and fall (September to November). This is likely due to the increased use of heating and cooling systems during the extreme temperature months.
The peak demand is observed during the summer months (June to August), which could be due to the use of air conditioning systems to cope with the high temperatures. In contrast, the lowest demand is observed in the spring (March to May), when the temperatures are mild, and the use of heating and cooling systems is minimal.
#Power consumption in 2016 only
ggplot(data = data1[data1$datetime >= '2016-01-01' & data1$datetime <= '2016-12-31',], aes(x = datetime, y = MW))+
geom_line(color = "#7843a6", linewidth = 0.5) +
xlab('Date') + ylab('Consumption in MW')Figure 3.2: Household electricity demand (MW) in 2016.
Figure 3.3 compares the electricity demand on weekdays (Monday to Friday) and weekends (Saturday and Sunday) throughout the year in 2016. The x-axis shows the days of the year, while the y-axis shows the electricity demand in MW.
The graph shows a clear difference between the electricity demand on weekdays and weekends. The demand is higher on weekdays than on weekends for most of the year, with a few exceptions. During the summer months of June, July, and August, the demand on weekends is higher than on weekdays, which is likely due to increased use of air conditioning units in homes during the hot weather.
# Power consumption in 2016 only Multiple line plot
ggplot(data = data1[data1$datetime >= '2016-01-01' & data1$datetime <= '2016-12-31',], aes(x = datetime, y = MW, group = daytype, colour = daytype)) +
geom_line(alpha = 0.4)+facet_wrap(~ daytype)+facet_grid(daytype ~ .)+
labs(x = "Day Of The Year (1-365)",
y = "Daily Consumption",
colour = NULL)Figure 3.3: Weekday vs weekend electricity demand (MW) in 2016.
The side-by-side boxplot in Figure 3.4 shows a comparison of the distribution of hourly electricity demand in megawatts (MW) between weekdays and weekends in Ontario during the year 2016. The left boxplot represents the distribution of hourly electricity demand on weekdays, while the right boxplot represents the distribution on weekends.
The boxplots show that the median demand on weekdays is higher than on weekends. The interquartile range (IQR) for both weekdays and weekends is similar, with the demand being more spread out on weekdays than on weekends. The upper whisker on weekdays is longer than on weekends, indicating that the demand on weekdays can be much higher than on weekends.
The weekdays boxplot also has a larger number of outliers, with the demand on some hours being significantly higher than on other hours. This is especially evident during the morning and evening peak hours, where the demand is much higher on weekdays than on weekends. On the other hand, the weekends boxplot has fewer outliers, indicating that the demand is relatively consistent throughout the day.
# Power consumption in 2014 only Multiple line plot
ggplot(data = data1[data1$datetime >= '2016-01-01' & data1$datetime <= '2016-12-31',], aes(x = daytype, y = MW, group = daytype, colour = daytype)) +
geom_boxplot() +
labs(title = "", x = "Part of week") +
scale_color_discrete(name = "")Figure 3.4: Boxplot comparison of weekday and weekend electricity demand (MW) in 2016.
The boxplot in Figure 3.5 show the variation in the distribution of electricity demand at different times of the day across different seasons. The x-axis represents the time of day, and the y-axis represents the electricity demand in MW. The boxplots in the figure are grouped by season: winter (Dec-Feb), spring (Mar-May), summer (Jun-Aug), and fall (Sep-Nov).
The boxplots show that the median demand for electricity is generally higher in the morning and evening hours, with a dip in demand during the afternoon hours. Additionally, the demand for electricity is generally higher during the winter and summer seasons compared to spring and fall.
For example, in the winter season, the demand for electricity is highest at 7 pm, and the demand is lowest at around 2 am. The demand for electricity during the summer season is generally higher than during the winter season, with the highest demand occurring at 5 pm.
The boxplots also show that there is more variability in the demand for electricity during the winter and summer seasons, as indicated by the larger interquartile ranges (IQR) and the longer whiskers, compared to the spring and fall seasons.
#Scatterplot
#Energy demand by the hour, 2016
ggplot(data1[data1$datetime >= '2016-01-01' & data1$datetime <= '2016-12-31',], aes(x = as.factor(Hour_new), y = MW, colour = season)) +
geom_boxplot() + facet_wrap(~ season)+
labs(x = "Hour of Day") +
scale_x_discrete(breaks=c("00","06", "12","18", "23")) +
# scale_x_discrete(limits=c("1", "24")) +
theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 8), # rotate x axis text
panel.grid.minor = element_blank()) # turn off minor gridFigure 3.5: Seasonal load demand profiles of electricity by time-of-day, 2016
In addition to electricity demand, we also have several weather variables that can be included in our modeling analysis. Figure 3.6 show the variation in several weather variables over time. The top panels show air density (kg/m^3), cloud cover (fraction [0,1]) and irradiance surface (W/m^2), the middle panel shows irradiance top of the atmosphere ((W/m^2), precipitation (mm/hr), and snow depth (mm) and the bottom panel shows snowfall (mm/hr) and temperature in degrees Celsius. The x-axis in all panels represents time, while the y-axis in each panel represents the corresponding weather variable.
The plots suggest that air density, temperature, and both irradiance measures have some seasonal patterns, while the other variables do not. To avoid over-complexifying our models and for demonstration purposes, we will only use temperature as an external predictor.
The temperature panel shows that the temperature varies over the year with the highest temperature observed in July and August and the lowest temperature in January and February. It also shows occasional dips in temperature, likely due to the occurrence of weather events like cold fronts and storms.
#Hourly weather data in 2016
data1 %>%
filter(datetime >= '2016-01-01') %>%
select(datetime, temp, precipitation, irradiance_surface,
irradiance_toa, snowfall, snow_depth, cloud_cover, air_density) %>%
pivot_longer(-datetime) %>%
ggplot(aes(datetime, value, colour = name)) +
geom_line(linewidth = 0.5, alpha = 0.8) +
facet_wrap(~ name, scales = "free_y") +
labs(title = "Hourly weather data in 2016",
y = "Unit of measurement",
x = NULL,
colour = NULL)+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8), # rotate x axis text
panel.grid.minor = element_blank()) # turn off minor gridFigure 3.6: Hourly weather data in 2016
Figure 3.7 consists of a set of scatterplots that show the relationship between various weather variables and household electricity demand. Each scatterplot shows the values of the weather variable on the x-axis and the corresponding values of electricity demand on the y-axis.
By examining the scatterplots, we can see that there is generally a non-linear relationship between temperature and electricity demand, with a minimum point around 285K (~12°C). It’s likely that a non-linear model would be more effective in predicting energy demand. The relationship between air density and electricity demand also appears to be non-linear, while the relationship between electricity demand and the other four variables appears to be weak or non-existent.
Overall, the scatterplots in 3.7 suggest that temperature is a more important driver of household electricity demand than air density, at least in the context of the data used in this analysis. However, it is important to note that other factors, such as the prevalence of air conditioning or heating systems, may also play a role in determining electricity demand.
#Scatterplot
data1 %>%
select(datetime, MW, temp, precipitation, irradiance_surface,
irradiance_toa, snowfall, snow_depth, cloud_cover, air_density) %>%
pivot_longer(-c(datetime, MW)) %>%
ggplot(aes(value, MW)) +
geom_point(color = "blue", alpha = 1/10) +
geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cs", k = 4), col="purple") +
facet_wrap(~ name, scales = "free_x") +
labs(y = "MW",
x = "Unit of measurement",
colour = NULL)Figure 3.7: Effect of various weather variables on household energy demand.
With the data now in a format suitable for different algorithms, we can start the model development process. Figure 4.1 is a plot showing the split between training and testing data sets used to build and evaluate a model to predict household electricity demand. The plot shows a time series of hourly electricity demand data, with the blue line representing the period from 2004 to 2015 and the red line line representing the subset of data used for testing the model.
The training dataset consists of the first 12 years of data, from 2004 to 2015, and the testing dataset consists of the final year of data, 2016. The split between training and testing data is important for building and evaluating models to ensure that the model is not simply memorizing the data it has seen, but is able to generalize to new data. The goal is to use pre-2016 data to parameterize our models and then evaluate their accuracy by testing the 2016 predictions against actual 2016 data.
In this case, the model will be trained on the training dataset and evaluated on the testing dataset to determine its accuracy in predicting household electricity demand. The plot shows that the split between training and testing data is based on time, which is appropriate for time series data such as electricity demand. The split also allows for the testing of the model on more recent data to ensure that it is still accurate and relevant.
# library(ggplot2)
bind_rows(
dt_train %>% mutate(id = "training"),
dt_test %>% mutate(id = "testing")
) %>%
ggplot(aes(datetime, MW, colour = id)) +
geom_line() +
scale_colour_manual(values = c("red", "lightblue")) +
labs(y = "Demand (MW)",
x = NULL,
colour = NULL) +
theme_light() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"),
plot.caption = element_text(face = "italic", size = 9,
colour = "grey50"),
legend.position = "right")Figure 4.1: Plot showing training/test data split.
After the training and testing split, we use the `recipes’ package to specify the dependent (output) variable and the predictors we plan to use.
#Set up variables
# library(recipes)
# library(gower)
# library(ipred)
dt_train <- dt_train[,-c(1)] #(keep only necessary variables)
model_rec <- recipe(MW ~ ., data = dt_train) %>%
step_zv(all_predictors())
#View list of final variables
model_rec[["var_info"]][["variable"]]## [1] "temp" "hour_of_day01" "hour_of_day02" "hour_of_day03"
## [5] "hour_of_day04" "hour_of_day05" "hour_of_day06" "hour_of_day07"
## [9] "hour_of_day08" "hour_of_day09" "hour_of_day10" "hour_of_day11"
## [13] "hour_of_day12" "hour_of_day13" "hour_of_day14" "hour_of_day15"
## [17] "hour_of_day16" "hour_of_day17" "hour_of_day18" "hour_of_day19"
## [21] "hour_of_day20" "hour_of_day21" "hour_of_day22" "hour_of_day23"
## [25] "day_of_week2" "day_of_week3" "day_of_week4" "day_of_week5"
## [29] "day_of_week6" "day_of_week7" "time_of_year2" "time_of_year3"
## [33] "time_of_year4" "MW"
rf_rec <- recipe(MW ~ ., data = dt_train) %>%
step_zv(all_predictors())To have a consistent approach in building our model, we use the `parsnip’ package. We first specify the model type: linear regression (LM), Multivariate Adaptive Regression Splines (MARS), Gradient Boosting (GB), and Random Forest (RF). Then we choose the fitting engine, in this case two types of linear regression models: classical linear regression (CLM) and generalized linear regression (GLM). CLM is fitted using ordinary least squares while GLM is fitted through maximum likelihood estimation. Finally, we declare the mode of the model, which is regression as our dependent variable is numeric. If the dependent variable were qualitative, the mode would be classification.
#Specify 5 models: linear, elastic net, mars (segmented), gradient boosting, random forest
# library(parsnip)
lm_spec <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
# library(parsnip)
en_spec <- linear_reg(penalty = 0.02, mixture = 0.5) %>%
set_engine('glmnet') %>%
set_mode("regression")
# library(earth)
mars_spec <- mars(prod_degree = 1) %>% #<- No interaction terms/hinge function
set_engine("earth") %>%
set_mode("regression")
# library(xgboost,warn.conflicts = FALSE)
xg_spec <- boost_tree(trees = 1000) %>%
set_engine("xgboost") %>%
set_mode("regression")
# library(parsnip)
# library(ranger)
rf_spec <- rand_forest(trees = 500) %>%
set_engine("ranger") %>%
set_mode("regression")With the model specifications in place, we use the `workflows’ function to streamline and evaluate our models.
#Fit models
# library(workflows)
en_wf_fit <- workflow() %>%
add_recipe(model_rec) %>%
add_model(en_spec) %>%
fit(dt_train)
lm_wf_fit <- workflow() %>%
add_recipe(model_rec) %>%
add_model(lm_spec) %>%
fit(dt_train)
mars_wf_fit <- workflow() %>%
add_recipe(model_rec) %>%
add_model(mars_spec) %>%
fit(dt_train)
xg_wf_fit <- workflow() %>%
add_recipe(model_rec) %>%
add_model(xg_spec) %>%
fit(dt_train)
rf_wf_fit <- workflow() %>%
add_recipe(rf_rec) %>%
add_model(rf_spec) %>%
fit(dt_train)I have now obtained my fitted models and let’s examine some of the calculated model coefficients. As anticipated, CLM and GLM yield the same results due to the assumption of normality in residuals.
#View model coefficients
lm_wf_fit %>% extract_fit_engine() %>% summary()##
## Call:
## stats::lm(formula = ..y ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6013.0 -1046.5 -104.0 897.5 8666.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27666.6935 224.4114 123.286 < 2e-16 ***
## temp -48.3844 0.7958 -60.800 < 2e-16 ***
## hour_of_day01 -429.9688 33.3395 -12.897 < 2e-16 ***
## hour_of_day02 -681.0083 33.3560 -20.416 < 2e-16 ***
## hour_of_day03 -737.1800 33.3720 -22.090 < 2e-16 ***
## hour_of_day04 -494.6656 33.3884 -14.816 < 2e-16 ***
## hour_of_day05 231.4434 33.4040 6.929 4.27e-12 ***
## hour_of_day06 1376.9789 33.4191 41.203 < 2e-16 ***
## hour_of_day07 2373.8403 33.4337 71.001 < 2e-16 ***
## hour_of_day08 2879.8110 33.4477 86.099 < 2e-16 ***
## hour_of_day09 3209.5465 33.4609 95.919 < 2e-16 ***
## hour_of_day10 3418.5828 33.4737 102.127 < 2e-16 ***
## hour_of_day11 3467.3702 33.4720 103.590 < 2e-16 ***
## hour_of_day12 3480.1794 33.4356 104.086 < 2e-16 ***
## hour_of_day13 3457.2372 33.3775 103.580 < 2e-16 ***
## hour_of_day14 3490.6896 33.3366 104.710 < 2e-16 ***
## hour_of_day15 3709.4177 33.3335 111.282 < 2e-16 ***
## hour_of_day16 4072.3201 33.3536 122.095 < 2e-16 ***
## hour_of_day17 4366.6796 33.3798 130.818 < 2e-16 ***
## hour_of_day18 4402.4604 33.4020 131.802 < 2e-16 ***
## hour_of_day19 4319.9038 33.4131 129.288 < 2e-16 ***
## hour_of_day20 3991.4730 33.4088 119.474 < 2e-16 ***
## hour_of_day21 3080.0907 33.3872 92.254 < 2e-16 ***
## hour_of_day22 1845.3427 33.3606 55.315 < 2e-16 ***
## hour_of_day23 736.4453 33.3400 22.089 < 2e-16 ***
## day_of_week2 290.4490 18.0029 16.133 < 2e-16 ***
## day_of_week3 276.3517 18.0031 15.350 < 2e-16 ***
## day_of_week4 185.7379 17.9956 10.321 < 2e-16 ***
## day_of_week5 -85.0636 18.0030 -4.725 2.30e-06 ***
## day_of_week6 -1350.0620 18.0045 -74.985 < 2e-16 ***
## day_of_week7 -1661.5022 18.0042 -92.284 < 2e-16 ***
## time_of_year2 -382.0071 13.9940 -27.298 < 2e-16 ***
## time_of_year3 1461.4892 17.8283 81.976 < 2e-16 ***
## time_of_year4 1028.5423 16.1971 63.502 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1560 on 105158 degrees of freedom
## Multiple R-squared: 0.6404, Adjusted R-squared: 0.6403
## F-statistic: 5676 on 33 and 105158 DF, p-value: < 2.2e-16
tidy(en_wf_fit) %>% print(n = Inf)## # A tibble: 34 × 3
## term estimate penalty
## <chr> <dbl> <dbl>
## 1 (Intercept) 27684. 0.02
## 2 temp -48.1 0.02
## 3 hour_of_day01 -520. 0.02
## 4 hour_of_day02 -770. 0.02
## 5 hour_of_day03 -826. 0.02
## 6 hour_of_day04 -584. 0.02
## 7 hour_of_day05 135. 0.02
## 8 hour_of_day06 1281. 0.02
## 9 hour_of_day07 2277. 0.02
## 10 hour_of_day08 2783. 0.02
## 11 hour_of_day09 3113. 0.02
## 12 hour_of_day10 3322. 0.02
## 13 hour_of_day11 3371. 0.02
## 14 hour_of_day12 3384. 0.02
## 15 hour_of_day13 3361. 0.02
## 16 hour_of_day14 3394. 0.02
## 17 hour_of_day15 3612. 0.02
## 18 hour_of_day16 3975. 0.02
## 19 hour_of_day17 4269. 0.02
## 20 hour_of_day18 4305. 0.02
## 21 hour_of_day19 4222. 0.02
## 22 hour_of_day20 3894. 0.02
## 23 hour_of_day21 2983. 0.02
## 24 hour_of_day22 1749. 0.02
## 25 hour_of_day23 641. 0.02
## 26 day_of_week2 289. 0.02
## 27 day_of_week3 275. 0.02
## 28 day_of_week4 185. 0.02
## 29 day_of_week5 -82.7 0.02
## 30 day_of_week6 -1347. 0.02
## 31 day_of_week7 -1659. 0.02
## 32 time_of_year2 -384. 0.02
## 33 time_of_year3 1454. 0.02
## 34 time_of_year4 1028. 0.02
mars_wf_fit %>% extract_fit_engine() %>% summary()## Call: earth(formula=..y~., data=data, keepxy=TRUE, degree=~1)
##
## coefficients
## (Intercept) 12740.3780
## hour_of_day01 -276.4688
## hour_of_day02 -447.9077
## hour_of_day03 -460.8605
## hour_of_day04 -187.9333
## hour_of_day05 558.3398
## hour_of_day06 1719.5353
## hour_of_day07 2729.0460
## hour_of_day08 3244.7458
## hour_of_day09 3581.9089
## hour_of_day10 3796.9101
## hour_of_day11 3829.7609
## hour_of_day12 3789.8862
## hour_of_day13 3697.5914
## hour_of_day14 3632.0886
## hour_of_day15 3734.2450
## hour_of_day16 3987.2584
## hour_of_day17 4187.5600
## hour_of_day18 4150.2108
## hour_of_day19 4023.2870
## hour_of_day20 3682.1301
## hour_of_day21 2786.6812
## hour_of_day22 1594.4562
## hour_of_day23 581.3297
## day_of_week6 -1447.2022
## day_of_week7 -1767.5562
## time_of_year2 -568.9550
## h(temp-282.45) 148.4622
## h(286.94-temp) 155.5844
## h(temp-286.94) 55.8565
## h(temp-298.74) 396.2347
##
## Selected 31 of 31 terms, and 27 of 33 predictors
## Termination condition: RSq changed by less than 0.001 at 31 terms
## Importance: temp, day_of_week7, day_of_week6, hour_of_day02, hour_of_day03, ...
## Number of terms at each degree of interaction: 1 30 (additive model)
## GCV 1777668 RSS 186779614524 GRSq 0.7373786 RSq 0.7376782
Before evaluating the predictions from the five models, we will establish a baseline. The data has significant seasonality, so we will use a seasonal ARIMA (SARIMA) model, specifically SARIMA(0,0,0)(0,1,0)168, to set the baseline. The “m” of 168 for hourly data indicates a weekly seasonal pattern.
#Baseline seasonal naive predictions
evaluation_metrics <- metric_set(rsq, mae, rmse)
#generate seasonal naive forecasts
# snf_model <- sarima(data1$MW, nrow(dt_test),0,0, 0,1,0,168,no.constant = TRUE)
sa_data <- sarima.for(dt_train$MW, nrow(dt_test), 0,0,0,0,1,0,168,no.constant = TRUE)
snf_data <- dt_test[ , c('datetime','MW')]
pred <- sa_data[["pred"]]
pred_baseline <- snf_data %>%
mutate(pred) #1 week (in hrs)Figure 5.1 compares the seasonal naive prediction and observed test data for household electricity demand. The graph shows two lines: a red line representing the seasonal naive prediction and a blue line representing the observed test data.
The seasonal naive prediction is a simple forecasting method that uses the demand from the same season in the previous period as a predictor for the current period. In this case, we use a weekly period to represent the seasonal pattern. This method assumes that the demand in the current week will be similar to demand in the previous week. The observed test data, on the other hand, represents the actual demand values for the testing period, 2016.
The graph shows that the seasonal naive prediction tends to underestimate the actual demand during peak periods, such as in the summer months of July and August, and overestimate the demand during off-peak periods, such as in Spring and Autumn. This is because the seasonal naive prediction is based solely on the previous week’s demand, without taking into account other factors that may affect electricity demand, such as changes in population or weather patterns.
Overall, the graph highlights the limitations of the seasonal naive prediction method and the importance of using more advanced forecasting methods that take into account additional factors that may affect electricity demand.
ggplot(pred_baseline, aes(x=datetime)) +
geom_line(aes(y = MW, color = "observed")) +
geom_line(aes(y = pred, color="seasonal naive forecast")) + # change to monthly ticks and labels
labs(title="",
caption="Source: Author",
y="MW",
x="Period",
color=NULL) + # title and caption
scale_color_manual(labels = c("observed","seasonal naive forecast"),
values = c("observed"="steelblue","seasonal naive forecast"="darkred")) + # line color
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8), # rotate x axis text
panel.grid.minor = element_blank()) # turn off minor gridFigure 5.1: Comparisons of seasonal naive prediction and observed test data.
To determine the effectiveness of the baseline prediction, we need to measure the differences between the predicted values and the observed values. Common performance metrics used for this purpose are r-squared, MAE, and RMSE. However, caution should be exercised when using r-squared as a performance metric, as it is a measure of goodness of fit that is only appropriate for evaluating training data and not for out-of-sample predictions. MAE and RMSE are more suitable measures, and the calculation of both metrics differs between Python and R. In Python, the coefficient of determination is used, while R uses the square of Pearson’s r coefficient. For the purpose of comparison, the square of the correlation coefficient is used. The table below shows that the r-squared of the baseline prediction is 0.318, with MAE of 1808 and RMSE of 2326. If any of the models perform worse than the baseline prediction, it is an indication that those models are likely not appropriate.
pred_baseline %>%
evaluation_metrics(truth = MW, estimate = pred) %>%
select(-.estimator) %>%
pivot_wider(names_from = .metric, values_from = .estimate) %>%
arrange(-rsq)## # A tibble: 1 × 3
## rsq mae rmse
## <dbl> <dbl> <dbl>
## 1 0.318 1808. 2326.
#Evaluate initial model performance
predictions <- tibble(
type = c("LM", "EN", "MARS", "GB", "RF"),
wflow = list(lm_wf_fit, en_wf_fit, mars_wf_fit, xg_wf_fit, rf_wf_fit),
predictions = map(.x = wflow, .f = ~ augment(.x, dt_test))
) %>%
select(-wflow) %>%
unnest(predictions)predictions %>% select(type, datetime, MW, .pred)## # A tibble: 43,920 × 4
## type datetime MW .pred
## <chr> <dttm> <dbl> <dbl>
## 1 LM 2016-01-01 00:00:00 13417 15525.
## 2 LM 2016-01-01 01:00:00 12968 15099.
## 3 LM 2016-01-01 02:00:00 12395 14842.
## 4 LM 2016-01-01 03:00:00 12228 14775.
## 5 LM 2016-01-01 04:00:00 12116 15015.
## 6 LM 2016-01-01 05:00:00 12257 15747.
## 7 LM 2016-01-01 06:00:00 12528 16902.
## 8 LM 2016-01-01 07:00:00 12820 17909.
## 9 LM 2016-01-01 08:00:00 12986 18429.
## 10 LM 2016-01-01 09:00:00 13569 18768.
## # … with 43,910 more rows
The performance of five models can be evaluated by comparing the predicted values to the observed values using metrics such as R-squared, MAE, and RMSE. The results show that all models perform better than the seasonal naive baseline, with gradient boosting, random forest, and MARS emerging as the best-fitting models. These models significantly outperform the linear regression models. Gradient boosting is consistently the best-fitting model, however, having an R-squared close to 1 and low MAE and RMSE values might lead to overfitting issues.
#Compute evaluation metrics
# library(yardstick)
evaluation_metrics <- metric_set(rsq, mae, rmse)predictions %>%
group_by(type) %>%
evaluation_metrics(truth = MW, estimate = .pred) %>%
select(-.estimator) %>%
pivot_wider(names_from = ".metric", values_from = ".estimate") %>%
arrange(rsq)## # A tibble: 5 × 4
## type rsq mae rmse
## <chr> <dbl> <dbl> <dbl>
## 1 LM 0.603 1503. 1796.
## 2 EN 0.603 1503. 1796.
## 3 MARS 0.762 1298. 1594.
## 4 RF 0.801 1236. 1516.
## 5 GB 0.888 1198. 1430.
Figure 5.2 shows a set of scatterplots comparing the observed data and test predictions for each of the five models used in the study. The y-axis represents the observed data, and the x-axis represents the test predictions made by the respective model. Each point on the graph represents a single data point from the test set.
In general, a good model would result in points that lie close to the diagonal line, indicating a strong correlation between the observed data and test predictions. A scatterplot with a high degree of scatter and points far from the diagonal line indicates poor model performance.
Looking at the scatterplots in Figure 5.2, it is clear that some models perform better than others. For example, the gradient boosting and random forest models have a tight cluster of points near the diagonal line, indicating a strong correlation between the observed data and test predictions. In contrast, the linear regression and MARS models have much more scatter, with many points far from the diagonal line.
Overall, Figure 5.2 provides a useful visual comparison of the performance of the different models used in the study.
#Plot predictions and actual values in scatterplot
predictions %>%
ggplot(aes(.pred, MW)) +
geom_point(alpha = 0.2, colour = "lightblue", size = 2) +
facet_wrap(~ type) +
geom_abline(lty = "dashed", colour = "grey50") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::comma_format()) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))Figure 5.2: Comparison of observed data and test predictions for each model.
Figure 5.3 compares the observed demand and predicted demand for each model through time. The x-axis shows the time period, which ranges from January 1, 2016 to December 31, 2016, while the y-axis shows the electricity demand in MW. Each subplot represents a different model, with the observed demand shown in blue and the predicted demand shown in red.
In general, all the models perform well in predicting the overall shape of the demand curve, but they differ in their ability to accurately predict the level of demand at certain times. For example, the linear regression models tend to under-estimate demand during the summer months. The gradient boosting, random forest and MARS models, on the other hand, are able to capture more of the variability in the data and produce more accurate predictions overall.
Looking at the individual subplots, it’s clear that the gradient boosting model provides the best overall fit to the data. The predicted demand closely follows the observed demand, with only minor deviations. The random forest model also does a good job, but it tends to over-predict demand in some periods, particularly during the summer months. The linear regression modesl are the least accurate of the five, with large deviations from the observed demand throughout the year.
Overall, the subplots in Figure 5.3 demonstrate the effectiveness of the different models in predicting electricity demand and highlight the strengths and weaknesses of each approach.
#Plot predictions and actual values against time
ggplot(predictions, aes(x=datetime)) +
geom_line(aes(y = MW, color = "observed")) +
geom_line(aes(y = .pred, color="predicted")) +
labs(title = NULL, x="Period") + # title and caption
scale_color_manual(labels = c("observed", "predicted"),
values = c("observed"="steelblue", "predicted"="darkred")) + # line color
facet_wrap(~ type) +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8), # rotate x axis text
panel.grid.minor = element_blank()) # turn off minor gridFigure 5.3: Comparison of observed demand and predicted demand for each model through time.
In order to evaluate the accuracy of the forecast, we can examine the forecast errors which are the differences between the observed values and their corresponding predicted values. In Figure 5.4, we only display the errors for the gradient boosting method, which is the best-fitting model.
p3 <- xg_wf_fit %>%
augment(dt_test) %>%
mutate(model = "Gradient Boosting") %>%
mutate(PE = (MW-.pred)/MW) %>%
ungroup() %>%
ggplot(aes(x = PE))+
geom_histogram(bins = 50, na.rm=TRUE) +
scale_x_continuous(limits = c(-0.5, 0.5), labels = scales::percent)+
labs(title = "(a)", x = "Percent Error")+
theme_bw()
p4 <- xg_wf_fit %>%
augment(dt_test) %>%
mutate(model = "Gradient Boosting") %>%
mutate(PE = (MW-.pred)/MW*100,
interval = ifelse(abs(PE) > 0.05*100,
"outside 5%", "inside 5%") %>% as.factor()) %>%
ungroup() %>%
ggplot(aes(datetime, PE, colour = interval)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0.05*100, colour = "darkred", lty = "dashed") +
geom_hline(yintercept = -0.05*100, colour = "darkred", lty = "dashed") +
labs(title = "(b)", y = "Percent Error", x = "Period")+
theme_bw() +
theme(legend.position = "bottom")
p3 + p4 + plot_layout(ncol=2,widths=c(1,2))Figure 5.4: Distribution of forecast errors.
Panel (a) of Figure 5.4 displays a histogram of forecast errors, showing that the distribution of errors is not symmetrical. The majority of errors are negative and relatively large, while a small fraction of errors are positive and relatively small. This indicates that the gradient boosting model is generally generating sensible forecasts, but sometimes produces bigger errors, particularly when the actual demand is lower than predicted.
The second panel suggests that the model is overestimating electricity demand since a substantial proportion of forecast errors lie below the 5% lower forecast interval. Our results demonstrate that 90% of test dataset errors fell within the range of -18% to 0.158% of the actual MW, indicating that the model is more likely to over-predict demand.
In conclusion, the gradient boosting method is making reasonably accurate predictions, but there is still room for improvement in some situations by experimenting with different transformations, variables, and models.
# library(gt)
xg_wf_fit %>%
augment(dt_test) %>%
mutate(model = "Gradient Boosting") %>%
mutate(PE = (MW-.pred)/MW) %>%
summarise(quant_05 = quantile(PE, 0.05) * 100,
quant_95 = quantile(PE, 0.95) * 100)## # A tibble: 1 × 2
## quant_05 quant_95
## <dbl> <dbl>
## 1 -18.0 0.158
In conclusion, the analysis indicates that the gradient boosting method is the most appropriate method for forecasting electricity demand using the given data. However, it is important to note that the model has its limitations, and there is still room for improvement in certain scenarios by experimenting with different transformations, variables, and model types. While our model used temperature as the only predictor variable, the inclusion of other related variables could enhance forecast accuracy, although this requires further investigation.
Future projects will focus on improving the model through adjusting the hyperparameters and feature engineering. It would also be valuable to explore the model’s performance during the COVID-19 period as new data becomes available. In summary, while the current model shows promise, continued efforts to improve it are necessary for more accurate forecasting of electricity demand.
A project by Alrick Campbell
alrickcampbell83@gmail.com