This document presents the model developed for the Hackathon held on the 2nd and 3rd of May 2020. The contest JanataHack - Time Series Forecasting was hosted by Analytics Vidhya.
The model herein developed scored 114 out of 192 solutions submitted, being the Root-Mean-Square-Error (RMSE) the classification parameter.
The problem proposed is to predict electricity consumption. The whole time-frame of the problem is comprised from July 2013 to July 2017. 5 years in total.
Data for the first 23 days of each month within those 5 years were provided and it was requested to forecast the electricity consumption for the rest of the days of each month. Consequently, the model has to be built such no posterior data to the predicted interval is used.
The training data is:
## ID datetime temperature var1 pressure windspeed var2
## 1 0 2013-07-01 00:00:00 -11.4 -17.1 1003 571.910 A
## 2 1 2013-07-01 01:00:00 -12.1 -19.3 996 575.040 A
## 3 2 2013-07-01 02:00:00 -12.9 -20.0 1000 578.435 A
## 4 3 2013-07-01 03:00:00 -11.4 -17.1 995 582.580 A
## 5 4 2013-07-01 04:00:00 -11.4 -19.3 1005 586.600 A
## 6 5 2013-07-01 05:00:00 -10.7 -19.3 1013 2.790 A
## electricity_consumption
## 1 216
## 2 210
## 3 225
## 4 216
## 5 222
## 6 216
where variables var1 and var2 have not been specified.
The correlation of variables against the outcome variable (electricity consumption) is:
library(GGally)
ggcorr(train[,-1], label = TRUE)
From the exploratory analysis, in a preliminary model, only the windspeed and var1 will be used, (var1 have a strong correlation with temperature and it has a slightly higher correlation with electricity consumption).
The rest of the model variables have been influenced by the paper Short-term load forecasting based on a semi-parametric additive model, Rob J Hyndman, Shu Fan (2012), where multiple seasonalities are recommended to predict electricity consumption. Therefore, the model accounts for daily and weekly seasonality. The seasonality have been extracted from the data as shown in the following graph:
t<- train[month(train$datetime)== 7 & year(train$datetime) =="2013",]
te<- test[month(test$datetime)== 7 & year(test$datetime) =="2013",]
train.ts<-as.ts(t$electricity_consumption,
order.by = t$datetime)
autoplot(mstl(msts(train.ts, seasonal.periods = c(24, 168))), main ="Electricity consumption patterns from 1st to 23th July 2013", xlab= "Week")
Finally, the last variable accounted is the weekday.
The model used is an AutoRegressive Integrated Moving Average or ARIMA. This is an univariate model widely used in time series analysis. The model account for the covariates mentioned above by fitting a regression model to the residuals of the ARIMA model.
A model is fitted to each month of the database. Thus, this simple approach meets the contest requirements of not using data posterior to the predicted interval.
The following picture shows the prediction model for the first model (July 2013):
The order of the ARIMA model is selected automatically by the R function and the order of the seasonality descomposition (fourier order) has been chosen by a broad step-wise analysis.
model <-function (x, t){
i = 1
solution = list()
x$weekday <-as.factor(weekdays(x$datetime))
t$weekday <- as.factor(weekdays(t$datetime))
years<-unique(year(x$datetime))
months<- unique(month(x$datetime))
for (y in years){
for (mo in months){
train<- x[month(x$datetime)== mo & year(x$datetime) ==y,]
test<- t[month(t$datetime)== mo & year(t$datetime) ==y,]
if (NROW(test)>0){
train.ts<-as.ts(train$electricity_consumption,
order.by = train$datetime)
fit <- auto.arima(train.ts,
xreg = cbind(fourier(msts(train.ts, seasonal.periods = c(24, 168)),K= c(3,3)),
var =train$var1,
wind=train$windspeed,
weekday=train$weekday))
s<-forecast(fit,
xreg= cbind(fourier(msts(train.ts, seasonal.periods = c(24, 168)),K= c(3,3), h= dim(test)[1]),
var=test$var1,
wind = test$windspeed,
weekday=test$weekday),
h= dim(test)[1])
solution[[i]]<- data.frame(ID=test$ID, electricity_consumption = as.numeric(s$mean))
i=i+1
}
}
}
solution<- do.call(rbind, solution)
return (solution)}
S<-model(train, test)
write.csv(S, "k3.csv", row.names = FALSE)
The model seems to behave well predicting the days 24th and 25th of each month (the only interval accounted for the score). The performance is considerably worsen if the whole predicted interval of each month is considered.
Due to time constraint the model has not been further developed. The proposed activities to continue the improvements would be, in this order:
Account for the variables appointed in the referenced paper, (demands around the same time period for the last 2 days, maximum and minimum demand in the last 24h)
Challenge the strategy of 1 month 1 model, so more data can be used to predict all the months (except for the first one, July 2013) and models can be validated via cross-validation or bootstrapping technique.
Improve the regression model of the ARIMA residuals, either by implementing a Generalized Linear model (so categorical variables can be better accounted for) or a different kind of model such as a Support Vector Machine (SVM) or eXtreme Gradient Boosting (XGBoost) and then ensemble the models.