This study aims at predicting the wind power production of various turbines based on some of their characteristics. The XGboost algorithm yields the best results.
The French energy producer ENGIE organized a challenge in 2018 in order to predict more accurately the wind power production of its turbines in terms of kW. The data was measured with sensors inside the turbines, though ENGIE acknowledges their poor reliability.
More information can be found at the following two links : https://www.college-de-france.fr/site/stephane-mallat/Challenge-2017-2018-Regression-de-la-production-d-energie-eolienne-par-Engie.htm and https://challengedata.ens.fr/fr/challenges
The data stems from the latter (registration is required).
# required libraries
library(readr)
library(corrplot)
library(ggplot2)
library(caret)
library(RANN)
library(h2o)
library(xgboost)
The dataset provided by ENGIE to train the algorithms is relatively large : 617,386 observations and 79 columns, including the response.
The following code loads the datasets and removes some columns that are either useless (ID) or that contain too many missing values. Indeed the columns associated with Grid_voltage have around 16 % missing values that would make imputation (see later) risky.
For several characteristics, ENGIE provided four measurements : the average (no suffix), the minimum (_min suffix ), the maximum (_max suffix), the standard deviation (_std suffix). Although different choices could be made, including feature engineering, we choose to keep only the average as a measurement of these characteristics, and remove all other columns. We also remove three duplicated columns. There is no duplicated row.
Finally, we change the nature of the MAC_CODE column (identifier of each wind turbine) to a factor with 4 levels corresponding to the 4 turbines.
At the end of this preparation, the dataset consists of 19 remaining variables, including the response.
input_training <- read_delim("input_training.csv", ";", escape_double = FALSE, trim_ws = TRUE)
output_training <- read_delim("challenge_fichier_de_sortie_dentrainement.csv", ";", escape_double = FALSE, trim_ws = TRUE)
input_training <- cbind(input_training, output_training$TARGET)
colnames(input_training)[which(names(input_training) == "output_training$TARGET")] <- "TARGET"
# remove useless ID column
input_training$ID <- NULL
input_training$Date_time <- NULL
# check for missing values
colSums(sapply(input_training, is.na))
# remove columns with too many missing values
input_training$Grid_voltage <- NULL
input_training$Grid_voltage_min <- NULL
input_training$Grid_voltage_max <- NULL
input_training$Grid_voltage_std <- NULL
# keep only the average
input_training$Pitch_angle_std <- NULL
input_training$Pitch_angle_min <- NULL
input_training$Pitch_angle_max <- NULL
input_training$Hub_temperature_std <- NULL
input_training$Hub_temperature_min <- NULL
input_training$Hub_temperature_max <- NULL
input_training$Generator_converter_speed_std <- NULL
input_training$Generator_converter_speed_min <- NULL
input_training$Generator_converter_speed_max <- NULL
input_training$Generator_speed_std <- NULL
input_training$Generator_speed_min <- NULL
input_training$Generator_speed_max <- NULL
input_training$Generator_bearing_1_temperature_std <- NULL
input_training$Generator_bearing_1_temperature_min <- NULL
input_training$Generator_bearing_1_temperature_max <- NULL
input_training$Generator_bearing_2_temperature_std <- NULL
input_training$Generator_bearing_2_temperature_min <- NULL
input_training$Generator_bearing_2_temperature_max <- NULL
input_training$Generator_stator_temperature_std <- NULL
input_training$Generator_stator_temperature_min <- NULL
input_training$Generator_stator_temperature_max <- NULL
input_training$Gearbox_bearing_1_temperature_std <- NULL
input_training$Gearbox_bearing_1_temperature_min <- NULL
input_training$Gearbox_bearing_1_temperature_max <- NULL
input_training$Gearbox_bearing_2_temperature_std <- NULL
input_training$Gearbox_bearing_2_temperature_min <- NULL
input_training$Gearbox_bearing_2_temperature_max <- NULL
input_training$Gearbox_inlet_temperature_std <- NULL
input_training$Gearbox_inlet_temperature_min <- NULL
input_training$Gearbox_inlet_temperature_max <- NULL
input_training$Gearbox_oil_sump_temperature_std <- NULL
input_training$Gearbox_oil_sump_temperature_min <- NULL
input_training$Gearbox_oil_sump_temperature_max <- NULL
input_training$Nacelle_angle_std <- NULL
input_training$Nacelle_angle_min <- NULL
input_training$Nacelle_angle_max <- NULL
input_training$Nacelle_temperature_std <- NULL
input_training$Nacelle_temperature_min <- NULL
input_training$Nacelle_temperature_max <- NULL
input_training$Outdoor_temperature_std <- NULL
input_training$Outdoor_temperature_min <- NULL
input_training$Outdoor_temperature_max <- NULL
input_training$Grid_frequency_std <- NULL
input_training$Grid_frequency_min <- NULL
input_training$Grid_frequency_max <- NULL
input_training$Rotor_speed_std <- NULL
input_training$Rotor_speed_min <- NULL
input_training$Rotor_speed_max <- NULL
input_training$Rotor_bearing_temperature_std <- NULL
input_training$Rotor_bearing_temperature_min <- NULL
input_training$Rotor_bearing_temperature_max <- NULL
input_training$Rotor_speed_std <- NULL
input_training$Rotor_speed_min <- NULL
input_training$Rotor_speed_max <- NULL
input_training <- data.frame(input_training)
# check whether two columns are duplicated
# Absolute_wind_direction
summary(input_training$Absolute_wind_direction)
summary(input_training$Absolute_wind_direction_c)
input_training$Absolute_wind_direction_c <- NULL # same as Absolute_wind_direction but with NAs
# Nacelle_angle
summary(input_training$Nacelle_angle)
summary(input_training$Nacelle_angle_c)
input_training$Nacelle_angle_c <- NULL # same as Nacelle_angle but with NAs
# Generator_speed
summary(input_training$Generator_converter_speed)
summary(input_training$Generator_speed)
input_training$Generator_converter_speed <- NULL # same as Generator_speed but with NAs
# make the turbine column a factor with 4 levels for the 4 turbines
input_training$MAC_CODE <- factor(input_training$MAC_CODE)
The distribution of the response variable, i.e. the wind power production of the turbines, is extremely skewed. The median is 193 kW, the mean is 372 kW, but these two measures poorly reflect the range of the active power. The 25 % lowest values are quite closed, between -19 kW and +18 kW, whereas the 25 % highest values are much more distant to one another, between 540 kW and 2256 kW. 50 % of the observations range between 18 kW and 540 kW.
# examine the response variable
summary(input_training$TARGET)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-19.48 18.62 193.99 372.75 540.68 2256.06
ggplot(data = input_training) +
geom_histogram(mapping = aes(x = TARGET), binwidth = 5) +
labs(title = "Histogram of the response variable", x = "Wind power production of the turbines")
The graphs display strong similarities between the four turbines in their relations with the power production although some display more outliers : the WT2 and the WT4.
table(input_training$MAC_CODE) # no class imbalence
WT1 WT2 WT3 WT4
154707 154791 154253 153635
ggplot(data = input_training) +
geom_boxplot(mapping = aes(x = MAC_CODE, y = TARGET, fill = MAC_CODE)) +
labs(title = "Wind power production of the four turbines",
x = "Code of the turbine", y = "Wind power production")
ggplot(data = input_training) +
geom_violin(mapping = aes(x = MAC_CODE, y = TARGET, fill = MAC_CODE)) +
labs(title = "Wind power production of the four turbines",
x = "Code of the turbine", y = "Wind power production")
Many numeric predictors are highly correlated, mostly in the positive direction. Vis-à-vis the response “TARGET”, notice the positive correlations with gearbox bearing temperature, the generator stator temperature, the generator speed and rotor speed, and the negative correlation with the pitch angle and to a lesser extent the nacelle temperature.
numeric_variables_names <- names(input_training)[which(sapply(input_training, is.numeric))]
numeric_variables <- input_training[numeric_variables_names]
correlations <- cor(na.omit(numeric_variables))
corrplot(correlations, method = "pie",
order = "alphabet",
tl.cex = 0.7) # with the Corrplot library
summary(input_training$Pitch_angle)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-156.180 -1.000 -1.000 13.012 5.567 156.820
ggplot(input_training, aes(Pitch_angle, TARGET)) +
geom_hex(bins = 30) +
labs(title = "Pitch angle and the power production", x = "Pitch angle", y = "Power production")
summary(input_training$Nacelle_temperature)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.90 21.44 26.03 25.13 29.91 46.26
ggplot(input_training, aes(Nacelle_temperature, TARGET)) +
geom_hex(bins = 30) +
labs(title = "Nacelle temperature and the power production", x = "nacelle temperature", y = "Power production")
summary(input_training$Gearbox_bearing_1_temperature)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.47 53.64 61.93 57.67 65.91 83.56
summary(input_training$Gearbox_bearing_2_temperature)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.80 53.69 64.06 59.22 69.10 80.35
ggplot(input_training, aes(Gearbox_bearing_1_temperature, TARGET)) +
geom_hex(bins = 30) +
labs(title = "Gearbox bearing temperature 1 and the power production", x = "Gearbox bearing temperature 1", y = "Power production")
ggplot(input_training, aes(Gearbox_bearing_2_temperature, TARGET)) +
geom_hex(bins = 30) +
labs(title = "Gearbox bearing temperature 2 and the power production", x = "Gearbox bearing temperature 2", y = "Power production")
summary(input_training$Generator_speed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.15 926.94 1174.54 1064.35 1562.67 1803.84
ggplot(input_training, aes(Generator_speed, TARGET)) +
geom_hex(bins = 30) +
labs(title = "Generator speed and the power production", x = "generator speed",
y = "Power production")
summary(input_training$Rotor_speed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 8.79 11.16 10.12 14.87 17.18
ggplot(input_training, aes(Rotor_speed, TARGET)) +
geom_hex(bins = 30) +
labs(title = "Rotor speed and the power production",
x = "Rotor speed", y = "Power production")
summary(input_training$Generator_stator_temperature)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.24 57.84 60.59 56.65 63.19 95.00
ggplot(input_training, aes(Generator_stator_temperature, TARGET)) +
geom_hex(bins = 30) +
labs(title = "Generator stator temperature and the power production",
x = "Generator stator temperature", y = "Power production")
Before feeding the algorithms with the data, we split the dataset into a training set and a test set following a 70 % - 30 % partition.
We then impute the missing values with the median, mostly for computation reasons : more sophisticated techniques, such as the k nearest neighbors or bagging would certainly give more accurate imputation. This will not be a problem because the proportion of missing values is very small.
We one hot encode the only categorical feature, which creates four dummy variables.
Finally we apply the Yeo Johnson transformation on the training predictors, before appending the training response variable which remains untransformed.
We procede exactly the same way with the test set using the models trained on the training set.
# Split the training set into a train and a test set
# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(input_training$TARGET, p = 0.70, list = FALSE)
# Step 2: Create the training dataset
trainData <- input_training[trainRowNumbers,]
# Step 3: Create the test dataset
testData <- input_training[-trainRowNumbers,]
# Store X and Y for later use.
x <- trainData[, -19]
y <- trainData$TARGET
x_test <- testData[, -19]
y_test <- testData$TARGET
# Pre-process the training data
# Imputation of remaining missing values
colSums(sapply(trainData, is.na))
# impute NAs
preProcess_missingdata_model <- preProcess(trainData, method = 'medianImpute')
# Use the imputation model to predict the values of missing data points
trainData <- predict(preProcess_missingdata_model, newdata = trainData, verbose = TRUE)
# One hot encoding
dummies_model <- dummyVars(TARGET ~ ., data = trainData)
trainData_mat <- predict(dummies_model, newdata = trainData)
trainData <- data.frame(trainData_mat) # Convert to dataframe. TARGET not in this dataset at this moment
# Transform the training data
preProcess_yeojohnson_model <- preProcess(trainData[, 5:21], method = c("center", "scale", "YeoJohnson"))
trainData[, 5:21] <- predict(preProcess_yeojohnson_model, newdata = trainData[, 5:21])
# Append the training response variable, that remains untransformed
trainData$TARGET <- y
# Pre process the test set
# impute NAs
testData2 <- predict(preProcess_missingdata_model, newdata = testData)
# one hot encoding
testData2 <- predict(dummies_model, newdata = testData2) # TARGET no longer in this test data, which only contains predictors
testData2 <- data.frame(testData2)
# Transformation on the test predictors
testData2[, 5:21] <- predict(preProcess_yeojohnson_model, newdata = testData2[, 5:21])
# Append the test response variable, which remains untransformed
testData2$TARGET <- y_test
This challenges requires regression algorithms through supervised learning. ENGIE picked the Mean Absolute Error (MAE) as the metric to evaluate their performance.
The following models we be trained and evaluated on the test set :
We first try the simplest regression model : a linear regression, using all predictors. The results are actually not so bad for such a naive model. The test MAE is 92.6, which is equal to the training MAE, meaning no overfitting. Almost all coefficients are highly significant. Comparing the predicted distribution of the test set with the real distribution of the response variable, we see that the model poorly predicts the lowest 25 % of the values with a range between - 11124 kW and + 73 kW (remember, the range of the true response for the 25 % lowest values is bounded within -19 kW and + 18 kW)
set.seed(11)
lm.fit1 <- caret::train(TARGET ~ .,
data = trainData,
method = "lm")
summary(lm.fit1)
lm.fit1
varImp(lm.fit1)
# most important variables in a linear setting :
# Rotor_speed, Generator_speed, Generator_bearing_1_temperature, Generator_stator_temperature
# Gearbox_bearing_1_temperature, Pitch_angle, Gearbox_inlet_temperature, Nacelle_temperature
lm.predicted.train <- predict(lm.fit1, newdata = trainData)
lm.err.MAE <- mean(abs(trainData$TARGET - lm.predicted.train))
lm.err.MAE # training MAE : 92.6
# Evaluate the performance on the test set
lm.predicted.test <- predict(lm.fit1, newdata = testData2)
lm.err.MAE <- mean(abs(testData2$TARGET - lm.predicted.test))
lm.err.MAE # test MAE : 92.6
summary(lm.predicted.test) # the main problem lies in the lowest 25 % values
We use the h2o interface to train and evaluate several neural networks with different architectures (number of hidden layers and number of neurons) as well as two activation functions, the Maxout and the Tanh.
The best test MAE we get is 21.38, without overfitting. A more accurate model could therefore be trained, probably of a larger size and perhaps with a higher number of epochs. We do not implement this here for computation reasons.
Interestingly, the most important variables are the four dummies, the pitch angle, the rotor speed, and to a lesser extent the generator speed, the generator stator temperature and the gearbox temperature. The other predictors matter less to the neural network.
# connect with the h2o cluster
h2o.init(nthreads = -1, max_mem_size = '4g')
# prepare the datasets
training_data_yeojohnson_h2o <- as.h2o(trainData)
testing_data_yeojohnson_h2o <- as.h2o(testData2)
Y <- "TARGET"
X <- setdiff(names(training_data_yeojohnson_h2o), Y)
# train the neural network
set.seed(2)
DL.h2o.2 <- h2o.deeplearning(x = X,
y = Y,
training_frame = training_data_yeojohnson_h2o,
nfolds = 5,
distribution = "gaussian",
activation = "Maxout",
hidden = c(32,32,32),
sparse = FALSE,
l1 = 0,
epochs = 100,
variable_importances = TRUE,
seed = 2)
DL.h2o.2
# on training : MAE 31.29958
# on cross validation data : MAE 23.0176
h2o.varimp(DL.h2o.2)
h2o.varimp_plot(DL.h2o.2, 21)
# evaluate the performance
DL.predicted2 <- h2o.predict(DL.h2o.2, newdata = testing_data_yeojohnson_h2o)
DL.predicted.err2 <- mean(abs(testing_data_yeojohnson_h2o$TARGET - DL.predicted2))
DL.predicted.err2 # test MAE : 30.92201
# change the parameters
set.seed(2)
DL.h2o.3 <- h2o.deeplearning(x = X,
y = Y,
training_frame = training_data_yeojohnson_h2o,
nfolds = 5,
distribution = "gaussian",
activation = "Maxout",
hidden = c(200,200),
sparse = FALSE,
l1 = 0,
epochs = 100,
variable_importances = TRUE,
seed = 3)
DL.h2o.3
# on training : MAE 23.61457
# on cross validation data : MAE 23.86
h2o.varimp_plot(DL.h2o.3, 22)
# evaluate the performance
DL.predicted3 <- h2o.predict(DL.h2o.3, newdata = testing_data_yeojohnson_h2o)
DL.predicted.err3 <- mean(abs(testing_data_yeojohnson_h2o$TARGET - DL.predicted3))
DL.predicted.err3 # test MAE : 23.86
# change the parameters
set.seed(3)
DL.h2o.4 <- h2o.deeplearning(x = X,
y = Y,
training_frame = training_data_yeojohnson_h2o,
nfolds = 5,
distribution = "gaussian",
activation = "Maxout",
hidden = c(32,32,32,32,32,32),
sparse = FALSE,
l1 = 0,
epochs = 300,
variable_importances = TRUE,
seed = 4)
DL.h2o.4
# on training : MAE 21.53118
# on cross validation data : 23.62188
h2o.varimp_plot(DL.h2o.4, 22)
# evaluate the performance
DL.predicted4 <- h2o.predict(DL.h2o.4, newdata = testing_data_yeojohnson_h2o)
DL.predicted.err4 <- mean(abs(testing_data_yeojohnson_h2o$TARGET - DL.predicted4))
DL.predicted.err4 # test MAE : 21.38377
# change the activation function : use the Tanh
set.seed(3)
DL.h2o.5 <- h2o.deeplearning(x = X,
y = Y,
training_frame = training_data_yeojohnson_h2o,
nfolds = 3,
distribution = "gaussian",
activation = "Tanh",
hidden = c(32,32,32,32,32,32),
sparse = FALSE,
l1 = 0,
epochs = 100,
variable_importances = TRUE,
seed = 4)
DL.h2o.5
# on training : MAE 27.10298
# on cross validation : MAE 22.77508
h2o.varimp_plot(DL.h2o.5, 21)
DL.predicted5 <- h2o.predict(DL.h2o.5, newdata = testing_data_yeojohnson_h2o)
DL.predicted.err5 <- mean(abs(testing_data_yeojohnson_h2o$TARGET - DL.predicted5))
DL.predicted.err5 # test MAE : 26.36685
summary(DL.predicted5)
The Gradient Boosting Machine (GBM) is an ensemble technique that uses the predictions made by weak models, in this case decision trees, to output a final prediction.
The best test MAE we get is 15.29708 with some overfitting and despite the use of regularization.
h2o.init(nthreads = -1, max_mem_size = '4g')
# prepare the datasets
training_data_yeojohnson_h2o <- as.h2o(trainData)
testing_data_yeojohnson_h2o <- as.h2o(testData2)
Y <- "TARGET"
X <- setdiff(names(training_data_yeojohnson_h2o), Y)
# train the model
set.seed(5)
gbm.h2o.1 <- h2o.gbm(training_frame = training_data_yeojohnson_h2o,
x = X,
y = Y,
distribution = "gaussian",
ntrees = 20,
max_depth = 10,
learn_rate = 0.2,
min_rows = 2,
nfolds = 5)
gbm.h2o.1
# on training data : MAE 15.99802
# on cross validation data : MAE 17.82392
gbm.h2o.1.predicted <- h2o.predict(gbm.h2o.1, newdata = testing_data_yeojohnson_h2o)
gbm.h2o.1.err <- mean(abs(testing_data_yeojohnson_h2o$TARGET - gbm.h2o.1.predicted))
gbm.h2o.1.err # test MAE : 17.69908
# change the parameters
set.seed(8)
gbm.h2o.2 <- h2o.gbm(training_frame = training_data_yeojohnson_h2o,
x = X,
y = Y,
distribution = "gaussian",
ntrees = 50,
max_depth = 12,
learn_rate = 0.2,
min_rows = 2,
nfolds = 5)
gbm.h2o.2
# on training data : MAE 8.488114
# on cross validation data : MAE 15.61311
gbm.h2o.2.predicted <- h2o.predict(gbm.h2o.2, newdata = testing_data_yeojohnson_h2o)
gbm.h2o.2.err <- mean(abs(testing_data_yeojohnson_h2o$TARGET - gbm.h2o.2.predicted))
gbm.h2o.2.err # test MAE : 15.34467, better result but overfitting is obvious
# change the parameters to prevent overfitting
set.seed(1)
gbm.h2o.3 <- h2o.gbm(training_frame = training_data_yeojohnson_h2o,
x = X,
y = Y,
distribution = "gaussian",
ntrees = 100,
max_depth = 12,
learn_rate = 0.2,
min_rows = 5,
sample_rate = 0.9,
nfolds = 5)
gbm.h2o.3
# on training : MAE 8.076759
# on cross validation : 15.55339
gbm.h2o.3.predicted <- h2o.predict(gbm.h2o.3, newdata = testing_data_yeojohnson_h2o)
gbm.h2o.3.err <- mean(abs(testing_data_yeojohnson_h2o$TARGET - gbm.h2o.3.predicted))
gbm.h2o.3.err # test MAE : 15.29708
We implement here the XGboost algorithm. It stems from the gradient boosting framework, which predicts the response from weak models, in this case decision trees.
The best test MAE we get is 14.75768. Notice the model clearly overfits, despite some regularization tools, as the training MAE is around 8.
set.seed(9)
xgboost1 <- xgboost::xgboost(data = as.matrix(trainData[, -22]),
label = trainData[, 22],
max.depth = 12,
eta = 0.15,
nrounds = 100,
objective = 'reg:linear',
eval_metric = 'mae')
xgboost1.predicted.train <- predict(xgboost1, newdata = as.matrix(trainData[, -22]))
xgboost1.err.train <- mean(abs(trainData[, 22] - xgboost1.predicted.train))
xgboost1.err.train # train MAE : 7.651979
# Evaluate the performance on the test set
xgboost1.predicted <- predict(xgboost1, newdata = as.matrix(testData2[, -22]))
xgboost1.err <- mean(abs(testData2[, 22] - xgboost1.predicted))
xgboost1.err # test MAE : 14.92736, good result but overfitting is obvious
# change the parameters to prevent overfitting
set.seed(10)
xgboost2 <- xgboost::xgboost(data = as.matrix(trainData[, -22]),
label = trainData[, 22],
max.depth = 12,
eta = 0.08,
nrounds = 170,
gamma = 20,
objective = 'reg:linear',
eval_metric = 'mae')
xgboost2.predicted <- predict(xgboost2, newdata = as.matrix(testData2[, -22]))
xgboost2.err <- mean(abs(testData2[, 22] - xgboost2.predicted))
xgboost2.err # test MAE : 14.78058
# change the parameters to prevent overfitting
set.seed(10)
xgboost3 <- xgboost::xgboost(data = as.matrix(trainData[, -22]),
label = trainData[, 22],
max.depth = 12,
eta = 0.08,
nrounds = 200,
gamma = 25,
colsample_bytree = 0.9,
subsample = 0.9,
objective = 'reg:linear',
eval_metric = 'mae')
xgboost3.predicted <- predict(xgboost3, newdata = as.matrix(testData2[, -22]))
xgboost3.err <- mean(abs(testData2[, 22] - xgboost3.predicted))
xgboost3.err # test MAE : 14.75768
We finally use a Distributed Random Forest (DRF) to predict the power production of the turbines. According to the h2o documentation, the DRF algorithm builds a forest of regression trees, each of these being a weak learner generated on a subset of rows and columns. More trees should reduce the variance. The final prediction is made by taking the average of all the trees’ predictions.
The best MAE we get on the test set after trying different combinations is 16.87922, with no overfitting. Therefore a more precise model could probably be fit using different parameters (more trees ? more depth ?) although for computation reasons we do not pursue in that direction.
# connect with the h2o cluster
h2o.init(nthreads = -1, max_mem_size = '4g')
# prepare the datasets
training_data_yeojohnson_h2o <- as.h2o(trainData)
testing_data_yeojohnson_h2o <- as.h2o(testData2)
Y <- "TARGET"
X <- setdiff(names(training_data_yeojohnson_h2o), Y)
# train the model
set.seed(1)
drf1 <- h2o.randomForest(training_frame = training_data_yeojohnson_h2o,
nfolds = 3,
y = Y,
x = X,
ntrees = 50,
max_depth = 20,
min_rows = 2,
mtries = -1,
seed = 1)
drf1
# on training : MAE 17.41881
# on cross validation data : MAE 17.29307
# evaluate on test data
drf1.predicted <- predict(drf1, newdata = testing_data_yeojohnson_h2o)
drf1.err <- mean(abs(testing_data_yeojohnson_h2o$TARGET - drf1.predicted))
drf1.err # test MAE : 17.0397
# change the parameters
set.seed(2)
drf2 <- h2o.randomForest(training_frame = training_data_yeojohnson_h2o,
nfolds = 3,
y = Y,
x = X,
ntrees = 100,
max_depth = 20,
min_rows = 2,
mtries = -1,
seed = 1)
drf2
# on training data : MAE 17.16537
# on cross validation data : 17.22089
# evaluate on test data
drf2.predicted <- predict(drf2, newdata = testing_data_yeojohnson_h2o)
drf2.err <- mean(abs(testing_data_yeojohnson_h2o$TARGET - drf2.predicted))
drf2.err # test MAE : 16.9644
# change the parameters
set.seed(2)
drf3 <- h2o.randomForest(training_frame = training_data_yeojohnson_h2o,
nfolds = 3,
y = Y,
x = X,
ntrees = 50,
max_depth = 30,
min_rows = 2,
mtries = -1,
seed = 1)
drf3
drf3.predicted <- predict(drf3, newdata = testing_data_yeojohnson_h2o)
drf3.err <- mean(abs(testing_data_yeojohnson_h2o$TARGET - drf3.predicted))
drf3.err # test MAE : 16.83267
# change the parameters
set.seed(2)
drf4 <- h2o.randomForest(training_frame = training_data_yeojohnson_h2o,
nfolds = 3,
y = Y,
x = X,
ntrees = 50,
max_depth = 30,
min_rows = 2,
mtries = -1,
histogram_type = "Random",
seed = 1)
drf4
drf4.predicted <- predict(drf4, newdata = testing_data_yeojohnson_h2o)
drf4.err <- mean(abs(testing_data_yeojohnson_h2o$TARGET - drf4.predicted))
drf4.err # test MAE : 16.87922
This study uses the data from the ENGIE 2018 challenge to predict the wind power production of turbines. The best performance was obtained with the XGboost algorithm that lead to a 14.7 mean absolute error. This mean error is to be compared with the range of the response variable, the wind power production, that takes values between -19 kW and +2256 kW.
In other words, considering some new observations for the same predictors, if the model predicts 500 kW, it means that the true response for these observations should likely be between 485 kW and 515 kW. This can be considered as satisfactory in terms of algorithm performance. A better strategy to predict the response for these new observations could be to blend the models, for example using the XGboost and the Distributed Random Forest, and take the average prediction, hoping that the models’ errors are wrong in different ways. That should reduce a bit the uncertainty regarding the interval.