Photo by Karsten Würth on Unsplash
Intro
Study Case
As countries seek clean, renewable, low-cost, domestic alternatives to fossil fuels, wind energy is an increasingly important source of power generation worldwide [1]. This caused a huge attention and investments being spent on wind turbine development and has led to a huge success in the last decades. Wind turbines are commonly designed by adopting engineering models, and for such purpose Blade Element Momentum (BEM) theory [2] has proven to be the market leader for its aerodynamic calculations. Many codes have been developed to evaluate the performance of wind turbines, and the data obtained from Harvard Database [3] demonstrated one of such calculation results by adopting the B-GO code [4].
The data being used here reports the calculated power of a typical wind turbine rotor, here a specific example as the basis was for the DanAero wind turbine [5]. This wind turbine has a rated power of 2.3 MW as the design power generated by the rotor. The calculations were treated such that the rotor power is scaled with the power coefficient for several ranges of wind speeds and air densities. The results adopted contain Gaussian noises being applied on the obtained power. This was applied in the code to roughly incorporate the uncertainty occurring in the field. Typically wind turbine power scales as a cubic of wind speed and linearly with power coefficient and air density. Although in reality yaw misalignment angle, shear exponent and its reference height affect wind turbine performance considerably, this is notably not captured in the documented results since only a steady code version of B- GO was used [4].
Main objective of this study to plan further treatment if the resulted Power falls into the off_design criteria.
What Will We Do
Using 86000 total observations on our dataset with 14 parameters we will build our model to classify observation points based on Off-Design criteria. This is defined as the power (in KW) to be greater than the rated power (2300 kW), a margin of 10% is added.
In this classification case, we will first split our dataset into 80:20 proportion for data training and data test. Then, we build the logistic regression model based on those data train, then perform step-wise feature selection to improve the performance and finally using KNN algorithm to predict our data test to see which prediction performs the best.
Source data from Harvard Dataverse.
Data Preparation
Load Packages
Before we move further, let’s load the required packages.
library(tidyverse)
library(plotly)
library(scales)
library(lmtest)
library(dplyr)
library(car)
library(caret)
library(gtools)
library(class) # package untuk fungsi `knn()`
Read Data
Start with reading our csv file.
wind <- read.csv('data_input/SteadyBEMData_BGO_CpConst.csv')
| X..RunNr… | CodeBase… | TurbineModel… | RotorNumber… | RotorRadius.m. | TurbWakeState… | TipLossModel… | PowerCoeff… | WindSpeed.m.s. | Yaw.deg. | Shear.deg. | RefHeight.m. | AirDensity.kg.m3. | Power.W. | Power_kW |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | B-GO_Steady | Generic | 3 | 40.02 | Spera | Prandtl | 0.3 | 3.0 | 0 | 0 | 0 | 1.1 | 22065.18 | 22.06518 |
| 2 | B-GO_Steady | Generic | 3 | 40.02 | Spera | Prandtl | 0.3 | 3.4 | 0 | 0 | 0 | 1.1 | 33356.47 | 33.35647 |
| 3 | B-GO_Steady | Generic | 3 | 40.02 | Spera | Prandtl | 0.3 | 3.8 | 0 | 0 | 0 | 1.1 | 45526.90 | 45.52690 |
| 4 | B-GO_Steady | Generic | 3 | 40.02 | Spera | Prandtl | 0.3 | 4.2 | 0 | 0 | 0 | 1.1 | 61362.37 | 61.36237 |
| 5 | B-GO_Steady | Generic | 3 | 40.02 | Spera | Prandtl | 0.3 | 4.6 | 0 | 0 | 0 | 1.1 | 73399.07 | 73.39907 |
| 6 | B-GO_Steady | Generic | 3 | 40.02 | Spera | Prandtl | 0.3 | 5.0 | 0 | 0 | 0 | 1.1 | 123781.61 | 123.78161 |
str(wind)
## 'data.frame': 86000 obs. of 15 variables:
## $ X..RunNr... : int 1 2 3 4 5 6 7 8 9 10 ...
## $ CodeBase... : chr "B-GO_Steady" "B-GO_Steady" "B-GO_Steady" "B-GO_Steady" ...
## $ TurbineModel... : chr "Generic" "Generic" "Generic" "Generic" ...
## $ RotorNumber... : int 3 3 3 3 3 3 3 3 3 3 ...
## $ RotorRadius.m. : num 40 40 40 40 40 ...
## $ TurbWakeState... : chr "Spera" "Spera" "Spera" "Spera" ...
## $ TipLossModel... : chr "Prandtl" "Prandtl" "Prandtl" "Prandtl" ...
## $ PowerCoeff... : num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ WindSpeed.m.s. : num 3 3.4 3.8 4.2 4.6 5 5.4 5.8 6.2 6.6 ...
## $ Yaw.deg. : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Shear.deg. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ RefHeight.m. : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AirDensity.kg.m3.: num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ Power.W. : num 22065 33356 45527 61362 73399 ...
## $ Power_kW : num 22.1 33.4 45.5 61.4 73.4 ...
| id | parameter | definition |
|---|---|---|
| 1 | RunNr | ID of the simulation run [-] |
| 2 | CodeBase | Name of code being used to run the simulation [-] |
| 3 | TurbineModel | Rotor model being simulated [-] |
| 4 | RotorNumber | Number of blades to construct the rotor [-] |
| 5 | RotorRadius | Radius of the rotor [m] |
| 6 | TurbWakeState | Turbulent wake state model being used in the simulation [-] |
| 7 | TipLossModel | Tip loss model being used in the simulation [-] |
| 8 | PowerCoeff | Value of the power coefficient of the rotor [-] |
| 9 | WindSpeed | Operating wind speed of the rotor used in the simulation [m/s] |
| 10 | Yaw | Operating yaw angle used in the simulation [degrees] |
| 11 | Shear | Operating wind shear exponent (power law) used in the simulation [-] |
| 12 | RefHeight | Reference height to generate the wind shear used in the simulation [m] |
| 13 | AirDensity | Density of air [kg/m3] |
| 14 | Power.W. | Calculated wind power from the calculations by applying a Gaussian noise in the predicted results [W] |
Data Wrangling
Then, we inspect the statistical summary of our wind
data.
summary(wind)
## X..RunNr... CodeBase... TurbineModel... RotorNumber...
## Min. : 1 Length:86000 Length:86000 Min. :3
## 1st Qu.:21501 Class :character Class :character 1st Qu.:3
## Median :43001 Mode :character Mode :character Median :3
## Mean :43001 Mean :3
## 3rd Qu.:64500 3rd Qu.:3
## Max. :86000 Max. :3
## RotorRadius.m. TurbWakeState... TipLossModel... PowerCoeff...
## Min. :40.02 Length:86000 Length:86000 Min. :0.3000
## 1st Qu.:40.02 Class :character Class :character 1st Qu.:0.3375
## Median :40.02 Mode :character Mode :character Median :0.3750
## Mean :40.02 Mean :0.3750
## 3rd Qu.:40.02 3rd Qu.:0.4125
## Max. :40.02 Max. :0.4500
## WindSpeed.m.s. Yaw.deg. Shear.deg. RefHeight.m. AirDensity.kg.m3.
## Min. : 3.0 Min. :0 Min. :0.00 Min. : 0 Min. :1.100
## 1st Qu.: 7.0 1st Qu.:0 1st Qu.:0.04 1st Qu.:20 1st Qu.:1.147
## Median :11.4 Median :0 Median :0.08 Median :40 Median :1.195
## Mean :11.4 Mean :0 Mean :0.08 Mean :40 Mean :1.195
## 3rd Qu.:15.8 3rd Qu.:0 3rd Qu.:0.12 3rd Qu.:60 3rd Qu.:1.242
## Max. :19.8 Max. :0 Max. :0.16 Max. :80 Max. :1.290
## Power.W. Power_kW
## Min. : 13149 Min. : 13.15
## 1st Qu.: 394055 1st Qu.: 394.06
## Median : 1640385 Median : 1640.38
## Mean : 2620260 Mean : 2620.26
## 3rd Qu.: 4263215 3rd Qu.: 4263.22
## Max. :14138431 Max. :14138.43
Based on above summary, we decided to:
- add new column : off_design for when;
- Power_kW >= (1.1 * 2300), YES (1)
- Power_kW < (1.1 * 2300), NO (0)
- transform off_design into factor
- remove columns : X..RunNr…, CodeBase…, TurbineModel…, RotorNumber…, RotorRadius.m., TurbWakeState…, TipLossModel…, Yaw.deg., Power.W. and Power_kW
wind_cln <- wind %>%
mutate(off_design = case_when(
Power_kW >= (1.1*2300) ~ 1,
Power_kW < (1.1*2300) ~ 0)) %>%
mutate(off_design = as.factor(off_design)) %>%
select(-c('X..RunNr...','CodeBase...','TurbineModel...', 'RotorNumber...', 'RotorRadius.m.', 'TurbWakeState...', 'TipLossModel...', 'Yaw.deg.', 'Power.W.', 'Power_kW', 'PowerCoeff...'))
Perform further inspection on missing values, if any.
#check missing value
wind_cln %>% is.na() %>%
colSums()
## WindSpeed.m.s. Shear.deg. RefHeight.m. AirDensity.kg.m3.
## 0 0 0 0
## off_design
## 0
EDA
Moving to the next stage, the statistical summary always provide a good start to explore our data.
summary(wind_cln)
## WindSpeed.m.s. Shear.deg. RefHeight.m. AirDensity.kg.m3. off_design
## Min. : 3.0 Min. :0.00 Min. : 0 Min. :1.100 0:51949
## 1st Qu.: 7.0 1st Qu.:0.04 1st Qu.:20 1st Qu.:1.147 1:34051
## Median :11.4 Median :0.08 Median :40 Median :1.195
## Mean :11.4 Mean :0.08 Mean :40 Mean :1.195
## 3rd Qu.:15.8 3rd Qu.:0.12 3rd Qu.:60 3rd Qu.:1.242
## Max. :19.8 Max. :0.16 Max. :80 Max. :1.290
- Check outliers
boxplot(wind_cln)
We see that there is scaling issue with our data, not necessarily huge scaling gap, but we can try to normalize our data later to avoid bias or domination of one variable over another.
- Check Multicollinearity :
We can use our own knowledge to identify Multicollinear variables or
by using the VIF (Variance Inflation Factor) value with
vif() function. VIF value should be < 10, for a variable
to be not Multicollinear.
model_dummy <- glm(formula = off_design ~ ., data = wind_cln, family = "binomial")
vif(model_dummy)
## WindSpeed.m.s. Shear.deg. RefHeight.m. AirDensity.kg.m3.
## 1.078163 1.000039 1.000037 1.078087
NO Multicollinearity between columns
- Check class proportion in the target variable
#class proportion of target variable
prop.table(table(wind_cln$off_design))
##
## 0 1
## 0.6040581 0.3959419
In our case, the 60:40 proportion is considered as balance data.
Model Fitting
Cross Validation
Split our wind_cln data into 80% data train and 20% data
test
RNGkind(sample.kind = "Rounding")
set.seed(111)
# index sampling
index <- sample(x = nrow(wind_cln), size = nrow(wind_cln)*0.8)
# splitting
wind_train <- wind_cln[index,]
wind_test <- wind_cln[-index,]
- Check class proportion of target variable in the data train
prop.table(table(wind_train$off_design))
##
## 0 1
## 0.6049128 0.3950872
Build Model
Model Base
First, we build our base logistic regression model using all columns,
except off_design, as predictors.
# model base
model_wind_all <- glm(formula = off_design ~ ., data = wind_train, family = "binomial")
Model After Step-Wise
Apply the backward step-wise for feature-selection.
# stepwise
model_wind_step <- step(object = model_wind_all,
direction = "backward", trace = F)
Evaluation
- Deviance
Deviance value cannot be interpreted directly, this value must be compared with model_null or other models.
#model null creation
model_wind_null <- glm(formula = off_design ~ 1, data = wind_train, family = "binomial")
Hence we compare the deviance performance of our base model
(model_wind_all), null model
(model_wind_null), and model after step-wise regression
(model_wind_step) based on the following matrix:
# compare deviance
model_wind_null$deviance
## [1] 92325.38
model_wind_all$deviance
## [1] 11423.8
model_wind_step$deviance
## [1] 11424.17
We choose model with lowest deviance value.
- AIC
AIC describes the amount of missing information from a model. The smaller the AIC, the less information is lost.
#compare AIC
model_wind_null$aic
## [1] 92327.38
model_wind_all$aic
## [1] 11433.8
model_wind_step$aic
## [1] 11430.17
Some criteria in selecting a good model are:
- Models with low Residual Deviance / AIC
- Model without Perfect Separation condition
In our case:
model_wind_allhas smallest deviance among those 3 with 1.09 point difference tomodel_wind_stepmodel_wind_stephas the smallest AIC among those 3 models above.
We can go ahead with model_wind_step for now and further
check the performance using Confusion Matrix later.
Model Interpretation
Logistic Regression model result in log-of-odds which
are uninterpretable, we must transform those values into odds of
probability. But in a a glance,
- If the coefficient is positive (+) -> increase the chances of our
target (in this case being
off_design) - If the coefficient is negative (-) -> lowers the chances of our target
Based on the coefficients of both model_wind_all and
model_wind_step below we can interprets the odds or
probability as folow:
#transforms to odds
exp(model_wind_all$coefficients)
## (Intercept) WindSpeed.m.s.
## 0.000000000000000001854007 10.114427739994940935730483
## Shear.deg. RefHeight.m.
## 0.776149085444676645195727 1.000085965538641596950242
## AirDensity.kg.m3.
## 5619.031688186513747496064752
Odds Interpretation :
- With each 1 unit increase in
WindSpeed.m.s., the odds of a data point to beoff_designincrease by 10.11 times - With each 1 unit increase in
Shear.deg., the odds of a data point to beoff_designincrease by 0.77 times - With each 1 unit increase in
RefHeight.m., the odds of a data point to beoff_designincrease by 1.000085 times - With each 1 unit increase in
AirDensity.kg.m3., the odds of a data point to beoff_designincrease by 5619.031 times
#transforms to odds
exp(model_wind_step$coefficients)
## (Intercept) WindSpeed.m.s.
## 0.000000000000000001829171 10.113757844904530003304899
## AirDensity.kg.m3.
## 5607.863234881284370203502476
Odds Interpretation :
- With each 1 unit increase in
WindSpeed.m.s., the odds of a data point to beoff_designincrease by 10.11 times - With each 1 unit increase in
AirDensity.kg.m3., the odds of a data point to beoff_designincrease by 5607.863 times
Model Performance
Predict
Using predict(object model, newdata, type) function
object: fill with the model used for predictionnewdata: fill with test data/unseen data/new datatype: the output type of the predicted value
in the type parameter there are value options:
link: generates a log of oddsresponse: generate probability
# log of odds
log_wind <- predict(object = model_wind_step,
newdata = wind_test,
type = "link")
odd_wind <- exp(log_wind)
p_wind <- inv.logit(log_wind)
wind_log_odd <- data.frame(log_wind, odd_wind, p_wind)
Turn into YES/1 & NO/0 label
pred_label <- ifelse(test = p_wind > 0.5, yes = 1, no = 0)
| log_wind | odd_wind | p_wind | pred_label | |
|---|---|---|---|---|
| 2 | -23.480303 | 0.0000000 | 0.0000000 | 0 |
| 7 | -18.852509 | 0.0000000 | 0.0000000 | 0 |
| 8 | -17.926951 | 0.0000000 | 0.0000000 | 0 |
| 15 | -11.448040 | 0.0000107 | 0.0000107 | 0 |
| 18 | -8.671364 | 0.0001714 | 0.0001714 | 0 |
| 21 | -5.894688 | 0.0027540 | 0.0027465 | 0 |
Evaluation with Confusion Matrix
Confusion Matrix
Function: confusionMatrix(data, reference, positive)
data: predicted result label (factor)reference: actual label (factor)positive: positive class name
confusionMatrix(data = as.factor(pred_label),
reference = wind_test$off_design,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 10003 331
## 1 328 6538
##
## Accuracy : 0.9617
## 95% CI : (0.9587, 0.9645)
## No Information Rate : 0.6006
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.9201
##
## Mcnemar's Test P-Value : 0.9379
##
## Sensitivity : 0.9518
## Specificity : 0.9683
## Pos Pred Value : 0.9522
## Neg Pred Value : 0.9680
## Prevalence : 0.3994
## Detection Rate : 0.3801
## Detection Prevalence : 0.3992
## Balanced Accuracy : 0.9600
##
## 'Positive' Class : 1
##
According to our business case, we want to avoid case when the actual power is off_design (1) but predicted as NOT (0) or False Negative case. Hence, our matrix of concern is the Sensitivity / Recall, which currently is 95.1%.
The model_wind_step has shown us an extremely promising
performance, but we could try to classify our data using KNN and compare
the performance.
Model Improvement with KNN
k-NN or K-nearest neighbor classifies new data by comparing the characteristics of the new data (test data) with existing data (train data).
Characteristic proximity is measured by Euclidean Distance until k data points (neighbors) with closest distance are obtained. The most class owned by these neighbors is the class of new data (majority voting).
Initiation
- Choose K
k optimum is the root of our sum of data:
sqrt(nrow(data))
sqrt(nrow(wind_train))
## [1] 262.2975
Since we have 2 classes in our target variable, we choose 263 as optimum k.
- Define Predictors and Target Variable
In KNN, we have to define the predictors and target variable prior to further implementation in the algorithm.
# prediktor
wind_train_x <- wind_train %>% select_if(is.numeric)
wind_test_x <- wind_test %>% select_if(is.numeric)
# target
wind_train_y <- wind_train[,"off_design"]
wind_test_y <- wind_test[,"off_design"]
prop.table(table(wind_train_y))
## wind_train_y
## 0 1
## 0.6049128 0.3950872
prop.table(table(wind_test_y))
## wind_test_y
## 0 1
## 0.6006395 0.3993605
We have balance data for both classes in the target variable, hence no upsampling / downsampling needed.
Scaling Numerical Value
KNN is sensitive to large different of data scale. To avoid one
variable domination on others, we have to scale our data using
scale() function.
Predictor data will be scaled using z-score standardization. The test data must also be scaled using parameters from the train data (because it assumes the test data is unseen data).
- the
scalefunction consists of several parameters- x = the object you want to scale
- center = average/mean value (taken from the center value on the scaled train data)
- scale = standard deviation value (taken from the sd value on train data that has been scaled)
# scaling data
# train
wind_train_xs <- scale(wind_train_x)
# test
wind_test_xs <- scale(x = wind_test_x,
center = attr(wind_train_xs, "scaled:center"),
scale = attr(wind_train_xs, "scaled:scale"))
We should now have the uniformly scaled predictors.
boxplot(wind_train_xs)
boxplot(wind_test_xs)
Predict
Unlike logistic and linear regression cases, the knn()
function does not build the model but directly predicts the train
data.
Parameters to the knn() function:
train: data train, predictor, scaled, numeric typetest: test data, predictor, scaled, numeric typecl: data train, label (target) actual (categorical)k: specified value of k
wind_pred <- knn(train = wind_train_xs,
test = wind_test_xs,
cl = wind_train_y,
k = 263)
head(wind_pred)
## [1] 0 0 0 0 0 0
## Levels: 0 1
Evaluation
Using Confusion Matrix
confusionMatrix(data = wind_pred,
reference = wind_test_y,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9997 344
## 1 334 6525
##
## Accuracy : 0.9606
## 95% CI : (0.9576, 0.9634)
## No Information Rate : 0.6006
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.9178
##
## Mcnemar's Test P-Value : 0.7296
##
## Sensitivity : 0.9499
## Specificity : 0.9677
## Pos Pred Value : 0.9513
## Neg Pred Value : 0.9667
## Prevalence : 0.3994
## Detection Rate : 0.3794
## Detection Prevalence : 0.3988
## Balanced Accuracy : 0.9588
##
## 'Positive' Class : 1
##
Sensitivity using KNN is 94.99%
Conclusion
- Performance of
model_wind_stepis slightly better than KNN predictor, by 0.11%, this is due to the nature of feature selection using step wise regression which taken into consideration the AIC value to select the most relevant ‘predictors’ - With KNN, we don’t have to build model but using the
knn()function to predict data - KNN predicts by learning patterns of certain data train, hence we have to input new training data for each prediction
Reference
- G. Bangga. Wind Turbine Aerodynamics Modeling Using CFD Approaches. AIP Publishing LLC, Melville, New York, 2022. https://doi.org/10.1063/9780735424111
- H. Glauert. The elements of aerofoil and airscrew theory. Cambridge University Press, 1983.
- G. Bangga. "Steady BEM Data using B-GO code for DanAero Turbine with 3D Hybrid CFD Polar." Harvard Dataverse, V2, 2023. https://doi.org/10.7910/DVN/3RRNYO
- G. Bangga. “Comparison of blade element method and CFD simulations of a 10 MW wind turbine.” Fluids 3 (4), 73. 2018. https://doi.org/10.3390/fluids3040073
- H.A. Madsen, C. Bak, U. Paulsen, M. Gaunaa, P. Fuglsang, J. Romblad, N. Olesen, P. Enevoldsen, J. Laursen, L. Jensen. "The DAN-AERO MW experiments: final report." Danmarks Tekniske Universitet, Risø Nationallaboratoriet for Bæredygtig Energi, 2010. orbit. dtu. dk. Accessed 13 November 2017