Classification Case of Wind Turbine Power with Logistic Regression & KNN

Theresia Londong

2023-03-18

Photo by Karsten Würth on Unsplash

Intro

Study Case

As countries seek clean, renewable, low-cost, domestic alternatives to fossil fuels, wind energy is an increasingly important source of power generation worldwide [1]. This caused a huge attention and investments being spent on wind turbine development and has led to a huge success in the last decades. Wind turbines are commonly designed by adopting engineering models, and for such purpose Blade Element Momentum (BEM) theory [2] has proven to be the market leader for its aerodynamic calculations. Many codes have been developed to evaluate the performance of wind turbines, and the data obtained from Harvard Database [3] demonstrated one of such calculation results by adopting the B-GO code [4].

The data being used here reports the calculated power of a typical wind turbine rotor, here a specific example as the basis was for the DanAero wind turbine [5]. This wind turbine has a rated power of 2.3 MW as the design power generated by the rotor. The calculations were treated such that the rotor power is scaled with the power coefficient for several ranges of wind speeds and air densities. The results adopted contain Gaussian noises being applied on the obtained power. This was applied in the code to roughly incorporate the uncertainty occurring in the field. Typically wind turbine power scales as a cubic of wind speed and linearly with power coefficient and air density. Although in reality yaw misalignment angle, shear exponent and its reference height affect wind turbine performance considerably, this is notably not captured in the documented results since only a steady code version of B- GO was used [4].

Main objective of this study to plan further treatment if the resulted Power falls into the off_design criteria.

What Will We Do

Using 86000 total observations on our dataset with 14 parameters we will build our model to classify observation points based on Off-Design criteria. This is defined as the power (in KW) to be greater than the rated power (2300 kW), a margin of 10% is added.

In this classification case, we will first split our dataset into 80:20 proportion for data training and data test. Then, we build the logistic regression model based on those data train, then perform step-wise feature selection to improve the performance and finally using KNN algorithm to predict our data test to see which prediction performs the best.

Source data from Harvard Dataverse.

Data Preparation

Load Packages

Before we move further, let’s load the required packages.

library(tidyverse)
library(plotly)
library(scales)
library(lmtest)
library(dplyr)
library(car)
library(caret)
library(gtools)
library(class) # package untuk fungsi `knn()`

Read Data

Start with reading our csv file.

wind <- read.csv('data_input/SteadyBEMData_BGO_CpConst.csv')
X..RunNr… CodeBase… TurbineModel… RotorNumber… RotorRadius.m. TurbWakeState… TipLossModel… PowerCoeff… WindSpeed.m.s. Yaw.deg. Shear.deg. RefHeight.m. AirDensity.kg.m3. Power.W. Power_kW
1 B-GO_Steady Generic 3 40.02 Spera Prandtl 0.3 3.0 0 0 0 1.1 22065.18 22.06518
2 B-GO_Steady Generic 3 40.02 Spera Prandtl 0.3 3.4 0 0 0 1.1 33356.47 33.35647
3 B-GO_Steady Generic 3 40.02 Spera Prandtl 0.3 3.8 0 0 0 1.1 45526.90 45.52690
4 B-GO_Steady Generic 3 40.02 Spera Prandtl 0.3 4.2 0 0 0 1.1 61362.37 61.36237
5 B-GO_Steady Generic 3 40.02 Spera Prandtl 0.3 4.6 0 0 0 1.1 73399.07 73.39907
6 B-GO_Steady Generic 3 40.02 Spera Prandtl 0.3 5.0 0 0 0 1.1 123781.61 123.78161
str(wind)
## 'data.frame':    86000 obs. of  15 variables:
##  $ X..RunNr...      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CodeBase...      : chr  "B-GO_Steady" "B-GO_Steady" "B-GO_Steady" "B-GO_Steady" ...
##  $ TurbineModel...  : chr  "Generic" "Generic" "Generic" "Generic" ...
##  $ RotorNumber...   : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ RotorRadius.m.   : num  40 40 40 40 40 ...
##  $ TurbWakeState... : chr  "Spera" "Spera" "Spera" "Spera" ...
##  $ TipLossModel...  : chr  "Prandtl" "Prandtl" "Prandtl" "Prandtl" ...
##  $ PowerCoeff...    : num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ WindSpeed.m.s.   : num  3 3.4 3.8 4.2 4.6 5 5.4 5.8 6.2 6.6 ...
##  $ Yaw.deg.         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Shear.deg.       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ RefHeight.m.     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AirDensity.kg.m3.: num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ Power.W.         : num  22065 33356 45527 61362 73399 ...
##  $ Power_kW         : num  22.1 33.4 45.5 61.4 73.4 ...
id parameter definition
1 RunNr ID of the simulation run [-]
2 CodeBase Name of code being used to run the simulation [-]
3 TurbineModel Rotor model being simulated [-]
4 RotorNumber Number of blades to construct the rotor [-]
5 RotorRadius Radius of the rotor [m]
6 TurbWakeState Turbulent wake state model being used in the simulation [-]
7 TipLossModel Tip loss model being used in the simulation [-]
8 PowerCoeff Value of the power coefficient of the rotor [-]
9 WindSpeed Operating wind speed of the rotor used in the simulation [m/s]
10 Yaw Operating yaw angle used in the simulation [degrees]
11 Shear Operating wind shear exponent (power law) used in the simulation [-]
12 RefHeight Reference height to generate the wind shear used in the simulation [m]
13 AirDensity Density of air [kg/m3]
14 Power.W. Calculated wind power from the calculations by applying a Gaussian noise in the predicted results [W]

Data Wrangling

Then, we inspect the statistical summary of our wind data.

summary(wind)
##   X..RunNr...    CodeBase...        TurbineModel...    RotorNumber...
##  Min.   :    1   Length:86000       Length:86000       Min.   :3     
##  1st Qu.:21501   Class :character   Class :character   1st Qu.:3     
##  Median :43001   Mode  :character   Mode  :character   Median :3     
##  Mean   :43001                                         Mean   :3     
##  3rd Qu.:64500                                         3rd Qu.:3     
##  Max.   :86000                                         Max.   :3     
##  RotorRadius.m.  TurbWakeState...   TipLossModel...    PowerCoeff...   
##  Min.   :40.02   Length:86000       Length:86000       Min.   :0.3000  
##  1st Qu.:40.02   Class :character   Class :character   1st Qu.:0.3375  
##  Median :40.02   Mode  :character   Mode  :character   Median :0.3750  
##  Mean   :40.02                                         Mean   :0.3750  
##  3rd Qu.:40.02                                         3rd Qu.:0.4125  
##  Max.   :40.02                                         Max.   :0.4500  
##  WindSpeed.m.s.    Yaw.deg.   Shear.deg.    RefHeight.m. AirDensity.kg.m3.
##  Min.   : 3.0   Min.   :0   Min.   :0.00   Min.   : 0    Min.   :1.100    
##  1st Qu.: 7.0   1st Qu.:0   1st Qu.:0.04   1st Qu.:20    1st Qu.:1.147    
##  Median :11.4   Median :0   Median :0.08   Median :40    Median :1.195    
##  Mean   :11.4   Mean   :0   Mean   :0.08   Mean   :40    Mean   :1.195    
##  3rd Qu.:15.8   3rd Qu.:0   3rd Qu.:0.12   3rd Qu.:60    3rd Qu.:1.242    
##  Max.   :19.8   Max.   :0   Max.   :0.16   Max.   :80    Max.   :1.290    
##     Power.W.           Power_kW       
##  Min.   :   13149   Min.   :   13.15  
##  1st Qu.:  394055   1st Qu.:  394.06  
##  Median : 1640385   Median : 1640.38  
##  Mean   : 2620260   Mean   : 2620.26  
##  3rd Qu.: 4263215   3rd Qu.: 4263.22  
##  Max.   :14138431   Max.   :14138.43

Based on above summary, we decided to:

  • add new column : off_design for when;
    • Power_kW >= (1.1 * 2300), YES (1)
    • Power_kW < (1.1 * 2300), NO (0)
  • transform off_design into factor
  • remove columns : X..RunNr…, CodeBase…, TurbineModel…, RotorNumber…, RotorRadius.m., TurbWakeState…, TipLossModel…, Yaw.deg., Power.W. and Power_kW
wind_cln <- wind %>%   
  mutate(off_design = case_when( 
  Power_kW >= (1.1*2300) ~ 1,
  Power_kW < (1.1*2300) ~ 0)) %>% 
  mutate(off_design = as.factor(off_design)) %>% 
  select(-c('X..RunNr...','CodeBase...','TurbineModel...', 'RotorNumber...', 'RotorRadius.m.', 'TurbWakeState...', 'TipLossModel...', 'Yaw.deg.', 'Power.W.', 'Power_kW', 'PowerCoeff...'))

Perform further inspection on missing values, if any.

#check missing value
wind_cln %>% is.na() %>% 
  colSums()
##    WindSpeed.m.s.        Shear.deg.      RefHeight.m. AirDensity.kg.m3. 
##                 0                 0                 0                 0 
##        off_design 
##                 0

EDA

Moving to the next stage, the statistical summary always provide a good start to explore our data.

summary(wind_cln)
##  WindSpeed.m.s.   Shear.deg.    RefHeight.m. AirDensity.kg.m3. off_design
##  Min.   : 3.0   Min.   :0.00   Min.   : 0    Min.   :1.100     0:51949   
##  1st Qu.: 7.0   1st Qu.:0.04   1st Qu.:20    1st Qu.:1.147     1:34051   
##  Median :11.4   Median :0.08   Median :40    Median :1.195               
##  Mean   :11.4   Mean   :0.08   Mean   :40    Mean   :1.195               
##  3rd Qu.:15.8   3rd Qu.:0.12   3rd Qu.:60    3rd Qu.:1.242               
##  Max.   :19.8   Max.   :0.16   Max.   :80    Max.   :1.290
  • Check outliers
boxplot(wind_cln)

We see that there is scaling issue with our data, not necessarily huge scaling gap, but we can try to normalize our data later to avoid bias or domination of one variable over another.

  • Check Multicollinearity :

We can use our own knowledge to identify Multicollinear variables or by using the VIF (Variance Inflation Factor) value with vif() function. VIF value should be < 10, for a variable to be not Multicollinear.

model_dummy <- glm(formula = off_design ~ ., data = wind_cln, family = "binomial")

vif(model_dummy)
##    WindSpeed.m.s.        Shear.deg.      RefHeight.m. AirDensity.kg.m3. 
##          1.078163          1.000039          1.000037          1.078087

NO Multicollinearity between columns

  • Check class proportion in the target variable
#class proportion of target variable

prop.table(table(wind_cln$off_design))
## 
##         0         1 
## 0.6040581 0.3959419

In our case, the 60:40 proportion is considered as balance data.

Model Fitting

Cross Validation

Split our wind_cln data into 80% data train and 20% data test

RNGkind(sample.kind = "Rounding") 
set.seed(111)

# index sampling
index <- sample(x = nrow(wind_cln), size = nrow(wind_cln)*0.8)

# splitting
wind_train <- wind_cln[index,]
wind_test <- wind_cln[-index,]
  • Check class proportion of target variable in the data train
prop.table(table(wind_train$off_design))
## 
##         0         1 
## 0.6049128 0.3950872

Build Model

Model Base

First, we build our base logistic regression model using all columns, except off_design, as predictors.

# model base
model_wind_all <- glm(formula = off_design ~ ., data = wind_train, family = "binomial")

Model After Step-Wise

Apply the backward step-wise for feature-selection.

# stepwise
model_wind_step <- step(object = model_wind_all,
                   direction = "backward", trace = F)

Evaluation

  • Deviance

Deviance value cannot be interpreted directly, this value must be compared with model_null or other models.

#model null creation
model_wind_null <- glm(formula = off_design ~ 1, data = wind_train, family = "binomial")

Hence we compare the deviance performance of our base model (model_wind_all), null model (model_wind_null), and model after step-wise regression (model_wind_step) based on the following matrix:

# compare deviance
model_wind_null$deviance
## [1] 92325.38
model_wind_all$deviance
## [1] 11423.8
model_wind_step$deviance
## [1] 11424.17

We choose model with lowest deviance value.

  • AIC

AIC describes the amount of missing information from a model. The smaller the AIC, the less information is lost.

#compare AIC
model_wind_null$aic
## [1] 92327.38
model_wind_all$aic
## [1] 11433.8
model_wind_step$aic
## [1] 11430.17

Some criteria in selecting a good model are:

  • Models with low Residual Deviance / AIC
  • Model without Perfect Separation condition

In our case:

  • model_wind_all has smallest deviance among those 3 with 1.09 point difference to model_wind_step
  • model_wind_step has the smallest AIC among those 3 models above.

We can go ahead with model_wind_step for now and further check the performance using Confusion Matrix later.

Model Interpretation

Logistic Regression model result in log-of-odds which are uninterpretable, we must transform those values into odds of probability. But in a a glance,

  • If the coefficient is positive (+) -> increase the chances of our target (in this case being off_design)
  • If the coefficient is negative (-) -> lowers the chances of our target

Based on the coefficients of both model_wind_all and model_wind_step below we can interprets the odds or probability as folow:

#transforms to odds
exp(model_wind_all$coefficients)
##                   (Intercept)                WindSpeed.m.s. 
##    0.000000000000000001854007   10.114427739994940935730483 
##                    Shear.deg.                  RefHeight.m. 
##    0.776149085444676645195727    1.000085965538641596950242 
##             AirDensity.kg.m3. 
## 5619.031688186513747496064752

Odds Interpretation :

  • With each 1 unit increase in WindSpeed.m.s., the odds of a data point to be off_design increase by 10.11 times
  • With each 1 unit increase in Shear.deg., the odds of a data point to be off_design increase by 0.77 times
  • With each 1 unit increase in RefHeight.m., the odds of a data point to be off_design increase by 1.000085 times
  • With each 1 unit increase in AirDensity.kg.m3., the odds of a data point to be off_design increase by 5619.031 times
#transforms to odds
exp(model_wind_step$coefficients)
##                   (Intercept)                WindSpeed.m.s. 
##    0.000000000000000001829171   10.113757844904530003304899 
##             AirDensity.kg.m3. 
## 5607.863234881284370203502476

Odds Interpretation :

  • With each 1 unit increase in WindSpeed.m.s., the odds of a data point to be off_design increase by 10.11 times
  • With each 1 unit increase in AirDensity.kg.m3., the odds of a data point to be off_design increase by 5607.863 times

Model Performance

Predict

Using predict(object model, newdata, type) function

  • object: fill with the model used for prediction
  • newdata: fill with test data/unseen data/new data
  • type: the output type of the predicted value

in the type parameter there are value options:

  • link: generates a log of odds
  • response: generate probability
# log of odds
log_wind <- predict(object = model_wind_step, 
                     newdata = wind_test, 
                     type = "link")

odd_wind <- exp(log_wind)

p_wind <- inv.logit(log_wind)

wind_log_odd <- data.frame(log_wind, odd_wind, p_wind)

Turn into YES/1 & NO/0 label

pred_label <- ifelse(test = p_wind > 0.5, yes = 1, no = 0)
log_wind odd_wind p_wind pred_label
2 -23.480303 0.0000000 0.0000000 0
7 -18.852509 0.0000000 0.0000000 0
8 -17.926951 0.0000000 0.0000000 0
15 -11.448040 0.0000107 0.0000107 0
18 -8.671364 0.0001714 0.0001714 0
21 -5.894688 0.0027540 0.0027465 0

Evaluation with Confusion Matrix

Confusion Matrix

Function: confusionMatrix(data, reference, positive)

  • data: predicted result label (factor)
  • reference: actual label (factor)
  • positive: positive class name
confusionMatrix(data = as.factor(pred_label), 
                reference = wind_test$off_design, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 10003   331
##          1   328  6538
##                                              
##                Accuracy : 0.9617             
##                  95% CI : (0.9587, 0.9645)   
##     No Information Rate : 0.6006             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.9201             
##                                              
##  Mcnemar's Test P-Value : 0.9379             
##                                              
##             Sensitivity : 0.9518             
##             Specificity : 0.9683             
##          Pos Pred Value : 0.9522             
##          Neg Pred Value : 0.9680             
##              Prevalence : 0.3994             
##          Detection Rate : 0.3801             
##    Detection Prevalence : 0.3992             
##       Balanced Accuracy : 0.9600             
##                                              
##        'Positive' Class : 1                  
## 

According to our business case, we want to avoid case when the actual power is off_design (1) but predicted as NOT (0) or False Negative case. Hence, our matrix of concern is the Sensitivity / Recall, which currently is 95.1%.

The model_wind_step has shown us an extremely promising performance, but we could try to classify our data using KNN and compare the performance.

Model Improvement with KNN

k-NN or K-nearest neighbor classifies new data by comparing the characteristics of the new data (test data) with existing data (train data).

Characteristic proximity is measured by Euclidean Distance until k data points (neighbors) with closest distance are obtained. The most class owned by these neighbors is the class of new data (majority voting).

Initiation

  • Choose K

k optimum is the root of our sum of data: sqrt(nrow(data))

sqrt(nrow(wind_train))
## [1] 262.2975

Since we have 2 classes in our target variable, we choose 263 as optimum k.

  • Define Predictors and Target Variable

In KNN, we have to define the predictors and target variable prior to further implementation in the algorithm.

# prediktor
wind_train_x <- wind_train %>% select_if(is.numeric)

wind_test_x <- wind_test %>% select_if(is.numeric)

# target
wind_train_y <- wind_train[,"off_design"]

wind_test_y <- wind_test[,"off_design"]
prop.table(table(wind_train_y))
## wind_train_y
##         0         1 
## 0.6049128 0.3950872
prop.table(table(wind_test_y))
## wind_test_y
##         0         1 
## 0.6006395 0.3993605

We have balance data for both classes in the target variable, hence no upsampling / downsampling needed.

Scaling Numerical Value

KNN is sensitive to large different of data scale. To avoid one variable domination on others, we have to scale our data using scale() function.

Predictor data will be scaled using z-score standardization. The test data must also be scaled using parameters from the train data (because it assumes the test data is unseen data).

  • the scale function consists of several parameters
    • x = the object you want to scale
    • center = average/mean value (taken from the center value on the scaled train data)
    • scale = standard deviation value (taken from the sd value on train data that has been scaled)
# scaling data

# train
wind_train_xs <- scale(wind_train_x)

# test
wind_test_xs <- scale(x = wind_test_x,
                      center = attr(wind_train_xs, "scaled:center"),
                      scale = attr(wind_train_xs, "scaled:scale"))

We should now have the uniformly scaled predictors.

boxplot(wind_train_xs)

boxplot(wind_test_xs)

Predict

Unlike logistic and linear regression cases, the knn() function does not build the model but directly predicts the train data.

Parameters to the knn() function:

  • train : data train, predictor, scaled, numeric type
  • test : test data, predictor, scaled, numeric type
  • cl : data train, label (target) actual (categorical)
  • k : specified value of k
wind_pred <- knn(train = wind_train_xs, 
                 test = wind_test_xs, 
                 cl = wind_train_y, 
                 k = 263)

head(wind_pred)
## [1] 0 0 0 0 0 0
## Levels: 0 1

Evaluation

Using Confusion Matrix

confusionMatrix(data = wind_pred, 
                reference = wind_test_y, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9997  344
##          1  334 6525
##                                              
##                Accuracy : 0.9606             
##                  95% CI : (0.9576, 0.9634)   
##     No Information Rate : 0.6006             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.9178             
##                                              
##  Mcnemar's Test P-Value : 0.7296             
##                                              
##             Sensitivity : 0.9499             
##             Specificity : 0.9677             
##          Pos Pred Value : 0.9513             
##          Neg Pred Value : 0.9667             
##              Prevalence : 0.3994             
##          Detection Rate : 0.3794             
##    Detection Prevalence : 0.3988             
##       Balanced Accuracy : 0.9588             
##                                              
##        'Positive' Class : 1                  
## 

Sensitivity using KNN is 94.99%

Conclusion

  • Performance of model_wind_step is slightly better than KNN predictor, by 0.11%, this is due to the nature of feature selection using step wise regression which taken into consideration the AIC value to select the most relevant ‘predictors’
  • With KNN, we don’t have to build model but using the knn() function to predict data
  • KNN predicts by learning patterns of certain data train, hence we have to input new training data for each prediction

Reference

  1. G. Bangga. Wind Turbine Aerodynamics Modeling Using CFD Approaches. AIP Publishing LLC, Melville, New York, 2022. https://doi.org/10.1063/9780735424111
  2. H. Glauert. The elements of aerofoil and airscrew theory. Cambridge University Press, 1983.
  3. G. Bangga. "Steady BEM Data using B-GO code for DanAero Turbine with 3D Hybrid CFD Polar." Harvard Dataverse, V2, 2023. https://doi.org/10.7910/DVN/3RRNYO
  4. G. Bangga. “Comparison of blade element method and CFD simulations of a 10 MW wind turbine.” Fluids 3 (4), 73. 2018. https://doi.org/10.3390/fluids3040073
  5. H.A. Madsen, C. Bak, U. Paulsen, M. Gaunaa, P. Fuglsang, J. Romblad, N. Olesen, P. Enevoldsen, J. Laursen, L. Jensen. "The DAN-AERO MW experiments: final report." Danmarks Tekniske Universitet, Risø Nationallaboratoriet for Bæredygtig Energi, 2010. orbit. dtu. dk. Accessed 13 November 2017