Goal

The purpose of this project is to analyze and predict the power output of wind turbines, a very important task for the renewable energy industry. The forecasting plays a big role in electricity prices and sloppy predictions can lead to all sorts of supply/demand disparities. Although energy companies have highly advanced tools and highly skilled analysts, I hope to reenact a fraction of their work in a much smaller magnitude. I hope to answer the following questions:

How can we best organize and format the data for our own comprehension? (Understanding the Data)
Can we observe any patterns that would be useful information in the future? (Data Exploration)
Can we use a model to predict the active power? (Model Development)

1. Understanding the Data

The dataset I will be using is taken from wind turbine operating in Yalova, Turkey, and was recorded through supervisory control and data acquisition (SCADA) systems. The dataset contains the following information:

Date/Time (for 10 minutes intervals)
LV ActivePower (kW): The power generated by the turbine for that moment
Wind Speed (m/s): The wind speed at the hub height of the turbine (the wind speed that turbine use for electricity generation)
TheoreticalPowerCurve (kW): The theoretical power values that the turbine generates with that wind speed which is given by the turbine manufacturer
Wind Direction (°): The wind direction at the hub height of the turbine (wind turbines turn to this direction automatically)

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(rpart)
library(rpart.plot)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

regr.error <- function(predicted,actual){
  #MAE
  mae <- mean(abs(actual-predicted))
  #MSE
  mse <- mean((actual-predicted)^2)
  #RMSE
  rmse <- sqrt(mean((actual-predicted)^2))
  #MAPE
  mape <- mean(abs((actual-predicted)/actual))
  errors <- c(mae,mse,rmse,mape)
  names(errors) <- c("mae","mse","rmse","mape")
  return(errors)
}

WindPower <- read.csv("~/Downloads/T1.csv")

#Renaming the columns for convenience
WindPower = rename(WindPower, WindSpeed = Wind.Speed..m.s.)
WindPower = rename(WindPower, ActivePower = LV.ActivePower..kW.)
WindPower = rename(WindPower, TheoreticalPower = Theoretical_Power_Curve..KWh.)
WindPower = rename(WindPower, WindDirection = Wind.Direction....)

summary(WindPower)

##   Date.Time          ActivePower         WindSpeed      TheoreticalPower
##  Length:50530       Min.   :  -2.471   Min.   : 0.000   Min.   :   0.0  
##  Class :character   1st Qu.:  50.678   1st Qu.: 4.201   1st Qu.: 161.3  
##  Mode  :character   Median : 825.838   Median : 7.105   Median :1063.8  
##                     Mean   :1307.684   Mean   : 7.558   Mean   :1492.2  
##                     3rd Qu.:2482.508   3rd Qu.:10.300   3rd Qu.:2965.0  
##                     Max.   :3618.733   Max.   :25.206   Max.   :3600.0  
##  WindDirection   
##  Min.   :  0.00  
##  1st Qu.: 49.32  
##  Median : 73.71  
##  Mean   :123.69  
##  3rd Qu.:201.70  
##  Max.   :360.00

Reformatting the Date and Time

In order to properly represent the time that each measurement was taken, I will be converting the “Date.Time” column from a character to a Date object.

WindPower$Year = as.numeric(substr(WindPower$Date.Time, 7, 10))
WindPower$Day = as.numeric(substr(WindPower$Date.Time, 1, 2))
WindPower$Month = as.numeric(substr(WindPower$Date.Time, 4, 5))
WindPower$Hour = as.numeric(substr(WindPower$Date.Time, 12, 13))
WindPower$Minute = as.numeric(substr(WindPower$Date.Time, 15, 16))
table(WindPower$Day, WindPower$Month, dnn = c("day: ", "month: "))

##      month: 
## day:    1   2   3   4   5   6   7   8   9  10  11  12
##    1  144 144 144 144 144 144 144 144 144   0 144 144
##    2  144 144 144 144 144 144 144 143 144  11 144 144
##    3  144 144 144 144 144 144 144 142 144  61 144 144
##    4  127 144 144 143 132 106 144 144 144 144 144 137
##    5  144 144 144 144 144 143 144 144 144 144 144 144
##    6  140 144 144 144 144 144 144 144 143 144 144 144
##    7  144 144 144 144 144 144 144 144 144 144 144 144
##    8  144 144 144 144 144 144 144 144 144 144 144 144
##    9  144 144 144 144 144 144 144 144 144 144 144 144
##    10 144 144 143 144 144 144 144 144 144 144 128 144
##    11 144 144 144 144 144 144 144 144 144 144   0 144
##    12 143 144 144 144 144 144 144 144 144 144   0 144
##    13 144 144 144 144 144 144 144 144 143 144   0 144
##    14 144 144 144 144 144 144 144 144 129 144  72 144
##    15 144 144 144 144 144 144 144 144 144 144 144 144
##    16 144 144 144 144 144 143 144 130 144 144 144 144
##    17 144 144 144 130 144 144 144 122 144 144 144 134
##    18 144 144 144 144 144 144 144 144 144 144 144 144
##    19 144 144 144 144 144 144 144 144 144 144 144 144
##    20 144 144 144 144 144 144 144 144 144 144 144 144
##    21 144 144 144 144 144 144 144 144 144 144 144 144
##    22 144 144 144 144 144 143 144 144 144 144 144 144
##    23 144 144 144 144 144 144 144 144 144 144 144 144
##    24 144 144 144 144 144 144 144 144 144 144 144 144
##    25 144 144 144 144 144 144 144 144 144 144 144 144
##    26  39 144 144 144 144 141 144 144 144 144 144 144
##    27   0 144 144 144 141 113 144 144 144 144 144 144
##    28   0 144 144 144 144 144 144 144 129 144 144 144
##    29   0   0 144 144 144 144 144 144   0 144 144 144
##    30  56   0 144 144 144 144 144 144   0 124 144 144
##    31 144   0 144   0 144   0 144 144   0 143   0 144

Note: there are missing values in the dataset. Each day of the year should hold 144 unique measurements for each 10 minute interval, but certain days seem to have fewer. This could be due to a failure in the SCADA system or specific maintenance. This is merely speculation on my end.

WindPower$Date.Time = as.POSIXlt(as.Date(paste(WindPower$Year, WindPower$Month, WindPower$Day, sep = "-")))
WindPower$Date.Time = WindPower$Date.Time + 3600*WindPower$Hour + 60*WindPower$Minute

Now, the Date.Time column holds Date objects, such that we can graph time along the x-axis.

Cardinal Directions

I will now divide the range of the Wind Direction (°) into intervals of 30 so that we can easily represent the wind direction as cardinal directions, a categorical variable.

cardinalDirections = c("N", "NNE", "ENE", "E", "ESE", "SSE", "S", "SSW", "WSW", "W", "WNW", "NNW", "N")
WindPower$CardinalDirection = cut(WindPower$WindDirection, (1:14)*30 - 45, cardinalDirections)

2. Data Exploration

The Theoretical Power curve that was given by the manufacturers seem to be following a variation of logistic curve (commonly known as a carrying capacity curve in the study of ecology). For any wind speed below 3 mph, there seems to be almost no power generated, and seems to be capped at a little over 3600 kW (The maximum power recorded in this dataset is 3618.733 kW)

plot(WindPower$WindSpeed, WindPower$TheoreticalPower, main = "Theoretical Power by Wind Speed", xlab = "Wind Speed (mph)", ylab = "Theoretical Power (kW)", pch=20)

plot(WindPower$WindSpeed, WindPower$ActivePower, main = "Active Power by Wind Speed", xlab = "Wind Speed (mph)", ylab = "Active Power (kW)", pch=20)

There seems to be a big difference between the theoretical power and active power. I would specifically like to point out the cluster of values close to 0 kW, ignorant of the wind speed. Despite the wind being as high as 15 mph, why might the active power still be at 0 kW?

I speculate that this is due to the economic perspective of energy companies. Like all industries, the prices in the energy market are determined by supply and demand. Energy generators supply the market with electricity and wholesale consumers buy the electricity. However, the power grid is limited and cannot keep up with high amounts of electricity– this can lead to damaged equipment and even blackouts. The amount of electricity and the demand of power must be kept equal. So when the grid contains too much power, suppliers struggle to rid themselves as fast possible by dropping prices, sometimes into the negatives. Certain energy sources like solar energy are unable to efficiently stop and start production in small periods of time can be burdening, but with energy sources that come from wind turbines are faster and more easily responsive. Despite the wind being as high as 15 mph, the active power might still be at 0 kW in order to keep the power grid at a safe and operable level.

WindPower$Residual = WindPower$ActivePower - WindPower$TheoreticalPower

plot(WindPower$Date.Time, WindPower$Residual, main = "Residual Power (Active - Theoretical) over Time", xlab = "Time (1/1/2018 to 12/31/2018)", ylab = "Active Power (kW)", pch=20)

This plot graphs the difference between the active power and theoretical power over the entire year of measurements (residuals). While most of the values hover between 1000 kW on either side of 0, there seem to be periods in time where a large vertical strip of values displays variance much larger than the rest of the values. These could be periods in time where maintenance takes place, or where the power grid is at full capacity, and so energy production is momentarily limited.

Daily Power Flow

In the following lines of code, I will take the mean of the entire year’s active power measurements for every time of the day recorded to create a daily average power output from the wind turbine. This will allow us to better understand the power flow throughout the day.

timeList = c()
meanActivePowerList = c()
meanTheoreticalPowerList = c()

for(i in 0:23) {
  for(j in 0:5*10) {
    TemporaryWindPowerSubset = WindPower[WindPower$Hour == i & WindPower$Minute == j,]
    meanActivePower = mean(TemporaryWindPowerSubset$ActivePower)
    meanTheoreticalPower = mean(TemporaryWindPowerSubset$TheoreticalPower)
    currentTime = TemporaryWindPowerSubset$Hour[1] + round(TemporaryWindPowerSubset$Minute[1] / 60, 4)
      
    meanActivePowerList = append(meanActivePowerList, meanActivePower)
    meanTheoreticalPowerList = append(meanTheoreticalPowerList, meanTheoreticalPower)
    timeList = append(timeList, currentTime)
  }  
}

dailyAveragePower = data.frame(timeList, meanActivePowerList, meanTheoreticalPowerList)
plot(dailyAveragePower$timeList, dailyAveragePower$meanActivePowerList, type="l", ylim = c(1000,1700), main = "Power Throughout the Day", xlab = "Time (hours since 12:00)", ylab = "Power (kW)", pch=20, col="red")
lines(dailyAveragePower$timeList, dailyAveragePower$meanTheoreticalPowerList, col="green")
legend("topleft", c("Theoretical Power","Active Power"), cex = 0.8, fill = c("green", "red"))

The power output seems to peak during the hours of the night and is lowest in the morning (between 10 and 11 AM). This information is quite useful in determining the best time of day to do maintenance that would result in the least amount of power output disrupted.

Wind Direction

pie(table(WindPower$CardinalDirection), main = "Frequency of Each Wind Direction")

From a quick glance, it might seem like the difference in the records of wind direction is quite drastic, however, we have only take account the wind direction within every 10 minute interval and have ignored the effect of the wind speed. The wind might blow more often in some directions but might blow faster in other directions. Naturally, the speed must also be taken into account.

In the following lines of code, I multiply the frequency of each wind direction by the mean wind speed in each direction to give weight to the wind speed.

weightedMeanWindList = c()
for (cd in cardinalDirections){
  WindPowerSubset = WindPower[WindPower$CardinalDirection == cd,]
  weightedMeanWind = mean(WindPowerSubset$WindSpeed) * length(WindPowerSubset)
  weightedMeanWindList = append(weightedMeanWindList, weightedMeanWind)
}
pie(weightedMeanWindList, labels = cardinalDirections, main = "Frequency of Each Wind Direction (Weighted by Wind Speed)")

3. Model Development

The theoretical power column of this dataset is already a model developed and given to us by the manufacturers, and is probably most accurately in predicting the active power, but this power curve comes from an engineering perspective. We wish to analyze and create our own models from a statistical perspective, using only the data given to us from the year of 2018. Nevertheless, I will calculate the error of the theoretical power as a basis for the models I create below.

regr.error(WindPower$TheoreticalPower, WindPower$ActivePower)

##         mae         mse        rmse        mape 
##    195.4794 216961.9777    465.7918         NaN

The following model uses only a decision tree to predict the values of the active power. While limited in its abilities, this will still give us insight

model1 = rpart(ActivePower ~ WindSpeed + WindDirection, data = WindPower)
summary(model1)

## Call:
## rpart(formula = ActivePower ~ WindSpeed + WindDirection, data = WindPower)
##   n= 50530 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.73652518      0 1.0000000 1.0000468 0.004067597
## 2 0.07297932      1 0.2634748 0.2650045 0.002625604
## 3 0.06757356      2 0.1904955 0.1888589 0.002561639
## 4 0.01000000      3 0.1229219 0.1242328 0.002357004
## 
## Variable importance
##     WindSpeed WindDirection 
##            99             1 
## 
## Node number 1: 50530 observations,    complexity param=0.7365252
##   mean=1307.684, MSE=1722515 
##   left son=2 (32553 obs) right son=3 (17977 obs)
##   Primary splits:
##       WindSpeed     < 8.745992 to the left,  improve=0.7365252, (0 missing)
##       WindDirection < 231.4409 to the right, improve=0.0686938, (0 missing)
## 
## Node number 2: 32553 observations,    complexity param=0.07297932
##   mean=470.6593, MSE=288435.7 
##   left son=4 (22145 obs) right son=5 (10408 obs)
##   Primary splits:
##       WindSpeed     < 6.467511 to the left,  improve=0.67650670, (0 missing)
##       WindDirection < 78.3166  to the right, improve=0.05204578, (0 missing)
## 
## Node number 3: 17977 observations,    complexity param=0.06757356
##   mean=2823.381, MSE=753354.6 
##   left son=6 (7325 obs) right son=7 (10652 obs)
##   Primary splits:
##       WindSpeed     < 10.91866 to the left,  improve=0.4342827, (0 missing)
##       WindDirection < 47.59192 to the left,  improve=0.0644950, (0 missing)
##   Surrogate splits:
##       WindDirection < 58.44873 to the left,  agree=0.64, adj=0.115, (0 split)
## 
## Node number 4: 22145 observations
##   mean=167.8241, MSE=48447.48 
## 
## Node number 5: 10408 observations
##   mean=1114.999, MSE=188754.3 
## 
## Node number 6: 7325 observations
##   mean=2133.621, MSE=500561.2 
## 
## Node number 7: 10652 observations
##   mean=3297.704, MSE=375040.4

rpart.plot(model1)

From the decision tree, we can confirm that the wind direction plays an insignificant role in predicting the power output. This makes sense, given that we know the wind turbines automatically turn to the direction of the wind.

The next model is a quartic regression model dependent on the wind speed.

model2 = lm(formula = ActivePower ~ WindSpeed + I(WindSpeed^2) + I(WindSpeed^3) + I(WindSpeed^4), data = WindPower)
summary(model2)

## 
## Call:
## lm(formula = ActivePower ~ WindSpeed + I(WindSpeed^2) + I(WindSpeed^3) + 
##     I(WindSpeed^4), data = WindPower)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3707.5   -53.9    57.6   128.1   847.6 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.281e+02  1.245e+01   58.46   <2e-16 ***
## WindSpeed      -6.671e+02  7.360e+00  -90.64   <2e-16 ***
## I(WindSpeed^2)  1.535e+02  1.361e+00  112.80   <2e-16 ***
## I(WindSpeed^3) -8.663e+00  9.651e-02  -89.76   <2e-16 ***
## I(WindSpeed^4)  1.499e-01  2.288e-03   65.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 410 on 50525 degrees of freedom
## Multiple R-squared:  0.9024, Adjusted R-squared:  0.9024 
## F-statistic: 1.168e+05 on 4 and 50525 DF,  p-value: < 2.2e-16

The next few lines of code show us how the model fits to the active power.

WindPower$Prediction = predict(model2, WindPower)
regr.error(WindPower$Prediction, WindPower$ActivePower)

##         mae         mse        rmse        mape 
##    200.1519 168109.0218    410.0110         Inf

prediction = predict(model2, data.frame(WindSpeed=c(0:25)))
plot(WindPower$WindSpeed, WindPower$ActivePower, main = "Active Power by Wind Speed", xlab = "Wind Speed (mph)", ylab = "Active Power (kW)", pch=20)
lines(c(0:25), prediction, col="green", lwd = 2)

The model has limitations, as the predictions only seem to be effective for wind speeds between about 4 mph and 15 mph, however, this curve is quite suitable for the majority of points lying between these intervals.

The mean squared error of the quartic regression is 168109.0218, while the mean squared error of the the theoretical power curve given to us is 216961.9777. While the theoretical power curve’s mean absolute error slightly beats out our quartic regression, a test using the mean squared error (which penalizes larger differences in values) suggests that our model has a better fit on the active power output.

Conclusion

This project is far from over. I would like to investigate weather patterns and learn more about wind turbines in order to make speculations and conclusions that are more accurate. However, using the dataset obtained from wind power measurements of 2018, I was able to provide insight to myself, at the least.

Analyzing Wind Power

Parth Patel

12/13/2021