The purpose of this project is to analyze and predict the power output of wind turbines, a very important task for the renewable energy industry. The forecasting plays a big role in electricity prices and sloppy predictions can lead to all sorts of supply/demand disparities. Although energy companies have highly advanced tools and highly skilled analysts, I hope to reenact a fraction of their work in a much smaller magnitude. I hope to answer the following questions:
The dataset I will be using is taken from wind turbine operating in Yalova, Turkey, and was recorded through supervisory control and data acquisition (SCADA) systems. The dataset contains the following information:
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(rpart)
library(rpart.plot)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
regr.error <- function(predicted,actual){
#MAE
mae <- mean(abs(actual-predicted))
#MSE
mse <- mean((actual-predicted)^2)
#RMSE
rmse <- sqrt(mean((actual-predicted)^2))
#MAPE
mape <- mean(abs((actual-predicted)/actual))
errors <- c(mae,mse,rmse,mape)
names(errors) <- c("mae","mse","rmse","mape")
return(errors)
}
WindPower <- read.csv("~/Downloads/T1.csv")
#Renaming the columns for convenience
WindPower = rename(WindPower, WindSpeed = Wind.Speed..m.s.)
WindPower = rename(WindPower, ActivePower = LV.ActivePower..kW.)
WindPower = rename(WindPower, TheoreticalPower = Theoretical_Power_Curve..KWh.)
WindPower = rename(WindPower, WindDirection = Wind.Direction....)
summary(WindPower)
## Date.Time ActivePower WindSpeed TheoreticalPower
## Length:50530 Min. : -2.471 Min. : 0.000 Min. : 0.0
## Class :character 1st Qu.: 50.678 1st Qu.: 4.201 1st Qu.: 161.3
## Mode :character Median : 825.838 Median : 7.105 Median :1063.8
## Mean :1307.684 Mean : 7.558 Mean :1492.2
## 3rd Qu.:2482.508 3rd Qu.:10.300 3rd Qu.:2965.0
## Max. :3618.733 Max. :25.206 Max. :3600.0
## WindDirection
## Min. : 0.00
## 1st Qu.: 49.32
## Median : 73.71
## Mean :123.69
## 3rd Qu.:201.70
## Max. :360.00
In order to properly represent the time that each measurement was taken, I will be converting the “Date.Time” column from a character to a Date object.
WindPower$Year = as.numeric(substr(WindPower$Date.Time, 7, 10))
WindPower$Day = as.numeric(substr(WindPower$Date.Time, 1, 2))
WindPower$Month = as.numeric(substr(WindPower$Date.Time, 4, 5))
WindPower$Hour = as.numeric(substr(WindPower$Date.Time, 12, 13))
WindPower$Minute = as.numeric(substr(WindPower$Date.Time, 15, 16))
table(WindPower$Day, WindPower$Month, dnn = c("day: ", "month: "))
## month:
## day: 1 2 3 4 5 6 7 8 9 10 11 12
## 1 144 144 144 144 144 144 144 144 144 0 144 144
## 2 144 144 144 144 144 144 144 143 144 11 144 144
## 3 144 144 144 144 144 144 144 142 144 61 144 144
## 4 127 144 144 143 132 106 144 144 144 144 144 137
## 5 144 144 144 144 144 143 144 144 144 144 144 144
## 6 140 144 144 144 144 144 144 144 143 144 144 144
## 7 144 144 144 144 144 144 144 144 144 144 144 144
## 8 144 144 144 144 144 144 144 144 144 144 144 144
## 9 144 144 144 144 144 144 144 144 144 144 144 144
## 10 144 144 143 144 144 144 144 144 144 144 128 144
## 11 144 144 144 144 144 144 144 144 144 144 0 144
## 12 143 144 144 144 144 144 144 144 144 144 0 144
## 13 144 144 144 144 144 144 144 144 143 144 0 144
## 14 144 144 144 144 144 144 144 144 129 144 72 144
## 15 144 144 144 144 144 144 144 144 144 144 144 144
## 16 144 144 144 144 144 143 144 130 144 144 144 144
## 17 144 144 144 130 144 144 144 122 144 144 144 134
## 18 144 144 144 144 144 144 144 144 144 144 144 144
## 19 144 144 144 144 144 144 144 144 144 144 144 144
## 20 144 144 144 144 144 144 144 144 144 144 144 144
## 21 144 144 144 144 144 144 144 144 144 144 144 144
## 22 144 144 144 144 144 143 144 144 144 144 144 144
## 23 144 144 144 144 144 144 144 144 144 144 144 144
## 24 144 144 144 144 144 144 144 144 144 144 144 144
## 25 144 144 144 144 144 144 144 144 144 144 144 144
## 26 39 144 144 144 144 141 144 144 144 144 144 144
## 27 0 144 144 144 141 113 144 144 144 144 144 144
## 28 0 144 144 144 144 144 144 144 129 144 144 144
## 29 0 0 144 144 144 144 144 144 0 144 144 144
## 30 56 0 144 144 144 144 144 144 0 124 144 144
## 31 144 0 144 0 144 0 144 144 0 143 0 144
Note: there are missing values in the dataset. Each day of the year should hold 144 unique measurements for each 10 minute interval, but certain days seem to have fewer. This could be due to a failure in the SCADA system or specific maintenance. This is merely speculation on my end.
WindPower$Date.Time = as.POSIXlt(as.Date(paste(WindPower$Year, WindPower$Month, WindPower$Day, sep = "-")))
WindPower$Date.Time = WindPower$Date.Time + 3600*WindPower$Hour + 60*WindPower$Minute
Now, the Date.Time column holds Date objects, such that we can graph time along the x-axis.
I will now divide the range of the Wind Direction (°) into intervals of 30 so that we can easily represent the wind direction as cardinal directions, a categorical variable.
cardinalDirections = c("N", "NNE", "ENE", "E", "ESE", "SSE", "S", "SSW", "WSW", "W", "WNW", "NNW", "N")
WindPower$CardinalDirection = cut(WindPower$WindDirection, (1:14)*30 - 45, cardinalDirections)
The Theoretical Power curve that was given by the manufacturers seem to be following a variation of logistic curve (commonly known as a carrying capacity curve in the study of ecology). For any wind speed below 3 mph, there seems to be almost no power generated, and seems to be capped at a little over 3600 kW (The maximum power recorded in this dataset is 3618.733 kW)
plot(WindPower$WindSpeed, WindPower$TheoreticalPower, main = "Theoretical Power by Wind Speed", xlab = "Wind Speed (mph)", ylab = "Theoretical Power (kW)", pch=20)
plot(WindPower$WindSpeed, WindPower$ActivePower, main = "Active Power by Wind Speed", xlab = "Wind Speed (mph)", ylab = "Active Power (kW)", pch=20)
There seems to be a big difference between the theoretical power and active power. I would specifically like to point out the cluster of values close to 0 kW, ignorant of the wind speed. Despite the wind being as high as 15 mph, why might the active power still be at 0 kW?
I speculate that this is due to the economic perspective of energy companies. Like all industries, the prices in the energy market are determined by supply and demand. Energy generators supply the market with electricity and wholesale consumers buy the electricity. However, the power grid is limited and cannot keep up with high amounts of electricity– this can lead to damaged equipment and even blackouts. The amount of electricity and the demand of power must be kept equal. So when the grid contains too much power, suppliers struggle to rid themselves as fast possible by dropping prices, sometimes into the negatives. Certain energy sources like solar energy are unable to efficiently stop and start production in small periods of time can be burdening, but with energy sources that come from wind turbines are faster and more easily responsive. Despite the wind being as high as 15 mph, the active power might still be at 0 kW in order to keep the power grid at a safe and operable level.
WindPower$Residual = WindPower$ActivePower - WindPower$TheoreticalPower
plot(WindPower$Date.Time, WindPower$Residual, main = "Residual Power (Active - Theoretical) over Time", xlab = "Time (1/1/2018 to 12/31/2018)", ylab = "Active Power (kW)", pch=20)
This plot graphs the difference between the active power and theoretical power over the entire year of measurements (residuals). While most of the values hover between 1000 kW on either side of 0, there seem to be periods in time where a large vertical strip of values displays variance much larger than the rest of the values. These could be periods in time where maintenance takes place, or where the power grid is at full capacity, and so energy production is momentarily limited.
In the following lines of code, I will take the mean of the entire year’s active power measurements for every time of the day recorded to create a daily average power output from the wind turbine. This will allow us to better understand the power flow throughout the day.
timeList = c()
meanActivePowerList = c()
meanTheoreticalPowerList = c()
for(i in 0:23) {
for(j in 0:5*10) {
TemporaryWindPowerSubset = WindPower[WindPower$Hour == i & WindPower$Minute == j,]
meanActivePower = mean(TemporaryWindPowerSubset$ActivePower)
meanTheoreticalPower = mean(TemporaryWindPowerSubset$TheoreticalPower)
currentTime = TemporaryWindPowerSubset$Hour[1] + round(TemporaryWindPowerSubset$Minute[1] / 60, 4)
meanActivePowerList = append(meanActivePowerList, meanActivePower)
meanTheoreticalPowerList = append(meanTheoreticalPowerList, meanTheoreticalPower)
timeList = append(timeList, currentTime)
}
}
dailyAveragePower = data.frame(timeList, meanActivePowerList, meanTheoreticalPowerList)
plot(dailyAveragePower$timeList, dailyAveragePower$meanActivePowerList, type="l", ylim = c(1000,1700), main = "Power Throughout the Day", xlab = "Time (hours since 12:00)", ylab = "Power (kW)", pch=20, col="red")
lines(dailyAveragePower$timeList, dailyAveragePower$meanTheoreticalPowerList, col="green")
legend("topleft", c("Theoretical Power","Active Power"), cex = 0.8, fill = c("green", "red"))
The power output seems to peak during the hours of the night and is lowest in the morning (between 10 and 11 AM). This information is quite useful in determining the best time of day to do maintenance that would result in the least amount of power output disrupted.
pie(table(WindPower$CardinalDirection), main = "Frequency of Each Wind Direction")
From a quick glance, it might seem like the difference in the records of wind direction is quite drastic, however, we have only take account the wind direction within every 10 minute interval and have ignored the effect of the wind speed. The wind might blow more often in some directions but might blow faster in other directions. Naturally, the speed must also be taken into account.
In the following lines of code, I multiply the frequency of each wind direction by the mean wind speed in each direction to give weight to the wind speed.
weightedMeanWindList = c()
for (cd in cardinalDirections){
WindPowerSubset = WindPower[WindPower$CardinalDirection == cd,]
weightedMeanWind = mean(WindPowerSubset$WindSpeed) * length(WindPowerSubset)
weightedMeanWindList = append(weightedMeanWindList, weightedMeanWind)
}
pie(weightedMeanWindList, labels = cardinalDirections, main = "Frequency of Each Wind Direction (Weighted by Wind Speed)")
The theoretical power column of this dataset is already a model developed and given to us by the manufacturers, and is probably most accurately in predicting the active power, but this power curve comes from an engineering perspective. We wish to analyze and create our own models from a statistical perspective, using only the data given to us from the year of 2018. Nevertheless, I will calculate the error of the theoretical power as a basis for the models I create below.
regr.error(WindPower$TheoreticalPower, WindPower$ActivePower)
## mae mse rmse mape
## 195.4794 216961.9777 465.7918 NaN
The following model uses only a decision tree to predict the values of the active power. While limited in its abilities, this will still give us insight
model1 = rpart(ActivePower ~ WindSpeed + WindDirection, data = WindPower)
summary(model1)
## Call:
## rpart(formula = ActivePower ~ WindSpeed + WindDirection, data = WindPower)
## n= 50530
##
## CP nsplit rel error xerror xstd
## 1 0.73652518 0 1.0000000 1.0000468 0.004067597
## 2 0.07297932 1 0.2634748 0.2650045 0.002625604
## 3 0.06757356 2 0.1904955 0.1888589 0.002561639
## 4 0.01000000 3 0.1229219 0.1242328 0.002357004
##
## Variable importance
## WindSpeed WindDirection
## 99 1
##
## Node number 1: 50530 observations, complexity param=0.7365252
## mean=1307.684, MSE=1722515
## left son=2 (32553 obs) right son=3 (17977 obs)
## Primary splits:
## WindSpeed < 8.745992 to the left, improve=0.7365252, (0 missing)
## WindDirection < 231.4409 to the right, improve=0.0686938, (0 missing)
##
## Node number 2: 32553 observations, complexity param=0.07297932
## mean=470.6593, MSE=288435.7
## left son=4 (22145 obs) right son=5 (10408 obs)
## Primary splits:
## WindSpeed < 6.467511 to the left, improve=0.67650670, (0 missing)
## WindDirection < 78.3166 to the right, improve=0.05204578, (0 missing)
##
## Node number 3: 17977 observations, complexity param=0.06757356
## mean=2823.381, MSE=753354.6
## left son=6 (7325 obs) right son=7 (10652 obs)
## Primary splits:
## WindSpeed < 10.91866 to the left, improve=0.4342827, (0 missing)
## WindDirection < 47.59192 to the left, improve=0.0644950, (0 missing)
## Surrogate splits:
## WindDirection < 58.44873 to the left, agree=0.64, adj=0.115, (0 split)
##
## Node number 4: 22145 observations
## mean=167.8241, MSE=48447.48
##
## Node number 5: 10408 observations
## mean=1114.999, MSE=188754.3
##
## Node number 6: 7325 observations
## mean=2133.621, MSE=500561.2
##
## Node number 7: 10652 observations
## mean=3297.704, MSE=375040.4
rpart.plot(model1)
From the decision tree, we can confirm that the wind direction plays an insignificant role in predicting the power output. This makes sense, given that we know the wind turbines automatically turn to the direction of the wind.
The next model is a quartic regression model dependent on the wind speed.
model2 = lm(formula = ActivePower ~ WindSpeed + I(WindSpeed^2) + I(WindSpeed^3) + I(WindSpeed^4), data = WindPower)
summary(model2)
##
## Call:
## lm(formula = ActivePower ~ WindSpeed + I(WindSpeed^2) + I(WindSpeed^3) +
## I(WindSpeed^4), data = WindPower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3707.5 -53.9 57.6 128.1 847.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.281e+02 1.245e+01 58.46 <2e-16 ***
## WindSpeed -6.671e+02 7.360e+00 -90.64 <2e-16 ***
## I(WindSpeed^2) 1.535e+02 1.361e+00 112.80 <2e-16 ***
## I(WindSpeed^3) -8.663e+00 9.651e-02 -89.76 <2e-16 ***
## I(WindSpeed^4) 1.499e-01 2.288e-03 65.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 410 on 50525 degrees of freedom
## Multiple R-squared: 0.9024, Adjusted R-squared: 0.9024
## F-statistic: 1.168e+05 on 4 and 50525 DF, p-value: < 2.2e-16
The next few lines of code show us how the model fits to the active power.
WindPower$Prediction = predict(model2, WindPower)
regr.error(WindPower$Prediction, WindPower$ActivePower)
## mae mse rmse mape
## 200.1519 168109.0218 410.0110 Inf
prediction = predict(model2, data.frame(WindSpeed=c(0:25)))
plot(WindPower$WindSpeed, WindPower$ActivePower, main = "Active Power by Wind Speed", xlab = "Wind Speed (mph)", ylab = "Active Power (kW)", pch=20)
lines(c(0:25), prediction, col="green", lwd = 2)
The model has limitations, as the predictions only seem to be effective for wind speeds between about 4 mph and 15 mph, however, this curve is quite suitable for the majority of points lying between these intervals.
The mean squared error of the quartic regression is 168109.0218, while the mean squared error of the the theoretical power curve given to us is 216961.9777. While the theoretical power curve’s mean absolute error slightly beats out our quartic regression, a test using the mean squared error (which penalizes larger differences in values) suggests that our model has a better fit on the active power output.
This project is far from over. I would like to investigate weather patterns and learn more about wind turbines in order to make speculations and conclusions that are more accurate. However, using the dataset obtained from wind power measurements of 2018, I was able to provide insight to myself, at the least.