Use this R script to analyze wind speed and wind power data:
* import raw data from .csv
* check for missing values
* remove missing and spurious values
* engineering computations
* compute energy benchmark comparisons (e.g. kinetic energy contained in wind and Betz limit)
* visualize (plot) the raw data with benchmarks
* apply data filters based on physical knowledge of the system
* plot the filtered (“clean”) data
* fit a Weibull distribution to “clean” windspeed data to characterize regional wind resource
* estimate power curve of installed wind turbines (observed power output vs observed windspeed)
* compare with manufacturers power curve
* calculate capacity factor (with curtailment)
* calculate capacity factor (without curtailment)
* calculate theoretical maximum capacity factor given Weibull distribution and Power Curve
* Predict next years energy production!
data<-read.csv(file="WIND_VXE_2013.csv", header=TRUE)
# assign new column names to be more succint (avoid spaces, use _ or . to seperate words)
new.names<-c("date_time","T1_Possible_Power","T2_Possible_Power","T3_Possible_Power","T4_Possible_Power","T5_Possible_Power","T6_Possible_Power","T7_Possible_Power","T1_Total_Active_Power","T2_Total_Active_Power","T3_Total_Active_Power","T4_Total_Active_Power","T5_Total_Active_Power","T6_Total_Active_Power","T7_Total_Active_Power","mean_wind_mps", "min_wind_mps", "max_wind_mps", "cum_energy_delivered_kwh")
# make sure the new names lineup properly with the ones they're replacing...
cbind(names(data), new.names)
# if yes, replace column names with new names
names(data)<-new.names
For most of this analysis, we only need cumulative energy delivered and windspeed. We can compute the rest from first principles. Create a new data.frame called dat that contains just the information we want:
# subset data by type
Cumulative<-subset(data, select=c(1,19))
Possible<-subset(data, select=1:8)
Active<-subset(data, select=c(1,9:15))
Wind<-subset(data, select=c(1,16:18))
# select data to use
dat<-data # keep all the data
# similarly, you could select all the data besides Active Power
# dat<-merge(Possible, Cumulative, Wind) # keep all but Active Power
dim returns the dimensions of the data (rows x columns).str returns the structure of the data (class) summary returns a statistical summary (quantiles) of the data contained in each column of the data.frame named “dat”.
dim(dat)
str(dat)
summary(dat)
I like to write a custom function that easily conveys information regarding missing values. It wraps several useful checks into a single function:
check<-function(df){
# count NA (missing values)
NAs<-sum(is.na(df))
print(paste("Missing Values:", NAs))
# count incomplete records (rows containing missing values)
ok<-complete.cases(df)
print(paste("Incomplete Records:", sum(! ok)))
# Show incomplete records (if less than 100 NAs).
if(NAs > 0 & NAs <= 100) print( df[which(! complete.cases(df)), ] )
# If more than 100, show column-wise distribution of NAs.
if (NAs > 100) hist(which(is.na(df), arr.ind=TRUE)[,2], xlab="Column", freq=TRUE, breaks=1:dim(df)[2], main="Column-wise distribution of missing values")
}
Similarly, we can write another quick function to show how many records are removed in any subset operation:
removed<-function(nrow, nrow1){
print(paste("number of records REMOVED:", nrow-nrow1, sep=" "))
print(paste("number of records REMAINING:", nrow1, sep=" "))
}
Now let’s take our check function for a test drive!
check(dat) # NAs present. Column-wise distribution is relatively uniform (besides wind with relatively few)
## [1] "Missing Values: 10194"
## [1] "Incomplete Records: 1377"
Here we apply the na.omit function (native) and the removed function we just wrote (user defined)
nrow<-dim(dat)[1] # record the dimensions of the data (before removing anything!)
dat<-na.omit(dat) # omit rows containing missing values
nrow1<-dim(dat)[1] # record the new dimensions of the data (after removing NAs)
removed(nrow, nrow1) # check how many records have been removed
## [1] "number of records REMOVED: 1377"
## [1] "number of records REMAINING: 51183"
# compute energy sentout in each timeblock (10min dt) using the cumulative energy counter
n<-length(dat$cum_energy_delivered_kwh)
a<-dat$cum_energy_delivered_kwh[1:n-1]
b<-dat$cum_energy_delivered_kwh[2:n]
diff<-b-a
dat$energy_sentout_10min_kwh<-c(diff,0) # cannot compute difference for last timestep, set to zero.
# compute kinetic energy in the wind at each windspeed
# Wind Power = (1/2)*rho*area*(velocity)^3 = [kg/m^3]*[m^2]*[m/s]^3 = [kg*m^2/s^3] = [kg*m^2/s^2][1/s] = [Newton-meter]/[second] = [Joules/second] = [Watts]
rho=1.225 # density of wind (kg/m^3)
area=2174 # sweep area of wind turbines (m^2)
turbines=7 # number of turbines
c<-(1/2)*rho*area
dat$wind_power_kw<-c*(dat$mean_wind_mps)^3*turbines/1000 # kW avg power
dat$wind_energy_10min_kwh<-c*(dat$mean_wind_mps)^3*turbines/(1000*6) # kWh in 10 min
# compute betz limit
betz.coef<- 16/27
dat$betz_limit_10min_kwh<-dat$wind_energy_10min_kwh*betz.coef
# compute turbine efficiency
dat$turbine_eff<-dat$energy_sentout_10min_kwh/dat$wind_energy_10min_kwh
# compute total Possible Power
uncurtailed_power<-apply(X=dat[,2:8], MARGIN=1, FUN=sum)
dat$uncurtailed_10min_kwh<-(uncurtailed_power)/6
# compute curtailment
dat$curtailment_10min_kwh<-dat$uncurtailed-dat$energy_sentout_10min_kwh
Check if we introduced NA, NaN or Outliers in our calculations:
# Check for NA
check(dat)
## [1] "Missing Values: 179"
## [1] "Incomplete Records: 179"
# NaN arise when we compute turbine efficiency with zero in the numerator and denominator (due to windspeed = 0 & energy_sentout = 0). Set these to zero.
nan<-which(dat$turbine_eff == "NaN")
# length(nan) # same as number of NAs
# head(dat[nan, ]) # look at the rows containing "NaN"
dat$turbine_eff[nan]<-0
# Likewise, Inf arise when we compute turbine efficiency with zero in the denominator (due to windspeed = 0). Set these to zero.
inf<-which(dat$turbine_eff == "Inf")
# length(inf) # not counted in NA count, beware!
# head(dat[inf, ]) # look at the rows containing "NaN"
dat$turbine_eff[inf]<-0
# check again
check(dat) # NAs have been removed.
## [1] "Missing Values: 0"
## [1] "Incomplete Records: 0"
First, we need to assign a timestamp to each observation. The raw data contains charachter representations of date-time, but these are not sufficient for robust time-series analysis. Therefore, we must convert date_time into a POSIXlt or POSIXct date-time class. POSIX date-time objects represent calendar dates and times to the nearest second. At their core, POSIX objects are lists containing date-time components. We will have to convert POSIX objects back to character representations for certain operations, such as summarizing the data with the ddply function.
# To convert a character representation of date-time into a POSIX object, each entry must conform to a standard format.
# In the raw data, midnight values are missing the hour and minute. Fix these:
get<-which(is.na(as.POSIXlt(dat$date_time, format="%m/%d/%y %H:%M"))) # return row index of date_time that cannot be coerced to POSIX object
dat$date_time[get]<-paste(dat$date_time[get], "00:00", sep=" ") # ammend those with the missing hour:time info.
# check if modified date_time character string can be coerced to POSIX without generating NA values.
sum(is.na(as.POSIXlt(dat$date_time, format="%m/%d/%y %H:%M"))) # zero NAs
## [1] 0
# covert modified date_time character string to POSIX
dat$date_time<-as.POSIXlt(dat$date_time, format="%m/%d/%y %H:%M")
Success!
Using the POSIX object, we can easily group the data with respect to time (e.g. by month, day, hour). This will come in handy for temporal aggregation, and for plotting:
dat$month <- cut(dat$date_time, breaks = "month")
week <- cut(dat$date_time, breaks = "week")
day <- cut(dat$date_time, breaks = "day")
hour <- cut(dat$date_time, breaks = "hour")
dummy<-strsplit(as.character(week), split=" ")
week<-laply(dummy, '[[', 1) # keep the date, drop the time
dat$week<-as.factor(week)
dummy<-strsplit(as.character(day), split=" ")
day<-laply(dummy, '[[', 1) # keep the date, drop the time
dat$day<-as.factor(day)
dummy<-strsplit(as.character(hour), split=" ")
hour<-laply(dummy, '[[', 2) # keep the time, drop the date
dat$hour<-as.factor(hour)
Save the augemented data.frame dat. This reflects the full dataset after NA checks and benchmark calculations, but prior to applying filters.
save(dat, file="Wind_VXE_2013_Augmented_No_Filters.rsav")
Now we can create time-series visualizations of the data:
# energy data
energy<-subset(dat, select=c("day", "energy_sentout_10min_kwh", "wind_energy_10min_kwh", "betz_limit_10min_kwh", "uncurtailed_10min_kwh", "curtailment_10min_kwh"))
energy<-ddply(energy, .(day), numcolwise(sum))
# plot energy vs time
test<-melt(energy, id.vars=("day"))
ggplot(test, aes(x=day, y=value/10^3, group=variable, colour=variable, linetype=variable)) +
geom_line() +
scale_y_continuous(name="MWh per day") +
labs(title="Energy Timeseries") +
theme_classic() +
scale_x_discrete(breaks=test$day[seq(1, 360, by=60)], labels=abbreviate)
# facet wrap by month
energy<-subset(dat, select=c("day","week", "month", "energy_sentout_10min_kwh", "wind_energy_10min_kwh", "betz_limit_10min_kwh", "uncurtailed_10min_kwh", "curtailment_10min_kwh"))
energy<-ddply(energy, .(day, week, month), numcolwise(sum))
# plot energy vs time
test<-melt(energy, id.vars=c("day", "week", "month"))
levels(test$month) <- month.name[1:12]
ggplot(test, aes(x=day, y=value/10^3, group=variable, colour=variable, linetype=variable)) +
geom_line() +
facet_wrap(~month, scales="free") +
scale_y_continuous(name="MWh per day") +
labs(title="Monthwise Energy Timeseries") +
theme_classic() +
scale_x_discrete(breaks=NULL)
Clearly we have some problem data…
We can apply varying levels of stringency in filtering the data based on energy sentout:
# check for outliers in the energy data
summary(dat$energy_sentout_10min_kwh) # quantiles
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10000000 235 417 24 585 47100
hist(dat$energy_sentout_10min_kwh) # histogram
# looks funny...
# number of records (rows) before filtering
nrow<-dim(dat)[1]
# FILTER 1: remove negative energy values
dat<-subset(dat, dat$energy_sentout_10min_kwh >= 0) #
# number of records removed
nrow2<-dim(dat)[1]
removed(nrow, nrow2)
## [1] "number of records REMOVED: 2"
## [1] "number of records REMAINING: 51181"
# Filter positive energy values based on physical knowledge of the system.
# Filters enumerated in order of increasing stringency:
# FILTER 2: remove energy values beyond what's possible given installed capacity
capacity<-850*7 # 850 KW rated capacity x 7 turbines
capacity_10min_kwh<-capacity*(10/60) # max energy sentout in 10 minutes
dat<-subset(dat, dat$energy_sentout_10min_kwh <= capacity_10min_kwh)
# # FILTER 3: remove statistical outliers: set quantiles to keep
# range<-quantile(dat$energy_sentout_10min_kwh, probs=c(0.01,0.99))
# dat<-subset(dat, dat$energy_sentout_10min_kwh > range[1])
# dat<-subset(dat, dat$energy_sentout_10min_kwh < range[2])
# # FILTER 4: remove energy values beyond what's possible given windspeed (e.g. kinetic energy contained in wind)
# dat<-subset(dat, dat$energy_sentout_10min_kwh <= dat$wind_energy_10min_kwh)
# # FILTER 5: remove energy values beyond what's possible given windspeed AND Betz limit (kinetic energy *16/27)
# dat<-subset(dat, dat$energy_sentout_10min_kwh <= dat$betz_limit_10min_kwh)
# number of records removed
nrow3<-dim(dat)[1]
removed(nrow2, nrow3)
## [1] "number of records REMOVED: 39"
## [1] "number of records REMAINING: 51142"
# check the quantiles and histogram again
summary(dat$energy_sentout_10min_kwh)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 235 417 411 585 964
hist(dat$energy_sentout_10min_kwh)
Now the summary statistics look more reasonable.
Let’s see what the data looks like now with respect to time.
# Total energy, per week
energy<-subset(dat, select=c("week", "energy_sentout_10min_kwh", "uncurtailed_10min_kwh", "curtailment_10min_kwh", "betz_limit_10min_kwh"))
energy<-ddply(energy, .(week), numcolwise(sum))
# plot energy vs time
test<-melt(energy, id.vars=("week"))
ggplot(test, aes(x=week, y=value/10^3, group=variable, colour=variable, linetype=variable)) +
geom_line() +
scale_y_continuous(name="MWh per week") +
labs(title="Energy Timeseries") +
theme_classic() +
scale_x_discrete(breaks=levels(test$week)[seq(1,52, by=8)], labels=abbreviate)
Now we are able to see temporal trends in the energy sentout (curtailed), energy possible (uncurtailed) and the difference (curtailment). We juxtopose the Betz limit (theoretical max) as a benchmark comparison.
Now apply data filters with respect to wind speed:
# check for outliers in wind data
# quantiles
summary(dat$mean_wind_mps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.60 8.50 8.11 10.90 18.80
summary(dat$min_wind_mps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 3.70 6.00 5.62 7.80 15.80
summary(dat$max_wind_mps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 7.7 11.1 10.6 13.9 31.5
# historgrams
hist(dat$mean_wind_mps, main="Histogram of Mean Windspeeds")
hist(dat$min_wind_mps, main="Histogram of Min Windspeeds")
hist(dat$max_wind_mps, main="Histogram of Max Windspeeds")
dat<-subset(dat, mean_wind_mps > 0.5)
Wind data looks OK besides the first bin. First bin is far too large. So we removed records (rows) associated with windspeeds less than 0.5 mps (first bin).
Check wind measurements with respect to time:
wind<-subset(dat, select=c("week", "mean_wind_mps", "max_wind_mps", "min_wind_mps"))
wind<-ddply(wind, .(week), numcolwise(mean))
# plot wind vs time
test<-melt(wind, id.vars=("week"))
ggplot(test, aes(x=week, y=value, group=variable, colour=variable, linetype=variable)) +
geom_line() +
scale_y_continuous(name="Windspeed [m/s]") +
labs(title="Wind Timeseries") +
theme_classic() +
scale_x_discrete(breaks=levels(test$week)[seq(1,360, by=30)], labels=abbreviate)
Now let’s check if the windspeed and energy measurements make sense together (e.g. pairwise observations).
Recall this is timeseries data, so at every timestamp there is a windspeed and energy measurement.
energy<-subset(dat, select=c("mean_wind_mps", "energy_sentout_10min_kwh", "wind_energy_10min_kwh", "betz_limit_10min_kwh"))
energy<-melt(energy, id.vars=("mean_wind_mps"))
ggplot(energy, aes(x=mean_wind_mps, y=value, group=variable, colour=variable)) +
geom_point() +
scale_y_continuous(name="KWh in 10min", limit=c(0, max(dat$energy_sentout_10min_kwh))) +
scale_x_continuous(name="Windspeed (mps)") +
labs(title="Empirical Power Curve with Betz Limit and Theoretical Wind Energy") +
theme_classic()
Windspeed vs power output looks good if we filter the data using the Betz Limit. Otherwise we see observations above the Betz limit, but we know this is not possible. How could have so much power been produced at such low wind speeds? The only (likely) explanation is that the wind measurements are off. If the wind gauge (anemometer) was reading too low, then it would artificially place points to the left of the true power curve, thus creating points above the Betz Limit. Here we are able to apply knowledge of the physical system to improve data filtering.
We can add more filters, as necessary.
We know that the turbines shutdown at windspeeds below 3mps and above 25mps. We have two options for applying this information: Option 1: Remove all observations with windspeed < 3 mps or > 25 mps.
* Pro: Turbines are off, so there is no information to be gained regarding the relationship between windspeed and power generation.
* Con: Weibull distribution and capacity factor will be skewed uprwards because observations at low windspeeds have been removed.
The following code is not run, but shown as an example of how to filter the data with respect to turbine setpoints.
dat<-subset(dat, dat$min_wind_mps > 3)
dat<-subset(dat, dat$max_wind_mps < 25)
energy<-subset(dat, select=c("mean_wind_mps", "energy_sentout_10min_kwh", "wind_energy_10min_kwh", "betz_limit_10min_kwh"))
energy<-melt(energy, id.vars=("mean_wind_mps"))
ggplot(energy, aes(x=mean_wind_mps, y=value, group=variable, colour=variable)) +
geom_point() +
scale_y_continuous(name="KWh in 10min", limit=c(0, max(dat$energy_sentout_10min_kwh))) +
scale_x_continuous(name="Windspeed (mps)") +
labs(title="Empirical Power Curve with Betz Limit and Theoretical Wind Energy") +
theme_classic()
Save the “clean” wind data (post-filter)
write.csv(dat, file="clean_VXE_wind_speed.csv")
save(dat, file="clean_VXE_wind_speed.rsav")
Alternatively, we could apply this filter instead: Option 2: When windspeed < 3 mps or > 25 mps, set power to zero (rather than removing the records altogether as in Option 1).
* Pro: Capacity factor will be more accurate because we have retained observations at low windspeeds. * Con: Fitting a Weibull distribution is more difficult because a large number of windspeed observations are forced to zero, creating an artifically heavy lower tail.
filter<-which(dat$mean_wind_mps < 3 | dat$mean_wind_mps > 25)
dat$energy_sentout_10min_kwh[filter]<-0
(We will use this later when we compute capacity factor.)
# fit a distribution to the wind speed data
library(fitdistrplus)
descdist(dat$mean_wind_mps) # heuristic
## summary statistics
## ------
## min: 0.6 max: 18.8
## median: 8.8
## mean: 8.553
## estimated sd: 3.388
## estimated skewness: -0.1581
## estimated kurtosis: 2.492
# based on heuristic, and knowledge of the system, fit a weibull distribution
weibull.fit<-fitdist(dat$mean_wind_mps, distr="weibull")
summary(weibull.fit)
## Fitting of the distribution ' weibull ' by maximum likelihood
## Parameters :
## estimate Std. Error
## shape 2.754 0.01021
## scale 9.590 0.01655
## Loglikelihood: -128209 AIC: 256422 BIC: 256440
## Correlation matrix:
## shape scale
## shape 1.0000 0.2949
## scale 0.2949 1.0000
plot(weibull.fit, demp=TRUE)
# summation of energy sentout, divided by the number of hours, yields average power sentout.
avg.power<-sum(dat$energy_sentout_10min_kwh)/(dim(dat)[1]/6)
# summation of energy sentout, divided by the number of hours*capacity, yields capacity factor.
actual.cap.factor<-sum(dat$energy_sentout_10min_kwh)/(850*7*dim(dat)[1]/6)
uncurtailed.cap.factor<-sum(dat$uncurtailed_10min_kwh)/(850*7*dim(dat)[1]/6)
betz.cap.factor<-sum(dat$betz_limit_10min_kwh)/(850*7*dim(dat)[1]/6)
# display the results
data.frame(actual.cap.factor, uncurtailed.cap.factor, betz.cap.factor)
## actual.cap.factor uncurtailed.cap.factor betz.cap.factor
## 1 0.4168 0.5833 0.8485
Above we demonstrated several ways to filter the data based on physics of the system and known system constraints. This type of data filtering is excellent for removing random missing, erroneous or outlier data. However, what if we have reason to believe the wind measurements are systematicaly too low? How can we address a systematic bias in the data? One way is to apply a correction algorithim.
Assume the energy sentout measurements are correct, and that all the error is contained in the windspeed measurements. This may be a good assumption since the energy data can be confirmed at other points in the electricity system (e.g. at the bus bar, by the utility or dispatch center, etc…). Now, assuming the load measurements are correct, then we can back out the minimum windspeed needed to produce that much power. We then re-assign the calculated minimum windspeed to observations with imfeasible wind speeds measurements. Conceptually, we are correcting inaccurate windspeed measurements by shifting them to the right (to the power curve).
To demonstrate data correction as an alternative to removing suspect data, we start from the beginning with the raw data:
Now we remove statistical outliers, but do not apply physics-based filters (e.g Betz Limit, Turbine setpoints, etc…).
nrow<-dim(dat)[1] # number of complete records BEFORE filtering
# select the numeric or integer class columns
c<-sapply(dat, class) # class of each column
colIndex<-which(c=="numeric" | c=="integer") # column index of numeric or integer vectors
# compute column-wise quantiles
# apply(dat[,-1], 2, stats::quantile)
range<-apply(dat[,colIndex], 2, function(x) {quantile(x, probs=c(.01, .99)) } )
# get names of the columns that we want to filter on (all but the first column)
# names<-colnames(dat[,colIndex])
names<-names(colIndex) # identical to above
# iterate through each column, subsetting the ENTIRE DATA.FRAME by removing records (rows) associated with an outlier value in that column
for(i in 1:dim(range)[2]){
n<-dim(dat)[1] # number of rows at start
dat<-subset(dat, eval(parse(text=names[i])) > range[1,i]) # keep records above 1st percentile
dat<-subset(dat, eval(parse(text=names[i])) < range[2,i]) # keep records less than 99 percentile
p<-dim(dat)[1] # number of rows at end
print(paste(n-p, "records dropped when filtering:", toupper(names[i]), sep=" "))
}
## [1] "4171 records dropped when filtering: T1_POSSIBLE_POWER"
## [1] "751 records dropped when filtering: T2_POSSIBLE_POWER"
## [1] "761 records dropped when filtering: T3_POSSIBLE_POWER"
## [1] "1586 records dropped when filtering: T4_POSSIBLE_POWER"
## [1] "731 records dropped when filtering: T5_POSSIBLE_POWER"
## [1] "229 records dropped when filtering: T6_POSSIBLE_POWER"
## [1] "502 records dropped when filtering: T7_POSSIBLE_POWER"
## [1] "829 records dropped when filtering: T1_TOTAL_ACTIVE_POWER"
## [1] "0 records dropped when filtering: T2_TOTAL_ACTIVE_POWER"
## [1] "0 records dropped when filtering: T3_TOTAL_ACTIVE_POWER"
## [1] "0 records dropped when filtering: T4_TOTAL_ACTIVE_POWER"
## [1] "0 records dropped when filtering: T5_TOTAL_ACTIVE_POWER"
## [1] "0 records dropped when filtering: T6_TOTAL_ACTIVE_POWER"
## [1] "0 records dropped when filtering: T7_TOTAL_ACTIVE_POWER"
## [1] "1959 records dropped when filtering: MEAN_WIND_MPS"
## [1] "1366 records dropped when filtering: MIN_WIND_MPS"
## [1] "50 records dropped when filtering: MAX_WIND_MPS"
## [1] "943 records dropped when filtering: CUM_ENERGY_DELIVERED_KWH"
## [1] "202 records dropped when filtering: ENERGY_SENTOUT_10MIN_KWH"
## [1] "0 records dropped when filtering: WIND_POWER_KW"
## [1] "0 records dropped when filtering: WIND_ENERGY_10MIN_KWH"
## [1] "0 records dropped when filtering: BETZ_LIMIT_10MIN_KWH"
## [1] "0 records dropped when filtering: TURBINE_EFF"
## [1] "0 records dropped when filtering: UNCURTAILED_10MIN_KWH"
## [1] "690 records dropped when filtering: CURTAILMENT_10MIN_KWH"
# number of records removed
nrow2<-dim(dat)[1]
removed(nrow, nrow2)
## [1] "number of records REMOVED: 14770"
## [1] "number of records REMAINING: 36413"
# fit a distribution to the wind speed data
library(fitdistrplus)
descdist(dat$mean_wind_mps) # heuristic
## summary statistics
## ------
## min: 1.4 max: 15.3
## median: 8.7
## mean: 8.576
## estimated sd: 2.53
## estimated skewness: -0.1009
## estimated kurtosis: 2.268
# based on heuristic, and knowledge of the system, fit a weibull distribution
weibull.fit.2<-fitdist(dat$mean_wind_mps, distr="weibull")
summary(weibull.fit.2)
## Fitting of the distribution ' weibull ' by maximum likelihood
## Parameters :
## estimate Std. Error
## shape 3.855 0.01608
## scale 9.498 0.01359
## Loglikelihood: -85057 AIC: 170118 BIC: 170135
## Correlation matrix:
## shape scale
## shape 1.0000 0.3129
## scale 0.3129 1.0000
plot(weibull.fit.2, demp=TRUE)
plot(dat$mean_wind_mps, dat$uncurtailed_10min_kwh, col="red", cex=0.2, pch=20, main="Curtailed and Uncurtailed Wind Energy Production\nSao Vicente, Cape Verde (2013)", xlab="Windspeed (mps)", ylab="Energy (kWh/10min)")
points(dat$mean_wind_mps, dat$energy_sentout_10min_kwh, col="green", cex=0.2, pch=20)
legend("topleft", legend=c("Energy Possible (Uncurtailed)", "Energy Sentout (Curtailed)"), col=c("red", "green"), pch=20)
Here is our algorithim in words: For each observation, supply the energy sentout (kWh), find the closest matching value on the power curve (kW), query the corresponding wind speed from the power curve (mps). This becomes the corrected windspeed. Apply this whenever the measured windspeed is less than the windspeed dictated by the power curve. This shifts all points to were left of the power curve, to the right.
The is the code:
# load the manufacturers power curve
MPC<-read.table(file="v52-850KW-power-curve.csv", header=TRUE, strip.white=TRUE, sep=",")
MPC$design_sentout_10min_kwh<-MPC$power_kW*7/6 # kW x (7 turbines) x (1/6 hour)
MPC<-subset(MPC, windspeed_mps<25)
# plot the manufacturers power curve
plot(x=MPC$windspeed_mps, y=MPC$design_sentout_10min_kwh, type="p", pch=4, main="Manufacturers Power Curve\n with Empirical Overlay", xlab="Windspeed (mps)", ylab="Energy (kWh/10min)", xlim=range(dat$mean_wind_mps))
points(y=dat$energy_sentout_10min_kwh, x=dat$mean_wind_mps, col="green", cex=0.1, pch=20)
points(y=dat$uncurtailed_10min_kwh, x=dat$mean_wind_mps, col="red", cex=0.1, pch=20)
legend("bottomright", col=c("black", "green", "red"), pch=c(4, 20, 20), legend=c("Manufacturers Power Curve", "Energy Sentout (Curtailed)","Energy Possible (Uncurtailed)") )
# For each observation, supply the energy sentout (kWh), find the closest matching value on the power curve (kW), query the corresponding wind speed from the power curve (mps). This becomes the corrected windspeed. Apply this whenever the measured windspeed is less than the windspeed dictated by the power curve. This shifts all points that were left of the power curve, onto the power curve.
# matching vector must be in ascending order
MPC<-MPC[ order(MPC$power_kW, decreasing=FALSE), ]
# pass energy sentout (kWh), return row index of closest matching value on power curve (kW)
power.match<-findInterval(x=dat$energy_sentout_10min_kwh, vec=MPC$power_kW, all.inside=TRUE, rightmost.closed=TRUE)
# read-off corresponding wind speed from the power curve (mps)
corrected.windspeed<-MPC[power.match, c("windspeed_mps")]
# # visually inspect matched values with values matched against --> Looks good!
# plot(x=MPC[power.match, c("windspeed_mps")], y=MPC[power.match, c("power_kW")], col="red", cex=3, xlab="windspeed_mps", ylab="power_kW")
# points(x=MPC$windspeed_mps, y=MPC$power_kW, col="black", cex=0.1)
# create a new column called "corrected_wind_mps"
dat$corrected_wind_mps<-corrected.windspeed
# if the observed mean windspeed is greater than the windspeed dictated by the power curve, that's OK --> keep the recorded mean windspeed.
keep<-which(dat$mean_wind_mps >= dat$corrected_wind_mps)
dat$corrected_wind_mps[keep]<-dat$mean_wind_mps[keep]
# We do this because it is possible that less power was produced at a given windspeed than dictated by the power curve, due to curtailment.
# however it is *not* possible to produce more power at a given windspeed than dictated by the power curve. Hence we correct those values, only.
fitw<-fitdist(dat$mean_wind_mps, "weibull")
summary(fitw)
## Fitting of the distribution ' weibull ' by maximum likelihood
## Parameters :
## estimate Std. Error
## shape 3.855 0.01608
## scale 9.498 0.01359
## Loglikelihood: -85057 AIC: 170118 BIC: 170135
## Correlation matrix:
## shape scale
## shape 1.0000 0.3129
## scale 0.3129 1.0000
denscomp(fitw, addlegend=FALSE, main="Histogram and Weibull Fit for Uncorrected Windspeeds")
fitw<-fitdist(dat$corrected_wind_mps, "weibull")
summary(fitw)
## Fitting of the distribution ' weibull ' by maximum likelihood
## Parameters :
## estimate Std. Error
## shape 3.604 0.01257
## scale 10.034 0.01541
## Loglikelihood: -86614 AIC: 173232 BIC: 173249
## Correlation matrix:
## shape scale
## shape 1.0000 0.3212
## scale 0.3212 1.0000
denscomp(fitw, addlegend=FALSE, main="Histogram and Weibull Fit for Corrected Windspeeds")
plot(x=MPC$windspeed_mps, y=MPC$design_sentout_10min_kwh, type="p", pch=4, main="Manufacturers Power Curve\n with Corrected Empirical Overlay", xlab="Windspeed (mps)", ylab="Energy (kWh/10min)", xlim=c(0,20))
points(y=dat$energy_sentout_10min_kwh, x=dat$corrected_wind_mps, col="green", cex=0.1, pch=20)
points(y=dat$uncurtailed_10min_kwh, x=dat$corrected_wind_mps, col="red", cex=0.1, pch=20)
legend("bottomright", col=c("black", "green", "red"), pch=c(4, 20, 20), legend=c("Manufacturers Power Curve", "Curtailed Power Curve (with windspeed correction)","Uncurtailed Power Curve (with windspeed correction)") )
Our correction algorithim works!
However, let’s reflect on this process for a moment. Instead of “correcting” the data, were we better off simply removing spurious values all together? That is what we did originally, because we had LOTS of data (more than enough to fit a Weibull distribution and a power curve). The rationale goes that if we have 60,000+ observations, we can stand to lose some.
Removing suspect data may be preferable to “correcting” data, depending on the context, because it reduces the amount of assumptions and manipulations required. However, beware that removing data can also introduce biases, depending on if the data in question was randomly distributed. For example, are we skewing the Weibull distribution upwards by removing a large number of observations at low windspeeds? What if we still have lots of “good” data at low windspeeds, then are we losing any information by removing the suspect data? There is no definitive answer, and you will have to use your best judegement. Try several combinations of filters and corrections to see what makes the most sense!
Find the maximum power output at each windspeed.
Since we have continuous data, we need to bin the data into discrete segments first:
* choose bin range based on data range.
* choose bin size based on the number of observations.
+ since we have lots of data, we can use small bins.
(The following code chunk is not run, but we provide it for future use.)
range<-range(dat$corrected_wind_mps)
bins<-seq(range[1], range[2], by=0.5)
# summarize the data by wind bin
# NOTE: the ddply function doesn't work with POSIX objects... because POSIX objects are really lists...
# use factor or character string representations of date_time when summarizing with ddply()
dat$date_time<-as.character(dat$date_time)
dat$wind.bin<-cut(x=dat$corrected_wind_mps, breaks=bins)
# # compute energy quantiles for each windspeed bin
# ddply(dat, .(wind.bin), function(x) quantile(x$energy_sentout_10min_kwh))
# choose the 95th percentile to define the power curve
power.curve<-ddply(dat, .(wind.bin), summarize, power.curve=quantile(energy_sentout_10min_kwh, probs=0.95))
# assign a power curve ordinate to each record based on the observed windspeed.
# this represents the **possible** energy delivered to the grid at a given windspeed, assuming no grid constraints.
dat<-merge(dat, power.curve, by="wind.bin") # DO NOT RUN TWICE!
# plot
plot(x=dat$corrected_wind_mps,
y=dat$power.curve,
xlab="mean wind speed (mps)",
ylab="KWh per 10min",
main="Empirically Estimated vs Manufacturer's Power Curve, Overlayed with Corrected Data",
col="blue",
cex=1,
pch=1)
# add layers: manufactures power curve and corrected observations.
# lines(y=MPC$design_sentout_10min_kwh, x=MPC$windpseed_mps, type="p", col="black", cex=0.5, pch=4)
points(y=dat$energy_sentout_10min_kwh, x=dat$corrected_wind_mps, col="blue", cex=0.1, pch=20)
legend("bottomright", legend=c("Estimated Power Curve (w. windspeed correction)", "Manufacturer's Power Curve", "Observations (w. windspeed correction"), col=c("blue","black","blue"), pch=c(1,4,20))
# summation of energy sentout, divided by the number of hours, yields average power sentout.
avg.power<-sum(dat$energy_sentout_10min_kwh)/(dim(dat)[1]/6)
# summation of energy sentout, divided by the number of hours*capacity, yields capacity factor.
curtailed.cap.factor<-sum(dat$energy_sentout_10min_kwh)/(850*7*dim(dat)[1]/6)
uncurtailed.cap.factor<-sum(dat$uncurtailed_10min_kwh)/(850*7*dim(dat)[1]/6)
betz.cap.factor<-sum(dat$betz_limit_10min_kwh)/(850*7*dim(dat)[1]/6)
data.frame(curtailed.cap.factor, uncurtailed.cap.factor, betz.cap.factor)
## curtailed.cap.factor uncurtailed.cap.factor betz.cap.factor
## 1 0.4132 0.5695 0.7368
Finally, we visually inspect the data to find the time of the year when curtailment is highest. We aggregate the data with respect to time and apply temporal smoothing to make the figures legible.
dat$date_time<-as.character(dat$date_time)
hourly<-ddply(dat, .(hour, day), numcolwise(mean))
### visualize curtailment data as time series objects
hourly <- ts(hourly$curtailment_10min_kwh, start = c(2013, 01), frequency = 365*24) #create time series object
daily<-aggregate(hourly, nfrequency=365, ts.eps=1, FUN=mean)
weekly<-aggregate(hourly, nfrequency=52, ts.eps=1, FUN=mean)
monthly<-aggregate(hourly, nfrequency=12, ts.eps=1, FUN=mean)
par(mfrow=c(4,1))
par(oma=c(0,0,2,0)) # set outter margins
par(mar=c(2,4,2,2) + 0.1) # set plot margins
plot(hourly, lwd=0.4, cex.lab=1.1, cex.axis=1.1)
plot(daily, cex.lab=1.1, cex.axis=1.1)
plot(weekly, cex.lab=1.1, cex.axis=1.1)
plot(monthly, cex.lab=1.1, cex.axis=1.1)
title(main="Energy Curtailment [kWh/10min] with Temporal Smoothing", outer=TRUE, cex.main=1.5)
par(mar=c(5,4,4,2) + 0.1) # reset default plot margins (bottom, left, top, right)
par(mfrow=c(1,1))
I believe we are seeing a sharp decrease in curtailment on the weekends. Notice the periodicity in the daily figure. The curtailment fluctuates between 200-300 kWh/10min averaged throughout the day for five days per week, and then plummets to less than 100 kWh/10 min (average) on the 6th and 7th day. Is this a result of increased demand on the weekends or decreased scheduling of baseload? Either could explain a sharp decrease in curtailment (e.g. less “excess” power).
Now that we have characterized windspeeds for this region (Weibull distribution) and estimated the power curve (power output given a windspeed), we can sample from the Weibull distribution to generate new wind data and compute the expected power output. To make the power curve a continuous function (rather than discrete), we fit a polynomial function to it. Local polynomial estimation techniques (e.g. k-nearest neigbor and kernel density) offer advantages to to global fits. In local estimation techniques, fitting is performed within small neighborhoods, resulting in improved goodness of fit with a lower polynomial order. The concept is similar to a moving average, and smoothing splines. Here we fit two local polynomial models, one to the observations, and a second to the manufacturers power curve. Choosing one or the other is a matter of preference and philosophy (do you trust the turbines to work precisely to the manufacturer specifications, or do you trust your own measurements?)
library(locfit)
# Option 1: Fit a local-polynomial model to the observed data
empirical.mod<-locfit(energy_sentout_10min_kwh ~ mean_wind_mps, data=dat)
summary(empirical.mod)
## Estimation type: Local Regression
##
## Call:
## locfit(formula = energy_sentout_10min_kwh ~ mean_wind_mps, data = dat)
##
## Number of data points: 36413
## Independent variables: mean_wind_mps
## Evaluation structure: Rectangular Tree
## Number of evaluation points: 7
## Degree of fit: 2
## Fitted Degrees of Freedom: 5.29
plot(empirical.mod, main="Model Fit to Empirical Data")
points(y=dat$energy_sentout_10min_kwh, x=dat$mean_wind_mps, cex=0.1, pch=20, col="red")
# Option 2: Fit a local-polynomial model to the manufacturer power curve
manufacturer.mod<-locfit(design_sentout_10min_kwh ~ windspeed_mps, data=MPC)
summary(manufacturer.mod)
## Estimation type: Local Regression
##
## Call:
## locfit(formula = design_sentout_10min_kwh ~ windspeed_mps, data = MPC)
##
## Number of data points: 560
## Independent variables: windspeed_mps
## Evaluation structure: Rectangular Tree
## Number of evaluation points: 5
## Degree of fit: 2
## Fitted Degrees of Freedom: 4.822
plot(manufacturer.mod, lty=2, col="red", main="Model Fit to Manufacturers Power Curve", lwd=1)
lines(y=MPC$design_sentout_10min_kwh, x=MPC$windspeed_mps, col="black", lty=1, lwd=1)
legend("bottomright", legend=c("Manufacturer's Power Curve", "Local Polynomial Model Fit"), col=c("black", "red"), lty=c(1,2), lwd=c(1, 1))
Now we can predict energy generation given a probabilistic (Weibull) distribution of windspeeds and the fitted power curve model (local polynomial).
# Use a Weibull distribution to randomly generate one-year of 10-minute average windspeeds using the same shape and scale factor as the observed windspeeds
expected.wind=rweibull(8760*6, shape=weibull.fit$estimate[1], scale=weibull.fit$estimate[2])
# estimate generation given expected windspeeds
expected.gen.empirical<-predict(empirical.mod, newdata=expected.wind)
expected.gen.manufacturer<-predict(manufacturer.mod, newdata=expected.wind)
# calculate the expected (uncurtailed) capacity factor for next year:
sum(expected.gen.empirical)/(850*7*8760)
## [1] 0.3902
sum(expected.gen.manufacturer)/(850*7*8760)
## [1] 0.4949
# plot the new (expected) power curve
plot(expected.gen.empirical ~ expected.wind, col="blue", ylim=c(0, 850*7/6), xlim=c(0,20), pch=20, main="Comparing Two Prediction Techniques", ylab="Expected Gen. (kWh/10min)", xlab="Expected Wind (mps)")
points(y=expected.gen.manufacturer, x=expected.wind, col="red", pch=20)
legend("topleft", legend=c("Fit to Empirical Data", "Fit to Manufacturer's Power Curve"), col=c("blue", "red"), pch=c(20,20))