Abstract

Introduction

Energy isn’t used consistently. Demand will change depending on the time of day, time of year, and other factors. Predicting energy demand is of great interest to grid operators and others involved in electrical infrastructure. As such, there are models to help forecast power requirements, which take several different variables into account. It’s well known that time of day and season factor into energy demand, but it is less clear what impact percipitation has.

In this project, we will be making models to forecast load over a grid. The questions that will be addressed here are:

1. How does energy-load on a grid behave over time.
2. Does precipitation affect load.

We will answer these questions by constructing two models: one that uses time to predict energy demand, and one that uses both time and percipitation to predict energy demand. If the later model is significantly better than the first, we can conclude that percipitation is a useful predictor for energy demand.

Loading and Tidying the Data

Before we do anything, we need to load our data and the libraries we’ll be using.

Energy data was gathered from PJM’s dataminer2 tool, under ‘Hourly Load: Metered’.

On a side note: these guys have a ton of great public data pertaining to energy within their region. Everything from types of energy generation, to energy pricing. https://dataminer2.pjm.com/

For power demand, we’ll be looking at net energy load by hour, for every hour between the beginning of 2010 an the end of 2019. This corresponds to average energy usage for any given hour, in mega watts (MW). The vector of date-times, originally as strings, was converted to datetime-POSIXct format.

##      mw            datetime
## 1 10273 2010-01-01 00:00:00
## 2  9960 2010-01-01 01:00:00
## 3  9797 2010-01-01 02:00:00
## 4  9715 2010-01-01 03:00:00
## 5  9851 2010-01-01 04:00:00
## 6 10178 2010-01-01 05:00:00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4724    9459   10702   11147   12620   21651

Seems slightly right tailed to me, which makes sense (we’ll get into why later).

Now, let’s load in our meteorological data from NOAA, which provides extensive climate records https://www.ncdc.noaa.gov/cdo-web/. We’ll focus on weather stations in eastern Virginia, which will align both data-sets geographically.

The data originates in a text file, so we’ll need to do some cleaning before we can analyze.

As a side note: this data is also available in csv format. That’s as much a challange as text can be though!
The text file is spaced strangely, so we’ll handle it with some loops and an else-if.
# Importing the raw file, and initializing an empty data frame
path <- "https://raw.githubusercontent.com/davidblumenstiel/data/master/weatherNOAA/weather.txt"

x <- read.delim(path, stringsAsFactors = FALSE)

weather <- data.frame(matrix(ncol = 19))



# There are more elegant ways to do this, but nothing quite as entertaining
i = 2
while (i < nrow(x)) {
  
  split = strsplit(x[i,], "    ")[[1]]
  
  j = 1
  while (j < length(split)) {
    
    buff = split[j]
    
    if (buff == "") {
      
      j = j + 1
      
    }
    
    else {
      
      weather[i - 1,j] <- buff
      
      j = j + 1
      
    }
    
  }
  
  i = i + 1
  
}

# Further straightening

i = 0
while (i < nrow(weather)) {
  i = i + 1
  
  if (is.na(weather[i,7]) == FALSE) {

    weather[i,8:21] <- weather[i,7:20]
    
  }
  
  else if (is.na(weather[i,8]) == TRUE) {
    
    weather[i,8:21] <- weather[i,9:22]
    
  }
  
}

# Now we need to seperate out the dates from other information we dont need

weather$date <- stringr::str_extract(weather[,8], '\\d{8}')


# Getting rid of everything we won't need (keeping station name, date, percipitation and snow)
weather <- weather[,c(1,10,12,26)]

colnames(weather) <- c("Station", "Rain_mm", "Snow_mm", "Date")

# The text file represents missing data as -9999.  This changes those to NA.
# One note: the data cleaning thus far has introduced some NA's into the dataframe, but those occur where -9999s would have occured
weather <- replace.value(weather, names = colnames(weather), from = "-9999", to = NA) # Handy function from 'anchors'
weather <- replace.value(weather, names = colnames(weather), from = "   -9999", to = NA)
weather <- replace.value(weather, names = colnames(weather), from = "  -9999", to = NA)
weather <- replace.value(weather, names = colnames(weather), from = " -9999", to = NA)

# Setting the data types

weather$Rain_mm <- as.numeric(weather$Rain_mm)
weather$Snow_mm <- as.numeric(weather$Snow_mm)
weather$Date <- as.POSIXct(weather$Date, format = "%Y%m%d")

Because we’re looking at the eastern part of the state as a whole, we should average the precipitation data by station and date. Luckily, there’s only one recording per station per day at max. So we can simply group by the day, and average.

## # A tibble: 6 x 3
##   Date                Rain_mm Snow_mm
##   <dttm>                <dbl>   <dbl>
## 1 2010-01-01 00:00:00  0.0133       0
## 2 2010-01-02 00:00:00  0            0
## 3 2010-01-03 00:00:00  0            0
## 4 2010-01-04 00:00:00  0            0
## 5 2010-01-05 00:00:00  0            0
## 6 2010-01-06 00:00:00  0            0
##       Date                        Rain_mm            Snow_mm        
##  Min.   :2010-01-01 00:00:00   Min.   :0.000000   Min.   : 0.00000  
##  1st Qu.:2012-07-01 18:00:00   1st Qu.:0.000000   1st Qu.: 0.00000  
##  Median :2014-12-31 12:00:00   Median :0.003333   Median : 0.00000  
##  Mean   :2014-12-31 11:20:53   Mean   :0.125340   Mean   : 0.04848  
##  3rd Qu.:2017-07-01 06:00:00   3rd Qu.:0.110000   3rd Qu.: 0.00000  
##  Max.   :2019-12-31 00:00:00   Max.   :3.840000   Max.   :10.00000  
##                                                   NA's   :9

Now that the weather-data preparation is done, what we’re left with the mean precipitation (snow and rain separately) for each day between the beginning of 2010 and end of 2019 for the eastern 2/3 of Virginia (the same area for which we have energy data).
As one might expect, rain and snow are not normally distributed; it’s usualy not percipitating, and thus most days are not going to have any readings.

Initial Investigation: Energy Usage

For starters, let’s see what power usage over the last 10 years looks like.

It seems to have followed the same pattern for the most part, and risen slightly overall. Power usage tends to start off high in the winter (heating), drop low in the spring, raise to it’s highest points in the summer (energy hungry AC) and drop again in autumn. This phenomenon is pretty straight forward, and widely known. If anything, it goes to validate our data, and explain why the data is slightly right skewed (two-low use months, one med use month, one high use month).
There’s a lot going on in the above graph though; let’s see if we can break it down a bit further.

The first apparent trend is how energy usage dips around 1:00-2:00 in the morning. No more evident is that than in the July figure, (coldest time of day is early morning, requires less cooling). The trend holds the same in winter, even through the coldest part of the day, as people tend to be too busy sleeping to use as much energy at night (aside from on heating).

I suspect heating has less an effect on energy than cooling, due to the prevalance of non-electric heating systems, whereas cooling is almost universially electric. This could explain the difference in the number of peaks between January and July: misc daily indoor usage (running dishwashers, heating water, etc) could be the main driving factor in winter, while air-conditioning is the main factor in summer.

Another interesting trend one can see in the January and April plots (and less so, October) is the energy usage peaking in the morning and in the evening. Maybe people cooking/showering/at home?

Looking at the Y axis scale alone, we can further validate the observations about seasonal energy usage made before: high in winter, higher in summer, lowest in spring and autumn.

Initial Investigation: Weather

Rain seems to occur fairly consistently, while snow tends to occur in bursts (presumably in winter). That’s about what we would expect; rain in that climate tends to be fairly consistent through the year, and high temperatures are not conductive to snow.

Becuase snow tends to occur at the same time each year, it is colinear with time, and thus not suitable for our analysis.

One detail we can observe here is: if it rains, it tends to do so over multiple days.

Now that we have a clearer idea of energy and weather behave over time, we’re ready to start modeling.

Modeling Preparation

Let’s come up with a model that fits energy load based on date and time. First, we’ll do some transformations so we can do a linear regression. For simplicity, and to match up with the weather data, we’ll start by averaging energy data for each day.

Much less busy. What’s left now is only seasonal variation for the most part.

One last thing: let’s join both data frames (weather and energy), and get rid of the intermediary variables.

Modeling

Using weather data alone to model energy is not likely to be terribly accurate. But, we can weather data to an existing model see if it improves the model. If it does, then we can use weather to help predict energy usage.

Let’s start by making a model that only uses date to predict energy. Should be able to do so because, as seen before, time has a fairly large impact on energy usage.

## 
## Call:
## lm(formula = MW ~ date + term1 + term2, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3877.0  -845.6  -132.1   800.1  6764.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.598e+03  3.421e+02   22.21   <2e-16 ***
## date        2.160e-01  2.077e-02   10.39   <2e-16 ***
## term1       1.052e+03  3.097e+01   33.96   <2e-16 ***
## term2       1.398e+03  3.095e+01   45.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1322 on 3648 degrees of freedom
## Multiple R-squared:  0.473,  Adjusted R-squared:  0.4726 
## F-statistic:  1092 on 3 and 3648 DF,  p-value: < 2.2e-16

As seen in the above plot, the regression takes the shape of a wave. Term1 and term2 shown in the regression summary are angular components (see the code). While it is highly significant (p - value is nearly 0), it only accounts for about 47.3% of the variability observed (R-squared).

This is probably not accurate enough to get a sense of whether or not weather plays a role in energy load, but lets see what happens anyways.

Below is the model which takes weather into account.

Pretty much the same thing as the last model, but with an additional term for rain.
We’re going to leave snow out, becuase it is colinear with time.
## 
## Call:
## lm(formula = MW ~ date + term1 + term2 + Rain_mm, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3879.3  -854.0  -127.6   801.0  6722.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7607.12940  341.23528   22.29   <2e-16 ***
## date           0.21802    0.02072   10.52   <2e-16 ***
## term1       1051.22078   30.88890   34.03   <2e-16 ***
## term2       1398.95796   30.87050   45.32   <2e-16 ***
## Rain_mm     -345.89713   76.86241   -4.50    7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1319 on 3647 degrees of freedom
## Multiple R-squared:  0.4759, Adjusted R-squared:  0.4754 
## F-statistic:   828 on 4 and 3647 DF,  p-value: < 2.2e-16
Only performs slightly better with 47.6% of variability accounted for.

Discussion/Conclusions

The difference in R-squared values between the two models (Date Only: 0.473, With Rain: 0.476) was 0.003; the model which took rain into account was an additional 0.3% more accurate. Given the low accuracy of the model predicting energy load by date alone (47.3%), I’m going to conclude that if rain does have an effect on grid load, then this did not discover as such. The base model (no rain) was not accurate enough to observe small differences as would be due to rain.

I was happy with time being as much of a predictor of energy load as it was. However, one thing to keep in mind is that this model only pertains to eastern Virginia. As date is pretty much a proxy for temperature, a change in climate likely necessitates a new model.

Out of curiosity, I also tried a model which included snow; it had a higher R-squared of 0.493. I suspect though, that the difference in R-squared here is mostly do to the fact that snow is only going to occur in colder months, which is why I didn’t include it as a variable (it’s colinear with date). I also suspect that the difference between the rain and no rain models may also have had more to do with whatever slight association rain may have with season, even though rain in Virginia is supposed to be fairly evenly distributed.

Further improvements to the model could be made by implementing ways to account for the difference between seasons better. My models technically considered winter and summer to be equal, but having a more complicated model to account for the lower energy use in winter than summer would further increase accuracy. One could also overhaul the model in attempt to have it predict energy load at any given hour by implementing methods to account for daily variation in additional to seasonal variations. And although I used date as the primary predictor of energy load, I know temperature would be a better predictor (at least for seasonal differences), as much power-usage goes towards heating and cooling.