Abstract
The objectives of this project were to discover trends in power-load on an electrical grid over time, and to determine if rain had any noticeable effect on load. The region focused on was eastern Virginia. Power data was sourced from PJM in .csv format, and weather data was sourced from NOAA as a text file. Data preparation was conducted by loading all relevant data into data-frame objects; there were 10 years worth of observations (2010-2019). Power demand was observed to behave cyclically over time, with high load occurring in summer and winter, and low load occurring in spring and autumn. Furthermore, daily cycles were also observed for each season, in which power usage tended to be higher during the day, with seasonal dependent peaks, and lows consistently occurring around 2:00AM. Sinusoidal models were constructed to predict load over time, with and without rain as a variable. The time only model had an R^2 of 0.473, and the time and rain model had an R^2 0.476; both models had p-values approaching 0. There was a slight improvement in the model which considered rain, but not enough to determine if rain has any impact on load.
Introduction
Energy isn’t used consistently. Demand will change depending on the time of day, time of year, and other factors. Predicting energy demand is of great interest to grid operators and others involved in electrical infrastructure. As such, there are models to help forecast power requirements, which take several different variables into account. It’s well known that time of day and season factor into energy demand, but it is less clear what impact percipitation has.
In this project, we will be making models to forecast load over a grid. The questions that will be addressed here are:
1. How does energy-load on a grid behave over time.
2. Does precipitation affect load.
We will answer these questions by constructing two models: one that uses time to predict energy demand, and one that uses both time and percipitation to predict energy demand. If the later model is significantly better than the first, we can conclude that percipitation is a useful predictor for energy demand.
Loading and Tidying the Data
Before we do anything, we need to load our data and the libraries we’ll be using.
Climate and power-usage likely to be strongly related to each other, as much of the power generated tends to go towards heating and cooling. Thus, to do the analysis properly, it would be best to narrow down the area in question to one region. So for this project, we will only be looking for relationships within the eastern 2/3’rds Virginia, which has the advantage of being PJM’s only southern market region, allowing for easy segregation from the rest of their data. Meteorological data is easy to access, and can be chosen from weather stations within that region.
Now, let’s load in our meteorological data from NOAA, which provides extensive climate records https://www.ncdc.noaa.gov/cdo-web/. We’ll focus on weather stations in eastern Virginia, which will align both data-sets geographically.
The data originates in a text file, so we’ll need to do some cleaning before we can analyze.
As a side note: this data is also available in csv format. That’s as much a challange as text can be though!
The text file is spaced strangely, so we’ll handle it with some loops and an else-if.
# Importing the raw file, and initializing an empty data frame
path <- "https://raw.githubusercontent.com/davidblumenstiel/data/master/weatherNOAA/weather.txt"
x <- read.delim(path, stringsAsFactors = FALSE)
weather <- data.frame(matrix(ncol = 19))
# There are more elegant ways to do this, but nothing quite as entertaining
i = 2
while (i < nrow(x)) {
split = strsplit(x[i,], " ")[[1]]
j = 1
while (j < length(split)) {
buff = split[j]
if (buff == "") {
j = j + 1
}
else {
weather[i - 1,j] <- buff
j = j + 1
}
}
i = i + 1
}
# Further straightening
i = 0
while (i < nrow(weather)) {
i = i + 1
if (is.na(weather[i,7]) == FALSE) {
weather[i,8:21] <- weather[i,7:20]
}
else if (is.na(weather[i,8]) == TRUE) {
weather[i,8:21] <- weather[i,9:22]
}
}
# Now we need to seperate out the dates from other information we dont need
weather$date <- stringr::str_extract(weather[,8], '\\d{8}')
# Getting rid of everything we won't need (keeping station name, date, percipitation and snow)
weather <- weather[,c(1,10,12,26)]
colnames(weather) <- c("Station", "Rain_mm", "Snow_mm", "Date")
# The text file represents missing data as -9999. This changes those to NA.
# One note: the data cleaning thus far has introduced some NA's into the dataframe, but those occur where -9999s would have occured
weather <- replace.value(weather, names = colnames(weather), from = "-9999", to = NA) # Handy function from 'anchors'
weather <- replace.value(weather, names = colnames(weather), from = " -9999", to = NA)
weather <- replace.value(weather, names = colnames(weather), from = " -9999", to = NA)
weather <- replace.value(weather, names = colnames(weather), from = " -9999", to = NA)
# Setting the data types
weather$Rain_mm <- as.numeric(weather$Rain_mm)
weather$Snow_mm <- as.numeric(weather$Snow_mm)
weather$Date <- as.POSIXct(weather$Date, format = "%Y%m%d")
Because we’re looking at the eastern part of the state as a whole, we should average the precipitation data by station and date. Luckily, there’s only one recording per station per day at max. So we can simply group by the day, and average.
## # A tibble: 6 x 3
## Date Rain_mm Snow_mm
## <dttm> <dbl> <dbl>
## 1 2010-01-01 00:00:00 0.0133 0
## 2 2010-01-02 00:00:00 0 0
## 3 2010-01-03 00:00:00 0 0
## 4 2010-01-04 00:00:00 0 0
## 5 2010-01-05 00:00:00 0 0
## 6 2010-01-06 00:00:00 0 0
## Date Rain_mm Snow_mm
## Min. :2010-01-01 00:00:00 Min. :0.000000 Min. : 0.00000
## 1st Qu.:2012-07-01 18:00:00 1st Qu.:0.000000 1st Qu.: 0.00000
## Median :2014-12-31 12:00:00 Median :0.003333 Median : 0.00000
## Mean :2014-12-31 11:20:53 Mean :0.125340 Mean : 0.04848
## 3rd Qu.:2017-07-01 06:00:00 3rd Qu.:0.110000 3rd Qu.: 0.00000
## Max. :2019-12-31 00:00:00 Max. :3.840000 Max. :10.00000
## NA's :9

Now that the weather-data preparation is done, what we’re left with the mean precipitation (snow and rain separately) for each day between the beginning of 2010 and end of 2019 for the eastern 2/3 of Virginia (the same area for which we have energy data).
As one might expect, rain and snow are not normally distributed; it’s usualy not percipitating, and thus most days are not going to have any readings.
Initial Investigation: Energy Usage
For starters, let’s see what power usage over the last 10 years looks like.

It seems to have followed the same pattern for the most part, and risen slightly overall. Power usage tends to start off high in the winter (heating), drop low in the spring, raise to it’s highest points in the summer (energy hungry AC) and drop again in autumn. This phenomenon is pretty straight forward, and widely known. If anything, it goes to validate our data, and explain why the data is slightly right skewed (two-low use months, one med use month, one high use month).
There’s a lot going on in the above graph though; let’s see if we can break it down a bit further.
Below, we look at just a few days worth of power usage from every month, to try to understand daily trends more. We’ll focus on the first few days of every three months for 2015 (which should represent seasons, in a quick and general sense).
selectdates <- energy[as.Date(energy$datetime) >= "2015-01-01" & as.Date(energy$datetime) <= "2015-01-05",]
January <- ggplot(selectdates, aes(x = datetime, y = mw)) +
geom_line(color="blue", size=1.0) +
scale_x_datetime(date_breaks = "12 hours") +
labs(title = "January", x = "Date Time", y = "MW")
selectdates <- energy[as.Date(energy$datetime) >= "2015-04-01" & as.Date(energy$datetime) <= "2015-04-05",]
April <- ggplot(selectdates, aes(x = datetime, y = mw)) +
geom_line(color="green", size=1.0) +
scale_x_datetime(date_breaks = "12 hours")+
labs(title = "April", x = "Date Time", y = "MW")
selectdates <- energy[as.Date(energy$datetime) >= "2015-07-01" & as.Date(energy$datetime) <= "2015-07-05",]
July <- ggplot(selectdates, aes(x = datetime, y = mw)) +
geom_line(color="red", size=1.0) +
scale_x_datetime(date_breaks = "12 hours")+
labs(title = "July", x = "Date Time", y = "MW")
selectdates <- energy[as.Date(energy$datetime) >= "2015-10-01" & as.Date(energy$datetime) <= "2015-10-05",]
October <- ggplot(selectdates, aes(x = datetime, y = mw)) +
geom_line(color="brown", size=1.0) +
scale_x_datetime(date_breaks = "12 hours")+
labs(title = "October", x = "Date Time", y = "MW")
gridExtra::grid.arrange(January, April, July, October, nrow = 4)

Another interesting trend one can see in the January and April plots (and less so, October) is the energy usage peaking in the morning and in the evening. Maybe people cooking/showering/at home?
Looking at the Y axis scale alone, we can further validate the observations about seasonal energy usage made before: high in winter, higher in summer, lowest in spring and autumn.
Initial Investigation: Weather
Let’s see what precipitation looked like in Virginia over the past 10 years.

Rain seems to occur fairly consistently, while snow tends to occur in bursts (presumably in winter). That’s about what we would expect; rain in that climate tends to be fairly consistent through the year, and high temperatures are not conductive to snow.
Becuase snow tends to occur at the same time each year, it is colinear with time, and thus not suitable for our analysis.
Let’s zoom in on 2015 to see if there’s anything else going on.

One detail we can observe here is: if it rains, it tends to do so over multiple days.
Now that we have a clearer idea of energy and weather behave over time, we’re ready to start modeling.
Modeling Preparation
Much less busy. What’s left now is only seasonal variation for the most part.
Modeling
Using weather data alone to model energy is not likely to be terribly accurate. But, we can weather data to an existing model see if it improves the model. If it does, then we can use weather to help predict energy usage.
Let’s start by making a model that only uses date to predict energy. Should be able to do so because, as seen before, time has a fairly large impact on energy usage.

##
## Call:
## lm(formula = MW ~ date + term1 + term2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3877.0 -845.6 -132.1 800.1 6764.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.598e+03 3.421e+02 22.21 <2e-16 ***
## date 2.160e-01 2.077e-02 10.39 <2e-16 ***
## term1 1.052e+03 3.097e+01 33.96 <2e-16 ***
## term2 1.398e+03 3.095e+01 45.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1322 on 3648 degrees of freedom
## Multiple R-squared: 0.473, Adjusted R-squared: 0.4726
## F-statistic: 1092 on 3 and 3648 DF, p-value: < 2.2e-16
As seen in the above plot, the regression takes the shape of a wave. Term1 and term2 shown in the regression summary are angular components (see the code). While it is highly significant (p - value is nearly 0), it only accounts for about 47.3% of the variability observed (R-squared).
This is probably not accurate enough to get a sense of whether or not weather plays a role in energy load, but lets see what happens anyways.
Below is the model which takes weather into account.
Pretty much the same thing as the last model, but with an additional term for rain.
We’re going to leave snow out, becuase it is colinear with time.
##
## Call:
## lm(formula = MW ~ date + term1 + term2 + Rain_mm, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3879.3 -854.0 -127.6 801.0 6722.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7607.12940 341.23528 22.29 <2e-16 ***
## date 0.21802 0.02072 10.52 <2e-16 ***
## term1 1051.22078 30.88890 34.03 <2e-16 ***
## term2 1398.95796 30.87050 45.32 <2e-16 ***
## Rain_mm -345.89713 76.86241 -4.50 7e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1319 on 3647 degrees of freedom
## Multiple R-squared: 0.4759, Adjusted R-squared: 0.4754
## F-statistic: 828 on 4 and 3647 DF, p-value: < 2.2e-16
Only performs slightly better with 47.6% of variability accounted for.
Discussion/Conclusions
The difference in R-squared values between the two models (Date Only: 0.473, With Rain: 0.476) was 0.003; the model which took rain into account was an additional 0.3% more accurate. Given the low accuracy of the model predicting energy load by date alone (47.3%), I’m going to conclude that if rain does have an effect on grid load, then this did not discover as such. The base model (no rain) was not accurate enough to observe small differences as would be due to rain.
I was happy with time being as much of a predictor of energy load as it was. However, one thing to keep in mind is that this model only pertains to eastern Virginia. As date is pretty much a proxy for temperature, a change in climate likely necessitates a new model.
Out of curiosity, I also tried a model which included snow; it had a higher R-squared of 0.493. I suspect though, that the difference in R-squared here is mostly do to the fact that snow is only going to occur in colder months, which is why I didn’t include it as a variable (it’s colinear with date). I also suspect that the difference between the rain and no rain models may also have had more to do with whatever slight association rain may have with season, even though rain in Virginia is supposed to be fairly evenly distributed.
Further improvements to the model could be made by implementing ways to account for the difference between seasons better. My models technically considered winter and summer to be equal, but having a more complicated model to account for the lower energy use in winter than summer would further increase accuracy. One could also overhaul the model in attempt to have it predict energy load at any given hour by implementing methods to account for daily variation in additional to seasonal variations. And although I used date as the primary predictor of energy load, I know temperature would be a better predictor (at least for seasonal differences), as much power-usage goes towards heating and cooling.