source('util.R')
library('dplyr')
elec <- readRDS('eia-elec-by-pca-2006-2017.rds')
elec$day <- doy(elec$year, elec$month, elec$day) # overwrite day with day of year
elec$season <- season(elec$month)
elec <- select(elec, -month) # month no longer useful
elec <- rename(elec, load=`Load (MW)`)
temperature <- readRDS('temp-by-hour-pca.rds') %>% select(-pca) # pca column in this dataset is bad.
pca_mapping <- read.csv('control_area_to_eia_id.csv', stringsAsFactors = FALSE) %>%
rename(pca=Abbreviated.PCA.Name) %>%
select(c('NAME', 'pca'))
temperature <- left_join(temperature, pca_mapping, by='NAME')
## Check completeness
spmap <- read.csv('spatially-mapped.csv', stringsAsFactors = FALSE) %>% rename(pca=Abbreviated.PCA.Name)
dropmask <- is.na(spmap$X2016) | spmap$pca == 'YAD' | spmap$pca == 'EEI'
spmap <- spmap[!dropmask,]
pcas_elec <- unique(elec$pca)
pcas_temp <- unique(temperature$pca)
spmap$pca[!(spmap$pca %in% pcas_temp)]
character(0)
spmap$pca[!(spmap$pca %in% pcas_elec)]
character(0)
## Drop the pcas for which we don't have valid mapping
elecmap <- semi_join(elec, spmap, by='pca')
## Check that they all have temperature data
anti_join(elecmap, temperature, by=c('year','day','hour','pca'))
Join the electricity and temperature data, and standardize the temperature and load.
electemp <- left_join(elecmap, select(temperature, -ID, -NAME), by=c('year', 'day','hour','pca')) %>%
mutate(pca=as.factor(pca))
standardize <- function(x) {
mu <- mean(x)
x <- x-mu
x / sd(x)
}
We are almost ready to do some modeling. Year 2017 is incomplete, so drop that year. The WACM PCA is missing for 2007 and 2008, so let’s start by dropping those years and trying to model a relationship between temperature and load. We’ll group the data by PCA and do a separate linear model in each.
modata2 <-
group_by(electemp, pca) %>%
mutate(load=standardize(load), temp=standardize(temp)) %>%
filter(!(year %in% c(2017, 2007, 2008)))
mods2 <- do(modata2, mod=lm(load~temp, data=.))
|=================================================================== | 47% ~2 s remaining
|====================================================================== | 49% ~2 s remaining
|========================================================================= | 51% ~2 s remaining
|============================================================================ | 53% ~2 s remaining
|============================================================================== | 55% ~2 s remaining
|================================================================================= | 57% ~2 s remaining
|==================================================================================== | 58% ~2 s remaining
|====================================================================================== | 60% ~2 s remaining
|========================================================================================= | 62% ~2 s remaining
|============================================================================================ | 64% ~2 s remaining
|=============================================================================================== | 66% ~1 s remaining
|================================================================================================= | 68% ~1 s remaining
|==================================================================================================== | 70% ~1 s remaining
|======================================================================================================= | 72% ~1 s remaining
|========================================================================================================= | 74% ~1 s remaining
|============================================================================================================ | 75% ~1 s remaining
|=============================================================================================================== | 77% ~1 s remaining
|==================================================================================================================== | 81% ~1 s remaining
|======================================================================================================================= | 83% ~1 s remaining
|========================================================================================================================== | 85% ~1 s remaining
|============================================================================================================================ | 87% ~1 s remaining
|=============================================================================================================================== | 89% ~0 s remaining
|================================================================================================================================== | 91% ~0 s remaining
|===================================================================================================================================== | 92% ~0 s remaining
|======================================================================================================================================= | 94% ~0 s remaining
|========================================================================================================================================== | 96% ~0 s remaining
|============================================================================================================================================= | 98% ~0 s remaining
|================================================================================================================================================|100% ~0 s remaining
summarise(mods2, pca=pca, rsq=summary(mod)$r.squared)
The \(R^2\) values here are not very impressive, so we can try adding a term for the season.
## Include seasonal effect. Use same data as mod2
modata3 <- modata2
mods3 <- do(modata3, mod=lm(load~season*temp, data=.))
|================================================ | 34% ~4 s remaining
|=================================================== | 36% ~4 s remaining
|====================================================== | 38% ~4 s remaining
|========================================================= | 40% ~4 s remaining
|=========================================================== | 42% ~3 s remaining
|============================================================== | 43% ~3 s remaining
|================================================================= | 45% ~3 s remaining
|=================================================================== | 47% ~3 s remaining
|====================================================================== | 49% ~3 s remaining
|========================================================================= | 51% ~3 s remaining
|============================================================================ | 53% ~3 s remaining
|============================================================================== | 55% ~3 s remaining
|================================================================================= | 57% ~3 s remaining
|==================================================================================== | 58% ~2 s remaining
|====================================================================================== | 60% ~2 s remaining
|========================================================================================= | 62% ~2 s remaining
|============================================================================================ | 64% ~2 s remaining
|=============================================================================================== | 66% ~2 s remaining
|================================================================================================= | 68% ~2 s remaining
|==================================================================================================== | 70% ~2 s remaining
|======================================================================================================= | 72% ~2 s remaining
|========================================================================================================= | 74% ~2 s remaining
|============================================================================================================ | 75% ~1 s remaining
|=============================================================================================================== | 77% ~1 s remaining
|==================================================================================================================== | 81% ~1 s remaining
|======================================================================================================================= | 83% ~1 s remaining
|========================================================================================================================== | 85% ~1 s remaining
|============================================================================================================================ | 87% ~1 s remaining
|=============================================================================================================================== | 89% ~1 s remaining
|================================================================================================================================== | 91% ~1 s remaining
|===================================================================================================================================== | 92% ~0 s remaining
|======================================================================================================================================= | 94% ~0 s remaining
|========================================================================================================================================== | 96% ~0 s remaining
|============================================================================================================================================= | 98% ~0 s remaining
|================================================================================================================================================|100% ~0 s remaining
r2 <- summarise(mods3, pca=pca, rsq=summary(mod)$r.squared)
r2
summary(r2)
pca rsq
ACEA : 1 Min. :0.1073
AEC : 1 1st Qu.:0.1891
AECI : 1 Median :0.2178
AVA : 1 Mean :0.2596
AZPS : 1 3rd Qu.:0.3149
BPAT : 1 Max. :0.4847
(Other):47
Better, though there is still a lot of variance left unexplained. These linear models only capture the linear association between the variables. We can use a nonparametric measure of association based on entropy to estimate how much of the information the temperature gives us about the load. We will calculate the uncertainty coefficient as (NR S. 14-4) \[ U(y|x) = \frac{H(y) - H(y|x)}{H(y)}, \] where \(H(y) = -\sum_{i=1} p_i\ln(p_i)\) is the entropy of the load distribution and \(H(y|x) = -\sum_{i,j} p_{ij}\ln(\frac{p_{ij}}{p_i})\). To do that we need to discretize the data, and while we’re at it, we should run a \(\chi^2\) test to make sure we haven’t sliced the data too finely.
modata4 <- modata2
ct <- function(d) {
t <- cut(d$temp, breaks=5000)
l <- cut(d$load, breaks=5000)
suppressWarnings(chisq.test(t, l))
}
chis4 <- do(modata4, chisq=ct(.))
|===== | 4% ~2 m remaining
|======== | 6% ~2 m remaining
|========== | 8% ~1 m remaining
|============= | 9% ~1 m remaining
|================ | 11% ~1 m remaining
|=================== | 13% ~1 m remaining
|===================== | 15% ~1 m remaining
|======================== | 17% ~1 m remaining
|=========================== | 19% ~1 m remaining
|============================= | 21% ~1 m remaining
|================================ | 23% ~1 m remaining
|=================================== | 25% ~1 m remaining
|====================================== | 26% ~1 m remaining
|======================================== | 28% ~1 m remaining
|=========================================== | 30% ~1 m remaining
|============================================== | 32% ~1 m remaining
|================================================ | 34% ~1 m remaining
|=================================================== | 36% ~1 m remaining
|====================================================== | 38% ~1 m remaining
|========================================================= | 40% ~1 m remaining
|=========================================================== | 42% ~60 s remaining
|============================================================== | 43% ~58 s remaining
|================================================================= | 45% ~58 s remaining
|=================================================================== | 47% ~57 s remaining
|====================================================================== | 49% ~57 s remaining
|========================================================================= | 51% ~54 s remaining
|============================================================================ | 53% ~52 s remaining
|============================================================================== | 55% ~49 s remaining
|================================================================================= | 57% ~46 s remaining
|==================================================================================== | 58% ~44 s remaining
|====================================================================================== | 60% ~42 s remaining
|========================================================================================= | 62% ~41 s remaining
|============================================================================================ | 64% ~38 s remaining
|=============================================================================================== | 66% ~37 s remaining
|================================================================================================= | 68% ~35 s remaining
|==================================================================================================== | 70% ~33 s remaining
|======================================================================================================= | 72% ~31 s remaining
|========================================================================================================= | 74% ~29 s remaining
|============================================================================================================ | 75% ~27 s remaining
|=============================================================================================================== | 77% ~24 s remaining
|================================================================================================================== | 79% ~23 s remaining
|==================================================================================================================== | 81% ~21 s remaining
|======================================================================================================================= | 83% ~19 s remaining
|========================================================================================================================== | 85% ~17 s remaining
|============================================================================================================================ | 87% ~15 s remaining
|=============================================================================================================================== | 89% ~13 s remaining
|================================================================================================================================== | 91% ~11 s remaining
|===================================================================================================================================== | 92% ~8 s remaining
|======================================================================================================================================= | 94% ~6 s remaining
|========================================================================================================================================== | 96% ~4 s remaining
|============================================================================================================================================= | 98% ~2 s remaining
|================================================================================================================================================|100% ~0 s remaining
summarise(chis4, pca=pca, pval=chisq$p.value)
The p-values are comically small, but what we really want to know is, how strong is the association. For this we turn to the uncertainty coefficient
uc0 <- summarise(modata4, ucoef_noseason=ucoef(cut(temp,5000), cut(load, 5000)))
uc0
summary(uc0)
pca ucoef_noseason
ACEA : 1 Min. :0.2771
AEC : 1 1st Qu.:0.4862
AECI : 1 Median :0.5424
AVA : 1 Mean :0.5162
AZPS : 1 3rd Qu.:0.5613
BPAT : 1 Max. :0.6513
(Other):47
uc1 <- summarise(modata4, ucoef_season=ucoef(interaction(cut(temp,5000),season), cut(load,5000)))
uc1
summary(uc1)
pca ucoef_season
ACEA : 1 Min. :0.5313
AEC : 1 1st Qu.:0.6880
AECI : 1 Median :0.7242
AVA : 1 Mean :0.7056
AZPS : 1 3rd Qu.:0.7379
BPAT : 1 Max. :0.8219
(Other):47
It seems that the combination of temperature and season tells us quite a lot about the load. We would have to do some simulations to figure out how these numbers compare to the \(R^2\) values from the linear fits, but my best guess right now is that the simple linear model isn’t capturing all of the mutual information in the temperature and load distributions. That is, there is some substantial nonlinear dependence between load and temperature+season.