Last published on: November 13, 2021
At Consumers Energy, a utility provider in the state of Michigan, the meters the company uses to measure consumption (kWh) produce a reading periodically versus continuously. Some of the meters produce data every 60 minutes.
A reading every 60 minutes makes for quite a course understanding of consumption behavior. At Consumers Energy there is a growing list of business needs to have a more granular understanding of the customer’s consumption (kWh) behavior.
One such use case relates to load profiles. A load profile is a descriptive measure or summary of a customer’s consumption (kWh). The company uses this information to better understand customers, which in turn helps customers. For example, a better understanding of consumption behavior and patterns helps ensure marketing material is relevant and targeted.
This research will focus on a branch of statistics called Functional Data Analysis (FDA).
From Wikipedia:
Functional data analysis (FDA) is a branch of statistics that analyzes data providing information about curves, surfaces or anything else varying over a continuum
Translation: Despite periodic meter reads (e.g., every 60 minutes for example) Consumers Energy knows that customers consume electricity continuously throughout the day and not just at the meter read intervals. I will illustrate that using the FDA framework, we can fill in the gaps by creating a smooth estimate of consumption (kWh) that better mirrors actual consumption.
The goal of this research is to explore how to leverage FDA to construct a smooth estimate of consumption. Specifically, for this body of work, I will show how to use FDA to model average consumption behavior. The data was summarized because the raw data is too big for this proof-of-concept type of research.
To provide some technical context, FDA will be used to estimate a continuous function, \(f(x)\) read f of x, from the discrete periodic meter reads (the x in \(f(x)\)) that can be used to guess what consumption would be for any period of time. For example, we could plug 11:05 AM into the estimated function as \(f(x = 11:05)\) and the function would output it’s best guess for a customer’s average consumption for that time.
The data for this research was obtained from an internal database. To protect the customer’s privacy and to adhere to information safety best practices, noise was added to the actual meter reads so that there is no way to associate the data used throughout this research with any actual customer. What’s more, no identifying markers will be used throughout the report. Also, the chunk of code that annonymizes the data is hidden so that the process cannot be reverse engineered.
An important nuance to make is that this research will analyze locations, not customers. One customer can have multiple meters. This analysis is at the meter level and for this report a meter = location. One could easily aggregate the meter data by customer and reproduce a similar analysis.
Overview:
One characteristic of the data is how greatly consumption can vary between locations (i.e., customers). Below is a table of the summary statistics by hour for the data used in this report. As can be seen below, consumption varies wildly. The final algorithm needs to be customized to each location’s particular usage behavior.
In the table below notice how far apart the mean and the median are at each hour. This is the tale-tale sign of a skewed distribution. Another artifact of the data to notice is the spread between the min and the max (i.e., the range in stats terminology)
tmpFuns <- c(max, mean, median, sd, mad)
tmpFunction <- function(data, FUN) {
data %>%
dplyr::select(dplyr::starts_with("cons_int")) %>%
dplyr::summarise(dplyr::across(.cols = dplyr::everything(), .fns = FUN)) %>%
dplyr::collect() %>%
round(., 2) %>%
as.numeric()
}
tmpOut <- vapply(tmpFuns, FUN.VALUE = numeric(24), function(x) tmpFunction(cniData, x))
tmpOut <- cbind(rep(0, 24), tmpOut)
colnames(tmpOut) <- c("Min", "Max", "Mean", "Median", "Stdv", "MAD")
row.names(tmpOut) <- 0:23
tmpOut %>%
dplyr::as_tibble()%>%
gt::gt()%>%
gt::tab_header(
title = md("**Summary Statistics**"),
subtitle = "SRS, n = 100 locations"
)%>%
gt::fmt_number(
columns = Max,
sep = ",",
decimals = 2
)| Summary Statistics | |||||
|---|---|---|---|---|---|
| SRS, n = 100 locations | |||||
| Min | Max | Mean | Median | Stdv | MAD |
| 0 | 24,709.42 | 16.15 | 3.55 | 95.12 | 3.74 |
| 0 | 7,977.61 | 15.95 | 3.30 | 94.34 | 3.26 |
| 0 | 8,132.75 | 15.71 | 3.24 | 93.41 | 3.20 |
| 0 | 7,816.13 | 15.76 | 3.24 | 93.32 | 3.17 |
| 0 | 6,495.13 | 16.11 | 3.25 | 93.89 | 3.19 |
| 0 | 8,212.77 | 16.88 | 3.36 | 95.54 | 3.34 |
| 0 | 7,540.88 | 18.17 | 3.55 | 98.66 | 3.62 |
| 0 | 10,523.57 | 19.53 | 4.09 | 103.83 | 4.45 |
| 0 | 13,233.00 | 20.71 | 4.55 | 107.99 | 5.06 |
| 0 | 15,913.09 | 21.61 | 4.98 | 113.27 | 5.56 |
| 0 | 18,331.53 | 22.29 | 5.34 | 116.79 | 5.97 |
| 0 | 22,250.87 | 22.63 | 5.52 | 120.49 | 6.18 |
| 0 | 27,082.36 | 22.76 | 5.53 | 124.65 | 6.21 |
| 0 | 26,197.86 | 22.95 | 5.53 | 127.70 | 6.23 |
| 0 | 26,349.87 | 22.88 | 5.51 | 129.30 | 6.22 |
| 0 | 27,391.87 | 22.55 | 5.43 | 129.32 | 6.15 |
| 0 | 30,571.50 | 21.92 | 5.24 | 124.20 | 5.94 |
| 0 | 25,198.64 | 21.08 | 4.79 | 118.06 | 5.44 |
| 0 | 19,475.05 | 20.36 | 4.44 | 112.75 | 5.08 |
| 0 | 16,130.69 | 19.79 | 4.26 | 108.71 | 4.85 |
| 0 | 14,796.09 | 19.12 | 4.12 | 104.59 | 4.62 |
| 0 | 12,007.55 | 18.31 | 3.99 | 100.86 | 4.38 |
| 0 | 11,106.90 | 17.41 | 3.80 | 98.32 | 4.08 |
| 0 | 9,109.45 | 16.66 | 3.46 | 96.16 | 3.48 |
#knitr::kable(tmpOut, format = "html") %>%
# kableExtra::kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)Below, a particular location was randomly sampled. All of that location’s data was plotted. A great deal of information can be extracted from this one visual:
# randomly sample a single usage point
set.seed(1)
UP <- cniData[["usage_point"]] %>%
unique(.) %>%
sample(x = ., size = 1)
cniData %>%
dplyr::as_tibble() %>%
dplyr::filter(usage_point == UP) %>%
dplyr::select(-c(contract_account, usage_point, account_class, readtime_date)) %>%
tidyr::pivot_longer(cols = !weekpart, names_to = "hour", values_to = "consumption") %>%
ggplot(aes(x = hour, y = consumption, fill = weekpart, color = weekpart)) +
geom_point(alpha = 0.5) +
scale_x_discrete(name = "Hour", labels = 0:23) +
ggtitle("A randomly sampled location")While I don’t investigate the nuance of weekday vs weekend in this research I do think it would be valuable to consider that perhaps the algorithm should be applied to both weekday and weekends separately to best capture the consumption behavior. The difference in consumption between weekday and weekend is likely a phenomenon among residential customers as well.
Another thing to consider is the size of the data. For this particularly random sampled location there are 1,400 data points for each hour (i.e., 1,400 rows by 24 columns). That’s a lot of data! In the next section I’ll discuss how to shrink the size of the data.
The previous section illustrated that there is a lot of data and that consumption can vary greatly by hour. One way to reduce the variability is to summarize the data. For this reason, and for practicality, the data was summarized.
The method used to summarize the data depends on the nature of the data. Each hour contains a history of past meter reads. Each hour is a distribution of consumption (kWh). If the hourly distributions are bell-shaped then the mean is an efficient and accurate representation of centrality of the distribution. If, however, the hourly distributions are skewed, then the median is a better way to summarize the data.
The graphic below plots the distributions of certain hours. We will use this example to gauge whether the median or the mean is more appropriate.
density_data <-
cniData %>%
dplyr::filter(usage_point == UP) %>%
#dplyr::select(cons_int_0, cons_int_13) %>%
dplyr::collect()
hr0_density <- density(density_data$cons_int_0, from = 0)
hr5_density <- density(density_data$cons_int_5, from = 0)
hr13_density <- density(density_data$cons_int_13, from = 0)
hr20_density <- density(density_data$cons_int_20, from = 0)
plot(
x = hr0_density$x,
y = hr0_density$y,
main = "KDE: Consumption (kWh) for a single Location",
ylab = "Density",
xlab = "Consumption",
ylim = c(0, max(hr0_density$y, hr13_density$y, hr5_density$y, hr20_density$y)),
xlim = c(0, max(hr0_density$x, hr13_density$x, hr5_density$x, hr20_density$x)),
col = "darkorange",
lwd = 2,
type = "l"
)
lines(
x = hr13_density$x,
y = hr13_density$y,
col = "navy",
lwd = 2
)
lines(
x = hr20_density$x,
y = hr20_density$y,
col = "firebrick",
lwd = 2
)
lines(
x = hr5_density$x,
y = hr5_density$y,
col = "gray20",
lwd = 2
)
legend("topright",
legend = c("0 Hour", "05 Hour", "13 Hour", "20 Hour"),
col = c("darkorange", "navy"),
lty = c(1, 1, 1, 1),
lwd = c(2, 2, 2, 2),
bty = "n"
)While some of the hourly distributions in the graphic above appear bell-shaped enough, I still prefer the median just as a precaution.
As a final sanity check, for the same location as above, the visual below shows how the median and mean differ. Also plotted is a measure of spread for both the median and the mean; median absolute deviation (MAD) and the standard deviation.
summaryVizData <-
cniData %>%
dplyr::as_tibble() %>%
dplyr::filter(usage_point == UP) %>%
dplyr::select(-c(contract_account, usage_point, account_class, readtime_date)) %>%
tidyr::pivot_longer(cols = !weekpart, names_to = "hour", values_to = "consumption") %>%
dplyr::group_by(hour) %>%
summarise(
median = median(consumption), MAD = mad(consumption),
mean = mean(consumption), StDv = sd(consumption)
)
vizMean <-
summaryVizData %>%
ggplot(aes(x = hour, y = mean)) +
geom_point(size = 4, color = "navy") +
scale_x_discrete(name = "Hour", labels = 0:23)
vizMedian <-
summaryVizData %>%
ggplot(aes(x = hour, y = median)) +
geom_point(size = 4, color = "darkred") +
scale_x_discrete(name = "Hour", labels = 0:23)
vizSd <-
summaryVizData %>%
ggplot(aes(x = hour, y = StDv)) +
geom_point(size = 4, color = "navy") +
scale_x_discrete(name = "Hour", labels = 0:23)
vizMad <-
summaryVizData %>%
ggplot(aes(x = hour, y = MAD)) +
geom_point(size = 4, color = "darkred") +
scale_x_discrete(name = "Hour", labels = 0:23)
ggarrange(vizMedian, vizMean,
vizMad, vizSd,
labels = c("Median", "Mean", "MAD", "Standard Deviation"),
ncol = 2, nrow = 2
)Both group of summary statistics tell a similar story:
With respect to the algorithm it matters not whether the data is summarized via the mean or the median because either way the algorithm will do it’s job. It’s more a matter of statistical philosophy which summary statistic is chosen. For this research, I chose the median.
FDA will help us to create a smooth curve from discrete data, such as hourly consumption (kWh) data. I will test two approaches to estimate the smooth curve; the Fourier basis and cubic splines.
The Fourier basis is a series of cosine and sine waves. This is an old technique that has been found to be particularly useful in the natural world, from engineering to physics, because it fits periodic data well.
From this article online:
A cubic spline is a piecewise cubic function that interpolates a set of data points and guarantees smoothness at the data points
Essentially, a cubic spline is a squiggly line that connects the dots.
Every recipe I know of starts by prepping the ingredients. And so it is too in statistics. Before getting started the data needs to be collected and prepped.
What’s more, the Fourier basis and cubic splines also need to be prepped beforehand. Each has a series of parameters that need to be initialized. To begin, I have arbitrarily chosen a value of these parameters (9). To begin, each location will use the same fixed set of parameters. Later on I will tune these settings so that each location has the optimal value.
A neat property of the Fourier Basis is how it handles periodic data. For our use, the reader will notice that the estimated smooth curve by the Fourier Basis repeats, that is the curve loops. The 23:59 minute estimate smoothly transitions to the 00:00 minute. This property has uses beyond the work in this report. For example, this method can be used to estimate the hourly seasonality in a time series model which can then be used to estimate consumption in the future.
Finally, the Fourier basis has easily computeable derivatives because it is constructed of cosine and sine functions. In a following section I will show how we could use a derivative to smooth the estimated curve.
fdaFourier_basis <- fda::create.fourier.basis(rangeval = range_eval, nbasis = num_ofSplines)
fdaFourier_matrix <- fda::eval.basis(eval_arg, fdaFourier_basis)
basiscoef <- solve(
crossprod(fdaFourier_matrix),
crossprod(fdaFourier_matrix, fdaData_cni)
)
testFD <-
fda::fd(
coef = basiscoef,
basisobj = fdaFourier_basis,
fdnames = fdNames
)Cubic splines are a great way to produce curves. However, as will be shown below there are caveats when using cubic splines. The biggest drawback of using cubic splines with consumption (kWh) data is that there are natural asymptotes in the data. For example, a location can’t have negative usage. Cubic splines could in theory produce a curve that extends beyond the possible range of values of the data.
fdaBsplies_basis <- fda::create.bspline.basis(range_eval, norder = 4, nbasis = num_ofSplines)
fdaBsplies_matrix <- fda::eval.basis(eval_arg, fdaBsplies_basis)
basiscoef_b <- solve(
crossprod(fdaBsplies_matrix),
crossprod(fdaBsplies_matrix, fdaData_cni)
)
testFD_b <-
fda::fd(
coef = basiscoef_b,
basisobj = fdaBsplies_basis,
fdnames = fdNames
)The plot below illustrates the 100 estimated smooth curves via Fourier Basis and Cubic Splines. It’s difficult to see all the 100 curves because consumption varies greatly among locations. In this case, there are certain locations that have nearly 500 times the consumption (kWh) than other locations.
par(mfrow = c(1, 2))
plot(testFD,
lty = 1,
lwd = 2,
col = 1,
xlim = c(0, 23)
)
mtext("Fourier Basis")
plot(testFD_b,
lty = 1,
lwd = 2,
col = 1,
xlim = c(0, 23)
)
mtext("Cubic Splines")Since the above is a jumbled mess, to continue to investigate further I have randomly sampled 10 locations to zoom into since we can’t see all the curves at once.
As the following visual will illustrate, the Fourier basis splines fit the data better than the cubic splines using the same hard-coded starting parameters. The Fourier basis curve is smoother and provide a more reasonable fit at the ends, that is the 0 and 23 hours. Thus, from here onward only the Fourier basis will be considered.
par(mfrow = c(k2, 2))
for (i in seq_along(usage_points_toPlot)) {
# Fourier basis
fda::plotfit.fd(fdaData_cni[, usage_points_toPlot[i]],
eval_arg,
testFD[usage_points_toPlot[i]],
lty = 1,
lwd = 2,
main = paste("Randomly Sampled Location", i, sep = " ")
)
mtext("Fourier Basis")
# Cubic Splines
fda::plotfit.fd(fdaData_cni[, usage_points_toPlot[i]],
eval_arg,
testFD_b[usage_points_toPlot[i]],
lty = 1,
lwd = 2,
main = paste("Randomly Sampled Location", i, sep = " ")
)
mtext("Cubic Splines")
}The parameter, nbasis, in the model defines how many basis functions are used to create the Fourier basis model. Too few and the model will underfit the data. While too many and the model will overfit. In the beginning I arbitrarily set this parameter to 9.
In any optimization procedure, what is to be optimized over needs to be defined. For example, if your goal were to bake the best cupcake, how would you define best - flavor, size, appearance, etc. Luckily, our objective function has already been defined. From the book by Ramsey and Silverman, (Functional Data Analysis) we will use the following equation to obtain an unbiased estimate of the sampling variance: \(s^2 = \frac{1}{n-K}\Sigma^n_j(y_i-\hat{y_i})^2\)
The below example will help to illustrate what the optimization procedure below will accomplish:
The procedure will loop through the possible number of basis functions (parameter K) until the slope is less than delta (\(\delta\)). Then, the algorithm stops and chooses K such that K is the first K within +/- \(\delta\). Since the value of K depends on the value of delta (\(\delta\)), delta (\(\delta\)) is set to a small value: \(\delta\) = 0.1 to help ensure that the chosen value of K does not overfit to the data.
The goal behind selecting K as the first value below the delta threshold is to avoid over fitting. The function will tend to overfit as K increases. While too small a value of K will underfit the data. In the next section a smoothing technique will be introduced that will help better fit the function to the data.
After some experimentation, I encountered an issue. The slopes varied wildly across locations. I struggled to come up with a general solution that could apply to all locations. To rectify this, I standardized the estimates of the sampling distribution (\(s^2\)) so that every location would be on a similar scale. I refer to this statistic as \(\tilde{s^{2}}\). The slopes of the Sigma tilde were much more stable and I was able to create a general solution that could apply to all locations. Further research is required to determine if this is a good solution.
# sources:
# Functional Data Analysis
# Functional Data Analysis with R and Matlab
# object to house visuals
viz_list <- vector(mode = "list", length = k2)
viz_list_slope <- viz_list
# object of optimal K for k2 usage points
optimal_K <- rep(NA, k2)
# Create max(K_i) basis functions
fbasis <- fda::create.fourier.basis(rangeval = range_eval, nbasis = max(nbasis_grid), period = diff(range_eval))
bvals <- fda::eval.basis(evalarg = eval_arg, basisobj = fbasis)
# LOOP: Usage Point (i)
for (i in seq_len(k2)) {
UP_data <- fdaData_cni[, usage_points_toPlot[i]]
# Create blank matrix to house results
## Column 1: K: Number of basis functions
## Column 2: Sigma: Estimate of sampling variance
UP_optim_stats_matrix <- matrixStats::allocMatrix(nrow = length(nbasis_grid), ncol = 2)
colnames(UP_optim_stats_matrix) <- c("K", "Sigma")
# LOOP: values of K
for (j in seq_along(nbasis_grid)) {
# Parameter K, the number of basis functions
k <- nbasis_grid[j]
# Fourier basis with K basis functions
tmpBasis <- bvals[, 1:k]
# Fit linear model: multiple linear regression
tmpLM <- lm(UP_data ~ tmpBasis - 1)
# predicted values
yhat <- tmpLM$fitted.values
# squared errors
squared_errors <- (UP_data - yhat)**2
# unbiased estimate of the sampling error
## assuming that the standard model for error is accepted
sigma <- (1 / (24 - k)) * sum(squared_errors)
# populate the matrix of results
## K
UP_optim_stats_matrix[j, 1] <- k
## Sigma
UP_optim_stats_matrix[j, 2] <- sigma
}
# Create summary matrix
summary_matrix_UP <-
UP_optim_stats_matrix %>%
# convert to tibble for dplyr functions
dplyr::as_tibble() %>%
dplyr::mutate(Sigma_stdz = scale(Sigma)) %>%
dplyr::mutate(Slope = (Sigma_stdz - dplyr::lag(Sigma_stdz)) / 2)
# Choosing K
optimal_K[i] <-
summary_matrix_UP %>%
dplyr::filter(Slope <= machine_precision_zero + delta & Slope >= machine_precision_zero - delta) %>%
dplyr::filter(abs(Sigma_stdz) <= 1) %>%
dplyr::slice_head() %>%
dplyr::select(K) %>%
dplyr::mutate(K = base::ifelse(K - 2 == 3, 5, K)) %>%
as.numeric()
# Plot: K2 UP
## Sigma hat
viz_list[[i]] <-
summary_matrix_UP %>%
ggplot(aes(x = K, y = Sigma_stdz)) +
geom_point(color = "darkorange", size = 2, alpha = 0.5) +
geom_line(color = "darkorange") +
ylab(NULL) +
ggtitle(paste("Randomly Sampled Usage Point", i, sep = " "),
subtitle = "Estimate of Standardized Sampling Variance"
) +
scale_y_continuous(limits = c(-3, 3))
## slope of sigma hat
viz_list_slope[[i]] <-
summary_matrix_UP %>%
dplyr::filter(Slope <= machine_precision_zero + delta & Slope >= machine_precision_zero - delta) %>%
dplyr::filter(abs(Sigma_stdz) <= 1) %>%
ggplot(aes(x = K, y = Slope)) +
geom_point(color = "navy", size = 2, alpha = 0.5) +
geom_abline(aes(slope = 0, intercept = machine_precision_zero + delta)) +
geom_abline(aes(slope = 0, intercept = machine_precision_zero - delta)) +
ylab(NULL) +
ggtitle(paste("Randomly Sampled Usage Point", i, sep = " "),
subtitle = paste("Slope:", paste("Optimal K value is", optimal_K[i], sep = " "), sep = " ")
) +
scale_y_continuous(limits = c(-delta, delta))
}
# plot all k2 visuals
ggarrange(viz_list[[1]], viz_list_slope[[1]],
viz_list[[2]], viz_list_slope[[2]],
viz_list[[3]], viz_list_slope[[3]],
viz_list[[4]], viz_list_slope[[4]],
viz_list[[5]], viz_list_slope[[5]],
viz_list[[6]], viz_list_slope[[6]],
viz_list[[6]], viz_list_slope[[7]],
viz_list[[8]], viz_list_slope[[8]],
viz_list[[9]], viz_list_slope[[9]],
viz_list[[10]], viz_list_slope[[10]],
ncol = 2,
nrow = 10
)Now that the optimal value of K has been selected for each of the randomly sampled usage points, in this section the function will be re-created using the optimal value of K. The following visual will display the original plot with the static and arbitrarily chosen K vs the optimal K for each usage point.
optimalK_fd_list <- vector("list", length = k2)
for (i in seq_len(k2)) {
fbasis_optimal <- fda::create.fourier.basis(rangeval = range_eval, nbasis = optimal_K[i])
fbasis_matrix_optimal <- fda::eval.basis(eval_arg, fbasis_optimal)
fbasis_coef_optimal <- solve(
crossprod(fbasis_matrix_optimal),
crossprod(fbasis_matrix_optimal, fdaData_cni[, usage_points_toPlot[i]])
)
fbasis_FD_optimal <- fda::fd(
coef = fbasis_coef_optimal,
basisobj = fbasis_optimal,
fdnames = fdNames
)
# out
optimalK_fd_list[[i]] <- fbasis_FD_optimal
}
par(mfrow = c(k2, 2))
for (i in seq_len(k2)) {
# Fourier Basis: original
fda::plotfit.fd(
y = fdaData_cni[, usage_points_toPlot[i]],
argvals = eval_arg,
fdobj = testFD[usage_points_toPlot[i]],
lty = 1,
lwd = 2,
main = paste("Randomly Sampled Location", i, sep = " ")
)
mtext("Fourier Basis: Original")
# Fourier Basis: optimal
fda::plotfit.fd(
y = fdaData_cni[, usage_points_toPlot[i]],
argvals = eval_arg,
fdobj = optimalK_fd_list[[i]],
lty = 1,
lwd = 2,
main = paste("Randomly Sampled Location", i, sep = " ")
)
mtext("Fourier Basis: Optimal")
}Initially, the Fourier basis was created with a fixed parameter. Then, an optimization procedure was used to re-build the Fourier basis with an optimal K parameter. In this section another parameter, lambda (\(\lambda\)) will be selected via another optimization procedure.
The previously created curve will be smoothed to better fit the data, via:
The harmonic acceleration operator is defined as \(L = w^2D + D^3\), where \(w = \frac{2\pi}{T}\) and \(T\) defines the period. In this case \(T\) = 24 hours.
# define w
W <- (2 * pi) / diff(range_eval)
# specify the constants
# p.55 in Functional Data Analysis with R and Matlab
harmonic_acceleration <- fda::vec2Lfd(c(0, W**2, 0), rangeval = range_eval)
# create the linear differential operator
harmacc_ldf <- fda::vec2Lfd(harmonic_acceleration)In the below procedure the goal is to estimate the optimal value of the Lagrangian roughness penalty, \(\lambda\). The optimal value is min(GCV coefficient). GCV stands for generalized cross validation. Something to note is the danger of a boundary solution. This occurs when the min(GCV) is located at either ends of the lambda values the procedure searches over. In this case, if the final result is a boundary solution I have hard-coded lambda to be 0. I chose 0 because \(10^0 =1\) and so \(\lambda = 1\) introduces almost no penalty.
# parallel: export
lambdas_seq <- seq(-10, 10, 0.1)
# define function
parallel_fda <- function(lambdas, data, ID, fourier_basis, ldf) {
# declare output
# out = rep(NA, length(lambdas))
out <- matrixStats::allocMatrix(nrow = length(lambdas), ncol = 1, data = NA)
colnames(out) <- ID
# fda per lambda
for (i in seq_along(lambdas)) {
lambda <- 10**(lambdas[i])
fdParObj <- fda::fdPar(
fourier_basis,
ldf,
lambda
)
smoothed <- fda::smooth.basis(
argvals = 0:23 + 0.5,
y = data,
fdParobj = fdParObj
)
out[i, 1] <- sum(smoothed$gcv)
}
return(out)
}
# establish parallel architecture
cl <- parallel::makeCluster(ncores)
doParallel::registerDoParallel(cl)
roughness_penalty_GCV_matrix <-
foreach::foreach(
i = seq_along(usage_points_toPlot),
.export = c("fdaData_cni", "optimalK_fd_list", "harmacc_ldf"),
.packages = c("fda"),
.combine = cbind
) %dopar% {
# GCV values for lambda
parallel_fda(
lambdas = lambdas_seq,
data = fdaData_cni[, usage_points_toPlot[i]],
ID = colnames(fdaData_cni)[usage_points_toPlot][i],
fourier_basis = optimalK_fd_list[[i]],
ldf = harmacc_ldf
)
}
# stop parallel
parallel::stopCluster(cl)
# plot
tmpSD <- apply(roughness_penalty_GCV_matrix, 2, sd)
par(mfrow = c(5, 2))
for (i in seq_len(k2)) {
plot(
x = lambdas_seq,
y = roughness_penalty_GCV_matrix[, i],
type = "l",
ylab = "GCV",
xlab = expression("log"[10] * (lambda)),
main = paste("Randomly Sampled Location", i, sep = " ")
)
points(
x = lambdas_seq[which.min(roughness_penalty_GCV_matrix[, i])],
y = min(roughness_penalty_GCV_matrix[, i]),
col = "navy",
pch = 19
)
}# choose optimal lambda
optimal_lambda_vector <- apply(
roughness_penalty_GCV_matrix,
2,
function(x) {
tmp <- lambdas_seq[which.min(x)]
if (tmp == max(lambdas_seq) || tmp == min(lambdas_seq)) {
0
} else {
tmp
}
}
)Now that the optimal values for the K parameter and roughness penalty \(\lambda\) have been computed, I will re-create the curve but with the optimal values and as before compare to the original estimate with arbitrarily chosen parameter values.
list_optimal_k_lambda <-
lapply(
seq_along(usage_points_toPlot),
function(i) {
fdParObj_tmp <- fda::fdPar(optimalK_fd_list[[i]], harmacc_ldf, 10**optimal_lambda_vector[[i]])
smoothed <- fda::smooth.basis(
argvals = eval_arg,
y = fdaData_cni[, usage_points_toPlot[i]],
fdParobj = fdParObj_tmp
)
fd_tmp <- smoothed$fd
fd_tmp$names <- fdNames
fd_tmp
}
)par(mfrow = c(k2, 2))
for (i in seq_len(k2)) {
# Fourier Basis: original
fda::plotfit.fd(
y = fdaData_cni[, usage_points_toPlot[i]],
argvals = eval_arg,
fdobj = testFD[usage_points_toPlot[i]],
lty = 1,
lwd = 2,
main = paste("Randomly Sampled Location", i, sep = " ")
)
mtext("Fourier Basis: Original")
# Fourier Basis: optimal
fda::plotfit.fd(
y = fdaData_cni[, usage_points_toPlot[i]],
argvals = eval_arg,
fdobj = list_optimal_k_lambda[[i]],
lty = 1,
lwd = 2,
main = paste("Randomly Sampled Location", i, sep = " ")
)
mtext("Fourier Basis: Optimal K and Lambda")
}The line represents the estimated minute interval meter reads while the dots represent the actual hourly meter reads. The below illustrates that Functional Data Analysis, using the Fourier basis is capable of producing estimates of consumption for any interval of time.
tmpMin <- 1
tmpSeq <- seq(0, 23, tmpMin / 60)
par(mfrow = c(5, 2))
for (i in seq_len(k2)) {
tmpPred <- predict(list_optimal_k_lambda[[i]], newdata = tmpSeq)
tmpY <- fdaData_cni[, usage_points_toPlot[i]]
plot(
x = tmpSeq,
y = tmpPred,
type = "l",
col = "navy",
main = paste("Randomly Sampled Location", i, sep = " "),
ylab = "Consumption",
xlab = "Time",
lwd = 2,
ylim = c(min(tmpPred, tmpY), max(tmpPred, tmpY))
)
points(
x = 0:23,
y = tmpY,
col = "darkorange",
pch = 16,
cex = 1.5
)
lines(
x = 0:23,
y = tmpY,
col = "darkorange"
)
}In regression analysis it is customary to check the residuals to see that they meet the assumptions of a regression model. Checking residuals for constant variance and normality is common practice to validate a model. Below are plots to visualize the normality of the residuals and check to see if the residuals appears to be a random scatter of points.
The constant variance assumption may be questionable since some of the residual plots appear to have a pattern. However, according to the QQ Plots most of the randomly sampled locations meet the normality assumption.
To assess normality I used the Shapiro Wilks test. The test should provide a reasonable assessment of normality because the sample size for each location is n = 24 - the Shapiro Wilks test can be sensitive to large sample sizes. The null hypothesis \(h_0\) for the Shapiro Wilks test is that the data is normally distributed. So, assuming the usual default \(\alpha=0.05\) level of significance, if the p.value is greater than the stated level of significance we can assume the data is normally distributed.
par(mfrow=c(10, 2))
for(i in seq_len(ncol(residuals))){
plot(residuals[,i],
type = "p",
main = paste("Randomly Sampled Location", i, sep = " "),
ylab = "Consumption",
xlab = "Time",
pch = 16,
col = "darkorange",
ylim = c(-abs(max(residuals[,i])), abs(max(residuals[,i])))
)
abline(h = 0, lty = 2, col = "gray50")
qqnorm(residuals[,i])
qqline(residuals[,i])
shapiro.test(residuals[,i])[["p.value"]] %>%
round(x = ., digits = 4) %>%
paste("Shapiro Wilks P.value:", ., sep = " ") %>%
mtext(., cex = 0.8, line = -1)
}In this assessment visual the actual consumption is plotted on the x-axis while the predicted consumption is on the y-axis. If the model is a good fit then we should see a linear scatter of points from the bottom left corner to the top right corner.
par(mfrow=c(5, 2))
for(i in seq_len(ncol(residuals))){
tmpAxisMax = max(matrix_yhat_hourly[,i], fdaData_cni[, usage_points_toPlot[i]])
tmpAxisMin = min(matrix_yhat_hourly[,i], fdaData_cni[, usage_points_toPlot[i]])
plot(x = matrix_yhat_hourly[,i],
y = fdaData_cni[, usage_points_toPlot[i]],
type = "p",
main = paste("Randomly Sampled Location", i, sep = " "),
ylab = "Consumption",
xlab = "Estimated Consumption",
pch = 16,
col = "darkorange"
,ylim = c(tmpAxisMin, tmpAxisMax)
,xlim = c(tmpAxisMin, tmpAxisMax)
)
abline(a = 0, b = 1, lty = 2, col = "gray50")
}The MSE (\(\frac{1}{n}\Sigma_{i=1}^n(y_i-\hat{y_i})^2\)) will be used to assess how well the estimated curves fit the hourly consumption data. A perfect fit would be MSE = 0. As MSE increases, the fit of the model worsens. Also, scale matters. So, locations with higher consumption will have higher MSE values relative to locations with lower consumption.
The MSE values can’t be used to compare the fit of one location to another. Instead, I proose to save the MSE values so that later on if another method to estimate consumption (kWh) is devised the fit from the Fourier basis can be compared to other methods by location.
data.frame(
Location = paste("Randomly Sampled Location", seq_len(k2), sep = " "),
MSE = colMeans(SSE) %>% round(., digits = 4),
row.names = NULL
) %>%
knitr::kable(., format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)| Location | MSE |
|---|---|
| Randomly Sampled Location 1 | 0.0146 |
| Randomly Sampled Location 2 | 2.4491 |
| Randomly Sampled Location 3 | 0.0785 |
| Randomly Sampled Location 4 | 0.0697 |
| Randomly Sampled Location 5 | 0.0082 |
| Randomly Sampled Location 6 | 0.6002 |
| Randomly Sampled Location 7 | 1.1139 |
| Randomly Sampled Location 8 | 0.0200 |
| Randomly Sampled Location 9 | 0.0508 |
| Randomly Sampled Location 10 | 13.9027 |
In this view we can see if the median is an appropriate measure of centrality. If the estimated curve is somewhere near the middle of the all the points for each hour then the median will be close to the mean - this suggests that the hourly consumption distributions are symmetric. If, however, the estimated curve is not in the middle then the mean and the median will differ, which suggests that the hourly distributions are skewed.
Based on this research I believe that the median better captures the “average” behavior of each location. This also implies that the MAD would be a better representation of dispersion than the standard deviation.
tmpUPs <- (colnames(fdaData_cni) %>% unique())[usage_points_toPlot]
par(mfrow = c(5, 2))
for (i in seq_along(usage_points_toPlot)) {
tmpPlotData <-
cniData %>%
dplyr::filter(usage_point == tmpUPs[i]) %>%
dplyr::select(dplyr::starts_with("cons_int")) %>%
dplyr::collect()
for (j in 0:23) {
if (j == 0) {
plot(
x = rep(j, nrow(tmpPlotData)),
y = tmpPlotData[[j + 1]],
type = "p",
col = "darkorange",
xlim = c(0, 23),
main = paste("Randomly Sampled Location", i, sep = " "),
ylab = "consumption",
xlab = "Time",
ylim = c(0, max(tmpPlotData))
)
} else {
points(
x = rep(j, nrow(tmpPlotData)),
y = tmpPlotData[[j + 1]],
col = "darkorange"
)
}
}
lines(x = 0:23, y = matrix_yhat_hourly[, i], col = "navy", lwd = 2)
}In conclusion, using FDA I have shown that it is possible to create a smooth estimate of consumption (kWh) which will allow Consumers Energy to interpolate the average consumption for any given time.
With this research I have shown that it is possible to create a continuous estimate of average consumption (kWh) behavior. The results of this work will allow us to make educated guesses at the average consumption behavior in between the actual meter reads.
As a follow-up to this initial proof-of-concept (POC) research I want to address the question, how do we know if the algorithm is accurate? Luckily, there exist meters that provide consumption data every 15 minutes. To test the model I will select a sample of these locations and aggregate the data. To mirror the data from the remaining meters I will aggregate the data by hour. Then, I will process the data via the algorithm. Next, I’ll produce 15 minute interval estimates using the algorithm. Finally, I will compare the actual 15 minute consumption to the estimated consumption. If all goes well, the difference between the actual values and the predicted values will be minimal and acceptable.
The culmination of this project would be to produce a business-ready algorithm that can:
Sources, Citations
Technology