Last published on: November 13, 2021

drawing

Introduction

At Consumers Energy, a utility provider in the state of Michigan, the meters the company uses to measure consumption (kWh) produce a reading periodically versus continuously. Some of the meters produce data every 60 minutes.

A reading every 60 minutes makes for quite a course understanding of consumption behavior. At Consumers Energy there is a growing list of business needs to have a more granular understanding of the customer’s consumption (kWh) behavior.

One such use case relates to load profiles. A load profile is a descriptive measure or summary of a customer’s consumption (kWh). The company uses this information to better understand customers, which in turn helps customers. For example, a better understanding of consumption behavior and patterns helps ensure marketing material is relevant and targeted.

Research Summary

This research will focus on a branch of statistics called Functional Data Analysis (FDA).

From Wikipedia:

Functional data analysis (FDA) is a branch of statistics that analyzes data providing information about curves, surfaces or anything else varying over a continuum

Translation: Despite periodic meter reads (e.g., every 60 minutes for example) Consumers Energy knows that customers consume electricity continuously throughout the day and not just at the meter read intervals. I will illustrate that using the FDA framework, we can fill in the gaps by creating a smooth estimate of consumption (kWh) that better mirrors actual consumption.

The goal of this research is to explore how to leverage FDA to construct a smooth estimate of consumption. Specifically, for this body of work, I will show how to use FDA to model average consumption behavior. The data was summarized because the raw data is too big for this proof-of-concept type of research.

To provide some technical context, FDA will be used to estimate a continuous function, \(f(x)\) read f of x, from the discrete periodic meter reads (the x in \(f(x)\)) that can be used to guess what consumption would be for any period of time. For example, we could plug 11:05 AM into the estimated function as \(f(x = 11:05)\) and the function would output it’s best guess for a customer’s average consumption for that time.

About the data

The data for this research was obtained from an internal database. To protect the customer’s privacy and to adhere to information safety best practices, noise was added to the actual meter reads so that there is no way to associate the data used throughout this research with any actual customer. What’s more, no identifying markers will be used throughout the report. Also, the chunk of code that annonymizes the data is hidden so that the process cannot be reverse engineered.

An important nuance to make is that this research will analyze locations, not customers. One customer can have multiple meters. This analysis is at the meter level and for this report a meter = location. One could easily aggregate the meter data by customer and reproduce a similar analysis.

Overview:

The data was obtained via a random sample of commercial and industrial locations
The nature of the data is hourly
All the location’s history was queried
- In practice, I would recommend to consider a historical cutoff, such as only use the last three years of data to estimate average consumption
- I recommend a historical cutoff because recent history will most likely better reflect the location’s current behavior

Exploratory Data Analysis (EDA)

Consumption (kWh) by location

One characteristic of the data is how greatly consumption can vary between locations (i.e., customers). Below is a table of the summary statistics by hour for the data used in this report. As can be seen below, consumption varies wildly. The final algorithm needs to be customized to each location’s particular usage behavior.

In the table below notice how far apart the mean and the median are at each hour. This is the tale-tale sign of a skewed distribution. Another artifact of the data to notice is the spread between the min and the max (i.e., the range in stats terminology)

tmpFuns <- c(max, mean, median, sd, mad)

tmpFunction <- function(data, FUN) {
  data %>%
    dplyr::select(dplyr::starts_with("cons_int")) %>%
    dplyr::summarise(dplyr::across(.cols = dplyr::everything(), .fns = FUN)) %>%
    dplyr::collect() %>%
    round(., 2) %>%
    as.numeric()
}

tmpOut <- vapply(tmpFuns, FUN.VALUE = numeric(24), function(x) tmpFunction(cniData, x))
tmpOut <- cbind(rep(0, 24), tmpOut)
colnames(tmpOut) <- c("Min", "Max", "Mean", "Median", "Stdv", "MAD")
row.names(tmpOut) <- 0:23

tmpOut %>%
  dplyr::as_tibble()%>%
  gt::gt()%>%
  gt::tab_header(
    title = md("**Summary Statistics**"),
    subtitle = "SRS, n = 100 locations"
  )%>%
  gt::fmt_number(
    columns = Max,
    sep = ",",
    decimals = 2
  )

Summary Statistics
SRS, n = 100 locations
Min	Max	Mean	Median	Stdv	MAD
0	24,709.42	16.15	3.55	95.12	3.74
0	7,977.61	15.95	3.30	94.34	3.26
0	8,132.75	15.71	3.24	93.41	3.20
0	7,816.13	15.76	3.24	93.32	3.17
0	6,495.13	16.11	3.25	93.89	3.19
0	8,212.77	16.88	3.36	95.54	3.34
0	7,540.88	18.17	3.55	98.66	3.62
0	10,523.57	19.53	4.09	103.83	4.45
0	13,233.00	20.71	4.55	107.99	5.06
0	15,913.09	21.61	4.98	113.27	5.56
0	18,331.53	22.29	5.34	116.79	5.97
0	22,250.87	22.63	5.52	120.49	6.18
0	27,082.36	22.76	5.53	124.65	6.21
0	26,197.86	22.95	5.53	127.70	6.23
0	26,349.87	22.88	5.51	129.30	6.22
0	27,391.87	22.55	5.43	129.32	6.15
0	30,571.50	21.92	5.24	124.20	5.94
0	25,198.64	21.08	4.79	118.06	5.44
0	19,475.05	20.36	4.44	112.75	5.08
0	16,130.69	19.79	4.26	108.71	4.85
0	14,796.09	19.12	4.12	104.59	4.62
0	12,007.55	18.31	3.99	100.86	4.38
0	11,106.90	17.41	3.80	98.32	4.08
0	9,109.45	16.66	3.46	96.16	3.48

#knitr::kable(tmpOut, format = "html") %>%
#  kableExtra::kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)

A deeper dive: a single location’s consumption (kWh)

Below, a particular location was randomly sampled. All of that location’s data was plotted. A great deal of information can be extracted from this one visual:

Each hour has it’s own distribution or behavior
The overall daily distribution is skewed
There may exist a difference in consumption between weekdays and weekends

# randomly sample a single usage point
set.seed(1)
UP <- cniData[["usage_point"]] %>%
  unique(.) %>%
  sample(x = ., size = 1)

cniData %>%
  dplyr::as_tibble() %>%
  dplyr::filter(usage_point == UP) %>%
  dplyr::select(-c(contract_account, usage_point, account_class, readtime_date)) %>%
  tidyr::pivot_longer(cols = !weekpart, names_to = "hour", values_to = "consumption") %>%
  ggplot(aes(x = hour, y = consumption, fill = weekpart, color = weekpart)) +
  geom_point(alpha = 0.5) +
  scale_x_discrete(name = "Hour", labels = 0:23) +
  ggtitle("A randomly sampled location")

While I don’t investigate the nuance of weekday vs weekend in this research I do think it would be valuable to consider that perhaps the algorithm should be applied to both weekday and weekends separately to best capture the consumption behavior. The difference in consumption between weekday and weekend is likely a phenomenon among residential customers as well.

Another thing to consider is the size of the data. For this particularly random sampled location there are 1,400 data points for each hour (i.e., 1,400 rows by 24 columns). That’s a lot of data! In the next section I’ll discuss how to shrink the size of the data.

How to handle the size of the data and the variability?

The previous section illustrated that there is a lot of data and that consumption can vary greatly by hour. One way to reduce the variability is to summarize the data. For this reason, and for practicality, the data was summarized.

The method used to summarize the data depends on the nature of the data. Each hour contains a history of past meter reads. Each hour is a distribution of consumption (kWh). If the hourly distributions are bell-shaped then the mean is an efficient and accurate representation of centrality of the distribution. If, however, the hourly distributions are skewed, then the median is a better way to summarize the data.

The graphic below plots the distributions of certain hours. We will use this example to gauge whether the median or the mean is more appropriate.

density_data <-
  cniData %>%
  dplyr::filter(usage_point == UP) %>%
  #dplyr::select(cons_int_0, cons_int_13) %>%
  dplyr::collect()

hr0_density <- density(density_data$cons_int_0, from = 0)
hr5_density <- density(density_data$cons_int_5, from = 0)
hr13_density <- density(density_data$cons_int_13, from = 0)
hr20_density <- density(density_data$cons_int_20, from = 0)

plot(
  x = hr0_density$x,
  y = hr0_density$y,
  main = "KDE: Consumption (kWh) for a single Location",
  ylab = "Density",
  xlab = "Consumption",
  ylim = c(0, max(hr0_density$y, hr13_density$y, hr5_density$y, hr20_density$y)),
  xlim = c(0, max(hr0_density$x, hr13_density$x, hr5_density$x, hr20_density$x)),
  col = "darkorange",
  lwd = 2,
  type = "l"
)
lines(
  x = hr13_density$x,
  y = hr13_density$y,
  col = "navy",
  lwd = 2
)
lines(
  x = hr20_density$x,
  y = hr20_density$y,
  col = "firebrick",
  lwd = 2
)
lines(
  x = hr5_density$x,
  y = hr5_density$y,
  col = "gray20",
  lwd = 2
)
legend("topright",
  legend = c("0 Hour", "05 Hour", "13 Hour", "20 Hour"),
  col = c("darkorange", "navy"),
  lty = c(1, 1, 1, 1),
  lwd = c(2, 2, 2, 2),
  bty = "n"
)

The Median vs the Mean

While some of the hourly distributions in the graphic above appear bell-shaped enough, I still prefer the median just as a precaution.

As a final sanity check, for the same location as above, the visual below shows how the median and mean differ. Also plotted is a measure of spread for both the median and the mean; median absolute deviation (MAD) and the standard deviation.

summaryVizData <-
  cniData %>%
  dplyr::as_tibble() %>%
  dplyr::filter(usage_point == UP) %>%
  dplyr::select(-c(contract_account, usage_point, account_class, readtime_date)) %>%
  tidyr::pivot_longer(cols = !weekpart, names_to = "hour", values_to = "consumption") %>%
  dplyr::group_by(hour) %>%
  summarise(
    median = median(consumption), MAD = mad(consumption),
    mean = mean(consumption), StDv = sd(consumption)
  )

vizMean <-
  summaryVizData %>%
  ggplot(aes(x = hour, y = mean)) +
  geom_point(size = 4, color = "navy") +
  scale_x_discrete(name = "Hour", labels = 0:23)

vizMedian <-
  summaryVizData %>%
  ggplot(aes(x = hour, y = median)) +
  geom_point(size = 4, color = "darkred") +
  scale_x_discrete(name = "Hour", labels = 0:23)

vizSd <-
  summaryVizData %>%
  ggplot(aes(x = hour, y = StDv)) +
  geom_point(size = 4, color = "navy") +
  scale_x_discrete(name = "Hour", labels = 0:23)

vizMad <-
  summaryVizData %>%
  ggplot(aes(x = hour, y = MAD)) +
  geom_point(size = 4, color = "darkred") +
  scale_x_discrete(name = "Hour", labels = 0:23)

ggarrange(vizMedian, vizMean,
  vizMad, vizSd,
  labels = c("Median", "Mean", "MAD", "Standard Deviation"),
  ncol = 2, nrow = 2
)

Both group of summary statistics tell a similar story:

The central tendency of consumption varies by hour
And, that during hours of greater consumption the variability of the measure of centrality (mean, median) increases

With respect to the algorithm it matters not whether the data is summarized via the mean or the median because either way the algorithm will do it’s job. It’s more a matter of statistical philosophy which summary statistic is chosen. For this research, I chose the median.

Functional Data Analysis (FDA)

FDA will help us to create a smooth curve from discrete data, such as hourly consumption (kWh) data. I will test two approaches to estimate the smooth curve; the Fourier basis and cubic splines.

The Fourier basis is a series of cosine and sine waves. This is an old technique that has been found to be particularly useful in the natural world, from engineering to physics, because it fits periodic data well.

From this article online:

A cubic spline is a piecewise cubic function that interpolates a set of data points and guarantees smoothness at the data points

Essentially, a cubic spline is a squiggly line that connects the dots.

Preparations

Every recipe I know of starts by prepping the ingredients. And so it is too in statistics. Before getting started the data needs to be collected and prepped.

Steps to collect the data
- Query the database
Steps to prep the data:
- remove unnecessary fields (columns)
- convert the data from wide to long
- summarize the consumption data by hour
- convert the data back from long to wide

What’s more, the Fourier basis and cubic splines also need to be prepped beforehand. Each has a series of parameters that need to be initialized. To begin, I have arbitrarily chosen a value of these parameters (9). To begin, each location will use the same fixed set of parameters. Later on I will tune these settings so that each location has the optimal value.

Steps to prep FDA:
- Initialize the parameters
- Estimate the function, \(f(x)\) using both techniques (Fourier basis and cubic splines)

Fourier Basis

Fourier basis are suitable when:
- The data is periodic or cyclical (such as hourly consumption)
- The data do not exhibit fluctuations in any particular interval that are much more rapid than those elsewhere
  - This will be drawn into question as some locations have spikes from one hour to the next and then drop back to the level of the prior hour

A neat property of the Fourier Basis is how it handles periodic data. For our use, the reader will notice that the estimated smooth curve by the Fourier Basis repeats, that is the curve loops. The 23:59 minute estimate smoothly transitions to the 00:00 minute. This property has uses beyond the work in this report. For example, this method can be used to estimate the hourly seasonality in a time series model which can then be used to estimate consumption in the future.

Finally, the Fourier basis has easily computeable derivatives because it is constructed of cosine and sine functions. In a following section I will show how we could use a derivative to smooth the estimated curve.

fdaFourier_basis <- fda::create.fourier.basis(rangeval = range_eval, nbasis = num_ofSplines)
fdaFourier_matrix <- fda::eval.basis(eval_arg, fdaFourier_basis)
basiscoef <- solve(
  crossprod(fdaFourier_matrix),
  crossprod(fdaFourier_matrix, fdaData_cni)
)
testFD <-
  fda::fd(
    coef = basiscoef,
    basisobj = fdaFourier_basis,
    fdnames = fdNames
  )

Cubic Splines

Cubic splines are a great way to produce curves. However, as will be shown below there are caveats when using cubic splines. The biggest drawback of using cubic splines with consumption (kWh) data is that there are natural asymptotes in the data. For example, a location can’t have negative usage. Cubic splines could in theory produce a curve that extends beyond the possible range of values of the data.

fdaBsplies_basis <- fda::create.bspline.basis(range_eval, norder = 4, nbasis = num_ofSplines)
fdaBsplies_matrix <- fda::eval.basis(eval_arg, fdaBsplies_basis)
basiscoef_b <- solve(
  crossprod(fdaBsplies_matrix),
  crossprod(fdaBsplies_matrix, fdaData_cni)
)
testFD_b <-
  fda::fd(
    coef = basiscoef_b,
    basisobj = fdaBsplies_basis,
    fdnames = fdNames
  )

Visualizing the smoothed data

All 100 locations

The plot below illustrates the 100 estimated smooth curves via Fourier Basis and Cubic Splines. It’s difficult to see all the 100 curves because consumption varies greatly among locations. In this case, there are certain locations that have nearly 500 times the consumption (kWh) than other locations.

par(mfrow = c(1, 2))
plot(testFD,
  lty = 1,
  lwd = 2,
  col = 1,
  xlim = c(0, 23)
)
mtext("Fourier Basis")

plot(testFD_b,
  lty = 1,
  lwd = 2,
  col = 1,
  xlim = c(0, 23)
)
mtext("Cubic Splines")

Randomly sampled usage points

Since the above is a jumbled mess, to continue to investigate further I have randomly sampled 10 locations to zoom into since we can’t see all the curves at once.

As the following visual will illustrate, the Fourier basis splines fit the data better than the cubic splines using the same hard-coded starting parameters. The Fourier basis curve is smoother and provide a more reasonable fit at the ends, that is the 0 and 23 hours. Thus, from here onward only the Fourier basis will be considered.

par(mfrow = c(k2, 2))
for (i in seq_along(usage_points_toPlot)) {
  # Fourier basis
  fda::plotfit.fd(fdaData_cni[, usage_points_toPlot[i]],
    eval_arg,
    testFD[usage_points_toPlot[i]],
    lty = 1,
    lwd = 2,
    main = paste("Randomly Sampled Location", i, sep = " ")
  )
  mtext("Fourier Basis")

  # Cubic Splines
  fda::plotfit.fd(fdaData_cni[, usage_points_toPlot[i]],
    eval_arg,
    testFD_b[usage_points_toPlot[i]],
    lty = 1,
    lwd = 2,
    main = paste("Randomly Sampled Location", i, sep = " ")
  )
  mtext("Cubic Splines")
}

Optimizing the Fourier Basis model

First, choose the optimal number of basis functions

The parameter, nbasis, in the model defines how many basis functions are used to create the Fourier basis model. Too few and the model will underfit the data. While too many and the model will overfit. In the beginning I arbitrarily set this parameter to 9.

In any optimization procedure, what is to be optimized over needs to be defined. For example, if your goal were to bake the best cupcake, how would you define best - flavor, size, appearance, etc. Luckily, our objective function has already been defined. From the book by Ramsey and Silverman, (Functional Data Analysis) we will use the following equation to obtain an unbiased estimate of the sampling variance: \(s^2 = \frac{1}{n-K}\Sigma^n_j(y_i-\hat{y_i})^2\)

The below example will help to illustrate what the optimization procedure below will accomplish:

Sigma (\(s^2\)): An estimate of the sampling variance
- After some experimentation, the Sigma values were standardized ~ \(N(0,1)\)
The slope between two points (rise over run, \(\frac{y_2-y_1}{x_2-x_1}\))
Delta: The cutoff value for which values to consider

The procedure will loop through the possible number of basis functions (parameter K) until the slope is less than delta (\(\delta\)). Then, the algorithm stops and chooses K such that K is the first K within +/- \(\delta\). Since the value of K depends on the value of delta (\(\delta\)), delta (\(\delta\)) is set to a small value: \(\delta\) = 0.1 to help ensure that the chosen value of K does not overfit to the data.

The goal behind selecting K as the first value below the delta threshold is to avoid over fitting. The function will tend to overfit as K increases. While too small a value of K will underfit the data. In the next section a smoothing technique will be introduced that will help better fit the function to the data.

After some experimentation, I encountered an issue. The slopes varied wildly across locations. I struggled to come up with a general solution that could apply to all locations. To rectify this, I standardized the estimates of the sampling distribution (\(s^2\)) so that every location would be on a similar scale. I refer to this statistic as \(\tilde{s^{2}}\). The slopes of the Sigma tilde were much more stable and I was able to create a general solution that could apply to all locations. Further research is required to determine if this is a good solution.

# sources:
#   Functional Data Analysis
#   Functional Data Analysis with R and Matlab

# object to house visuals
viz_list <- vector(mode = "list", length = k2)
viz_list_slope <- viz_list
# object of optimal K for k2 usage points
optimal_K <- rep(NA, k2)

# Create max(K_i) basis functions
fbasis <- fda::create.fourier.basis(rangeval = range_eval, nbasis = max(nbasis_grid), period = diff(range_eval))
bvals <- fda::eval.basis(evalarg = eval_arg, basisobj = fbasis)

# LOOP: Usage Point (i)
for (i in seq_len(k2)) {
  UP_data <- fdaData_cni[, usage_points_toPlot[i]]

  # Create blank matrix to house results
  ## Column 1: K: Number of basis functions
  ## Column 2: Sigma: Estimate of sampling variance
  UP_optim_stats_matrix <- matrixStats::allocMatrix(nrow = length(nbasis_grid), ncol = 2)
  colnames(UP_optim_stats_matrix) <- c("K", "Sigma")

  # LOOP: values of K
  for (j in seq_along(nbasis_grid)) {
    # Parameter K, the number of basis functions
    k <- nbasis_grid[j]
    # Fourier basis with K basis functions
    tmpBasis <- bvals[, 1:k]
    # Fit linear model: multiple linear regression
    tmpLM <- lm(UP_data ~ tmpBasis - 1)
    # predicted values
    yhat <- tmpLM$fitted.values
    # squared errors
    squared_errors <- (UP_data - yhat)**2
    # unbiased estimate of the sampling error
    ## assuming that the standard model for error is accepted
    sigma <- (1 / (24 - k)) * sum(squared_errors)
    # populate the matrix of results
    ## K
    UP_optim_stats_matrix[j, 1] <- k
    ## Sigma
    UP_optim_stats_matrix[j, 2] <- sigma
  }

  # Create summary matrix
  summary_matrix_UP <-
    UP_optim_stats_matrix %>%
    # convert to tibble for dplyr functions
    dplyr::as_tibble() %>%
    dplyr::mutate(Sigma_stdz = scale(Sigma)) %>%
    dplyr::mutate(Slope = (Sigma_stdz - dplyr::lag(Sigma_stdz)) / 2)

  # Choosing K
  optimal_K[i] <-
    summary_matrix_UP %>%
    dplyr::filter(Slope <= machine_precision_zero + delta & Slope >= machine_precision_zero - delta) %>%
    dplyr::filter(abs(Sigma_stdz) <= 1) %>%
    dplyr::slice_head() %>%
    dplyr::select(K) %>%
    dplyr::mutate(K = base::ifelse(K - 2 == 3, 5, K)) %>%
    as.numeric()

  # Plot: K2 UP
  ## Sigma hat
  viz_list[[i]] <-
    summary_matrix_UP %>%
    ggplot(aes(x = K, y = Sigma_stdz)) +
    geom_point(color = "darkorange", size = 2, alpha = 0.5) +
    geom_line(color = "darkorange") +
    ylab(NULL) +
    ggtitle(paste("Randomly Sampled Usage Point", i, sep = " "),
      subtitle = "Estimate of Standardized Sampling Variance"
    ) +
    scale_y_continuous(limits = c(-3, 3))

  ## slope of sigma hat
  viz_list_slope[[i]] <-
    summary_matrix_UP %>%
    dplyr::filter(Slope <= machine_precision_zero + delta & Slope >= machine_precision_zero - delta) %>%
    dplyr::filter(abs(Sigma_stdz) <= 1) %>%
    ggplot(aes(x = K, y = Slope)) +
    geom_point(color = "navy", size = 2, alpha = 0.5) +
    geom_abline(aes(slope = 0, intercept = machine_precision_zero + delta)) +
    geom_abline(aes(slope = 0, intercept = machine_precision_zero - delta)) +
    ylab(NULL) +
    ggtitle(paste("Randomly Sampled Usage Point", i, sep = " "),
      subtitle = paste("Slope:", paste("Optimal K value is", optimal_K[i], sep = " "), sep = " ")
    ) +
    scale_y_continuous(limits = c(-delta, delta))
}

# plot all k2 visuals
ggarrange(viz_list[[1]], viz_list_slope[[1]],
  viz_list[[2]], viz_list_slope[[2]],
  viz_list[[3]], viz_list_slope[[3]],
  viz_list[[4]], viz_list_slope[[4]],
  viz_list[[5]], viz_list_slope[[5]],
  viz_list[[6]], viz_list_slope[[6]],
  viz_list[[6]], viz_list_slope[[7]],
  viz_list[[8]], viz_list_slope[[8]],
  viz_list[[9]], viz_list_slope[[9]],
  viz_list[[10]], viz_list_slope[[10]],
  ncol = 2,
  nrow = 10
)

Second, Recreate the smooth curve with the optimal K

Now that the optimal value of K has been selected for each of the randomly sampled usage points, in this section the function will be re-created using the optimal value of K. The following visual will display the original plot with the static and arbitrarily chosen K vs the optimal K for each usage point.

optimalK_fd_list <- vector("list", length = k2)

for (i in seq_len(k2)) {
  fbasis_optimal <- fda::create.fourier.basis(rangeval = range_eval, nbasis = optimal_K[i])
  fbasis_matrix_optimal <- fda::eval.basis(eval_arg, fbasis_optimal)
  fbasis_coef_optimal <- solve(
    crossprod(fbasis_matrix_optimal),
    crossprod(fbasis_matrix_optimal, fdaData_cni[, usage_points_toPlot[i]])
  )
  fbasis_FD_optimal <- fda::fd(
    coef = fbasis_coef_optimal,
    basisobj = fbasis_optimal,
    fdnames = fdNames
  )
  # out
  optimalK_fd_list[[i]] <- fbasis_FD_optimal
}

par(mfrow = c(k2, 2))
for (i in seq_len(k2)) {
  # Fourier Basis: original
  fda::plotfit.fd(
    y = fdaData_cni[, usage_points_toPlot[i]],
    argvals = eval_arg,
    fdobj = testFD[usage_points_toPlot[i]],
    lty = 1,
    lwd = 2,
    main = paste("Randomly Sampled Location", i, sep = " ")
  )
  mtext("Fourier Basis: Original")

  # Fourier Basis: optimal
  fda::plotfit.fd(
    y = fdaData_cni[, usage_points_toPlot[i]],
    argvals = eval_arg,
    fdobj = optimalK_fd_list[[i]],
    lty = 1,
    lwd = 2,
    main = paste("Randomly Sampled Location", i, sep = " ")
  )
  mtext("Fourier Basis: Optimal")
}

Third, Smoothing

Initially, the Fourier basis was created with a fixed parameter. Then, an optimization procedure was used to re-build the Fourier basis with an optimal K parameter. In this section another parameter, lambda (\(\lambda\)) will be selected via another optimization procedure.

The linear Differential Operator

The previously created curve will be smoothed to better fit the data, via:

The Fourier Basis linear differential operator: harmonic acceleration
Penalizing harmonic acceleration via a roughness penalty \(\lambda\)
In mathematical terms: \(\Sigma_j[y_j - x(t_j)]^2 + \lambda \int[Lx(t)]^2dt\)

The harmonic acceleration operator is defined as \(L = w^2D + D^3\), where \(w = \frac{2\pi}{T}\) and \(T\) defines the period. In this case \(T\) = 24 hours.

# define w
W <- (2 * pi) / diff(range_eval)
# specify the constants
# p.55 in Functional Data Analysis with R and Matlab
harmonic_acceleration <- fda::vec2Lfd(c(0, W**2, 0), rangeval = range_eval)
# create the linear differential operator
harmacc_ldf <- fda::vec2Lfd(harmonic_acceleration)

Another Optimization Procedure

In the below procedure the goal is to estimate the optimal value of the Lagrangian roughness penalty, \(\lambda\). The optimal value is min(GCV coefficient). GCV stands for generalized cross validation. Something to note is the danger of a boundary solution. This occurs when the min(GCV) is located at either ends of the lambda values the procedure searches over. In this case, if the final result is a boundary solution I have hard-coded lambda to be 0. I chose 0 because \(10^0 =1\) and so \(\lambda = 1\) introduces almost no penalty.

# parallel: export
lambdas_seq <- seq(-10, 10, 0.1)

# define function
parallel_fda <- function(lambdas, data, ID, fourier_basis, ldf) {
  # declare output
  # out = rep(NA, length(lambdas))
  out <- matrixStats::allocMatrix(nrow = length(lambdas), ncol = 1, data = NA)
  colnames(out) <- ID

  # fda per lambda
  for (i in seq_along(lambdas)) {
    lambda <- 10**(lambdas[i])
    fdParObj <- fda::fdPar(
      fourier_basis,
      ldf,
      lambda
    )
    smoothed <- fda::smooth.basis(
      argvals = 0:23 + 0.5,
      y = data,
      fdParobj = fdParObj
    )

    out[i, 1] <- sum(smoothed$gcv)
  }

  return(out)
}

# establish parallel architecture
cl <- parallel::makeCluster(ncores)
doParallel::registerDoParallel(cl)
roughness_penalty_GCV_matrix <-
  foreach::foreach(
    i = seq_along(usage_points_toPlot),
    .export = c("fdaData_cni", "optimalK_fd_list", "harmacc_ldf"),
    .packages = c("fda"),
    .combine = cbind
  ) %dopar% {

    # GCV values for lambda
    parallel_fda(
      lambdas = lambdas_seq,
      data = fdaData_cni[, usage_points_toPlot[i]],
      ID = colnames(fdaData_cni)[usage_points_toPlot][i],
      fourier_basis = optimalK_fd_list[[i]],
      ldf = harmacc_ldf
    )
  }

# stop parallel
parallel::stopCluster(cl)

# plot
tmpSD <- apply(roughness_penalty_GCV_matrix, 2, sd)

par(mfrow = c(5, 2))
for (i in seq_len(k2)) {
  plot(
    x = lambdas_seq,
    y = roughness_penalty_GCV_matrix[, i],
    type = "l",
    ylab = "GCV",
    xlab = expression("log"[10] * (lambda)),
    main = paste("Randomly Sampled Location", i, sep = " ")
  )
  points(
    x = lambdas_seq[which.min(roughness_penalty_GCV_matrix[, i])],
    y = min(roughness_penalty_GCV_matrix[, i]),
    col = "navy",
    pch = 19
  )
}

# choose optimal lambda
optimal_lambda_vector <- apply(
  roughness_penalty_GCV_matrix,
  2,
  function(x) {
    tmp <- lambdas_seq[which.min(x)]
    if (tmp == max(lambdas_seq) || tmp == min(lambdas_seq)) {
      0
    } else {
      tmp
    }
  }
)

Finally, combine the optimal K and roughness penalty \(\lambda\)

Now that the optimal values for the K parameter and roughness penalty \(\lambda\) have been computed, I will re-create the curve but with the optimal values and as before compare to the original estimate with arbitrarily chosen parameter values.

list_optimal_k_lambda <-
  lapply(
    seq_along(usage_points_toPlot),
    function(i) {
      fdParObj_tmp <- fda::fdPar(optimalK_fd_list[[i]], harmacc_ldf, 10**optimal_lambda_vector[[i]])
      smoothed <- fda::smooth.basis(
        argvals = eval_arg,
        y = fdaData_cni[, usage_points_toPlot[i]],
        fdParobj = fdParObj_tmp
      )
      fd_tmp <- smoothed$fd
      fd_tmp$names <- fdNames
      fd_tmp
    }
  )

par(mfrow = c(k2, 2))
for (i in seq_len(k2)) {
  # Fourier Basis: original
  fda::plotfit.fd(
    y = fdaData_cni[, usage_points_toPlot[i]],
    argvals = eval_arg,
    fdobj = testFD[usage_points_toPlot[i]],
    lty = 1,
    lwd = 2,
    main = paste("Randomly Sampled Location", i, sep = " ")
  )
  mtext("Fourier Basis: Original")

  # Fourier Basis: optimal
  fda::plotfit.fd(
    y = fdaData_cni[, usage_points_toPlot[i]],
    argvals = eval_arg,
    fdobj = list_optimal_k_lambda[[i]],
    lty = 1,
    lwd = 2,
    main = paste("Randomly Sampled Location", i, sep = " ")
  )
  mtext("Fourier Basis: Optimal K and Lambda")
}

Showcase the new solution: 1 minute interval meter reads

The line represents the estimated minute interval meter reads while the dots represent the actual hourly meter reads. The below illustrates that Functional Data Analysis, using the Fourier basis is capable of producing estimates of consumption for any interval of time.

tmpMin <- 1
tmpSeq <- seq(0, 23, tmpMin / 60)
par(mfrow = c(5, 2))
for (i in seq_len(k2)) {
  tmpPred <- predict(list_optimal_k_lambda[[i]], newdata = tmpSeq)
  tmpY <- fdaData_cni[, usage_points_toPlot[i]]

  plot(
    x = tmpSeq,
    y = tmpPred,
    type = "l",
    col = "navy",
    main = paste("Randomly Sampled Location", i, sep = " "),
    ylab = "Consumption",
    xlab = "Time",
    lwd = 2,
    ylim = c(min(tmpPred, tmpY), max(tmpPred, tmpY))
  )
  points(
    x = 0:23,
    y = tmpY,
    col = "darkorange",
    pch = 16,
    cex = 1.5
  )
  lines(
    x = 0:23,
    y = tmpY,
    col = "darkorange"
  )
}

Residual Plots

In regression analysis it is customary to check the residuals to see that they meet the assumptions of a regression model. Checking residuals for constant variance and normality is common practice to validate a model. Below are plots to visualize the normality of the residuals and check to see if the residuals appears to be a random scatter of points.

The constant variance assumption may be questionable since some of the residual plots appear to have a pattern. However, according to the QQ Plots most of the randomly sampled locations meet the normality assumption.

To assess normality I used the Shapiro Wilks test. The test should provide a reasonable assessment of normality because the sample size for each location is n = 24 - the Shapiro Wilks test can be sensitive to large sample sizes. The null hypothesis \(h_0\) for the Shapiro Wilks test is that the data is normally distributed. So, assuming the usual default \(\alpha=0.05\) level of significance, if the p.value is greater than the stated level of significance we can assume the data is normally distributed.

par(mfrow=c(10, 2))
for(i in seq_len(ncol(residuals))){
  plot(residuals[,i],
       type = "p",
       main = paste("Randomly Sampled Location", i, sep = " "),
       ylab = "Consumption",
       xlab = "Time",
       pch = 16,
       col = "darkorange",
       ylim = c(-abs(max(residuals[,i])), abs(max(residuals[,i])))
       )
  abline(h = 0, lty = 2, col = "gray50")
  
  qqnorm(residuals[,i])
  qqline(residuals[,i])
  shapiro.test(residuals[,i])[["p.value"]] %>%
    round(x = ., digits = 4) %>%
    paste("Shapiro Wilks P.value:", ., sep = " ") %>%
    mtext(., cex = 0.8, line = -1)
}

Inverse Response Plot

In this assessment visual the actual consumption is plotted on the x-axis while the predicted consumption is on the y-axis. If the model is a good fit then we should see a linear scatter of points from the bottom left corner to the top right corner.

par(mfrow=c(5, 2))
for(i in seq_len(ncol(residuals))){
  tmpAxisMax = max(matrix_yhat_hourly[,i], fdaData_cni[, usage_points_toPlot[i]])
  tmpAxisMin = min(matrix_yhat_hourly[,i], fdaData_cni[, usage_points_toPlot[i]])
  plot(x = matrix_yhat_hourly[,i],
       y = fdaData_cni[, usage_points_toPlot[i]],
       type = "p",
       main = paste("Randomly Sampled Location", i, sep = " "),
       ylab = "Consumption",
       xlab = "Estimated Consumption",
       pch = 16,
       col = "darkorange"
       ,ylim = c(tmpAxisMin, tmpAxisMax)
       ,xlim = c(tmpAxisMin, tmpAxisMax)
       )
  abline(a = 0, b = 1, lty = 2, col = "gray50")
}

Assessing the fit of the estimates

The MSE (\(\frac{1}{n}\Sigma_{i=1}^n(y_i-\hat{y_i})^2\)) will be used to assess how well the estimated curves fit the hourly consumption data. A perfect fit would be MSE = 0. As MSE increases, the fit of the model worsens. Also, scale matters. So, locations with higher consumption will have higher MSE values relative to locations with lower consumption.

The MSE values can’t be used to compare the fit of one location to another. Instead, I proose to save the MSE values so that later on if another method to estimate consumption (kWh) is devised the fit from the Fourier basis can be compared to other methods by location.

data.frame(
  Location = paste("Randomly Sampled Location", seq_len(k2), sep = " "),
  MSE = colMeans(SSE) %>% round(., digits = 4),
  row.names = NULL
) %>%
  knitr::kable(., format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)

Location	MSE
Randomly Sampled Location 1	0.0146
Randomly Sampled Location 2	2.4491
Randomly Sampled Location 3	0.0785
Randomly Sampled Location 4	0.0697
Randomly Sampled Location 5	0.0082
Randomly Sampled Location 6	0.6002
Randomly Sampled Location 7	1.1139
Randomly Sampled Location 8	0.0200
Randomly Sampled Location 9	0.0508
Randomly Sampled Location 10	13.9027

Comparing the estimated curve to all the available data

In this view we can see if the median is an appropriate measure of centrality. If the estimated curve is somewhere near the middle of the all the points for each hour then the median will be close to the mean - this suggests that the hourly consumption distributions are symmetric. If, however, the estimated curve is not in the middle then the mean and the median will differ, which suggests that the hourly distributions are skewed.

Based on this research I believe that the median better captures the “average” behavior of each location. This also implies that the MAD would be a better representation of dispersion than the standard deviation.

tmpUPs <- (colnames(fdaData_cni) %>% unique())[usage_points_toPlot]

par(mfrow = c(5, 2))
for (i in seq_along(usage_points_toPlot)) {
  tmpPlotData <-
    cniData %>%
    dplyr::filter(usage_point == tmpUPs[i]) %>%
    dplyr::select(dplyr::starts_with("cons_int")) %>%
    dplyr::collect()

  for (j in 0:23) {
    if (j == 0) {
      plot(
        x = rep(j, nrow(tmpPlotData)),
        y = tmpPlotData[[j + 1]],
        type = "p",
        col = "darkorange",
        xlim = c(0, 23),
        main = paste("Randomly Sampled Location", i, sep = " "),
        ylab = "consumption",
        xlab = "Time",
        ylim = c(0, max(tmpPlotData))
      )
    } else {
      points(
        x = rep(j, nrow(tmpPlotData)),
        y = tmpPlotData[[j + 1]],
        col = "darkorange"
      )
    }
  }
  lines(x = 0:23, y = matrix_yhat_hourly[, i], col = "navy", lwd = 2)
}

Conclusion

In conclusion, using FDA I have shown that it is possible to create a smooth estimate of consumption (kWh) which will allow Consumers Energy to interpolate the average consumption for any given time.

Next steps

With this research I have shown that it is possible to create a continuous estimate of average consumption (kWh) behavior. The results of this work will allow us to make educated guesses at the average consumption behavior in between the actual meter reads.

As a follow-up to this initial proof-of-concept (POC) research I want to address the question, how do we know if the algorithm is accurate? Luckily, there exist meters that provide consumption data every 15 minutes. To test the model I will select a sample of these locations and aggregate the data. To mirror the data from the remaining meters I will aggregate the data by hour. Then, I will process the data via the algorithm. Next, I’ll produce 15 minute interval estimates using the algorithm. Finally, I will compare the actual 15 minute consumption to the estimated consumption. If all goes well, the difference between the actual values and the predicted values will be minimal and acceptable.

The culmination of this project would be to produce a business-ready algorithm that can:

Efficiently compute the optimal K value for each Location (parallel architecture)
Efficiently compute the optimal value for the roughness penalty \(\lambda\) (parallel architecture)
Create a model by location that can be used to estimate the consumption for any interval
Output an estimate of the average consumption at the requested interval
- Along with a 95% confidence interval
Make the output of the model available to any internal team that should find the data relevant

Sources, Citations, and Technology

Sources, Citations

Functional Data Analysis with R and Matlab by Ramsay, Hooker, and Graves
Functional Data Analysis by Ramsay and Silverman

Technology

R, RStudio
SQL

Estimated Consumption (kWh) via Functional Data Analysis

Marketing: Data Science