The Power of Functional Data Analysis

Unlocking Patterns in Energy Data

Author

Santiago Rodriguez

Code
# r lib
library(fda)
library(fda.usc)
library(funFEM)
library(ggplot2)
library(ggtext)
library(ggrepel)

# online
source("https://raw.github.com/talgalili/R-code-snippets/master/clustergram.r")

Introduction

As utilities and energy companies continue to collect more and more customer data, the challenge of analyzing and understanding this data becomes increasingly complex. One powerful approach to tackle this challenge is functional data analysis. In this article, we will explore how functional data analysis can be used to describe customer behavior, and how this information can be used for feature engineering, and ultimately drive better business outcomes.

The objectives of this article are threefold: to provide a gentle, non-technical introduction to functional data analysis, demonstrate an application using functional data analysis to make inferences about customer behavior, and showcase how such inferences can be leveraged as features in other processes, such as predictive models.

What Is Functional Data Analysis?

I was introduced to functional data analysis in graduate school by a professor. One day while studying GAMs (Generalized Additive Models) and splines, the professor suggested that anyone interested in learning more should explore the works of professors Ramsay and Silverman. I was curious and decided to purchase the authors’ books. In doing so I discovered a new world of statistics and methods of analysis that were very applicable to modeling energy data.

But, what is functional data analysis anyway?

According to Wikipedia, functional data analysis, which is often abbreviated as FDA, “is a branch of statistics that analyses data providing information about curves, surfaces or anything else varying over a continuum.

According to the ever-popular ChatGPT, “FDA is a statistical approach that analyzes data represented as functions over a continuous domain, aiming to extract information by modeling the functional relationships between variables. FDA is useful for analyzing complex data structures subject to noise and measurement error, with applications in fields such as time series analysis, image analysis, and bioinformatics.

Finally, I define functional data analysis as a non-parametric regression technique used to approximate a curve or a function via a linear combination of basis functions. FDA prodvides methods to analyze information on curves or functions.

Example of FDA. Image by author

The figure above is a helpful illustration taken from a presentation I gave at an energy-professionals meetup. The 24 red data points represent the mean consumption by hour for a period of time, say the last two years. The black curve was fitted to the points using functional data analysis.

A neat feature of the fitted curve is that it allows us to make inferences about points in time for which there is no data. For instance, there are only have 24 data points, one observation per hour. However, what if what’s of interest is the consumption behavior in between hours? Well, fret not because the fitted curve is continuous and allows us to estimate what consumption was at 05:30, 11:15, 21:55, or any point in the range of time the curve was fitted to, which in this example is hours 0 through 23.

An additional feature of the fitted curves that set FDA apart from other statistical techniques is the fitted curves are differentiable. This means it’s possible to calculate the first and second derivatives of the curves, a capability that has significant implications in various applications from engineering to marketing. Initially, the importance of derivatives might not be immediately clear, but after reading this article, you will understand why they matter and appreciate their potential.

The curve fitting methods of FDA are robust, meaning that there are many ways to fit curves to the data. For example, there are different basis functions; Fourier, B-splines, wavelet, etc., and there are different optimization techniques, such as least squares or penalized methods. We won’t discuss the technical details of how functional data analysis works. However, I’d be remiss not to share some details. For reference, throughout this article I used a fully parameterized Fourier basis expansion optimized using a penalized technique to fit curves to the data. The Fourier basis is well suited for periodic data and it’s is easily differentiated.

Another fun tidbit about functional data analysis is that many traditional methods have functional counterparts. For instance, it’s possible to take the mean of fitted curves or the median, and to construct confidence intervals on the curves. There also exists functional principal component analysis and even function clustering techniques to group FDA-fitted curves. In summary, FDA is feature rich and offers many avenues of exploration.

I’m considering potentially writing a series of posts that explore other topics related to FDA, such as curve clustering to analyze patterns across months, phase-plane plots to showcase how a classical engineering technique can be a leveraged to analyze consumption behavior, a deeper dive into the mechanics of FDA, or even more advanced techniques such as regression analysis and predictive modeling using functional data analysis.

Lastly, for intrigued individuals, the aforementioned presentation can be viewed below or at this link.

The presentation from the video above can be accessed at this link.

Data Discussion

Code
# hourly summary
# - convert to int
df_energy_hourly <- dplyr::mutate(
  df_energy_raw_clean,
  hour_start = as.integer(hour_start)
)
# - group by
df_energy_hourly <- dplyr::group_by(
  df_energy_hourly,
  date,
  hour_start
)
# - summarize: mean
df_energy_hourly <- dplyr::summarise(
  df_energy_hourly,
  kwh = mean(kwh, na.rm = TRUE)
)
# - ungroup
df_energy_hourly <- dplyr::ungroup(df_energy_hourly)

# create date variables
df_energy_hourly <- create_date_variables(df_energy_hourly, date)

# create df for fda by hour
df_fda_hourly <- dplyr::select(
  df_energy_hourly,
  date, hour_start, kwh
)
# transform
df_fda_hourly <- tidyr::pivot_wider(
  df_fda_hourly,
  id_cols = hour_start,
  names_from = date,
  values_from = kwh
)
# select ... from ...
df_fda_hourly <- dplyr::select(df_fda_hourly, -hour_start)

# create a nested df
# - by year, month
df_fda_nested_hourly <- purrr::map_df(
  .x = energy_years_factor,
  .f = function(year) {
    tmp <- dplyr::filter(df_energy_hourly, year_factor == year)
    tmp <- dplyr::select(tmp, year_factor, month_int, date, hour_start, kwh)
    tmp <- tidyr::nest(tmp, .by = c(year_factor, month_int))
    return(tmp)
  }
)

# data prep
df_fda_nested_hourly <- dplyr::mutate(
  df_fda_nested_hourly,
  # create date
  date = lubridate::make_date(
    year = as.character(year_factor),
    month = month_int,
    day = 1L
  ),
  # matrix (for fda)
  data_matrix = purrr::map(
    .x = data,
    .f = function(df) {
      out <- tidyr::pivot_wider(
        df,
        id_cols = hour_start,
        names_from = date,
        values_from = kwh
      )
      out <- dplyr::select(out, -hour_start)
      out <- as.matrix(out)
    }
  ),
  # matrix
  data_matrix_summarized = purrr::map(
    .x = data_matrix,
    .f = function(x) {
      out <- rowMeans(x, na.rm = TRUE)
      out <- matrix(out, ncol = 1)
    }
  ),
  # how many days in month
  days_in_month = purrr::map_dbl(
    .x = data_matrix,
    .f = function(x) ncol(x)
  )
)
Code
# select ... from ...
df_fda_15min <- dplyr::select(
  df_energy_raw_clean,
  date, time, kwh
)

# transform
df_fda_15min <- tidyr::pivot_wider(
  df_fda_15min,
  id_cols = time,
  names_from = date,
  values_from = kwh
)

# select ... from ...
df_fda_15min <- dplyr::select(df_fda_15min, -time)

# create a nested df
# - by year, month
df_fda_nested_15min <- tidyr::nest(
  .data = df_energy_raw_clean,
  .by = c(year_factor, month_int)
)

# data prep
df_fda_nested_15min <- dplyr::mutate(
  df_fda_nested_15min,
  # matrix
  data_matrix = purrr::map2(
    .x = year_factor,
    .y = month_int,
    .f = function(year_factor, month_int) {
      tmpMask <- which(
        lubridate::year(names(df_fda_15min)) == year_factor &
          lubridate::month(names(df_fda_15min)) == month_int
      )
      tmp <- df_fda_15min[tmpMask]
      as.matrix(tmp)
    }
  ),
  # matrix
  data_matrix_summarized = purrr::map(
    .x = data_matrix,
    .f = function(x) {
      out <- rowMeans(x, na.rm = TRUE)
      out <- matrix(out, ncol = 1)
    }
  ),
  days_in_month = purrr::map_dbl(
    .x = data_matrix,
    .f = function(x) ncol(x)
  ),
  nrow = purrr::map_dbl(
    .x = data_matrix,
    .f = function(x) nrow(x)
  ),
  date = lubridate::make_date(
    year = as.character(year_factor),
    month = month_int,
    day = 1L
  ),
)
Code
num_of_days <- purrr::map_dbl(
  .x = list(purrr::pluck(df_energy_raw, "date")),
  .f = function(x) as.numeric(difftime(max(x), min(x), units = "days"))
)
num_of_days <- format(num_of_days, big.mark = ",")

num_of_months <- format(nrow(df_fda_nested_hourly), big.mark = ",")

The data for this article represents my household’s energy consumption and consists of 15-minute interval meter readings. The start of the period is 2021-01-16 and the end of the period is 2023-09-21 for a total of 978 days and 33 months.

Code
ggplot(df_energy_raw, aes(x = date, y = kwh)) +
  geom_line(color = "blue", linewidth = 1.5, alpha = 0.70) +
  xlab(NULL) +
  ylab(NULL) +
  ggtitle("Energy Consumption by 15-Minute Intervals") +
  theme(
    plot.title = element_text(size = 24, hjust = 0.5),
    panel.background = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  scale_x_date(date_breaks = "6 month", date_labels = "%b %Y") +
  scale_y_continuous(labels = scales::unit_format(unit = "kWh"))

Image by author

Data Transformations

During my time at a public utility, we employed a common technique to transform the raw meter readings into a more manageable format, called the daily load profile. The daily load profile shows how electricity usage varies throughout the day and it’s a useful technique to summarize consumption.

Source Wikipedia

The daily load profile is versatile, as it can be summarized many different ways. For instance, it’s common to exclude weekends for commercial clients because most commercial entities consume the bulk of their energy Monday through Friday.

For our purposes, consider each day as a separate, distinct load profile. The 15-minute meter readings will be summarized by hour because doing so smooths the data and makes it easier to work with. We’ll see below that smoothing the data doesn’t have much of a negative effect. Then, days will be collected by month and year so that each month has at most as many observations as there are days in that month. Sometimes there are outages and no energy is consumed or the meter can’t upload data to the energy provider. This shows ups as missing values in the data. For ease of analysis, any day with a missing value has been excluded.

The hierarchy of nesting days within months allows us to analyze each day separately or to analyze many days at once via a summary of the data, such as the mean of various days (e.g., the mean of all the days in January 2021). Below is an example of a daily load profile from a randomly chosen month.

Code
set.seed(1234)
tmpDate <- sample(unique(purrr::pluck(df_energy_raw, "date")), 1)

# data prep
tmp <- dplyr::filter(
  df_fda_nested_hourly,
  month_int == lubridate::month(tmpDate),
  as.integer(as.character(year_factor)) == lubridate::year(tmpDate)
)
tmp <- purrr::pluck(tmp, "data", 1)
tmp <- dplyr::filter(tmp, date == tmpDate)

# plot
ggplot(tmp, aes(x = hour_start, y = kwh)) +
  geom_point(color = "blue", size = 2, alpha = 0.70) +
  geom_line(color = "blue", linewidth = 1, alpha = 0.5) +
  xlab("Time") +
  ylab(NULL) +
  ggtitle(
    "Daily Load Profile",
    subtitle = tmpDate
  ) +
  theme(
    plot.title = element_text(size = 24, hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5),
    panel.background = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  scale_x_continuous(breaks = (0:23)[which(0:23 %% 2 == 0)]) +
  scale_y_continuous(labels = scales::unit_format(unit = "kWh"))

Image by author
Code
rm(tmp, tmpDate)

Fitting the Curves

Code
# fda options
# - global
fda_T <- 24
fdaNames <- list("Time", "Date", "kWh")

# - 24 hours
num_ofBasis_functions <- 24
range_eval <- c(0, 23)
eval_arg <- 0:23
fourier_omega <- (2 * pi) / fda_T
harmonic_accerlation <- fda::vec2Lfd(c(0, fourier_omega**2, 0), range_eval)

# - 15 min
num_ofBasis_functions_15min <- purrr::pluck(df_fda_nested_15min, "nrow")
num_ofBasis_functions_15min <- max(num_ofBasis_functions_15min, na.rm = TRUE)
range_eval_15min <- c(0, 23.75)
eval_arg_15min <- seq(0, 23.75, 15 / 60)
fourier_omega_15min <- (2 * pi) / fda_T
harmonic_accerlation_15min <- fda::vec2Lfd(c(0, fourier_omega_15min**2, 0), range_eval_15min)
Code
# optimization: choose best lambda: smoothing via ldf
# - define search grid
# - exclude 0
# - lambdas1 is finer grid, takes longer to run
# - lambdas2 is coarse grid, good enough for the purpose of this article

# tmp_lambdas1 <- c(seq(1e-5, 1, 1e-3), seq(1, 1001, 1))
tmp_lambdas2 <- c(seq(1e-5, 1, 1e-2), seq(1, 1001, 10))
Code
# create fourier basis functions using fda
# - use fourier basis because fitted curves are continuous and differentiable
# - also, there is a certain cyclicality to the data

# - hourly
tmpBasis <- fda::create.fourier.basis(
  rangeval = range_eval,
  nbasis = num_ofBasis_functions,
  period = fda_T
)

# - 15 min
tmpBasis_15min <- fda::create.fourier.basis(
  rangeval = range_eval_15min,
  nbasis = num_ofBasis_functions_15min,
  period = fda_T
)
Code
# parallelization: start

# check if any server is online
if (!is.null(available_workers)) {
  # then use it/them
  future::plan(
    strategy = "cluster",
    workers = c(rep("localhost", num_cores), available_workers)
  )
} else {
  # else use this machine only
  future::plan(strategy = "multisession", workers = num_cores)
}

# hourly
df_fda_nested_hourly <- dplyr::mutate(
  df_fda_nested_hourly,
  lambdas = furrr::future_map(
    .x = data_matrix,
    .f = function(df) {
      vapply(
        X = tmp_lambdas2,
        FUN.VALUE = numeric(1),
        FUN = function(x) {
          step01_fdPar <- fda::fdPar(
            fdobj = tmpBasis,
            Lfdobj = harmonic_accerlation,
            lambda = x
          )
          step02_smooth <- fda::smooth.basis(
            0:23,
            df,
            step01_fdPar
          )
          out <- purrr::pluck(step02_smooth, "gcv")
          out <- sum(out, na.rm = TRUE)
        }
      )
    }
  )
)

# 15 min
df_fda_nested_15min <- dplyr::mutate(
  df_fda_nested_15min,
  lambdas = furrr::future_map(
    .x = data_matrix_summarized,
    .f = function(df) {
      vapply(
        X = tmp_lambdas2,
        FUN.VALUE = numeric(1),
        FUN = function(x) {
          step01_fdPar <- fda::fdPar(
            fdobj = tmpBasis_15min,
            Lfdobj = harmonic_accerlation_15min,
            lambda = x
          )
          step02_smooth <- fda::smooth.basis(
            seq(0, 23.75, 15 / 60),
            df,
            step01_fdPar
          )
          out <- purrr::pluck(step02_smooth, "gcv")
          out <- sum(out, na.rm = TRUE)
        }
      )
    }
  )
)

# house cleaning: parallelization: end
future::plan(strategy = "sequential")
Code
df_fda_nested_hourly <- dplyr::mutate(
  df_fda_nested_hourly,
  min_lambda = purrr::map_dbl(
    .x = lambdas,
    .f = function(x) x[which.min(x)]
  ),
  fd_smooth = purrr::map2(
    .x = min_lambda,
    .y = data_matrix,
    .f = function(x, df) {
      # apply optimal lambda
      fda_fdPar <- fda::fdPar(
        fdobj = tmpBasis,
        Lfdobj = harmonic_accerlation,
        lambda = x
      )

      fda_fd_smooth <- fda::smooth.basis(
        0:23,
        df,
        fda_fdPar
      )

      purrr::pluck(fda_fd_smooth, "fd", "fdnames") <- fdaNames

      return(fda_fd_smooth)
    }
  ),
  fd_smooth_summarized = purrr::map(
    .x = fd_smooth,
    .f = function(x) fda::mean.fd(purrr::pluck(x, "fd"))
  )
)
Code
df_fda_nested_15min <- dplyr::mutate(
  df_fda_nested_15min,
  min_lambda = purrr::map_dbl(
    .x = lambdas,
    .f = function(x) x[which.min(x)]
  ),
  fd_smooth = purrr::map2(
    .x = min_lambda,
    .y = data_matrix,
    .f = function(x, df) {
      # apply optimal lambda
      fda_fdPar <- fda::fdPar(
        fdobj = tmpBasis_15min,
        Lfdobj = harmonic_accerlation_15min,
        lambda = x
      )

      fda_fd_smooth <- fda::smooth.basis(
        seq(0, 23.75, 15 / 60),
        df,
        fda_fdPar
      )

      purrr::pluck(fda_fd_smooth, "fd", "fdnames") <- fdaNames

      return(fda_fd_smooth)
    }
  ),
  fd_smooth_summarized = purrr::map(
    .x = fd_smooth,
    .f = function(x) fda::mean.fd(purrr::pluck(x, "fd"))
  )
)
Code
# combine the n fd objects into a single fd object
tmpList <- purrr::map(
  .x = purrr::pluck(df_fda_nested_hourly, "fd_smooth_summarized"),
  .f = function(x) purrr::pluck(x, "coefs")
)

tmpCoefAll <- do.call(cbind, tmpList)
colnames(tmpCoefAll) <- as.character(purrr::pluck(df_fda_nested_hourly, "date"))

fdaAllMonthly_hourly <- fda::fd(
  coef = tmpCoefAll,
  basisobj = tmpBasis,
  fdnames = fdaNames
)

# house cleaning
rm(tmpList, tmpCoefAll)
Code
# combine the n fd objects into a single fd object
tmpList <- purrr::map(
  .x = purrr::pluck(df_fda_nested_15min, "fd_smooth_summarized"),
  .f = function(x) purrr::pluck(x, "coefs")
)

tmpCoefAll <- do.call(cbind, tmpList)
colnames(tmpCoefAll) <- as.character(purrr::pluck(df_fda_nested_15min, "date"))

fdaAllMonthly_15min <- fda::fd(
  coef = tmpCoefAll,
  basisobj = tmpBasis_15min,
  fdnames = fdaNames
)

# house cleaning
rm(tmpList, tmpCoefAll)
Code
tmpList <- purrr::map(
  .x = purrr::pluck(df_fda_nested_hourly, "fd_smooth"),
  .f = function(x) purrr::pluck(x, "fd")
)

fdaAll_hourly <- do.call(fda::mean.fd, tmpList)
purrr::pluck(fdaAll_hourly, "fdnames") <- fdaNames

# house cleaning
rm(tmpList)
Code
tmpList <- purrr::map(
  .x = purrr::pluck(df_fda_nested_15min, "fd_smooth"),
  .f = function(x) purrr::pluck(x, "fd")
)

fdaAll_15min <- do.call(fda::mean.fd, tmpList)
purrr::pluck(fdaAll_15min, "fdnames") <- fdaNames

# house cleaning
rm(tmpList)
Code
# choose one of:
# - 15-min
# - hourly

# derivative: 1 velocity
fda_all_monthly_velocity <- fda::deriv.fd(fdaAllMonthly_hourly, 1)

# derivative: 2 acceleration
fda_all_monthly_acceleration <- fda::deriv.fd(fdaAllMonthly_hourly, 2)

In this section we’ll discuss different aspects of fitting curves to the data. However, we will not discuss the underlying machinations of how FDA works or the technical aspects of fitting the curves, so we’ll use illustrations to understand the various steps and processes.

Using FDA we’ll first derive smooth approximations of the data, one curve per day. The figures below illustrate the curve-fitting process. The figure on the left shows the daily load profile for two days while the second shows the FDA-fitted curves. Notice that the fitted curves are smooth and capture the essence of the data but don’t necessarily capture it’s peakedness - you might say the curves aren’t over-fitted.

Code
# 1 row, 2 columns
par(mfrow = c(1, 2))

# plot
tmpPlotDf <- df_fda_hourly
tmpXAxis <- if (nrow(tmpPlotDf) == 24) 0:23 else seq(0, 23.75, 15 / 60)

## plotting params
tmp_ylim_margins <- c(
  min(tmpPlotDf[, 1:2]),
  max(tmpPlotDf[, 1:2])
)

## plot
par(mai = c(0.85, 0.85, 0.5, 0.15))
plot(
  x = tmpXAxis,
  y = purrr::pluck(tmpPlotDf, 1),
  type = "b",
  main = "Daily Load Profile",
  ylab = "kWh",
  xlab = "Time",
  ylim = tmp_ylim_margins,
  cex.main = 2
)
lines(
  x = tmpXAxis,
  y = purrr::pluck(tmpPlotDf, 2),
  col = "firebrick",
  type = "b"
)
legend(
  "topleft",
  legend = names(tmpPlotDf)[1:2],
  col = c("black", "firebrick"),
  lwd = c(2, 2),
  bty = "n"
)

rm(tmpPlotDf, tmpXAxis)

# sr: i'm not sure this is needed given the below plots
# plot
par(mai = c(0.85, 0.85, 0.5, 0.15))
## invisible helps hide the plotting "done" message
invisible(
  plot(
    purrr::pluck(df_fda_nested_hourly, "fd_smooth", 1, "fd")[1:2],
    ylim = tmp_ylim_margins,
    main = "FDA Smoothed Curves",
    cex.main = 2
  )
)
legend(
  "topleft",
  legend = names(df_fda_hourly)[1:2],
  col = c("black", "firebrick"),
  lwd = c(2, 2),
  bty = "n"
)

Image by author
Code
# house cleaning
rm(tmp_ylim_margins)

# reset
par(mfrow = c(1, 1))

The next step is to summarize the fitted curves. We will accomplish this using methods from functional data analysis. The idea is similar to summarizing non-functional data. There are two components: a) the group-by clause and b) the summarizing function. In this case, we will summarize the data via the mean and group-by month. The result is one curve per month and we’ll call these summarized curves mean daily load profiles. Below is a plot of the summarized monthly curves.

Code
par(mai = c(0.85, 0.85, 0.5, 0.15))
invisible(
  plot(
    fdaAllMonthly_hourly,
    main = "Mean Daily Load Profiles",
    cex.main = 1.5
  )
)

Image by author

Next, the figure below illustrates the summarized curve of a randomly chosen month. The summarized hourly curves are the focus of our discussion in this article. However, I was curious to see what effect summarizing the 15-minute interval data by hour would have. To accomplish this, I also fit curves to each day on the 15-minute data. Then, as before, I summarized the fitted-curves by month. As expected, smoothing the 15-minute data results in a loss of information. For example, the magnitude of the tallest peak is lessened and shifted. Overall, though, I think the smoothed data fits well enough.

Code
# plotting
set.seed(123)
tmpRandomMonth <- sample(seq_len(num_of_months), 1)

for (i in seq_len(num_of_months)[tmpRandomMonth]) {
  tmpFDA <- purrr::pluck(fdaAllMonthly_hourly, "coefs")

  tmpPlotName <- paste(
    months(as.Date(colnames(tmpFDA))[[i]]),
    lubridate::year(colnames(tmpFDA)[[i]])
  )

  # actuals
  # dfTmp = dplyr::filter(df_energy_hourly, date == as.Date(colnames(tmpFDA))[[i]])
  # tmpY = purrr::pluck(dfTmp, "kwh")
  # tmpX = purrr::pluck(dfTmp, "hour_start")

  tmpYAxisLims <- c(
    0,
    max(
      predict(fdaAllMonthly_hourly, 0:23)[, i],
      predict(fdaAllMonthly_hourly, seq(0, 23, 15 / 60))[, i],
      predict(fdaAllMonthly_15min, seq(0, 23, 15 / 60))[, i],
      na.rm = TRUE
    )
  )

  # consumption
  par(mai = c(0.85, 0.85, 0.5, 0.15))
  plot(
    fdaAllMonthly_hourly[i],
    col = "dodgerblue",
    cex.lab = 1.3,
    ylim = tmpYAxisLims * 1.3
  )
  lines(
    fdaAllMonthly_15min[i],
    col = "darkorange",
    cex.lab = 1.3,
    lty = 2
  )
  # points(
  #   x = tmpX,
  #   y = tmpY,
  #   pch = 16,
  #   col = "gray50"
  # )
  legend(
    "topleft",
    legend = c("Hourly", "15-Min"),
    lty = c(1, 2),
    # pch = c(16, NA, NA),
    bty = "n",
    col = c("dodgerblue", "darkorange")
  )
  title(tmpPlotName, cex.main = 1.5)
}

Image by author

We’ll wrap up this section with a discussion about the procedure outlined above. Instead of fitting curves to each day first, another approach would be to summarize the data first and then fit the curves to the summarized data.

In context, we’d group-by month and hour then take the mean of the data first before fitting the curve. The resulting curve would be nearly the same as the summarized curves above. The difference is twofold: 1) summarizing first is more efficient and 2) if only the mean is used to summarize the data we loose insight into the variability of the data - one could summarize the data by mean and standard deviation but then the variability would need to also be translated to its functional counterpart and that adds steps. By summarizing the daily fitted curves (as was done above) the variability is implicitly captured, no extra steps needed. Additionally, since the curves contain information about both the mean and variance, it’s trivial to compute confidence intervals for the fitted curves.

Regarding computational efficiency. FDA is fast. I’m able to fit curves to the 978 days in this analysis and summarize said curves in minutes using modest resources. To fit the curves I use distributed resources; machine 1: 16Gb RAM with 4 cores and machine 2: 8Gb RAM with 8 cores. This works well for the size of my data. However, if I planned to analyze years worth daily/hourly/15-min energy consumption data for millions of customers then fitting a curve to each day might not be the optimal strategy.

Customer Profiles

On-peak hours are the time of the day when there is the highest demand for electricity. This is usually when people are using the most electricity for tasks such as cooking, cleaning, using electronic devices, and running air conditioning or heating units. The purpose of defining on-peak hours is to help utilities manage the electricity grid more efficiently and ensure a stable supply of electricity during times of high demand.

In this section I’ll illustrate an alternative method to define on-peak hours using functional data analysis. The approach is based on the velocity of consumption. The idea is to define the peak hours of consumption as the time of day when the rate of consumption, or the velocity of consumption is highest or lowest.

The velocity-based definition of on-peak hours may at times align with the usual definition. A key difference between the two approaches is that the usual definition only provides a time of day. Whereas the velocity-based definition can also describe how quickly demand is increasing or decreasing. That is, we can know the rate at which energy is being consumed and not just the times of day. This additional information could be useful to utilities and energy companies for demand-planning.

Next, let’s discuss the intuition behind the velocity-based definition. Below are two figures, on the left is a mean daily load profile (i.e., a mean-fitted curve) and on the right are plots of the first and second derivatives, that is the velocity and acceleration, respectively. You may recall from calculus that the first derivative can be used to find global or local inflection points and that the first derivative describes a rate - in this case the rate of energy consumption.

Code
# plotting
# set.seed(123)
# tmpRandomMonth <- sample(seq_len(num_of_months), 1)
tmpRandomMonth <- 31 # hard coded to jul 2023

par(mfrow = c(1, 2))
for (i in seq_len(num_of_months)[tmpRandomMonth]) {
  tmpPlotName <- paste(
    months(as.Date(colnames(purrr::pluck(fdaAllMonthly_hourly, "coefs")))[[i]]),
    lubridate::year(colnames(purrr::pluck(fdaAllMonthly_hourly, "coefs"))[[i]])
  )

  tmpYAxisLims_v1 <- c(
    0,
    max(
      predict(fdaAllMonthly_hourly, 0:23)[, i],
      predict(fdaAllMonthly_hourly, seq(0, 23, 15 / 60))[, i],
      na.rm = TRUE
    )
  )

  tmpYAxisLims_v2 <- c(
    min(
      predict(fda_all_monthly_velocity, 0:23)[, i],
      predict(fda_all_monthly_acceleration, 0:23)[, i],
      na.rm = TRUE
    ),
    max(
      predict(fda_all_monthly_velocity, 0:23)[, i],
      predict(fda_all_monthly_acceleration, 0:23)[, i],
      na.rm = TRUE
    )
  )

  # consumption
  par(mai = c(0.85, 0.85, 0.5, 0.15))
  plot(
    fdaAllMonthly_hourly[i],
    col = "dodgerblue",
    cex.lab = 1.3,
    xlab = "Hour",
    ylim = tmpYAxisLims_v1 * 1.2
  )
  legend(
    "topright",
    legend = c("kWh"),
    lty = c(1),
    bty = "n",
    col = c("dodgerblue")
  )
  title(tmpPlotName, cex = 2)

  par(mai = c(0.85, 0.85, 0.5, 0.15))
  # derivatives
  plot(
    fda_all_monthly_velocity[i],
    ylab = "Rate of Change",
    xlab = "Hour",
    cex.lab = 1.3,
    ylim = tmpYAxisLims_v2 * 1.2,
    col = "darkorange"
  )
  title("Derivatives of Consumption", cex = 2)
  lines(
    fda_all_monthly_acceleration[i],
    lty = 2,
    col = "tomato"
  )
  legend(
    "topright",
    legend = c("Velocity", "Acceleration"),
    lty = c(1),
    bty = "n",
    col = c("darkorange", "tomato")
  )

  rm(tmpPlotName)
}

Image by author
Code
# add title to page
# mtext("Derivatives", outer = TRUE, cex = 2, line = -2.5)

rm(tmpRandomMonth, tmpYAxisLims_v1, tmpYAxisLims_v2)

A visual inspection of the plots suggests that the highest point (max velocity) and the lowest point (min velocity) visually occur around 4pm (max) and 2am (min). This implies that for the given month, on average, my household started consuming electricity at the fastest pace around 4pm and reduced consumption at the fastest pace around 2am.

I think a little story-telling would be helpful so let me add context to the plots. Summers in Texas are HOT but we don’t run the A/C all day. We’re conscience of the environment and our wallets. The A/C is on a schedule. The system starts to cool the house down early, before scheduled, to reach a certain temperature by the requested time. Then as the night cools the system relaxes and stops running so aggressively. Marring the derivative discussion with the parent, kWh plot shows that my household in this given month used the most energy in the evening.

Often there are other stories to be told. 2am and 4pm are the global inflection points. There are other inflection points though, local inflection points. Sometimes these are of interest. For example, there might only be a small difference between the global max and the local max. This might imply that there are nuances in the data. Imagine a multi-generational household, a household with children of different ages that go to school at different times, or a household where at least one adult work the night shift. These households would likely have local inflection points that are of interest. Part of my interest in FDA with energy consumption data is that it tells a more complete picture than just the kWh alone.

Let’s wrap this section up with a discussion about derivatives. A visual inspection of the plots is one thing, but we can also use math (or maths) to find inflection points and confirm if our findings are truly inflection points. Using FDA we have an approximation or an estimate of the velocity of consumption. We can use the fitted curve to find the max and min values. The index or the argmin/argmax are also of interest as that tells us the time of day. Lastly, we can use the second derivative test to confirm if the previous findings are truly inflection points. When the second derivative is 0 then the findings from the first derivative are inflection points. We can also use the second derivative to find local inflection points, that’s anywhere acceleration is 0.

Feature Engineering

In this section I’ll illustrate how to create features from the consumption data and the derivatives discussed in the previous section. In essence, I will create profiles of the months and perform an analysis on the data.

I worked on a variety of projects during my tenure at Consumers Energy. For some projects the energy data had to be condensed - often as much as a single row/observation per customer. Clustering projects usually require such a structure, so that’s what I’ll focus on.

Feature engineering will focus on two areas, time and measurements based on consumption. Using my monthly household energy data I will extract the min/max kWh as well as the time of day associated with the min/max values. Additionally, utilizing the derivative curves from the previous section it’s possible to extract the min/max velocity and the times of day associated with the min/max values. Each variable is numeric. The time fields display military time in fractional units by 30 minute intervals. For example, 15.5 is 3:30 pm. The numeric nature of makes using the fields easier.

Code
# choose one of:
# - 15-min
# - hourly

# time grid to predict
# adjust as needed
tmpPredGrid <- seq(0, 23, 30 / 60)

# predict
fdaAllMonths_predicted_velocity <- fda::predict.fd(
  fdaAllMonthly_hourly,
  tmpPredGrid,
  Lfdobj = 1
)

# get monthly min/max kwh
min_monthly_demand <- purrr::map_dbl(
  .x = purrr::pluck(df_fda_nested_hourly, "data_matrix_summarized"),
  .f = function(x) min(x, na.rm = TRUE)
)
max_monthly_demand <- purrr::map_dbl(
  .x = purrr::pluck(df_fda_nested_hourly, "data_matrix_summarized"),
  .f = function(x) max(x, na.rm = TRUE)
)

# get monthly argmin/argmax kwh
argmin_monthly_demand <- purrr::map_dbl(
  .x = purrr::pluck(df_fda_nested_hourly, "data_matrix_summarized"),
  .f = function(x) seq_along(x)[which.min(x)]
)
argmax_monthly_demand <- purrr::map_dbl(
  .x = purrr::pluck(df_fda_nested_hourly, "data_matrix_summarized"),
  .f = function(x) seq_along(x)[which.max(x)]
)

# get monthly min/max velocity
df_onpeak_min_velocity <- apply(
  X = fdaAllMonths_predicted_velocity,
  MARGIN = 2,
  FUN = function(x) min(x, na.rm = TRUE)
)
df_onpeak_max_velocity <- apply(
  X = fdaAllMonths_predicted_velocity,
  MARGIN = 2,
  FUN = function(x) max(x, na.rm = TRUE)
)

# extract monthly argmin/argmax of velocity
# which.min/max auto excludes NA/NULLs
df_onpeak_argmin_velocity <- apply(
  X = fdaAllMonths_predicted_velocity,
  MARGIN = 2,
  FUN = function(x) which.min(x)
)
df_onpeak_argmax_velocity <- apply(
  X = fdaAllMonths_predicted_velocity,
  MARGIN = 2,
  FUN = function(x) which.max(x)
)

# create dataframe
df_onpeak <- dplyr::tibble(
  date = colnames(fdaAllMonths_predicted_velocity),
  min_consumption = min_monthly_demand,
  max_consumption = max_monthly_demand,
  argmin_consumption = argmin_monthly_demand,
  argmax_consumption = argmax_monthly_demand,
  min_velocity = df_onpeak_min_velocity,
  max_velocity = df_onpeak_max_velocity,
  argmax_velocity = tmpPredGrid[df_onpeak_argmax_velocity],
  argmin_velocity = tmpPredGrid[df_onpeak_argmin_velocity],
)

# convert date to date type
df_onpeak <- dplyr::mutate(df_onpeak, date = as.Date(date))

# use custom function to add date variables
df_onpeak <- create_date_variables(df_onpeak, date)

# feature engineering
# - convert raw time to pretty, string time
# df_onpeak <- dplyr::mutate(
#   df_onpeak,
#   onpeak_start_time = paste(as.integer(onpeak_start_numeric), ":", (onpeak_start_numeric %% 1) * 60, sep = ""),
#   onpeak_end_time = paste(as.integer(onpeak_end_numeric), ":", (onpeak_end_numeric %% 1) * 60, sep = ""),
#   onpeak_start_time = purrr::map_chr(.x = onpeak_start_time, .f = standardizeMilitaryTime),
#   onpeak_end_time = purrr::map_chr(.x = onpeak_end_time, .f = standardizeMilitaryTime),
# )
Code
dfTmp <- dplyr::transmute(
  df_onpeak,
  Date = paste(month_name, year_factor),
  minConsumption = min_consumption,
  maxConsumption = max_consumption,
  argmin_consumption = argmin_consumption,
  argmax_consumption = argmax_consumption,
  minVelocity = min_velocity,
  maxVelocity = max_velocity,
  argmin_velocity = argmin_velocity,
  argmax_velocity = argmax_velocity,
  year_factor = year_factor,
  month_int = month_int,
)
Code
# logistics
tmpN <- 5
set.seed(123)

# data prep
dfTmp2 <- dplyr::slice_sample(dfTmp, n = tmpN)
dfTmp2 <- dplyr::arrange(dfTmp2, year_factor, month_int)
dfTmp2 <- dplyr::select(dfTmp2, -month_int, -year_factor)

# table: start
tmpgt <- gt::gt(dfTmp2)
# table: create header
tmpgt <- gt::tab_header(
  tmpgt,
  title = gt::md("**Engineered Features**"),
  subtitle = "n Randomly Chosen Months"
)
# table: format labels
tmpgt <- gt::cols_label(
  tmpgt,
  argmin_velocity = "Min Velocity Time",
  argmax_velocity = "Max Velocity Time",
  argmin_consumption = "Min kWh Time",
  argmax_consumption = "Max kWh Time",
  minVelocity = "Min Velocity",
  maxVelocity = "Max Velocity",
  minConsumption = "Min kWh",
  maxConsumption = "Max kWh",
)
# table: format digits
tmpgt <- gt::fmt_number(
  tmpgt,
  columns = c(dplyr::contains("min"), dplyr::contains("max")),
  decimals = 2
)

# display
tmpgt
Engineered Features
n Randomly Chosen Months
Date Min kWh Max kWh Min kWh Time Max kWh Time Min Velocity Max Velocity Min Velocity Time Max Velocity Time
Mar 2021 0.08 0.34 4.00 12.00 −0.05 0.06 12.00 7.00
Feb 2022 0.14 0.64 1.00 9.00 −0.07 0.10 12.50 6.50
Mar 2022 0.07 0.44 2.00 9.00 −0.06 0.09 10.00 6.50
Jul 2022 0.18 0.90 10.00 22.00 −0.15 0.12 0.00 14.50
Jul 2023 0.20 0.86 13.00 19.00 −0.08 0.16 2.00 15.50

If it were available, I could add other information as well, such as the maximum number of people in the house by month, average temperature by month, etc. There are many possibilities, it all depends on the goal of the analysis. To keep it simple, no additional data was added. This would most similar to a behavioral clustering project.

The goal of the clustering analysis might be to assess which months are similar to each other (i.e. members of the same cluster). Intuitively we’d expect months in the same season to be similar, such as June and July might be in a summer cluster.

Two methods will be used, Multidimensional Scaling (MDS) and Partitioning Around Medoids (PAM). MDS is a dimension reduction technique, it compresses multivariate data into a specified number of dimensions - usually 2 since MDS is often utilized as a visualization tool. Side note, there are several ways to compress multivariate data into \(p\) dimensions, such as UMAP or T-SNE. I used MDS because it’s pre-installed in base R and I didn’t want to install another library.

As for the clustering algorithm, I chose PAM because it’s a more robust version of K-means. I used K = 4 because there are four seasons (although it doesn’t always feel that way in Texas). The data was centered and scaled prior to performing MDS and PAM. I’m still unsure if it makes sense to standardize the time variables, but I didn’t want them to have more influence than the other variables.

Code
# used for reproducibility
set.seed(123)

# initialize new dataframe
dfTmpCluster <- dfTmp

# extract date field as IDs
clusterDates <- purrr::pluck(dfTmpCluster, "Date")

# drop unnecessary fields
dfTmpCluster <- dplyr::select(dfTmpCluster, -month_int, -year_factor, -Date)

# center/scale the [all] data
dfTmpCluster <- scale(dfTmpCluster)

# center/scale the only min/max variables
# dfTmpCluster <- dplyr::mutate(
#   dfTmpCluster,
#   dplyr::across(
#     .cols = c(dplyr::contains("min"), dplyr::contains("max")),
#     .fns = function(x) scale(x)
#   )
# )

# calculate dist
tmpDist = dist(dfTmpCluster, method = "euclidean")

# Multidimensional scaling
MDS <- cmdscale(d = tmpDist, k = 2)

# PAM
tmpClusters <- cluster::pam(x = dfTmpCluster, k = 4, cluster.only = TRUE)

# collect results into a dataframe
tmpClusterPlotData <- dplyr::tibble(
  date = clusterDates,
  x = MDS[, 1],
  y = MDS[, 2],
  cluster = factor(tmpClusters)
)

# plot
ggplot(tmpClusterPlotData, aes(x = x, y = y, color = cluster, label = date)) +
  geom_text_repel(size = 4, box.padding = 0.1, max.overlaps = 100) +
  labs(x = "X Value", y = "Y Value", color = "Cluster") +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.background = element_rect(fill = "white"),
    legend.position = "none",
    panel.border = element_blank(),
    panel.grid = element_blank(),
  )

What do you think? Does the plot make sense? The coldest months appear to be in the top left corner, summer months in the top right corner, spring and fall in the middle, and two outliers perhaps towards the bottom of the graph.

February 2021 was known as Sow-Mageddon in Texas (Texas Parks & Wildlife Magazine 2022), a snow storm crippled North Texas for a few days, perhaps a week. There were even rolling blackouts. In December 2022 my household hosted family for the holidays so consumption was greater than usual. Also, in the summer months (top right) July 2023 stands out a bit, which makes sense because it was one of the warmest months on record.

Assessing K = 4 (Seasons)

Before wrapping up this section I wanted to assess how reasonable K = 4 is. First, I looked at an elbow plot. I read recently that it’s not recommended to use the elbow plot anymore, I’ll explore that another time. Also, I didn’t use the K-Means algorithm to cluster the data, I used PAM so the results may not translate. Regardless, K between 4 and 6 seems to make sense.

Code
set.seed(123)
k.values <- 2:15

# extract wss for 2-15 clusters
wss_values <- purrr::map_dbl(
  .x = k.values,
  .f = function(k) purrr::pluck(kmeans(dfTmpCluster, k, nstart = 10), "tot.withinss")
)

# plotting
plot(
  k.values,
  wss_values,
  type = "b",
  pch = 19,
  frame = FALSE,
  xlab = "Number of clusters K",
  ylab = "Total within-clusters sum of squares",
  main = "Elblow Plot (K-Means)"
)

Code
# house cleaning
rm(k.values, wss_values)

Have you ever heard of a clustergram? See this link for details. In summary, it’s another way to visualize what value K should take on. There isn’t a library on CRAN that I know of for this, but someone created a function accessible via Github (Galili 2010). Similar to the elbow plot, a value between 3 and 5 makes sense. This function also uses K-Means but it could be extended for other clustering algorithms. It’s recommended to repeat the process several times to account for randomness in the process. I’ve only called the clustergram function once for illustrative purposes.

Code
clustergram(Data = dfTmpCluster, k.range = 2:10)

Curve Clustering

Another application of FDA that’s worthwhile to discuss is curve clustering. Like their non-functional clustering cousins, curve clustering techniques have a wide breadth of applications, such as customer segmentation. For our purposes, it might be of interest to see which of the 33 months are similar [across time].

In a traditional unsupervised paradigm we might treat the months as observations and the 24 hours as variables, resulting in a 33 row by 24 column matrix/data frame. Under this structure we’re essentially treating the 24 hours as independent observations, but we know the data isn’t independent. The data are temporally or serially dependent. One solution to this dilemma might be to perform a dimension reduction technique such as PCA and then cluster the data.

FDA provides an alternative approach, curve clustering. In this approach FDA first estimates a continuous curve using the 24 hours, then the curves themselves are the object of the [curve] clustering algorithm. There are different curve clustering algorithms. The results of which may mirror those of a traditional clustering technique while the results of others will differ.

Below are results of a traditional clustering process using K-means followed by a curve clustering algorithm.

The Data

Code
# for each 1-column matrix of summarized data,
# column combine each element in the list
# output is a 24 by p matrix w/o dimnames
# rows = hours
# columns = summarized monthly data
matrix_curve_clustering = do.call(
  what = cbind,
  args = purrr::pluck(df_fda_nested_hourly, "data_matrix_summarized")
)

# transpose ^ so that rows = months, cols = hours
matrix_curve_clustering = t(matrix_curve_clustering)

# add row names (index)
rownames(matrix_curve_clustering) = as.character(purrr::pluck(df_fda_nested_hourly, "date"))

# add col names
colnames(matrix_curve_clustering) = paste("hour", seq(from = 0, to = ncol(matrix_curve_clustering)-1), sep = "_")

# scaled
matrix_curve_clustering_scaled = scale(matrix_curve_clustering)
Code
dfTmp = matrix_curve_clustering[1:5,1:8]
dfTmp = as.data.frame(dfTmp)

# init table
outTmp = gt::gt(dfTmp, rownames_to_stub = TRUE)
# add title
outTmp = gt::tab_header(outTmp, title = gt::md("**Data**"), subtitle = "n by 24 matrix")
# format numbers
outTmp = gt::fmt_number(outTmp, columns = dplyr::everything(), decimals = 2)
# display
outTmp
Data
n by 24 matrix
hour_0 hour_1 hour_2 hour_3 hour_4 hour_5 hour_6 hour_7
2021-01-01 0.18 0.21 0.18 0.22 0.19 0.21 0.22 0.26
2021-02-01 0.36 0.39 0.36 0.38 0.37 0.37 0.36 0.44
2021-03-01 0.13 0.10 0.10 0.08 0.10 0.10 0.16 0.19
2021-04-01 0.10 0.10 0.11 0.09 0.08 0.10 0.17 0.17
2021-05-01 0.18 0.13 0.12 0.09 0.09 0.10 0.20 0.18
Code
# source: https://www.datanovia.com/en/blog/ggplot-colors-best-tricks-you-will-love/

# The palette with grey:
cbp1 <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

K-Means

Code
set.seed(123)
traditional_clusters = kmeans(
  x = matrix_curve_clustering_scaled,
  centers = 4,
  nstart = 15
)

traditional_clusters = purrr::pluck(traditional_clusters, "cluster")
traditional_clusters = rep(traditional_clusters, each = 24)
Code
curve_clustering_base_plot+
  ggtitle("Traditional Clustering", subtitle = "K-Means")+
  geom_line(color = cbp1[traditional_clusters])

Image by author

Curve Clustering via Functional Data Analysis

I used the funFEM function (Bouveyron 2021) to cluster the fitted curves.

Code
set.seed(123)
curve_clustering_funfem = suppressMessages(
  suppressWarnings(
    funFEM::funFEM(
      fd = fdaAllMonthly_hourly,
      K = 4,
      init = "kmeans",
      model = "all",
      disp = FALSE,
      graph = FALSE
    )
  )
)

funFEM_clusters = purrr::pluck(curve_clustering_funfem, "cls")
funFEM_clusters = rep(funFEM_clusters, each = 24)
Code
curve_clustering_base_plot+
  ggtitle("Curve Clustering", subtitle = "funFEM (with K-Means)")+
  geom_line(color = cbp1[funFEM_clusters])

Image by author
Code
set.seed(123)
curve_clustering_funfem = suppressMessages(
  suppressWarnings(
    funFEM::funFEM(
      fd = fdaAllMonthly_hourly,
      K = 4,
      init = "random",
      model = "all",
      disp = FALSE,
      graph = FALSE
    )
  )
)
Error in if (min(colSums(T)) <= 1) stop("One cluster is almost empty!") : 
  missing value where TRUE/FALSE needed
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Code
funFEM_clusters = purrr::pluck(curve_clustering_funfem, "cls")
funFEM_clusters = rep(funFEM_clusters, each = 24)
Code
curve_clustering_base_plot+
  ggtitle("Curve Clustering", subtitle = "funFEM (with random start)")+
  geom_line(color = cbp1[funFEM_clusters])

Image by author
Code
set.seed(123)
curve_clustering_funfem = suppressMessages(
  suppressWarnings(
    funFEM::funFEM(
      fd = fda_all_monthly_velocity,
      K = 4,
      init = "random",
      model = "all",
      disp = FALSE,
      graph = FALSE
    )
  )
)
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Code
funFEM_clusters = purrr::pluck(curve_clustering_funfem, "cls")
funFEM_clusters = rep(funFEM_clusters, each = 24)
Code
curve_clustering_base_plot+
  ggtitle("Curve Clustering", subtitle = "funFEM (with random start) using velocity")+
  geom_line(color = cbp1[funFEM_clusters])

Image by author

References

Bouveyron, Charles. 2021. funFEM: Clustering in the Discriminative Functional Subspace. https://CRAN.R-project.org/package=funFEM.
Galili, Tal. 2010. “Clustergram Visualization and Diagnostics for Cluster Analysis.” 2010. https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/.
Texas Parks & Wildlife Magazine. 2022. “Snowmageddon.” 2022. https://tpwmagazine.com/archive/2022/jan/ed_3_snowmageddon/index.phtml.