Guy has been measuring his refrigerator’s electricity consumption with a power meter. He wants to know if it’s using too much electricity and should be replaced, or if it’s running as expected.

The nameplate states that the unit will consume 296 kilowatt-hour (kWh) per year.

As we work through this problem, we should think carefully about variation, the measurement process, and the relation of the data to the specification. In real-world problems, getting these right is often the difference between success and failure.

For starters, we should ask what “296 kWh” means. Is it the design target or a projection based on some test? If it’s a target, is it an upper specification (“not to exceed”), a mean, or the median? Does it mean that some percentage of the time, this model of refrigerator will consume at most 296 kWh of electricity under given test conditions (a statistical upper limit), or that the refrigerator is flat-out defective if it ever exceeds 296 kWh in a year (a go/no-go condition such as used in manufacturing)?

In this case, we could probably find out with enough digging into European Union energy efficiency standards. Short of that kind of research, we can suppose that it’s either a mean or an upper limit. If the predicted electricity consumption is greater than this value, it’s time for Guy to consider replacing his refrigerator. Our statistical tests, then, should focus on the range of possible values and whether or not that range includes the nameplate rating.

We’ll use the following libraries in this analysis.

This is a fairly small dataset, so we can print it in its entirity.

Date | kWh | Comment |
---|---|---|

16/06/18 | 0.00 | NA |

29/06/18 | 6.32 | PC |

01/07/18 | 7.97 | NA |

06/07/18 | 12.07 | NA |

08/07/18 | 13.90 | NA |

13/07/18 | 0.40 | PC |

15/07/18 | 2.09 | NA |

21/07/18 | 6.42 | NA |

22/07/18 | 7.77 | NA |

27/07/18 | 1.29 | PC |

03/08/18 | 6.73 | NA |

05/08/18 | 8.57 | NA |

10/08/18 | 12.42 | NA |

The power meter reports in cumulative kWh of electricity consumed. However, it resets every time there’s a power outage, and there have been several of those. The time of the power outages is not known, but the first measurement after a power outage is identified with a “PC” in the \(Comment\) variable.

As with the specification, we need to be careful in thinking about this data. Guy’s measument process may be different than that intended by the manufacturer, and produce somewhat different results. That might not be a difference—a bias—of a factor of 2, but it could certainly be a bias of 20%.

Likewise, it’s likely that the refrigerator will consume electricity at different rates over time. Warmer or cooler room temperatures will certainly alter electricity consumption, as will more or less frequent access of the refrigerator and longer or shorter durations of access. A big party would probably bump that usage, too.

Likewise, the measurement process can introduce variaton. We have time-of-measurement in full day increments, but Guy may have recorded in the morning one day, the evening the next day, and the morning on the third day, so that even if the rate of electricity were constant, his observations would show an uneven day-to-day rate.

What this all means is that we need to use this small dataset for all that it’s worth, making sure that we data from a single population and taking full advantage of any variation we find in the data.

The first step in any analysis is exploratory data analysis (EDA), where we examine the data’s structure, familiarize ourselves with its components, and use visualization to identify potential patterns.

Let us start with a first look at how the numeric data look. Since this is cumulative data over time, it makes sense to plot it as a simple time series against the row index.

```
# Preliminary EDA ##
rf_df %>%
mutate(index = row_number()) %>%
ggplot(aes(x = index, y = kWh)) +
geom_point() +
geom_label_repel(aes(label = Comment))
```

`## Warning: Removed 10 rows containing missing values (geom_label_repel).`

The cumulative nature of the reported energy consumption, and the occasional resets, are both obvious.

It’s also appears that the consumption is not entirely linear—there is some variation in the rate of use from point to point. However, we’re looking at the data by row number rather than by date; maybe the apparent variation is just due to the irregularity of observations. After wrangling the data, we’ll look at this again.

Of note: the first data point precedes the first power outage. It seems unlikely that this will be of any use to us.

To predict annual electricity usage, we need to calculate a rate that we can then annualize. We can use the given cumulative numbers to derive an average daily rate between each observation. As we don’t know when each reset point was, we have to be careful about how we use these values. For instance, we want to eliminate negative rates (“drops” in cumulative kWh due to a reset), and we can’t make assumptions about when the power outages occur (i.e. we don’t know where the ’0’s are).

```
# Wrangle the data ##
# Fix the dates so we can do something useful
# Calculate the kWh used between each observation
# Calculate the days between observations
rf_df <- rf_df %>%
mutate(Date = dmy(Date)) %>%
mutate(diff = kWh - lag(kWh),
Date_Start = lag(Date)) %>%
mutate(days = Date - Date_Start)
```

Now that we’ve fixed the \(Date\) column, we can re-plot Figure 1 by date rather than by index.

The apparent variation in daily energy consumption may be a little less, plotted this way, but it’s still there. With a little more work, we might have made this a little more obvious by creating a grouping column and plotting a linear fit with `geom_smooth(method = 'lm')`

, but for EDA that we’re not sharing, eyeballing it is good enough. Either the refrigerator is not constant in its electricity consumption (as we expected), or the power meter is inconsistent in measurement. Possibly both are sources of variation here.

We should also look at when measurements are collected, and this may tell us something.

```
daynames <- c('1' = "Monday",
'2' = "Tuesday",
'3' = "Wednesday",
'4' = "Thursday",
'5' = "Friday",
'6' = "Saturday",
'7' = "Sunday")
rf_df %>%
mutate(DOW = lubridate::wday(Date, week_start = 1)) %>%
group_by(DOW, Comment) %>%
summarise(n = n()) %>%
ungroup() %>%
mutate(DOW = as.factor(daynames[as.character(DOW)])) %>%
ggplot(aes(x = DOW, y = n, color = Comment)) +
geom_point() +
scale_x_discrete(breaks = daynames, label = daynames) +
labs(y = "Count of measurements",
x = "Day of measurement") +
coord_flip() +
facet_grid(.~Comment)
```

We can see that all of the measurements have been taken on the weekend, and all of the power outages happened between Sunday and Friday.

If this is the case, we might expect Fridays to show higher consumption rates than Saturday or Sunday. We should calculate a rate for kWh consumption. We have to be careful, though, since those power outages throw off our totals.

```
# Due to power outages, lines with "PC" have invalid $diff values,
# so convert to NA
# Had some trouble doing this in dplyr due to the NAs...
rf_df$diff[rf_df$Comment == "PC" & !is.na(rf_df$Comment)] <- NA
# Now drop the Comment column, and then drop any rows
# with NA values.
# This leaves us with useful observations of
# kWh used between dates.
# Now we calculate the average daily
# kWh consumption
rf_df <- rf_df %>%
select(-Comment) %>%
na.omit() %>%
mutate(daily_kWh = diff / as.integer(days))
```

Table 2 contains the updated, wrangled dataset.

Date | kWh | diff | Date_Start | days | daily_kWh |
---|---|---|---|---|---|

2018-07-01 | 7.97 | 1.65 | 2018-06-29 | NA | 0.82 |

2018-07-06 | 12.07 | 4.10 | 2018-07-01 | NA | 0.82 |

2018-07-08 | 13.90 | 1.83 | 2018-07-06 | NA | 0.92 |

2018-07-15 | 2.09 | 1.69 | 2018-07-13 | NA | 0.84 |

2018-07-21 | 6.42 | 4.33 | 2018-07-15 | NA | 0.72 |

2018-07-22 | 7.77 | 1.35 | 2018-07-21 | NA | 1.35 |

2018-08-03 | 6.73 | 5.44 | 2018-07-27 | NA | 0.78 |

2018-08-05 | 8.57 | 1.84 | 2018-08-03 | NA | 0.92 |

2018-08-10 | 12.42 | 3.85 | 2018-08-05 | NA | 0.77 |

The key variable that we’re interested in is the average daily electricity use, \(daily\_kWh\), so let’s plot it gainst the date of observation. Since this is an average over some number of days, it would be nice to also indicate over how many days each calculated value is averaged.

```
# Analysis ##
# Basic time series plot of the data
rf_df %>%
ggplot() +
geom_point(aes(x = Date, y = daily_kWh)) +
geom_segment(aes(x = Date_Start, y = daily_kWh, xend = Date, yend = daily_kWh)) +
labs(x = "Date",
y = "Average Daily Consumption, kWh")
```

There only obvious pattern that I’m picking up here is that cumulative readings over more days may be showing lower rates of daily consumption.

We might have expected lower average daily consumption over longer observation intervals. First, longer observation intervals will tend toward the distribution mean, rather than the high rates of consumption that we might expect over shorter periods (see below for more on the long tail). Additionally, longer intervals (larger values of \(days\)) may indicate periods where the refrigerator was not opened, and therefore operating at minimal power draw, while shorter periods could indicate heavy use.

Beyond that, there is not a general upward trend, or trend toward longer or shorter observation intervals.

Now that we have values that are directly comparable in the \(daily\_kWh\) variable, we should also look at its distribution, as this might tell us something useful about the assumptions that we can make during our analysis.

We appear to have a skewed distribution with a long tail toward higher values, or possibly two separate distributions (i.e. this may be a bimodal distribution) with one distribution centered near 0.8 and the other somewhere above 1.2 kWh/day.

We might expect to see two distributions if there are really two processes at play.

At the same time, with only common-cause variation in usage and measurement, we would expect to see high days and low days. While the low days must be bound by 0 kWh/day, the upper bound is determined only by the maximum power use of the compressor and lights. The distribution of daily kWh should be skewed with a long tail toward higher numbers, and we could be seeing that here.

We should also have a look at the days when measurements were taken (thanks to Guy for this suggestion), to see if there’s anything interesting or suggestive in the data.

```
rf_df %>%
mutate(DOW = lubridate::wday(Date, week_start = 1)) %>%
group_by(DOW) %>%
summarise(mean_daily_kWh = mean(daily_kWh),
n = n(),
sd = sd(daily_kWh)) %>%
mutate(se = sd / sqrt(n),
ci_lo = mean_daily_kWh - qt(1 - (0.05 / 2), n - 1) * se,
ci_hi = mean_daily_kWh + qt(1 - (0.05 / 2), n - 1) * se) %>%
# mutate(conf_lo = gmodels::ci(daily_kWh)[2],
# conf_hi = ci(daily_kWh)[3]) %>%
ungroup() %>%
mutate(DOW = as.factor(daynames[as.character(DOW)])) %>%
kable(format = "html", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
full_width = FALSE)
```

DOW | mean_daily_kWh | n | sd | se | ci_lo | ci_hi |
---|---|---|---|---|---|---|

Friday | 0.79 | 3 | 0.03 | 0.02 | 0.72 | 0.86 |

Saturday | 0.72 | 1 | NA | NA | NaN | NaN |

Sunday | 0.97 | 5 | 0.22 | 0.10 | 0.70 | 1.24 |

With only 9 datapoints, it’s hard to gain much information from this. Friday and Saturday seem to be similar, while Sunday might be a little higher. If this house were unused during the week, we might have expected that Friday would be much lower than either Saturday or Sunday. However, with confidence intervals spanning all three day’s mean readings, there is no significant difference between them, and we cannot learn anything further.

Our analysis will first check that the data is suitable for making predictions. We will then predict a range of likely annual values, and finally assess whether the specification of 296 kWh per annum falls with those predictions.

In Figure 4 we were left wondering if we have skewed data from one process or symmetric data from two processes.

This question is critical to our ability to reliably predict future data. We have sampled some process, and in order to make predictions about the future, we need those samples to all come from a singular, or *homogeneous*, process. If we’re seeing samples from different, or *heterogeneous*, processes then we either have to segregate the data or give up making valid predictions.(Wheeler 2009a)

We can use an individuals and moving range (“ImR”) chart to quickly assess whether this data exhibits homogeneous properties—are we sampling from a single population of data and can make predictions about the future?

```
# Use qcc to plot the individual values and the moving range
rf_imr <- qcc(data = rf_df$daily_kWh, type = "xbar.one")
```

```
# Create the matrix used by qcc for the moving range chart.
# Column 1 is the individuals data; column 2 is the same data with
# a 1-row lag.
rf_daily_m <- matrix(c(rf_df$daily_kWh, lag(rf_df$daily_kWh)),
byrow = FALSE, ncol = 2)
# Use qcc to plot the moving range chart
rf_r <- qcc(data = rf_daily_m, type = "R", sizes = 2)
```

With no out-of-control points on either the individuals or the moving range chart, it appears that we do have a homogeneous sample, and can continue with a prediction. We should be aware that nine sample points are not really enough to make this assessment with a high degree of confidence; we should continue collecting data until we have between 20 and 30 observations in our cleaned data set (McNeese 2016). If we find that some points are outliers, we need to investigate *why* the outliers exist in order to improve our predictions.(Wheeler 2009b)

One approach to estimate annual electricity consumption is to choose a distribution and then work from the population parameter estimates (e.g. mean and standard deviation) from the sample that we have so far. Another approach is bootstrap, where we build a large bootstrap sample—that we hope is a representative of the population—from the existing samples and then estimate population parameters from the bootstrap distribution.

The bootstrap method is attractive in this case because we don’t know quite what shape the underlying distribution should have, and bootstrapping takes care of this for us. When bootstrapping, we randomly sample from the data that we have over and over to create a larger data set with the same distribution as the original set of observations. We can then use this larger data set to estimate population parameters. For instance, if the population parameters that we’re interested in relate to the distribution of the possible annual consumption, we would bootstrap 365 values from our 9 observations, add them up, repeat many times, and then calculate annual mean and other parameters.

However, since our samples do not all represent quite the same thing, and in fact are a bunch of averages over varying lengths of time, bootstrapping, while it probably can be done, may pose some special challenges.

Alternatively, we might solve this by recognizing that we’re only interested in annual numbers. By adding up many individual daily values, we can rely on the central limit theorem (CLT): the resulting annual usage estimates will be normally distributed.

Not knowing the starting distribution of daily values, though, we would have to make an assumption about the correct shape, fit the distribution parameters to the limited data, and then predict annual values, and finally rely on the CLT and the variation in the measurements to make the errors in our assumptions about the shape of the distribution irrelevant. Here, again, we will need to be careful about weighting.

Since the distribution of annual consumption predictions is likely to be normal, it may be that the specific choice of distribution of underlying *daily* usage may not have a large impact on our final analysis of annual predicted usage, so we might be safe in using a normal distribution for daily values without much loss of fidelity in the model, but that’s not immediately certain.

Let’s compare the two approaches.

Given the *proviso* above about not having enough observations to be highly confident that our data is homogeneous, we now have estimates for key population parameters.

While the ImR chart treated each data point as being of equal weight, what we have with \(daily\_kWh\) are means of observations over variable numbers of days. The grand mean should therefore be weighted by how many days’ worth of data was collected. From Figure 4, we might expect that this will slightly reduce the mean of \(daily\_kWh\) from that calculated in the individuals chart (the “center” line).

Since our estimate for the standard deviation is based on the sample-to-sample variance, it makes no sense to weight the calculation of standard deviation, and we’ll just use the estimate provided by the moving range chart.

```
# Calculate a mean and standard deviation
rf_daily_mean <- sum(rf_df$daily_kWh * as.integer(rf_df$days)) / sum(as.integer(rf_df$days))
rf_daily_sd <- rf_r$center
```

We have an average daily consumption of 0.82 kWh, with a standard deviation of 0.20 kWh.

With these population parameter estimates, we use Monte Carlo simulation to estimate annual electricity use for 1000 identical cases. We do this by simulating daily consumption for 365 days and adding to obtain an annual figure. We then repeat this process 1000 times.

```
# Use Monte Carlo simulation with these values to simulate
# 1000 identical refrigerators operating for a year
num_replicates <- 10000
rf_annual <- replicate(expr = rnorm(n = 365,
mean = rf_daily_mean,
sd = rf_daily_sd) %>%
sum(),
n = num_replicates)
head(rf_annual)
```

`## [1] 294.5972 296.1955 292.1577 294.8222 292.8706 301.2250`

```
# Population mean and standard deviation, based on the moving range
rf_annual_mean <- mean(rf_annual)
rf_annual_sd <- sd(rf_annual)
```

We can compare the simulated annual performance to the nameplate performance with a histogram.

```
rf_annual_df <- tibble(annual_total_kWh = rf_annual)
rf_annual_df %>%
ggplot(aes(x = annual_total_kWh)) +
geom_histogram(bins = 17, aes(y = ..density..)) +
geom_vline(xintercept = 296, color = "red") +
labs(x = "Annual electricity use, kWh") +
theme_linedraw() +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
```

This looks like the refrigerator is at least operating in the range of the nameplate electricity consumption. It appears to be normally distributed, as we expected from the CLT. We can test that with a test for non-normality like the Anderson-Darling test.

```
##
## Anderson-Darling normality test
##
## data: rf_annual_df$annual_total_kWh
## A = 0.23438, p-value = 0.794
```

And it does, as expected, fail to reject the null hypothesis in the lack of fit test.

To bootstrap, we use the samples that we have to create larger data sets. This is accomplished by randomly *resampling with replacement* from the existing samples. Unlike parametric methods (e.g. assuming a normal distribution), bootstrapping preserves characteristics of the sample distribution—which presumably mirrors the population distribution—while still allowing us to estimate population parameters such as mean, standard deviation, and confidence intervals, and to perform other statistical analyses such as regression.

In this case, we’ll bootstrap a year’s worth of daily consumption from our observed samples. We can then add up the daily values to obtain an estimate for the year, or estimate population statististics for the daily values.

One bootstrap provides us with a single estimate. As with the previous method, we use Monte Carlo simulation to repeat the bootstrap many times in order to estimate population parameters like mean and standard deviation.

```
# Calculate a weight for sampling
rf_df <- rf_df %>%
mutate(weight = as.integer(days) / sum(as.integer(days)))
# Set up a data frame to store results
bs_annual_df <- tibble(annual_total_kWh = rep(NA, times = num_replicates),
daily_avg_kWh = rep(NA, times = num_replicates),
daily_sd_kWh = rep(NA, times = num_replicates))
# Monte Carlo simulation num_replicates times
for(i in 1:num_replicates) {
# Bootstrap for a year's worth of daily values
bs_annual <- sample(x = rf_df$daily_kWh,
size = 365,
replace = TRUE,
prob = rf_df$weight)
# Calculate and store the cumulative total, daily mean, and daily std dev
bs_annual_df[i, "annual_total_kWh"] <- sum(bs_annual)
bs_annual_df[i, "daily_avg_kWh"] <- mean(bs_annual)
bs_annual_df[i, "daily_sd_kWh"] <- sd(bs_annual)
}
```

Having obtained our 10 000 annual estimates, we can now see how they are distributed and compare them to the refrigerator’s nameplate consumption, as we did for the normal distribution-based analysis in Figure 7: Range of likely annual consumption predicted using a normal distribution and population parameters from the ImR charts. Red line indicates nameplate annual consumption..

```
# Plot annual estimates and test for normality
bs_annual_df %>%
ggplot(aes(x = annual_total_kWh)) +
geom_histogram(bins = 25, aes(y = ..density..)) +
geom_vline(xintercept = 296, color = "red") +
labs(x = "Annual electricity use, kWh") +
scale_x_continuous(limits = range(rf_annual_df$annual_total_kWh)) +
theme_linedraw() +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
```