The Problem

Guy has been measuring his refrigerator’s electricity consumption with a power meter. He wants to know if it’s using too much electricity and should be replaced, or if it’s running as expected.

The nameplate states that the unit will consume 296 kilowatt-hour (kWh) per year.

As we work through this problem, we should think carefully about variation, the measurement process, and the relation of the data to the specification. In real-world problems, getting these right is often the difference between success and failure.

For starters, we should ask what “296 kWh” means. Is it the design target or a projection based on some test? If it’s a target, is it an upper specification (“not to exceed”), a mean, or the median? Does it mean that some percentage of the time, this model of refrigerator will consume at most 296 kWh of electricity under given test conditions (a statistical upper limit), or that the refrigerator is flat-out defective if it ever exceeds 296 kWh in a year (a go/no-go condition such as used in manufacturing)?

In this case, we could probably find out with enough digging into European Union energy efficiency standards. Short of that kind of research, we can suppose that it’s either a mean or an upper limit. If the predicted electricity consumption is greater than this value, it’s time for Guy to consider replacing his refrigerator. Our statistical tests, then, should focus on the range of possible values and whether or not that range includes the nameplate rating.

Setup

We’ll use the following libraries in this analysis.

library(tidyverse)
library(lubridate)
library(qcc)
library(ggrepel)
library(knitr)
library(kableExtra)
library(captioner)
library(RColorBrewer)
library(minpack.lm)
#library(gmodels)

The Data

This is a fairly small dataset, so we can print it in its entirity.

Table 1: The data provided by Guy.

Date	kWh	Comment
16/06/18	0.00	NA
29/06/18	6.32	PC
01/07/18	7.97	NA
06/07/18	12.07	NA
08/07/18	13.90	NA
13/07/18	0.40	PC
15/07/18	2.09	NA
21/07/18	6.42	NA
22/07/18	7.77	NA
27/07/18	1.29	PC
03/08/18	6.73	NA
05/08/18	8.57	NA
10/08/18	12.42	NA

The power meter reports in cumulative kWh of electricity consumed. However, it resets every time there’s a power outage, and there have been several of those. The time of the power outages is not known, but the first measurement after a power outage is identified with a “PC” in the \(Comment\) variable.

As with the specification, we need to be careful in thinking about this data. Guy’s measument process may be different than that intended by the manufacturer, and produce somewhat different results. That might not be a difference—a bias—of a factor of 2, but it could certainly be a bias of 20%.

Likewise, it’s likely that the refrigerator will consume electricity at different rates over time. Warmer or cooler room temperatures will certainly alter electricity consumption, as will more or less frequent access of the refrigerator and longer or shorter durations of access. A big party would probably bump that usage, too.

Likewise, the measurement process can introduce variaton. We have time-of-measurement in full day increments, but Guy may have recorded in the morning one day, the evening the next day, and the morning on the third day, so that even if the rate of electricity were constant, his observations would show an uneven day-to-day rate.

What this all means is that we need to use this small dataset for all that it’s worth, making sure that we data from a single population and taking full advantage of any variation we find in the data.

EDA

The first step in any analysis is exploratory data analysis (EDA), where we examine the data’s structure, familiarize ourselves with its components, and use visualization to identify potential patterns.

Let us start with a first look at how the numeric data look. Since this is cumulative data over time, it makes sense to plot it as a simple time series against the row index.

# Preliminary EDA ##
rf_df %>%
  mutate(index = row_number()) %>% 
  ggplot(aes(x = index, y = kWh)) +
  geom_point() +
  geom_label_repel(aes(label = Comment))

## Warning: Removed 10 rows containing missing values (`geom_label_repel()`).

Figure 1: Time series plot of the data.

The cumulative nature of the reported energy consumption, and the occasional resets, are both obvious.

It’s also appears that the consumption is not entirely linear—there is some variation in the rate of use from point to point. However, we’re looking at the data by row number rather than by date; maybe the apparent variation is just due to the irregularity of observations. After wrangling the data, we’ll look at this again.

Of note: the first data point precedes the first power outage. It seems unlikely that this will be of any use to us.

Data wrangling

To predict annual electricity usage, we need to calculate a rate that we can then annualize. We can use the given cumulative numbers to derive an average daily rate between each observation. As we don’t know when each reset point was, we have to be careful about how we use these values. For instance, we want to eliminate negative rates (“drops” in cumulative kWh due to a reset), and we can’t make assumptions about when the power outages occur (i.e. we don’t know where the ’0’s are).

# Wrangle the data ##
# Fix the dates so we can do something useful
# Calculate the kWh used between each observation
# Calculate the days between observations
rf_df <- rf_df %>%
  mutate(Date = dmy(Date)) %>% 
  mutate(diff = kWh - lag(kWh),
         Date_Start = lag(Date)) %>% 
  mutate(days = Date - Date_Start)

Now that we’ve fixed the \(Date\) column, we can re-plot Figure 1 by date rather than by index.

rf_df %>%
  ggplot(aes(x = Date, y = kWh)) +
  geom_point()

Figure 3: Plot of cumulative electricity consumption in kWh versus date. Power outages and NA values have been removed.

The apparent variation in daily energy consumption may be a little less, plotted this way, but it’s still there. With a little more work, we might have made this a little more obvious by creating a grouping column and plotting a linear fit with geom_smooth(method = 'lm'), but for EDA that we’re not sharing, eyeballing it is good enough. Either the refrigerator is not constant in its electricity consumption (as we expected), or the power meter is inconsistent in measurement. Possibly both are sources of variation here.

We should also look at when measurements are collected, and this may tell us something.

daynames <- c('1' = "Monday",
              '2' = "Tuesday",
              '3' = "Wednesday",
              '4' = "Thursday",
              '5' = "Friday",
              '6' = "Saturday",
              '7' = "Sunday")
rf_df %>% 
  mutate(DOW = lubridate::wday(Date, week_start = 1)) %>% 
  group_by(DOW, Comment) %>% 
  summarise(n = n()) %>% 
  ungroup() %>% 
  mutate(DOW = as.factor(daynames[as.character(DOW)])) %>% 
  ggplot(aes(x = DOW, y = n, color = Comment)) +
  geom_point() +
  scale_x_discrete(breaks = daynames, label = daynames) +
  labs(y = "Count of measurements",
       x = "Day of measurement") +
  coord_flip() +
  facet_grid(.~Comment)

## `summarise()` has grouped output by 'DOW'. You can override using the `.groups`
## argument.

We can see that all of the measurements have been taken on the weekend, and all of the power outages happened between Sunday and Friday.

If this is the case, we might expect Fridays to show higher consumption rates than Saturday or Sunday. We should calculate a rate for kWh consumption. We have to be careful, though, since those power outages throw off our totals.

# Due to power outages, lines with "PC" have invalid $diff values, 
# so convert to NA
# Had some trouble doing this in dplyr due to the NAs...
rf_df$diff[rf_df$Comment == "PC" & !is.na(rf_df$Comment)] <- NA

# Now drop the Comment column, and then drop any rows 
# with NA values.
# This leaves us with useful observations of 
# kWh used between dates. 
# Now we calculate the average daily
# kWh consumption
rf_df <- rf_df %>% 
  select(-Comment) %>% 
  na.omit() %>% 
  mutate(daily_kWh = diff / as.integer(days))

Table 2 contains the updated, wrangled dataset.

Table 2: The data after wrangling.

Date	kWh	diff	Date_Start	days	daily_kWh
2018-07-01	7.97	1.65	2018-06-29	NA days	0.82
2018-07-06	12.07	4.10	2018-07-01	NA days	0.82
2018-07-08	13.90	1.83	2018-07-06	NA days	0.92
2018-07-15	2.09	1.69	2018-07-13	NA days	0.84
2018-07-21	6.42	4.33	2018-07-15	NA days	0.72
2018-07-22	7.77	1.35	2018-07-21	NA days	1.35
2018-08-03	6.73	5.44	2018-07-27	NA days	0.78
2018-08-05	8.57	1.84	2018-08-03	NA days	0.92
2018-08-10	12.42	3.85	2018-08-05	NA days	0.77

The key variable that we’re interested in is the average daily electricity use, \(daily\_kWh\), so let’s plot it gainst the date of observation. Since this is an average over some number of days, it would be nice to also indicate over how many days each calculated value is averaged.

# Analysis ##
# Basic time series plot of the data
rf_df %>% 
  ggplot() +
  geom_point(aes(x = Date, y = daily_kWh)) +
  geom_segment(aes(x = Date_Start, y = daily_kWh, xend = Date, yend = daily_kWh)) +
  labs(x = "Date",
       y = "Average Daily Consumption, kWh")

Figure 4: The average daily kWh used by date. Lines represent the span of days over which the average is taken.

There only obvious pattern that I’m picking up here is that cumulative readings over more days may be showing lower rates of daily consumption.

rf_df %>% 
  mutate(days = as.integer(days)) %>%
  ggplot(aes(x = days, y = daily_kWh)) +
  geom_point()

We might have expected lower average daily consumption over longer observation intervals. First, longer observation intervals will tend toward the distribution mean, rather than the high rates of consumption that we might expect over shorter periods (see below for more on the long tail). Additionally, longer intervals (larger values of \(days\)) may indicate periods where the refrigerator was not opened, and therefore operating at minimal power draw, while shorter periods could indicate heavy use.

Beyond that, there is not a general upward trend, or trend toward longer or shorter observation intervals.

Now that we have values that are directly comparable in the \(daily\_kWh\) variable, we should also look at its distribution, as this might tell us something useful about the assumptions that we can make during our analysis.

rf_df %>% 
  ggplot(aes(x = daily_kWh)) +
  geom_histogram(bins = 10)

Figure 2: Distribution of the calculated daily electricity use.

We appear to have a skewed distribution with a long tail toward higher values, or possibly two separate distributions (i.e. this may be a bimodal distribution) with one distribution centered near 0.8 and the other somewhere above 1.2 kWh/day.

We might expect to see two distributions if there are really two processes at play.

At the same time, with only common-cause variation in usage and measurement, we would expect to see high days and low days. While the low days must be bound by 0 kWh/day, the upper bound is determined only by the maximum power use of the compressor and lights. The distribution of daily kWh should be skewed with a long tail toward higher numbers, and we could be seeing that here.

We should also have a look at the days when measurements were taken (thanks to Guy for this suggestion), to see if there’s anything interesting or suggestive in the data.

rf_df %>% 
  mutate(DOW = lubridate::wday(Date, week_start = 1)) %>% 
  group_by(DOW) %>% 
  summarise(mean_daily_kWh = mean(daily_kWh), 
            n = n(),
            sd = sd(daily_kWh)) %>% 
  mutate(se = sd / sqrt(n),
         ci_lo = mean_daily_kWh - qt(1 - (0.05 / 2), n - 1) * se,
         ci_hi = mean_daily_kWh + qt(1 - (0.05 / 2), n - 1) * se) %>% 
  # mutate(conf_lo = gmodels::ci(daily_kWh)[2],
  #        conf_hi = ci(daily_kWh)[3]) %>% 
  ungroup() %>% 
  mutate(DOW = as.factor(daynames[as.character(DOW)])) %>% 
  kable(format = "html", digits = 2) %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"),
                full_width = FALSE)

DOW	mean_daily_kWh	n	sd	se	ci_lo	ci_hi
Friday	0.79	3	0.03	0.02	0.72	0.86
Saturday	0.72	1	NA	NA	NaN	NaN
Sunday	0.97	5	0.22	0.10	0.70	1.24

With only 9 datapoints, it’s hard to gain much information from this. Friday and Saturday seem to be similar, while Sunday might be a little higher. If this house were unused during the week, we might have expected that Friday would be much lower than either Saturday or Sunday. However, with confidence intervals spanning all three day’s mean readings, there is no significant difference between them, and we cannot learn anything further.

Analysis and Results

Our analysis will first check that the data is suitable for making predictions. We will then predict a range of likely annual values, and finally assess whether the specification of 296 kWh per annum falls with those predictions.

Data homogeneity

In Figure 4 we were left wondering if we have skewed data from one process or symmetric data from two processes.

This question is critical to our ability to reliably predict future data. We have sampled some process, and in order to make predictions about the future, we need those samples to all come from a singular, or homogeneous, process. If we’re seeing samples from different, or heterogeneous, processes then we either have to segregate the data or give up making valid predictions.(Wheeler 2009a)

We can use an individuals and moving range (“ImR”) chart to quickly assess whether this data exhibits homogeneous properties—are we sampling from a single population of data and can make predictions about the future?

# Use qcc to plot the individual values and the moving range
rf_imr <- qcc(data = rf_df$daily_kWh, type = "xbar.one")

Figure 5: Individuals chart for the daily consumption data.

# Create the matrix used by qcc for the moving range chart.
# Column 1 is the individuals data; column 2 is the same data with
# a 1-row lag.
rf_daily_m <- matrix(c(rf_df$daily_kWh, lag(rf_df$daily_kWh)), 
                     byrow = FALSE, ncol = 2)
# Use qcc to plot the moving range chart
rf_r <- qcc(data = rf_daily_m, type = "R", sizes = 2)

Figure 6: Moving range chart for the daily consumption data.

With no out-of-control points on either the individuals or the moving range chart, it appears that we do have a homogeneous sample, and can continue with a prediction. We should be aware that nine sample points are not really enough to make this assessment with a high degree of confidence; we should continue collecting data until we have between 20 and 30 observations in our cleaned data set (McNeese 2016). If we find that some points are outliers, we need to investigate why the outliers exist in order to improve our predictions.(Wheeler 2009b)

Approach to model-building

One approach to estimate annual electricity consumption is to choose a distribution and then work from the population parameter estimates (e.g. mean and standard deviation) from the sample that we have so far. Another approach is bootstrap, where we build a large bootstrap sample—that we hope is a representative of the population—from the existing samples and then estimate population parameters from the bootstrap distribution.

The bootstrap method is attractive in this case because we don’t know quite what shape the underlying distribution should have, and bootstrapping takes care of this for us. When bootstrapping, we randomly sample from the data that we have over and over to create a larger data set with the same distribution as the original set of observations. We can then use this larger data set to estimate population parameters. For instance, if the population parameters that we’re interested in relate to the distribution of the possible annual consumption, we would bootstrap 365 values from our 9 observations, add them up, repeat many times, and then calculate annual mean and other parameters.

However, since our samples do not all represent quite the same thing, and in fact are a bunch of averages over varying lengths of time, bootstrapping, while it probably can be done, may pose some special challenges.

Alternatively, we might solve this by recognizing that we’re only interested in annual numbers. By adding up many individual daily values, we can rely on the central limit theorem (CLT): the resulting annual usage estimates will be normally distributed.

Not knowing the starting distribution of daily values, though, we would have to make an assumption about the correct shape, fit the distribution parameters to the limited data, and then predict annual values, and finally rely on the CLT and the variation in the measurements to make the errors in our assumptions about the shape of the distribution irrelevant. Here, again, we will need to be careful about weighting.

Since the distribution of annual consumption predictions is likely to be normal, it may be that the specific choice of distribution of underlying daily usage may not have a large impact on our final analysis of annual predicted usage, so we might be safe in using a normal distribution for daily values without much loss of fidelity in the model, but that’s not immediately certain.

Let’s compare the two approaches.

Modeling from a distribtion

Population parameter estimates

Given the proviso above about not having enough observations to be highly confident that our data is homogeneous, we now have estimates for key population parameters.

While the ImR chart treated each data point as being of equal weight, what we have with \(daily\_kWh\) are means of observations over variable numbers of days. The grand mean should therefore be weighted by how many days’ worth of data was collected. From Figure 4, we might expect that this will slightly reduce the mean of \(daily\_kWh\) from that calculated in the individuals chart (the “center” line).

Since our estimate for the standard deviation is based on the sample-to-sample variance, it makes no sense to weight the calculation of standard deviation, and we’ll just use the estimate provided by the moving range chart.

# Calculate a mean and standard deviation
rf_daily_mean <- sum(rf_df$daily_kWh * as.integer(rf_df$days)) / sum(as.integer(rf_df$days))
rf_daily_sd <- rf_r$center

We have an average daily consumption of 0.81 kWh, with a standard deviation of 0.20 kWh.

Simulating a year, over and over

With these population parameter estimates, we use Monte Carlo simulation to estimate annual electricity use for 1000 identical cases. We do this by simulating daily consumption for 365 days and adding to obtain an annual figure. We then repeat this process 1000 times.

# Use Monte Carlo simulation with these values to simulate
# 1000 identical refrigerators operating for a year
num_replicates <- 10000

rf_annual <- replicate(expr = rnorm(n = 365, 
                                    mean = rf_daily_mean, 
                                    sd = rf_daily_sd) %>% 
                         sum(), 
                       n = num_replicates)

head(rf_annual)

## [1] 304.4316 298.4617 304.9896 298.7005 297.3876 299.0656

# Population mean and standard deviation, based on the moving range
rf_annual_mean <- mean(rf_annual)
rf_annual_sd <- sd(rf_annual)

We can compare the simulated annual performance to the nameplate performance with a histogram.

rf_annual_df <- tibble(annual_total_kWh = rf_annual)
rf_annual_df %>% 
  ggplot(aes(x = annual_total_kWh)) +
  geom_histogram(bins = 17, aes(y = ..density..)) +
  geom_vline(xintercept = 296, color = "red") +
  labs(x = "Annual electricity use, kWh") +
  theme_linedraw() +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Figure 7: Range of likely annual consumption predicted using a normal distribution and population parameters from the ImR charts. Red line indicates nameplate annual consumption.

This looks like the refrigerator is at least operating in the range of the nameplate electricity consumption. It appears to be normally distributed, as we expected from the CLT. We can test that with a test for non-normality like the Anderson-Darling test.

nortest::ad.test(rf_annual_df$annual_total_kWh)

## 
##  Anderson-Darling normality test
## 
## data:  rf_annual_df$annual_total_kWh
## A = 0.80202, p-value = 0.03788

And it does, as expected, fail to reject the null hypothesis in the lack of fit test.

Bootstrap

Bootstrapping with Monte Carlo simulation

To bootstrap, we use the samples that we have to create larger data sets. This is accomplished by randomly resampling with replacement from the existing samples. Unlike parametric methods (e.g. assuming a normal distribution), bootstrapping preserves characteristics of the sample distribution—which presumably mirrors the population distribution—while still allowing us to estimate population parameters such as mean, standard deviation, and confidence intervals, and to perform other statistical analyses such as regression.

In this case, we’ll bootstrap a year’s worth of daily consumption from our observed samples. We can then add up the daily values to obtain an estimate for the year, or estimate population statistics for the daily values.

One bootstrap provides us with a single estimate. As with the previous method, we use Monte Carlo simulation to repeat the bootstrap many times in order to estimate population parameters like mean and standard deviation.

# Calculate a weight for sampling
rf_df <- rf_df %>% 
  mutate(weight = as.integer(days) / sum(as.integer(days)))

# Set up a data frame to store results
bs_annual_df <- tibble(annual_total_kWh = rep(NA, times = num_replicates),
                       daily_avg_kWh = rep(NA, times = num_replicates),
                       daily_sd_kWh = rep(NA, times = num_replicates))

# Monte Carlo simulation num_replicates times
for(i in 1:num_replicates) {
  # Bootstrap for a year's worth of daily values
  bs_annual <- sample(x = rf_df$daily_kWh, 
                      size = 365, 
                      replace = TRUE, 
                      prob = rf_df$weight)
  # Calculate and store the cumulative total, daily mean, and daily std dev
  bs_annual_df[i, "annual_total_kWh"] <- sum(bs_annual)
  bs_annual_df[i, "daily_avg_kWh"] <- mean(bs_annual)
  bs_annual_df[i, "daily_sd_kWh"] <- sd(bs_annual)
}

Examining the bootstrap

Having obtained our 10 000 annual estimates, we can now see how they are distributed and compare them to the refrigerator’s nameplate consumption, as we did for the normal distribution-based analysis in Figure 7: Range of likely annual consumption predicted using a normal distribution and population parameters from the ImR charts. Red line indicates nameplate annual consumption..

# Plot annual estimates and test for normality
bs_annual_df %>% 
  ggplot(aes(x = annual_total_kWh)) +
  geom_histogram(bins = 25, aes(y = ..density..)) +
  geom_vline(xintercept = 296, color = "red") +
  labs(x = "Annual electricity use, kWh") +
  scale_x_continuous(limits = range(rf_annual_df$annual_total_kWh)) +
  theme_linedraw() +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Figure 8: Range of likely annual consumption predicted from bootstrapping. Red line indicates nameplate annual consumption.

nortest::ad.test(bs_annual_df$annual_total_kWh)

## 
##  Anderson-Darling normality test
## 
## data:  bs_annual_df$annual_total_kWh
## A = 2.3113, p-value = 7.48e-06

Visually, this appears to be close to a normal distribution as well, but this time we reject the Anderson-Darling lack-of-fit test. Perhaps, for this data

It has a narrower variance than the previous model.

Comparison

Table 4:

method	mean_annual_kWh	stdv_annual_kWh
Bootstrap	297	2.2
Distribution	297	3.8

# Select colors that are colorblind-safe and print-friendly
# from the ColorBrewer palettes
colors <- RColorBrewer::brewer.pal(n = 3, name = "Dark2")[c(1,3)]

# Plot histogram of each distribution overlapping, in the
# selected colors
comp_df %>% 
  ggplot(aes(x = annual_total_kWh, fill = method)) +
  geom_histogram(bins = 18, position = "identity", alpha = 0.5) +
  scale_fill_manual(values = colors) +
  geom_vline(xintercept = 296, color = "red") +
  labs(x = "Annual electricity use, kWh") +
  theme_linedraw() +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Figure 9: Range of likely annual consumption predicted by each method. Red line indicates nameplate annual consumption.

This difference in distributions is entirely due to our choice to use the moving range to etimate the population standard deviation for the normal distribution-based monte carlo simulation. That choice was entirely valid and correct, as were our decisions in bootstrapping. Therefore we have two slightly different predictions and no statistical or algorithmic means of selecting between them. In both cases, our estimates should improve with more data, and we should see the two sets of estimates converge.

Which method we choose to base our decision recommendations off of depends on personal or organizational preferences. Having made this choice, we should be consistent in future modeling and stick with it so that we make decisions based on the same methods and criteria.

Tolerance intervals

Most newly-minted data scientists are more familiar with the concept of confidence intervals than tolerance intervals.

Confidence intervals are a means of indicating ranges of uncertainty in our estimates of certain population parameters, such as the population mean. When we want a point estimate for a population parameter, it’s good practice to also report a confidence interval to indicate how good (or poor) our estimate is.

Tolerance intervals are estimates of the range of possible future values that we can expect from a process, and come with a confidence level. A very typical tolerance interval is one of 99.7% with a 50% confidence level: with a confidence of 50%, we predict that 99.7% of future values will fall within the tolerance range.

Since we don’t know what our 296 kWh specification means, we can at least say that it should fall either within or above the range of future observed consumption, and this points us toward using tolerance intervals.

Using the above, we can create tolerance intervals to bracket the consumption values that we will likely see over a year’s time. We will construct a 2-sigma, or 95%, tolerance interval with a 50% confidence level (Wheeler 2016). These can be used to assess whether the range of predicted performance is different than the nameplate performance.

# Prediction interval lower and upper
comp_agg_df <- comp_df %>% 
  group_by(method) %>% 
  summarise(annual_mean = mean(annual_total_kWh),
            annual_sd = sd(annual_total_kWh)) %>% 
  mutate(ltl = annual_mean - sigmas * annual_sd,
         utl = annual_mean + sigmas * annual_sd) %>% 
  select(method, ltl, utl) 
comp_agg_df %>% 
  kable(format = "html", digits = c(NA, 0, 0)) %>% 
  kable_styling(full_width = FALSE)

method	ltl	utl
Bootstrap	293	302
Distribution	290	305

Electricity being relatively cheap, and refrigerators being relatively expensive, we will take the more conservative (wider) estimate from the “Distribution” model.

Conclusion

Based on the provided data, we can expect annual electricity use to fall between 290 kWh and 305 kWh. With a nameplate consumption of 296 kWh per annum, we can say that the refrigerator very likely consumes about the same as the nameplate. Guy should not consider replacing his refrigerator.

We have noted several assumptions in the above analysis, and need more data to test those assumptions, so we can also recommend continuing to collect data and repeating the analysis when we have 20 to 30 observations, not including the first points after power outages. This will allow us to determine if we have homogeneous data and, if we do, to obtain greater fidelity in our estimates of the population tolerance intervals, regardless of method used to determine them.

Reflection

We have performed a complete data analysis here. While we worked with something of a “toy” dataset, we went through many of the steps that data scientists and analysts perform every day on real problems. We could make limited use of the data as it came to us: dates stored as text rather than as date data; energy use in cumulative with occasional reseting. We wrangled the data to a moore usable form without adding in assumptions than altered it.

We used EDA throughout—even before wrangling—to develop an understanding the patterns and trends in the data and to guide us in our analysis.

Finally, we confirmed the apparent trends and put numbers to our analysis through model-building and statistical comparison. We tried two different approaches to model-building, first building a model by assuming a normal distribution, and then, to avoid the problems inherent with a parametric approach, tried bootstrapping. Both produced similar results, and with so little data to start with there was no clear way to choose among them.

In the interest of erring on the side of spending less, we chose the wider distribution and concluded that there was no justification for replacing the refrigerator.

We could have taken this a step further and developed a cost model, calculating the relative cost of operation per kWh against the cost of replacement with a newer, more efficient model, and made our decision wholy on an economic basis. We might also have added in a subjective social cost factor associated with CO2 emissions or other factors. This seems like overkill in this case, but in a business environment all decisions are fundamentally economic ones, and in most cases we would be justified to take this extra step.

References

Alathea, Letaw. 2015. Captioner: Numbers Figures and Creates Simple Captions. https://CRAN.R-project.org/package=captioner.

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. http://www.jstatsoft.org/v40/i03/.

McNeese, Bill. 2016. “How Much Data Do i Need to Calculate Control Limits?” BPI Consulting. https://www.spcforexcel.com/knowledge/control-chart-basics/how-much-data-do-i-need-calculate-control-limits.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Scrucca, Luca. 2004. “Qcc: An r Package for Quality Control Charting and Statistical Process Control.” R News 4/1: 11–17. https://cran.r-project.org/doc/Rnews/.

Slowikowski, Kamil. 2018. Ggrepel: Automatically Position Non-Overlapping Text Labels with ’Ggplot2’. https://CRAN.R-project.org/package=ggrepel.

Wheeler, Donald J. 2009a. “The Four Questions of Data Analysis.” Quality Digest. https://www.qualitydigest.com/inside/twitter-ed/four-questions-data-analysis.html.

———. 2009b. “All Outliers Are Evidence.” Quality Digest, May. https://www.qualitydigest.com/magazine/2009/may/department/all-outliers-are-evidence.html.

———. 2016. “Statistical Tolerance Intervals.” Quality Digest. https://www.qualitydigest.com/inside/statistics-column/010416-statistical-tolerance-intervals.html.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

———. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.

Zhu, Hao. 2018. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.

Refrigerator Analysis

Thomas Hopper

December 2022