Introduction

The data set under consideration is the usmelec data from the fpp package. It comprises 486 monthly values of net electricity generation in billions of kilowatt hours measured from January 1973 through June 2013. Electricity generation is a very good proxy for electricity demand or consumption, as it cannot be stored long-term, so we may refer to demand instead of the actually measured supply.

The simple question we will be attempting to answer is whether or not electrical usage meaningfully differs by season.

summary(Electric)
##        X              time          value      
##  Min.   :  1.0   Min.   :1973   Min.   :139.6  
##  1st Qu.:122.2   1st Qu.:1983   1st Qu.:195.8  
##  Median :243.5   Median :1993   Median :261.2  
##  Mean   :243.5   Mean   :1993   Mean   :259.6  
##  3rd Qu.:364.8   3rd Qu.:2003   3rd Qu.:312.0  
##  Max.   :486.0   Max.   :2013   Max.   :421.8

As can be expected, the long-term demand for electricity has grown over time:

Electric[, .(MeanUsage = mean(value)), keyby = floor(time)]
##     floor MeanUsage
##  1:  1973  155.3380
##  2:  1974  155.8599
##  3:  1975  160.0629
##  4:  1976  170.0761
##  5:  1977  177.2872
##  6:  1978  184.1147
##  7:  1979  187.5553
##  8:  1980  190.7999
##  9:  1981  191.4978
## 10:  1982  187.0311
## 11:  1983  192.7872
## 12:  1984  201.6220
## 13:  1985  206.0833
## 14:  1986  207.5392
## 15:  1987  214.6072
## 16:  1988  225.6177
## 17:  1989  247.2622
## 18:  1990  253.1523
## 19:  1991  256.1498
## 20:  1992  256.9903
## 21:  1993  266.4327
## 22:  1994  270.6268
## 23:  1995  279.4572
## 24:  1996  287.0155
## 25:  1997  291.0144
## 26:  1998  301.6912
## 27:  1999  307.9008
## 28:  2000  316.8422
## 29:  2001  311.3871
## 30:  2002  321.5377
## 31:  2003  323.5988
## 32:  2004  330.8796
## 33:  2005  337.9518
## 34:  2006  338.7252
## 35:  2007  346.3953
## 36:  2008  343.2822
## 37:  2009  329.1942
## 38:  2010  343.7549
## 39:  2011  341.7213
## 40:  2012  337.8737
## 41:  2013  326.7323
##     floor MeanUsage

However, the raw data is clearly not normally distributed—it isn’t even unimodal—as can be seen by its histogram:

# Using Freedman-Diaconis rule for number of histogram bins
bw <- 2 * IQR(Electric$value) / (length(Electric$value) ^ (1 / 3))
GGPE + geom_histogram(aes(x = value), binwidth = bw)

Moreover, seasonality clearly exists as can be seen from a simple plotting of the values against time:

GGPE + geom_path(aes(x = time, y = value))

To more clearly show and address the issues with the data, some wrangling needs to be performed. Adding the seasons and the months for starters will provide more clarity.

Data Wrangling

The raw time values are measured in decimal fractions of the year. It’s more intuitive to see the month and season as well.

Electric[, `:=`(Month = month.name[(X - 1) %% 12 + 1],
                 Year = floor(time))]
Seasons <- data.table(Month = month.name,
                      Season = factor(c(rep('Winter', 2L), rep('Spring', 3L),
                                 rep('Summer', 3L), rep('Fall', 3L), 'Winter')))
Electric <- Seasons[Electric, on = 'Month']
Electric[, Month := factor(Month, levels = month.name)]
setcolorder(Electric, c('X', 'time', 'Season', 'Month', 'Year', 'value'))
GGPE <- ggplot(Electric)

With this, we can look at the average supply by season or month which should help adjust for seasonality.

Graphics

Now that we have months and seasons for the data, we can recreate the plots and tables grouped by months and seasons.

Electric[, .(MeanSeason = mean(value)), keyby = Season]
##    Season MeanSeason
## 1:   Fall   247.0618
## 2: Spring   242.8094
## 3: Summer   290.2464
## 4: Winter   258.2968
GGPE + geom_path(aes(x = time, y = value, group = Season, color = Season))

Summer is the standout season for electrical usage, followed interestingly by Winter, and a clearer trend for increased supply is seen splitting by season. Moreover, the sharp spikes in the plot have disappeared. However, as there are some smaller spikes, we can look at the same graph using a monthly breakdown:

Electric[, .(MeanMonth = mean(value)), keyby = Month][order(-MeanMonth)]
##         Month MeanMonth
##  1:      July  299.3572
##  2:    August  298.8606
##  3:      June  272.9538
##  4:   January  271.1319
##  5:  December  263.5595
##  6: September  258.3069
##  7:       May  249.3817
##  8:     March  247.8368
##  9:   October  244.1158
## 10:  February  240.3275
## 11:  November  238.7626
## 12:     April  231.2098
ggplot(Electric, aes(x = time, y = value, group = Month, color = Month)) + geom_path()

Now we can see a clear linear trend in growth by month, with July and August the months with the heaviest usage, followed by June. Interestingly, the next highest average electrical generation, and presumably demand since electricity cannot be stored long-term, are January and December. This is probably due to areas of the country using electricity for heat, as opposed to oil or natural gas. Whereas almost all the country uses electricity for air-conditioning.

This data set does not lend itself that well to an x-y scatter plot, as it is a time series, but for the purposes of demonstrating competency, below is a jittered scatterplot of values against seasons.

GGPE + geom_jitter(aes(x = Season, y = value, color = Month))

# Conclusion Simple data exploration demonstrates both the increasing usage of electricity in the US over time as well as the presence of distinct seasonality. Further exploration could include investigating the relationship between usage and population, as well as between usage and the proliferation of electronic devices such as computers and smartphones.

SessionInfo

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.12.2 ggplot2_3.2.0     curl_3.3         
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1       knitr_1.23       magrittr_1.5     tidyselect_0.2.5
##  [5] munsell_0.5.0    colorspace_1.4-1 R6_2.4.0         rlang_0.4.0     
##  [9] stringr_1.4.0    dplyr_0.8.3      tools_3.6.0      grid_3.6.0      
## [13] gtable_0.3.0     xfun_0.8         withr_2.1.2      htmltools_0.3.6 
## [17] assertthat_0.2.1 yaml_2.2.0       lazyeval_0.2.2   digest_0.6.20   
## [21] tibble_2.1.3     crayon_1.3.4     purrr_0.3.2      glue_1.3.1      
## [25] evaluate_0.14    rmarkdown_1.13   labeling_0.3     stringi_1.4.3   
## [29] compiler_3.6.0   pillar_1.4.2     scales_1.0.0     pkgconfig_2.0.2