The data set under consideration is the usmelec data from the fpp package. It comprises 486 monthly values of net electricity generation in billions of kilowatt hours measured from January 1973 through June 2013. Electricity generation is a very good proxy for electricity demand or consumption, as it cannot be stored long-term, so we may refer to demand instead of the actually measured supply.
The simple question we will be attempting to answer is whether or not electrical usage meaningfully differs by season.
summary(Electric)
## X time value
## Min. : 1.0 Min. :1973 Min. :139.6
## 1st Qu.:122.2 1st Qu.:1983 1st Qu.:195.8
## Median :243.5 Median :1993 Median :261.2
## Mean :243.5 Mean :1993 Mean :259.6
## 3rd Qu.:364.8 3rd Qu.:2003 3rd Qu.:312.0
## Max. :486.0 Max. :2013 Max. :421.8
As can be expected, the long-term demand for electricity has grown over time:
Electric[, .(MeanUsage = mean(value)), keyby = floor(time)]
## floor MeanUsage
## 1: 1973 155.3380
## 2: 1974 155.8599
## 3: 1975 160.0629
## 4: 1976 170.0761
## 5: 1977 177.2872
## 6: 1978 184.1147
## 7: 1979 187.5553
## 8: 1980 190.7999
## 9: 1981 191.4978
## 10: 1982 187.0311
## 11: 1983 192.7872
## 12: 1984 201.6220
## 13: 1985 206.0833
## 14: 1986 207.5392
## 15: 1987 214.6072
## 16: 1988 225.6177
## 17: 1989 247.2622
## 18: 1990 253.1523
## 19: 1991 256.1498
## 20: 1992 256.9903
## 21: 1993 266.4327
## 22: 1994 270.6268
## 23: 1995 279.4572
## 24: 1996 287.0155
## 25: 1997 291.0144
## 26: 1998 301.6912
## 27: 1999 307.9008
## 28: 2000 316.8422
## 29: 2001 311.3871
## 30: 2002 321.5377
## 31: 2003 323.5988
## 32: 2004 330.8796
## 33: 2005 337.9518
## 34: 2006 338.7252
## 35: 2007 346.3953
## 36: 2008 343.2822
## 37: 2009 329.1942
## 38: 2010 343.7549
## 39: 2011 341.7213
## 40: 2012 337.8737
## 41: 2013 326.7323
## floor MeanUsage
However, the raw data is clearly not normally distributed—it isn’t even unimodal—as can be seen by its histogram:
# Using Freedman-Diaconis rule for number of histogram bins
bw <- 2 * IQR(Electric$value) / (length(Electric$value) ^ (1 / 3))
GGPE + geom_histogram(aes(x = value), binwidth = bw)
Moreover, seasonality clearly exists as can be seen from a simple plotting of the values against time:
GGPE + geom_path(aes(x = time, y = value))
To more clearly show and address the issues with the data, some wrangling needs to be performed. Adding the seasons and the months for starters will provide more clarity.
The raw time values are measured in decimal fractions of the year. It’s more intuitive to see the month and season as well.
Electric[, `:=`(Month = month.name[(X - 1) %% 12 + 1],
Year = floor(time))]
Seasons <- data.table(Month = month.name,
Season = factor(c(rep('Winter', 2L), rep('Spring', 3L),
rep('Summer', 3L), rep('Fall', 3L), 'Winter')))
Electric <- Seasons[Electric, on = 'Month']
Electric[, Month := factor(Month, levels = month.name)]
setcolorder(Electric, c('X', 'time', 'Season', 'Month', 'Year', 'value'))
GGPE <- ggplot(Electric)
With this, we can look at the average supply by season or month which should help adjust for seasonality.
Now that we have months and seasons for the data, we can recreate the plots and tables grouped by months and seasons.
Electric[, .(MeanSeason = mean(value)), keyby = Season]
## Season MeanSeason
## 1: Fall 247.0618
## 2: Spring 242.8094
## 3: Summer 290.2464
## 4: Winter 258.2968
GGPE + geom_path(aes(x = time, y = value, group = Season, color = Season))
Summer is the standout season for electrical usage, followed interestingly by Winter, and a clearer trend for increased supply is seen splitting by season. Moreover, the sharp spikes in the plot have disappeared. However, as there are some smaller spikes, we can look at the same graph using a monthly breakdown:
Electric[, .(MeanMonth = mean(value)), keyby = Month][order(-MeanMonth)]
## Month MeanMonth
## 1: July 299.3572
## 2: August 298.8606
## 3: June 272.9538
## 4: January 271.1319
## 5: December 263.5595
## 6: September 258.3069
## 7: May 249.3817
## 8: March 247.8368
## 9: October 244.1158
## 10: February 240.3275
## 11: November 238.7626
## 12: April 231.2098
ggplot(Electric, aes(x = time, y = value, group = Month, color = Month)) + geom_path()
Now we can see a clear linear trend in growth by month, with July and August the months with the heaviest usage, followed by June. Interestingly, the next highest average electrical generation, and presumably demand since electricity cannot be stored long-term, are January and December. This is probably due to areas of the country using electricity for heat, as opposed to oil or natural gas. Whereas almost all the country uses electricity for air-conditioning.
This data set does not lend itself that well to an x-y scatter plot, as it is a time series, but for the purposes of demonstrating competency, below is a jittered scatterplot of values against seasons.
GGPE + geom_jitter(aes(x = Season, y = value, color = Month))
# Conclusion Simple data exploration demonstrates both the increasing usage of electricity in the US over time as well as the presence of distinct seasonality. Further exploration could include investigating the relationship between usage and population, as well as between usage and the proliferation of electronic devices such as computers and smartphones.
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.12.2 ggplot2_3.2.0 curl_3.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 knitr_1.23 magrittr_1.5 tidyselect_0.2.5
## [5] munsell_0.5.0 colorspace_1.4-1 R6_2.4.0 rlang_0.4.0
## [9] stringr_1.4.0 dplyr_0.8.3 tools_3.6.0 grid_3.6.0
## [13] gtable_0.3.0 xfun_0.8 withr_2.1.2 htmltools_0.3.6
## [17] assertthat_0.2.1 yaml_2.2.0 lazyeval_0.2.2 digest_0.6.20
## [21] tibble_2.1.3 crayon_1.3.4 purrr_0.3.2 glue_1.3.1
## [25] evaluate_0.14 rmarkdown_1.13 labeling_0.3 stringi_1.4.3
## [29] compiler_3.6.0 pillar_1.4.2 scales_1.0.0 pkgconfig_2.0.2