Describe a situation or problem from your job, everyday life, current events, etc., for which exponential smoothing would be appropriate. What data would you need? Would you expect the value of alpha (the first smoothing parameter) to be closer to 0 or 1, and why?
We are forecasting Operational Expenses (time series data is needed) for some of our workforce and we are using exponential smoothing functions. For expenses that are seasonal and do not have sparsity issues, the TES (Holts-Winter) does really well. For expenses that are not seasonal and do not contain trends, the SES does quite well sometimes and we let the algorithm pick the optimal alpha values based on the data. Our alpha values are between 0.5 to 0.8 in general; As alpha gets closer to one, this means more weight is given to the more recent observations
require(ggthemes)
library(tidyverse)
library(magrittr)
library(TTR)
library(tidyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
library(fpp2)
library(forecast)
library(caTools)
library(reshape2)
library(psych)
require(graphics)
library(qcc)
library(ggQC)
df <- read.csv(file="temps.csv",stringsAsFactors = F, header=T)
#Replacing Unwanted headers
df1 <- df %>% rename_at(vars(starts_with("X")),funs(str_replace(.,"X","")))
head(df1,2)
## DAY 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## 1 1-Jul 98 86 91 84 89 84 90 73 82 91 93 95 85 95
## 2 2-Jul 97 90 88 82 91 87 90 81 81 89 93 85 87 90
## 2010 2011 2012 2013 2014 2015
## 1 87 92 105 82 90 85
## 2 84 94 93 85 93 87
#make wide to long format
df2 <- melt(df1, id.vars = c("DAY"),variable.name = "YEAR",value.name="TEMP")
#head(df2,20)
df3 = df2 %>% unite(date, DAY, YEAR, sep = "-", remove = FALSE)
#df3
New_Date <- as.Date(df3$date, format="%d-%b-%Y")
#class(New_Date)
df4 <- df3 %>% mutate(date=New_Date) %>% select(c("date","TEMP"))
head(df4,5)
## date TEMP
## 1 1996-07-01 98
## 2 1996-07-02 97
## 3 1996-07-03 97
## 4 1996-07-04 90
## 5 1996-07-05 89
summary(df4$TEMP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.00 79.00 85.00 83.34 90.00 105.00
glimpse(describe(df4$TEMP))
## Observations: 1
## Variables: 13
## $ vars <dbl> 1
## $ n <dbl> 2460
## $ mean <dbl> 83.33902
## $ sd <dbl> 8.620253
## $ median <dbl> 85
## $ trimmed <dbl> 84.0747
## $ mad <dbl> 7.413
## $ min <dbl> 50
## $ max <dbl> 105
## $ range <dbl> 55
## $ skew <dbl> -0.8208976
## $ kurtosis <dbl> 0.5796116
## $ se <dbl> 0.1738011
# Histogram
hist(df4$TEMP, xlab="Temperature",col="yellow",,main="Fig1a: Histogram of Temperatures in Atlanta",probability=TRUE)
s = sd(df4$TEMP)
m = mean(df4$TEMP)
curve(dnorm(x, mean=m, sd=s), add=TRUE,col = "red", lwd = 3)
#making interactive Boxplot
fig1 <- plot_ly(df2, y = ~TEMP, x = ~YEAR, type = "box")
fig1 <- fig1 %>% layout(title = "Fig1b: Boxplots of Temperature over 20 yrs in Atlanta",
xaxis = list(title = "Year"),
yaxis = list (title = "Degree Fahrenheit")) %>% layout(plot_bgcolor='rgb(555, 666, 188)')
fig1
Observation1:
A left-skewed histogram and boxplot with low temperature outliers indicated the data set contained more lower temperatures than higher temperatures over the 20 year span
The question is whether there exist enough statistical evidence to support such a hypothesis? We shall find out….
Original plot of data
Timeseries conversion
Making Temperatures into Timeseries Object to feed into Holts-Winter’s function call
#Plot original dataset via Holts-Winters
dat_ts <- ts(df4[, 2], start = c(1996,1), frequency = 123)
plot(dat_ts)
#Fit Holt-Winters exponential smoothing model to data and show summary
fit_hw <- HoltWinters(dat_ts,seasonal="multiplicative", optim.start = c(alpha = 0.3, beta = 0.1, gamma = 0.1))
summary(fit_hw)
## Length Class Mode
## fitted 9348 mts numeric
## x 2460 ts numeric
## alpha 1 -none- numeric
## beta 1 -none- numeric
## gamma 1 -none- numeric
## coefficients 125 -none- numeric
## seasonal 1 -none- character
## SSE 1 -none- numeric
## call 4 -none- call
#Plot the fitted values overlaying original data
plot(fit_hw )
plot(fitted(fit_hw))
fit_hw$alpha
## alpha
## 0.615003
fit_hw$beta
## beta
## 0
fit_hw$gamma
## gamma
## 0.5495256
#binding dates and seasonality factors together to make dataframe
df4.shorten<-df4 %>% slice(124:2460)
test<-as.vector(fit_hw$fitted[,4])
df.XmR<- df4.shorten %>% mutate(seaonality.factor=test)%>% select(c("date","seaonality.factor"))
head(df.XmR,5)
## date seaonality.factor
## 1 1997-07-01 1.052653
## 2 1997-07-02 1.100742
## 3 1997-07-03 1.135413
## 4 1997-07-04 1.110338
## 5 1997-07-05 1.025231
19 years of seasonality factors (excluding first year)
Year 1996 was “lost” in the computation because Holts-Winters needed a full cycle (123 days in our case) to determine seasonality factors
#Grouping avg SI by year
df3.shorten<-df3 %>% slice(124:2460) %>% select(c("date","YEAR"))%>% mutate(seaonality.factor=test)
df.sf<-df3.shorten %>% group_by(YEAR) %>% summarise(Seasonality.Index=1*mean(seaonality.factor))
head(df.sf,5)
## # A tibble: 5 x 2
## YEAR Seasonality.Index
## <fct> <dbl>
## 1 1997 1
## 2 1998 0.998
## 3 1999 0.998
## 4 2000 0.998
## 5 2001 0.997
#line charts of seasonality factors
ggplot(data=df.sf, aes(x=YEAR, y=Seasonality.Index, group=1)) +geom_line(color="red")+geom_point()+ggtitle("Fig1C:Line Chart of Avg Seasonality Index over 20 Yrs in Atlanta")+theme_excel()
Observation2:
At this point, Holts-Winter decomposition and figures 1a to 1c are telling 2 different stories:
However, the decomposed trend graph from Holts-Winters indicated NO trending at all!
Holts-Winter indicated that there were some randomness (alpha=0.615003) and seasonalities (gamma=0.5495256) but No Trend; zero beta
Holts-Winter decomposed plots clearly showed flat trend and level, with strong influence of seasonalities
Plotting seasonality factors vs. control limits
95% confidence level (alpha=0.05); control limits = +/- 2 Std Errors from the mean
#mean Temp over entire 20 years
m <- matrix(fit_hw$fitted[,4],ncol=123)
temp.mean<-mean(m)#mean ~ 1.0
#median Temp over entire 20 years
temp.median<-median(m)#median ~ 1.01
#Temp Standard Deviation over entire 20 years
temp.sd<-sd(m)#std dev. ~ 0.09 => 2SD = 0.18
# Using 95% confidence level equates to control limits of +/-2 Std Dev from mean
sd.2<-2*temp.sd
mean.1.2<-sd.2+temp.mean
mean.0.8<-temp.mean-sd.2
#control charts with upper and lower limits
ggplot(data=df.XmR, aes(x=date, y=seaonality.factor, group=1))+geom_line(color='blue')+ggtitle("Fig2:Line Chart of Avg seasonality factors over 20 Yrs in Atlanta")+theme_excel()+ geom_hline(yintercept=mean.1.2, linetype="dashed", color = "red",size=1.25)+ geom_hline(yintercept=mean.0.8, linetype="dashed", color = "red",size=1.5)
Summary
- All Visual indicators and even the ploted average seasonal indices were pointing to lower temperatures
- Fig1a thru Fig1c displayed YoY downward temperature trends
- Decomposition of Time Series indicated there were indeed no trend at all; beta was computed to be zero
Conclusion
Like the saying goes, looks can be deceiving. The histogram, boxplot and line chart all were telling us the data were trending lower while in fact, Holts-Winter’s decomposition told us a different story:
- No trends either upwards or downwards were statistically evident
- The average Seasonal Index were lower YoY but the decreases were too small and not statistical strong enough
- Since NO normality assumption is needed in fitting an exponential smoothing model, see reference 1, we can apply the standard confidence intervals to check if temperature spikes were statistically significant
- Fig 2:Plotting the seaonsal factors with a 95% confidence level in the control chart, we witnessed the data were quite stationary peppered with seasonalities. Any violations against upper or lower bounds were purely due to seasonality effects
Bottomline, it is really hard to tell if unofficial end of summer has gotten later over the 20 years based on this data. i.e, temperatures getting warmer YoY. However, we discovered no statistical evidence to support downward temperature trend over 20 years. And higher temperature spikes especially in the later years of 2011 thru 2015 were driven by seasonalities
Reference 1: