Describe a situation or problem from your job, everyday life, current events, etc., for which exponential smoothing would be appropriate. What data would you need? Would you expect the value of ??? (the first smoothing parameter) to be closer to 0 or 1, and why?
I’m a consultant for hospitals, so we look at a lot of volume trending data. Exponential smoothing could be used to analyze month-to-month trends in hospital discharges. I could use hospital discharge data or insurance claims data to track the month of discharge and volume of discharges that month. I could also look at particular diagnosis-related groups to see if there were real increases in any particular services. I think alpha (the first smoothing parameter) would probably be closer to 1 than to 0 because generally hospital discharge trends are pretty consistent month over month. Therefore, the model would me more likely to detect increases in volume as actual increases and not just random noise. However, the opposite might be the case in the outpatient space, where there is a lot more randomness in the services patients receive, so it might be better to have a value closer to 0 in that case.
Using the 20 years of daily high temperature data for Atlanta (July through October) from Question 6.2 (file temps.txt), build and use an exponential smoothing model to help make a judgment of whether the unofficial end of summer has gotten later over the 20 years. (Part of the point of this assignment is for you to think about how you might use exponential smoothing to answer this question. Feel free to combine it with other models if you’d like to. There’s certainly more than one reasonable approach.)
rm(list = ls())
# Load temps data as time series object
temps <- read.table('temps.txt', stringsAsFactors = FALSE, header = TRUE)
head(temps)
## DAY X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006
## 1 1-Jul 98 86 91 84 89 84 90 73 82 91 93
## 2 2-Jul 97 90 88 82 91 87 90 81 81 89 93
## 3 3-Jul 97 93 91 87 93 87 87 87 86 86 93
## 4 4-Jul 90 91 91 88 95 84 89 86 88 86 91
## 5 5-Jul 89 84 91 90 96 86 93 80 90 89 90
## 6 6-Jul 93 84 89 91 96 87 93 84 90 82 81
## X2007 X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 1 95 85 95 87 92 105 82 90 85
## 2 85 87 90 84 94 93 85 93 87
## 3 82 91 89 83 95 99 76 87 79
## 4 86 90 91 85 92 98 77 84 85
## 5 88 88 80 88 90 100 83 86 84
## 6 87 82 87 89 90 98 83 87 84
temps_vec <- as.vector(unlist(temps[,2:21]))
temps_ts <- ts(data = temps_vec, frequency=123, start=1996)
summary(temps_ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.00 79.00 85.00 83.34 90.00 105.00
# Plot the temps time series
ts.plot(temps_ts)
# Use Holt Winters to smooth the data
# alpha beta gamma set to null so function will find optimal values
# there are seasonal variations, so set seasonal to "multiplicative"
temps_hw <- HoltWinters(temps_ts, alpha=NULL, beta=NULL, gamma=NULL, seasonal = "multiplicative")
summary(temps_hw)
## Length Class Mode
## fitted 9348 mts numeric
## x 2460 ts numeric
## alpha 1 -none- numeric
## beta 1 -none- numeric
## gamma 1 -none- numeric
## coefficients 125 -none- numeric
## seasonal 1 -none- character
## SSE 1 -none- numeric
## call 6 -none- call
plot(temps_hw)
# It looks like there was some smoothing that occured. Let's look at the temps_hw$fitted object more closely, since "fitted" has
# the time series data in it as well as the level, trend, and seasonal components.
head(temps_hw$fitted)
## xhat level trend season
## [1,] 87.23653 82.87739 -0.004362918 1.052653
## [2,] 90.42182 82.15059 -0.004362918 1.100742
## [3,] 92.99734 81.91055 -0.004362918 1.135413
## [4,] 90.94030 81.90763 -0.004362918 1.110338
## [5,] 83.99917 81.93634 -0.004362918 1.025231
## [6,] 84.04496 81.93247 -0.004362918 1.025838
# Let's pull out the seasonality factors into a separate object and add the row/column names for ease of use
temps_hw_sf <- matrix(temps_hw$fitted[,4], nrow=123)
head(temps_hw_sf)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.052653 1.049468 1.120607 1.103336 1.118390 1.108172 1.140906
## [2,] 1.100742 1.099653 1.108025 1.098323 1.110184 1.116213 1.126827
## [3,] 1.135413 1.135420 1.139096 1.142831 1.143201 1.138495 1.129678
## [4,] 1.110338 1.110492 1.117079 1.125774 1.134539 1.126117 1.130758
## [5,] 1.025231 1.025233 1.044684 1.067291 1.084725 1.097239 1.115055
## [6,] 1.025838 1.025722 1.028169 1.042340 1.053954 1.067494 1.080203
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 1.140574 1.125438 1.122063 1.161415 1.198102 1.198910 1.243012
## [2,] 1.154074 1.142187 1.131889 1.144549 1.134661 1.153433 1.165431
## [3,] 1.156092 1.165657 1.147982 1.149459 1.135756 1.153310 1.155197
## [4,] 1.137722 1.150639 1.146992 1.142497 1.150162 1.151169 1.157751
## [5,] 1.103877 1.120818 1.133733 1.132167 1.142714 1.139244 1.112909
## [6,] 1.094312 1.102680 1.092178 1.075766 1.088547 1.082185 1.103092
## [,15] [,16] [,17] [,18] [,19]
## [1,] 1.243781 1.238435 1.300204 1.290647 1.254521
## [2,] 1.172935 1.190735 1.191956 1.219190 1.228826
## [3,] 1.157286 1.169773 1.189915 1.172309 1.169045
## [4,] 1.163844 1.159343 1.166605 1.167993 1.158956
## [5,] 1.132435 1.132045 1.145230 1.168161 1.170449
## [6,] 1.115071 1.118575 1.121598 1.134962 1.145475
tail(temps_hw_sf)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [118,] 0.9202878 0.9167266 0.9316249 0.9483276 0.9356822 0.8960745
## [119,] 0.9203791 0.8752448 0.8890989 0.9189986 0.9279178 0.8676299
## [120,] 0.9934182 0.9493421 0.9386713 0.9343247 0.9395149 0.9269720
## [121,] 1.0052839 1.0170234 0.9947869 0.9777463 0.9568681 0.9754782
## [122,] 1.0049851 1.0160082 1.0182545 1.0152111 1.0042247 1.0293585
## [123,] 0.9920411 0.9788295 0.9796735 0.9865343 0.9910570 1.0001962
## [,7] [,8] [,9] [,10] [,11] [,12]
## [118,] 0.9115292 0.8880224 0.8919086 0.9016390 0.9096762 0.9076538
## [119,] 0.8817807 0.8675880 0.8778461 0.8903499 0.9073530 0.9042270
## [120,] 0.9360466 0.8945719 0.8856237 0.8857793 0.8844382 0.8961658
## [121,] 0.9685817 0.9862772 0.9533857 0.9449564 0.9580790 0.9214616
## [122,] 0.9928965 1.0143191 1.0032329 1.0049513 1.0001026 0.9921573
## [123,] 0.9651261 0.9733806 0.9879126 0.9947735 0.9800279 0.9931484
## [,13] [,14] [,15] [,16] [,17] [,18]
## [118,] 0.9056794 0.8805521 0.8824280 0.8730733 0.8639161 0.8614441
## [119,] 0.8709238 0.8536790 0.8396531 0.8495173 0.8266791 0.8544568
## [120,] 0.8523899 0.8794887 0.8631928 0.8499712 0.7994624 0.8024431
## [121,] 0.9301216 0.9347985 0.8985886 0.8409815 0.8223158 0.8457842
## [122,] 1.0005049 0.9591606 0.9478091 0.9185031 0.8930257 0.8944247
## [123,] 1.0024638 0.9948792 0.9941488 0.9852592 0.9909004 0.9581546
## [,19]
## [118,] 0.8717307
## [119,] 0.8598048
## [120,] 0.8002607
## [121,] 0.8257107
## [122,] 0.8615051
## [123,] 0.9137532
# And also pull out the xhat values into a separate object too
temps_hw_smoothed <- matrix(temps_hw$fitted[,1], nrow=123)
head(temps_hw_smoothed)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 87.23653 65.04516 90.29613 83.39938 87.68863 78.07509 73.10059
## [2,] 90.42182 84.87634 85.44878 86.44444 84.78855 86.02384 72.13247
## [3,] 92.99734 89.61560 85.65942 92.85774 88.70570 90.23022 77.77739
## [4,] 90.94030 88.47600 84.80741 91.55309 86.98750 87.27931 83.52416
## [5,] 83.99917 83.11178 81.14293 88.80208 81.40681 86.06745 83.86090
## [6,] 84.04496 88.00054 85.21673 91.04477 81.83758 87.87757 78.93483
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 87.27074 92.29714 78.50826 81.58696 84.72917 79.51855 86.74604
## [2,] 85.01878 92.85614 88.18138 88.52648 80.39548 85.65722 81.47324
## [3,] 82.68648 92.33884 92.43570 86.72311 84.53380 88.31357 82.29310
## [4,] 83.37312 87.29596 92.69774 83.30574 89.62822 88.56597 82.90566
## [5,] 83.64904 84.25223 90.58916 84.18954 89.27001 89.12501 80.92784
## [6,] 86.79140 85.75665 86.91496 82.21750 84.28967 79.32562 84.52016
## [,15] [,16] [,17] [,18] [,19]
## [1,] 93.88371 82.30605 84.88750 102.54643 90.07756
## [2,] 87.43846 92.55001 76.18707 89.57468 85.16854
## [3,] 90.24836 91.18746 81.46207 88.15080 82.09161
## [4,] 93.69353 95.13130 76.56780 87.11605 79.49314
## [5,] 90.14667 94.60910 75.42085 85.20682 83.69666
## [6,] 88.67069 96.75445 78.42462 83.25423 82.08838
#Let's add the row and column names back into the dataset for ease of use.
colnames(temps_hw_sf) <- colnames(temps[,3:21])
rownames(temps_hw_sf) <- temps[,1]
colnames(temps_hw_smoothed) <- colnames(temps[,3:21])
rownames(temps_hw_smoothed) <- temps[,1]
# One interesting thing we could look at is the average seasonal factor for each year and for the dataset as a whole.
meanSFAllYears <- mean(temps_hw_sf)
meanSFAllYears
## [1] 0.9954727
meanSFEachYear <- vector()
for (i in 1:ncol(temps_hw_sf)){
meanSFEachYear[i] = mean(temps_hw_sf[,i])
}
meanSFEachYear
## [1] 1.0000000 0.9981882 0.9980547 0.9975095 0.9971271 0.9961954 0.9955108
## [8] 0.9957393 0.9956360 0.9949667 0.9948162 0.9946495 0.9943300 0.9941211
## [15] 0.9940949 0.9933907 0.9932476 0.9935215 0.9928824
# We could look at the average SF in October, which I think everyone would agree is a fall month.
which(temps$DAY=="1-Oct")
## [1] 93
mean(temps_hw_sf[93:123,])
## [1] 0.8751098
# Let's also look at the average seasonal factor for the first year of data (x[1]):
meanSF_firstYear <- mean(temps_hw_sf[,1])
meanSF_firstYear
## [1] 1
# Interesting. The baseline mean seasonal factor is 1, which I suppose makes sense since it's the baseline.
# So, one way we could define the end of summer is the day when the seasonal factor falls below the baseline SF factor (1), which would indicate that the weather is getting cooler and fall is here. We could use cusum to see at which point in each year that threshold is being breached. Here's an implementation of CUSUM we can use:
cusum_decrease = function(data, mean, T, C){
results = list()
cusum = 0
rowCounter = 1
while (rowCounter <= nrow(data)){
current = data[rowCounter,]
cusum = max(0, cusum + (mean - current - C))
# print(cusum)
if (cusum >= T) {
results = rowCounter
break
}
rowCounter = rowCounter + 1
if (rowCounter >= nrow(data)){
results = NA
break
}
}
return(results)
}
# Let's see what values for C we should use by definined it as 0.5x of the standard deviation of the first year of data, and let's use 3x the stD for T.
C_var = sd(temps_hw_sf[,1])*0.5
T_var = sd(temps_hw_sf[,1])*3
# OK. Now let's run cusum on each year of data to see which day the running average of the SF breached the threshold that we set.
# I'll use the mean of the first year (x[1]) when applying CUSUM.
result_vector = vector()
for (col in 1:ncol(temps_hw_sf)){
result_vector[col] = cusum_decrease(data = as.matrix(temps_hw_sf[,col]), mean = 1,T = T_var, C = C_var)
}
result_df = data.frame(Year = colnames(temps_hw_sf),Day = temps[result_vector,1])
result_df
## Year Day
## 1 X1997 30-Sep
## 2 X1998 1-Oct
## 3 X1999 1-Oct
## 4 X2000 1-Oct
## 5 X2001 2-Oct
## 6 X2002 2-Oct
## 7 X2003 3-Oct
## 8 X2004 3-Oct
## 9 X2005 4-Oct
## 10 X2006 4-Oct
## 11 X2007 5-Oct
## 12 X2008 5-Oct
## 13 X2009 5-Oct
## 14 X2010 5-Oct
## 15 X2011 3-Oct
## 16 X2012 3-Oct
## 17 X2013 3-Oct
## 18 X2014 4-Oct
## 19 X2015 4-Oct
From this analysis, it looks like the day fall came gradually got later, but the range was much tighter than in the previous HW when we used CUSUM on the unsmoothed data. Could this be the impact of the exponential smoothing model?