Exponential Smoothing on Time Series Data

Use Case: Real Estate Market Analysis

In my daily job as an urban development advisor, we need to keep a close eye on the trends of real estate market prices and be able to understand the financial feasibility (and profitability) of a potential real estate investment in certain locations in a city, either buying a property, or developing a new one. Exponential Smoothing could be useful here in helping understanding the changing patterns of a certain neighborhood’s property values. This is because that:

Firstly, real estate market analysis focuses on relatively short-term results, for example, from a couple of months to a year’s span, because the potential developers or buyers are looking for a good enough estimated near term price to execute the deal. With many different data sources available nowadays to track property prices, Exponential Smoothing is a good choice for short-term predictions.
Secondly, real estate market has a strong cyclical seasonality. For example, usually, end of summer and start of fall is a relatively “low-price” season when families with a moving plan are trying to sold their properties before their kids’ school starts. On the contrary, the price in the winter and the holiday season is more likely to be on the higher end, because people are less likely to sold their homes during these times, thus there are fewer supply. Also, another seasonality is caused by the economic cycles, which can span one or two decades long for each.
Thirdly, there are random effects in a market price that we should take into account. Looking at a year-long week to week property value curve, you’d find there are many small ups and downs happening, a lot of which are not going to be easily explained by a predictable formula. These randomness is part of the price variation and need to be considered in the prediction model. Exponential Smoothing does that with the choice of alpha parameter. Generally, an alpha closer to 0 means a lot of randomness, and we should rely our estimate more on the previous value. An alpha closer to 1 means there is not much randomness, and the observed value will be very close to the baseline value.

For this analysis, we could use the monthly market price data of the target property in the last past 3 - 5 years, and train an Exponential Smoothing model using part of this dataset. The majority of the data can be used to train and validate the model - identifying the most accurate and effective alpha, beta and gamma parameters to account for randomness, trend and seasonality. The remaining data can be used to make the actual prediction.

As for the value of alpha, generally it is expected to be closer to 1 rather than 0, because real estate price represents the potential value of the property, which can be determined by many long-term or fixed factors, such as environment, geographic location, access to amenities, local economic vitality, etc. Its fluctuation is not driven by randomness. Therefore, except the extreme situation of a sudden economic crisis, we should mainly use the observed historic data to make our predictions.

Use Case: Revisiting temperature data of Atlanta

The main idea here is to utilize the Exponential Smoothing model to preprocess the raw temperature data, in specific, trying to reduce the random noises of the raw data, and then apply a change detection method (doesn’t need to be CUSUM) over this smoothed data, to understand whether summer is ending later over the 20 years span.

Conceptually, this would be a three step process: – Step 1: Prepare the data to build an Exponential Smoothing model with. – Step 2: Smooth the data using a good Exponential Smoothing model. – Step 3: Perform a change detection analysis on the smoothed data to answer the question.

Step 1: Data Preparation

set.seed(100)
#load the data
temp_data <- read.table("temps.txt", header = TRUE)
head(temp_data)

##     DAY X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 1-Jul    98    86    91    84    89    84    90    73    82    91    93    95
## 2 2-Jul    97    90    88    82    91    87    90    81    81    89    93    85
## 3 3-Jul    97    93    91    87    93    87    87    87    86    86    93    82
## 4 4-Jul    90    91    91    88    95    84    89    86    88    86    91    86
## 5 5-Jul    89    84    91    90    96    86    93    80    90    89    90    88
## 6 6-Jul    93    84    89    91    96    87    93    84    90    82    81    87
##   X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 1    85    95    87    92   105    82    90    85
## 2    87    90    84    94    93    85    93    87
## 3    91    89    83    95    99    76    87    79
## 4    90    91    85    92    98    77    84    85
## 5    88    80    88    90   100    83    86    84
## 6    82    87    89    90    98    83    87    84

#Note that for each year, the dataset has 123 days' temperature data, or 123 rows of data. 
nrow(temp_data)

## [1] 123

To be able to analyze this data using a time series approach, the data first needs to be converted into a time series object. Note that we don’t need the first column of date.

#convert data to a vector
temp_data_vec <- as.vector(unlist(temp_data[,2:21]))

#Build a time series object from this vector 
temp_ts <- ts(temp_data_vec, start = 1996, frequency = 123)

head(temp_ts)

## [1] 98 97 97 90 89 93

Step 2: Building and Testing the Exponential Smoothing Model

Here we could apply the exponential smoothing method using a Holt-Winters model. Before building the model, let’s take a quick look at the time series data itself.

plot(temp_ts, col="light blue")

As shown in the regular ups and downs in the above graph, we are dealing with temperature data for the same 4 months period across 20 years, there is expected to be some seasonality across years, thus the gamma parameter should be included. Also, since we are trying to understand whether there is a trend existing somewhere in the data across the years, so the beta parameter should be included as well. In addition, for the seasonal factor, let’s choose “multiplicative” rather than “additive” because we are not 100% confident about the season factor being a simple constant addition or subtraction.

#Check the documentation of the HoltWinters() function
?HoltWinters()

## starting httpd help server ... done

#A triple Exponential Smoothing Model
temp_HW <- HoltWinters(temp_ts, alpha = NULL, beta = NULL, gamma = NULL, seasonal = "multiplicative")
#Take a look at the model summary
summary(temp_HW)

##              Length Class  Mode     
## fitted       9348   mts    numeric  
## x            2460   ts     numeric  
## alpha           1   -none- numeric  
## beta            1   -none- numeric  
## gamma           1   -none- numeric  
## coefficients  125   -none- numeric  
## seasonal        1   -none- character
## SSE             1   -none- numeric  
## call            6   -none- call

#Let's look at the fitted values, which include the fitted series or model estimates, level values, trend and seasonal components as well. Due to page limit, the top 20 rows are presented below.
temp_HW$fitted[1:20,]

##           xhat    level        trend    season
##  [1,] 87.23653 82.87739 -0.004362918 1.0526529
##  [2,] 90.42182 82.15059 -0.004362918 1.1007422
##  [3,] 92.99734 81.91055 -0.004362918 1.1354128
##  [4,] 90.94030 81.90763 -0.004362918 1.1103378
##  [5,] 83.99917 81.93634 -0.004362918 1.0252306
##  [6,] 84.04496 81.93247 -0.004362918 1.0258379
##  [7,] 75.06333 81.90115 -0.004362918 0.9165601
##  [8,] 87.04945 81.85429 -0.004362918 1.0635250
##  [9,] 84.02220 81.82134 -0.004362918 1.0269532
## [10,] 87.06445 81.80368 -0.004362918 1.0643665
## [11,] 84.05272 81.76208 -0.004362918 1.0280709
## [12,] 88.05141 81.72617 -0.004362918 1.0774531
## [13,] 86.03178 81.69246 -0.004362918 1.0531740
## [14,] 89.92787 81.66954 -0.004362918 1.1011777
## [15,] 90.89514 81.70546 -0.004362918 1.1125326
## [16,] 90.93680 81.75907 -0.004362918 1.1123128
## [17,] 88.91874 81.78965 -0.004362918 1.0872217
## [18,] 88.89374 81.83125 -0.004362918 1.0863634
## [19,] 88.86666 81.88704 -0.004362918 1.0852925
## [20,] 89.83703 81.95824 -0.004362918 1.0961900

From the rows above, note that the trend value not only has much smaller magnitude than the seasonal value, it also stays relatively stable and doesn’t have much variations compared to the seasonal component.

Let’s use the decompose() function to visualize them individually.

plot(decompose(temp_ts,type = "multiplicative"))

The above graph shows the separated influences of trend, seasonal, and random factors. Note the regular fluctuations of the seasonal factor, the gradually increasing level in the trend factor, and random patterns in the random factor.

#Plot and compare the original time series and the fitted HW model
plot(temp_HW)

The above plot shows in black the original data, and in red the fitted results from the Holt-Winters model. It can be observed that in earlier years from 1997 to 2002, the fitted curve doesn’t fit the raw data as good as in more recent years. It can be interpreted that the model is constrained by limited number of recent data points it can use in earlier years. In later years, the model finds more historical data points and adjust the model parameters accordingly, so it generally fits better to the actual data.

#Just for informational purpose, let's see what a single or double exponential smoothing model's fitted results would look like.
temp_HW1 <- HoltWinters(temp_ts, alpha = NULL, beta = FALSE, gamma = FALSE, seasonal = "multiplicative")

temp_HW2a <- HoltWinters(temp_ts, alpha = NULL, beta = NULL, gamma = FALSE, seasonal = "multiplicative")

temp_HW2b <- HoltWinters(temp_ts, alpha = NULL, beta = FALSE, gamma = NULL, seasonal = "multiplicative")

par(mfrow=c(2,2))
plot(temp_HW, main="Triple")
plot(temp_HW1, main="Single")
plot(temp_HW2a, main="Double without Seasonality")
plot(temp_HW2b, main="Double without Trend")

Note in the above graphs that the single and double without seasonality models have fitted values from the beginning - this is expected because no “seasonal” difference, here meaning from year to year difference, was considered in these two models. They also seem to fluctuate within a smaller band compared to the triple and double without trend models. One possible explanation could be that seasonality is the main driver for the fluctuations of the data.

For now, we’ll stick with the triple exponential smoothing model. Next, let’s obtain the smoothed data for further analysis.

#Obtain the estimated value from the Exponential Smoothing Model for every time period.
temp_smoothed_mx <- matrix(temp_HW$fitted[,1], nrow = 123)

temp_smoothed <- as.data.frame(temp_smoothed_mx)
#Assign the column headers from the original temp data to this smoothed data, and the dates to row names
colnames(temp_smoothed) <- colnames(temp_data)[c(3:21)]

#Attach the date column
temp_smoothed$DAY <- temp_data$DAY
head(temp_smoothed)

##      X1997    X1998    X1999    X2000    X2001    X2002    X2003    X2004
## 1 87.23653 65.04516 90.29613 83.39938 87.68863 78.07509 73.10059 87.27074
## 2 90.42182 84.87634 85.44878 86.44444 84.78855 86.02384 72.13247 85.01878
## 3 92.99734 89.61560 85.65942 92.85774 88.70570 90.23022 77.77739 82.68648
## 4 90.94030 88.47600 84.80741 91.55309 86.98750 87.27931 83.52416 83.37312
## 5 83.99917 83.11178 81.14293 88.80208 81.40681 86.06745 83.86090 83.64904
## 6 84.04496 88.00054 85.21673 91.04477 81.83758 87.87757 78.93483 86.79140
##      X2005    X2006    X2007    X2008    X2009    X2010    X2011    X2012
## 1 92.29714 78.50826 81.58696 84.72917 79.51855 86.74604 93.88371 82.30605
## 2 92.85614 88.18138 88.52648 80.39548 85.65722 81.47324 87.43846 92.55001
## 3 92.33884 92.43570 86.72311 84.53380 88.31357 82.29310 90.24836 91.18746
## 4 87.29596 92.69774 83.30574 89.62822 88.56597 82.90566 93.69353 95.13130
## 5 84.25223 90.58916 84.18954 89.27001 89.12501 80.92784 90.14667 94.60910
## 6 85.75665 86.91496 82.21750 84.28967 79.32562 84.52016 88.67069 96.75445
##      X2013     X2014    X2015   DAY
## 1 84.88750 102.54643 90.07756 1-Jul
## 2 76.18707  89.57468 85.16854 2-Jul
## 3 81.46207  88.15080 82.09161 3-Jul
## 4 76.56780  87.11605 79.49314 4-Jul
## 5 75.42085  85.20682 83.69666 5-Jul
## 6 78.42462  83.25423 82.08838 6-Jul

#Take a look at the smoothed 1997 data and compare with the original raw data.
#Show the smoothed data as a green curve, the original raw data as points.
plot(temp_data$X1997)
lines(temp_smoothed$X1997,col="green")

From the above graph it is interesting to note that the smoothed data picks up the earlier part data (about the first 60 days since July 1st) better than the later ones. This earlier part may imply that when given with limited number of historical data points, it is easier(or more of a priority) for the exponential smoothing model to try to fit to the exact raw data, while as the number of historic data points increases, it is becomes harder(or more focusing on the general trend) for the model to fit every single points.

Also take a look at the 2015 fitted and original data.

plot(temp_data$X2015)
lines(temp_smoothed$X2015,col="blue")

Here as more historical data have been included, it seems that the model has picked up the general trend correctly, but less so for the individual data points - this is exactly what we are looking for in “smoothing”, which would help remove the noisy part of the data and reveal the baseline trend.

Step3: The CUSUM Analysis

With the smoothed data ready, let’s apply CUSUM method to determine whether the summer in Atlanta is ending later over the years. To do so, there are two steps: – Step 1: Use CUSUM to identify the unofficial end of summer date for each year using the smoothed data. – Step 2: Use CUSUM again to identify whether there is a change in the above-mentioned end of summer dates over the years.

#First build a CUSUM model for the first fitted year, which is 1997, or the first column of the smoothed temperature data

#Create a new vector to store the s values
s <- rep(0,123)
#initialize the first value as 0
s[[1]] <- 0
#start with a C value of 1/2 of the standard deviation of the year 1997
sd(temp_smoothed$X1997)

## [1] 9.698415

c0 <- 0.5*sd(temp_smoothed$X1997)

#calculate the mean value of the observed proper "summer" high temperature, or the first 75 days of daily high temperatures till mid September.
mean_temp0 <- mean(temp_smoothed$X1997[1:75])

#calculate the s values by day.
#Note that we are detecting the change in a decreasing time series trend.
for (i in 2:123){
  s[[i]] <- max(0, s[[i-1]] + (mean_temp0 - temp_smoothed$X1997[i] - c0))
}
s

##   [1]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##   [7]   6.3569122   0.7277125   0.0000000   0.0000000   0.0000000   0.0000000
##  [13]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##  [19]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##  [25]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##  [31]   9.5794382  11.1429493   8.6338517   2.0091768   0.0000000   0.0000000
##  [37]   0.0000000   0.0000000   1.5534967  10.0958373  11.6500171   7.2393701
##  [43]   0.8166848   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##  [49]   0.0000000   0.0000000   0.0000000   0.0000000   2.2552507   2.5667872
##  [55]   1.8600863   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##  [61]   0.0000000   0.0000000   0.9029190   4.6028355   0.0000000   0.0000000
##  [67]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   2.0896444
##  [73]   1.8255843   0.0000000   5.8952088   3.6541254   0.0000000   0.0000000
##  [79]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##  [85]   0.0000000   0.5886107   0.0000000  10.8894314  26.3466950  46.3663100
##  [91]  64.9433113  82.7484974  84.2545548  81.7470046  73.0435032  87.1417993
##  [97]  98.8850876 105.9539415 113.2338339  91.9211392  92.5600852  91.2504245
## [103]  93.7515819  94.3456749  91.4629503  84.5541915  88.2401684 101.8331409
## [109] 120.5662720 149.8946680 174.5923693 186.6351219 187.0064757 192.2371797
## [115] 216.1244110 233.0593262 251.4704299 260.5828192 270.4964667 284.3694599
## [121] 305.2487992 324.0237981 341.5733703

#As a rule of thumb, let's first use a threshold value t of 5*sd
t0 <- 5*sd(temp_smoothed[,1])

#plot out the s values over time and the t threshold value
plot(s)
abline(h=t0, col="blue")

#identify the first day s reaches t
min(which(s>=t0))

## [1] 91

Therefore, the 91th day since July 1st, or September 29th is identified by the CUSUM model as the unofficial end of summer day in 1997.

Now, try to replicate this process into a callable function. Let’s develop a function that takes in any specific year’s smoothed data during 1997 - 2015. Note here for the smoothed data, it only covers from 1997 - 2015, because the Holt Winters model needs to use the 1996 data as a minimum input to predict the 1997 values, and only starts to predict the estimated temperature data since 1997.

#store all the s values by year in a matrix. It should have 20 columns, including the first column recording the date from July 1st.
s_all <- data.frame(matrix(ncol = 20, nrow = 123))
colnames(s_all) <- colnames(temp_data)[c(1,3:21)]

#The function
summer_end <- function(i,c,t){
  
  s_all[,i][1] <- 0
  
  temp_mean <- mean(temp_smoothed[,i][1:75])

  for (j in 2:123){
    s_all[,i][j] <- max(0, s_all[,i][j-1] + temp_mean - temp_smoothed[,i][j] - c)
  }
  
  return (min(which(s_all[,i]>=t)))
  
}

Now run the function across the years. Note that for each different year, the c value and the threshold value t will be recalculated each time based on the standard deviation of that specific year’s smoothed temperature data.

#Store the detected end of summer day for each year from 1997 to 2015.
end_summer <- rep(0,19)

for (i in 1:19){
    end_summer[[i]] <- summer_end(i,0.5*sd(temp_smoothed[,i]),5*sd(temp_smoothed[,i]))
}

end_summer

##  [1]  91  93  86  71  91  88  92  80 100  84  82  87  93  93  72  86  90  88  81

From above loop result, an “end_summer” vector storing the detected end of summer date has been generated. Now let’s convert it into actual dates.

#store the actual dates in a new vector
end_summer_dates <- vector(length(end_summer), mode = "list")
#assign the date
for (i in 1:19){
  end_summer_dates[[i]] <- temp_smoothed$DAY[end_summer[[i]]]
}

head(end_summer_dates)

## [[1]]
## [1] "29-Sep"
## 
## [[2]]
## [1] "1-Oct"
## 
## [[3]]
## [1] "24-Sep"
## 
## [[4]]
## [1] "9-Sep"
## 
## [[5]]
## [1] "29-Sep"
## 
## [[6]]
## [1] "26-Sep"

Plot out the detected end of summer days by year.

library(ggplot2)
#get a clean year column
years <- sub('.', '', colnames(temp_smoothed)[1:19])
years

##  [1] "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006"
## [11] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015"

#create a data frame for plotting
end_summer_df <- data.frame(cbind(year = years, days = as.numeric(end_summer)))
end_summer_df

##    year days
## 1  1997   91
## 2  1998   93
## 3  1999   86
## 4  2000   71
## 5  2001   91
## 6  2002   88
## 7  2003   92
## 8  2004   80
## 9  2005  100
## 10 2006   84
## 11 2007   82
## 12 2008   87
## 13 2009   93
## 14 2010   93
## 15 2011   72
## 16 2012   86
## 17 2013   90
## 18 2014   88
## 19 2015   81

#plot the detected end of summer day by year
ggplot(data=end_summer_df, aes(x=years, y=as.numeric(days), group=1)) +
  geom_line(color="orange",lwd=1.5)+
  geom_point()+
  geom_text(
    label=end_summer_dates, 
    nudge_x = 0.25, nudge_y = 0.25, 
    check_overlap = T
  )

Above graph shows the identified unofficial end of summer dates from 1997 to 2015. Let’s compare this result with the unsmoothed data from last exercise.

#Same process but using the unsmoothed data
s_all0 <- data.frame(matrix(ncol = 21, nrow = 123))
colnames(s_all0) <- colnames(temp_data)[1:21]

summer_end0 <- function(i,c,t){
  
  s_all0[,i][1] <- 0
  
  temp_mean <- mean(temp_data[,i][1:75])

  for (j in 2:123){
    s_all0[,i][j] <- max(0, s_all0[,i][j-1] + temp_mean - temp_data[,i][j] - c)
  }
  
  return (min(which(s_all0[,i]>=t)))
  
}

end_summer0 <- rep(0,20)

for (i in 2:21){
    end_summer0[[i-1]] <- summer_end0(i,0.5*sd(temp_data[,i]),5*sd(temp_data[,i]))
}

end_summer_dates0 <- vector(length(end_summer0), mode = "list")
#assign the date
for (i in 1:20){
  end_summer_dates0[[i]] <- temp_data$DAY[end_summer0[[i]]]
}

years <- sub('.', '', colnames(temp_data)[3:21])
years

##  [1] "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006"
## [11] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015"

#create a data frame for plotting
end_summer_df0 <- data.frame(cbind(year = years, 
                                   days = as.numeric(end_summer0[2:20])))

Now plot them together and compare.

library(gridExtra)
#Raw data
p1= ggplot(data=end_summer_df0, aes(x=years, y=as.numeric(days), group=1)) +
  geom_line(color="blue",lwd=1.5)+
  geom_point()+
  ylim(45, 105) +
  geom_text(
    label=end_summer_dates0[2:20], 
    nudge_x = 0.25, nudge_y = 0.25, 
    check_overlap = T
  )

p2 = ggplot(data=end_summer_df, aes(x=years, y=as.numeric(days), group=1)) +
  geom_line(color="orange",lwd=1.5)+
  geom_point()+
  ylim(45, 105)+
  geom_text(
    label=end_summer_dates, 
    nudge_x = 0.25, nudge_y = 0.25, 
    check_overlap = T
  )

grid.arrange(p1, p2, nrow = 2)

The blue line is the result from the raw data, the orange line is the result from the smoothed data. Note that the smoothed line is pretty similar to the raw line in the general pattern and especially earlier years, however it smooths out the drastic change from 2012 - 2015 in the raw line. For example, the Aug 17th of 2013 is no longer identified as the summer ending date, which can be closer to usual perception of summer end time.

Now, trying to understand whether the summer is ending later over the years, we maybe start by simply thinking of a linear regression line over the above curves and see if the beta coefficient we get is significantly positive or not. Let’s try this first.

#remember we have the index of the identified summer ending day stored in a vector
end_summer

##  [1]  91  93  86  71  91  88  92  80 100  84  82  87  93  93  72  86  90  88  81

test_df <- data.frame(year = c(1:19), days = as.numeric(end_summer))
#A quick graphical analysis
scatter.smooth(x=test_df$year, y=test_df$days, main="Summer Ending Day ~ Year")

Interestingly, this graphical analysis’s fitted line seems to have a downward tendency, which means the summer is ending earlier rather than later. Let’s try a basic linear model.

#Basic linear model
test_linear <- lm(days ~ year, data = test_df)
summary(test_linear)

## 
## Call:
## lm(formula = days ~ year, data = test_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6632  -3.5421   0.6456   4.5719  13.1088 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  88.2807     3.5362  24.965 7.77e-15 ***
## year         -0.1544     0.3101  -0.498    0.625    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.405 on 17 degrees of freedom
## Multiple R-squared:  0.01437,    Adjusted R-squared:  -0.04361 
## F-statistic: 0.2478 on 1 and 17 DF,  p-value: 0.625

The coefficient for year itself is negative, but its p-value is much larger than 0.05, and the R-squared is also very small, thus we cannot confidently say there is any trend identified in this data.

Now going back to the Holt Winters model’s fitted results, actually we can take a closer look at the seasonality factor and the gamma value.

season <- temp_HW$fitted[,4]
length(season)

## [1] 2337

plot(season)

Looks like we are getting wider and wider fluctuations, and also looks like with higher values - might be an evidence that the summer is ending later, since a multiplicative seasonal factor would amplify as the baseline temperature becomes higher across the years.

#check the seasonality parameter gamma
temp_HW$gamma

##     gamma 
## 0.5495256

As noted in the lecture, a gamma value closer to 1 means there is not much randomness, and a closer to 0 value means there is a lot of randomness. In our case, gamma at 0.55 is slightly closer to 1 than to 0, thus we could say there is some predictable seasonality factor playing in the temperature data.

scatter.smooth(x=c(1:length(season)), y=season, main="Seasonality ~ Time")

#Also the linear analysis.
test_season <- lm(season ~ c(1:length(season)))
summary(test_season)

## 
## Call:
## lm(formula = season ~ c(1:length(season)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27358 -0.05990  0.01825  0.06369  0.31138 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.005e+00  3.727e-03 269.679  < 2e-16 ***
## c(1:length(season)) -8.316e-06  2.762e-06  -3.011  0.00263 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09007 on 2335 degrees of freedom
## Multiple R-squared:  0.003868,   Adjusted R-squared:  0.003441 
## F-statistic: 9.067 on 1 and 2335 DF,  p-value: 0.002631

Finally, try the CUSUM model again. Note that we are detecting an increasing change (ending day becomes later).

s <- rep(0,19)
#initialize the first value as 0
s[[1]] <- 0

sd(end_summer)

## [1] 7.248311

c <- 0.1*sd(end_summer)

mean_days <- mean(end_summer)

#calculate the s values by day.
#Note that we are detecting the change in a decreasing time series trend.
for (i in 2:19){
  s[[i]] <- max(0, s[[i-1]] + (end_summer[i] - mean_days - c))
}
s

##  [1]  0.000000  5.538327  4.076654  0.000000  3.538327  4.076654  8.614980
##  [8]  1.153307 13.691634 10.229961  4.768288  4.306614  9.844941 15.383268
## [15]  0.000000  0.000000  2.538327  3.076654  0.000000

#Let's use smaller threshold value
t <- sd(end_summer)

#plot out the s values over time and the t threshold value
plot(s)
abline(h=t, col="blue")

From the above analysis, it seems that CUSUM is not able to detect an increasing trend either, thus we cannot confidently say the summer is ending later.

Finally, seems that our different analyses are giving us different conclusions, but note also that given the rather limited number of years data we got, even with the smoothed data removing some of the random noises in the original data, we are not at a very confident place to say that the summer is actually ending later (Or the trend hasn’t become large enough).