R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

#Build 1 ETS model, 1 ARIMA model, 1 dynamic regression models, and 1 model where the previous forecasts are averaged. Use only 4 years of data to build your forecasts, and then estimate the models' performance on the 5th year.  Provide measures that assess the performance of your three models.  Which performed the best and why?



# Start with monthly time-serires for the past 5 years. 






rm(list = ls()) # Clear environment
gc()            # Clear unused memory
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 504516 27.0    1119618 59.8   643711 34.4
## Vcells 896726  6.9    8388608 64.0  1649229 12.6
cat("\f")       # Clear the console
require(WaveletArima)
## Loading required package: WaveletArima
## Warning: package 'WaveletArima' was built under R version 4.1.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library("feasts")
## Warning: package 'feasts' was built under R version 4.1.3
## Loading required package: fabletools
## Warning: package 'fabletools' was built under R version 4.1.3
library("seasonal")
## Warning: package 'seasonal' was built under R version 4.1.3
library("tsibble")
## Warning: package 'tsibble' was built under R version 4.1.3
## 
## Attaching package: 'tsibble'
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
library("tsibbledata")
## Warning: package 'tsibbledata' was built under R version 4.1.3
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 4.1.3
library("forecast")
## Warning: package 'forecast' was built under R version 4.1.3
## 
## Attaching package: 'forecast'
## The following objects are masked from 'package:fabletools':
## 
##     accuracy, forecast
library("fable")
## Warning: package 'fable' was built under R version 4.1.3
library("weathermetrics")
## Warning: package 'weathermetrics' was built under R version 4.1.3
library("fpp3")
## Warning: package 'fpp3' was built under R version 4.1.3
## -- Attaching packages ---------------------------------------------- fpp3 0.5 --
## v tibble    3.1.6     v lubridate 1.8.0
## v tidyr     1.2.0
## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date()     masks base::date()
## x dplyr::filter()       masks stats::filter()
## x tsibble::intersect()  masks base::intersect()
## x lubridate::interval() masks tsibble::interval()
## x dplyr::lag()          masks stats::lag()
## x tsibble::setdiff()    masks base::setdiff()
## x tsibble::union()      masks base::union()
## x tibble::view()        masks seasonal::view()
library("lubridate")
library("zoo")
## Warning: package 'zoo' was built under R version 4.1.3
## 
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
## 
##     index
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library("magrittr")
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract
require(kableExtra)
## Loading required package: kableExtra
## Warning: package 'kableExtra' was built under R version 4.1.3
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library("fpp3")
require(latex2exp)
## Loading required package: latex2exp
## Warning: package 'latex2exp' was built under R version 4.1.3
require(readxl)
## Loading required package: readxl
require(tseries)
## Loading required package: tseries
## Warning: package 'tseries' was built under R version 4.1.3
require(urca)
## Loading required package: urca
## Warning: package 'urca' was built under R version 4.1.3
#fpp3 wants a tsibble
#load data and then split it into the train and test on 80% 20%
#My data set only has data from 2022-2017

#US_Retail <- read.table('https://www.census.gov/retail/marts/www/adv72200.txt', 
      #              sep = '\t', header = T, strip.white = T, skip = 3)

#view(US_Retail)

#set wd
setwd("C:/Users/polhe/Downloads/")


# load data from US census website

us_retail <- read.csv("us_retail.csv")

summary(us_retail)
##    ï..Amount     
##  Min.   :380957  
##  1st Qu.:453160  
##  Median :477344  
##  Mean   :505495  
##  3rd Qu.:579633  
##  Max.   :628376
#showing a lot of NA, let me investigate more
#view(us_retail)

#name is reading in wrong so I will rename it

#check class

names(us_retail)
## [1] "ï..Amount"
#want to re-name the first column of the data set to "Year"

names(us_retail) <- c("Amount")
#Now convert the data into a ts object

us_retail$Year=seq(as.Date("2017/1/1"), as.Date("2022/12/1"), "months")
myts=us_retail%>%
  mutate(YearMonth = yearmonth(as.character(Year))) %>%
  as_tsibble(index = YearMonth)
myts%>%autoplot(Amount)

#data loaded in ,  now split it into test and training data sets

#add external regressor
p1=c(rep(0,2), rep(1, 15))
p1=c(rep(p1, 4), rep(0, 4))
myts$Pres=p1

train=myts[1:60,]
test=myts[61:nrow(myts),]

#view(test)
#view(train)

#having a weird problem where R is giving me daily data and then complaining that the daily data is missing

#convert to ts and then plot



autoplot(myts)
## Plot variable not specified, automatically selected `.vars = Amount`

myts%>%head()
## # A tsibble: 6 x 4 [1M]
##   Amount Year       YearMonth  Pres
##    <int> <date>         <mth> <dbl>
## 1 433454 2017-01-01  2017 Jan     0
## 2 433188 2017-02-01  2017 Feb     0
## 3 432443 2017-03-01  2017 Mar     1
## 4 434497 2017-04-01  2017 Apr     1
## 5 433727 2017-05-01  2017 May     1
## 6 436243 2017-06-01  2017 Jun     1
myts%>%gg_season(Amount)

myts%>%gg_lag(Amount)

#ACF plot looks like it is stationary .

myts%>%ACF(Amount)%>%autoplot()

#test for gaps, looks good. If the below code returns any data then there may be an issue in how R is reading the date logic.

tsibble::scan_gaps(myts)
## # A tsibble: 0 x 1 [?]
## # ... with 1 variable: YearMonth <mth>
#apply transformation and then test if ACF plot changed

lambda=myts%>%features(Amount, features = guerrero) |> pull(lambda_guerrero)
myts%>%autoplot(box_cox(Amount, lambda[1])) +
  labs(y = "",title = latex2exp::TeX(paste0(
         "Transformed NT Consumption with $\\lambda$ = ", round(lambda[1],2))))

myts%>%ACF(Amount)%>%autoplot()

comp=myts%>%model(stl=STL(Amount))%>%components()
comp%>%as_tsibble() %>%autoplot(Amount)+  geom_line(aes(y=trend), colour = "red")+
  geom_line(aes(y=season_adjust), colour = "blue")

comp%>%autoplot()

#models
m1=train |>model(ETS_Model=ETS(Amount))
m2=train |>model(ARIMA_Model=ARIMA(Amount))
m3=train|>model(Ensemble_Model_2=(ETS(Amount)+TSLM(Amount~trend()+fourier(K=3))+ARIMA(Amount~Pres))/3)

#m3=train|>model(Dynamic_Model=ARIMA(us_retail~Pres))
#ensemble model, that averages all models
m4=train|>model(Ensemble_Model=(ETS(Amount)+ARIMA(Amount))/2)




#view(train)


#plot
ets_plot=function(model){
  model|>forecast(test)|>autoplot(myts)+geom_line(aes(y=.fitted, col=.model),data=augment(model))+ggtitle(names(model))}

#results
report(eval(m1))
## Series: Amount 
## Model: ETS(A,N,N) 
##   Smoothing parameters:
##     alpha = 0.9644138 
## 
##   Initial states:
##      l[0]
##  433441.6
## 
##   sigma^2:  306679057
## 
##      AIC     AICc      BIC 
## 1422.105 1422.534 1428.388
report(eval(m2))
## Series: Amount 
## Model: ARIMA(0,1,0) 
## 
## sigma^2 estimated as 301727070:  log likelihood=-659.71
## AIC=1321.41   AICc=1321.48   BIC=1323.49
report(eval(m3))
## Series: Amount 
## Model: COMBINATION 
## Combination: Amount * 0.333333333333333
## 
## =======================================
## 
## Series: Amount 
## Model: COMBINATION 
## Combination: Amount + Amount
## 
## ============================
## 
## Series: Amount 
## Model: COMBINATION 
## Combination: Amount + Amount
## 
## ============================
## 
## Series: Amount 
## Model: ETS(A,N,N) 
##   Smoothing parameters:
##     alpha = 0.9644138 
## 
##   Initial states:
##      l[0]
##  433441.6
## 
##   sigma^2:  306679057
## 
##      AIC     AICc      BIC 
## 1422.105 1422.534 1428.388 
## 
## Series: Amount 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -115889  -13994   -2219   14267   53839 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         409630.637   7465.706  54.868  < 2e-16 ***
## trend()               2431.534    213.769  11.375 1.01e-15 ***
## fourier(K = 3)C1_12  -2261.445   5147.809  -0.439    0.662    
## fourier(K = 3)S1_12  -4376.535   5204.874  -0.841    0.404    
## fourier(K = 3)C2_12   2383.700   5147.809   0.463    0.645    
## fourier(K = 3)S2_12      8.661   5156.678   0.002    0.999    
## fourier(K = 3)C3_12   -592.566   5147.809  -0.115    0.909    
## fourier(K = 3)S3_12   3285.600   5147.809   0.638    0.526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28170 on 52 degrees of freedom
## Multiple R-squared: 0.7265,  Adjusted R-squared: 0.6896
## F-statistic: 19.73 on 7 and 52 DF, p-value: 1.3102e-12
## 
## 
## Series: Amount 
## Model: LM w/ ARIMA(0,1,0) errors 
## 
## Coefficients:
##            Pres
##        285.4286
## s.e.  6565.2046
## 
## sigma^2 estimated as 306919429:  log likelihood=-659.7
## AIC=1323.41   AICc=1323.62   BIC=1327.56
report(eval(m4))
## Series: Amount 
## Model: COMBINATION 
## Combination: Amount * 0.5
## 
## =========================
## 
## Series: Amount 
## Model: COMBINATION 
## Combination: Amount + Amount
## 
## ============================
## 
## Series: Amount 
## Model: ETS(A,N,N) 
##   Smoothing parameters:
##     alpha = 0.9644138 
## 
##   Initial states:
##      l[0]
##  433441.6
## 
##   sigma^2:  306679057
## 
##      AIC     AICc      BIC 
## 1422.105 1422.534 1428.388 
## 
## Series: Amount 
## Model: ARIMA(0,1,0) 
## 
## sigma^2 estimated as 301727070:  log likelihood=-659.71
## AIC=1321.41   AICc=1321.48   BIC=1323.49
#plots

myplot=function(m1){
  model|>forecast(test)|>autoplot(myts)+geom_line(aes(y=.fitted, col=.model),data=augment(model))+ggtitle(names(model))}


myplot=function(m2){
  model|>forecast(test)|>autoplot(myts)+geom_line(aes(y=.fitted, col=.model),data=augment(model))+ggtitle(names(model))}

myplot=function(m3){
  model|>forecast(test)|>autoplot(myts)+geom_line(aes(y=.fitted, col=.model),data=augment(model))+ggtitle(names(model))}


myplot=function(m4){
  model|>forecast(test)|>autoplot(myts)+geom_line(aes(y=.fitted, col=.model),data=augment(model))+ggtitle(names(model))}

#Results and compare model performance. #M1-BIC=1428.388 #M2-BIC=1323.49 #M3-BIC=1327.56 #M4-BIC=1323.49

#It looks to me like model2 has the lowest BIC score which would generally mean that it’s the model we would select. Model2 was just an ARIMA model on the time series, after it had the box-cox transform applied to it and made into a normal distrubtion. An ARIMA model is known for fitting the training data very well but not being as good as an ETS model on the testing data set. Im still trying to understand the dynamic regression model more but from looking at my intercepts, what stands out most is that only the trend and intercept are statistically significant. This suggests that my linear regression term does not play a strong role in forecasting “US Retail stores”. It stands to reason that US retail stores are probably constnatly opening and closing, with very few stores being open for years at a time and so when the economy turns down, retail stores are going to follow the overall trend in the economy.The other interesting thing to note is that my smoothing parameter was “0.9644138”, very close to 1 which suggests that there is a nearly random stochastic part of this time series when looking at overall numbers of US retail stores. Finally, It was interesting to do an ACF plot and see that my data was already stationary, I did not expect this to be stationary given it’s a real world data set but I will look into that later.