# Libraries ---------------------------------------------------------------

library('tidyverse')
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library('lubridate')
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library('forecast')
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library('feasts')
## Loading required package: fabletools
## 
## Attaching package: 'fabletools'
## The following objects are masked from 'package:forecast':
## 
##     accuracy, forecast
library('fpp3') 
## -- Attaching packages -------------------------------------------- fpp3 0.4.0 --
## v tsibble     1.1.1     v fable       0.3.1
## v tsibbledata 0.4.0
## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date()      masks base::date()
## x dplyr::filter()        masks stats::filter()
## x fabletools::forecast() masks forecast::forecast()
## x tsibble::intersect()   masks base::intersect()
## x tsibble::interval()    masks lubridate::interval()
## x dplyr::lag()           masks stats::lag()
## x tsibble::setdiff()     masks base::setdiff()
## x tsibble::union()       masks base::union()
library('gridExtra') 
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

The data used in this time series was Boston Marathon Data from R. This data was used because my cousin ran it for his 7th time recently and it interested me to see how the times had changed for the oldest marathon in the world.

#Data is Boston Marathon data
boston_marathon
## # A tsibble: 265 x 5 [1Y]
## # Key:       Event [5]
##    Event                Year Champion            Country       Time    
##    <fct>               <int> <chr>               <chr>         <time>  
##  1 Men's open division  1897 John J. McDermott   United States 02:55:10
##  2 Men's open division  1898 Ronald J. MacDonald Canada        02:42:00
##  3 Men's open division  1899 Lawrence Brignolia  United States 02:54:38
##  4 Men's open division  1900 John P. Caffery     Canada        02:39:44
##  5 Men's open division  1901 John P. Caffery     Canada        02:29:23
##  6 Men's open division  1902 Sammy A. Mellor     United States 02:43:12
##  7 Men's open division  1903 John C. Lorden      United States 02:41:29
##  8 Men's open division  1904 Michael Spring      United States 02:38:04
##  9 Men's open division  1905 Frederick Lorz      United States 02:38:25
## 10 Men's open division  1906 Timothy Ford        United States 02:45:45
## # ... with 255 more rows
#Analyzing the data
summary(boston_marathon)
##                    Event          Year        Champion        
##  Men's open division  :123   Min.   :1897   Length:265        
##  Men's wheelchair     : 45   1st Qu.:1963   Class :character  
##  Women's open champion: 48   Median :1986   Mode  :character  
##  Women's pioneer era  :  6   Mean   :1978                     
##  Women's wheelchair   : 43   3rd Qu.:2003                     
##                              Max.   :2019                     
##    Country              Time         
##  Length:265         Length:265       
##  Class :character   Class1:hms       
##  Mode  :character   Class2:difftime  
##                     Mode  :numeric   
##                                      
## 
colSums(is.na(boston_marathon))
##    Event     Year Champion  Country     Time 
##        0        0        0        0        1
colSums(boston_marathon == "")
##    Event     Year Champion  Country     Time 
##        0        0        0        1       NA

The data had one NA and missing value in the same line. Since it was just one observation, it will be omitted from the data set. another alternative would be to replace it with the median time for that event but due to how small the hole in the data was, ease of coding was prioritized.

#data manipulation
boston <- na.omit(boston_marathon) %>% 
  filter(Event == "Men's open division") 

boston<- boston %>% 
  mutate(Seconds = as.numeric(seconds(Time))) 


boston <- as_tsibble(boston, index = Year) 

#creating train/test data

train <- boston %>% 
  filter(Year <= 2010) 
train 
## # A tsibble: 114 x 6 [1Y]
## # Key:       Event [1]
##    Event                Year Champion            Country       Time     Seconds
##    <fct>               <int> <chr>               <chr>         <time>     <dbl>
##  1 Men's open division  1897 John J. McDermott   United States 02:55:10   10510
##  2 Men's open division  1898 Ronald J. MacDonald Canada        02:42:00    9720
##  3 Men's open division  1899 Lawrence Brignolia  United States 02:54:38   10478
##  4 Men's open division  1900 John P. Caffery     Canada        02:39:44    9584
##  5 Men's open division  1901 John P. Caffery     Canada        02:29:23    8963
##  6 Men's open division  1902 Sammy A. Mellor     United States 02:43:12    9792
##  7 Men's open division  1903 John C. Lorden      United States 02:41:29    9689
##  8 Men's open division  1904 Michael Spring      United States 02:38:04    9484
##  9 Men's open division  1905 Frederick Lorz      United States 02:38:25    9505
## 10 Men's open division  1906 Timothy Ford        United States 02:45:45    9945
## # ... with 104 more rows
test <- boston %>% 
  filter(Year > 2010)

After the variable was omitted, time was turned into seconds and set as a numeric variable. This was for ease of use for forecasting. The predictions could be converted back to the same time format. Next, the data was filtered by Men’s open division for more observations. More observations were needs to split the data up into training and test data. having a split data set here was preferred because running cross validation on the types of models used took too long to produce a result. Finally, the training and test data was turned into a tsibble for forecasting.

boston_marathon %>% 
  ggplot(aes(Year))+
  geom_line(aes(y=Time, color = Event)) + 
  labs(title = "Boston Marathon Times")

train %>% 
  ggplot(aes(x=Year, y=Seconds)) +
  geom_line() +
  labs(title = "Boston Marathon Times: Men's Open Division", 
       y = "Time in Seconds") 

#Trying a box cox to see how it effects the variation of the data
#getting the lambda
lambda <- train %>% 
  features(Seconds, features = guerrero) %>% 
  pull(lambda_guerrero) 
#graphing 
train %>% 
  autoplot(box_cox(Seconds, lambda))+
  labs(y = "",
       title = paste0(
         "Transformed Marathon times with lambda =  ",
         round(lambda,2))) 

#the variation in the data didn't seem to change much

#Decomposing for better view 
train %>% 
  model(STL(Seconds)) %>% 
  components() %>% 
  autoplot()

#no seasonality 

The exploratory graphing reveals some interesting information about the data. There seems to be a lot of variation in the data so a box-cox transformation was done using the Guerrero method to find the lambda. With not much change in the data, the use of the transformation wasn’t incorporated. When looking at the STL decomposition, there is no seasonal pattern. This explains why the transformation didn’t provide enough additional value to be used.

trainmodels <- train %>% 
  model("ETS" = ETS(Seconds),
        "Arima" = ARIMA(Seconds),
        "Neural" = NNETAR(Seconds))

trainmodels %>% 
  forecast(h=1) %>% 
  accuracy(test)
## Warning: 1 error encountered
## [1] subscript out of bounds
## 
## 1 error encountered
## [1] subscript out of bounds
## 
## 1 error encountered
## [1] subscript out of bounds
## # A tibble: 3 x 11
##   .model Event             .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <fct>             <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima  Men's open divis~ Test  -256.  256.  256. -3.46  3.46   NaN   NaN    NA
## 2 ETS    Men's open divis~ Test  -312.  312.  312. -4.23  4.23   NaN   NaN    NA
## 3 Neural Men's open divis~ Test  -355.  355.  355. -4.81  4.81   NaN   NaN    NA

Three models were implemented to find the most accurate one: ETS, Arima, and Neural Network Model. All models used their base function so the best model could be found using the lowest AIC which is the default chosen one. The Arima model proved to be the most accurate with an RMSE of 256 compared to the ETS 312 and the Neural Networks 359. Cross validation would have been preferred to help determine the accuracy of the models but it took to long process the accuracy of the models so the split data set approach was used.

arima <- train %>% 
  model("Arima" = ARIMA(Seconds))

arima %>% report() 
## Series: Seconds 
## Model: ARIMA(0,1,1) w/ drift 
## 
## Coefficients:
##           ma1  constant
##       -0.6916  -21.0050
## s.e.   0.0683    9.7486
## 
## sigma^2 estimated as 109816:  log likelihood=-815.43
## AIC=1636.85   AICc=1637.07   BIC=1645.04

The Arima model used was ARIMA(0,1,1) with drift. It is interesting that the ETS wasn’t closer to as accurate because it a ARIMA(0,1,1) model is the same as an ETS(A,N,N) model. This just shows how much the drift impacted the accuracy. The drift was negative which isn’t surprising as the times were decreasing.

arima %>% gg_tsresiduals()

Both the residual plots show clear signs of heteroscedasticty. The innovation residual plot shows a pattern and the residual count plot is skewed. Both violate the assumption the residual are normally distributed and random. The ACF plot does show some lags surpassing the bounds, indicating autocorrelation.

augment(arima) %>%
  filter(.model == "Arima") %>%
  features(.innov, ljung_box, lag=12, dof=3)
## # A tibble: 1 x 4
##   Event               .model lb_stat lb_pvalue
##   <fct>               <chr>    <dbl>     <dbl>
## 1 Men's open division Arima     21.9   0.00905

A Ljun box test was used to assess the autocorrelation. With a p-value of less than 0.05 it can be cocnluded that the residuals are discernible from white noise and the model has autocorrelation which was indicated by the ACF plot.

arima %>% 
  forecast(h = 10) %>% 
  autoplot(train)+
  autolayer(test) +
  labs(title = "Boston Marathon Times: Predicted v. Actual",
       y = "Time in Seconds")
## Plot variable not specified, automatically selected `.vars = Seconds`

Though the model has autocorrelation and heteroscedasticity, it was still used because it was the most accurate at predicting the data. The graph compares the predicted values compared to the real ones. The model isn’t as accurate as hoped but most of the test data is within the 95% prediction interval and all is within the 80% prediction interval.

arima %>% 
  forecast(h = 10) %>% 
  hilo(level = 95)
## # A tsibble: 10 x 6 [1Y]
## # Key:       Event, .model [1]
##    Event               .model  Year         Seconds .mean                  `95%`
##    <fct>               <chr>  <dbl>          <dist> <dbl>                 <hilo>
##  1 Men's open division Arima   2011 N(7638, 109816) 7638. [6988.180, 8287.187]95
##  2 Men's open division Arima   2012 N(7617, 120258) 7617. [6936.997, 8296.360]95
##  3 Men's open division Arima   2013 N(7596, 130700) 7596. [6887.099, 8304.249]95
##  4 Men's open division Arima   2014 N(7575, 141142) 7575. [6838.333, 8311.004]95
##  5 Men's open division Arima   2015 N(7554, 151584) 7554. [6790.576, 8316.751]95
##  6 Men's open division Arima   2016 N(7533, 162026) 7533. [6743.726, 8321.591]95
##  7 Men's open division Arima   2017 N(7512, 172468) 7512. [6697.696, 8325.611]95
##  8 Men's open division Arima   2018 N(7491, 182909) 7491. [6652.413, 8328.884]95
##  9 Men's open division Arima   2019 N(7470, 193351) 7470. [6607.813, 8331.474]95
## 10 Men's open division Arima   2020  N(7449, 2e+05) 7449. [6563.843, 8333.434]95

The 95% prediction intervals are listed.

arima %>% 
  forecast(
    new_data(train,1)) %>%
  autoplot(train) + 
  autolayer(test)+
  labs(title = "Boston Marathon: 2010 Predicted Winning Time", 
       y = "Time in Seonds", 
       X = "Year")
## Plot variable not specified, automatically selected `.vars = Seconds`

The final graph is to get a better look at the first forecasted year which should be the most accurate. Though it wasn’t very accurate, the real time was within the 95% prediction interval.

The Boston Marathon is the oldest marathon in the world. It is an inclusive race that host competitors from all over the world and athletes of various backgrounds. The data was interesting to work with as there was no seasonality. This lead to a great comparison to how Arima models and ETS deal with data and are different in there approach. The Arima model ultimately was the most accurate. The model itself was the same as the ETS model but the drift functionality of the Arima model allowed it surpass the ETS in accuracy. Overall the Arima model was fairly accurate capturing most of the test data within its 95% prediction interval and all within the 80% prediction interval. Future work with data would have to include cross validation or bagging and boosting to artificially provide more data to work with for more accurate forecasting.