# Libraries ---------------------------------------------------------------
library('tidyverse')
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library('lubridate')
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library('forecast')
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library('feasts')
## Loading required package: fabletools
##
## Attaching package: 'fabletools'
## The following objects are masked from 'package:forecast':
##
## accuracy, forecast
library('fpp3')
## -- Attaching packages -------------------------------------------- fpp3 0.4.0 --
## v tsibble 1.1.1 v fable 0.3.1
## v tsibbledata 0.4.0
## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x fabletools::forecast() masks forecast::forecast()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag() masks stats::lag()
## x tsibble::setdiff() masks base::setdiff()
## x tsibble::union() masks base::union()
library('gridExtra')
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
The data used in this time series was Boston Marathon Data from R. This data was used because my cousin ran it for his 7th time recently and it interested me to see how the times had changed for the oldest marathon in the world.
#Data is Boston Marathon data
boston_marathon
## # A tsibble: 265 x 5 [1Y]
## # Key: Event [5]
## Event Year Champion Country Time
## <fct> <int> <chr> <chr> <time>
## 1 Men's open division 1897 John J. McDermott United States 02:55:10
## 2 Men's open division 1898 Ronald J. MacDonald Canada 02:42:00
## 3 Men's open division 1899 Lawrence Brignolia United States 02:54:38
## 4 Men's open division 1900 John P. Caffery Canada 02:39:44
## 5 Men's open division 1901 John P. Caffery Canada 02:29:23
## 6 Men's open division 1902 Sammy A. Mellor United States 02:43:12
## 7 Men's open division 1903 John C. Lorden United States 02:41:29
## 8 Men's open division 1904 Michael Spring United States 02:38:04
## 9 Men's open division 1905 Frederick Lorz United States 02:38:25
## 10 Men's open division 1906 Timothy Ford United States 02:45:45
## # ... with 255 more rows
#Analyzing the data
summary(boston_marathon)
## Event Year Champion
## Men's open division :123 Min. :1897 Length:265
## Men's wheelchair : 45 1st Qu.:1963 Class :character
## Women's open champion: 48 Median :1986 Mode :character
## Women's pioneer era : 6 Mean :1978
## Women's wheelchair : 43 3rd Qu.:2003
## Max. :2019
## Country Time
## Length:265 Length:265
## Class :character Class1:hms
## Mode :character Class2:difftime
## Mode :numeric
##
##
colSums(is.na(boston_marathon))
## Event Year Champion Country Time
## 0 0 0 0 1
colSums(boston_marathon == "")
## Event Year Champion Country Time
## 0 0 0 1 NA
The data had one NA and missing value in the same line. Since it was just one observation, it will be omitted from the data set. another alternative would be to replace it with the median time for that event but due to how small the hole in the data was, ease of coding was prioritized.
#data manipulation
boston <- na.omit(boston_marathon) %>%
filter(Event == "Men's open division")
boston<- boston %>%
mutate(Seconds = as.numeric(seconds(Time)))
boston <- as_tsibble(boston, index = Year)
#creating train/test data
train <- boston %>%
filter(Year <= 2010)
train
## # A tsibble: 114 x 6 [1Y]
## # Key: Event [1]
## Event Year Champion Country Time Seconds
## <fct> <int> <chr> <chr> <time> <dbl>
## 1 Men's open division 1897 John J. McDermott United States 02:55:10 10510
## 2 Men's open division 1898 Ronald J. MacDonald Canada 02:42:00 9720
## 3 Men's open division 1899 Lawrence Brignolia United States 02:54:38 10478
## 4 Men's open division 1900 John P. Caffery Canada 02:39:44 9584
## 5 Men's open division 1901 John P. Caffery Canada 02:29:23 8963
## 6 Men's open division 1902 Sammy A. Mellor United States 02:43:12 9792
## 7 Men's open division 1903 John C. Lorden United States 02:41:29 9689
## 8 Men's open division 1904 Michael Spring United States 02:38:04 9484
## 9 Men's open division 1905 Frederick Lorz United States 02:38:25 9505
## 10 Men's open division 1906 Timothy Ford United States 02:45:45 9945
## # ... with 104 more rows
test <- boston %>%
filter(Year > 2010)
After the variable was omitted, time was turned into seconds and set as a numeric variable. This was for ease of use for forecasting. The predictions could be converted back to the same time format. Next, the data was filtered by Men’s open division for more observations. More observations were needs to split the data up into training and test data. having a split data set here was preferred because running cross validation on the types of models used took too long to produce a result. Finally, the training and test data was turned into a tsibble for forecasting.
boston_marathon %>%
ggplot(aes(Year))+
geom_line(aes(y=Time, color = Event)) +
labs(title = "Boston Marathon Times")
train %>%
ggplot(aes(x=Year, y=Seconds)) +
geom_line() +
labs(title = "Boston Marathon Times: Men's Open Division",
y = "Time in Seconds")
#Trying a box cox to see how it effects the variation of the data
#getting the lambda
lambda <- train %>%
features(Seconds, features = guerrero) %>%
pull(lambda_guerrero)
#graphing
train %>%
autoplot(box_cox(Seconds, lambda))+
labs(y = "",
title = paste0(
"Transformed Marathon times with lambda = ",
round(lambda,2)))
#the variation in the data didn't seem to change much
#Decomposing for better view
train %>%
model(STL(Seconds)) %>%
components() %>%
autoplot()
#no seasonality
The exploratory graphing reveals some interesting information about the data. There seems to be a lot of variation in the data so a box-cox transformation was done using the Guerrero method to find the lambda. With not much change in the data, the use of the transformation wasn’t incorporated. When looking at the STL decomposition, there is no seasonal pattern. This explains why the transformation didn’t provide enough additional value to be used.
trainmodels <- train %>%
model("ETS" = ETS(Seconds),
"Arima" = ARIMA(Seconds),
"Neural" = NNETAR(Seconds))
trainmodels %>%
forecast(h=1) %>%
accuracy(test)
## Warning: 1 error encountered
## [1] subscript out of bounds
##
## 1 error encountered
## [1] subscript out of bounds
##
## 1 error encountered
## [1] subscript out of bounds
## # A tibble: 3 x 11
## .model Event .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima Men's open divis~ Test -256. 256. 256. -3.46 3.46 NaN NaN NA
## 2 ETS Men's open divis~ Test -312. 312. 312. -4.23 4.23 NaN NaN NA
## 3 Neural Men's open divis~ Test -355. 355. 355. -4.81 4.81 NaN NaN NA
Three models were implemented to find the most accurate one: ETS, Arima, and Neural Network Model. All models used their base function so the best model could be found using the lowest AIC which is the default chosen one. The Arima model proved to be the most accurate with an RMSE of 256 compared to the ETS 312 and the Neural Networks 359. Cross validation would have been preferred to help determine the accuracy of the models but it took to long process the accuracy of the models so the split data set approach was used.
arima <- train %>%
model("Arima" = ARIMA(Seconds))
arima %>% report()
## Series: Seconds
## Model: ARIMA(0,1,1) w/ drift
##
## Coefficients:
## ma1 constant
## -0.6916 -21.0050
## s.e. 0.0683 9.7486
##
## sigma^2 estimated as 109816: log likelihood=-815.43
## AIC=1636.85 AICc=1637.07 BIC=1645.04
The Arima model used was ARIMA(0,1,1) with drift. It is interesting that the ETS wasn’t closer to as accurate because it a ARIMA(0,1,1) model is the same as an ETS(A,N,N) model. This just shows how much the drift impacted the accuracy. The drift was negative which isn’t surprising as the times were decreasing.
arima %>% gg_tsresiduals()
Both the residual plots show clear signs of heteroscedasticty. The innovation residual plot shows a pattern and the residual count plot is skewed. Both violate the assumption the residual are normally distributed and random. The ACF plot does show some lags surpassing the bounds, indicating autocorrelation.
augment(arima) %>%
filter(.model == "Arima") %>%
features(.innov, ljung_box, lag=12, dof=3)
## # A tibble: 1 x 4
## Event .model lb_stat lb_pvalue
## <fct> <chr> <dbl> <dbl>
## 1 Men's open division Arima 21.9 0.00905
A Ljun box test was used to assess the autocorrelation. With a p-value of less than 0.05 it can be cocnluded that the residuals are discernible from white noise and the model has autocorrelation which was indicated by the ACF plot.
arima %>%
forecast(h = 10) %>%
autoplot(train)+
autolayer(test) +
labs(title = "Boston Marathon Times: Predicted v. Actual",
y = "Time in Seconds")
## Plot variable not specified, automatically selected `.vars = Seconds`
Though the model has autocorrelation and heteroscedasticity, it was still used because it was the most accurate at predicting the data. The graph compares the predicted values compared to the real ones. The model isn’t as accurate as hoped but most of the test data is within the 95% prediction interval and all is within the 80% prediction interval.
arima %>%
forecast(h = 10) %>%
hilo(level = 95)
## # A tsibble: 10 x 6 [1Y]
## # Key: Event, .model [1]
## Event .model Year Seconds .mean `95%`
## <fct> <chr> <dbl> <dist> <dbl> <hilo>
## 1 Men's open division Arima 2011 N(7638, 109816) 7638. [6988.180, 8287.187]95
## 2 Men's open division Arima 2012 N(7617, 120258) 7617. [6936.997, 8296.360]95
## 3 Men's open division Arima 2013 N(7596, 130700) 7596. [6887.099, 8304.249]95
## 4 Men's open division Arima 2014 N(7575, 141142) 7575. [6838.333, 8311.004]95
## 5 Men's open division Arima 2015 N(7554, 151584) 7554. [6790.576, 8316.751]95
## 6 Men's open division Arima 2016 N(7533, 162026) 7533. [6743.726, 8321.591]95
## 7 Men's open division Arima 2017 N(7512, 172468) 7512. [6697.696, 8325.611]95
## 8 Men's open division Arima 2018 N(7491, 182909) 7491. [6652.413, 8328.884]95
## 9 Men's open division Arima 2019 N(7470, 193351) 7470. [6607.813, 8331.474]95
## 10 Men's open division Arima 2020 N(7449, 2e+05) 7449. [6563.843, 8333.434]95
The 95% prediction intervals are listed.
arima %>%
forecast(
new_data(train,1)) %>%
autoplot(train) +
autolayer(test)+
labs(title = "Boston Marathon: 2010 Predicted Winning Time",
y = "Time in Seonds",
X = "Year")
## Plot variable not specified, automatically selected `.vars = Seconds`
The final graph is to get a better look at the first forecasted year which should be the most accurate. Though it wasn’t very accurate, the real time was within the 95% prediction interval.
The Boston Marathon is the oldest marathon in the world. It is an inclusive race that host competitors from all over the world and athletes of various backgrounds. The data was interesting to work with as there was no seasonality. This lead to a great comparison to how Arima models and ETS deal with data and are different in there approach. The Arima model ultimately was the most accurate. The model itself was the same as the ETS model but the drift functionality of the Arima model allowed it surpass the ETS in accuracy. Overall the Arima model was fairly accurate capturing most of the test data within its 95% prediction interval and all within the 80% prediction interval. Future work with data would have to include cross validation or bagging and boosting to artificially provide more data to work with for more accurate forecasting.