Data Description The dataset selected for this assignment is the “World Development Indicators and Other World Bank Data” loaded as a package. This data is from over 40 databases hosted by the World Bank, including the World Development Indicators (‘WDI’), International Debt Statistics, Doing Business, Human Capital Index, and Sub-national Poverty indicators.The complete dataset has a total of 16532 observations. The dataset has a total of 13 variables.

I further filtered on the dataset to use the United States data to see the trends across time and probably see the impact of various world events on different aspects. One basic aspect that drives the trends in this data are the socio economic conditions in the country at any point of time.Variable of interest for us from the data is Infant mortality rate. The variation in this variable is driven by usually a combination of GDP, schooling years which effects the standard of healthcare provided and also the basic knowledge of individuals.

library(WDI)
library(ggplot2)
wdi_data = WDI(indicator = c('inf_mort' = "SP.DYN.IMRT.IN",
                             'gdpPercap'="NY.GDP.PCAP.KD",
                             'yrs_schooling'='BAR.SCHL.15UP'), # interest rate spread
               start = 1960, end = 2020,
               extra=TRUE) %>% 
  as_tibble()
united_states_data = wdi_data %>% 
  filter(country == 'United States')
dim(united_states_data)

## [1] 61 13

Filtering the data to include just the United States data gives a total of 61 observations of 13 variables.

Plot of infant mortality rate vs time(in years)

ggplot(united_states_data, aes(x = year, y = inf_mort)) +geom_line() + labs(y = 'Infant Mortality Rate',x = 'Year') + ggtitle("Infant mortality rate across Years")

There is a significant downward trend of the infant mortality rate from 1960 to 2019.

Fitting a linear time trend to infant mortality rate

united_states_data %>%
  ggplot()+
  geom_line(aes(year,inf_mort))+
  geom_smooth(aes(year,inf_mort),method='lm',color='red') +
  theme_bw()+
  xlab("Year")+
  ylab("Infant Mortality")+
  labs(title = "Infant Mortality in United States, 1960 - 2019")

## `geom_smooth()` using formula 'y ~ x'

Above graph indicates that a linear time trend provides a relatively good fit to the data. This can be further substantiated by fitting a regression model to infant mortality rate as a function of time.

Checking for outliers in infant mortality rate

boxplot(united_states_data$inf_mort, main = "Checking for outliers in infant mortality rate in United States Data", ylab = "Infant Mortality Rate")

Above boxplot shows that there are no outliers in the infant mortality rate whcih indicates that the mortality rate is normally distributed and there are no unusual trends in the variable.

Summary of Infant Mortality Rate for United States

summary(united_states_data$inf_mort)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.40    6.70    9.40   11.82   15.90   25.90

Summary stats of infant mortality rate for United States is provided above. For the time frame of 1960-2019, United States experienced a maximum of 25.9 infant mortality rate and a minimum of 5.6. We can also observe that median is slightly lower than the mean. This means that there are more data points less than the mean and less number of data points that have higher infant mortality rate.

Regression Model

mod = lm(inf_mort~year,data=united_states_data)
summary(mod)

## 
## Call:
## lm(formula = inf_mort ~ year, data = united_states_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9331 -2.2774 -0.3216  2.0977  4.0131 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 679.61148   33.35111   20.38   <2e-16 ***
## year         -0.33557    0.01676  -20.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.305 on 59 degrees of freedom
## Multiple R-squared:  0.8717, Adjusted R-squared:  0.8696 
## F-statistic:   401 on 1 and 59 DF,  p-value: < 2.2e-16

From the regression model summary, it can be observed that one unit increase in time(year in this case) is associated with a 0.34 unit decline in infant mortality rate and the p-value is way less than 0.05 and hence this holds true. A R-squared value of 0.87 indicates that the linear time trend provides a relatively good fit.

Section-1

Verifying the stationarity

From the plot of infant mortality rate of United States, we can observe that the mortality rate timeseries is neither mean stationary nor variance stationary as the mortality rate is constantly decreasing over time. This is evident from the mortality rate plot.

Since the mortality rate is continuously reducing overtime, we can be sure that variance is non-stationary.

Seasonality

From the mortality rate plots above, we can safely say that there is no seasonlaity in the mortality rate as it is pretty much a decreasing trend overtime.

Log Transformation

We are performing log transformation to try and remove the non-stationarity

US_log_diff <- united_states_data %>%
  mutate(
    infMort_log = log1p(inf_mort), # Log for Variance Stationarity
    infMort_diff = inf_mort - lag(inf_mort),
    infMort_log_diff = infMort_log - lag(infMort_log) # Difference for Mean Stationarity
  )%>%
  drop_na()
US_log_diff %>%
  ggplot() +
  geom_line(aes(year, infMort_log_diff)) +
  theme_bw() +
  ggtitle("Difference in Log Values, US Infant Mortality Rate over Time") +
  ylab("Mortality Rate (Difference))") +
  xlab("Year")

Augmented Dickey Fuller(ADF) and KPSS tests

We conduct ADF and KPSS tests provide mathematical proof of the presence of stationarity using the p-value.

adf.test(US_log_diff$inf_mort) #raw close value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  US_log_diff$inf_mort
## Dickey-Fuller = -2.962, Lag order = 2, p-value = 0.2059
## alternative hypothesis: stationary

adf.test(US_log_diff$infMort_diff) #first difference

## 
##  Augmented Dickey-Fuller Test
## 
## data:  US_log_diff$infMort_diff
## Dickey-Fuller = -31.873, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary

adf.test(US_log_diff$infMort_log) #log close value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  US_log_diff$infMort_log
## Dickey-Fuller = -0.43272, Lag order = 2, p-value = 0.9779
## alternative hypothesis: stationary

adf.test(US_log_diff$infMort_log_diff) #differenced log close value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  US_log_diff$infMort_log_diff
## Dickey-Fuller = -10.258, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary

kpss.test(US_log_diff$inf_mort) #raw close value

## 
##  KPSS Test for Level Stationarity
## 
## data:  US_log_diff$inf_mort
## KPSS Level = 0.4142, Truncation lag parameter = 2, p-value = 0.07103

kpss.test(US_log_diff$infMort_diff) #first difference

## 
##  KPSS Test for Level Stationarity
## 
## data:  US_log_diff$infMort_diff
## KPSS Level = 0.22978, Truncation lag parameter = 2, p-value = 0.1

kpss.test(US_log_diff$infMort_log) #log close value

## 
##  KPSS Test for Level Stationarity
## 
## data:  US_log_diff$infMort_log
## KPSS Level = 0.42179, Truncation lag parameter = 2, p-value = 0.06776

kpss.test(US_log_diff$infMort_log_diff) #differenced log close value

## 
##  KPSS Test for Level Stationarity
## 
## data:  US_log_diff$infMort_log_diff
## KPSS Level = 0.22044, Truncation lag parameter = 2, p-value = 0.1

From the KPSS tests we can see that the p-value in case of First difference and Difference Close Log value are greater than 0.05. We know that p>0.05 from KPSS test indicates stationarity.

ACF and PACF Plots

acf(US_log_diff$infMort_log_diff,lag.max=20)

pacf(US_log_diff$infMort_log_diff,lag.max=20)

From the ACF and PACF plots, we can say that the time-series appears to be neither autoregressive process nor a moving average process. We can run the Auto-Arima function to further conform this.

Section-2

ARIMA Models

BIC(
  arima(US_log_diff$infMort_log,order=c(0,0,0)),
  arima(US_log_diff$infMort_log,order=c(0,1,0)),
  arima(US_log_diff$infMort_log,order=c(1,0,0)),
  arima(US_log_diff$infMort_log,order=c(0,1,1)),
  arima(US_log_diff$infMort_log,order=c(0,1,2)),
  arima(US_log_diff$infMort_log,order=c(0,1,3)),
  arima(US_log_diff$infMort_log,order=c(1,1,0)),
  arima(US_log_diff$infMort_log,order=c(1,1,1)),
  arima(US_log_diff$infMort_log,order=c(2,1,0)),
  arima(US_log_diff$infMort_log,order=c(3,1,0))
)

##                                                    df        BIC
## arima(US_log_diff$infMort_log, order = c(0, 0, 0))  2  10.958139
## arima(US_log_diff$infMort_log, order = c(0, 1, 0))  1  -6.040362
## arima(US_log_diff$infMort_log, order = c(1, 0, 0))  3  -0.209925
## arima(US_log_diff$infMort_log, order = c(0, 1, 1))  2 -11.113999
## arima(US_log_diff$infMort_log, order = c(0, 1, 2))  3 -13.241080
## arima(US_log_diff$infMort_log, order = c(0, 1, 3))  4 -12.256931
## arima(US_log_diff$infMort_log, order = c(1, 1, 0))  2 -20.493178
## arima(US_log_diff$infMort_log, order = c(1, 1, 1))  3 -18.416595
## arima(US_log_diff$infMort_log, order = c(2, 1, 0))  3 -18.415510
## arima(US_log_diff$infMort_log, order = c(3, 1, 0))  4 -16.693569

AIC(
  arima(US_log_diff$infMort_log,order=c(0,0,0)),
  arima(US_log_diff$infMort_log,order=c(0,1,0)),
  arima(US_log_diff$infMort_log,order=c(1,0,0)),
  arima(US_log_diff$infMort_log,order=c(0,1,1)),
  arima(US_log_diff$infMort_log,order=c(0,1,2)),
  arima(US_log_diff$infMort_log,order=c(0,1,3)),
  arima(US_log_diff$infMort_log,order=c(1,1,0)),
  arima(US_log_diff$infMort_log,order=c(1,1,1)),
  arima(US_log_diff$infMort_log,order=c(2,1,0)),
  arima(US_log_diff$infMort_log,order=c(3,1,0))
)

##                                                    df         AIC
## arima(US_log_diff$infMort_log, order = c(0, 0, 0))  2  10.5636896
## arima(US_log_diff$infMort_log, order = c(0, 1, 0))  1  -6.1198035
## arima(US_log_diff$infMort_log, order = c(1, 0, 0))  3  -0.8015988
## arima(US_log_diff$infMort_log, order = c(0, 1, 1))  2 -11.2728821
## arima(US_log_diff$infMort_log, order = c(0, 1, 2))  3 -13.4794046
## arima(US_log_diff$infMort_log, order = c(0, 1, 3))  4 -12.5746973
## arima(US_log_diff$infMort_log, order = c(1, 1, 0))  2 -20.6520609
## arima(US_log_diff$infMort_log, order = c(1, 1, 1))  3 -18.6549194
## arima(US_log_diff$infMort_log, order = c(2, 1, 0))  3 -18.6538348
## arima(US_log_diff$infMort_log, order = c(3, 1, 0))  4 -17.0113352

auto.arima(US_log_diff$infMort_log,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)

## Series: US_log_diff$infMort_log 
## ARIMA(0,2,0) 
## 
## sigma^2 = 0.002006:  log likelihood = 11.81
## AIC=-21.63   AICc=-20.83   BIC=-21.68

Based on the AIC and BIC values, we can say that (0,1,0) of the non-differenced data gives the best model because of the lowest AIC and BIC values. This is further substantiated by using auto.arima on the data frame to obtain the suitable model.

Checking for Residuals

best_mod = arima(US_log_diff$infMort_log,order=c(0,1,0))
forecast::checkresiduals(best_mod)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)
## Q* = 0.87878, df = 3, p-value = 0.8305
## 
## Model df: 0.   Total lags used: 3

We conduct Ljung-Box test to verify residual auto-correlation. Presence of residual auto-correlation indicates that we don’t have the best model. So to make sure that we have the best model. We plot the ACF and PACF for the residuals. We also make sure that p-value in Ljung-Box test is not less than 0.05 at the significant lags from ACF and PACF plots

Ljung-Box test

par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)

There are no significant lags from the ACF and PACF plots. Just to make sure, we still do the Ljung-Box test for the first five lags

Box.test(resid,type='Ljung-Box',lag=1)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.18677, df = 1, p-value = 0.6656

Box.test(resid,type='Ljung-Box',lag=2)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.71858, df = 2, p-value = 0.6982

Box.test(resid,type='Ljung-Box',lag=3)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.87878, df = 3, p-value = 0.8305

Box.test(resid,type='Ljung-Box',lag=4)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.8788, df = 4, p-value = 0.9276

Box.test(resid,type='Ljung-Box',lag=5)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 3.9453, df = 5, p-value = 0.5573

Box.test(resid,type='Ljung-Box',lag=10)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = NA, df = 10, p-value = NA

Box.test(resid,type='Ljung-Box',lag=20)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = NA, df = 20, p-value = NA

The p-values for all the lags are greater than 0.05 indicating that there is no residual auto-correlation.

Predicted vs Actual Model

best_mod = arima(US_log_diff$infMort_log,order=c(0,1,0))
resid = best_mod$residuals
pred = resid+US_log_diff$infMort_log
ggplot()+
  geom_line(aes(US_log_diff$year,US_log_diff$infMort_log))+
  geom_line(aes(US_log_diff$year,pred),color='blue',alpha=0.4)+
  theme_bw()+
  xlab("Year")+
  ylab("Log Infant Mortality Rate")

RMSE

RMSE = sqrt(mean((expm1(pred) - expm1(US_log_diff$infMort_log))^2,na.rm=T))
RMSE

## [1] 1.637295

We compute the Root Mean Squared Error and find it to be 1.785. The results suggest that our model predicts the infant mortality rate within ~1.785 points on average in-sample.

Forecast

best_mod %>%
  forecast(h=5) %>% 
  autoplot()

Loading the WDI Data and filtering to extract just the US data.

Since the mortality rate is continuously reducing overtime, we can be sure that variance is non-stationary.

From the mortality rate plots above, we can safely say that there is no seasonlaity in the mortality rate as it is pretty much a decreasing trend overtime.

Fitting a Facebook Prophet Model

for_prophet_data <- subset(united_states_data, select = c(year, inf_mort)) %>%
  mutate(year = as.Date(ISOdate(year, 1, 1)))
#(paste('01/01/', trimws(year))))
for_prophet_data

## # A tibble: 61 × 2
##    year       inf_mort
##    <date>        <dbl>
##  1 1960-01-01     25.9
##  2 1963-01-01     24.4
##  3 1964-01-01     23.8
##  4 1961-01-01     25.4
##  5 1962-01-01     24.9
##  6 1967-01-01     22  
##  7 1968-01-01     21.3
##  8 1969-01-01     20.6
##  9 1970-01-01     19.9
## 10 1971-01-01     19.1
## # … with 51 more rows

library(prophet)
prophet_data = for_prophet_data %>% 
    rename(ds = year, # Have to name our date variable "ds"
    y = inf_mort)  # Have to name our time series "y"
train = prophet_data %>% 
  filter(ds<ymd("2018-01-01"))
test = prophet_data %>%
  filter(ds>=ymd("2018-01-01"))
model = prophet(train, yearly.seasonality = FALSE, weekly.seasonality = FALSE, daily.seasonality = FALSE)
future = make_future_dataframe(model,periods = 7*365)
forecast = predict(model,future)

Plotting Forecast

plot(model,forecast)+
ylab("Infant Mortality Rate in US")+xlab("Year")+theme_bw()

Just looking at the forecast we can see that there is a discrepancy in the forecast trend. Infant mortality rate is going below zero which is not possible. We will set the floor value later when we try to figure out the saturation points.

Interactive Charts

dyplot.prophet(model,forecast)

Time Series Decomposition

Visualization of Time Series components

prophet_plot_components(model,forecast)

Estimating the trend and changepoints

Plotting Changepoints

The algorithm examines 25 equally-spaced potential changepoints by default as potential candidates for a change in the trend.
Examines the actual rate of change at each potential changepoint, and removes those with low rates of change.

plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Year")+ylab("Infant Mortality Rate of US")

Changepoint Hyperparameters

We can manually specify more number of changepoints to improve performance.

# Number of Changepoints
model = prophet(train,n.changepoints=10,yearly.seasonality = FALSE, weekly.seasonality = FALSE, daily.seasonality = FALSE)
forecast = predict(model,future)
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Year")+ylab("Infant Mortality Rate of US")

prophet_plot_components(model,forecast)

forecast_plot_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2020-01-01"))

forecast_not_scaled = ggplot()+
geom_line(aes(test$ds,test$y))+ xlab("Year") + ylab("Infant Mortality Rate of US") +
geom_line(aes(forecast_plot_data$ds,forecast_plot_data$yhat), color='red')
forecast_scaled = forecast_not_scaled  
forecast_scaled

Identifying Saturation Points

We need to set the floor value for this data as the infant mortality can not go below 0. Let us make a two-year forecast to identify the need for saturation points.

two_yr_future = make_future_dataframe(model,periods = 730)
two_yr_forecast = predict(model,two_yr_future)
plot(model,two_yr_forecast)+theme_bw()+xlab("Date")+ylab("Infant Mortality Rate of US")

train$floor = 0
train$cap = 30
future$floor = 0
future$cap = 30
# Set floor in forecsat data
two_yr_future$floor = 0
two_yr_future$cap = 30
sat_model = prophet(train,growth='logistic', yearly.seasonality = FALSE, weekly.seasonality = FALSE, daily.seasonality = FALSE)
sat_two_yr_forecast = predict(sat_model,two_yr_future)

plot(sat_model,sat_two_yr_forecast)+ylim(0,30)+
  theme_bw()+xlab("Date")+ylab("Page Views")

Assessment of Seasonality

prophet_plot_components(model,forecast)

Additive Seasonality

additive = prophet(train,yearly.seasonality = FALSE, weekly.seasonality = FALSE, daily.seasonality = FALSE)
add_fcst = predict(additive,future)
plot(additive,add_fcst)+
ylim(0,30)

prophet_plot_components(additive,add_fcst)

Multiplicative Seasonality

multi = prophet(train,seasonality.mode = 'multiplicative', yearly.seasonality = FALSE, weekly.seasonality = FALSE, daily.seasonality = FALSE)
multi_fcst = predict(multi,future)
plot(multi,multi_fcst)+ylim(0,30)

prophet_plot_components(multi, multi_fcst)

From the above plots, we can see that the additive and multiplicative seasonalities doesn’t play any significant role.

Assessment of holidays

Since the data we are using is a yearly data, impact of holidays is not applicable on the annual infant mortality rate.

Prophet Model Assessment

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2020-01-01"))
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test$y - forecast_metric_data$yhat))
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
print(paste("RMSE:",round(RMSE,2)))

## [1] "RMSE: 0.36"

print(paste("MAE:",round(MAE,2)))

## [1] "MAE: 0.34"

print(paste("MAPE:",round(MAPE,2)))

## [1] "MAPE: 0.06"

The observed values of performance metrics are RMSE = 0.36, MAE = 0.34, MAPE = 0.06.

Cross-Validation with Prophet Model

library(tidyverse)
library(caret)
library(prophet)
df.cv <- cross_validation(model, horizon = 5*365,initial = 6*365, units = 'days')
head(df.cv)

## # A tibble: 6 × 6
##       y ds                   yhat yhat_lower yhat_upper cutoff             
##   <dbl> <dttm>              <dbl>      <dbl>      <dbl> <dttm>             
## 1  21.3 1968-01-01 00:00:00  21.4       21.3       21.4 1967-01-14 00:00:00
## 2  20.6 1969-01-01 00:00:00  20.7       20.6       20.9 1967-01-14 00:00:00
## 3  19.9 1970-01-01 00:00:00  20.1       19.8       20.3 1967-01-14 00:00:00
## 4  19.1 1971-01-01 00:00:00  19.4       19.0       19.9 1967-01-14 00:00:00
## 5  18.3 1972-01-01 00:00:00  18.8       18.1       19.4 1967-01-14 00:00:00
## 6  19.9 1970-01-01 00:00:00  19.9       19.9       19.9 1969-07-14 12:00:00

unique(df.cv$cutoff)

##  [1] "1967-01-14 00:00:00 GMT" "1969-07-14 12:00:00 GMT"
##  [3] "1972-01-13 00:00:00 GMT" "1974-07-13 12:00:00 GMT"
##  [5] "1977-01-11 00:00:00 GMT" "1979-07-12 12:00:00 GMT"
##  [7] "1982-01-10 00:00:00 GMT" "1984-07-10 12:00:00 GMT"
##  [9] "1987-01-09 00:00:00 GMT" "1989-07-09 12:00:00 GMT"
## [11] "1992-01-08 00:00:00 GMT" "1994-07-08 12:00:00 GMT"
## [13] "1997-01-06 00:00:00 GMT" "1999-07-07 12:00:00 GMT"
## [15] "2002-01-05 00:00:00 GMT" "2004-07-05 12:00:00 GMT"
## [17] "2007-01-04 00:00:00 GMT" "2009-07-04 12:00:00 GMT"
## [19] "2012-01-03 00:00:00 GMT"

Cross-Validation Actual vs Predicted

df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color = factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Adjusted Closing Price")+
  scale_color_discrete(name = 'Cutoff')

Nearer horizon values are closer to the actual values and the farther values show more deviation from teh actual values.

Table for Performance metrics

We can generate a table for various performance metrics for different time horizons.

performance_metrics(df.cv)

##        horizon        mse      rmse       mae       mape       mdape      smape
## 1   180.5 days 0.02203328 0.1484361 0.1115551 0.01405537 0.009832773 0.01417687
## 2   352.0 days 0.02256754 0.1502250 0.1191892 0.01441354 0.009832773 0.01453446
## 3   354.0 days 0.02285979 0.1511945 0.1247254 0.01472895 0.009832773 0.01484939
## 4   355.0 days 0.02224203 0.1491376 0.1219532 0.01443306 0.007169743 0.01455097
## 5   356.0 days 0.02691323 0.1640525 0.1309530 0.01508729 0.007169743 0.01522092
## 6   357.0 days 0.02994155 0.1730363 0.1480905 0.01681164 0.016714379 0.01696085
## 7   359.0 days 0.03080494 0.1755134 0.1534586 0.01738732 0.016714379 0.01753098
## 8   360.0 days 0.02913382 0.1706863 0.1505182 0.01680129 0.016714379 0.01692125
## 9   361.0 days 0.04043131 0.2010754 0.1732670 0.02002807 0.016714379 0.02026862
## 10  362.0 days 0.03813686 0.1952866 0.1608028 0.01798798 0.012302975 0.01826157
## 11  364.0 days 0.03763065 0.1939862 0.1548644 0.01792068 0.012302975 0.01819525
## 12  535.5 days 0.03848501 0.1961760 0.1604521 0.01818594 0.012302975 0.01845941
## 13  536.5 days 0.03845380 0.1960964 0.1602932 0.01810211 0.012302975 0.01837502
## 14  538.5 days 0.03894727 0.1973506 0.1611057 0.01797422 0.012302975 0.01824374
## 15  539.5 days 0.06061901 0.2462093 0.1951223 0.02115576 0.012302975 0.02152652
## 16  540.5 days 0.05949179 0.2439094 0.1866278 0.02013216 0.006415317 0.02051076
## 17  541.5 days 0.05139937 0.2267143 0.1591871 0.01640994 0.005531182 0.01671321
## 18  543.5 days 0.05996035 0.2448680 0.1699162 0.01786200 0.005531182 0.01825230
## 19  544.5 days 0.06806006 0.2608832 0.1961858 0.02176448 0.006415317 0.02224680
## 20  545.5 days 0.07134868 0.2671117 0.2136161 0.02461709 0.025398983 0.02505507
## 21  718.0 days 0.07178299 0.2679235 0.2155823 0.02466950 0.025398983 0.02510719
## 22  719.0 days 0.07149424 0.2673841 0.2139838 0.02450975 0.025398983 0.02494378
## 23  720.0 days 0.06807901 0.2609195 0.2078823 0.02381235 0.019122337 0.02423058
## 24  721.0 days 0.06762250 0.2600433 0.2073961 0.02353661 0.019122337 0.02394226
## 25  723.0 days 0.07686973 0.2772539 0.2364773 0.02651342 0.028323383 0.02696997
## 26  724.0 days 0.07925017 0.2815141 0.2502938 0.02817728 0.028323383 0.02861527
## 27  725.0 days 0.07074983 0.2659884 0.2396511 0.02650192 0.028323383 0.02684133
## 28  726.0 days 0.08342040 0.2888259 0.2575923 0.02907391 0.028323383 0.02955319
## 29  728.0 days 0.08015785 0.2831216 0.2408482 0.02631013 0.019122337 0.02683402
## 30  729.0 days 0.07857925 0.2803199 0.2281283 0.02575733 0.019122337 0.02628312
## 31  900.5 days 0.08237012 0.2870020 0.2413980 0.02643410 0.019122337 0.02695449
## 32  902.5 days 0.07967752 0.2822721 0.2354291 0.02583196 0.018087686 0.02634235
## 33  903.5 days 0.09395300 0.3065175 0.2489458 0.02666156 0.018087686 0.02721200
## 34  904.5 days 0.14289606 0.3780160 0.2972092 0.03123051 0.018087686 0.03202098
## 35  905.5 days 0.14095780 0.3754435 0.2882276 0.03007285 0.013703124 0.03087808
## 36  907.5 days 0.12845748 0.3584097 0.2608403 0.02620425 0.013703124 0.02688475
## 37  908.5 days 0.14706504 0.3834906 0.2789015 0.02871908 0.013703124 0.02959979
## 38  909.5 days 0.15693219 0.3961467 0.3096505 0.03336641 0.013996049 0.03436297
## 39  910.5 days 0.16125991 0.4015718 0.3310508 0.03693161 0.032902493 0.03786896
## 40 1083.0 days 0.15995081 0.3999385 0.3275465 0.03665663 0.032902493 0.03759667
## 41 1084.0 days 0.15703221 0.3962729 0.3182409 0.03593731 0.032902493 0.03686395
## 42 1085.0 days 0.13680409 0.3698704 0.2979067 0.03392402 0.032213273 0.03476487
## 43 1087.0 days 0.13871916 0.3724502 0.2992181 0.03375004 0.032213273 0.03457800
## 44 1088.0 days 0.14912307 0.3861646 0.3265367 0.03660188 0.032902493 0.03749588
## 45 1089.0 days 0.15199744 0.3898685 0.3362136 0.03774807 0.032902493 0.03859867
## 46 1090.0 days 0.14392035 0.3793684 0.3290555 0.03646935 0.032902493 0.03721017
## 47 1092.0 days 0.16375775 0.4046699 0.3533560 0.04002120 0.032902493 0.04099028
## 48 1093.0 days 0.16036864 0.4004605 0.3416472 0.03801476 0.032213273 0.03903086
## 49 1094.0 days 0.15712347 0.3963880 0.3233428 0.03718010 0.032213273 0.03820022
## 50 1266.5 days 0.16576206 0.4071389 0.3440817 0.03829174 0.032213273 0.03929839
## 51 1267.5 days 0.15730363 0.3966152 0.3320596 0.03710934 0.024311742 0.03808353
## 52 1268.5 days 0.18129984 0.4257932 0.3469966 0.03795327 0.024311742 0.03899284
## 53 1269.5 days 0.25581576 0.5057823 0.4096054 0.04409032 0.024311742 0.04552310
## 54 1271.5 days 0.25296860 0.5029598 0.4000574 0.04277634 0.021571648 0.04423296
## 55 1272.5 days 0.23056208 0.4801688 0.3726381 0.03873352 0.021571648 0.03995026
## 56 1273.5 days 0.28024457 0.5293813 0.4090868 0.04376717 0.021571648 0.04552788
## 57 1274.5 days 0.29105796 0.5394979 0.4350000 0.04767770 0.038540943 0.04959324
## 58 1276.5 days 0.29270332 0.5410206 0.4478419 0.04981401 0.038540943 0.05170692
## 59 1448.0 days 0.29416991 0.5423743 0.4501909 0.04977658 0.038540943 0.05167013
## 60 1449.0 days 0.28456639 0.5334476 0.4223104 0.04772158 0.038540943 0.04958846
## 61 1451.0 days 0.24898599 0.4989850 0.3990839 0.04523614 0.038540943 0.04693082
## 62 1452.0 days 0.29880252 0.5466283 0.4242426 0.04718212 0.038540943 0.04908103
## 63 1453.0 days 0.31015842 0.5569187 0.4498779 0.04990772 0.038540943 0.05189279
## 64 1454.0 days 0.30316983 0.5506086 0.4332783 0.04758508 0.037016303 0.04946888
## 65 1456.0 days 0.28025624 0.5293923 0.4185791 0.04529067 0.037016303 0.04689350
## 66 1457.0 days 0.32305371 0.5683781 0.4602941 0.05135083 0.037016303 0.05345213
## 67 1458.0 days 0.32265591 0.5680281 0.4585494 0.05102782 0.037016303 0.05313512
## 68 1459.0 days 0.31111618 0.5577779 0.4246188 0.04948878 0.037016303 0.05161129
## 69 1631.5 days 0.32887860 0.5734794 0.4641508 0.05182526 0.037016303 0.05391640
## 70 1632.5 days 0.30767989 0.5546890 0.4458462 0.04993240 0.037016303 0.05193305
## 71 1633.5 days 0.31990121 0.5655981 0.4512652 0.04988887 0.037016303 0.05188451
## 72 1635.5 days 0.44934132 0.6703293 0.5394682 0.05872756 0.037627439 0.06144721
## 73 1636.5 days 0.44964399 0.6705550 0.5405442 0.05873928 0.037627439 0.06145873
## 74 1637.5 days 0.40795452 0.6387132 0.5018964 0.05305704 0.037627439 0.05531555
## 75 1638.5 days 0.46334302 0.6806930 0.5345661 0.05769052 0.037627439 0.06059050
## 76 1640.5 days 0.47586136 0.6898271 0.5618874 0.06186885 0.050754942 0.06495835
## 77 1641.5 days 0.47821234 0.6915290 0.5762354 0.06428925 0.050754942 0.06734543
## 78 1813.0 days 0.48521933 0.6965769 0.5842250 0.06449167 0.050754942 0.06754288
## 79 1815.0 days 0.45787045 0.6766613 0.5316516 0.06049108 0.050754942 0.06346227
## 80 1816.0 days 0.42485281 0.6518073 0.5164145 0.05838579 0.050754942 0.06113875
## 81 1817.0 days 0.52306519 0.7232325 0.5541716 0.06145258 0.050754942 0.06464833
## 82 1818.0 days 0.53514087 0.7315332 0.5780443 0.06403098 0.050754942 0.06733916
## 83 1820.0 days 0.52099352 0.7217988 0.5462013 0.05955966 0.040948288 0.06271489
## 84 1821.0 days 0.49981456 0.7069756 0.5349866 0.05769879 0.040948288 0.06056813
## 85 1822.0 days 0.55431526 0.7445235 0.5829667 0.06478330 0.040948288 0.06831291
## 86 1823.0 days 0.55353774 0.7440012 0.5800519 0.06425152 0.040948288 0.06779287
## 87 1825.0 days 0.52909807 0.7273913 0.5353115 0.06277030 0.040948288 0.06635738
##     coverage
## 1  0.3333333
## 2  0.2222222
## 3  0.1111111
## 4  0.1111111
## 5  0.1111111
## 6  0.0000000
## 7  0.0000000
## 8  0.0000000
## 9  0.0000000
## 10 0.1111111
## 11 0.2222222
## 12 0.3333333
## 13 0.4444444
## 14 0.4444444
## 15 0.4444444
## 16 0.5555556
## 17 0.6666667
## 18 0.6666667
## 19 0.5555556
## 20 0.4444444
## 21 0.4444444
## 22 0.4444444
## 23 0.4444444
## 24 0.4444444
## 25 0.3333333
## 26 0.2222222
## 27 0.2222222
## 28 0.2222222
## 29 0.3333333
## 30 0.3333333
## 31 0.3333333
## 32 0.4444444
## 33 0.4444444
## 34 0.4444444
## 35 0.5555556
## 36 0.6666667
## 37 0.6666667
## 38 0.5555556
## 39 0.4444444
## 40 0.4444444
## 41 0.4444444
## 42 0.4444444
## 43 0.4444444
## 44 0.3333333
## 45 0.2222222
## 46 0.2222222
## 47 0.2222222
## 48 0.3333333
## 49 0.3333333
## 50 0.3333333
## 51 0.4444444
## 52 0.4444444
## 53 0.4444444
## 54 0.5555556
## 55 0.6666667
## 56 0.6666667
## 57 0.6666667
## 58 0.6666667
## 59 0.6666667
## 60 0.6666667
## 61 0.6666667
## 62 0.6666667
## 63 0.5555556
## 64 0.5555556
## 65 0.5555556
## 66 0.4444444
## 67 0.4444444
## 68 0.4444444
## 69 0.4444444
## 70 0.4444444
## 71 0.4444444
## 72 0.4444444
## 73 0.4444444
## 74 0.5555556
## 75 0.5555556
## 76 0.5555556
## 77 0.5555556
## 78 0.5555556
## 79 0.6666667
## 80 0.6666667
## 81 0.6666667
## 82 0.6666667
## 83 0.6666667
## 84 0.6666667
## 85 0.5555556
## 86 0.5555556
## 87 0.5555556

Plots for various metrics

Now, we can plot visualizations for metrics like RMSE, MAE, MAPE, MDAPE, SMAPE.

plot_cross_validation_metric(df.cv, metric = 'rmse')

plot_cross_validation_metric(df.cv, metric = 'mae')

plot_cross_validation_metric(df.cv, metric = 'mape')

plot_cross_validation_metric(df.cv, metric = 'mdape')

plot_cross_validation_metric(df.cv, metric = 'smape')

Model Comparison between best models of ARIMA and Prophet

ARIMA Model for RMSE comparison:

best_mod_ar <- auto.arima(train$y,stationary=FALSE,allowdrift=FALSE,
                 seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
checkresiduals(best_mod_ar)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1)
## Q* = 4.1213, df = 7, p-value = 0.7657
## 
## Model df: 3.   Total lags used: 10

ARIMA(2,1,1) on training dataset is indicated by Auto Arima

Calculating Out of Sample RMSE for ARIMA:

test_pred = predict(best_mod_ar, 39)
test_pred_ar = test_pred$pred
error_ar = test$y - test_pred_ar

out_rmse_ar=sqrt(mean(error_ar^2,na.rm=T))

Prophet Model for RMSE comparison:

best_mod_pr = prophet(train, weekly.seasonality=FALSE, daily.seasonality = FALSE, 
                yearly.seasonality=FALSE)

future = make_future_dataframe(model,periods = 5, freq = "year")
forecast = predict(best_mod_pr,future)

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2020-01-31"))

out_rmse_pr = sqrt(mean((test$y - forecast_metric_data$yhat)^2))

Out of Sample RMSE comparison between best models built on ARIMA and Prophet:

tibble(
  `best_ARIMA` = round(out_rmse_ar,2),
  `best_Prophet` = round(out_rmse_pr,2)
)

## # A tibble: 1 × 2
##   best_ARIMA best_Prophet
##        <dbl>        <dbl>
## 1       2.38          0.3

While the RMSE is not that different from each other between both best models, but we still have a clear winner as the model built by Prophet, which reduces the RMSE by ~2.2% when compared with best model from ARIMA.

We can say that while it is better to understand the Time Series through ARIMA through manually decomposing the time-series, and understanding the AR characteristics of the process, it is also advisable to use Prophet to compare the performance of the forecasting. Usually, the Prophet is better performing but to understand the data generating process, it would not hurt to build ARIMA for better understanding of the time-series.

Final_Assignment_M14291371

Guna Shanmukha

2/24/2022