Data Description The dataset selected for this assignment is the “World Development Indicators and Other World Bank Data” loaded as a package. This data is from over 40 databases hosted by the World Bank, including the World Development Indicators (‘WDI’), International Debt Statistics, Doing Business, Human Capital Index, and Sub-national Poverty indicators.The complete dataset has a total of 16532 observations. The dataset has a total of 13 variables.
I further filtered on the dataset to use the United States data to see the trends across time and probably see the impact of various world events on different aspects. One basic aspect that drives the trends in this data are the socio economic conditions in the country at any point of time.Variable of interest for us from the data is Infant mortality rate. The variation in this variable is driven by usually a combination of GDP, schooling years which effects the standard of healthcare provided and also the basic knowledge of individuals.
library(WDI)
library(ggplot2)
wdi_data = WDI(indicator = c('inf_mort' = "SP.DYN.IMRT.IN",
'gdpPercap'="NY.GDP.PCAP.KD",
'yrs_schooling'='BAR.SCHL.15UP'), # interest rate spread
start = 1960, end = 2020,
extra=TRUE) %>%
as_tibble()
united_states_data = wdi_data %>%
filter(country == 'United States')
dim(united_states_data)
## [1] 61 13
Filtering the data to include just the United States data gives a total of 61 observations of 13 variables.
Plot of infant mortality rate vs time(in years)
ggplot(united_states_data, aes(x = year, y = inf_mort)) +geom_line() + labs(y = 'Infant Mortality Rate',x = 'Year') + ggtitle("Infant mortality rate across Years")
There is a significant downward trend of the infant mortality rate from 1960 to 2019.
Fitting a linear time trend to infant mortality rate
united_states_data %>%
ggplot()+
geom_line(aes(year,inf_mort))+
geom_smooth(aes(year,inf_mort),method='lm',color='red') +
theme_bw()+
xlab("Year")+
ylab("Infant Mortality")+
labs(title = "Infant Mortality in United States, 1960 - 2019")
## `geom_smooth()` using formula 'y ~ x'
Above graph indicates that a linear time trend provides a relatively good fit to the data. This can be further substantiated by fitting a regression model to infant mortality rate as a function of time.
Checking for outliers in infant mortality rate
boxplot(united_states_data$inf_mort, main = "Checking for outliers in infant mortality rate in United States Data", ylab = "Infant Mortality Rate")
Above boxplot shows that there are no outliers in the infant mortality rate whcih indicates that the mortality rate is normally distributed and there are no unusual trends in the variable.
Summary of Infant Mortality Rate for United States
summary(united_states_data$inf_mort)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.40 6.70 9.40 11.82 15.90 25.90
Summary stats of infant mortality rate for United States is provided above. For the time frame of 1960-2019, United States experienced a maximum of 25.9 infant mortality rate and a minimum of 5.6. We can also observe that median is slightly lower than the mean. This means that there are more data points less than the mean and less number of data points that have higher infant mortality rate.
Regression Model
mod = lm(inf_mort~year,data=united_states_data)
summary(mod)
##
## Call:
## lm(formula = inf_mort ~ year, data = united_states_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9331 -2.2774 -0.3216 2.0977 4.0131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 679.61148 33.35111 20.38 <2e-16 ***
## year -0.33557 0.01676 -20.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.305 on 59 degrees of freedom
## Multiple R-squared: 0.8717, Adjusted R-squared: 0.8696
## F-statistic: 401 on 1 and 59 DF, p-value: < 2.2e-16
From the regression model summary, it can be observed that one unit increase in time(year in this case) is associated with a 0.34 unit decline in infant mortality rate and the p-value is way less than 0.05 and hence this holds true. A R-squared value of 0.87 indicates that the linear time trend provides a relatively good fit.
Section-1
Verifying the stationarity
From the plot of infant mortality rate of United States, we can observe that the mortality rate timeseries is neither mean stationary nor variance stationary as the mortality rate is constantly decreasing over time. This is evident from the mortality rate plot.
Since the mortality rate is continuously reducing overtime, we can be sure that variance is non-stationary.
Seasonality
From the mortality rate plots above, we can safely say that there is no seasonlaity in the mortality rate as it is pretty much a decreasing trend overtime.
Log Transformation
We are performing log transformation to try and remove the non-stationarity
US_log_diff <- united_states_data %>%
mutate(
infMort_log = log1p(inf_mort), # Log for Variance Stationarity
infMort_diff = inf_mort - lag(inf_mort),
infMort_log_diff = infMort_log - lag(infMort_log) # Difference for Mean Stationarity
)%>%
drop_na()
US_log_diff %>%
ggplot() +
geom_line(aes(year, infMort_log_diff)) +
theme_bw() +
ggtitle("Difference in Log Values, US Infant Mortality Rate over Time") +
ylab("Mortality Rate (Difference))") +
xlab("Year")
Augmented Dickey Fuller(ADF) and KPSS tests
We conduct ADF and KPSS tests provide mathematical proof of the presence of stationarity using the p-value.
adf.test(US_log_diff$inf_mort) #raw close value
##
## Augmented Dickey-Fuller Test
##
## data: US_log_diff$inf_mort
## Dickey-Fuller = -2.962, Lag order = 2, p-value = 0.2059
## alternative hypothesis: stationary
adf.test(US_log_diff$infMort_diff) #first difference
##
## Augmented Dickey-Fuller Test
##
## data: US_log_diff$infMort_diff
## Dickey-Fuller = -31.873, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
adf.test(US_log_diff$infMort_log) #log close value
##
## Augmented Dickey-Fuller Test
##
## data: US_log_diff$infMort_log
## Dickey-Fuller = -0.43272, Lag order = 2, p-value = 0.9779
## alternative hypothesis: stationary
adf.test(US_log_diff$infMort_log_diff) #differenced log close value
##
## Augmented Dickey-Fuller Test
##
## data: US_log_diff$infMort_log_diff
## Dickey-Fuller = -10.258, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
kpss.test(US_log_diff$inf_mort) #raw close value
##
## KPSS Test for Level Stationarity
##
## data: US_log_diff$inf_mort
## KPSS Level = 0.4142, Truncation lag parameter = 2, p-value = 0.07103
kpss.test(US_log_diff$infMort_diff) #first difference
##
## KPSS Test for Level Stationarity
##
## data: US_log_diff$infMort_diff
## KPSS Level = 0.22978, Truncation lag parameter = 2, p-value = 0.1
kpss.test(US_log_diff$infMort_log) #log close value
##
## KPSS Test for Level Stationarity
##
## data: US_log_diff$infMort_log
## KPSS Level = 0.42179, Truncation lag parameter = 2, p-value = 0.06776
kpss.test(US_log_diff$infMort_log_diff) #differenced log close value
##
## KPSS Test for Level Stationarity
##
## data: US_log_diff$infMort_log_diff
## KPSS Level = 0.22044, Truncation lag parameter = 2, p-value = 0.1
From the KPSS tests we can see that the p-value in case of First difference and Difference Close Log value are greater than 0.05. We know that p>0.05 from KPSS test indicates stationarity.
ACF and PACF Plots
acf(US_log_diff$infMort_log_diff,lag.max=20)
pacf(US_log_diff$infMort_log_diff,lag.max=20)
From the ACF and PACF plots, we can say that the time-series appears to be neither autoregressive process nor a moving average process. We can run the Auto-Arima function to further conform this.
Section-2
ARIMA Models
BIC(
arima(US_log_diff$infMort_log,order=c(0,0,0)),
arima(US_log_diff$infMort_log,order=c(0,1,0)),
arima(US_log_diff$infMort_log,order=c(1,0,0)),
arima(US_log_diff$infMort_log,order=c(0,1,1)),
arima(US_log_diff$infMort_log,order=c(0,1,2)),
arima(US_log_diff$infMort_log,order=c(0,1,3)),
arima(US_log_diff$infMort_log,order=c(1,1,0)),
arima(US_log_diff$infMort_log,order=c(1,1,1)),
arima(US_log_diff$infMort_log,order=c(2,1,0)),
arima(US_log_diff$infMort_log,order=c(3,1,0))
)
## df BIC
## arima(US_log_diff$infMort_log, order = c(0, 0, 0)) 2 10.958139
## arima(US_log_diff$infMort_log, order = c(0, 1, 0)) 1 -6.040362
## arima(US_log_diff$infMort_log, order = c(1, 0, 0)) 3 -0.209925
## arima(US_log_diff$infMort_log, order = c(0, 1, 1)) 2 -11.113999
## arima(US_log_diff$infMort_log, order = c(0, 1, 2)) 3 -13.241080
## arima(US_log_diff$infMort_log, order = c(0, 1, 3)) 4 -12.256931
## arima(US_log_diff$infMort_log, order = c(1, 1, 0)) 2 -20.493178
## arima(US_log_diff$infMort_log, order = c(1, 1, 1)) 3 -18.416595
## arima(US_log_diff$infMort_log, order = c(2, 1, 0)) 3 -18.415510
## arima(US_log_diff$infMort_log, order = c(3, 1, 0)) 4 -16.693569
AIC(
arima(US_log_diff$infMort_log,order=c(0,0,0)),
arima(US_log_diff$infMort_log,order=c(0,1,0)),
arima(US_log_diff$infMort_log,order=c(1,0,0)),
arima(US_log_diff$infMort_log,order=c(0,1,1)),
arima(US_log_diff$infMort_log,order=c(0,1,2)),
arima(US_log_diff$infMort_log,order=c(0,1,3)),
arima(US_log_diff$infMort_log,order=c(1,1,0)),
arima(US_log_diff$infMort_log,order=c(1,1,1)),
arima(US_log_diff$infMort_log,order=c(2,1,0)),
arima(US_log_diff$infMort_log,order=c(3,1,0))
)
## df AIC
## arima(US_log_diff$infMort_log, order = c(0, 0, 0)) 2 10.5636896
## arima(US_log_diff$infMort_log, order = c(0, 1, 0)) 1 -6.1198035
## arima(US_log_diff$infMort_log, order = c(1, 0, 0)) 3 -0.8015988
## arima(US_log_diff$infMort_log, order = c(0, 1, 1)) 2 -11.2728821
## arima(US_log_diff$infMort_log, order = c(0, 1, 2)) 3 -13.4794046
## arima(US_log_diff$infMort_log, order = c(0, 1, 3)) 4 -12.5746973
## arima(US_log_diff$infMort_log, order = c(1, 1, 0)) 2 -20.6520609
## arima(US_log_diff$infMort_log, order = c(1, 1, 1)) 3 -18.6549194
## arima(US_log_diff$infMort_log, order = c(2, 1, 0)) 3 -18.6538348
## arima(US_log_diff$infMort_log, order = c(3, 1, 0)) 4 -17.0113352
auto.arima(US_log_diff$infMort_log,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
## Series: US_log_diff$infMort_log
## ARIMA(0,2,0)
##
## sigma^2 = 0.002006: log likelihood = 11.81
## AIC=-21.63 AICc=-20.83 BIC=-21.68
Based on the AIC and BIC values, we can say that (0,1,0) of the non-differenced data gives the best model because of the lowest AIC and BIC values. This is further substantiated by using auto.arima on the data frame to obtain the suitable model.
Checking for Residuals
best_mod = arima(US_log_diff$infMort_log,order=c(0,1,0))
forecast::checkresiduals(best_mod)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,0)
## Q* = 0.87878, df = 3, p-value = 0.8305
##
## Model df: 0. Total lags used: 3
We conduct Ljung-Box test to verify residual auto-correlation. Presence of residual auto-correlation indicates that we don’t have the best model. So to make sure that we have the best model. We plot the ACF and PACF for the residuals. We also make sure that p-value in Ljung-Box test is not less than 0.05 at the significant lags from ACF and PACF plots
Ljung-Box test
par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)
There are no significant lags from the ACF and PACF plots. Just to make sure, we still do the Ljung-Box test for the first five lags
Box.test(resid,type='Ljung-Box',lag=1)
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.18677, df = 1, p-value = 0.6656
Box.test(resid,type='Ljung-Box',lag=2)
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.71858, df = 2, p-value = 0.6982
Box.test(resid,type='Ljung-Box',lag=3)
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.87878, df = 3, p-value = 0.8305
Box.test(resid,type='Ljung-Box',lag=4)
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.8788, df = 4, p-value = 0.9276
Box.test(resid,type='Ljung-Box',lag=5)
##
## Box-Ljung test
##
## data: resid
## X-squared = 3.9453, df = 5, p-value = 0.5573
Box.test(resid,type='Ljung-Box',lag=10)
##
## Box-Ljung test
##
## data: resid
## X-squared = NA, df = 10, p-value = NA
Box.test(resid,type='Ljung-Box',lag=20)
##
## Box-Ljung test
##
## data: resid
## X-squared = NA, df = 20, p-value = NA
The p-values for all the lags are greater than 0.05 indicating that there is no residual auto-correlation.
Predicted vs Actual Model
best_mod = arima(US_log_diff$infMort_log,order=c(0,1,0))
resid = best_mod$residuals
pred = resid+US_log_diff$infMort_log
ggplot()+
geom_line(aes(US_log_diff$year,US_log_diff$infMort_log))+
geom_line(aes(US_log_diff$year,pred),color='blue',alpha=0.4)+
theme_bw()+
xlab("Year")+
ylab("Log Infant Mortality Rate")
RMSE
RMSE = sqrt(mean((expm1(pred) - expm1(US_log_diff$infMort_log))^2,na.rm=T))
RMSE
## [1] 1.637295
We compute the Root Mean Squared Error and find it to be 1.785. The results suggest that our model predicts the infant mortality rate within ~1.785 points on average in-sample.
Forecast
best_mod %>%
forecast(h=5) %>%
autoplot()
Since the mortality rate is continuously reducing overtime, we can be sure that variance is non-stationary.
From the plot of infant mortality rate of United States, we can observe that the mortality rate timeseries is neither mean stationary nor variance stationary as the mortality rate is constantly decreasing over time. This is evident from the mortality rate plot.
From the mortality rate plots above, we can safely say that there is no seasonlaity in the mortality rate as it is pretty much a decreasing trend overtime.
for_prophet_data <- subset(united_states_data, select = c(year, inf_mort)) %>%
mutate(year = as.Date(ISOdate(year, 1, 1)))
#(paste('01/01/', trimws(year))))
for_prophet_data
## # A tibble: 61 × 2
## year inf_mort
## <date> <dbl>
## 1 1960-01-01 25.9
## 2 1963-01-01 24.4
## 3 1964-01-01 23.8
## 4 1961-01-01 25.4
## 5 1962-01-01 24.9
## 6 1967-01-01 22
## 7 1968-01-01 21.3
## 8 1969-01-01 20.6
## 9 1970-01-01 19.9
## 10 1971-01-01 19.1
## # … with 51 more rows
library(prophet)
prophet_data = for_prophet_data %>%
rename(ds = year, # Have to name our date variable "ds"
y = inf_mort) # Have to name our time series "y"
train = prophet_data %>%
filter(ds<ymd("2018-01-01"))
test = prophet_data %>%
filter(ds>=ymd("2018-01-01"))
model = prophet(train, yearly.seasonality = FALSE, weekly.seasonality = FALSE, daily.seasonality = FALSE)
future = make_future_dataframe(model,periods = 7*365)
forecast = predict(model,future)
plot(model,forecast)+
ylab("Infant Mortality Rate in US")+xlab("Year")+theme_bw()
Just looking at the forecast we can see that there is a discrepancy in the forecast trend. Infant mortality rate is going below zero which is not possible. We will set the floor value later when we try to figure out the saturation points.
dyplot.prophet(model,forecast)
prophet_plot_components(model,forecast)
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Year")+ylab("Infant Mortality Rate of US")
We can manually specify more number of changepoints to improve performance.
# Number of Changepoints
model = prophet(train,n.changepoints=10,yearly.seasonality = FALSE, weekly.seasonality = FALSE, daily.seasonality = FALSE)
forecast = predict(model,future)
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Year")+ylab("Infant Mortality Rate of US")
prophet_plot_components(model,forecast)
forecast_plot_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2020-01-01"))
forecast_not_scaled = ggplot()+
geom_line(aes(test$ds,test$y))+ xlab("Year") + ylab("Infant Mortality Rate of US") +
geom_line(aes(forecast_plot_data$ds,forecast_plot_data$yhat), color='red')
forecast_scaled = forecast_not_scaled
forecast_scaled
We need to set the floor value for this data as the infant mortality can not go below 0. Let us make a two-year forecast to identify the need for saturation points.
two_yr_future = make_future_dataframe(model,periods = 730)
two_yr_forecast = predict(model,two_yr_future)
plot(model,two_yr_forecast)+theme_bw()+xlab("Date")+ylab("Infant Mortality Rate of US")
train$floor = 0
train$cap = 30
future$floor = 0
future$cap = 30
# Set floor in forecsat data
two_yr_future$floor = 0
two_yr_future$cap = 30
sat_model = prophet(train,growth='logistic', yearly.seasonality = FALSE, weekly.seasonality = FALSE, daily.seasonality = FALSE)
sat_two_yr_forecast = predict(sat_model,two_yr_future)
plot(sat_model,sat_two_yr_forecast)+ylim(0,30)+
theme_bw()+xlab("Date")+ylab("Page Views")
prophet_plot_components(model,forecast)
additive = prophet(train,yearly.seasonality = FALSE, weekly.seasonality = FALSE, daily.seasonality = FALSE)
add_fcst = predict(additive,future)
plot(additive,add_fcst)+
ylim(0,30)
prophet_plot_components(additive,add_fcst)
multi = prophet(train,seasonality.mode = 'multiplicative', yearly.seasonality = FALSE, weekly.seasonality = FALSE, daily.seasonality = FALSE)
multi_fcst = predict(multi,future)
plot(multi,multi_fcst)+ylim(0,30)
prophet_plot_components(multi, multi_fcst)
From the above plots, we can see that the additive and multiplicative seasonalities doesn’t play any significant role.
Since the data we are using is a yearly data, impact of holidays is not applicable on the annual infant mortality rate.
forecast_metric_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2020-01-01"))
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test$y - forecast_metric_data$yhat))
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 0.36"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 0.34"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.06"
The observed values of performance metrics are RMSE = 0.36, MAE = 0.34, MAPE = 0.06.
library(tidyverse)
library(caret)
library(prophet)
df.cv <- cross_validation(model, horizon = 5*365,initial = 6*365, units = 'days')
head(df.cv)
## # A tibble: 6 × 6
## y ds yhat yhat_lower yhat_upper cutoff
## <dbl> <dttm> <dbl> <dbl> <dbl> <dttm>
## 1 21.3 1968-01-01 00:00:00 21.4 21.3 21.4 1967-01-14 00:00:00
## 2 20.6 1969-01-01 00:00:00 20.7 20.6 20.9 1967-01-14 00:00:00
## 3 19.9 1970-01-01 00:00:00 20.1 19.8 20.3 1967-01-14 00:00:00
## 4 19.1 1971-01-01 00:00:00 19.4 19.0 19.9 1967-01-14 00:00:00
## 5 18.3 1972-01-01 00:00:00 18.8 18.1 19.4 1967-01-14 00:00:00
## 6 19.9 1970-01-01 00:00:00 19.9 19.9 19.9 1969-07-14 12:00:00
unique(df.cv$cutoff)
## [1] "1967-01-14 00:00:00 GMT" "1969-07-14 12:00:00 GMT"
## [3] "1972-01-13 00:00:00 GMT" "1974-07-13 12:00:00 GMT"
## [5] "1977-01-11 00:00:00 GMT" "1979-07-12 12:00:00 GMT"
## [7] "1982-01-10 00:00:00 GMT" "1984-07-10 12:00:00 GMT"
## [9] "1987-01-09 00:00:00 GMT" "1989-07-09 12:00:00 GMT"
## [11] "1992-01-08 00:00:00 GMT" "1994-07-08 12:00:00 GMT"
## [13] "1997-01-06 00:00:00 GMT" "1999-07-07 12:00:00 GMT"
## [15] "2002-01-05 00:00:00 GMT" "2004-07-05 12:00:00 GMT"
## [17] "2007-01-04 00:00:00 GMT" "2009-07-04 12:00:00 GMT"
## [19] "2012-01-03 00:00:00 GMT"
df.cv %>%
ggplot()+
geom_point(aes(ds,y)) +
geom_point(aes(ds,yhat,color = factor(cutoff)))+
theme_bw()+
xlab("Date")+
ylab("Adjusted Closing Price")+
scale_color_discrete(name = 'Cutoff')
Nearer horizon values are closer to the actual values and the farther values show more deviation from teh actual values.
We can generate a table for various performance metrics for different time horizons.
performance_metrics(df.cv)
## horizon mse rmse mae mape mdape smape
## 1 180.5 days 0.02203328 0.1484361 0.1115551 0.01405537 0.009832773 0.01417687
## 2 352.0 days 0.02256754 0.1502250 0.1191892 0.01441354 0.009832773 0.01453446
## 3 354.0 days 0.02285979 0.1511945 0.1247254 0.01472895 0.009832773 0.01484939
## 4 355.0 days 0.02224203 0.1491376 0.1219532 0.01443306 0.007169743 0.01455097
## 5 356.0 days 0.02691323 0.1640525 0.1309530 0.01508729 0.007169743 0.01522092
## 6 357.0 days 0.02994155 0.1730363 0.1480905 0.01681164 0.016714379 0.01696085
## 7 359.0 days 0.03080494 0.1755134 0.1534586 0.01738732 0.016714379 0.01753098
## 8 360.0 days 0.02913382 0.1706863 0.1505182 0.01680129 0.016714379 0.01692125
## 9 361.0 days 0.04043131 0.2010754 0.1732670 0.02002807 0.016714379 0.02026862
## 10 362.0 days 0.03813686 0.1952866 0.1608028 0.01798798 0.012302975 0.01826157
## 11 364.0 days 0.03763065 0.1939862 0.1548644 0.01792068 0.012302975 0.01819525
## 12 535.5 days 0.03848501 0.1961760 0.1604521 0.01818594 0.012302975 0.01845941
## 13 536.5 days 0.03845380 0.1960964 0.1602932 0.01810211 0.012302975 0.01837502
## 14 538.5 days 0.03894727 0.1973506 0.1611057 0.01797422 0.012302975 0.01824374
## 15 539.5 days 0.06061901 0.2462093 0.1951223 0.02115576 0.012302975 0.02152652
## 16 540.5 days 0.05949179 0.2439094 0.1866278 0.02013216 0.006415317 0.02051076
## 17 541.5 days 0.05139937 0.2267143 0.1591871 0.01640994 0.005531182 0.01671321
## 18 543.5 days 0.05996035 0.2448680 0.1699162 0.01786200 0.005531182 0.01825230
## 19 544.5 days 0.06806006 0.2608832 0.1961858 0.02176448 0.006415317 0.02224680
## 20 545.5 days 0.07134868 0.2671117 0.2136161 0.02461709 0.025398983 0.02505507
## 21 718.0 days 0.07178299 0.2679235 0.2155823 0.02466950 0.025398983 0.02510719
## 22 719.0 days 0.07149424 0.2673841 0.2139838 0.02450975 0.025398983 0.02494378
## 23 720.0 days 0.06807901 0.2609195 0.2078823 0.02381235 0.019122337 0.02423058
## 24 721.0 days 0.06762250 0.2600433 0.2073961 0.02353661 0.019122337 0.02394226
## 25 723.0 days 0.07686973 0.2772539 0.2364773 0.02651342 0.028323383 0.02696997
## 26 724.0 days 0.07925017 0.2815141 0.2502938 0.02817728 0.028323383 0.02861527
## 27 725.0 days 0.07074983 0.2659884 0.2396511 0.02650192 0.028323383 0.02684133
## 28 726.0 days 0.08342040 0.2888259 0.2575923 0.02907391 0.028323383 0.02955319
## 29 728.0 days 0.08015785 0.2831216 0.2408482 0.02631013 0.019122337 0.02683402
## 30 729.0 days 0.07857925 0.2803199 0.2281283 0.02575733 0.019122337 0.02628312
## 31 900.5 days 0.08237012 0.2870020 0.2413980 0.02643410 0.019122337 0.02695449
## 32 902.5 days 0.07967752 0.2822721 0.2354291 0.02583196 0.018087686 0.02634235
## 33 903.5 days 0.09395300 0.3065175 0.2489458 0.02666156 0.018087686 0.02721200
## 34 904.5 days 0.14289606 0.3780160 0.2972092 0.03123051 0.018087686 0.03202098
## 35 905.5 days 0.14095780 0.3754435 0.2882276 0.03007285 0.013703124 0.03087808
## 36 907.5 days 0.12845748 0.3584097 0.2608403 0.02620425 0.013703124 0.02688475
## 37 908.5 days 0.14706504 0.3834906 0.2789015 0.02871908 0.013703124 0.02959979
## 38 909.5 days 0.15693219 0.3961467 0.3096505 0.03336641 0.013996049 0.03436297
## 39 910.5 days 0.16125991 0.4015718 0.3310508 0.03693161 0.032902493 0.03786896
## 40 1083.0 days 0.15995081 0.3999385 0.3275465 0.03665663 0.032902493 0.03759667
## 41 1084.0 days 0.15703221 0.3962729 0.3182409 0.03593731 0.032902493 0.03686395
## 42 1085.0 days 0.13680409 0.3698704 0.2979067 0.03392402 0.032213273 0.03476487
## 43 1087.0 days 0.13871916 0.3724502 0.2992181 0.03375004 0.032213273 0.03457800
## 44 1088.0 days 0.14912307 0.3861646 0.3265367 0.03660188 0.032902493 0.03749588
## 45 1089.0 days 0.15199744 0.3898685 0.3362136 0.03774807 0.032902493 0.03859867
## 46 1090.0 days 0.14392035 0.3793684 0.3290555 0.03646935 0.032902493 0.03721017
## 47 1092.0 days 0.16375775 0.4046699 0.3533560 0.04002120 0.032902493 0.04099028
## 48 1093.0 days 0.16036864 0.4004605 0.3416472 0.03801476 0.032213273 0.03903086
## 49 1094.0 days 0.15712347 0.3963880 0.3233428 0.03718010 0.032213273 0.03820022
## 50 1266.5 days 0.16576206 0.4071389 0.3440817 0.03829174 0.032213273 0.03929839
## 51 1267.5 days 0.15730363 0.3966152 0.3320596 0.03710934 0.024311742 0.03808353
## 52 1268.5 days 0.18129984 0.4257932 0.3469966 0.03795327 0.024311742 0.03899284
## 53 1269.5 days 0.25581576 0.5057823 0.4096054 0.04409032 0.024311742 0.04552310
## 54 1271.5 days 0.25296860 0.5029598 0.4000574 0.04277634 0.021571648 0.04423296
## 55 1272.5 days 0.23056208 0.4801688 0.3726381 0.03873352 0.021571648 0.03995026
## 56 1273.5 days 0.28024457 0.5293813 0.4090868 0.04376717 0.021571648 0.04552788
## 57 1274.5 days 0.29105796 0.5394979 0.4350000 0.04767770 0.038540943 0.04959324
## 58 1276.5 days 0.29270332 0.5410206 0.4478419 0.04981401 0.038540943 0.05170692
## 59 1448.0 days 0.29416991 0.5423743 0.4501909 0.04977658 0.038540943 0.05167013
## 60 1449.0 days 0.28456639 0.5334476 0.4223104 0.04772158 0.038540943 0.04958846
## 61 1451.0 days 0.24898599 0.4989850 0.3990839 0.04523614 0.038540943 0.04693082
## 62 1452.0 days 0.29880252 0.5466283 0.4242426 0.04718212 0.038540943 0.04908103
## 63 1453.0 days 0.31015842 0.5569187 0.4498779 0.04990772 0.038540943 0.05189279
## 64 1454.0 days 0.30316983 0.5506086 0.4332783 0.04758508 0.037016303 0.04946888
## 65 1456.0 days 0.28025624 0.5293923 0.4185791 0.04529067 0.037016303 0.04689350
## 66 1457.0 days 0.32305371 0.5683781 0.4602941 0.05135083 0.037016303 0.05345213
## 67 1458.0 days 0.32265591 0.5680281 0.4585494 0.05102782 0.037016303 0.05313512
## 68 1459.0 days 0.31111618 0.5577779 0.4246188 0.04948878 0.037016303 0.05161129
## 69 1631.5 days 0.32887860 0.5734794 0.4641508 0.05182526 0.037016303 0.05391640
## 70 1632.5 days 0.30767989 0.5546890 0.4458462 0.04993240 0.037016303 0.05193305
## 71 1633.5 days 0.31990121 0.5655981 0.4512652 0.04988887 0.037016303 0.05188451
## 72 1635.5 days 0.44934132 0.6703293 0.5394682 0.05872756 0.037627439 0.06144721
## 73 1636.5 days 0.44964399 0.6705550 0.5405442 0.05873928 0.037627439 0.06145873
## 74 1637.5 days 0.40795452 0.6387132 0.5018964 0.05305704 0.037627439 0.05531555
## 75 1638.5 days 0.46334302 0.6806930 0.5345661 0.05769052 0.037627439 0.06059050
## 76 1640.5 days 0.47586136 0.6898271 0.5618874 0.06186885 0.050754942 0.06495835
## 77 1641.5 days 0.47821234 0.6915290 0.5762354 0.06428925 0.050754942 0.06734543
## 78 1813.0 days 0.48521933 0.6965769 0.5842250 0.06449167 0.050754942 0.06754288
## 79 1815.0 days 0.45787045 0.6766613 0.5316516 0.06049108 0.050754942 0.06346227
## 80 1816.0 days 0.42485281 0.6518073 0.5164145 0.05838579 0.050754942 0.06113875
## 81 1817.0 days 0.52306519 0.7232325 0.5541716 0.06145258 0.050754942 0.06464833
## 82 1818.0 days 0.53514087 0.7315332 0.5780443 0.06403098 0.050754942 0.06733916
## 83 1820.0 days 0.52099352 0.7217988 0.5462013 0.05955966 0.040948288 0.06271489
## 84 1821.0 days 0.49981456 0.7069756 0.5349866 0.05769879 0.040948288 0.06056813
## 85 1822.0 days 0.55431526 0.7445235 0.5829667 0.06478330 0.040948288 0.06831291
## 86 1823.0 days 0.55353774 0.7440012 0.5800519 0.06425152 0.040948288 0.06779287
## 87 1825.0 days 0.52909807 0.7273913 0.5353115 0.06277030 0.040948288 0.06635738
## coverage
## 1 0.3333333
## 2 0.2222222
## 3 0.1111111
## 4 0.1111111
## 5 0.1111111
## 6 0.0000000
## 7 0.0000000
## 8 0.0000000
## 9 0.0000000
## 10 0.1111111
## 11 0.2222222
## 12 0.3333333
## 13 0.4444444
## 14 0.4444444
## 15 0.4444444
## 16 0.5555556
## 17 0.6666667
## 18 0.6666667
## 19 0.5555556
## 20 0.4444444
## 21 0.4444444
## 22 0.4444444
## 23 0.4444444
## 24 0.4444444
## 25 0.3333333
## 26 0.2222222
## 27 0.2222222
## 28 0.2222222
## 29 0.3333333
## 30 0.3333333
## 31 0.3333333
## 32 0.4444444
## 33 0.4444444
## 34 0.4444444
## 35 0.5555556
## 36 0.6666667
## 37 0.6666667
## 38 0.5555556
## 39 0.4444444
## 40 0.4444444
## 41 0.4444444
## 42 0.4444444
## 43 0.4444444
## 44 0.3333333
## 45 0.2222222
## 46 0.2222222
## 47 0.2222222
## 48 0.3333333
## 49 0.3333333
## 50 0.3333333
## 51 0.4444444
## 52 0.4444444
## 53 0.4444444
## 54 0.5555556
## 55 0.6666667
## 56 0.6666667
## 57 0.6666667
## 58 0.6666667
## 59 0.6666667
## 60 0.6666667
## 61 0.6666667
## 62 0.6666667
## 63 0.5555556
## 64 0.5555556
## 65 0.5555556
## 66 0.4444444
## 67 0.4444444
## 68 0.4444444
## 69 0.4444444
## 70 0.4444444
## 71 0.4444444
## 72 0.4444444
## 73 0.4444444
## 74 0.5555556
## 75 0.5555556
## 76 0.5555556
## 77 0.5555556
## 78 0.5555556
## 79 0.6666667
## 80 0.6666667
## 81 0.6666667
## 82 0.6666667
## 83 0.6666667
## 84 0.6666667
## 85 0.5555556
## 86 0.5555556
## 87 0.5555556
Now, we can plot visualizations for metrics like RMSE, MAE, MAPE, MDAPE, SMAPE.
plot_cross_validation_metric(df.cv, metric = 'rmse')
plot_cross_validation_metric(df.cv, metric = 'mae')
plot_cross_validation_metric(df.cv, metric = 'mape')
plot_cross_validation_metric(df.cv, metric = 'mdape')
plot_cross_validation_metric(df.cv, metric = 'smape')
ARIMA Model for RMSE comparison:
best_mod_ar <- auto.arima(train$y,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
checkresiduals(best_mod_ar)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,1)
## Q* = 4.1213, df = 7, p-value = 0.7657
##
## Model df: 3. Total lags used: 10
ARIMA(2,1,1) on training dataset is indicated by Auto Arima
Calculating Out of Sample RMSE for ARIMA:
test_pred = predict(best_mod_ar, 39)
test_pred_ar = test_pred$pred
error_ar = test$y - test_pred_ar
out_rmse_ar=sqrt(mean(error_ar^2,na.rm=T))
Prophet Model for RMSE comparison:
best_mod_pr = prophet(train, weekly.seasonality=FALSE, daily.seasonality = FALSE,
yearly.seasonality=FALSE)
future = make_future_dataframe(model,periods = 5, freq = "year")
forecast = predict(best_mod_pr,future)
forecast_metric_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2020-01-31"))
out_rmse_pr = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
Out of Sample RMSE comparison between best models built on ARIMA and Prophet:
tibble(
`best_ARIMA` = round(out_rmse_ar,2),
`best_Prophet` = round(out_rmse_pr,2)
)
## # A tibble: 1 × 2
## best_ARIMA best_Prophet
## <dbl> <dbl>
## 1 2.38 0.3
While the RMSE is not that different from each other between both best models, but we still have a clear winner as the model built by Prophet, which reduces the RMSE by ~2.2% when compared with best model from ARIMA.
We can say that while it is better to understand the Time Series through ARIMA through manually decomposing the time-series, and understanding the AR characteristics of the process, it is also advisable to use Prophet to compare the performance of the forecasting. Usually, the Prophet is better performing but to understand the data generating process, it would not hurt to build ARIMA for better understanding of the time-series.