Summary

We will be analysing how the trend of monthly interstate traffic volume changes across the years between 2012 and 2018 and then do a forecast for the next 2 years. According to the findings, the monthly traffic volume stays the same even though the data shows slightly upward trend before applying time series methods.

Introduction

The Metro Interstate Traffic Volume Data Set is taken from the UCI Machine Learning Repository (UCI, 2019). The initial data set contains 48,204 instances of hourly traffic data with a total of 9 attributes such as holiday type, temperature, rain fall, snow, clouds, weather (short description), weather (long description), date time hour of the data collected and traffic volume collected between October 2012 to Janrauary 2018 and the location is roughly between Minneapolis and St Paul, MN. It is a concern for the Minneapolis especially during holiday seasons that the traffic accidents on highways have been increasing drastically. Some argue that this may be a result of increased traffic volume despite the fact that average daily traffic has been decreased by 5% within the city (Lee, 2019). In this report, we will look up whether the traffic volume is actually going up. Even though the data contain detailed conditions on weather and season, they will not be considered in this report as we are only intersted in the volume. The data is also combined to monthly traffic volume instead of hourly traffic volume as we are interested in the overall increase in traffic volume.

The data is count data and the original dataset is a very big dataset with 48,204 instances. However, after combining the traffic volume to montly volume, the dataset is left with only 63 instances. This is a very small dataset to be using the Time Series methods taught in this unit and there are more appropriate methods outside of the scope of this unit. Nonetheless, we will still try to apply thes methods learned in the class. It will be interesting to see the difference on the difference in methods between the count data and continuous data for predictions but this is not the aim of this project.

Hypothesis

Our hypothesis is that the monthly traffic volume between the two cities will have an upward increasing trend.

Methodology and Analysis

As we are only interested in the monthly traffic volume, the date_time column of the dataset is separated and the monthly volume is calculated. Then the data is processed and fit into different models to see which model will give the optimum prediction for the next two years.

The original dataset looked like the following.

##   holiday   temp rain_1h snow_1h clouds_all weather_main weather_description
## 1    None 288.28       0       0         40       Clouds    scattered clouds
## 2    None 289.36       0       0         75       Clouds       broken clouds
## 3    None 289.58       0       0         90       Clouds     overcast clouds
## 4    None 290.13       0       0         90       Clouds     overcast clouds
## 5    None 291.14       0       0         75       Clouds       broken clouds
##             date_time traffic_volume
## 1 2012-10-02 09:00:00           5545
## 2 2012-10-02 10:00:00           4516
## 3 2012-10-02 11:00:00           4767
## 4 2012-10-02 12:00:00           5026
## 5 2012-10-02 13:00:00           4918

Data Preprocessing

-splitted date_time column to date and time

-splitted date into year, month and day

-grouped traffic flow by month

Descriptive Analysis

In Fig 1., the data shows some seasonality trends and there is a hugh discrease in the number of traffic at the end of 2014 despite the fact that it is holiday season. The general trend does not show clear indication of whether there is an increasing trend in the traffic volume. As next step, we checked whether the previous month’s traffic affects the traffic volume of the next month.

Previous year traffic-volume

## [1] 0.5413655

The analysis shows that there is a weak positive correlation between the previous month’s traffic volume to the next month as it can be seen in Fig 2. and correlation value = 0.541. Next we will check whether we can draw a line through the model to see if we can predict the traffic volume data.

Model Selection

Linear Trend Model

## 
## Call:
## lm(formula = df ~ time(df))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1880049  -254907    93368   321097   969323 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -232059487   70435470  -3.295  0.00154 **
## time(df)        116414      34956   3.330  0.00138 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 524400 on 71 degrees of freedom
## Multiple R-squared:  0.1351, Adjusted R-squared:  0.1229 
## F-statistic: 11.09 on 1 and 71 DF,  p-value: 0.001379

According to the F-statistics, linear model is significant at the p-value < 0.01 with adjusted R-squared value at 0.1229 which means it is not the best fit model as the correlation is very weak. Fig 3 shows the fitted linear model to the traffic volume data and shows that it does not fit well.

Quadric Trend Model

## 
## Call:
## lm(formula = df ~ t + t2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1836997  -247269   107776   284941   900405 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.013e+11  9.022e+10   1.122    0.266
## t           -1.006e+08  8.955e+07  -1.124    0.265
## t2           2.500e+04  2.222e+04   1.125    0.264
## 
## Residual standard error: 523500 on 70 degrees of freedom
## Multiple R-squared:  0.1505, Adjusted R-squared:  0.1262 
## F-statistic: 6.199 on 2 and 70 DF,  p-value: 0.003321

Quadratic model in Fig 4. shows a slightly better fit to the data compared to the linear model with better adjusted R-squared value at 0.1262 and is statistacally significant at p-value < 0.01. However, the model is still not a strong fitting one. Before transforming the data, we check the datashape first so we know what will be the best method to normalise the data.

ARIMA Model

Box-Cox Transformation

Box-cox transformation is applied to the series to help make the series stationary.

df.transform$ci
## [1] 1.6 2.0

The 95% confidence interval for lambda contains the values of lambda between 1.6 and 2.0. Here, we take the center of the two numbers 1.8 to apply it to the Box-Cox transformation.

The data looks more normally distributed after the Box-Cox transformation as seen in Fig 8. However, further Shapiro-Wilk normality test is appled to see if it passes the normality test.

## 
##  Shapiro-Wilk normality test
## 
## data:  df.box
## W = 0.96789, p-value = 0.05943

The Shapiro-Wilk normality test’s p-value is greater than 0.05 and therefore, we can assume the new transformed data is normally distributed. After normalizing, the data is differenced to make the series stationary.

Differencing

First Differencing

## Warning in adf.test(df.diff, alternative = "stationary"): p-value smaller than
## printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  df.diff
## Dickey-Fuller = -5.2679, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

It is seen from the results that the p-value of the Dickey-Fuller test is < 0.01 and therefore the series is now stationary. Therefore, first order differencing will be used to try to fit in the models. The following shows how the data looks like after applying Box-Cox and first order differencing to the data. It can be seen from Fig 9. that transformation also reduced the varience of the data as well.

Candidate Models Selection

As the series is stationary at this stage, EACF method will be tried on to help determine which models are to be tested out. According to the eacf table below, it is not clear which models are the best but we take ARMA(1,1), ARMA(1,2) and ARMA (1,3) as best possible candidates.

## AR/MA
##    0 1 2 3 4 5 6 7 8 9 10
## 0  x o o o o x o o o o o 
## 1  x o o o o o o o o o o 
## 2  x x o o o o o o o o o 
## 3  x x x o o o o o o o o 
## 4  o o x o o x o o o o o 
## 5  o x x o o x o o o o o 
## 6  o o o o x x o o o o o 
## 7  x o o o o o o o o o o 
## 8  x x o o o o o o o o o 
## 9  x o o o o o o o o o o 
## 10 x x x x o o o o o o o
## Reordering variables and trying again:

According to the BIC table, the smallest BIC are at lag 1,and 6 and at error lag 5,6 and 7. Therefore, we take ARIMA (1,1,1), ARIMA (1,1,2), ARIMA (1,1,3) , ARIMA (1,1,5) , ARIMA (1,1,6), ARIMA (6,1,5), and ARIMA(6,1,6) as possible candidates. We tested out all the possible models to find the best possible model.

ARIMA(1,1,1)

CSS

## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value  Pr(>|z|)    
## ar1 -0.313467   0.117853  -2.6598  0.007819 ** 
## ma1 -0.942817   0.032353 -29.1419 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ML

## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value Pr(>|z|)    
## ar1 -0.249428   0.114477  -2.1789  0.02934 *  
## ma1 -0.999839   0.037196 -26.8799  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ARIMA(1,1,2)

CSS

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.71406    0.20136 -3.5462 0.0003909 ***
## ma1 -0.39570    0.30284 -1.3066 0.1913345    
## ma2 -0.57647    0.31954 -1.8041 0.0712229 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ML

## 
## z test of coefficients:
## 
##     Estimate Std. Error  z value  Pr(>|z|)    
## ar1  0.35584    0.14231   2.5005    0.0124 *  
## ma1 -1.84460    0.12242 -15.0680 < 2.2e-16 ***
## ma2  0.84469    0.11231   7.5208 5.446e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ARIMA(1,1,3)

CSS

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.796538   0.095930 -8.3033 < 2.2e-16 ***
## ma1 -0.689169   0.148162 -4.6515 3.296e-06 ***
## ma2 -0.845854   0.094338 -8.9662 < 2.2e-16 ***
## ma3  0.569150   0.149578  3.8050 0.0001418 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ML

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.24769    0.29672  0.8348    0.4038    
## ma1 -1.72293    0.30430 -5.6620 1.496e-08 ***
## ma2  0.62886    0.50066  1.2561    0.2091    
## ma3  0.09415    0.21283  0.4424    0.6582    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ARIMA(1,1,5)

CSS

## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value  Pr(>|z|)    
## ar1 -0.837644   0.082107 -10.2019 < 2.2e-16 ***
## ma1 -0.499557   0.133756  -3.7348 0.0001878 ***
## ma2 -0.942279   0.134278  -7.0174 2.261e-12 ***
## ma3  0.163813   0.219980   0.7447 0.4564705    
## ma4  0.165429   0.113036   1.4635 0.1433293    
## ma5  0.295749   0.135937   2.1756 0.0295829 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ML

## Warning in sqrt(diag(se)): NaNs produced
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.502120         NA      NA        NA    
## ma1 -0.972711         NA      NA        NA    
## ma2 -0.402976         NA      NA        NA    
## ma3  0.132890   0.125354  1.0601    0.2891    
## ma4  0.248724   0.031197  7.9727 1.552e-15 ***
## ma5  0.127748         NA      NA        NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ARIMA(1,1,6)

CSS

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.799643   0.098704 -8.1014 5.433e-16 ***
## ma1 -0.578813   0.132346 -4.3735 1.223e-05 ***
## ma2 -0.919706   0.151417 -6.0740 1.248e-09 ***
## ma3  0.097904   0.162946  0.6008 0.5479503    
## ma4  0.351763   0.178907  1.9662 0.0492779 *  
## ma5  0.418392   0.123251  3.3946 0.0006872 ***
## ma6 -0.230157   0.145168 -1.5854 0.1128643    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ML

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.68829    0.22693 -3.0331 0.0024207 ** 
## ma1 -0.78029    0.26623 -2.9309 0.0033794 ** 
## ma2 -0.83579    0.30495 -2.7408 0.0061296 ** 
## ma3  0.19383    0.16044  1.2081 0.2270269    
## ma4  0.43017    0.20088  2.1414 0.0322385 *  
## ma5  0.44646    0.12448  3.5867 0.0003349 ***
## ma6 -0.36443    0.16608 -2.1943 0.0282145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ARIMA(6,1,5)

CSS

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.024998   0.148025  0.1689  0.865893    
## ar2 -0.419345   0.331060 -1.2667  0.205272    
## ar3 -0.161827   0.240376 -0.6732  0.500804    
## ar4 -0.322377   0.128570 -2.5074  0.012162 *  
## ar5 -0.186190   0.127340 -1.4621  0.143701    
## ar6 -0.415038   0.133803 -3.1018  0.001923 ** 
## ma1 -1.547129   0.198487 -7.7946 6.461e-15 ***
## ma2  0.793810   0.571467  1.3891  0.164811    
## ma3 -0.499588   0.810174 -0.6166  0.537470    
## ma4  0.489995   0.499252  0.9815  0.326367    
## ma5 -0.141443   0.175167 -0.8075  0.419391    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ML

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.422583   0.225948 -1.8703 0.0614469 .  
## ar2 -0.640987   0.152055 -4.2155 2.492e-05 ***
## ar3 -0.614374   0.253368 -2.4248 0.0153156 *  
## ar4 -0.074332   0.179832 -0.4133 0.6793593    
## ar5 -0.258063   0.121789 -2.1189 0.0340955 *  
## ar6 -0.437457   0.113761 -3.8454 0.0001204 ***
## ma1 -1.051212   0.244592 -4.2978 1.725e-05 ***
## ma2  0.443554   0.473514  0.9367 0.3488980    
## ma3 -0.372233   0.488397 -0.7622 0.4459689    
## ma4 -0.521721   0.461519 -1.1304 0.2582900    
## ma5  0.701698   0.266439  2.6336 0.0084482 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ARIMA(6,1,6)

CSS

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.040570   0.173650  0.2336  0.815270    
## ar2 -0.446794   0.340848 -1.3108  0.189916    
## ar3 -0.144301   0.243601 -0.5924  0.553604    
## ar4 -0.305432   0.157800 -1.9356  0.052922 .  
## ar5 -0.205137   0.160227 -1.2803  0.200442    
## ar6 -0.415938   0.135994 -3.0585  0.002225 ** 
## ma1 -1.561342   0.216059 -7.2265 4.957e-13 ***
## ma2  0.846913   0.627604  1.3494  0.177196    
## ma3 -0.563147   0.825828 -0.6819  0.495291    
## ma4  0.484256   0.474250  1.0211  0.307208    
## ma5 -0.074140   0.412137 -0.1799  0.857237    
## ma6 -0.038703   0.212104 -0.1825  0.855211    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ML

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.194416   0.265903 -0.7312 0.4646845    
## ar2 -0.218108   0.292306 -0.7462 0.4555683    
## ar3 -0.280367   0.298949 -0.9378 0.3483255    
## ar4 -0.078762   0.282120 -0.2792 0.7801090    
## ar5 -0.275468   0.176899 -1.5572 0.1194206    
## ar6 -0.435064   0.131186 -3.3164 0.0009119 ***
## ma1 -1.313799   0.293929 -4.4698  7.83e-06 ***
## ma2  0.226986   0.552780  0.4106 0.6813461    
## ma3  0.019344   0.494371  0.0391 0.9687873    
## ma4 -0.012141   0.690554 -0.0176 0.9859722    
## ma5  0.421534   0.622070  0.6776 0.4980055    
## ma6 -0.247485   0.259381 -0.9541 0.3400128    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Selecting the Best Model

Best possible model is then selected using the sort score function.

Sorting models using AIC scores

##              df      AIC
## model_615_ml 12 3725.522
## model_116_ml  8 3725.941
## model_112_ml  4 3728.780
## model_616_ml 13 3728.941
## model_115_ml  7 3730.317
## model_113_ml  5 3730.584
## model_111_ml  3 3737.229

Sorting models using BIC scores

##              df      BIC
## model_112_ml  4 3737.830
## model_113_ml  5 3741.897
## model_111_ml  3 3744.017
## model_116_ml  8 3744.043
## model_115_ml  7 3746.155
## model_615_ml 12 3752.674
## model_616_ml 13 3758.356

The two tests show contradictory results therefore, we will take ARIMA (6,1,5), ARIMA (1,1,2) ARIMA (1,1,6) and ARIMA (1,1,3) to test for residuals to see which model will be the best.

Residual analysis for ARIMA(1,1,2)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.97718, p-value = 0.2147

Residual analysis for ARIMA(1,1,6)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.98043, p-value = 0.3243

Residual analysis for ARIMA(1,1,3)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.99186, p-value = 0.9274

Residual analysis for ARIMA(6,1,5)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.98358, p-value = 0.471

Only at ARIMA (1,1,6), and (6,1,5), the models’ standardized residual plots shows no trend nor changing variance meaning that the both of the models are supported. The histograms are also normally distributed and the QQ plot also shows that the data is normally distributed. Both the PACF and ACF plots do not show significant lags as well. Therefore, we chose the better performing model ARIMA(6,1,5) and this will be used to calculate the forecasting for the next 5 years.

Forecasting

Forecast for Interstate Traffic Volume Forecastign for the Next 2 Years

Forecast for Interstate Traffic Volume Forecastign for the Next 2 Years
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Feb 2018 -15636610212 -78394251930 47121031507 -111616140673 80342920249
Mar 2018 -6882980831 -77680194044 63914232382 -115157974856 101392013195
Apr 2018 -25903190992 -97653366260 45846984276 -135635614712 83829232728
May 2018 6233118648 -66611223417 79077460713 -105172688784 117638926081
Jun 2018 46301748346 -26682888094 119286384786 -65318620817 157922117509
Jul 2018 674448524 -73315402761 74664299809 -112483264079 113832161127
Aug 2018 -14011877255 -88640514168 60616759659 -128146528172 100122773663
Sep 2018 -12181365929 -89380304170 65017572311 -130246953506 105884221647
Oct 2018 -9599811879 -88118607927 68918984170 -129683947782 110484324025
Nov 2018 -14015185106 -92534062918 64503692706 -134099446057 106069075844
Dec 2018 -18904456128 -97535729929 59726817674 -139160611910 101351699655
Jan 2019 3488490507 -75376664005 82353645018 -117125354955 124102335968
Feb 2019 11735407035 -67200393245 90671207314 -108986481806 132457295875
Mar 2019 3525444037 -75426441410 82477329484 -117221044945 124271933018
Apr 2019 -2435029340 -81392371732 76522313052 -123189863998 118319805318
May 2019 -4951900802 -84403182567 74499380962 -126462150542 116558348938
Jun 2019 -5985457236 -85827545650 73856631179 -128093394169 116122479698
Jul 2019 -12173990557 -92157311960 67809330846 -134497924707 110149943593
Aug 2019 -11460624306 -91443946039 68522697427 -133784558961 110863310349
Sep 2019 -3351800960 -83335978404 76632376483 -125677044311 118973442391
Oct 2019 828861237 -79158122351 80815844826 -121500673743 123158396217
Nov 2019 649612673 -79341439747 80640665094 -121686145048 122985370395
Dec 2019 -1069000503 -81073825349 78935824342 -123425821330 121287820324
Jan 2020 -1891788944 -82005713153 78222135265 -124415462849 120631884961

Discussion and Limitations

It is found that the traffic volume in the long run is neither decreasing nor decreasing in the long run but somewhat stays the same despite fluctuations in the volume according to the season. There is also a possibility that the prediction did not work well due to the data size being too small to be using with the current methods and therfore, giving a different result than expected. It is hard to translate the prediction of Box-Cox transformed values backed to the original data as well. This could be avoided by using a huge set of data and assume normal distribution. This data has a huge dip in the traffic volume around the end of 2014 and thus affecting the trend as well. If we remove the year 2014, we might be able to get a better fitting model with more precise predictions.

References

UCI Machine Learning Repository 2019. The Metro Interstate Traffic Volume Data Set Retrieved from https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume on 20th May 2020.

Lee, J., 2019. Five Takeaways from Minneaplis’ New Plan to Boost Street Safety. Retrieved from https://www.minnpost.com/metro/2019/09/five-takeaways-from-minneapolis-new-plan-to-boost-street-safety/ on 29th May 2020.

Schmitt, A., 2019. How Two Cites Reduced Driving. Retrieved from https://usa.streetsblog.org/2019/02/08/minneapolis-and-seattle-have-achieved-the-holy-grail-for-sustainable-transportation/ on 5th June 2020.

Code Appendix

library(purrr) #checking dataframe
library(tidyr)
library(dplyr)
library(TSA)
library(lmtest)
library(knitr)
library(FSAdata)
library(tseries)
library(fUnitRoots)
library(bestglm)
library(FitAR)
library(forecast)
library(rmarkdown)
library(readr)

traffic <- read.csv("Metro_Interstate_Traffic_Volume.csv")
head(traffic,5)

monthly_traffic<- tidyr::separate(traffic, date_time, c("date", "time"), sep = " ")

monthly_traffic <- tidyr::separate(monthly_traffic, date, c("year", "month", "day"), sep = "-")

monthly_traffic_data <- monthly_traffic %>% group_by(year,month) %>% summarise(vol_by_month = sum(traffic_volume))

df = ts(as.vector(monthly_traffic_data$vol_by_month), start=2012, end=2018, frequency=12)
plot(df, ylab="Traffic Volume", type="o", main="Time Series Plot for Monthly Interstate Traffic Volume")
title(main="Fig 1.", line=-20.5)

plot(y=df,x=zlag(df), ylab="Traffic Volume", xlab="Previous month traffic volume", main="Scatter Plot of Previous Month Traffic Volume")

title(main="Fig 2.", line=-20.5)

y=df
x=zlag(df)
index=2:length(x)
cor(y[index],x[index])

linear = lm(df~time(df))
summary(linear)

plot(df, ylab="Time", type="o", main=" Fitted Linear Model to Traffic Volume Data")
abline(linear)
title(main="Fig 3", line=-20.5)

t = time(df)
t2 = t^2
quadratic = lm(df~t+t2)
summary(quadratic)

plot(ts(fitted(quadratic)), ylim = c(min(c(fitted(quadratic), as.vector(df))), max(c(fitted(quadratic), as.vector(df)))), ylab="y",
     main = "Fitted Quadratic Model to Traffic Volume Data")
lines(as.vector(df), type="o")
title(main="Fig 4.", line=-20.5)

har. = harmonic(df, 1/12)
cosine = lm(df~har.)
summary(cosine)

hist(df, breaks=12, col="red", main ="Histogram of Traffic Volume Data")
title(main="Fig 5", line=-20.5)

df.transform<- BoxCox.ar(df, method="yule-walker")
title(main="Log-likelihood vs values of lambda for traffic volume data", line=1)
title(main="Fig 6.", line=-20.5)

df.transform$ci

par(mfrow=c(1,2))
lambda=1.8
df.box= (df^lambda-1)/lambda
plot(df, type='l', ylab ='count for Original series')
plot(df.box, type='l', ylab ="count for transformed series")
mtext("Before and After applying Box-Cox to the Daily Interstate Traffic Volume Data", font=2, line=1, 
      at=par("usr")[1]+0.05*diff(par("usr")[1:2]),
      cex=1.2)
title(main="Fig 7.", line=-24.5, outer = TRUE)

hist(df.box, breaks=12, col="red", main = "Histogram of Box-Cox Transformed Data")
title(main="Fig 8.", line=-20.5)

shapiro.test(df.box)

df.diff = diff(df.box, differences =1)
testAdf1 = adf.test(df.diff, alternative = "stationary")
testAdf1

par(mfrow=c(1,2))
plot(df, type='l', ylab ='count for Original series')
plot(df.diff, type='l', ylab ="count for transformed series")
mtext("Applying 1st Order Difference to Traffic Volume", font=2, line=1, 
      at=par("usr")[1]+0.05*diff(par("usr")[1:2]),
      cex=1.2)
title(main="Fig 9.", line=-24.5, outer = TRUE)

eacf(df.diff, ar.max=10, ma.max=10)

armaSubset = armasubsets(y=df.diff,
                         nar=10,
                         nma=10,
                         y.name='test',
                         ar.method='ols')
plot(armaSubset)

title('BIC Table for Model Selection', line = 5)
title('Fig 10.', line = -17.5)

model_111_css = arima(df.diff,order=c(1,1,1),method='CSS')
coeftest(model_111_css)

model_111_ml = arima(df.diff,order=c(1,1,1),method='ML')
coeftest(model_111_ml)

model_112_css = arima(df.diff,order=c(1,1,2),method='CSS')
coeftest(model_112_css)

model_112_ml = arima(df.diff,order=c(1,1,2),method='ML')
coeftest(model_112_ml)

model_113_css = arima(df.diff,order=c(1,1,3),method='CSS')
coeftest(model_113_css)

model_113_ml = arima(df.diff,order=c(1,1,3),method='ML')
coeftest(model_113_ml)

model_115_css = arima(df.diff,order=c(1,1,5),method='CSS')
coeftest(model_115_css)

model_115_ml = arima(df.diff,order=c(1,1,5),method='ML')
coeftest(model_115_ml)

model_116_css = arima(df.diff,order=c(1,1,6),method='CSS')
coeftest(model_116_css)

model_116_ml = arima(df.diff,order=c(1,1,6),method='ML')
coeftest(model_116_ml)

model_615_css = arima(df.diff,order=c(6,1,5),method='CSS')
coeftest(model_615_css)

model_615_ml = arima(df.diff,order=c(6,1,5),method='ML')
coeftest(model_615_ml)

model_616_css = arima(df.diff,order=c(6,1,6),method='CSS')
coeftest(model_616_css)

model_616_ml = arima(df.diff,order=c(6,1,6),method='ML')
coeftest(model_616_ml)

#sort score function
sort.score <- function(x, score = c("bic", "aic")){
  if (score == "aic"){
    x[with(x, order(AIC)),]
  } else if (score == "bic") {
    x[with(x, order(BIC)),]
  } else {
    warning('score = "x" only accepts valid arguments ("aic","bic")')
  }
}


sort.score(AIC(model_111_ml,model_112_ml,model_113_ml,model_115_ml,model_116_ml,model_615_ml,model_616_ml), score = "aic")

sort.score(BIC(model_111_ml,model_112_ml,model_113_ml,model_115_ml,model_116_ml,model_615_ml,model_616_ml), score = "bic")

#residual analysis formula

residual.analysis <- function(model, std = TRUE,start = 2, class = c("ARIMA","GARCH","ARMA-GARCH")[1]){
  # If you have an output from arima() function use class = "ARIMA"
  # If you have an output from garch() function use class = "GARCH". 
  # Please note that you should use tseries package to be able to run this function for GARCH models.
  # If you have an output from ugarchfit() function use class = "ARMA-GARCH"
  
  if (class == "ARIMA"){
    if (std == TRUE){
      res.model = rstandard(model)
    }else{
      res.model = residuals(model)
    }
  }else if (class == "GARCH"){
    res.model = model$residuals[start:model$n.used]
  }else if (class == "ARMA-GARCH"){
    res.model = model@fit$residuals
  }else {
    stop("The argument 'class' must be either 'ARIMA' or 'GARCH' ")
  }
  par(mfrow=c(3,2))
  plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised residuals")
  abline(h=0)
  hist(res.model,main="Histogram of standardised residuals")
  acf(res.model,main="ACF of standardised residuals")
  pacf(res.model,main="PACF of standardised residuals")
  qqnorm(res.model,main="QQ plot of standardised residuals")
  qqline(res.model, col = 2)
  print(shapiro.test(res.model))
  k=0
  LBQPlot(res.model, lag.max =length(model$residuals)-1, StartLag = k + 1, k = 0, SquaredQ = FALSE)
}

residual.analysis(model=model_112_css)
residual.analysis(model=model_116_css)
residual.analysis(model=model_113_css)
residual.analysis(model=model_615_css)

fit = Arima(df.diff, model= model_615_css) 
df.forecast = plot(forecast(fit,h=24), ylim=c(-250000000000,250000000000), main ='Interstate Traffic Volume Forecasting for the Next 2 Years') 
title('Fig.11', line=-20)

fit = Arima(df.diff, model = model_615_css)
df_forecast = forecast(fit,h=24)
knitr::kable(df_forecast, 
             caption = "Forecast for Interstate Traffic Volume Forecastign for the Next 2 Years")