This report is produced by R, an elegant programming language for getting data, tidy data, statistics, constructing models and data visualization.
Since I write the reports all by myself without external editor, please be generous if any mistake happened.
Welcome to view my code on github_ma-Haoran to check everything including loading data, data visualization, building time series models and related mathematical formulas, etc.
My E-mail: mhrdyx@126.com
China is the country with the largest traffic passenger scale around the world. It could affect national economy through tourism industry, transportation, communications and so on. Operators in industries like trains, airlines, hotels are especially interested in this data. Thus forecast on the national passenger traffic is meaningful. However, it is tricky to build effective forecast models because of complex external factors such as COVID-19, which, obviously, has a strong impact on traffic passenger. Nonetheless, since the forecast is important, it is worthwhile to give a try.
The historical Monthly data on passenger traffic of China is downloaded from National Bureau of Statistics. View details at https://data.stats.gov.cn/english/easyquery.htm?cn=A01.
Preview the 5 first lines of raw data:
## Month Passenger Traffic Current Period(10000 persons)
## 1: Feb 2021 NA
## 2: Jan 2021 67112
## 3: Dec 2020 86756
## 4: Nov 2020 91786
## 5: Oct 2020 107355
## Passenger Traffic Accumulated (10000 persons)
## 1: NA
## 2: 67112
## 3: 966512
## 4: 879743
## 5: 787958
## Passenger Traffic Growth Rate (The same period last year=100)(%)
## 1: NA
## 2: -47.3
## 3: -35.5
## 4: -32.7
## 5: -31.0
## Passenger Traffic Accumulated Growth Rate (%)
## 1: NA
## 2: -47.3
## 3: -45.1
## 4: -45.9
## 5: -47.1
## Passenger Traffic of Railways Current Period(10000 persons)
## 1: NA
## 2: 15715
## 3: 20743
## 4: 21504
## 5: 27169
## Passenger Traffic of Railways Accumulated (10000 persons)
## 1: NA
## 2: 15715
## 3: 220323
## 4: 199580
## 5: 178075
## Passenger Traffic of Railways Growth Rate (The same period last year=100)(%)
## 1: NA
## 2: -42.1
## 3: -21.1
## 4: -20.6
## 5: -14.8
## Passenger Traffic of Railways Accumulated Growth Rate (%)
## 1: NA
## 2: -42.1
## 3: -39.8
## 4: -41.2
## 5: -43.0
## Passenger Traffic of Highways Current Period(10000 persons)
## 1: NA
## 2: 47527
## 3: 60632
## 4: 64378
## 5: 73173
## Passenger Traffic of Highways Accumulated (10000 persons)
## 1: NA
## 2: 47527
## 3: 689425
## 4: 628793
## 5: 564415
## Passenger Traffic of Highways Growth Rate (The same period last year=100)(%)
## 1: NA
## 2: -49.4
## 3: -40.1
## 4: -36.9
## 5: -36.6
## Passenger Traffic of Highways Accumulated Growth Rate (%)
## 1: NA
## 2: -49.4
## 3: -47.0
## 4: -47.6
## 5: -48.6
## Passenger Traffic of Waterways Current Period(10000 persons)
## 1: NA
## 2: 852
## 3: 1149
## 4: 1462
## 5: 1980
## Passenger Traffic of Waterways Accumulated (10000 persons)
## 1: NA
## 2: 852
## 3: 14987
## 4: 13838
## 5: 12376
## Passenger Traffic of Waterways Growth Rate (The same period last year=100)(%)
## 1: NA
## 2: -38.7
## 3: -32.2
## 4: -27.4
## 5: -26.9
## Passenger Traffic of Waterways Accumulated Growth Rate (%)
## 1: NA
## 2: -38.7
## 3: -45.2
## 4: -46.1
## 5: -47.7
## Passenger Traffic of Civil Aviation Current Period(10000 persons)
## 1: NA
## 2: 3018
## 3: 4232
## 4: 4441
## 5: 5032
## Passenger Traffic of Civil Aviation Accumulated (10000 persons)
## 1: NA
## 2: 3018
## 3: 41777
## 4: 37532
## 5: 33091
## Passenger Traffic of Civil Aviation Growth Rate (The same period last year=100)(%)
## 1: NA
## 2: -40.4
## 3: -19.8
## 4: -16.3
## 5: -11.7
## Passenger Traffic of Civil Aviation Accumulated Growth Rate (%)
## 1: NA
## 2: -40.4
## 3: -36.7
## 4: -38.2
## 5: -40.3
Extract time and passenger traffic information regardless transportation type, tidy data to fit the time series table format.
Supplement missing values existing in “current period” with the difference between current “accumulated” and previous “accumulated”.
Below is the first 10 lines of tidy data:
## # A tsibble: 6 x 5 [1M]
## Month Current_Period ` Accumulated ` Growth_Rate Accumulated_Growth_Rate
## <mth> <int> <int> <dbl> <dbl>
## 1 2021 1月 67112 67112 -47.3 -47.3
## 2 2020 12月 86756 966512 -35.5 -45.1
## 3 2020 11月 91786 879743 -32.7 -45.9
## 4 2020 10月 107355 787958 -31 -47.1
## 5 2020 9月 97345 680602 -33.7 -49
## 6 2020 8月 98214 583258 -37.8 -50.9
Plot the monthly traffic to observe the patterns in data structure.
Here apply a \(\log(y)\) transformation to decrease the fluctuation.
Below is an interactive plot for data visualization.
A significant annual seasonality shows in plot, which is the Spring Festival travel rush.
Regardless seasonality, the trend increased before 2013 but tended to decrease after 2014.
It is worthwhile to emphasize 2 particular points:
In Jan 2018, an expected Spring Festival transportation peak does not show in data. Since the data is endorsed by National Bureau of Statistics of China, we do not doubt the accuracy of the point. Nonetheless, the disappeared seasonality in 2018 makes the forecast more difficult.
Covid-19 firstly appeared in Wuhan on Dec 2019 and began to spread across the country. To protect citizens, Government issued a quarantine policy which led to a sharp decline in passenger traffic. As a result, the emergency disturbed the regular trend and seasonality that historical traffic data could on play a limit row in forecast. To deal with the special issue, it is necessary to put more factors, such as corona virus data of China, into regression.
Getting and tidy Covid-19 data:
Daily Covid-19 data downloaded from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset
Preview raw Covid-19 data
## SNo ObservationDate Province/State Country/Region Last Update Confirmed
## 1: 1 01/22/2020 Anhui Mainland China 1/22/2020 17:00 1
## 2: 2 01/22/2020 Beijing Mainland China 1/22/2020 17:00 14
## 3: 3 01/22/2020 Chongqing Mainland China 1/22/2020 17:00 6
## 4: 4 01/22/2020 Fujian Mainland China 1/22/2020 17:00 1
## 5: 5 01/22/2020 Gansu Mainland China 1/22/2020 17:00 0
## 6: 6 01/22/2020 Guangdong Mainland China 1/22/2020 17:00 26
## Deaths Recovered
## 1: 0 0
## 2: 0 0
## 3: 0 0
## 4: 0 0
## 5: 0 0
## 6: 0 0
Preview Processed covid-19 data
## # A tibble: 6 x 4
## Confirm Deaths Recover Month
## <dbl> <dbl> <dbl> <mth>
## 1 9783 213 214 2020 1月
## 2 69468 2622 39065 2020 2月
## 3 2273 470 36783 2020 3月
## 4 1350 1328 1580 2020 4月
## 5 143 1 665 2020 5月
## 6 517 0 173 2020 6月
Until now traffic and epidemic data are available. However, building variables containing information which has a strong correlation with dependent variable (numbers of passengers in this case) could contribute to the accuracy of model. Thus it is reasonable to build two dummy variables here:
“break” : indicate whether there is a confirmed infection cases, “1” means confirmed, 0 means no one confirm.
“knot” : according to the exploratory of passenger traffic data above, both the trend and seasonality changed on Dec 2017 ,Dec 2019 as well as Dec 2020. Thus add “knot” at the three inflection points as an variable.
Preview the first 5 rows of train data.
## # A tsibble: 5 x 7 [1M]
## Month Passenger Confirm Deaths Recover Break knot
## <mth> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2010 1月 254630 0 0 0 0 0
## 2 2010 2月 283614 0 0 0 0 0
## 3 2010 3月 266058 0 0 0 0 0
## 4 2010 4月 263024 0 0 0 0 0
## 5 2010 5月 266319 0 0 0 0 0
Now 6 variables (Month, Confirm, Deaths, Recover, Break, Knot) could be used to build models for forecasting “Passenger”. Rather than put all these variables into models, it is wise to select ones which affect Passenger significantly, or the model would become complex and lead to over fitting. Besides, redundant variables would distort the correlations within the model.
Analysis of variance (ANOVA) enables to determine which factors have a statistical influence on the given data set. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. Factors with \(Pr < 0.05\) are statistical significant.
ANOVA:
First test all the factors:
## Analysis of Variance Table
##
## Response: log(Passenger)
## Df Sum Sq Mean Sq F value Pr(>F)
## Month 1 24.686 24.6858 63.9765 7.58e-13 ***
## Confirm 1 3.663 3.6634 9.4943 0.00254 **
## Recover 1 1.352 1.3517 3.5030 0.06361 .
## Deaths 1 0.425 0.4250 1.1015 0.29598
## lag(Confirm, 1) 1 0.001 0.0007 0.0017 0.96729
## Break 1 1.029 1.0286 2.6657 0.10507
## knot 1 0.001 0.0011 0.0028 0.95791
## Residuals 124 47.846 0.3859
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Rule out Deaths and lag(Confirm,1), then perform ANOVA, \(Pr\) get better.
## Analysis of Variance Table
##
## Response: log(Passenger)
## Df Sum Sq Mean Sq F value Pr(>F)
## Month 1 24.369 24.3692 63.9648 6.78e-13 ***
## Confirm 1 3.696 3.6958 9.7007 0.002277 **
## Recover 1 1.366 1.3659 3.5852 0.060572 .
## Break 1 1.212 1.2116 3.1803 0.076917 .
## knot 1 0.001 0.0014 0.0037 0.951517
## Residuals 127 48.384 0.3810
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The purpose is to build a model for forecasting rather than for explaining the relationship between variables. So I would put the accuracy as first priority at the cost of losing some interpretability. In this case, I prefer to make a \(log()\) transformation on \(Passenger\) and take \(knot\) into regression to improve forecast performance. When test residuals and forecast, \(log(Passenger)\) would be reversed to the initial format, \(Passenger\).
STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess,” while Loess is a method for estimating nonlinear relationships.
According to STL decomposition:
*Trend declined from Jul 2013, but got smoother since Aug 2014. Also, a sharp declination took place from Oct 2019.
*The seasonality period is 12 months.
*Notably, the remainder was led by the seasonality disappeared in 2018, and the occurrence of Covid-19 at the end of 2019.
Apply \(log()\) to each variable for plot.
A negative correlation between \(Log(Passenger)\) and \(log(Confirm)\) showed in the plot above.
An interactive 3D scatter plot for visualization:
In this case, a good model should:
Capture the trend, seasonality of the data.
Model the relationship between Passenger and Covid-19 epidemic.
The models I choose:
where, \(Y_t\) is the dependent factor, \(Passenger\) in this case; \(f_i\) is the frequency of seasonality, \(1/12\) in this case; \(j\) is the amount of pairs of Fourier items; \(A_0\), \(A_j\), \(B_j\) are constants which could be obtained by fitting observing data.
\[y_{t}= c + \phi_{1}y_{t - 1} + \cdots + \phi_{p}y_{t - p} + \theta_{1}\varepsilon_{t - 1} + \cdots + \theta_{q}\varepsilon_{t - q} + \varepsilon_{t}\]
where \(\varepsilon_t\) is white noise, \(c\) is the mean of observed series, \(\phi\) and \(\theta\) are constant coefficients.
Comparing to ARMA(p,q), ARIMA(p,d,q) has an extra parameter - \(d\) degree of first differencing involved. (See detail at ARIMA)
Try \(j=1,2,3,4,5,6\) with \(ARIMA\) error on \(log(Passenger)\). Set \(ARIMA(p,d,q)(P,D,Q)_s\) \(P=D=Q=0\) to force Fourier describe all the seasonality. Typically, \(j\) with least \(AICc\) indicates a better fit.
We get the least \(AICc=146.53\) when \(K=6\), so \(6\) pairs of Fourier terms should be fitted in the model.
\[y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t} +\sum_{i = 1}^{n} [{A_j}\cos(2{\pi}{f_i}t)+{B_j}\sin(2{\pi}{f_i}t)] +\varepsilon_t\]
The blue curve is the fit curve, 95% confidence interval is also plotted above. For the peak point for each annual, however, forecast is much higher than real data.
See the details of the model:
## Series: Passenger
## Model: TSLM
## Transformation: log(Passenger)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4172 -0.2826 -0.1039 0.3176 0.5186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.245e+01 3.002e-02 414.775 < 2e-16 ***
## Confirm -9.181e-06 7.297e-06 -1.258 0.2108
## Recover -2.537e-05 9.910e-06 -2.560 0.0117 *
## Break -7.956e-01 1.074e-01 -7.409 2.15e-11 ***
## knot -2.516e+00 2.321e-01 -10.842 < 2e-16 ***
## fourier(K = 6)C1_12 3.600e-01 4.127e-02 8.723 2.17e-14 ***
## fourier(K = 6)S1_12 -2.340e-01 4.084e-02 -5.728 8.02e-08 ***
## fourier(K = 6)C2_12 2.175e-01 4.048e-02 5.373 4.00e-07 ***
## fourier(K = 6)S2_12 -3.256e-01 4.162e-02 -7.822 2.54e-12 ***
## fourier(K = 6)C3_12 7.402e-04 4.041e-02 0.018 0.9854
## fourier(K = 6)S3_12 -3.985e-01 4.171e-02 -9.553 2.47e-16 ***
## fourier(K = 6)C4_12 -2.050e-01 4.055e-02 -5.056 1.60e-06 ***
## fourier(K = 6)S4_12 -3.509e-01 4.156e-02 -8.444 9.59e-14 ***
## fourier(K = 6)C5_12 -3.750e-01 4.118e-02 -9.106 2.76e-15 ***
## fourier(K = 6)S5_12 -2.302e-01 4.082e-02 -5.639 1.21e-07 ***
## fourier(K = 6)C6_12 -2.170e-01 2.896e-02 -7.492 1.40e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3263 on 117 degrees of freedom
## Multiple R-squared: 0.8423, Adjusted R-squared: 0.8221
## F-statistic: 41.67 on 15 and 117 DF, p-value: < 2.22e-16
\(Pr(>|t|)\) gives the p-value for that t-test (the proportion of the t distribution at that df which is greater than the absolute value of t statistic). It is rational to accept the coefficients since most of \(P-value<0.05\) except two.
The Adjusted R-squared=0.8221, indicates the model is capable of explaining 82.21% variance of the historical data.
Test the residuals:
However, ACF significant above 0, indicates the residual autocorrelation is statistically significant. Thus should construct a stochastic model to fit the residuals. Applying \(ARIMA\) model for residuals in this case.
Set \(ARIMA(p,d,q)(P,D,Q)_s\) \(P=D=Q=0\) to force Fourier describe all the seasonality. Set d=0 so the differencing will be dealed by linear regression. In this case the model could be written as: \[y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t} +\sum_{i = 1}^{n} [{A_j}\cos(2{\pi}{f_i}t)+{B_j}\sin(2{\pi}{f_i}t)] +\eta_t\]
\[\eta_{t}= c + \phi_{1}\eta_{t - 1} + \cdots + \phi_{p}\eta_{t - p} + \theta_{1}\varepsilon_{t - 1} + \cdots + \theta_{q}\varepsilon_{t - q} + \varepsilon_{t}\]
The model fit well except the peak on Jan 2016.
Test the residuals :
ACF is not significant different from 0, so cannot find evidence to prove the residual autocorrelation is statistically significant. Thus model 2 is better on its residuals.
Further implement Ljung_Box test on the residuals to buttress the idea.
Ljung_Box test:
\(H_0\): The series is not significant from a white noise.
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 "ARIMA(log(Passenger) ~ PDQ(0, 0, 0) + pdq(d = 0) + Confirm~ 0.0938 0.759
\(P-value=0.76>0.09\), so cannot reject Null hypothesis. Model 2 indeed fit better.
Finally we could make a forecast, however, whether the forecast is accuracy depend on the future situaion of Covid-19. Here I make a forecast under 3 different scenarios:
A scenario without Covid-19 breaks in mainland China.
A scenario where Covid-19 breaks, with 1 patient confirm and 1 patient recover everyday.
A scenario where Covid-19 breaks, with 1 patient confirm and 1 patient recover everyday. However, public emergencies, such as wide mandatory isolation, closing city, etc, will arise in the future. The emergency would be reflected as a knot in the model.
View the forecast:
## # A fable: 36 x 9 [1M]
## # Key: Scenario, .model [3]
## Scenario .model Month Passenger .mean Confirm Break knot Recover
## <chr> <chr> <mth> <dist> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Without_~ TSLM(l~ 2021 2月 t(N(12, 0.12)) 2.33e5 0 0 0 0
## 2 Without_~ TSLM(l~ 2021 3月 t(N(12, 0.12)) 2.19e5 0 0 0 0
## 3 Without_~ TSLM(l~ 2021 4月 t(N(12, 0.12)) 2.06e5 0 0 0 0
## 4 Without_~ TSLM(l~ 2021 5月 t(N(12, 0.12)) 2.13e5 0 0 0 0
## 5 Without_~ TSLM(l~ 2021 6月 t(N(12, 0.12)) 2.10e5 0 0 0 0
## 6 Without_~ TSLM(l~ 2021 7月 t(N(12, 0.12)) 2.24e5 0 0 0 0
## 7 Without_~ TSLM(l~ 2021 8月 t(N(12, 0.12)) 2.30e5 0 0 0 0
## 8 Without_~ TSLM(l~ 2021 9月 t(N(12, 0.12)) 2.24e5 0 0 0 0
## 9 Without_~ TSLM(l~ 2021 10月 t(N(12, 0.12)) 2.35e5 0 0 0 0
## 10 Without_~ TSLM(l~ 2021 11月 t(N(12, 0.12)) 2.11e5 0 0 0 0
## # ... with 26 more rows