About

This report is produced by R, an elegant programming language for getting data, tidy data, statistics, constructing models and data visualization.

Since I write the reports all by myself without external editor, please be generous if any mistake happened.

Welcome to view my code on github_ma-Haoran to check everything including loading data, data visualization, building time series models and related mathematical formulas, etc.

My E-mail:

Forecast the National Passenger Traffic in China

China is the country with the largest traffic passenger scale around the world. It could affect national economy through tourism industry, transportation, communications and so on. Operators in industries like trains, airlines, hotels are especially interested in this data. Thus forecast on the national passenger traffic is meaningful. However, it is tricky to build effective forecast models because of complex external factors such as COVID-19, which, obviously, has a strong impact on traffic passenger. Nonetheless, since the forecast is important, it is worthwhile to give a try.

Getting Data

The historical Monthly data on passenger traffic of China is downloaded from National Bureau of Statistics. View details at https://data.stats.gov.cn/english/easyquery.htm?cn=A01.

Preview the 5 first lines of raw data:

##       Month Passenger Traffic Current Period(10000 persons)
## 1: Feb 2021                                              NA
## 2: Jan 2021                                           67112
## 3: Dec 2020                                           86756
## 4: Nov 2020                                           91786
## 5: Oct 2020                                          107355
##    Passenger Traffic Accumulated (10000 persons)
## 1:                                            NA
## 2:                                         67112
## 3:                                        966512
## 4:                                        879743
## 5:                                        787958
##    Passenger Traffic Growth Rate (The same period last year=100)(%)
## 1:                                                               NA
## 2:                                                            -47.3
## 3:                                                            -35.5
## 4:                                                            -32.7
## 5:                                                            -31.0
##    Passenger Traffic Accumulated Growth Rate (%)
## 1:                                            NA
## 2:                                         -47.3
## 3:                                         -45.1
## 4:                                         -45.9
## 5:                                         -47.1
##    Passenger Traffic of Railways Current Period(10000 persons)
## 1:                                                          NA
## 2:                                                       15715
## 3:                                                       20743
## 4:                                                       21504
## 5:                                                       27169
##    Passenger Traffic of Railways Accumulated (10000 persons)
## 1:                                                        NA
## 2:                                                     15715
## 3:                                                    220323
## 4:                                                    199580
## 5:                                                    178075
##    Passenger Traffic of Railways Growth Rate (The same period last year=100)(%)
## 1:                                                                           NA
## 2:                                                                        -42.1
## 3:                                                                        -21.1
## 4:                                                                        -20.6
## 5:                                                                        -14.8
##    Passenger Traffic of Railways Accumulated Growth Rate (%)
## 1:                                                        NA
## 2:                                                     -42.1
## 3:                                                     -39.8
## 4:                                                     -41.2
## 5:                                                     -43.0
##    Passenger Traffic of Highways Current Period(10000 persons)
## 1:                                                          NA
## 2:                                                       47527
## 3:                                                       60632
## 4:                                                       64378
## 5:                                                       73173
##    Passenger Traffic of Highways Accumulated (10000 persons)
## 1:                                                        NA
## 2:                                                     47527
## 3:                                                    689425
## 4:                                                    628793
## 5:                                                    564415
##    Passenger Traffic of Highways Growth Rate (The same period last year=100)(%)
## 1:                                                                           NA
## 2:                                                                        -49.4
## 3:                                                                        -40.1
## 4:                                                                        -36.9
## 5:                                                                        -36.6
##    Passenger Traffic of Highways Accumulated Growth Rate (%)
## 1:                                                        NA
## 2:                                                     -49.4
## 3:                                                     -47.0
## 4:                                                     -47.6
## 5:                                                     -48.6
##    Passenger Traffic of Waterways Current Period(10000 persons)
## 1:                                                           NA
## 2:                                                          852
## 3:                                                         1149
## 4:                                                         1462
## 5:                                                         1980
##    Passenger Traffic of Waterways Accumulated (10000 persons)
## 1:                                                         NA
## 2:                                                        852
## 3:                                                      14987
## 4:                                                      13838
## 5:                                                      12376
##    Passenger Traffic of Waterways Growth Rate (The same period last year=100)(%)
## 1:                                                                            NA
## 2:                                                                         -38.7
## 3:                                                                         -32.2
## 4:                                                                         -27.4
## 5:                                                                         -26.9
##    Passenger Traffic of Waterways Accumulated Growth Rate (%)
## 1:                                                         NA
## 2:                                                      -38.7
## 3:                                                      -45.2
## 4:                                                      -46.1
## 5:                                                      -47.7
##    Passenger Traffic of Civil Aviation Current Period(10000 persons)
## 1:                                                                NA
## 2:                                                              3018
## 3:                                                              4232
## 4:                                                              4441
## 5:                                                              5032
##    Passenger Traffic of Civil Aviation Accumulated (10000 persons)
## 1:                                                              NA
## 2:                                                            3018
## 3:                                                           41777
## 4:                                                           37532
## 5:                                                           33091
##    Passenger Traffic of Civil Aviation Growth Rate (The same period last year=100)(%)
## 1:                                                                                 NA
## 2:                                                                              -40.4
## 3:                                                                              -19.8
## 4:                                                                              -16.3
## 5:                                                                              -11.7
##    Passenger Traffic of Civil Aviation Accumulated Growth Rate (%)
## 1:                                                              NA
## 2:                                                           -40.4
## 3:                                                           -36.7
## 4:                                                           -38.2
## 5:                                                           -40.3

Processing Data and Exploratory Analysis

Extract time and passenger traffic information regardless transportation type, tidy data to fit the time series table format.

Supplement missing values existing in “current period” with the difference between current “accumulated” and previous “accumulated”.

Below is the first 10 lines of tidy data:

## # A tsibble: 6 x 5 [1M]
##       Month Current_Period ` Accumulated ` Growth_Rate Accumulated_Growth_Rate
##       <mth>          <int>           <int>       <dbl>                   <dbl>
## 1  2021 1月          67112           67112       -47.3                   -47.3
## 2 2020 12月          86756          966512       -35.5                   -45.1
## 3 2020 11月          91786          879743       -32.7                   -45.9
## 4 2020 10月         107355          787958       -31                     -47.1
## 5  2020 9月          97345          680602       -33.7                   -49  
## 6  2020 8月          98214          583258       -37.8                   -50.9

Plot the monthly traffic to observe the patterns in data structure.

Here apply a \(\log(y)\) transformation to decrease the fluctuation.

Below is an interactive plot for data visualization.

A significant annual seasonality shows in plot, which is the Spring Festival travel rush.

Regardless seasonality, the trend increased before 2013 but tended to decrease after 2014.

It is worthwhile to emphasize 2 particular points:

  1. In Jan 2018, an expected Spring Festival transportation peak does not show in data. Since the data is endorsed by National Bureau of Statistics of China, we do not doubt the accuracy of the point. Nonetheless, the disappeared seasonality in 2018 makes the forecast more difficult.

  2. Covid-19 firstly appeared in Wuhan on Dec 2019 and began to spread across the country. To protect citizens, Government issued a quarantine policy which led to a sharp decline in passenger traffic. As a result, the emergency disturbed the regular trend and seasonality that historical traffic data could on play a limit row in forecast. To deal with the special issue, it is necessary to put more factors, such as corona virus data of China, into regression.

Getting and tidy Covid-19 data:

Daily Covid-19 data downloaded from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset

Preview raw Covid-19 data

##    SNo ObservationDate Province/State Country/Region     Last Update Confirmed
## 1:   1      01/22/2020          Anhui Mainland China 1/22/2020 17:00         1
## 2:   2      01/22/2020        Beijing Mainland China 1/22/2020 17:00        14
## 3:   3      01/22/2020      Chongqing Mainland China 1/22/2020 17:00         6
## 4:   4      01/22/2020         Fujian Mainland China 1/22/2020 17:00         1
## 5:   5      01/22/2020          Gansu Mainland China 1/22/2020 17:00         0
## 6:   6      01/22/2020      Guangdong Mainland China 1/22/2020 17:00        26
##    Deaths Recovered
## 1:      0         0
## 2:      0         0
## 3:      0         0
## 4:      0         0
## 5:      0         0
## 6:      0         0

Preview Processed covid-19 data

## # A tibble: 6 x 4
##   Confirm Deaths Recover    Month
##     <dbl>  <dbl>   <dbl>    <mth>
## 1    9783    213     214 2020 1月
## 2   69468   2622   39065 2020 2月
## 3    2273    470   36783 2020 3月
## 4    1350   1328    1580 2020 4月
## 5     143      1     665 2020 5月
## 6     517      0     173 2020 6月

Models

Constuct and filter independent variables

Until now traffic and epidemic data are available. However, building variables containing information which has a strong correlation with dependent variable (numbers of passengers in this case) could contribute to the accuracy of model. Thus it is reasonable to build two dummy variables here:

  1. “break” : indicate whether there is a confirmed infection cases, “1” means confirmed, 0 means no one confirm.

  2. “knot” : according to the exploratory of passenger traffic data above, both the trend and seasonality changed on Dec 2017 ,Dec 2019 as well as Dec 2020. Thus add “knot” at the three inflection points as an variable.

Preview the first 5 rows of train data.

## # A tsibble: 5 x 7 [1M]
##      Month Passenger Confirm Deaths Recover Break  knot
##      <mth>     <int>   <dbl>  <dbl>   <dbl> <dbl> <dbl>
## 1 2010 1月    254630       0      0       0     0     0
## 2 2010 2月    283614       0      0       0     0     0
## 3 2010 3月    266058       0      0       0     0     0
## 4 2010 4月    263024       0      0       0     0     0
## 5 2010 5月    266319       0      0       0     0     0

Now 6 variables (Month, Confirm, Deaths, Recover, Break, Knot) could be used to build models for forecasting “Passenger”. Rather than put all these variables into models, it is wise to select ones which affect Passenger significantly, or the model would become complex and lead to over fitting. Besides, redundant variables would distort the correlations within the model.

Analysis of variance (ANOVA) enables to determine which factors have a statistical influence on the given data set. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. Factors with \(Pr < 0.05\) are statistical significant.

ANOVA:

First test all the factors:

## Analysis of Variance Table
## 
## Response: log(Passenger)
##                  Df Sum Sq Mean Sq F value   Pr(>F)    
## Month             1 24.686 24.6858 63.9765 7.58e-13 ***
## Confirm           1  3.663  3.6634  9.4943  0.00254 ** 
## Recover           1  1.352  1.3517  3.5030  0.06361 .  
## Deaths            1  0.425  0.4250  1.1015  0.29598    
## lag(Confirm, 1)   1  0.001  0.0007  0.0017  0.96729    
## Break             1  1.029  1.0286  2.6657  0.10507    
## knot              1  0.001  0.0011  0.0028  0.95791    
## Residuals       124 47.846  0.3859                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Rule out Deaths and lag(Confirm,1), then perform ANOVA, \(Pr\) get better.

## Analysis of Variance Table
## 
## Response: log(Passenger)
##            Df Sum Sq Mean Sq F value   Pr(>F)    
## Month       1 24.369 24.3692 63.9648 6.78e-13 ***
## Confirm     1  3.696  3.6958  9.7007 0.002277 ** 
## Recover     1  1.366  1.3659  3.5852 0.060572 .  
## Break       1  1.212  1.2116  3.1803 0.076917 .  
## knot        1  0.001  0.0014  0.0037 0.951517    
## Residuals 127 48.384  0.3810                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The purpose is to build a model for forecasting rather than for explaining the relationship between variables. So I would put the accuracy as first priority at the cost of losing some interpretability. In this case, I prefer to make a \(log()\) transformation on \(Passenger\) and take \(knot\) into regression to improve forecast performance. When test residuals and forecast, \(log(Passenger)\) would be reversed to the initial format, \(Passenger\).

Seasonal-Trend decomposition

STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess,” while Loess is a method for estimating nonlinear relationships.

According to STL decomposition:

*Trend declined from Jul 2013, but got smoother since Aug 2014. Also, a sharp declination took place from Oct 2019.

*The seasonality period is 12 months.

*Notably, the remainder was led by the seasonality disappeared in 2018, and the occurrence of Covid-19 at the end of 2019.

Relationship between traffic passenger and epidemic

Apply \(log()\) to each variable for plot.

A negative correlation between \(Log(Passenger)\) and \(log(Confirm)\) showed in the plot above.

An interactive 3D scatter plot for visualization:

Choose Models

In this case, a good model should:

  1. Capture the trend, seasonality of the data.

  2. Model the relationship between Passenger and Covid-19 epidemic.

The models I choose:

  1. A harmonic model that describes the seasonality. Such as models using Fourier terms could capture the seasonality: \[Y_t = A_0 + \sum_{i = 1}^{n} [{A_j}\cos(2{\pi}{f_i}t)+{B_j}\sin(2{\pi}{f_i}t)]\]

where, \(Y_t\) is the dependent factor, \(Passenger\) in this case; \(f_i\) is the frequency of seasonality, \(1/12\) in this case; \(j\) is the amount of pairs of Fourier items; \(A_0\), \(A_j\), \(B_j\) are constants which could be obtained by fitting observing data.

  1. A time series multiple linear regression where epidemic predictor and knots could catch the deterministic trend. The general form of a multiple regression model is \[y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t} + \varepsilon_t\]
  • \(y_t\) is the variable we want to predict: the “Passenger” .
  • Each \(x_{j,t}\) is numerical and is called a ``predictor’’. They are usually assumed to be known for all past and future times.
  • The coefficients \(\beta_1,\dots,\beta_k\) measure the effect of each predictor after taking account of the effect of all other predictors in the model.
  • \(\varepsilon_t\) is a white noise error term
  1. The last thing is about residuals. If the model with two components mentioned above can effectively fit the data, then the residuals should be in independent distribution with 0 \(mean\) and 0 auto-covariance (\(ACF\)), such as a normal distribution. However, when the residuals are not independent or in weak correlation, it is better to build a stochastic model such as \(ARIMA\) model to fit the residuals. ARMA Model:

\[y_{t}= c + \phi_{1}y_{t - 1} + \cdots + \phi_{p}y_{t - p} + \theta_{1}\varepsilon_{t - 1} + \cdots + \theta_{q}\varepsilon_{t - q} + \varepsilon_{t}\]

where \(\varepsilon_t\) is white noise, \(c\) is the mean of observed series, \(\phi\) and \(\theta\) are constant coefficients.

Comparing to ARMA(p,q), ARIMA(p,d,q) has an extra parameter - \(d\) degree of first differencing involved. (See detail at ARIMA)

Find the proper \(j\): Pairs of Fourier terms for Harmonic Model

Try \(j=1,2,3,4,5,6\) with \(ARIMA\) error on \(log(Passenger)\). Set \(ARIMA(p,d,q)(P,D,Q)_s\) \(P=D=Q=0\) to force Fourier describe all the seasonality. Typically, \(j\) with least \(AICc\) indicates a better fit.

We get the least \(AICc=146.53\) when \(K=6\), so \(6\) pairs of Fourier terms should be fitted in the model.

Build Models

Model 1 : TSLM Model with Harmonic Components

\[y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t} +\sum_{i = 1}^{n} [{A_j}\cos(2{\pi}{f_i}t)+{B_j}\sin(2{\pi}{f_i}t)] +\varepsilon_t\]

The blue curve is the fit curve, 95% confidence interval is also plotted above. For the peak point for each annual, however, forecast is much higher than real data.

See the details of the model:

## Series: Passenger 
## Model: TSLM 
## Transformation: log(Passenger) 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4172 -0.2826 -0.1039  0.3176  0.5186 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.245e+01  3.002e-02 414.775  < 2e-16 ***
## Confirm             -9.181e-06  7.297e-06  -1.258   0.2108    
## Recover             -2.537e-05  9.910e-06  -2.560   0.0117 *  
## Break               -7.956e-01  1.074e-01  -7.409 2.15e-11 ***
## knot                -2.516e+00  2.321e-01 -10.842  < 2e-16 ***
## fourier(K = 6)C1_12  3.600e-01  4.127e-02   8.723 2.17e-14 ***
## fourier(K = 6)S1_12 -2.340e-01  4.084e-02  -5.728 8.02e-08 ***
## fourier(K = 6)C2_12  2.175e-01  4.048e-02   5.373 4.00e-07 ***
## fourier(K = 6)S2_12 -3.256e-01  4.162e-02  -7.822 2.54e-12 ***
## fourier(K = 6)C3_12  7.402e-04  4.041e-02   0.018   0.9854    
## fourier(K = 6)S3_12 -3.985e-01  4.171e-02  -9.553 2.47e-16 ***
## fourier(K = 6)C4_12 -2.050e-01  4.055e-02  -5.056 1.60e-06 ***
## fourier(K = 6)S4_12 -3.509e-01  4.156e-02  -8.444 9.59e-14 ***
## fourier(K = 6)C5_12 -3.750e-01  4.118e-02  -9.106 2.76e-15 ***
## fourier(K = 6)S5_12 -2.302e-01  4.082e-02  -5.639 1.21e-07 ***
## fourier(K = 6)C6_12 -2.170e-01  2.896e-02  -7.492 1.40e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3263 on 117 degrees of freedom
## Multiple R-squared: 0.8423,  Adjusted R-squared: 0.8221
## F-statistic: 41.67 on 15 and 117 DF, p-value: < 2.22e-16

\(Pr(>|t|)\) gives the p-value for that t-test (the proportion of the t distribution at that df which is greater than the absolute value of t statistic). It is rational to accept the coefficients since most of \(P-value<0.05\) except two.

The Adjusted R-squared=0.8221, indicates the model is capable of explaining 82.21% variance of the historical data.

Test the residuals:

However, ACF significant above 0, indicates the residual autocorrelation is statistically significant. Thus should construct a stochastic model to fit the residuals. Applying \(ARIMA\) model for residuals in this case.

Model 2 : TSLM Model with Harmonic Components + ARIMA()

Set \(ARIMA(p,d,q)(P,D,Q)_s\) \(P=D=Q=0\) to force Fourier describe all the seasonality. Set d=0 so the differencing will be dealed by linear regression. In this case the model could be written as: \[y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t} +\sum_{i = 1}^{n} [{A_j}\cos(2{\pi}{f_i}t)+{B_j}\sin(2{\pi}{f_i}t)] +\eta_t\]

\[\eta_{t}= c + \phi_{1}\eta_{t - 1} + \cdots + \phi_{p}\eta_{t - p} + \theta_{1}\varepsilon_{t - 1} + \cdots + \theta_{q}\varepsilon_{t - q} + \varepsilon_{t}\]

The model fit well except the peak on Jan 2016.

Test the residuals :

ACF is not significant different from 0, so cannot find evidence to prove the residual autocorrelation is statistically significant. Thus model 2 is better on its residuals.

Further implement Ljung_Box test on the residuals to buttress the idea.

Ljung_Box test:

\(H_0\): The series is not significant from a white noise.

## # A tibble: 1 x 3
##   .model                                                       lb_stat lb_pvalue
##   <chr>                                                          <dbl>     <dbl>
## 1 "ARIMA(log(Passenger) ~ PDQ(0, 0, 0) + pdq(d = 0) + Confirm~  0.0938     0.759

\(P-value=0.76>0.09\), so cannot reject Null hypothesis. Model 2 indeed fit better.

12-Month-Forecast

Finally we could make a forecast, however, whether the forecast is accuracy depend on the future situaion of Covid-19. Here I make a forecast under 3 different scenarios:

  1. A scenario without Covid-19 breaks in mainland China.

  2. A scenario where Covid-19 breaks, with 1 patient confirm and 1 patient recover everyday.

  3. A scenario where Covid-19 breaks, with 1 patient confirm and 1 patient recover everyday. However, public emergencies, such as wide mandatory isolation, closing city, etc, will arise in the future. The emergency would be reflected as a knot in the model.

View the forecast:

## # A fable: 36 x 9 [1M]
## # Key:     Scenario, .model [3]
##    Scenario  .model      Month      Passenger  .mean Confirm Break  knot Recover
##    <chr>     <chr>       <mth>         <dist>  <dbl>   <dbl> <dbl> <dbl>   <dbl>
##  1 Without_~ TSLM(l~  2021 2月 t(N(12, 0.12)) 2.33e5       0     0     0       0
##  2 Without_~ TSLM(l~  2021 3月 t(N(12, 0.12)) 2.19e5       0     0     0       0
##  3 Without_~ TSLM(l~  2021 4月 t(N(12, 0.12)) 2.06e5       0     0     0       0
##  4 Without_~ TSLM(l~  2021 5月 t(N(12, 0.12)) 2.13e5       0     0     0       0
##  5 Without_~ TSLM(l~  2021 6月 t(N(12, 0.12)) 2.10e5       0     0     0       0
##  6 Without_~ TSLM(l~  2021 7月 t(N(12, 0.12)) 2.24e5       0     0     0       0
##  7 Without_~ TSLM(l~  2021 8月 t(N(12, 0.12)) 2.30e5       0     0     0       0
##  8 Without_~ TSLM(l~  2021 9月 t(N(12, 0.12)) 2.24e5       0     0     0       0
##  9 Without_~ TSLM(l~ 2021 10月 t(N(12, 0.12)) 2.35e5       0     0     0       0
## 10 Without_~ TSLM(l~ 2021 11月 t(N(12, 0.12)) 2.11e5       0     0     0       0
## # ... with 26 more rows

Plot the forecast:

Welcome to view my code for this report on github_ma-Haoran