| Historical Rainfall Data in Ireland and Forecasting Precipitations in Dublin |
Basic statistics
The supplied dataset (“rain”) consists of 49,500 observations and 4 variables. Data are organized according to year (from 1850 to 2014), month, station (25), and total monthly precipitations. Herein, I will first look for the mean annual rainfall values for each station; to do this, I have to create a new file (“StMean”) by manipulating the data using pipelines. The dplyr [1] package is used to create a pipeline (%>%), which finds the mean values of the rainfall data given in the dataset. Then, the ggplot2 [2] package is used to plot the data as histograms with standard deviations.
An alternative look at the data
Geometric bars are partially informative; in particular standard deviations are almost meaningless on annual means. Boxplots may represent an interesting, and more informative, way to visualize the data. Herein, mean values are visualized as black straight horizontal lines within the boxplots, and outliers (black dots) are also visualized
Local annual trends
By filtering the data for each station, differences in trends are even more appreciable. While some stations have consistently rising rainfall levels in the period under investigation, others are very fluctuating.
A map of Ireland with mean annual precipitation
By merging the newly created “StMean” and the provided “stations” files, a new dataset was created (“datarain”). This dataset contains not only the mean precipitation values for each station but also their geographic coordinates. I used this file to create my map with the leaflet [3] package.
Forecasting precipitations: an example.
I will use the historical rainfall data to try to forecast future precipitations on the capital city of the Irish Republic. I will isolate the Dublin (more precisely, the Dublin Airport) rainfall data and then I will convert the data in a format recognized by the xts package [4]. Finally, I will plot the historical data using the dygraphs [5] package, which offers many features, as zooming.
Modeling the time series data.
To model our data to forecast future rain events at the Dublin Airport station, I first have to create a training dataset and a test dataset. Herein, all data until 2010 will form the training dataset (D_train), while all the other data will form the test dataset (D_test). To perform the analyses, 3 different R packages are used: urca [6], forecast [7] ad astsa [8].
I am going to use the Arima model to forecast the data. The Autoregressive integrated moving average (ARIMA) models predict future values based on past values [9]. Before using this model, I need to check the data are not stationary (i.e., the properties of the time series do not depend on the time at which the series is observed). I am going to use two different tests to check it. The KPSS Test (null hypothesis: Data are stationary) and the D-F Test (null hypothesis: Data are non-stationary). The results below show that the KPSS null hypothesis is rejected (p-value>0.5), while the D-F null hypothesis is accepted (p-value<0.005). Data are not stationary.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 11 lags.
##
## Value of test-statistic is: 40.0499
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression none
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.81 -18.19 4.27 30.29 164.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z.lag.1 -0.11623 0.01319 -8.809 <2e-16 ***
## z.diff.lag -0.44213 0.02043 -21.644 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.06 on 1928 degrees of freedom
## Multiple R-squared: 0.2794, Adjusted R-squared: 0.2786
## F-statistic: 373.8 on 2 and 1928 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -8.8087
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau1 -2.58 -1.95 -1.62
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## ACF 0.07 0.07 0.00 0.06 0.1 0.06 0.05 0.06 0.05 0.08 0.04 0.09 0.05
## PACF 0.07 0.06 -0.01 0.05 0.1 0.05 0.04 0.05 0.04 0.06 0.01 0.07 0.02
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## ACF 0.06 0.05 0.05 0.09 0.05 0.06 0.05 0.06 0.08 0.08 0.08 0.03
## PACF 0.03 0.02 0.02 0.07 0.02 0.02 0.02 0.03 0.05 0.04 0.04 0.00
## [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37]
## ACF 0.06 0.10 0.07 0.07 0.07 0.04 0.04 0.05 0.06 0.08 0.09 0.05
## PACF 0.03 0.06 0.03 0.02 0.03 0.00 0.00 0.01 0.02 0.03 0.04 0.01
## [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49]
## ACF 0.09 0.04 0.03 0.06 0.02 0.05 0.09 0.01 0.05 0.05 0.05 0.06
## PACF 0.06 -0.01 -0.02 0.02 -0.02 0.00 0.05 -0.04 0.00 0.01 -0.01 0.02
## [,50] [,51] [,52] [,53] [,54]
## ACF 0.06 0.10 0.07 0.03 0.05
## PACF 0.01 0.06 0.03 -0.02 0.01
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,2)(1,0,2)[12] with non-zero mean
## Q* = 40.314, df = 18, p-value = 0.001891
##
## Model df: 6. Total lags used: 24
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1044829 32.27883 25.33405 -50.188385 74.53019 0.7378056
## Test set 9.4849157 41.32536 27.76316 -8.131686 37.09009 0.8085486
## ACF1 Theil's U
## Training set -0.0008529944 NA
## Test set 0.1523061705 0.874273
After investigating the accuracy of the model, now I can run it and then plot it. A time span of 12 months in the future will be used. In the plot, the highlighted area corresponds to the forecasted rainfall data.
References.
[1] Wickham, H., François, R., Henry, L., Müller, K., & Wickham, M. H. (2019). Package ‘dplyr’. A Grammar of Data Manipulation. R package version, 8.
[2] Wickham, H., Chang, W., & Wickham, M. H. (2016). Package ‘ggplot2’. Create elegant data visualisations using the grammar of graphics. Version, 2(1), 1-189.
[3] Cheng, J., Karambelkar, B., Xie, Y., Wickham, H., Russell, K., Johnson, K., … & Agafonkin, V. (2023). Package ‘leaflet’.
[4] Ryan, J. A., Ulrich, J. M., & Ryan, M. J. A. (2023). Package ‘xts’.
[5] Vanderkam, D., Allaire, J. J., Owen, J., Gromer, D., & Thieurmel, B. (2018). dygraphs: Interface to’dygraphs’ interactive time series charting library. R package version 1.1. 1.6. Retrieved August 19, 2021.
[6] Pfaff, B., Zivot, E., Stigler, M., & Pfaff, M. B. (2016). Package ‘urca’. Unit root and cointegration tests for time series data. R package version, 1-2.
[7] Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: the forecast package for R. Journal of statistical software, 27, 1-22.
[8] Stoffer, D., & Stoffer, M. D. (2023). Package ‘astsa’. blood, 8, 1.
[9] Bari, S. H., Rahman, M. T., Hussain, M. M., & Ray, S. (2015). Forecasting monthly precipitation in Sylhet city using ARIMA model. Civil and Environmental Research, 7(1), 69-77.