Forecasting Precipitation on Dublin

The supplied dataset (“rain”) consist of 49,500 observations and 4 variables. Data are organized according to year (from 1850 to 2014), month, station (25), and total monthly precipitations. Herein, I will firstly look for the mean annual rainfall values for each station; to do this, I have to create a new file (“StMean”) by manipulating the data using pipelines. The dplyr package is used to create a pipeline (%>%), which finds the mean values of the rainfall data given in the dataset.Then,the ggplot2 package is used to plot the data has histograms with standard deviations.

title: “Boxplots are more informative than histograms for annual means” output: html_document —

Geometric bars are partially informative; in particular standard deviations is almost meaningless on annual means. Boxplots may represent an interesting, and more informative, way to visualize the data. Herein, mean values are visualized as black straght horizontal lines within the boxplots, and outliers (black dots) are also visualized.

By filtering the data for each station, differences in trends are even more appreciable. While some stations have consistently rising rainfall levels in the period under investigation, others are very fluctuating.

title: “A map of Ireland with mean annual precipitation” output: html_document —

By merging the newly created “StMean” and the provided “stations” files, a new dataset was created (“datarain”). This dataset contains not only the mean precipitation values for each station, but also their geographic coordinates. I used this file to create my map with the leaflet package.

##Map of Irland with Mean Annual Precipitations (1850 - 2014) by station

## Caricamento del pacchetto richiesto: zoo

## 
## Caricamento pacchetto: 'zoo'

## I seguenti oggetti sono mascherati da 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################

## 
## Caricamento pacchetto: 'xts'

## I seguenti oggetti sono mascherati da 'package:dplyr':
## 
##     first, last

## 
## Caricamento pacchetto: 'leaflet'

## Il seguente oggetto è mascherato da 'package:xts':
## 
##     addLegend

## Assuming "Long" and "Lat" are longitude and latitude, respectively

Map of Irland

Now, I will use the historical rainfall data to try to forecast future precipitations on the capital city of the Irish Republic. I will isolate the Dublin (more precisely, the Dublin Airport) rainfall data and then I will convert the data in a format recognized by the xts package. Finally, I will plot the historical data using the dygraphs package, which offers many features, as zooming.

To model our data in order to forecast future raining events at the Dublin Airport station, I firstly have to create a training dataset and a test dataset. Herein, all data until 2010 will form the training dataset (D_train), while all the other data wil form the test dataset (D_test)

I am going to use the Arima model to forecast the data. The Autoregressive integrated moving average (ARIMA) models predict future values based on past values. Before using this model, I need to check the data are not stationary (i.e., the properties of the time series do not depend on the time at which the series is observed). I am going to use two different test to check it. The KPSS Test (null hypothesis: Data are stationary) and the D-F Test (null hypothesis: Data are non-stationary). The results below show that the KPSS null hypothesis is rejected (p-value>0.5), while the D-F null hypothesis is accepted (p-value<0.005). Data are not stationary.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 11 lags. 
## 
## Value of test-statistic is: 40.0499 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression none 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -90.81 -18.19   4.27  30.29 164.13 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## z.lag.1    -0.11623    0.01319  -8.809   <2e-16 ***
## z.diff.lag -0.44213    0.02043 -21.644   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.06 on 1928 degrees of freedom
## Multiple R-squared:  0.2794, Adjusted R-squared:  0.2786 
## F-statistic: 373.8 on 2 and 1928 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -8.8087 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau1 -2.58 -1.95 -1.62

##      [,1] [,2]  [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## ACF  0.07 0.07  0.00 0.06  0.1 0.06 0.05 0.06 0.05  0.08  0.04  0.09  0.05
## PACF 0.07 0.06 -0.01 0.05  0.1 0.05 0.04 0.05 0.04  0.06  0.01  0.07  0.02
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## ACF   0.06  0.05  0.05  0.09  0.05  0.06  0.05  0.06  0.08  0.08  0.08  0.03
## PACF  0.03  0.02  0.02  0.07  0.02  0.02  0.02  0.03  0.05  0.04  0.04  0.00
##      [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37]
## ACF   0.06  0.10  0.07  0.07  0.07  0.04  0.04  0.05  0.06  0.08  0.09  0.05
## PACF  0.03  0.06  0.03  0.02  0.03  0.00  0.00  0.01  0.02  0.03  0.04  0.01
##      [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49]
## ACF   0.09  0.04  0.03  0.06  0.02  0.05  0.09  0.01  0.05  0.05  0.05  0.06
## PACF  0.06 -0.01 -0.02  0.02 -0.02  0.00  0.05 -0.04  0.00  0.01 -0.01  0.02
##      [,50] [,51] [,52] [,53] [,54]
## ACF   0.06  0.10  0.07  0.03  0.05
## PACF  0.01  0.06  0.03 -0.02  0.01

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,2)(1,0,2)[12] with non-zero mean
## Q* = 40.314, df = 18, p-value = 0.001891
## 
## Model df: 6.   Total lags used: 24

##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 0.1044829 32.27883 25.33405 -50.188385 74.53019 0.7378056
## Test set     9.4849157 41.32536 27.76316  -8.131686 37.09009 0.8085486
##                       ACF1 Theil's U
## Training set -0.0008529944        NA
## Test set      0.1523061705  0.874273

After investigating the accuracy of the model, now I can run it and than plot it.

Forecasting Precipitation on Dublin

Vincenzo Arzillo, 20/12/2023