##Basic statistics
The supplied dataset (“rain”) consist of 49,500 observations and 4 variables. Data are organized according to year (from 1850 to 2014), month, station (25), and total monthly precipitations. Herein, I will firstly look for the mean annual rainfall values for each station; to do this, I have to create a new file (“StMean”) by manipulating the data using pipelines. The dplyr package is used to create a pipeline (%>%), which finds the mean values of the rainfall data given in the dataset.Then,the ggplot2 package is used to plot the data has histograms with standard deviations.
Mean Annual Precipitation (histogram)
title: “Boxplots are more informative than histograms for annual means” output: html_document —
Geometric bars are partially informative; in particular standard deviations is almost meaningless on annual means. Boxplots may represent an interesting, and more informative, way to visualize the data. Herein, mean values are visualized as black straght horizontal lines within the boxplots, and outliers (black dots) are also visualized.
Mean Annual Precipitation (Boxplot)
title: “Local annual trends” output: html_document —
By filtering the data for each station, differences in trends are even more appreciable. While some stations have consistently rising rainfall levels in the period under investigation, others are very fluctuating.
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
title: “A map of Ireland with mean annual precipitation” output: html_document —
By merging the newly created “StMean” and the provided “stations” files, a new dataset was created (“datarain”). This dataset contains not only the mean precipitation values for each station, but also their geographic coordinates. I used this file to create my map with the leaflet package.
##Map of Irland with Mean Annual Precipitations (1850 - 2014) by station
## Caricamento del pacchetto richiesto: zoo
##
## Caricamento pacchetto: 'zoo'
## I seguenti oggetti sono mascherati da 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Caricamento pacchetto: 'xts'
## I seguenti oggetti sono mascherati da 'package:dplyr':
##
## first, last
##
## Caricamento pacchetto: 'leaflet'
## Il seguente oggetto è mascherato da 'package:xts':
##
## addLegend
## Assuming "Long" and "Lat" are longitude and latitude, respectively
Map of Irland
Now, I will use the historical rainfall data to try to forecast future precipitations on the capital city of the Irish Republic. I will isolate the Dublin (more precisely, the Dublin Airport) rainfall data and then I will convert the data in a format recognized by the xts package. Finally, I will plot the historical data using the dygraphs package, which offers many features, as zooming.
To model our data in order to forecast future raining events at the Dublin Airport station, I firstly have to create a training dataset and a test dataset. Herein, all data until 2010 will form the training dataset (D_train), while all the other data wil form the test dataset (D_test)
I am going to use the Arima model to forecast the data. The Autoregressive integrated moving average (ARIMA) models predict future values based on past values. Before using this model, I need to check the data are not stationary (i.e., the properties of the time series do not depend on the time at which the series is observed). I am going to use two different test to check it. The KPSS Test (null hypothesis: Data are stationary) and the D-F Test (null hypothesis: Data are non-stationary). The results below show that the KPSS null hypothesis is rejected (p-value>0.5), while the D-F null hypothesis is accepted (p-value<0.005). Data are not stationary.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 11 lags.
##
## Value of test-statistic is: 40.0499
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression none
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.81 -18.19 4.27 30.29 164.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z.lag.1 -0.11623 0.01319 -8.809 <2e-16 ***
## z.diff.lag -0.44213 0.02043 -21.644 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.06 on 1928 degrees of freedom
## Multiple R-squared: 0.2794, Adjusted R-squared: 0.2786
## F-statistic: 373.8 on 2 and 1928 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -8.8087
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau1 -2.58 -1.95 -1.62
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## ACF 0.07 0.07 0.00 0.06 0.1 0.06 0.05 0.06 0.05 0.08 0.04 0.09 0.05
## PACF 0.07 0.06 -0.01 0.05 0.1 0.05 0.04 0.05 0.04 0.06 0.01 0.07 0.02
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## ACF 0.06 0.05 0.05 0.09 0.05 0.06 0.05 0.06 0.08 0.08 0.08 0.03
## PACF 0.03 0.02 0.02 0.07 0.02 0.02 0.02 0.03 0.05 0.04 0.04 0.00
## [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37]
## ACF 0.06 0.10 0.07 0.07 0.07 0.04 0.04 0.05 0.06 0.08 0.09 0.05
## PACF 0.03 0.06 0.03 0.02 0.03 0.00 0.00 0.01 0.02 0.03 0.04 0.01
## [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49]
## ACF 0.09 0.04 0.03 0.06 0.02 0.05 0.09 0.01 0.05 0.05 0.05 0.06
## PACF 0.06 -0.01 -0.02 0.02 -0.02 0.00 0.05 -0.04 0.00 0.01 -0.01 0.02
## [,50] [,51] [,52] [,53] [,54]
## ACF 0.06 0.10 0.07 0.03 0.05
## PACF 0.01 0.06 0.03 -0.02 0.01
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,2)(1,0,2)[12] with non-zero mean
## Q* = 40.314, df = 18, p-value = 0.001891
##
## Model df: 6. Total lags used: 24
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1044829 32.27883 25.33405 -50.188385 74.53019 0.7378056
## Test set 9.4849157 41.32536 27.76316 -8.131686 37.09009 0.8085486
## ACF1 Theil's U
## Training set -0.0008529944 NA
## Test set 0.1523061705 0.874273
After investigating the accuracy of the model, now I can run it and than plot it.