Introduction

Welcome to my Geo Computation journey! In this blog post, we’ll delve into the world of rainfall data analysis using R. I recently completed an assignment that involved processing, cleaning, and visualizing rainfall data from various weather stations in Ireland. Join me as I walk you through the steps I took to transform raw data into insightful visualizations.

Data Explanation

The data used in this analysis consists of monthly rainfall measurements from four weather stations in Ireland: Belfast, Dublin Airport, University College Galway, and Cork Airport. The dataset includes columns for the year, month, station name, and rainfall amount. The goal is to process this data, handle any missing values or anomalies, and create an interactive time series visualization to identify patterns in rainfall.

Libraries and Data Loading

To kick things off, I loaded the necessary libraries for data manipulation, visualization, and time series analysis:

Libraries Loading

First, we load the necessary libraries:
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dygraphs)
## Warning: package 'dygraphs' was built under R version 4.4.2
library(xts)
## Warning: package 'xts' was built under R version 4.4.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(zoo)
library(imputeTS)
## Warning: package 'imputeTS' was built under R version 4.4.2
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
## Attaching package: 'imputeTS'
## The following object is masked from 'package:zoo':
## 
##     na.locf
library(forecast)
## Warning: package 'forecast' was built under R version 4.4.2
Next, I loaded the rainfall data from a file and prepared it for processing.
The first task was to create a function that processes data for each weather station. This function filters the data, summarizes the total monthly rainfall, ensures the Year and Month columns are numeric, and creates a Date column in the YYYY-MM-DD format.

Processing Data for Each Station

We create a function to process data for each weather station

This function filters the data for a specific station. It summarizes the total monthly rainfall.It ensures that the Year and Month are numeric.It creates a Date column in the YYYY-MM-DD format. It returns a data frame with Date and Rainfall.
process_station_data <- function(station_name) {
    ts_data <- rain %>%
        filter(Station == station_name) %>%
        summarise(Rainfall = sum(Rainfall), .by = c(Year, Month)) %>%
        mutate(
            Year = as.numeric(Year),  # Ensure Year is numeric
            Month = as.numeric(Month),  # Ensure Month is numeric
            Date = as.Date(paste(Year, Month, "01", sep = "-"), format = "%Y-%m-%d")  # Create Date column
        ) %>%
        select(Date, Rainfall)
    return(ts_data)
}

Handling Missing Data and Anomalies

Missing data and anomalies can skew analysis 0results, so I created a function to clean the data. The na_interpolation function fills in missing values, while the tsclean function removes anomalies.

we create a function to clean the data

The na_interpolation function fills in any missing values in the rainfall data.The tsclean function removes any anomalies in the cleaned data.
clean_data <- function(ts_data) {
    ts_data_clean <- ts_data %>%
        mutate(Rainfall = na_interpolation(Rainfall))  # Fill missing values
    ts_data_clean <- ts_data_clean %>%
        mutate(Rainfall = tsclean(ts_data_clean$Rainfall))  # Remove anomalies
    return(ts_data_clean)
}

Cleaning Data for Each Station

I applied the data processing and cleaning functions to each weather station, including Belfast, Dublin Airport, University College Galway, and Cork Airport.

We apply the data cleaning process to each station

This applies the process_station_data and clean_data functions to the data for each station.
load("C:/Users/Dell/Downloads/rainfall.RData")
bel_ts <- clean_data(process_station_data("Belfast"))
dub_ts <- clean_data(process_station_data("Dublin Airport"))
gal_ts <- clean_data(process_station_data("University College Galway"))
cor_ts <- clean_data(process_station_data("Cork Airport"))

Combining Data

To gain a holistic view, I combined the data from all stations into a single data frame, merging them by date.

We combine the data from all stations into a single data frame

This creates a combined data frame that includes the rainfall data for all stations, merged by date.
all_stations_df <- full_join(full_join(bel_ts, dub_ts, by = "Date", suffix = c("_Belfast", "_Dublin_Airport")),
                             full_join(gal_ts, cor_ts, by = "Date", suffix = c("_Galway", "_Cork_Airport")), by = "Date")

Creating xts Object

For compatibility with dygraphs, I converted the combined data frame into an xts object with the Date column as the index.

We convert the data frame into an xts object for compatibility with dygraphs

The xts function converts the data frame into a time series object with the Date column as the index.
all_stations_xts <- xts(all_stations_df[-1], order.by = all_stations_df$Date)

Feature Engineering: Adding Lagged Variables and Moving Averages

To enhance the data analysis, I added lagged variables (e.g., previous month’s rainfall) and moving averages (e.g., 12-month moving average) for each station.

We add additional features to the data

We add lagged variables (e.g., previous month’s rainfall) and moving averages (e.g., 12-month moving average) for each station. The rollapply function calculates the moving average. We convert the data back into an xts object after adding these features.
all_stations_xts <- all_stations_xts %>%
    as.data.frame() %>%
    mutate(
        Belfast_Lag1 = dplyr::lag(Rainfall_Belfast, 1),
        Dublin_Airport_Lag1 = dplyr::lag(Rainfall_Dublin_Airport, 1),
        Galway_Lag1 = dplyr::lag(Rainfall_Galway, 1),
        Cork_Airport_Lag1 = dplyr::lag(Rainfall_Cork_Airport, 1),
        Belfast_MA = rollapply(Rainfall_Belfast, 12, mean, align = "right", fill = NA),
        Dublin_Airport_MA = rollapply(Rainfall_Dublin_Airport, 12, mean, align = "right", fill = NA),
        Galway_MA = rollapply(Rainfall_Galway, 12, mean, align = "right", fill = NA),
        Cork_Airport_MA = rollapply(Rainfall_Cork_Airport, 12, mean, align = "right", fill = NA)
    ) %>%
    as.xts(order.by = all_stations_df$Date)

Creating the Dygraph

Finally, I created an interactive time series plot using dygraph. The dygraph function creates the plot, and various dy* functions add additional features like a range selector, axis labels, highlighting, and annotations.

For which:

1) dygraph creates the interactive time series plot.
2) dyRangeSelector adds a range selector to the bottom of the chart.
3) dyOptions sets the colors for each series.
4) dyLegend controls the display of the legend.
5) dyAxis labels the axes.
6) dyHighlight adds highlighting for selected data points.
7) dyEvent adds an annotation for a specific date (e.g., Easter Rising).
8) dyShading shades a specific time period (e.g., World War II).
dygraph(all_stations_xts, main = "Monthly Rainfall at Weather Stations") %>%
    dyRangeSelector() %>%
    dyOptions(colors = c("blue", "red", "green", "purple")) %>%
    dyLegend(show = "always", hideOnMouseOut = FALSE) %>%
    dyAxis("y", label = "Rainfall (mm)") %>%
    dyAxis("x", label = "Year") %>%
    dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE) %>%
    dyEvent("1916-04-24", "Easter Rising", labelLoc = "bottom") %>%
    dyShading(from = "1939-09-01", to = "1945-09-02", color = "#FFE6E6")

Embedded Dygraph

Discussion of Patterns

Conclusion