This interactive essay will import and utilise a dataset containing rainfall data for 25 weather stations located throughout Ireland. It contains a continuous record from 1850 to 2014. The data was initially collected by amateur meteorologists in an early precursor of citizen science, and later came under the remit of Met Éireann. The dataset was provided to the students of AFF624 - (Mapping and Modelling Space and Time module of the MA in Digital Humanities from NUIM) by Professor Chris Brunsdon, director of the National Centre for Geocomputation, who in turn acknowledges Simon Noone and Conor Murphy for their work on the dataset.
First we will load the dataset from the working directory of the R programme:
load('rainfall.RData')
We can load the core ‘tidyverse’ package and make it available in the current R session. The core tidyverse includes the packages that we will use for our analysis of the rainfall dataset. We will need to load them separately as needed. They are ‘ggplot’, and ‘dplyr’.
library("tidyverse");
Next we will check the data has loaded and view the variables contained with the dataset.
head(stations)
## # A tibble: 6 x 9
## Station Elevation Easting Northing Lat Long County Abbr~ Sour~
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 Athboy 87 270400 261700 53.6 -6.93 Meath AB Met ~
## 2 Foulksmills 71 284100 118400 52.3 -6.77 Wexford F Met ~
## 3 Mullingar 112 241780 247765 53.5 -7.37 Westmeath M Met ~
## 4 Portlaw 8 246600 115200 52.3 -7.31 Waterford P Met ~
## 5 Rathdrum 131 319700 186000 52.9 -6.22 Wicklow RD Met ~
## 6 Strokestown 49 194500 279100 53.8 -8.10 Roscommon S Met ~
head(rain)
## # A tibble: 6 x 4
## Year Month Rainfall Station
## <dbl> <ord> <dbl> <chr>
## 1 1850 Jan 169 Ardara
## 2 1851 Jan 236 Ardara
## 3 1852 Jan 250 Ardara
## 4 1853 Jan 209 Ardara
## 5 1854 Jan 188 Ardara
## 6 1855 Jan 32.3 Ardara
The Data Frames
The first tible table shows the first six values for the nine variables in the table: Station (Location), Elevation (height above sea level), Easting, Northing (terms easting and northing are geographic Cartesian coordinates for a point), Lat, Long (longitude and latitude), County, Abbr (Location Abbreviation), Sour (Source: Met Éireann)
The second tible table shows the first six values for the four variables in the table: Year, Month, Rainfall, Station.
Load Dplyr
Next we will load dplyr.
‘dplyr’ is a package to manipulate data easily and solve the most common data manipulation challenges:
Using the group_by() function allows any operation to be preformed “by group”. http://dplyr.tidyverse.org/
A very advantageous feature of ‘dpylr’ is the command %>% which act as a pipeline, making the code easier to write and clearer to read.
library("dplyr");
Now we will take the rain data and pipeline it. First by grouping the stations and then using the summarise function to calculate the mean rainfall for each station.
We will use the group() command to group by stations, and get each station’s mean rainfall using the command summarise, which applies a summary function to each rainfall entry and create a new data frame with one rainfall entry for each station. Any R summary function can be used as such; e.g. median,sd,max or a user defined function. We will store that new data frame in a new variable called rain_summary. We will do this by using the -> assignment operator. We can print the results to see if the summary has worked.
rain %>% group_by(Station) %>%
summarise(mrain=mean(Rainfall)) -> rain_summary
head(rain_summary)
## # A tibble: 6 x 2
## Station mrain
## <chr> <dbl>
## 1 Ardara 140
## 2 Armagh 68.3
## 3 Athboy 74.7
## 4 Belfast 87.1
## 5 Birr 70.8
## 6 Cappoquinn 121
To make more sense of the data, it might be useful to locate the weather stations on some maps so as to elucidate the graph data which will follow. For example, to know if a station is in the East, North-East, or the Midlands, etc., is not apparent from the graph visualisations without a prior knowledge of the geography. It could be significant to know if the east has a higher average rainfall than the west, or vice-a-versa. Similarly, it might be significant to know if the midlands have more or less rain than coastal regions.
To display a map we could load a shape file, or we could use a Leaflet map, or indeed a combination of both. For example, the counties could be loaded onto a Leaflet map as shape files provided their projection is latitude / longitude.
Leaflet is an open-source JavaScript library for interactive maps. http://leafletjs.com/ The map base imagery data consists of tiled raster canvases. These can be replaced by themed canvases which are offered by different providers with differing conditions and rights.
Load Leaflet Map
library(leaflet)
Load leaflet Map and display stations
On mouse-over or click, the name of the station will be displayed.
leaflet(data=stations,height=650,width=820) %>% addProviderTiles('CartoDB.Positron') %>%
setView(-8,53.5,7) %>% addCircleMarkers(~Long, ~Lat, popup = ~as.character(Station), label = ~as.character(Station))
We have been tasked with displaying the data in a time series defined by monthly rain fall, so we will now group the rain data by month, gets the mean values and stores the summary in a new variable called rain_months.
rain %>% group_by(Month) %>%
summarise(mrain=mean(Rainfall)) -> rain_months
head(rain_months)
## # A tibble: 6 x 2
## Month mrain
## <ord> <dbl>
## 1 Jan 113
## 2 Feb 83.2
## 3 Mar 79.5
## 4 Apr 68.7
## 5 May 71.3
## 6 Jun 72.7
Now we can carry out some basic visual analysis.We will start with a simple bar plot.
barplot(rain_months$mrain,names=rain_months$Month,las=1,col='dodgerblue')
With our first visualisation of the data we can see clearly that precipitation is highest on the turn or the year, so that in January, it is only slightly below the maximum which is in December. From January the mean drops sharply, and in February, there is a steadier drop to April. Then there is a gradual rise over the following months back up to maximum precipitation in December.
We will load ggplot2.
ggplot2 is a system for declaratively creating graphics. The user provides the data, tells ggplot2 how to map the variables to aesthetics, what graphical primitives to use, and then ggplot2 will generate graphical plots of the details provided.
library("ggplot2");
There follows some examples of how the data can be visualised.
rain %>% group_by(Year) %>%
summarise(total_rain=sum(Rainfall)) -> rain_years
ggplot(aes(x=Year,y=total_rain, col='indianred'),data=rain_years) + geom_line(stat='identity') + labs(y='Total Annual Rainfall')
Is a trend discernible? It is hard to detect with this type of chart with this scope, but we can apply a gemon_smoth() to the graph. This will average the graph, weighting the smoothing line towards the centre in order to draw a graphic line. The graphic also shows an amount of variability around the average.
So we will apply the Smoothing below.
rain %>% group_by(Year) %>%
summarise(total_rain=sum(Rainfall)) -> rain_years
ggplot(aes(x=Year,y=total_rain, col='indianred'),data=rain_years) + geom_line(stat='identity') + geom_smooth(col='dodgerblue') + labs(y='Total Annual Rainfall')
## `geom_smooth()` using method = 'loess'
Having applied the gemon_smoth() to the graph, we can see a clear upward trend in precipitation levels.
Is this trend the same for all stations? We can use a coplot of the stations to see them all individually but within the same field of view on screen. This will enable us to quickly see if the upward trend is present in all stations.
rain %>% group_by(Year, Station) %>%
summarise(total_rain=sum(Rainfall)) -> rain_years_st
ggplot(aes(x=Year,y=total_rain),data=rain_years_st) +
geom_smooth() + labs(y='Total Annual Rainfall') +
facet_wrap(~Station,nrow=5) + theme_dark()
## `geom_smooth()` using method = 'loess'
Even at this scale, we can discern an upward trend across the stations as represented by the blue line tilting upward from left to right, with only one outlier. The most precipitous station, Ardara, has a fall towards the end representing a drop in precipitation in more recent years. We can also see at a glance that a number of stations, for example, Ardara, Valentia, and Killarney, have a much higher rainfall than most. If we were not familiar with the geography, it might be useful to combine this data with a map to see if these stations present some kind of geographical pattern.
In order to do this we need to link the geographical location of the stations to the rainfall data.
rain %>% group_by(Station) %>% summarise(mrain=mean(Rainfall)) -> rain_summary
rain_summary %>% left_join(stations) -> station_means
## Joining, by = "Station"
left_join links the Station data frame to rain_summary to join the rainfall mean data to the geographical information about station location.
We will Create the colour profile for the station markers on the map.
color_fun <- colorNumeric('Blues',station_means$mrain)
previewColors(color_fun,fivenum(station_means$mrain))
color_fun
fivenum(station_means$mrain)
| 61.432 | |
| 78.62 | |
| 87.11 | |
| 100 | |
| 140.37 |
Create the map
Now we can create a map which indicates the mean rain fall for each station by colour shading of the circular markers which identify the stations’ locations. The darker the shade of blue the greater the mean rain fall at that station. The names of the stations can be displayed by clicking on the circular marker.
leaflet(data=station_means,height=650,width=820) %>% addProviderTiles('CartoDB.Positron') %>%
setView(-8,53.5,7) %>% addCircleMarkers(fillColor=~color_fun(mrain),weight=0,fillOpacity = 0.85, popup = ~as.character(Station)) %>%
addLegend(pal=color_fun,values=~mrain,title="Rainfall",position='bottomleft')
We can observe that the stations registering the highest precipitation levels are on the periphery of island. While not all coastal, they are very close to the coast in the south, southwest, and northwest. The midlands, in particular the eastern region around Dublin, register the least amount of rainfall. Ardara, the station which measures the most rainfall, lies in the North West close to the Atlantic coast. The combination of map and visualisation of mean rainfall is very useful for quickly observing patterns.
In the plot below grouped by year and month, the summary function is used to calculate the total rain fall. We then need to ungroup so that we can mutate the data to display only a specified number of years, filtering out all years beyond our maximum. We will also use a geom smooth to make any trends more discernible.
rain %>% group_by(Year,Month) %>% summarise(rf=sum(Rainfall)) %>% ungroup %>%
mutate(moyr = 1850 + row_number()/12) %>% filter(Year <= 1870) -> monthly_total
ggplot(aes(x=moyr,y=rf),data=monthly_total) + geom_point() +
geom_line(col='indianred') + labs(x='Year',y='Rainfall') +
geom_smooth()
## `geom_smooth()` using method = 'loess'
Multiple groupings - average by month/station combination
rain %>% group_by(Month,Station) %>%
summarise(mean_rain=mean(Rainfall)) -> rain_season_station
The code above groups the Month and Station and combines them into a new dataframe called rain_season_station. Now we can use the reshape2package to rearrange the data to produce a a 2D array (below), which we can use to produce a head map.
library(reshape2)
rain_season_station %>% acast(Station~Month) %>% head
## Jan Feb Mar Apr May Jun
## Ardara 174.82606 126.82303 123.02000 98.79333 96.90727 105.24061
## Armagh 74.57242 55.97182 56.48879 53.67030 59.23182 62.72939
## Athboy 84.94759 62.62133 62.44944 58.97874 62.16260 68.11460
## Belfast 101.20718 74.50206 73.10221 65.90492 69.23426 74.48525
## Birr 79.92074 57.88501 58.42056 54.07187 60.25831 61.96445
## Cappoquinn 153.97159 117.77099 110.02890 94.00365 95.81437 94.86357
## Jul Aug Sep Oct Nov Dec
## Ardara 123.70485 145.24788 152.80727 174.44788 176.45030 186.14182
## Armagh 72.50636 81.92182 69.02576 80.94242 73.73121 79.05939
## Athboy 76.28662 88.84710 76.35077 88.14627 80.95767 87.05997
## Belfast 87.70003 102.44499 87.11622 106.35394 100.31760 102.95073
## Birr 75.10084 86.75669 72.21722 83.80810 77.87266 81.74334
## Cappoquinn 104.09372 125.42245 116.04466 146.70658 141.10012 154.87402
Generate a headmap
library(reshape2)
rain_season_station %>% acast(Station~Month) %>% heatmap(Colv=NA)
## Using mean_rain as value column: use value.var to override.
Creating a dygraph of the weather stations at Belfast, Dublin Airport, University College Galway, and Cork Airport showing time series of rainfall on a monthly basis. Include a RangeSelector control that simultaneously changes the time window on the four time series.
Now we have explored some visualisation techniques and how to customise them, we can begin to answer the task set as outlined above. We will explore how to create a time series and manipulate the visualisations.
The R time-series object
ts - A time series object that is useful for many functions.
Examples of time-series visualisations
Below the monthly total for the years 1850 (Jan) to 1853 (Dec). Specified by concatenating c the years within our specified window,window.
monthly_total$rf %>% ts(freq=12,start=1850) -> rain_ts # Time serioes object
rain_ts %>% window(c(1850,1),c(1853,12)) # window from jan 1850 - dec 1853
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1850 2836.3 2158.9 964.1 3457.2 1492.1 1362.4 2584.8 1906.4 1763.1 1567.8
## 1851 4875.5 1379.9 1997.9 1368.1 1124.2 2537.1 2258.3 2416.2 1234.5 2834.2
## 1852 4036.1 2311.2 1116.0 1110.0 1754.6 4484.9 1716.0 2752.0 1433.1 2217.7
## 1853 3618.6 1073.9 2088.0 1881.9 829.1 1936.7 2512.3 2072.0 1504.1 4627.6
## Nov Dec
## 1850 2828.5 2600.1
## 1851 1134.3 1719.2
## 1852 5581.2 5205.9
## 1853 2651.1 1209.3
The plot below displays the same scope (all stations and all years) as the first bar chart in this essay, but using the total rainfall instead of mean. Its different visualisation technique provides more information than a simple bar chart.
rain_ts %>%
monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)
We will create a new time series variable called rain_ts and ascribe it the values of total rainfall and group it by month and year. Again this needs to be ungrouped to specify a window. First establish the start as being Jan 1850 going up in frequencies of 12 (months). Then specify the window. We will display this as a tible to see if it worked.
rain %>% group_by(Year,Month) %>%
summarise(Rainfall=sum(Rainfall)) %>% ungroup %>% transmute(Rainfall) %>%
ts(start=c(1850,1),freq=12) -> rain_ts
rain_ts %>% window(c(1870,1),c(1871,12))
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1870 2666.2 1975.3 1500.5 1024.8 1862.8 789.2 1038.6 1510.5 2045.5 5177.6
## 1871 3148.3 2343.7 1731.7 2654.5 657.6 2040.1 3705.0 1869.9 2083.4 2774.3
## Nov Dec
## 1870 1733.2 1902.2
## 1871 2000.1 1902.0
Now we can load dygraphs library to create dynamic interactive graphs. Then the desired data frame can be displayed as a dynamic graph. This is achieved by using the %>% command to pipeline the rain_ts into a dygraph.
library(dygraphs) # A dynamic graph library
rain_ts %>% dygraph (width=800,height=500,x='Year') # Try moving the pointer along the curve
This dygraph looks rather cluttered and difficult to read at this scale. It also lacks a controlling interface; therefore, we will add a range selector.
rain_ts %>% dygraph(width=800,height=300) %>% dyRangeSelector
We need to isolate individual stations, let us start with Belfast. First we will pipeline the data for Belfast into a new time series variable called bel_ts.
rain %>% group_by(Year,Month) %>% filter(Station=="Belfast") %>%
summarise(Rainfall=sum(Rainfall)) %>% ungroup %>% transmute(Rainfall) %>%
ts(start=c(1850,1),freq=12) -> bel_ts
Now we will create the dygraph.
bel_ts %>% dygraph(width=800,height=360, main='Belfast') %>% dyRangeSelector
Any station can be queried in this way and the results stored in a new variable, We will do this for the remaining stations of Dublin Airport, University College Galway, and Cork Airport.
rain %>% group_by(Year,Month) %>% filter(Station=="Dublin Airport") %>%
summarise(Rainfall=sum(Rainfall)) %>% ungroup %>% transmute(Rainfall) %>%
ts(start=c(1850,1),freq=12) -> dub_ts
dub_ts %>% dygraph(width=800,height=360, main='Dublin Airport') %>% dyRangeSelector
rain %>% group_by(Year,Month) %>% filter (Station=="University College Galway") %>%
summarise(Rainfall=sum(Rainfall)) %>% ungroup %>% transmute(Rainfall) %>%
ts(start=c(1850,1),freq=12) -> gal_ts
gal_ts %>% dygraph(width=800,height=360, main='University College Galway') %>% dyRangeSelector
rain %>% group_by(Year,Month) %>% filter(Station=="Cork Airport") %>%
summarise(Rainfall=sum(Rainfall)) %>% ungroup %>% transmute(Rainfall) %>%
ts(start=c(1850,1),freq=12) -> cor_ts
cor_ts %>% dygraph(width=800,height=360, main='Cork Airport') %>% dyRangeSelector
We can display all of the above as a four-way comparison dynamic graph. First we need to create a new variable which will store the data of the four grouped stations using the combine command cbind, and, as always, the assignment operator ->
comp4_ts <- cbind(bel_ts,dub_ts, gal_ts, cor_ts)
Now to create a dygraph displaying four stations.
We can use the dyOptions to change attributes of the graph, for example the stroke width and/or colours.
comp4_ts %>% dygraph(width=800,height=360) %>%
dyOptions(strokeWidth = 2, colors = RColorBrewer::brewer.pal(4, "Set2")) %>% dyRangeSelector
It is difficult to draw conclusions as a non meteorologist, however, the rain fall at the four stations seems to correspond closely for the most part, in that if it is a particularly precipitous period, levels tend to be elevated in all, and the converse is true of periods of low precipitation. However, sometimes the stations diverge considerably, as in Oct 1902, while at tother times they correlate very closely as in Oct 2002.
As can be seen from above, this type of graph is very useful for comparing data, however, for the amount of data and the range of years, it can be difficult to read with more than two lines (stations in this case) and becomes more difficult as the scope is increased to include a greater time period. It would be more advantageous to separate out the stations and group them to allow one range control to manipulate all stations. This is also the objective of this assignment.
bel_ts %>% dygraph(width=800,height=130,group="bdgc_four",main='Belfast')
dub_ts %>% dygraph(width=800,height=130,group="bdgc_four",main='Dublin Airport')
gal_ts %>% dygraph(width=800,height=130,group="bdgc_four",main='University College Galway')
cor_ts %>% dygraph(width=800,height=170,group="bdgc_four",main='Cork Airport') %>% dyRangeSelector
We can hide the code in RMarkdown so we can read the charts more easily.
A blog associated with my MA course work can be found here: http://dhprojects.maynoothuniversity.ie/2017/ssourke/