Advanced R and Data Analysis

Overview


  • More on time series
  • Making interactive tools
  • Some new spatial techniques
  • Reproducibility

R studio

Looking at the rainfall data

💧 Main working data set is rainfall.RData


📥 Get it from Moodle in section headed Chris Brunsdon’s Lectures.
load('rainfall.RData')
head(stations)
## # A tibble: 6 x 9
##   Station Elevation Easting Northing   Lat  Long County Abbreviation Source
##   <chr>       <int>   <dbl>    <dbl> <dbl> <dbl> <chr>  <chr>        <chr> 
## 1 Athboy         87  270400   261700  53.6 -6.93 Meath  AB           Met E…
## 2 Foulks…        71  284100   118400  52.3 -6.77 Wexfo… F            Met E…
## 3 Mullin…       112  241780   247765  53.5 -7.37 Westm… M            Met E…
## 4 Portlaw         8  246600   115200  52.3 -7.31 Water… P            Met E…
## 5 Rathdr…       131  319700   186000  52.9 -6.22 Wickl… RD           Met E…
## 6 Stroke…        49  194500   279100  53.8 -8.1  Rosco… S            Met E…

This packages all of the rainfall information


head(rain)
## # A tibble: 6 x 4
##    Year Month Rainfall Station
##   <dbl> <fct>    <dbl> <chr>  
## 1  1850 Jan      169   Ardara 
## 2  1851 Jan      236.  Ardara 
## 3  1852 Jan      250.  Ardara 
## 4  1853 Jan      209.  Ardara 
## 5  1854 Jan      188.  Ardara 
## 6  1855 Jan       32.3 Ardara

  • Its been converted to an R binary data file for convenience.
    • There are no missing values.
    • Data runs from 1850 to 2014 by month.
    • 👍🏻Thanks to Conor Murphy and Simon Noone for supplying.

Introduction to the dplyr package

🎁 A new package dplyr


library(dplyr)
rain %>% group_by(Station) %>% 
  summarise(mrain=mean(Rainfall))  -> rain_summary
head(rain_summary)
## # A tibble: 6 x 2
##   Station    mrain
##   <chr>      <dbl>
## 1 Ardara     140. 
## 2 Armagh      68.3
## 3 Athboy      74.7
## 4 Belfast     87.1
## 5 Birr        70.8
## 6 Cappoquinn 121.

  • The %>% command acts as a ‘pipeline’.
    • x %>% f(y) is the same as f(x,y).
    • x %>% f1(y) %>% f2(z) is the same as f2(f1(x,y),z) but easier to read.

📆 Group by month


rain %>% group_by(Month) %>% 
  summarise(mrain=mean(Rainfall)) -> rain_months
head(rain_months)
## # A tibble: 6 x 2
##   Month mrain
##   <fct> <dbl>
## 1 Jan   113. 
## 2 Feb    83.2
## 3 Mar    79.5
## 4 Apr    68.7
## 5 May    71.3
## 6 Jun    72.7

summarise applies a summary function to each group and creates a new data frame with one entry for each group. Any R summary function can be used - eg median, sd, max or a user defined function.

📊 Basic graphic investigation


library(ggplot2)
ggplot(rain_months,aes(x=Month,y=mrain)) + geom_col() 

📊 What does that mean?

  • library(ggplot2) Load ggplot2 - this is an R graphics library
  • Enter line: ggplot(rain_months,aes(x=Month,y=mrain)) + geom_col()
  • ggplot is the main command - it creates a graph
    • aes Aesthetics link variables to characteristics of the graph
    • Here x is month and y is mean rainfall
  • geom_col specifies the kind of graph - here it is a column plot
  • you add a geometry to the graph object to specify the actual graph

📊 Modifying the style of plots


ggplot(rain_months,aes(x=Month,y=mrain)) + geom_col(fill='dodgerblue')  

📈 Also for yearly data


Here we use geom_line to get a line graph, rather than column.

rain %>% group_by(Year) %>% 
  summarise(total_rain=sum(Rainfall)) -> rain_years
ggplot(rain_years,aes(x=Year,y=total_rain)) +  geom_line(col='indianred') 

✂️ Filter for individual stations using filter


rain %>% group_by(Year) %>% 
  filter(Station=='Strokestown') %>%
  summarise(total_rain=sum(Rainfall)) -> rain_years_str
ggplot(rain_years_str,aes(x=Year,y=total_rain)) +  geom_line(col='indianred') 

Multiple summary stats are also possible


rain %>% group_by(Month,Station) %>% 
  summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) -> rain_mnsd
head(rain_mnsd)
## # A tibble: 6 x 4
## # Groups:   Month [1]
##   Month Station    mean_rain sd_rain
##   <fct> <chr>          <dbl>   <dbl>
## 1 Jan   Ardara         175.     64.6
## 2 Jan   Armagh          74.6    28.9
## 3 Jan   Athboy          84.9    32.8
## 4 Jan   Belfast        101.     40.6
## 5 Jan   Birr            79.9    33.7
## 6 Jan   Cappoquinn     154.     66.3

mutate - Create new variables in a data frame


rain %>% group_by(Month) %>% 
  summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) %>%
  mutate(cv_rain=100 * sd_rain/mean_rain)  -> rain_mnsdcv
head(rain_mnsdcv)
## # A tibble: 6 x 4
##   Month mean_rain sd_rain cv_rain
##   <fct>     <dbl>   <dbl>   <dbl>
## 1 Jan       113.     57.6    51.1
## 2 Feb        83.2    51.5    61.8
## 3 Mar        79.5    44.3    55.7
## 4 Apr        68.7    36.4    52.9
## 5 May        71.3    37.2    52.2
## 6 Jun        72.7    40.9    56.3

  • 📖 CV: Coefficient of variation - standard deviation as a percentage of mean.
    • Shows relative variability of rainfall

Relative variability visualised…


ggplot(rain_mnsdcv,aes(x=Month,y=cv_rain)) + geom_col(fill='darkred')

… and absolute variability


ggplot(rain_mnsdcv,aes(x=Month,y=sd_rain)) + geom_col(fill='seagreen')

🔢 Multiple groupings - eg average by month/station combination


rain %>% group_by(Month,Station) %>% 
  summarise(mean_rain=mean(Rainfall)) -> rain_season_station
head(rain_season_station)
## # A tibble: 6 x 3
## # Groups:   Month [1]
##   Month Station    mean_rain
##   <fct> <chr>          <dbl>
## 1 Jan   Ardara         175. 
## 2 Jan   Armagh          74.6
## 3 Jan   Athboy          84.9
## 4 Jan   Belfast        101. 
## 5 Jan   Birr            79.9
## 6 Jan   Cappoquinn     154.

Note that with is implicit in the dplyr commands such as filter and summarise

Re-arrange using pivot_wider in dplyr package


library(tidyverse)
rain_season_station %>% pivot_wider(names_from=Month,values_from=mean_rain)
## # A tibble: 25 x 13
##    Station   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct
##    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Ardara  175.  127.  123.   98.8  96.9 105.  124.  145.  153.  174. 
##  2 Armagh   74.6  56.0  56.5  53.7  59.2  62.7  72.5  81.9  69.0  80.9
##  3 Athboy   84.9  62.6  62.4  59.0  62.2  68.1  76.3  88.8  76.4  88.1
##  4 Belfast 101.   74.5  73.1  65.9  69.2  74.5  87.7 102.   87.1 106. 
##  5 Birr     79.9  57.9  58.4  54.1  60.3  62.0  75.1  86.8  72.2  83.8
##  6 Cappoq… 154.  118.  110.   94.0  95.8  94.9 104.  125.  116.  147. 
##  7 Cork A… 138.  103.   92.3  76.7  77.2  71.5  76.9  92.8  91.7 120. 
##  8 Derry   100.   74.6  70.2  62.9  64.0  72.9  83.4  94.1  85.3 103. 
##  9 Drumsna  97.2  74.4  71.9  63.6  68.9  73.0  82.1  97.0  83.1  97.4
## 10 Dublin…  63.7  49.0  51.3  49.2  56.1  56.3  64.1  74.9  61.8  73.8
## # … with 15 more rows, and 2 more variables: Nov <dbl>, Dec <dbl>

  • Now results organised into 2D array

Can also visualise 2D arrays - geom_tile


ggplot(rain_season_station,aes(x=Month,y=Station,fill=mean_rain)) + geom_tile()

Create a heatmap


rain %>% group_by(Station) %>% 
  summarise(mean_rain=sum(Rainfall)) -> rain_station
rain_season_station %>% mutate(Station = reorder(Station,rain_station$mean_rain)) -> rain_season_station
  • Similar to above, but order the station columns so that similar ones are closer together
  • Base this on the overall rainfall per station
  • Create an ordering of stations by amount of rainfall
  • Rainiest places at the top, driest places at the bottom

Create a heatmap


hm <- ggplot(rain_season_station,aes(x=Month,y=Station,fill=mean_rain)) + geom_tile(col='white')
hm

Storing plots to modify is useful (1)


hm + scale_fill_viridis_c(direction=-1) + labs(fill='Mean Rainfall')

Storing plots to modify is useful (2)


hm + scale_fill_distiller(palette = 'Reds',direction=1) + labs(fill='Mean Rainfall')

Alternative aesthetic and geometry


ggplot(rain_season_station,aes(x=Month,y=Station,size=mean_rain)) + geom_point(col='salmon')

Why %>% is useful here


This code takes the rain data frame and runs through the steps to aggregate nationally for 1950.

rain %>% filter(Year==1950) %>% group_by(Station) %>% 
  summarise(total_rain=sum(Rainfall)) -> rain_station_1950

So does this code, in ‘traditional’ R format:

rain_station_1950 <- summarise(
  group_by(
    filter(Station,Year==1950),Station),
    total_rain=sum(Rainfall))

Which do you think is clearer?

Conclusion

💡 New ideas

  • New R ideas
    • dplyr and pipeline-based commands, and related functions
    • ggplot for graphics
  • New time series ideas
    • Graphical tools: heatmaps, monthplots etc.
  • Next lecture
    • Time Series patterns