06 Returning Home

Exploratory Data Analysis in Geosciences

Yannis Markonis

After Exploration Comes the Explanation

So the exploration is over and we are back home. We are eager to share our findings. Graphs is a great way to do so. However, so far we have prepared more than 30 plots and maps. We have to decide which one to keep and which one to leave aside.

Every plot we have created as part of Exploratory Data Analysis had an analytical purpose. Therefore, it can be considered as exploratory visual. It helped us to figure out the structure and significance of the variables explored.

Not all of the exploratory visuals have also an explanatory value. The purpose of explanatory visuals is to demonstrate the most important results. Our intention is to communicate them in the most effective way.

Depending on the audience and the size of our presentation we should narrow down the graphs and perhaps decide to prepare some new. In the past, this was considered as a completely separate procedure. We would create the new graphs, save them and then prepare the presentations. This would also mean different tools for text reports, web documents and presentations.

Today, we have tools that not only merge the communication alternatives, but most importantly dynamically link the analysis with its presentation. This is what R Markdown does. Its closest example is the one you are currently reading in your screens. This course was exclusively built with R Markdown.

Telling the Story

R Markdown is a branch of Markdown. To use its official definition:

Markdown is a text-to-HTML conversion tool for web writers.1 Markdown Web page

Markdown has become increasingly popular for various reasons:

  1. It is easy. To begin with all you need to learn is just the first page.

  2. It is fast and clean. As we make less mistakes, our efficiency is increased. Here is an example.

  3. It is portable. your documents can be edited in any text application on any operating system.

  4. It is flexible. Many other platforms/languages are using it, e.g. Dropbox, Github and of course R. At the same time it offers a variety in applications (e.g. emails, webpages, presentations, even books!). For example, besides HTML it supports HTML, PDF, Word.

However, when it comes to R Markdown it also allows to link the code used in the analysis to generate the results, with the results themselves.

After installing R Markdown it is quite straightforward to create a Markdown HTML file.

To create a new markdown file: File/New File/R Markdown...

This is a plain text file with .Rmd extension. Notice that the file contains three types of content:

To create an html (or alternatively a PDF/word) file, we press the Knit button [or Ctr+Shift+K].

R Markdown follows some very simple syntax rules, which allow for basic text formating, using links, adding equations

To add R scripts in the markdown document, a few things should be kept in mind. First of all, that each code chunk can be run separately. When you order R to knit your .Rmd file, R Markdown will run all the code chunks and embed their results beneath them3 When knitting, R Markdown will not use anything outside the code chunks, e.g., other variables created in our workspace.. For example:

```{r}
a <- 5 + 5

a / 2
```

In our EDA project, this is what we have already been doing up to now.

```{r}
library(data.table)
library(ggplot2)

runoff_summary <- readRDS('data/runoff_summary.rds')
head(runoff_summary)
```

However, some caution is needed when we read files. This can be adjusted by correctly setting the root directory (folder) in the begining of our script.

knitr::opts_knit$set(root.dir = '../..')

With knitr::opts_chunk$set we can set some global options for our code chunks. The cache option is one of them. The content of each chunk is permanantly stored and as long as it remains unchanged, does not run in each knit.

Some other options, which can be also used to each code chunk are:

If the echo option is set to TRUE then both the code and its result is presented. If not, then only the result will be printed/plotted.

The knitr package also offers another useful function; kable.

For example:

aa <- data.frame(ID = 1:10,  Value = rnorm(10), Type = sample(c('A', 'B', 'C', 'D'), 10, replace = TRUE))
knitr::kable(aa, caption = "A simple table with kable()")

A simple table with kable()

ID Value Type
1 -0.7849509 A
2 -1.5019003 C
3 -0.9221323 C
4 0.0452651 C
5 -1.1095846 C
6 0.3757807 B
7 0.7411421 D
8 -1.0796701 A
9 -0.3650294 C
10 -0.0706510 D

Or in our analysis, we can have a table showing some information for each station.

runoff_summary <- readRDS('data/runoff_summary.rds')
knitr::kable(runoff_summary[, .(sname, altitude, start, end)], caption = "A better table with kable()", digits = 2)

A better table with kable()

sname altitude start end
DOMA 623 1899 2016
DIER 456 1919 2016
NEUF 430 1904 2016
REKI 370 1904 2016
RHEM 310 1933 2016
BASR 294 1869 2016
RHEI 260 1930 2016
MAXA 98 1921 2016
SPEY 89 1950 2016
WORM 84 1930 2016
MAIN 78 1930 2016
KAUB 68 1930 2016
ANDE 51 1930 2016
KOEL 35 1816 2016
DUES 24 1900 2016
REES 8 1814 2016
LOBI 9 1901 2017

Another feature of R Markdown is that it is not necessary to create code chunks to present our results. We can add them by using the `r some code` formulation.

In our report, we might want to present that the average catchment altitude is 193 with standard deviation 189 and a maximum of 623 at DOMA.

Here, the code for estimating the mean in inline code would be `r round(mean(runoff_summary$altitude), 0`.

The true strength of R Markdown is that in this way our report is dynamic. For example, if we want to exclude DOMA station from our analysis, than we just remove it and re-knit.

runoff_summary <- runoff_summary[sname != "DOMA"]

We don’t change anything in our text.

The average catchment is 166 with standard deviation 158 and a maximum of 456 at DIER.

A last step is selecting style options for our report. We can use the theme option in the header section to specify the theme to use for the page. The theme names are “default”, “cerulean”, “journal”, “flatly”, “readable”, “spacelab”, “united”, “cosmo”, “lumen”, “paper”, “sandstone”, “simplex”, and “yeti”. For example:

---
title: "EDA"
output:
  html_document:
    theme: united
---

We can also add a table of contents using the toc option and section numbering to headers using the number_sections option. For example:

---
title: "EDA"
output: 
  html_document:
    toc: true
    number_sections: true
    pandoc_args: 
      ["--number-sections",
      "--number-offset=1"]
---

R markdown is fully compatible with dropbox and github. As you have also noticed R Studio also provides some space for free at rpubs.

The Adventure Continues

At the time of writing of these lines, a new study about changes in Rhine hydrology has just been published. A coincedence? Probably not. As the environmental problems increase and water demand grows, our need to further understand how hydroclimatic systems function also becomes larger.

This short trip was meant as an introduction to Exploratory Data Analysis with focus in Geosciences. Its objective was not only to present the main methods used in EDA, but most importantly to show that we cannot efficiently explore a dataset, if we do not understand how it works. Data themselves are dumb. It is our questions that eventually create an interpretation.

As this course is completed for the first time, I would like to thank all of you for participating. With your work, you have already reshaped it by highlighting which parts work and which need revisioning. I wish you good luck to your own future explorations.

Further Reading & Assignments

Some useful material on R Markdown:

Assignments