After Exploration Comes the Explanation

So the exploration is over and we are back home. We are eager to share our findings. Graphs is a great way to do so. However, so far we have prepared more than 30 plots and maps. We have to decide which one to keep and which one to leave aside.

Every plot we have created as part of Exploratory Data Analysis had an analytical purpose. Therefore, it can be considered as exploratory visual. It helped us to figure out the structure and significance of the variables explored.

Not all of the exploratory visuals have also an explanatory value. The purpose of explanatory visuals is to demonstrate the most important results. Our intention is to communicate them in the most effective way.

Depending on the audience and the size of our presentation we should narrow down the graphs and perhaps decide to prepare some new. In the past, this was considered as a completely separate procedure. We would create the new graphs, save them and then prepare the presentations. This would also mean different tools for text reports, web documents and presentations.

Today, we have tools that not only merge the communication alternatives, but most importantly dynamically link the analysis with its presentation. This is what R Markdown does. Its closest example is the one you are currently reading in your screens. This course was exclusively built with R Markdown.

Telling the Story

R Markdown is a branch of Markdown. To use its official definition:

Markdown is a text-to-HTML conversion tool for web writers.11 Markdown Web page

Markdown has become increasingly popular for various reasons:

It is easy. To begin with all you need to learn is just the first page.
It is fast and clean. As we make less mistakes, our efficiency is increased. Here is an example.
It is portable. your documents can be edited in any text application on any operating system.
It is flexible. Many other platforms/languages are using it, e.g. Dropbox, Github and of course R. At the same time it offers a variety in applications (e.g. emails, webpages, presentations, even books!). For example, besides HTML it supports HTML, PDF, Word.

However, when it comes to R Markdown it also allows to link the code used in the analysis to generate the results, with the results themselves.

After installing R Markdown it is quite straightforward to create a Markdown HTML file.

To create a new markdown file: File/New File/R Markdown...

This is a plain text file with .Rmd extension. Notice that the file contains three types of content:

A header surrounded by ---.
R code surrounded by ```. We call them chunks and can create a new one with the Insert button.22 Shortcut for this ctr + alt + I.
Some text mixed with some simple text formatting.

To create an html (or alternatively a PDF/word) file, we press the Knit button [or Ctr+Shift+K].

R Markdown follows some very simple syntax rules, which allow for basic text formating, using links, adding equations

To add R scripts in the markdown document, a few things should be kept in mind. First of all, that each code chunk can be run separately. When you order R to knit your .Rmd file, R Markdown will run all the code chunks and embed their results beneath them33 When knitting, R Markdown will not use anything outside the code chunks, e.g., other variables created in our workspace.. For example:

```{r}
a <- 5 + 5

a / 2
```

In our EDA project, this is what we have already been doing up to now.

```{r}
library(data.table)
library(ggplot2)

runoff_summary <- readRDS('data/runoff_summary.rds')
head(runoff_summary)
```

However, some caution is needed when we read files. This can be adjusted by correctly setting the root directory (folder) in the begining of our script.

knitr::opts_knit$set(root.dir = '../..')

With knitr::opts_chunk$set we can set some global options for our code chunks. The cache option is one of them. The content of each chunk is permanantly stored and as long as it remains unchanged, does not run in each knit.

Some other options, which can be also used to each code chunk are:

echo
include
message
fig.width, fig.height
- small: <4
- medium: 4-8
- big: >8
fig.align

If the echo option is set to TRUE then both the code and its result is presented. If not, then only the result will be printed/plotted.

The knitr package also offers another useful function; kable.

For example:

aa <- data.frame(ID = 1:10,  Value = rnorm(10), Type = sample(c('A', 'B', 'C', 'D'), 10, replace = TRUE))
knitr::kable(aa, caption = "A simple table with kable()")

A simple table with kable()

ID	Value	Type
1	-0.7849509	A
2	-1.5019003	C
3	-0.9221323	C
4	0.0452651	C
5	-1.1095846	C
6	0.3757807	B
7	0.7411421	D
8	-1.0796701	A
9	-0.3650294	C
10	-0.0706510	D

Or in our analysis, we can have a table showing some information for each station.

runoff_summary <- readRDS('data/runoff_summary.rds')
knitr::kable(runoff_summary[, .(sname, altitude, start, end)], caption = "A better table with kable()", digits = 2)

A better table with kable()

sname	altitude	start	end
DOMA	623	1899	2016
DIER	456	1919	2016
NEUF	430	1904	2016
REKI	370	1904	2016
RHEM	310	1933	2016
BASR	294	1869	2016
RHEI	260	1930	2016
MAXA	98	1921	2016
SPEY	89	1950	2016
WORM	84	1930	2016
MAIN	78	1930	2016
KAUB	68	1930	2016
ANDE	51	1930	2016
KOEL	35	1816	2016
DUES	24	1900	2016
REES	8	1814	2016
LOBI	9	1901	2017

Another feature of R Markdown is that it is not necessary to create code chunks to present our results. We can add them by using the `r some code` formulation.

In our report, we might want to present that the average catchment altitude is 193 with standard deviation 189 and a maximum of 623 at DOMA.

Here, the code for estimating the mean in inline code would be `r round(mean(runoff_summary$altitude), 0`.

The true strength of R Markdown is that in this way our report is dynamic. For example, if we want to exclude DOMA station from our analysis, than we just remove it and re-knit.

runoff_summary <- runoff_summary[sname != "DOMA"]

We don’t change anything in our text.

The average catchment is 166 with standard deviation 158 and a maximum of 456 at DIER.

A last step is selecting style options for our report. We can use the theme option in the header section to specify the theme to use for the page. The theme names are “default”, “cerulean”, “journal”, “flatly”, “readable”, “spacelab”, “united”, “cosmo”, “lumen”, “paper”, “sandstone”, “simplex”, and “yeti”. For example:

---
title: "EDA"
output:
  html_document:
    theme: united
---

We can also add a table of contents using the toc option and section numbering to headers using the number_sections option. For example:

---
title: "EDA"
output: 
  html_document:
    toc: true
    number_sections: true
    pandoc_args: 
      ["--number-sections",
      "--number-offset=1"]
---

R markdown is fully compatible with dropbox and github. As you have also noticed R Studio also provides some space for free at rpubs.

The Adventure Continues

At the time of writing of these lines, a new study about changes in Rhine hydrology has just been published. A coincedence? Probably not. As the environmental problems increase and water demand grows, our need to further understand how hydroclimatic systems function also becomes larger.

This short trip was meant as an introduction to Exploratory Data Analysis with focus in Geosciences. Its objective was not only to present the main methods used in EDA, but most importantly to show that we cannot efficiently explore a dataset, if we do not understand how it works. Data themselves are dumb. It is our questions that eventually create an interpretation.

As this course is completed for the first time, I would like to thank all of you for participating. With your work, you have already reshaped it by highlighting which parts work and which need revisioning. I wish you good luck to your own future explorations.

06 Returning Home

Exploratory Data Analysis in Geosciences

Yannis Markonis

After Exploration Comes the Explanation

Telling the Story

The Adventure Continues

Further Reading & Assignments

Assignments

Navigator’s and Explorer Task