Setting the Destination

Columbus did not sail to discover a new continent. His objective was quite different. Still, it was explicitly formulated:

Find a direct route to the Orient.

Our first step will be exactly the same. We will set our destination in a clear and rigid formulation:

Show that the hydroclimatic changes predicted by Middelkoop et al. are happening across the Rhine basin.

This is our analysis hypothesis. Our aim is to verify or falsify it. However, before going any further, it would be a good idea to see which are these changes.11 Middelkoop et al., 2001

“Higher winter river discharge as a result of intensified snow-melt and increased winter precipitation, and lower summer discharge due to the reduced winter snow storage and an increase of evapotranspiration.”

⊕Map of the world by Martin Behaim (1492). Asia lies on the left side, while Europe on the right (Statistical Rethinking, Chapman & Hall/CRC Press, 2015).

Actually Columbus did the same thing. He narrowed down his objective to a more specific one, which was to find the so-called “North-west Passage”. His target became to find the passage that linked Atlantic and Pacific Oceans. To achieve this objective, three assumptions should stand true:

The Earth was round.
The North-west Passage existed.
The trip to Asia had a reasonable duration.

Columbus did not spend much time for the two first ones22 It was also safe to assume that no other “obstacle”, such as a continent, lay in the way.. He considered the available evidence sufficient. However, he put a significant amount of effort to estimate the distance of the trip. Calculating the journey length, was indispensable for designing it correctly. For these calculations, he used the most common model the explorers use – a map.33 All models are in their essence representations of reality.

In EDA, we do the same thing. Our model here is the statistical theory and sometimes our assumptions can be quite shady. It is important to try to determine them in the beginning of the analysis before getting into our ships.

Preparing the Ships

In our case study, and this course in general, our ship is R/RStudio. If you do not already have downloaded and installed them, it is time to do so.

Since we will spend some time in the same boat it would be helpful, if we shared a common language. We should all use the same code practices, as well as folder structure.

In this project, you will get the basic structure from a pre-existing github repository. After forking it, you can create a new project with version control in RStudio.

Now it is time to read the readme file to see if there is any useful information kept there. It is always good to keep a readme file, the very same way that captains kept their logs.

If you run the functions in 00_init.R, the following folder structure will be created. It contains:

code where we will save our scripts
data for data storage
data/raw always have the initial data isolated and read-only
results to keep our results
results/figures it usually helps to keep our figures separately
docs reading material that we might collect along the way

The reason that some of these folders are not found in the git online repository is that they are excluded through the .gitignore file. This is because we are trying to avoid having big size files online.

We will download these files directly from the cloud. They should be stored in data/raw and docs correspondingly.

Studying the Map

Fortunately, in our exploration we will follow the steps of Middelkoop and his colleagues back in 2001. Thus, by reading their work we can get an idea of what to expect ahead. In most cases, though, we will have not such an opportunity. What we can do, is to spend some time to investigate similar projects to ours.

Typical questions we want to answer are:

What type of data did they use?
Which variables are most suitable to verify/falsify the analysis hypothesis?
Did they do any pre-processing in the data? Were there any missing values or outliers?
Did they use some summary statistics?
Was some clustering or classification method applied to the data to reduce their dimensions?
Which are the most representative time scales (daily, monthly etc.)?
What kind of visual representations (figures, maps etc.) did they use?

It is easy to answer these inquiries, by reading their paper.

Middelkoop and colleagues use numerical time series of data at different locations (data type) of daily discharge data (main variable). The discharge is produced by hydrological models, which use temperature and precipitation projections generated by the climate models (secondary variables). There is no pre-processing, since all the data they use are model-generated and thus assumed clean. Monthly means and percentage of change are used to summarize their results, which are plotted in line plots and bar plots. The only classification applied is the aggregation of daily to monthly values, which is the most representative scale for the study. In this context, this is further aggregated to seasonal time scale (winter vs. summer changes).

Before moving forward, we should make a quick recap of these concepts and their utilization in R.

To define numerical time series, we have to start first from variables.

A variable is a quantity, quality, or property that you can measure. Since it varies, it can get different states or values. In R, we can create a variable by assigning a value to it.

runoff <- 130 #m3/s

runoff is a numerical variable and gets a real number value. Other types of variables in R that we will use here are integer for whole numbers, factor for categorical data and character which is pretty self-explanatory.44 In R, variables are stored in data objects, such as vectors, matrices and data frames. If you are not familiar with the basic types of R data objects, have a quick read here.
If there are more than one values ordered in time, then we have a time series.

runoff_ts <- data.frame(time = as.Date(1:90, origin = '2020/12/31'), 
                        value = sample(c(130, 135, 140), 
                                       size = 90, replace = T))
head(runoff_ts)

##         time value
## 1 2021-01-01   135
## 2 2021-01-02   130
## 3 2021-01-03   140
## 4 2021-01-04   130
## 5 2021-01-05   135
## 6 2021-01-06   135

In our analysis, we will use the data.table objects. data.table is an enhanced data.frame, which helps in considerably reducing programming time, as well as computing time. This is especially useful when working with large datasets.

Another efficient way of data handling is through the plyr package and the %>% syntax. We strongly recommend this approach too, even though here we will be using the data.table syntax.

The general syntax of data.tables is

dt[i, j, by]

which means:

Take data.table dt, select rows using i, then calculate j possibly grouped by by.

The examples below offer the basic applications of data.table syntax.

We start with loading the package into the workspace and transforming the data.frame object to data.table.

library(data.table)
runoff_dt <- data.table(runoff_ts)

We can select all the runoff values above 130 m3/s by using only the first part of the data.table syntax.

runoff_dt[value > 130]

We can estimate the mean of all the runoff values above 130 m3/s by using the first two parts of the data.table syntax.

runoff_dt[value > 130, mean(value)]

Or we can use the full syntax of data.table to estimate the monthly mean of all the runoff values above 130 m3/s.

runoff_dt[value > 130, mean(value), by = month(time)]

If we want to store the results of our data manipulation to a new column, then we could simply use the := notation.

runoff_dt[, mon := month(time)]

Now, we can use this new column to create another one.

runoff_dt[, mon_mean := mean(value), by = mon]

Finally, we can create a new data.table with the . notation, which is equal to list() function.

runoff_month <- runoff_dt[, .(mon, mon_mean)] 
runoff_month
unique(runoff_month)

Since we are interested in the individual instances of the monthly means, we remove the duplicates with the unique function. We conclude this example by storing the data.table as an .rds file, which is quicker in reading and thus plays well with the data.table efficiency.

saveRDS(runoff_dt, file = './data/dt_example.rds')

In the examples above, we have used the function mean to estimate mean55 Mean is the most common summary statistic. However, there are several kinds of means in statistics. For a data set, the arithmetic mean, also called the mathematical expectation or average, is the central value of a discrete set of numbers: specifically, the sum of the values divided by the number of values. \[\bar{x} = \frac{1}{n} \sum_{i}{x_i}\] Generally, mean represents the central tendency of the data. However, there are cases it can be misleading. For example, mean is a good descriptor for apple size. If I have a basket with 10 apples that weights 2 kgs, it’s reasonable to expect that if a get an apple at random from the basket then it will weight around 200 gr. However, if my basket contains both apples and mellons then my expectation would be false. As a rule, in natural processes we have both apples and mellons (and many other fruits) so we have to be very cautious when using summary statistics. runoff. We have already noticed that Middelkoop and colleagues aggregated their data to monthly values. As aggregation we define the application of a function (usually the mean or the sum) to merge data into clusters with common characteristics.

Time series aggregation is the aggregation of all data points for a single time series over a specified period (time scale).

All the above cover more or less the “technical” preparations for our project. However, this is not enough. EDA also requires something else.

EDA is Not Just Numbers

In the very same manner that Columbus trip was not about navigating through the ocean, EDA is not just about applying statistical methods on data. A basic, but clear, understanding of the variables and their processes is essential.

In our project, we will investigate the Rhine catchment66 In United States the term watershed is used instead. Another term is drainage basin or just basin. and how water cycle interacts with it. Therefore, we should be familiar with these concepts.

In a secondary level, it is always beneficial to know more about the study area and its main characteristics. We could start with a topographic map of the catchment.

Or one that presents the main sub-catchments.

We can easily dive deeper by collecting information from other online sources. Today we have the unique opportunity to collect information from multiple media:77 Depending on browser, right-clicking and opening in new tab/window might be needed.

02 Sharpening the Axe

Exploratory Data Analysis in Geosciences

Yannis Markonis

Setting the Destination

Preparing the Ships

Studying the Map

EDA is Not Just Numbers

Further Reading & Assignments

Assignments

Navigator’s Tasks

Explorer’s Questions

New Functions & Code Listings

Functions