Give me six hours to chop down a tree and I will spend the first four sharpening the axe.
Columbus did not sail to discover a new continent. His objective was quite different. Still, it was explicitly formulated:
Find a direct route to the Orient.
Our first step will be exactly the same. We will set our destination in a clear and rigid formulation:
Show that the hydroclimatic changes predicted by Middelkoop et al. are happening across the Rhine basin.
This is our analysis hypothesis. Our aim is to verify or falsify it. However, before going any further, it would be a good idea to see which are these changes.1 Middelkoop et al., 2001
“Higher winter river discharge as a result of intensified snow-melt and increased winter precipitation, and lower summer discharge due to the reduced winter snow storage and an increase of evapotranspiration.”
Map of the world by Martin Behaim (1492). Asia lies on the left side, while Europe on the right (Statistical Rethinking, Chapman & Hall/CRC Press, 2015).
Actually Columbus did the same thing. He narrowed down his objective to a more specific one, which was to find the so-called “North-west Passage”. His target became to find the passage that linked Atlantic and Pacific Oceans. To achieve this objective, three assumptions should stand true:
Columbus did not spend much time for the two first ones2 It was also safe to assume that no other “obstacle”, such as a continent, lay in the way.. He considered the available evidence sufficient. However, he put a significant amount of effort to estimate the distance of the trip. Calculating the journey length, was indispensable for designing it correctly. For these calculations, he used the most common model the explorers use – a map.3 All models are in their essence representations of reality.
In EDA, we do the same thing. Our model here is the statistical theory and sometimes our assumptions can be quite shady. It is important to try to determine them in the beginning of the analysis before getting into our ships.
In our case study, and this course in general, our ship is R/RStudio. If you do not already have downloaded and installed them, it is time to do so.
Since we will spend some time in the same boat it would be helpful, if we shared a common language. We should all use the same code practices, as well as folder structure.
In this project, you will get the basic structure from a pre-existing github repository. After forking it, you can create a new project with version control in RStudio.
Now it is time to read the readme file to see if there is any useful information kept there. It is always good to keep a readme file, the very same way that captains kept their logs.
If you run the functions in 00_init.R, the following folder structure will be created. It contains:
code where we will save our scriptsdata for data storagedata/raw always have the initial data isolated and read-onlyresults to keep our resultsresults/figures it usually helps to keep our figures separatelydocs reading material that we might collect along the wayThe reason that some of these folders are not found in the git online repository is that they are excluded through the .gitignore file. This is because we are trying to avoid having big size files online.
We will download these files directly from the cloud. They should be stored in data/raw and docs correspondingly.
Fortunately, in our exploration we will follow the steps of Middelkoop and his colleagues back in 2001. Thus, by reading their work we can get an idea of what to expect ahead. In most cases, though, we will have not such an opportunity. What we can do, is to spend some time to investigate similar projects to ours.
Typical questions we want to answer are:
It is easy to answer these inquiries, by reading their paper.
Middelkoop and colleagues use numerical time series of data at different locations (data type) of daily discharge data (main variable). The discharge is produced by hydrological models, which use temperature and precipitation projections generated by the climate models (secondary variables). There is no pre-processing, since all the data they use are model-generated and thus assumed clean. Monthly means and percentage of change are used to summarize their results, which are plotted in line plots and bar plots. The only classification applied is the aggregation of daily to monthly values, which is the most representative scale for the study. In this context, this is further aggregated to seasonal time scale (winter vs. summer changes).
Before moving forward, we should make a quick recap of these concepts and their utilization in R.
To define numerical time series, we have to start first from variables.
A variable is a quantity, quality, or property that you can measure. Since it varies, it can get different states or values. In R, we can create a variable by assigning a value to it.
runoff <- 130 #m3/s
runoff is a numerical variable and gets a real number value. Other types of variables in R that we will use here are integer for whole numbers, factor for categorical data and character which is pretty self-explanatory.4 In R, variables are stored in data objects, such as vectors, matrices and data frames. If you are not familiar with the basic types of R data objects, have a quick read here.
If there are more than one values ordered in time, then we have a time series.
runoff_ts <- data.frame(time = as.Date(1:90, origin = '2020/12/31'),
value = sample(c(130, 135, 140),
size = 90, replace = T))
head(runoff_ts)
## time value
## 1 2021-01-01 135
## 2 2021-01-02 130
## 3 2021-01-03 140
## 4 2021-01-04 130
## 5 2021-01-05 135
## 6 2021-01-06 135
In our analysis, we will use the data.table objects. data.table is an enhanced data.frame, which helps in considerably reducing programming time, as well as computing time. This is especially useful when working with large datasets.
Another efficient way of data handling is through the plyr package and the %>% syntax. We strongly recommend this approach too, even though here we will be using the data.table syntax.
The general syntax of data.tables is
dt[i, j, by]
which means:
Take data.table
dt, select rows usingi, then calculatejpossibly grouped byby.
The examples below offer the basic applications of data.table syntax.
We start with loading the package into the workspace and transforming the data.frame object to data.table.
library(data.table)
runoff_dt <- data.table(runoff_ts)
We can select all the runoff values above 130 m3/s by using only the first part of the data.table syntax.
runoff_dt[value > 130]
We can estimate the mean of all the runoff values above 130 m3/s by using the first two parts of the data.table syntax.
runoff_dt[value > 130, mean(value)]
Or we can use the full syntax of data.table to estimate the monthly mean of all the runoff values above 130 m3/s.
runoff_dt[value > 130, mean(value), by = month(time)]
If we want to store the results of our data manipulation to a new column, then we could simply use the := notation.
runoff_dt[, mon := month(time)]
Now, we can use this new column to create another one.
runoff_dt[, mon_mean := mean(value), by = mon]
Finally, we can create a new data.table with the . notation, which is equal to list() function.
runoff_month <- runoff_dt[, .(mon, mon_mean)]
runoff_month
unique(runoff_month)
Since we are interested in the individual instances of the monthly means, we remove the duplicates with the unique function. We conclude this example by storing the data.table as an .rds file, which is quicker in reading and thus plays well with the data.table efficiency.
saveRDS(runoff_dt, file = './data/dt_example.rds')
In the examples above, we have used the function mean to estimate mean5 Mean is the most common summary statistic. However, there are several kinds of means in statistics. For a data set, the arithmetic mean, also called the mathematical expectation or average, is the central value of a discrete set of numbers: specifically, the sum of the values divided by the number of values. \[\bar{x} = \frac{1}{n} \sum_{i}{x_i}\] Generally, mean represents the central tendency of the data. However, there are cases it can be misleading. For example, mean is a good descriptor for apple size. If I have a basket with 10 apples that weights 2 kgs, it’s reasonable to expect that if a get an apple at random from the basket then it will weight around 200 gr. However, if my basket contains both apples and mellons then my expectation would be false. As a rule, in natural processes we have both apples and mellons (and many other fruits) so we have to be very cautious when using summary statistics. runoff. We have already noticed that Middelkoop and colleagues aggregated their data to monthly values. As aggregation we define the application of a function (usually the mean or the sum) to merge data into clusters with common characteristics.
Time series aggregation is the aggregation of all data points for a single time series over a specified period (time scale).
All the above cover more or less the “technical” preparations for our project. However, this is not enough. EDA also requires something else.
In the very same manner that Columbus trip was not about navigating through the ocean, EDA is not just about applying statistical methods on data. A basic, but clear, understanding of the variables and their processes is essential.
In our project, we will investigate the Rhine catchment6 In United States the term watershed is used instead. Another term is drainage basin or just basin. and how water cycle interacts with it. Therefore, we should be familiar with these concepts.
In a secondary level, it is always beneficial to know more about the study area and its main characteristics. We could start with a topographic map of the catchment.
Or one that presents the main sub-catchments.
We can easily dive deeper by collecting information from other online sources. Today we have the unique opportunity to collect information from multiple media:7 Depending on browser, right-clicking and opening in new tab/window might be needed.
Interested in the conceptual notions behind statistical models? Have a look at chapters 1-2 of Richard McElreath’s Statistical Rethinking.
If you want to get deeper into using github in R, then you can visit J. Bryan’s guide.
Some useful tips about keeping an efficient workflow and folder structure.
How big is the Rhine catchment (km2)?
If it rained for one full day over the whole catchment area at 0.5mm/hour and all the precipitated water ended up in the river, how much would be the increase in the average river runoff? Write a script that performs the calculation.
(Optional) How much time does a rain drop falling at Alpine Rhine need to reach the ocean? Write a script that performs the calculation.
Now that we have finished with the preparations, we can jump into our ships and sail away.
as.Date() # Create a date object
head() # Display the first lines of a data object
library() # Load packages in memory
mean() # Estimate the mean
month() # Return the month of a date object
sample() # Pick random numbers from a set
saveRDS() # Save one or more data objects to an .rds file
unique() # Return the non-duplicate values of a data object or set
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.