01 The Explorer’s Perspective

Exploratory Data Analysis in Geosciences

Yannis Markonis


Before We Get Started…

Welcome to the course of Exploratory Data Analysis offered by Bachelor of Environmental Science of Czech University of Life Sciences Prague.

The prerequisites for the course is basic knowledge of R and statistics. In terms of software, Rstudio and a github account will suffice.

This intensive course will be held during 6th – 10th April from 9.00 to 16.00 with an hour lunch break at 12.00. I will be available to answer questions till 21.00, and the daily assignments should be delivered up to that time.

After successfully completing all the assignments, you will work on a larger individual project, which should be submitted by 1st of June. All assignments and reports will be uploaded in your github repositories.

Preface

So what is Exploratory Data Analysis (EDA)?

Let’s explore some definitions of EDA coming from sources with different backgrounds.

In statistics, Exploratory Data Analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

— Wikipedia

Roughly speaking, Exploratory Data Analysis may be defined as the art of looking at one or more datasets in an effort to understand the underlying structure of the data contained there.1 Exploratory Data Analysis using R, CRC Press, 2018

— Ronald K. Pearson

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.2 Towards data science, 2018

— Prasad Patil

These three definitions contain the key features of EDA, that are:

Most data analysts would agree on these terms. However, EDA can be a lot more than just a methodological first step before the main analysis.

It can be the main analysis itself.

This is the approach we will take here. We will apply EDA focusing in the exploration perspective.

The Explorer’s Approach

Think of Christopher Columbus. His exploration project started long before his journey. There were long, careful preparations and a clear objective. The objective was to find the westward route to Chinese markets

Then, even though the objective was not accomplished and the route to China was not found, the project did not finish at that point. After he and his fellow travelers returned home, there were stories to tell and evidence to present. New maps were drawn and new observations were collected. There was information that would help him and other explorers to repeat this journey or use the knowledge gained for other ambitious endeavors. It was not just the journey, but also the time spent before and after.

When we start our data projects, we usually dive into the datasets and start doing things with them. This comes naturally, in some cases due to the excitement of exploration, in others due to time limitations.

However, can you imagine how Columbus expedition would be if he had followed this path? If he had skip the first step of preparation he would probably be wandering across Atlantic Ocean until his provisions were spent3 After one day or a year? Who knows, nobody really checked.. If he did not care for the last step, after put his foot back on the harbor of Lisbon, he would simply report that no route to China was found. But he did not do that.

Following Columbus approach, a data exploration expedition should consist of three parts:

In this course, you will learn how to effectively take these steps in order to succeed in your own future explorations.

Our expedition

Instead of learning abstract statistical tools and methods, our approach here is to get involved into an exploration expedition and be trained during this journey. By the end of it, you will not have only become familiar with some common methods and procedures in EDA, but you will have also witnessed the mindset behind the decisions made and the actions taken to complete it.

Our expedition will take us in the scientific discipline of hydroclimatology. As the word suggests, it is about the interaction between water and climate. So let’s see what it is about.

Twenty years ago, a milestone study was published in the scientific journal of Climatic Change. The name of the research article was “Impact of Climate Change on Hydrological Regimes and Water Resources Management in the Rhine Basin”4 Middelkoop et al., 2001. By combining climate and hydrological models, the authors presented what future changes in river discharge5 River discharge, runoff and streamflow are terms that in most cases are used with the same meaning. are expected over the Rhine basin due to global warming.

Their results, not only influenced the local policy makers, but most importantly shaped the research path in hydrological projections for the next two decades. Since then, we have collected a vast amount of data that can be used to evaluate the actual changes in the hydroclimatology of the region.

It is left to us to find out if the changes in Rhine are actually happening, the way Middelkoop and his colleagues have already predicted.

Further Reading & Assignments

Some books that provided material and inspiration for creating this course:

Assignments

In this first journey, you will not be alone. At this section, you will find tasks and questions from the two persons responsible for your training: The Navigator and the Explorer.

Explorer’s Questions

The Explorer is interested into discovering new things. His questions will help you understand how and hopefully result to new questions from your side.

New Functions & Code Listings

At the end of each chapter, a list with with the most important functions introduced in the chapter can be found, as well as the code scripts presented.

After this short introduction, it is time to start preparing for our journey.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.