Initial Questions

Now that we have cleared the dataset and preliminary validated its quality we are ready to start with the exploration. We start with some simple questions.

How many records do we have? How many variables?
What are the variable names? Are they meaningful?
What type is each variable; e.g., numeric, categorical, logical?
How many unique values does each variable have?
What value occurs most frequently, and how often does it occur?

As we discussed earlier, when we are working with spatiotemporal data, there are two types of data. The records that describe the process (runoff_day) and the information about these records (runoff_stations).

library(data.table)
library(ggplot2)

colset_4 <-  c("#D35C37", "#BF9A77", "#D6C6B9", "#97B8C2")

runoff_stations <- readRDS('data/runoff_stations.rds')
runoff_day <- readRDS('data/runoff_day.rds')

After our initial data wrangling is quite easy to answer these questions for runoff_stations. For example, str function can provide us a quick overview of the number of variables, their names and types can be seen with table.

str(runoff_stations)
table(runoff_stations$country) # For one variable
apply(X = runoff_stations, MARGIN = 2, FUN = table) # For all variables

In the case of runoff_day, and in timeseries in general, we need to take a different approach.

Summary Statistics

Although plotting is a necessary first step, in most cases visual inspection offers only limited information about our data. Each time series has 30 to 75 thousand values. We need to somehow derive the information encapsulated in our data to meaningful information. The most straightforward way to do this is by estimating some summary statistics.

runoff_stats <- runoff_day[, .(mean_day = round(mean(value), 0),
                               sd_day = round(sd(value), 0),
                             min_day = round(min(value), 0),
                             max_day = round(max(value), 0)), by = sname]
head(runoff_stats, 4)

##    sname mean_day sd_day min_day max_day
## 1:  REES     2251   1112     500   11700
## 2:  DUES     2126   1078     464   11000
## 3:  KOEL     2086   1039     401   10900
## 4:  ANDE     2039   1057     560   10400

By estimating the mean, the standard deviation and the range, we now know how much runoff to expect11 Mean or average is the expected value of the time series. Standard deviation is a measure of spread around the expected value., and how much we shouldn’t22 To be confident about the range, though, we need a really big amount of values.. What we did is to reduce the amount of information from each time series (thousands of values) to four values; mean, standard deviation, minimum and maximum. We must be careful, though. We want to compress and not remove the available information.33 You might have noticed that we have created a new data.table for this information, instead of adding it in runoff_stations. As the volume of analyses grows larger, then we might end up with some huge data objects with “summary information”. We try to avoid this, by keeping relatively small data objects that can be easily combined. To effectively achieve this there should be a common keyword or id between different data objects. In our case, this is sname. Thus, we will keep all information regarding the statistical properties of our dataset in runoff_stats.

For example, looking at the large differences between mean and maximum runoff, one might wonder whether these values are also outliers. A quick way to answer this is by creating a histogram.

ggplot(runoff_day, aes(value)) +
  geom_histogram(fill = "#97B8C2") +
  facet_wrap(~sname, scales = 'free') +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that all histograms44 More about histograms here. have more or less the same shape. The runoff time series appear to have a skewed distribution. It is now obvious that the larger values follows a pattern and, therefore, we should not remove the maximum values as outliers. They are part of the process, even though they are so far away from the mean.55 The statistical measure to estimate this shape feature of the distribution is skewness and can be estimated with skewness function in moments package.

Data Transformation

Descriptive statistics are useful to summarize information, but most of the times this is not enough. Continuous variables, such as river runoff, can have different values, but at the same time be similar. Therefore, we can try to discretize them into meaningful categories.

In our project, we have already estimated runoff means. How can we divide them in three classes; low, medium and high? The obvious answer is to set some criteria, defined as threshold values. Usually, these are relative and subjective, but sometimes quantiles can help.

runoff_stats[, quantile(mean_day)]

##   0%  25%  50%  75% 100% 
##  117 1031 1276 2039 2251

Here, we can use the 25% and 75% quantiles to define low and high values respectively.

runoff_stats_class <- runoff_stats[, .(sname, 
                                       mean_day)]
runoff_stats_class[, runoff_class := factor('low')]
runoff_stats_class[mean_day >= 1000 & mean_day < 2000, runoff_class := factor('medium')]
runoff_stats_class[mean_day >= 2000, runoff_class := factor('high')]

However, in most of the cases we might need to set the thresholds in an arbitrary way.

runoff_stations[, area_class := factor('small')]
runoff_stations[area >= 10000 & area < 130000, area_class := factor('medium')]
runoff_stations[area >= 130000, area_class := factor('large')]

runoff_stations[, alt_class := factor('low')]
runoff_stations[altitude >= 50 & altitude < 400, alt_class := factor('medium')]
runoff_stations[altitude >= 400, alt_class := factor('high')]

We will merge the runoff classes with the rest of information in runoff_summary.

to_merge <- runoff_stats_class[, .(sname, runoff_class)]
runoff_summary <- runoff_stations[to_merge, on = 'sname']

Now, as we look at our data, it is easier to detect some patterns and see their internal structure. For instance, we observe that there are three subsets in our time series. There are stations in low altitude that belong to large catchments and have high runoff, there are the mid ones, which are the majority and there are the ones that are on or near to the mountains, that belong to smaller catchments and have low runoff. This sounds reasonable, so we can move on.

Discretization is one of the available procedures for transforming our data. Another one, quite necessary in our project is aggregation. Our time series are in daily time scale, but we are interested in changes over monthly, seasonal and annual scales. To do this, we can either sum or average over the coarser scales.66 The choice of function, depends on the process. Summing temperature has no physical meaning. However, summing precipitation has.

In our case, we will use the mean.

runoff_day[, year := year(date)]
runoff_day[, month := month(date)]

runoff_month <- runoff_day[, .(value = mean(value)), by = .(month, year, sname)]
runoff_month[, date := as.Date(paste0(year, '-', month, '-1'))]

By aggregating in other time scales new information can become available. For instance, in runoff data there is seasonality. This is due to the climate cycles in precipitation and temperature (evaporation). We can use boxplots to investigate this.77 More about boxplots here.

ggplot(runoff_month, aes(x = factor(month), y = value)) +
  geom_boxplot(fill = colset_4[4]) +
  facet_wrap(~sname, scales = 'free') + 
  theme_bw()

Another option is to estimate and plot the annual or the seasonal means.

runoff_year <- runoff_day[, .(value = mean(value)), by = .(year, sname)]

ggplot(runoff_year[year > 1980], aes(x = year, y = value)) +
  geom_line(col =  colset_4[1]) +
  geom_point(col = colset_4[1]) + 
  facet_wrap(~sname, scales = 'free') +
  theme_minimal()

runoff_day[month == 12 | month == 1 | month == 2, season := 'winter']
runoff_day[month == 3 | month == 4 | month == 5, season := 'spring']
runoff_day[month == 6 | month == 7 | month == 8, season := 'summer']
runoff_day[month == 9 | month == 10 | month == 11, season := 'autumn']
runoff_day[, season := factor(season, levels = c('winter', 'spring', 'summer', 'autumn'))]

runoff_winter <- runoff_day[season == 'winter', .(value = mean(value)), by = .(sname, year)]
runoff_summer <- runoff_day[season == 'summer', .(value = mean(value)), by = .(sname, year)]

We can see that our time series are becoming quite more readable in this scale. We can also observe a drop in runoff in most of the stations after 2000. Could this be the verification of our EDA hypothesis? It is still too early to tell, but we can check whether this shift is common in all time series using a different transformation. We will normalize them by subtracting the mean and dividing by the standard deviation.88 Alternatively, we may use the scale function, which does exactly the same thing.

runoff_year[, value_norm := (value - mean(value)) / sd(value), by = sname]
n_stations <- nrow(runoff_summary)

ggplot(runoff_year[year > 1980], aes(x = year, y = value_norm, col = sname)) +
  geom_line() +
  geom_point() + 
  scale_color_manual(values = colorRampPalette(colset_4)(n_stations)) +
  theme_bw()

By rescaling, we can easily compare the fluctuations of all time series and detect any similarities. As expected, most locations are strongly correlated.

Where Are We?

After all these transformations, we are now ready to start exploring our data. However, before doing so, it is a good moment to make a brief stop and check the spatial properties of the data. A quick way to make this happen is by drawing a map.

library(mapview) 
library(sf)

## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3

station_coords_sf <- st_as_sf(runoff_summary, 
                              coords = c("lon", "lat"), 
                              crs = 4326)
mapview(station_coords_sf, map.types = 'Stamen.TerrainBackground')

We observe that the station locations are well-distributed across the river. We can also get a clear idea of their order as water moves from Alps to Baltic. We add this information both to runoff_summary and runoff records.

runoff_summary <- runoff_summary[order(-altitude)]
runoff_summary <- runoff_summary[c(1:15, 17:16)]
runoff_summary <- cbind(order_id = 1:17, runoff_summary)
runoff_summary[, sname := reorder(sname, order_id)]

runoff_day[, sname := factor(sname, levels = runoff_summary$sname)]
runoff_month[, sname := factor(sname, levels = runoff_summary$sname)]
runoff_summer[, sname := factor(sname, levels = runoff_summary$sname)]
runoff_winter[, sname := factor(sname, levels = runoff_summary$sname)]
runoff_year[, sname := factor(sname, levels = runoff_summary$sname)]

Checking the Compass

Since we made this short stop, it is a good opportunity to see another thing. Looking for outliers and missing values is the first line of analyses regarding data robustness. It is a necessary step taken regardless of data type. A more case-specific approach is checking for consistency. This approach relates to the process(es) represented by our data.

In our project, we have river runoff records and some information about the altitude of the measurements and the catchment area they correspond to. We know that larger catchments are expected to have higher runoff.99 This is the information coming from the process.

Now, let’s check if the runoff is consistently higher as catchment size increases.

dt <- runoff_summary[, .(sname, area, alt_class)]
to_plot <- runoff_stats[dt, on = 'sname']

ggplot(to_plot, aes(x = mean_day, y = area, col = sname, cex = alt_class)) +
  geom_point() +
  scale_color_manual(values = colorRampPalette(colset_4)(n_stations)) +
  theme_bw()

It looks good. We used a scatter plot to determine whether there is any correlation between runoff and catchment area. The results match our prior belief about how the system should function. This makes us more confident about the future steps of our analysis.

We can save our new datasets and move forward. Who knows, perhaps land is close.

saveRDS(runoff_summary, './data/runoff_summary.rds')
saveRDS(runoff_stats, './data/runoff_stats.rds')
saveRDS(runoff_day, './data/runoff_day.rds')
saveRDS(runoff_month, './data/runoff_month.rds')
saveRDS(runoff_summer, './data/runoff_summer.rds')
saveRDS(runoff_winter, './data/runoff_winter.rds')
saveRDS(runoff_year, './data/runoff_year.rds')

Code Listings

`03_transformation.R`

library(data.table)
library(ggplot2)
library(mapview) 
library(sf) 

# Load data and set global variables and ---------------------------

colset_4 <-  c("#D35C37", "#BF9A77", "#D6C6B9", "#97B8C2")

runoff_stations <- readRDS('data/runoff_stations.rds')
runoff_day <- readRDS('data/runoff_day.rds')

# Summary statistics -----------------------------------------------

runoff_stats <- runoff_day[, .(mean_day = round(mean(value), 0),
                               sd_day = round(sd(value), 0),
                               min_day = round(min(value), 0),
                               max_day = round(max(value), 0)), by = sname]
head(runoff_stats, 4)

ggplot(runoff_day, aes(value)) +
  geom_histogram(fill = "#97B8C2") +
  facet_wrap(~sname, scales = 'free') +
  theme_bw()

# Discretization ------------------------------------------------------

runoff_stats_class <- runoff_stats[, .(sname, 
                                       mean_day)]
runoff_stats_class[, runoff_class := factor('low')]
runoff_stats_class[mean_day >= 1000 & mean_day < 2000, runoff_class := factor('medium')]
runoff_stats_class[mean_day >= 2000, runoff_class := factor('high')]

runoff_stations[, area_class := factor('small')]
runoff_stations[area >= 10000 & area < 130000, area_class := factor('medium')]
runoff_stations[area >= 130000, area_class := factor('large')]

runoff_stations[, alt_class := factor('low')]
runoff_stations[altitude >= 50 & altitude < 400, alt_class := factor('medium')]
runoff_stations[altitude >= 400, alt_class := factor('high')]

to_merge <- runoff_stats_class[, .(sname, runoff_class)]
runoff_summary <- runoff_stations[to_merge, on = 'sname']

# Aggregation ------------------------------------------------------

runoff_day[, year := year(date)]
runoff_day[, month := month(date)]

runoff_month <- runoff_day[, .(value = mean(value)), by = .(month, year, sname)]
runoff_month[, date := as.Date(paste0(year, '-', month, '-1'))]

ggplot(runoff_month, aes(x = factor(month), y = value)) +
  geom_boxplot(fill = colset_4[4]) +
  facet_wrap(~sname, scales = 'free') + 
  theme_bw()

runoff_year <- runoff_day[, .(value = mean(value)), by = .(year, sname)]

ggplot(runoff_year[year > 1980], aes(x = year, y = value)) +
  geom_line(col =  colset_4[1]) +
  geom_point(col = colset_4[1]) + 
  facet_wrap(~sname, scales = 'free') +
  theme_minimal()

runoff_day[month == 12 | month == 1 | month == 2, season := 'winter']
runoff_day[month == 3 | month == 4 | month == 5, season := 'spring']
runoff_day[month == 6 | month == 7 | month == 8, season := 'summer']
runoff_day[month == 9 | month == 10 | month == 11, season := 'autumn']
runoff_day[, season := factor(season, levels = c('winter', 'spring', 'summer', 'autumn'))]

runoff_year[, value_norm := (value - mean(value)) / sd(value), by = sname]
n_stations <- nrow(runoff_summary)

ggplot(runoff_year[year > 1980], aes(x = year, y = value_norm, col = sname)) +
  geom_line() +
  geom_point() + 
  scale_color_manual(values = colorRampPalette(colset_4)(n_stations)) +
  theme_bw()

runoff_winter <- runoff_day[season == 'winter', 
                            .(value = mean(value)), 
                            by = .(sname, year)]
runoff_summer <- runoff_day[season == 'summer', 
                            .(value = mean(value)), 
                            by = .(sname, year)]

# Map --------------------------------------------------------------------

station_coords_sf <- st_as_sf(runoff_summary, 
                              coords = c("lon", "lat"), 
                              crs = 4326)
mapview(station_coords_sf, map.types = 'Stamen.TerrainBackground')

runoff_summary <- runoff_summary[order(-altitude)]
runoff_summary <- runoff_summary[c(1:15, 17:16)]
runoff_summary <- cbind(order_id = 1:17, runoff_summary)
runoff_summary[, sname := reorder(sname, order_id)]

runoff_day[, sname := factor(sname, levels = runoff_summary$sname)]
runoff_month[, sname := factor(sname, levels = runoff_summary$sname)]
runoff_summer[, sname := factor(sname, levels = runoff_summary$sname)]
runoff_winter[, sname := factor(sname, levels = runoff_summary$sname)]
runoff_year[, sname := factor(sname, levels = runoff_summary$sname)]

# Data consistency ------------------------------------------------------

dt <- runoff_summary[, .(sname, area, alt_class)]
to_plot <- runoff_stats[dt, on = 'sname']

ggplot(to_plot, aes(x = mean_day, y = area, col = sname, cex = alt_class)) +
  geom_point() +
  scale_color_manual(values = colorRampPalette(colset_4)(n_stations)) +
  theme_bw()

# Saving ------------------------------------------------------

saveRDS(runoff_summary, './data/runoff_summary.rds')
saveRDS(runoff_stats, './data/runoff_stats.rds')
saveRDS(runoff_day, './data/runoff_day.rds')
saveRDS(runoff_month, './data/runoff_month.rds')
saveRDS(runoff_summer, './data/runoff_summer.rds')
saveRDS(runoff_winter, './data/runoff_winter.rds')
saveRDS(runoff_year, './data/runoff_year.rds')

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

04 In Open Sea

Exploratory Data Analysis in Geosciences

Yannis Markonis

Initial Questions

Summary Statistics

Data Transformation

Where Are We?

Checking the Compass

Further Reading & Assignments

Assignments

Navigator’s Tasks

Explorer’s Questions

Code Listings

`03_transformation.R`