So Far, So Good

So far, we have imported and cleared the data, investigated for missing values and outliers, and made all necessary transformations needed for answering the questions related to our analysis hypothesis. During these steps we estimated some descriptive statistics that helped us get a better grasp of our data. In addition, we checked for consistency and now we are quite confident that our data are robust and appropriate for giving some answers to our quest.

It’s the moment that land emerges in the horizon.

library(data.table)
library(ggplot2)

runoff_summary <- readRDS('data/runoff_summary.rds')
runoff_stats <- readRDS('data/runoff_stats.rds')
runoff_month <- readRDS('data/runoff_month.rds')
runoff_summer <- readRDS('data/runoff_summer.rds')
runoff_winter <- readRDS('data/runoff_winter.rds')
runoff_year <- readRDS('data/runoff_year.rds')

We have already discussed that we should start our script with loading all the necessary libraries and datasets. It is also quite common to set some variables that will be used in different parts of the code. For example, the graphical parameters, i.e., colors and formatting.

colset_4 <-  c("#D35C37", "#BF9A77", "#D6C6B9", "#97B8C2")
theme_set(theme_bw())

If these lines code increase in size and start to repeat, then we can save them in a separate file and load them with the source function.

Landing on the Right Beach

Now that we reach closer to our destination and some features of the unknown land start to appear, a decision has to be made. Which are the best shores to disembark?

We cannot explore everything. We need to find the ones that are the most representative. In EDA, this is especially crucial when we handle large volumes of data. One of our best options is to classify our records to some groups that have common features.

There are many ways to achieve this. However, in this project, we will subjectively determine similar groups according to the statistical metrics we have estimated plus one new that is widely-used in time series analysis. This is linear cross-correlation or just correlation.

Correlation11 Correlation is often denoted by the correlation coefficient \(\rho(x, y)\), which is estimated by \[\rho(x, y) = \frac{E[(x - \mu_x)(y - \mu_y)]}{\sigma_x\sigma_y}\] \(\mu\) is the mean and \(\sigma\) the standard deviation. The denominator product of the two spreads will constrain the correlation coefficient within the interval [-1, 1]. is a dimensionless measure of how two variables vary together, or “co-vary”.

A correlation coefficient of 1 indicates exact positive linear association
A correlation coefficient of 0 indicates no linear association at all
A correlation coefficient of -1 indicates exact negative linear association (anti-correlation).

We have already discussed briefly the correlation between river runoff and catchment size. Estimating the correlation coefficient is quite straightforward.

dt <- runoff_summary[, .(sname, area)]
for_cor <- runoff_stats[dt, on = 'sname']
cor(for_cor$mean_day, for_cor$area)

Or alternatively, we can use matrix format.

for_cor_mat <- for_cor[, c('mean_day', 'area')]
cor(for_cor_mat)

which creates the correlation matrix between the variables. This is quite handy when we are dealing with more than two variables.

We can now estimate and plot the correlation matrix for monthly or annual river runoff and detect some groups with similar behaviour.

runoff_month_mat <- dcast(runoff_month, date~sname)
runoff_month_cor <- cor(runoff_month_mat[, -1], use = "pairwise.complete.obs")
to_plot <- melt(runoff_month_cor)

ggplot(data = to_plot, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile(col = 'black') +
  scale_fill_gradient2(low = colset_4[4], 
                       high = colset_4[1], 
                       mid = colset_4[3],
                       midpoint = 0.5,
                       limits = c(-0.1, 1)) +
  geom_text(aes(label = round(value, 1))) +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab(label = "") +
  ylab(label = "")

As expected, river runoff from adjacent stations is strongly correlated and as we move further away the correlation fades out.

To determine similar groups, we have to combine all the information gained so far both about the location characteristics and the runoff properties. All the steps taken up to this point can help.

Starting from the stations, we have already made some intuitive classifications for catchment area and altitude and linked them with runoff22 There are stations in low altitude, that belong to large catchments and have high runoff, there are the mid ones, which are the majority and there are the ones that are on or near to the mountains, that belong to smaller catchments and have low runoff.. We now examine how these groups associate with the map and the correlation matrix.

Groups do not necessarily have the same size. For instance, we might want to keep DOMA and DIER in a separate group as they are in Alps area, located before a large water reservoir that definitely impacts runoff at the succeeding stations.

At the far end of the river we could keep the high runoff stations together. However, if we look at the monthly distribution of runoff, we see that there is a distinct difference in runoff seasonality before and after MAIN.

The number of groups is another, usually arbitrary choice we have to make. Again, it depends on data homogeneity, the process investigated and the analysis hypothesis. In our case, three seem reasonable.

runoff_summary[, category := 'downstream']
runoff_summary$category[3:10] <- ('mid')
runoff_summary$category[1:2] <- ('upstream')
runoff_summary[, category := factor(category, levels = c('upstream', 'mid', 'downstream'))]

We can now redraw our previous plots and see if the classification is meaningful. As there is no such thing as perfect classification, we have always to make some compromises.

runoff_month_mean <- runoff_month[, .(value = mean(value)), .(month, sname)]
to_plot <- runoff_month[runoff_summary[, .(sname, category)], on = 'sname']

ggplot(to_plot, aes(x = factor(month), y = value, fill = category, group = month)) +
  geom_boxplot() +
  facet_wrap(~sname, scales = 'free') +
  scale_fill_manual(values = colset_4[c(1, 2, 4)]) +
  xlab(label = "Month") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

Classification can help us lower down the volume of data, by focusing in representative records of each group. An additional benefit, is that we now need far less computational strength. In our project, we will pick the longer records, trying to avoid the ones with missing values if possible.

key_stations <- c('DOMA', 'BASR', 'KOEL')

runoff_summary_key <- runoff_summary[sname %in% key_stations]
runoff_month_key <- runoff_month[sname %in% key_stations]
runoff_winter_key <- runoff_winter[sname %in% key_stations]
runoff_summer_key <- runoff_summer[sname %in% key_stations]
runoff_year_key <- runoff_year[sname %in% key_stations]

ggplot(runoff_year_key[year > 1950], aes(x = year, y = value_norm, col = sname)) +
  geom_line() +
  geom_point() + 
  scale_color_manual(values = colset_4[c(1, 2, 4)]) +
  xlab(label = "Year") +
  ylab(label = "Runoff (m3/s)")

We can save our script as 04_classification.R, as well as our representative stations.

saveRDS(runoff_summary, './data/runoff_summary.rds')
saveRDS(runoff_summary_key, './data/runoff_summary_key.rds')
saveRDS(runoff_month_key, './data/runoff_month_key.rds')
saveRDS(runoff_winter_key, './data/runoff_winter_key.rds')
saveRDS(runoff_summer_key, './data/runoff_summer_key.rds')
saveRDS(runoff_year_key, './data/runoff_year_key.rds')

In the Jungle

Now that we have selected where we will disembark, we are ready to get off board and explore the new land.

Let’s create a new script and see whether Middelkoop and his colleagues were right back to 2000.

library(data.table)
library(ggplot2)

runoff_summary <- readRDS('data/runoff_summary.rds')
runoff_summary_key <- readRDS('data/runoff_summary_key.rds')
runoff_stats <- readRDS('data/runoff_stats.rds')
runoff_month_key <- readRDS('data/runoff_month_key.rds')
runoff_summer_key <- readRDS('data/runoff_summer_key.rds')
runoff_winter_key <- readRDS('data/runoff_winter_key.rds')
runoff_summer <- readRDS('data/runoff_summer.rds')
runoff_winter <- readRDS('data/runoff_winter.rds')
runoff_year_key <- readRDS('data/runoff_year_key.rds')

colset_4 <-  c("#D35C37", "#BF9A77", "#D6C6B9", "#97B8C2")
theme_set(theme_bw())

Due to our preparation strategy, not much effort is needed to examine this. We can start by simply comparing the statistics of values before and after 2000.

year_thres <- 2000

runoff_winter_key[year < year_thres, period := factor('pre_2000')]
runoff_winter_key[year >= year_thres, period := factor('aft_2000')]
runoff_summer_key[year < year_thres, period := factor('pre_2000')]
runoff_summer_key[year >= year_thres, period := factor('aft_2000')]

to_plot <- rbind(cbind(runoff_winter_key, season = factor('winter')), 
                       cbind(runoff_summer_key, season = factor('summer'))) 

ggplot(to_plot, aes(season, value, fill = period)) +
  geom_boxplot() +
  facet_wrap(~sname, scales = 'free_y') +
  scale_fill_manual(values = colset_4[c(4, 1)]) +
  xlab(label = "Season") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

There is a clear shift in the median and the 25-75% range after 2000 according to the hypothesis of Middelkoop et al. Winter runoff has increased, while summer runoff has decreased in all locations. Changes are more profound in smaller, upstream catchments (DOMA) than the lowland regions near the river mouth (KOEL).

However, we have take under consideration that we compare the last 17 years to 100-183 depending on the location. What if we compare the period after 2000, with the 17 years before 2000.

ggplot(to_plot[year >= 1983], aes(season, value, fill = period)) +
  geom_boxplot() +
  facet_wrap(~sname, scales = 'free_y') +
  scale_fill_manual(values = colset_4[c(4, 1)]) +
  xlab(label = "Season") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

Most of the changes now disappear. The results actually suggest that the shift in runoff came earlier, and the only significant changes since 2000 are in winter precipitation downstream. A new question now appears: When did the change in runoff actually occurred?

To answer this question, we need to examine our records in full length. The best way to achieve improve our visibility, is to find some higher ground.33 Direct comparison of the smaller dataset (after 2000) with the larger one is also feasible, by using appropriate statistical tests or stochastic simulation. We will discuss these approaches in detail in the next semester.

The Path to the Slope

Back in the previous chapter, we have used a scatter plot to show how catchment area is associated with runoff. Then, in this chapter, we have quantified the association with correlation coefficient. Now, we will move one step further. We will estimate the slope.

By estimating the slope, we create an empirical relationship between the two variables. Thus, if only one variable is known, we can derive the other one.

There are several ways to estimate the slope. The most common is a linear least squares fit. A “linear fit” is a line intended to model the relationship between variables.

Thus, our model has a very strong assumption; that is linearity. It can be formulated as \[ y = \alpha x + \beta\] where \(x, y\) are the two variables \(\alpha\) the slope and \(\beta\) the intercept. In ggplot, we can easily fit a line with function geom_smooth. In the arguments, we set the method to lm for linearity and formula to y ~ x to define the relationship between the variables.

ggplot(to_plot, aes(x = mean_day, y = area)) +
  geom_point(aes(col = category), cex = 3) +
  geom_smooth(method = 'lm', formula = y ~ x, se = 0, col = colset_4[1]) +
  scale_color_manual(values = colset_4[c(2, 3, 4)]) +
  xlab(label = "Area (km3)") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

Our linear model suggests that 50 km3 approximately correspond to runoff of 1 m3/s. However, we can also see that most of our measurements do not lie on the fitted line. We could perhaps try to use different models, for different scales.

ggplot() +
  geom_point(data = to_plot, aes(x = mean_day, y = area, col = category), cex = 3) +
  geom_smooth(data = to_plot[c(1:7)], aes(x = mean_day, y = area), 
              method = 'lm', formula = y~x, se = 0, col = colset_4[1]) +
    geom_smooth(data = to_plot[c(8:11)], aes(x = mean_day, y = area), 
              method = 'lm', formula = y~x, se = 0, col = colset_4[1]) +
  geom_smooth(data = to_plot[c(12:17)], aes(x = mean_day, y = area), 
              method = 'lm', formula = y~x, se = 0, col = colset_4[1]) +
  scale_color_manual(values = colset_4[c(2, 3, 4)]) +
  xlab(label = "Area (km3)") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

The fitted lines look significantly better, but now instead of one general, linear model we have three local ones. More effort is required to decide how many linear models should be used, as well as deciding the range of each model. Additionally, these decisions are usually arbitrary.

An alternative approach is to use a different form of relationship. For instance, a polynomial curve44 A polynomial fit is denoted as \[y = \alpha_1 x^2 + \alpha_2 x + \beta\] where \(\alpha_1, \alpha_2\) are the slope coefficients., looks quite promising in our case.

ggplot(to_plot, aes(x = mean_day, y = area)) +
  geom_point(aes(col = category), cex = 3) +
  geom_smooth(method = 'lm', formula = y ~ poly(x, 2), se = 0, col = colset_4[1]) +
  scale_color_manual(values = colset_4[c(2, 3, 4)]) +
  xlab(label = "Area (km3)") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

Finally, an approach that combines local slopes with polynomial fitting is the loess method. An acronym for locally estimated scatterplot smoothing, loess fits simple models to localized subsets of the data.

ggplot(to_plot, aes(x = mean_day, y = area)) +
  geom_point(aes(col = category), cex = 3) +
  geom_smooth(method = 'loess', formula = y ~ x, se = 0, col = colset_4[1]) +
  scale_color_manual(values = colset_4[c(2, 3, 4)]) +
  xlab(label = "Area (km3)") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

All these approaches are regression methods and they are also widely applied in time series analysis. In time series, \(x\) is time and \(y\) is a variable that changes over time. In our case, it is river runoff.

The use of linear slopes is quite common in Geosciences. However, the majority of the implementations are misleading. The reason is that the linearity assumption is rarely satisfied in nature.

Let’s see how this is reflected in our data by comparing linear and loess regression.

ggplot(runoff_summer_key, aes(x = year, y = value)) +
  geom_line(col = colset_4[3])+
  geom_point(col = colset_4[3])+
  facet_wrap(~sname, scales = 'free') +
  geom_smooth(method = 'lm', formula = y~x, se = 0, col = colset_4[1]) +
  geom_smooth(method = 'loess', formula = y~x, se = 0, col = colset_4[4]) +
  scale_color_manual(values = colset_4[c(1, 2, 4)]) +
  xlab(label = "Year") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

The linear fit suggests that all three time series have been monotonically decreasing with different rates each. The most abrupt shift is observed at DOMA, where precipitation declines at approximately 10 m3/s per decade.

Loess regression, though, provides a different picture. Runoff at DOMA declined between 1930 – 1980 and since then has remained rather stable. BASR and KOEL there has been a short increase between 1960 – 1970 and a steady decrease since then. This explains our earlier boxplot results.

ggplot(runoff_winter_key, aes(x = year, y = value)) +
  geom_line(col = colset_4[3])+
  geom_point(col = colset_4[3])+
  facet_wrap(~sname, scales = 'free') +
  geom_smooth(method = 'lm', formula = y~x, se = 0, col = colset_4[1]) +
  geom_smooth(method = 'loess', formula = y~x, se = 0, col = colset_4[4]) +
  scale_color_manual(values = colset_4[c(1, 2, 4)]) +
  xlab(label = "Year") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

In winter runoff, the discrepancies are becoming even more profound. The most striking discovery is the strange behaviour of runoff at DOMA. In addition, although runoff increased during 1950–1980, there has been a steady decrease since then in all locations.

By checking a smaller number of representative records, we gained some insight on where to focus. To make our final assessment on the analysis hypothesis, we will examine the periods 1950 – 1980 to 1980 – today.

Back to the Ships

So, are the changes predicted by Middelkoop and colleagues occurring since 2000?

Our exploratory data analysis led us to the following plots:

runoff_winter[, value_norm := scale(value), sname]
runoff_summer[, value_norm := scale(value), sname]
n_stations <- nrow(runoff_summary)

ggplot(runoff_winter[year > 1950], aes(x = year, y = value_norm, col = sname)) +
  geom_smooth(method = 'loess', formula = y~x, se = 0) + 
  scale_color_manual(values = colorRampPalette(colset_4)(n_stations)) +
  ggtitle('Winter runoff') +
  xlab(label = "Year") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

ggplot(runoff_summer[year > 1950], aes(x = year, y = value_norm, col = sname)) +
  geom_smooth(method = 'loess', formula = y~x, se = 0) + 
  scale_color_manual(values = colorRampPalette(colset_4)(n_stations)) +
    ggtitle('Summer runoff') +
  xlab(label = "Year") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

year_thres <- 1980
to_plot <- rbind(cbind(runoff_winter, season = factor('winter')), 
                 cbind(runoff_summer, season = factor('summer'))) 

to_plot[year < year_thres, period := factor('1950-1980')]
to_plot[year >= year_thres, period := factor('1981-2016')]
to_plot[year < year_thres, period := factor('1950-1980')]
to_plot[year >= year_thres, period := factor('1981-2016')]

to_plot <- to_plot[year >= 1950]

ggplot(to_plot, aes(season, value, fill = period)) +
  geom_boxplot() +
  facet_wrap(~sname, scales = 'free_y') +
  scale_fill_manual(values = colset_4[c(4, 1)]) +
  xlab(label = "Season") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

What do you think? Do we have a clear answer?

We save our script as 05_comparison.R and prepare our story during our return back home.

Code Listings

`04_Classification.R`

library(data.table)
library(ggplot2)

runoff_summary <- readRDS('data/runoff_summary.rds')
runoff_stats <- readRDS('data/runoff_stats.rds')
runoff_month <- readRDS('data/runoff_month.rds')
runoff_summer <- readRDS('data/runoff_summer.rds')
runoff_winter <- readRDS('data/runoff_winter.rds')
runoff_year <- readRDS('data/runoff_year.rds')

colset_4 <-  c("#D35C37", "#BF9A77", "#D6C6B9", "#97B8C2")
theme_set(theme_bw())

dt <- runoff_summary[, .(sname, area)]
for_cor <- runoff_stats[dt, on = 'sname']
cor(for_cor$mean_day, for_cor$area)


for_cor_mat <- for_cor[, c('mean_day', 'area')]
cor(for_cor_mat)

runoff_month_mat <- dcast(runoff_month, date~sname)
runoff_month_cor <- cor(runoff_month_mat[, -1], use = "pairwise.complete.obs")
to_plot <- melt(runoff_month_cor)

ggplot(data = to_plot, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile(col = 'black') +
  scale_fill_gradient2(low = colset_4[4], 
                       high = colset_4[1], 
                       mid = colset_4[3],
                       midpoint = 0.5,
                       limits = c(-0.1, 1)) +
  geom_text(aes(label = round(value, 1))) +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab(label = "") +
  ylab(label = "")

runoff_summary[, category := 'downstream']
runoff_summary$category[3:10] <- ('mid')
runoff_summary$category[1:2] <- ('upstream')
runoff_summary[, category := factor(category, levels = c('upstream', 'mid', 'downstream'))]

runoff_month_mean <- runoff_month[, .(value = mean(value)), .(month, sname)]
to_plot <- runoff_month[runoff_summary[, .(sname, category)], on = 'sname']

ggplot(to_plot, aes(x = factor(month), y = value, fill = category, group = month)) +
  geom_boxplot() +
  facet_wrap(~sname, scales = 'free') +
  scale_fill_manual(values = colset_4[c(1, 2, 4)]) +
  xlab(label = "Month") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

key_stations <- c('DOMA', 'BASR', 'KOEL')

runoff_summary_key <- runoff_summary[sname %in% key_stations]
runoff_month_key <- runoff_month[sname %in% key_stations]
runoff_winter_key <- runoff_winter[sname %in% key_stations]
runoff_summer_key <- runoff_summer[sname %in% key_stations]
runoff_year_key <- runoff_year[sname %in% key_stations]

ggplot(runoff_year_key[year > 1950], aes(x = year, y = value_norm, col = sname)) +
  geom_line() +
  geom_point() + 
  scale_color_manual(values = colset_4[c(1, 2, 4)]) +
  xlab(label = "Year") +
  ylab(label = "Runoff (m3/s)")

saveRDS(runoff_summary, './data/runoff_summary.rds')
saveRDS(runoff_summary_key, './data/runoff_summary_key.rds')
saveRDS(runoff_month_key, './data/runoff_month_key.rds')
saveRDS(runoff_winter_key, './data/runoff_winter_key.rds')
saveRDS(runoff_summer_key, './data/runoff_summer_key.rds')
saveRDS(runoff_year_key, './data/runoff_year_key.rds')

`05_Comparison.R`

library(data.table)
library(ggplot2)

runoff_summary <- readRDS('data/runoff_summary.rds')
runoff_summary_key <- readRDS('data/runoff_summary_key.rds')
runoff_stats <- readRDS('data/runoff_stats.rds')
runoff_month_key <- readRDS('data/runoff_month_key.rds')
runoff_summer_key <- readRDS('data/runoff_summer_key.rds')
runoff_winter_key <- readRDS('data/runoff_winter_key.rds')
runoff_year_key <- readRDS('data/runoff_year_key.rds')
runoff_summer <- readRDS('data/runoff_summer.rds')
runoff_winter <- readRDS('data/runoff_winter.rds')

colset_4 <-  c("#D35C37", "#BF9A77", "#D6C6B9", "#97B8C2")
theme_set(theme_bw())

year_thres <- 2000

runoff_winter_key[year < year_thres, period := factor('pre_2000')]
runoff_winter_key[year >= year_thres, period := factor('aft_2000')]
runoff_summer_key[year < year_thres, period := factor('pre_2000')]
runoff_summer_key[year >= year_thres, period := factor('aft_2000')]

to_plot <- rbind(cbind(runoff_winter_key, season = factor('winter')), 
                 cbind(runoff_summer_key, season = factor('summer'))) 

ggplot(to_plot, aes(season, value, fill = period)) +
  geom_boxplot() +
  facet_wrap(~sname, scales = 'free_y') +
  scale_fill_manual(values = colset_4[c(4, 1)]) +
  xlab(label = "Season") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

ggplot(to_plot[year >= 1983], aes(season, value, fill = period)) +
  geom_boxplot() +
  facet_wrap(~sname, scales = 'free_y') +
  scale_fill_manual(values = colset_4[c(4, 1)]) +
  xlab(label = "Season") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

dt <- runoff_summary[, .(sname, area, category)]
to_plot <- runoff_stats[dt, on = 'sname']

ggplot(to_plot, aes(x = mean_day, y = area, col = category)) +
  geom_point(cex = 3) +
  scale_color_manual(values = colset_4[c(2, 3, 4)]) +
  xlab(label = "Area (km3)") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

ggplot(to_plot, aes(x = mean_day, y = area)) +
  geom_point(aes(col = category), cex = 3) +
  geom_smooth(method = 'lm', formula = y ~ x, se = 0, col = colset_4[1]) +
  scale_color_manual(values = colset_4[c(2, 3, 4)]) +
  xlab(label = "Area (km3)") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

ggplot() +
  geom_point(data = to_plot, aes(x = mean_day, y = area, col = category), cex = 3) +
  geom_smooth(data = to_plot[c(1:7)], aes(x = mean_day, y = area), 
              method = 'lm', formula = y~x, se = 0, col = colset_4[1]) +
  geom_smooth(data = to_plot[c(8:11)], aes(x = mean_day, y = area), 
              method = 'lm', formula = y~x, se = 0, col = colset_4[1]) +
  geom_smooth(data = to_plot[c(12:17)], aes(x = mean_day, y = area), 
              method = 'lm', formula = y~x, se = 0, col = colset_4[1]) +
  scale_color_manual(values = colset_4[c(2, 3, 4)]) +
  xlab(label = "Area (km3)") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

ggplot(runoff_summer_key, aes(x = year, y = value)) +
  geom_line(col = colset_4[3])+
  geom_point(col = colset_4[3])+
  facet_wrap(~sname, scales = 'free') +
  geom_smooth(method = 'lm', formula = y~x, se = 0, col = colset_4[1]) +
  geom_smooth(method = 'loess', formula = y~x, se = 0, col = colset_4[4]) +
  scale_color_manual(values = colset_4[c(1, 2, 4)]) +
  xlab(label = "Year") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

ggplot(runoff_winter_key, aes(x = year, y = value)) +
  geom_line(col = colset_4[3])+
  geom_point(col = colset_4[3])+
  facet_wrap(~sname, scales = 'free') +
  geom_smooth(method = 'lm', formula = y~x, se = 0, col = colset_4[1]) +
  geom_smooth(method = 'loess', formula = y~x, se = 0, col = colset_4[4]) +
  scale_color_manual(values = colset_4[c(1, 2, 4)]) +
  xlab(label = "Year") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

runoff_winter[, value_norm := scale(value), sname]
runoff_summer[, value_norm := scale(value), sname]
n_stations <- nrow(runoff_summary)

ggplot(runoff_winter[year > 1950], aes(x = year, y = value_norm, col = sname)) +
  geom_smooth(method = 'loess', formula = y~x, se = 0) + 
  scale_color_manual(values = colorRampPalette(colset_4)(n_stations)) +
  ggtitle('Winter runoff') +
  xlab(label = "Year") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

ggplot(runoff_summer[year > 1950], aes(x = year, y = value_norm, col = sname)) +
  geom_smooth(method = 'loess', formula = y~x, se = 0) + 
  scale_color_manual(values = colorRampPalette(colset_4)(n_stations)) +
  ggtitle('Summer runoff') +
  xlab(label = "Year") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

year_thres <- 1980
to_plot <- rbind(cbind(runoff_winter, season = factor('winter')), 
                 cbind(runoff_summer, season = factor('summer'))) 

to_plot[year < year_thres, period := factor('1950-1980')]
to_plot[year >= year_thres, period := factor('1981-2016')]
to_plot[year < year_thres, period := factor('1950-1980')]
to_plot[year >= year_thres, period := factor('1981-2016')]

to_plot <- to_plot[year >= 1950]

ggplot(to_plot, aes(season, value, fill = period)) +
  geom_boxplot() +
  facet_wrap(~sname, scales = 'free_y') +
  scale_fill_manual(values = colset_4[c(4, 1)]) +
  xlab(label = "Season") +
  ylab(label = "Runoff (m3/s)") +
  theme_bw()

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

05 Land Ho!

Exploratory Data Analysis in Geosciences

Yannis Markonis

So Far, So Good

Landing on the Right Beach

In the Jungle

The Path to the Slope

Back to the Ships

Further Reading & Assignments

Assignments

Navigator’s Tasks

Explorer’s questions

Code Listings

`04_Classification.R`

`05_Comparison.R`