Introduction

In the beginning of the analysis, raw data from the Rhine river system were imported, cleaned and prepared for further analysis. The dataset includes information about river discharge stations such as location, altitude, and catchment area, as well as daily runoff measurements. Missing values and inconsistent formats were also handled.

library(data.table)
library(ggplot2)

runoff_stations <- readRDS("data/runoff_stations.rds")
runoff_day <- readRDS("data/runoff_day.rds")

head(runoff_day)
##         id  sname       date value  year month season
##     <fctr> <fctr>     <Date> <num> <int> <int> <fctr>
## 1: 6335020   REES 1814-11-01   845  1814    11 autumn
## 2: 6335020   REES 1814-11-02   828  1814    11 autumn
## 3: 6335020   REES 1814-11-03   812  1814    11 autumn
## 4: 6335020   REES 1814-11-04   796  1814    11 autumn
## 5: 6335020   REES 1814-11-05   768  1814    11 autumn
## 6: 6335020   REES 1814-11-06   796  1814    11 autumn
head(runoff_stations)
##     sname      id     station country    lat   lon   area altitude  size
##    <fctr>  <fctr>      <char>  <char>  <num> <num>  <int>    <num> <int>
## 1:   REES 6335020        REES      DE 51.757 6.395 159300        8 73841
## 2:   DUES 6335050 DUESSELDORF      DE 51.226 6.770 147680       24 42430
## 3:   KOEL 6335060       KOELN      DE 50.937 6.963 144232       35 73110
## 4:   ANDE 6335070   ANDERNACH      DE 50.443 7.392 139549       51 31473
## 5:   KAUB 6335100        KAUB      DE 50.085 7.765 103488       68 31473
## 6:   MAIN 6335150       MAINZ      DE 50.004 8.275  98206       78 31473
##    missing start   end
##      <num> <int> <int>
## 1:   0.068  1814  2016
## 2:   0.000  1900  2016
## 3:   0.000  1816  2016
## 4:   0.000  1930  2016
## 5:   0.000  1930  2016
## 6:   0.000  1930  2016

Quick look at data

knitr::kable(
  head(runoff_day[, .(id, sname, date, value, year)])
)
id sname date value year
6335020 REES 1814-11-01 845 1814
6335020 REES 1814-11-02 828 1814
6335020 REES 1814-11-03 812 1814
6335020 REES 1814-11-04 796 1814
6335020 REES 1814-11-05 768 1814
6335020 REES 1814-11-06 796 1814
head(runoff_stations)
##     sname      id     station country    lat   lon   area altitude  size
##    <fctr>  <fctr>      <char>  <char>  <num> <num>  <int>    <num> <int>
## 1:   REES 6335020        REES      DE 51.757 6.395 159300        8 73841
## 2:   DUES 6335050 DUESSELDORF      DE 51.226 6.770 147680       24 42430
## 3:   KOEL 6335060       KOELN      DE 50.937 6.963 144232       35 73110
## 4:   ANDE 6335070   ANDERNACH      DE 50.443 7.392 139549       51 31473
## 5:   KAUB 6335100        KAUB      DE 50.085 7.765 103488       68 31473
## 6:   MAIN 6335150       MAINZ      DE 50.004 8.275  98206       78 31473
##    missing start   end
##      <num> <int> <int>
## 1:   0.068  1814  2016
## 2:   0.000  1900  2016
## 3:   0.000  1816  2016
## 4:   0.000  1930  2016
## 5:   0.000  1930  2016
## 6:   0.000  1930  2016

Data exploration and summary statistics

To better understand the structure of the dataset, we first examined the variables, their types, and basic statistical properties. This step helps us to gain an overview of the data before deeper analysis.

str(runoff_stations)
## Classes 'data.table' and 'data.frame':   17 obs. of  12 variables:
##  $ sname   : Factor w/ 20 levels "ANDE","BASE",..: 15 7 10 1 9 12 19 20 13 17 ...
##  $ id      : Factor w/ 20 levels "6335020","6335050",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ station : chr  "REES" "DUESSELDORF" "KOELN" "ANDERNACH" ...
##  $ country : chr  "DE" "DE" "DE" "DE" ...
##  $ lat     : num  51.8 51.2 50.9 50.4 50.1 ...
##  $ lon     : num  6.39 6.77 6.96 7.39 7.76 ...
##  $ area    : int  159300 147680 144232 139549 103488 98206 53131 68827 50196 34550 ...
##  $ altitude: num  8 24 35 51 68 78 89 84 98 260 ...
##  $ size    : int  73841 42430 73110 31473 31473 31473 24168 31777 35064 31473 ...
##  $ missing : num  0.068 0 0 0 0 0 0 0 0.003 0 ...
##  $ start   : int  1814 1900 1816 1930 1930 1930 1950 1930 1921 1930 ...
##  $ end     : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  - attr(*, ".internal.selfref")=<externalptr>
table(runoff_stations$country)
## 
## CH DE NL 
##  6 10  1
apply(X = runoff_stations, MARGIN = 2, FUN = table)
## $sname
## 
## ANDE BASR DIER DOMA DUES KAUB KOEL LOBI MAIN MAXA NEUF REES REKI RHEI RHEM SPEY 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## WORM 
##    1 
## 
## $id
## 
## 6335020 6335050 6335060 6335070 6335100 6335150 6335170 6335180 6335200 6335400 
##       1       1       1       1       1       1       1       1       1       1 
## 6435060 6935051 6935053 6935054 6935055 6935145 6935500 
##       1       1       1       1       1       1       1 
## 
## $station
## 
##                    ANDERNACH            BASEL, RHEINHALLE 
##                            1                            1 
##      DIEPOLDSAU, RIETBRUECKE                    DOMAT/EMS 
##                            1                            1 
##                  DUESSELDORF                         KAUB 
##                            1                            1 
##                        KOELN                       LOBITH 
##                            1                            1 
##                        MAINZ                        MAXAU 
##                            1                            1 
## NEUHAUSEN, FLURLINGERBRUECKE                         REES 
##                            1                            1 
##                     REKINGEN                  RHEINFELDEN 
##                            1                            1 
##     RHEINFELDEN, MESSSTATION                       SPEYER 
##                            1                            1 
##                        WORMS 
##                            1 
## 
## $country
## 
## CH DE NL 
##  6 10  1 
## 
## $lat
## 
## 46.838 47.383 47.559 47.561 47.570 47.682 49.039 49.324 49.641 50.004 50.085 
##      1      1      1      2      1      1      1      1      1      1      1 
## 50.443 50.937 51.226 51.757 51.840 
##      1      1      1      1      1 
## 
## $lon
## 
## 6.110 6.395 6.770 6.963 7.392 7.617 7.765 7.800 8.275 8.306 8.330 8.376 8.449 
##     1     1     1     1     1     1     1     2     1     1     1     1     1 
## 8.626 9.456 9.641 
##     1     1     1 
## 
## $area
## 
##   3229   6119  11887  14718  34526  34550  35897  50196  53131  68827  98206 
##      1      1      1      1      1      1      1      1      1      1      1 
## 103488 139549 144232 147680 159300 160800 
##      1      1      1      1      1      1 
## 
## $altitude
## 
##   8   9  24  35  51  68  78  84  89  98 260 294 310 370 430 456 623 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 
## $size
## 
## 24168 30681 31473 31777 35064 35795 41274 42430 42734 43099 54056 73110 73841 
##     1     1     4     1     1     1     2     1     1     1     1     1     1 
## 
## $missing
## 
## 0.000 0.003 0.025 0.068 
##    14     1     1     1 
## 
## $start
## 
## 1814 1816 1869 1899 1900 1901 1904 1919 1921 1930 1933 1950 
##    1    1    1    1    1    1    2    1    1    5    1    1 
## 
## $end
## 
## 2016 2017 
##   16    1
runoff_stats <- runoff_day[, .(mean_day = round(mean(value), 0),
                               sd_day = round(sd(value), 0),
                               min_day = round(min(value), 0),
                               max_day = round(max(value), 0)), by = sname]

head(runoff_stats, 4)
##     sname mean_day sd_day min_day max_day
##    <fctr>    <num>  <num>   <num>   <num>
## 1:   REES     2251   1112     500   11700
## 2:   DUES     2126   1078     464   11000
## 3:   KOEL     2086   1039     401   10900
## 4:   ANDE     2039   1057     560   10400

Runoff statistics across stations

To compare runoff statistics across stations, the data was reshaped from wide to long (tidy) format. This allows all four measures — mean, minimum, maximum, and standard deviation — to be visualized together in a single scatter plot, with each statistic represented by a different color and shape.

runoff_tidy <- melt(
  runoff_stats,
  id.vars = "sname",
  measure.vars = c("mean_day", "min_day", "max_day", "sd_day"),
  variable.name = "stat",
  value.name = "runoff"
)


colset_4 <- c("#ff7f0e", "#2ca02c", "#d62728", "#1f77b4")


ggplot(runoff_tidy, aes(x = sname, y = runoff, color = stat, shape = stat)) +
  geom_point(size = 3) +
  scale_color_manual(values = colset_4) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Distribution shape and variability of runoff

To better understand the statistical structure of runoff across stations, we compute two key descriptive measures: skewness and coefficient of variation. Skewness is used to evaluate the symmetry of the distribution. Positive skewness indicates that extreme high runoff values occur more frequently than low values, which is typical for hydrological processes influenced by extreme precipitation or snowmelt events. The coefficient of variation (CV) is used to assess relative variability. It standardizes the standard deviation with respect to the mean, allowing comparison of variability across stations with different magnitude levels.

library(moments)

# Skewness
skew_vals <- runoff_day[, .(skewness_day = round(skewness(value), 3)), by = sname]
to_merge_skew <- skew_vals[, .(sname, skewness_day)]
runoff_stats <- runoff_stats[to_merge_skew, on = 'sname']

saveRDS(skew_vals, './results/assignments/skew_vals.rds')
saveRDS(to_merge_skew, './results/assignments/to_merge_skew.rds')
saveRDS(runoff_stats, './results/assignments/runoff_stats.rds')



#coefficient of variation
runoff_stats[, cv_day := round(sd_day / mean_day, 3)]


#data.table

runoff_extra_stats <- runoff_day[, .(
  skewness_day = round(skewness(value), 3),
  cv_day       = round(sd(value) / mean(value), 3)
), by = sname]

saveRDS(runoff_extra_stats, './results/assignments/runoff_extra_stats.rds')

Interpretation

All stations show positive skewness, confirming that runoff distributions are right-skewed — extreme flood events dominate the upper tail. The upstream Alpine stations DOMA (skewness = 2.19, CV = 0.83) and DIER (1.88, 0.73) show the highest variability, driven by irregular snowmelt. Mid-Rhine stations cluster around skewness ~0.95–0.98 and CV ~0.42, reflecting the smoothing effect of a larger catchment area.

Outliers in Daily Runoff

The skewness values already hinted at the presence of extreme values in the data. To make this more visible, we now plot the daily runoff distribution for each station using boxplots.

n_stations <- length(unique(runoff_day$sname))
ggplot(runoff_day, aes(x = sname, y = value, fill = sname)) +
  geom_boxplot() +
  scale_fill_manual(values = colorRampPalette(colset_4)(n_stations)) +
  theme_bw()

Interpretation

What do outliers values show us? All stations have outliers only on the upper side (high values). The number of outliers is very high Why do you think this happens? Runoff distribution is positively skewed — we observed this in both the histograms and the skewness calculations. Boxplot determines outliers based on IQR, and in positively skewed distributions the upper boundary is frequently exceeded . Therefore these are not actual errors, but the result of extreme natural events such as heavy rainfall and snow melt.

Conclusion

This exploratory analysis of the Rhine river system revealed consistent patterns across stations. Runoff distributions are universally right-skewed, driven by extreme precipitation and snowmelt events. Upstream Alpine stations show the highest variability, while mid-Rhine stations reflect the stabilizing effect of larger catchment areas. These findings provide a solid foundation for further hydroclimatic investigation.