In the beginning of the analysis, raw data from the Rhine river system were imported, cleaned and prepared for further analysis. The dataset includes information about river discharge stations such as location, altitude, and catchment area, as well as daily runoff measurements. Missing values and inconsistent formats were also handled.
library(data.table)
library(ggplot2)
runoff_stations <- readRDS("data/runoff_stations.rds")
runoff_day <- readRDS("data/runoff_day.rds")
head(runoff_day)
## id sname date value year month season
## <fctr> <fctr> <Date> <num> <int> <int> <fctr>
## 1: 6335020 REES 1814-11-01 845 1814 11 autumn
## 2: 6335020 REES 1814-11-02 828 1814 11 autumn
## 3: 6335020 REES 1814-11-03 812 1814 11 autumn
## 4: 6335020 REES 1814-11-04 796 1814 11 autumn
## 5: 6335020 REES 1814-11-05 768 1814 11 autumn
## 6: 6335020 REES 1814-11-06 796 1814 11 autumn
head(runoff_stations)
## sname id station country lat lon area altitude size
## <fctr> <fctr> <char> <char> <num> <num> <int> <num> <int>
## 1: REES 6335020 REES DE 51.757 6.395 159300 8 73841
## 2: DUES 6335050 DUESSELDORF DE 51.226 6.770 147680 24 42430
## 3: KOEL 6335060 KOELN DE 50.937 6.963 144232 35 73110
## 4: ANDE 6335070 ANDERNACH DE 50.443 7.392 139549 51 31473
## 5: KAUB 6335100 KAUB DE 50.085 7.765 103488 68 31473
## 6: MAIN 6335150 MAINZ DE 50.004 8.275 98206 78 31473
## missing start end
## <num> <int> <int>
## 1: 0.068 1814 2016
## 2: 0.000 1900 2016
## 3: 0.000 1816 2016
## 4: 0.000 1930 2016
## 5: 0.000 1930 2016
## 6: 0.000 1930 2016
knitr::kable(
head(runoff_day[, .(id, sname, date, value, year)])
)
| id | sname | date | value | year |
|---|---|---|---|---|
| 6335020 | REES | 1814-11-01 | 845 | 1814 |
| 6335020 | REES | 1814-11-02 | 828 | 1814 |
| 6335020 | REES | 1814-11-03 | 812 | 1814 |
| 6335020 | REES | 1814-11-04 | 796 | 1814 |
| 6335020 | REES | 1814-11-05 | 768 | 1814 |
| 6335020 | REES | 1814-11-06 | 796 | 1814 |
head(runoff_stations)
## sname id station country lat lon area altitude size
## <fctr> <fctr> <char> <char> <num> <num> <int> <num> <int>
## 1: REES 6335020 REES DE 51.757 6.395 159300 8 73841
## 2: DUES 6335050 DUESSELDORF DE 51.226 6.770 147680 24 42430
## 3: KOEL 6335060 KOELN DE 50.937 6.963 144232 35 73110
## 4: ANDE 6335070 ANDERNACH DE 50.443 7.392 139549 51 31473
## 5: KAUB 6335100 KAUB DE 50.085 7.765 103488 68 31473
## 6: MAIN 6335150 MAINZ DE 50.004 8.275 98206 78 31473
## missing start end
## <num> <int> <int>
## 1: 0.068 1814 2016
## 2: 0.000 1900 2016
## 3: 0.000 1816 2016
## 4: 0.000 1930 2016
## 5: 0.000 1930 2016
## 6: 0.000 1930 2016
To better understand the structure of the dataset, we first examined the variables, their types, and basic statistical properties. This step helps us to gain an overview of the data before deeper analysis.
str(runoff_stations)
## Classes 'data.table' and 'data.frame': 17 obs. of 12 variables:
## $ sname : Factor w/ 20 levels "ANDE","BASE",..: 15 7 10 1 9 12 19 20 13 17 ...
## $ id : Factor w/ 20 levels "6335020","6335050",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ station : chr "REES" "DUESSELDORF" "KOELN" "ANDERNACH" ...
## $ country : chr "DE" "DE" "DE" "DE" ...
## $ lat : num 51.8 51.2 50.9 50.4 50.1 ...
## $ lon : num 6.39 6.77 6.96 7.39 7.76 ...
## $ area : int 159300 147680 144232 139549 103488 98206 53131 68827 50196 34550 ...
## $ altitude: num 8 24 35 51 68 78 89 84 98 260 ...
## $ size : int 73841 42430 73110 31473 31473 31473 24168 31777 35064 31473 ...
## $ missing : num 0.068 0 0 0 0 0 0 0 0.003 0 ...
## $ start : int 1814 1900 1816 1930 1930 1930 1950 1930 1921 1930 ...
## $ end : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## - attr(*, ".internal.selfref")=<externalptr>
table(runoff_stations$country)
##
## CH DE NL
## 6 10 1
apply(X = runoff_stations, MARGIN = 2, FUN = table)
## $sname
##
## ANDE BASR DIER DOMA DUES KAUB KOEL LOBI MAIN MAXA NEUF REES REKI RHEI RHEM SPEY
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## WORM
## 1
##
## $id
##
## 6335020 6335050 6335060 6335070 6335100 6335150 6335170 6335180 6335200 6335400
## 1 1 1 1 1 1 1 1 1 1
## 6435060 6935051 6935053 6935054 6935055 6935145 6935500
## 1 1 1 1 1 1 1
##
## $station
##
## ANDERNACH BASEL, RHEINHALLE
## 1 1
## DIEPOLDSAU, RIETBRUECKE DOMAT/EMS
## 1 1
## DUESSELDORF KAUB
## 1 1
## KOELN LOBITH
## 1 1
## MAINZ MAXAU
## 1 1
## NEUHAUSEN, FLURLINGERBRUECKE REES
## 1 1
## REKINGEN RHEINFELDEN
## 1 1
## RHEINFELDEN, MESSSTATION SPEYER
## 1 1
## WORMS
## 1
##
## $country
##
## CH DE NL
## 6 10 1
##
## $lat
##
## 46.838 47.383 47.559 47.561 47.570 47.682 49.039 49.324 49.641 50.004 50.085
## 1 1 1 2 1 1 1 1 1 1 1
## 50.443 50.937 51.226 51.757 51.840
## 1 1 1 1 1
##
## $lon
##
## 6.110 6.395 6.770 6.963 7.392 7.617 7.765 7.800 8.275 8.306 8.330 8.376 8.449
## 1 1 1 1 1 1 1 2 1 1 1 1 1
## 8.626 9.456 9.641
## 1 1 1
##
## $area
##
## 3229 6119 11887 14718 34526 34550 35897 50196 53131 68827 98206
## 1 1 1 1 1 1 1 1 1 1 1
## 103488 139549 144232 147680 159300 160800
## 1 1 1 1 1 1
##
## $altitude
##
## 8 9 24 35 51 68 78 84 89 98 260 294 310 370 430 456 623
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $size
##
## 24168 30681 31473 31777 35064 35795 41274 42430 42734 43099 54056 73110 73841
## 1 1 4 1 1 1 2 1 1 1 1 1 1
##
## $missing
##
## 0.000 0.003 0.025 0.068
## 14 1 1 1
##
## $start
##
## 1814 1816 1869 1899 1900 1901 1904 1919 1921 1930 1933 1950
## 1 1 1 1 1 1 2 1 1 5 1 1
##
## $end
##
## 2016 2017
## 16 1
runoff_stats <- runoff_day[, .(mean_day = round(mean(value), 0),
sd_day = round(sd(value), 0),
min_day = round(min(value), 0),
max_day = round(max(value), 0)), by = sname]
head(runoff_stats, 4)
## sname mean_day sd_day min_day max_day
## <fctr> <num> <num> <num> <num>
## 1: REES 2251 1112 500 11700
## 2: DUES 2126 1078 464 11000
## 3: KOEL 2086 1039 401 10900
## 4: ANDE 2039 1057 560 10400
To compare runoff statistics across stations, the data was reshaped from wide to long (tidy) format. This allows all four measures — mean, minimum, maximum, and standard deviation — to be visualized together in a single scatter plot, with each statistic represented by a different color and shape.
runoff_tidy <- melt(
runoff_stats,
id.vars = "sname",
measure.vars = c("mean_day", "min_day", "max_day", "sd_day"),
variable.name = "stat",
value.name = "runoff"
)
colset_4 <- c("#ff7f0e", "#2ca02c", "#d62728", "#1f77b4")
ggplot(runoff_tidy, aes(x = sname, y = runoff, color = stat, shape = stat)) +
geom_point(size = 3) +
scale_color_manual(values = colset_4) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
To better understand the statistical structure of runoff across stations, we compute two key descriptive measures: skewness and coefficient of variation. Skewness is used to evaluate the symmetry of the distribution. Positive skewness indicates that extreme high runoff values occur more frequently than low values, which is typical for hydrological processes influenced by extreme precipitation or snowmelt events. The coefficient of variation (CV) is used to assess relative variability. It standardizes the standard deviation with respect to the mean, allowing comparison of variability across stations with different magnitude levels.
library(moments)
# Skewness
skew_vals <- runoff_day[, .(skewness_day = round(skewness(value), 3)), by = sname]
to_merge_skew <- skew_vals[, .(sname, skewness_day)]
runoff_stats <- runoff_stats[to_merge_skew, on = 'sname']
saveRDS(skew_vals, './results/assignments/skew_vals.rds')
saveRDS(to_merge_skew, './results/assignments/to_merge_skew.rds')
saveRDS(runoff_stats, './results/assignments/runoff_stats.rds')
#coefficient of variation
runoff_stats[, cv_day := round(sd_day / mean_day, 3)]
#data.table
runoff_extra_stats <- runoff_day[, .(
skewness_day = round(skewness(value), 3),
cv_day = round(sd(value) / mean(value), 3)
), by = sname]
saveRDS(runoff_extra_stats, './results/assignments/runoff_extra_stats.rds')
All stations show positive skewness, confirming that runoff distributions are right-skewed — extreme flood events dominate the upper tail. The upstream Alpine stations DOMA (skewness = 2.19, CV = 0.83) and DIER (1.88, 0.73) show the highest variability, driven by irregular snowmelt. Mid-Rhine stations cluster around skewness ~0.95–0.98 and CV ~0.42, reflecting the smoothing effect of a larger catchment area.
The skewness values already hinted at the presence of extreme values in the data. To make this more visible, we now plot the daily runoff distribution for each station using boxplots.
n_stations <- length(unique(runoff_day$sname))
ggplot(runoff_day, aes(x = sname, y = value, fill = sname)) +
geom_boxplot() +
scale_fill_manual(values = colorRampPalette(colset_4)(n_stations)) +
theme_bw()
What do outliers values show us? All stations have outliers only on the upper side (high values). The number of outliers is very high Why do you think this happens? Runoff distribution is positively skewed — we observed this in both the histograms and the skewness calculations. Boxplot determines outliers based on IQR, and in positively skewed distributions the upper boundary is frequently exceeded . Therefore these are not actual errors, but the result of extreme natural events such as heavy rainfall and snow melt.
This exploratory analysis of the Rhine river system revealed consistent patterns across stations. Runoff distributions are universally right-skewed, driven by extreme precipitation and snowmelt events. Upstream Alpine stations show the highest variability, while mid-Rhine stations reflect the stabilizing effect of larger catchment areas. These findings provide a solid foundation for further hydroclimatic investigation.