Analysis and modeling of snowmelt data for the prediction of Glacier lake outburst flood
Arpan Dutta (MD2305)
Endterm Project Presentation
For every \(d\geq 4\), a d dimentional copula \(\mathcal{C} : [0,1]^d\rightarrow[0,1]\) is a multivariate distribution function on \([0,1]^d\) with uniformly distributed marginals.
Let the random vector \((X_1, X_2,\ldots, X_d )\) exist with marginal cdf’s \(F_i(x) = P(X_i\leq x)\). Using the probability integral transformation the vector \((U_1, \ldots,U_d )\) = \((F_1(x_1),\ldots ,F_d(x_d))\) has uniform marginal, Then the copula of \((X1, \ldots, X_d )\) is the joint cdf of \((U_1,\ldots, U_d )\). Namely, it is \(\mathcal{C}_\theta(u_1,\ldots,u_d) = P[U_1\leq u_1,\ldots,U_d\leq u_d ]\). where \(\theta\) is the parameter values of copula.
it has some properties:
Consider a d dimensional CDF, F, with pdf f and with marginals \(F_1, F_2, \ldots, F_d\) Then there exists a copula, \(\mathcal{C}\), such that,
\[F(x_1,x_2,\ldots,x_d)= \mathcal{C}_\theta(F_1(x_1), F_1(x_2),\ldots,F_1(x_d))\] For all \(x_i\in\mathbb{R}\) and \(i = 1,2,\ldots,d\)
If \(F_i\) is continuous for all \(i = 1,2,3,\ldots, d\), then \(\mathcal{C}\) is unique. If any distribution is discrete, then C is uniquely defined on the support of the joint distribution.
On the otherhand, consider a copula, \(\mathcal C\) and univariate CDF’s \(F_1,F_2,\ldots,F_d\). Then \(F\) as defined as before is a multivariate CDF with marginals \(F_1,F_2,\ldots,F_d\).
The joint density function is obtained by,
\[ \begin{aligned} f(x_1,..., x_d) &= \frac{\partial ^d F(x_1, ..., x_d)}{\partial x_1 \cdots \partial x_d} \\ &= \frac{\partial^d C_\theta(F_1(x_1),...,F_d(x_d))}{\partial x_1 \cdots \partial x_d} \\ &= \frac{\partial ^d C_\theta (F_1(x_1),...,F_d(x_d))}{\partial F_1(x_1) \cdots \partial F_d(x_d)} \cdot \frac{\partial F_1(x_1)}{\partial x_1} \cdots \frac{\partial F_d(x_d)}{\partial x_d} \\ &= c_\theta (F_1(x_1),...,F_d(x_d)) \cdot f_1(x_1) \cdots f_d(x_d) \end{aligned} \] Where \(c_\theta\) is multivariate density of the copula.
\[ f_{\mathbf{Y}}(\mathbf{y}) = f_Y(y_1, \dots, y_d) = c_Y(F_{Y_1} (y_1), \dots, F_{Y_d} (y_d)) \prod_{j=1}^{d} f_{Y_j} (y_j). \]
\[ \begin{aligned} log L(\theta_1, ..., \theta_d, \theta_C) &= \log \prod_{i=1}^{n} f_Y(y_i) \\ &= \sum_{i=1}^{n} \Bigg( \log c_Y \Big( F_{Y_1}(y_{i,1} \mid \theta_1), \ldots, F_{Y_d}(y_{i,d} \mid \theta_d) \mid \theta_C \Big) \quad \\ &+ \log f_{Y_1}(y_{i,1} \mid \theta_1) + \dots + \log f_{Y_d}(y_{i,d} \mid \theta_d) \Bigg) \end{aligned} \]
The dataset is sourced from NASA Giovanni with a daily temporal resolution and \(0.01^\circ\) spatial resolution. It combines static topographical variables derived from DEM data at a 30m spatial resolution and dynamic climatic variables that vary over time. The dataset includes Snow Melt data (kg \(m^{-2} s^{-1}\)) recorded from 15th March 2022 to 15th June 2022 (93 days). The data was processed in QGIS, resulting in 2025 spatial points for model analysis.
| Static Variables | Dynamic Variables |
|---|---|
| Elevation | Snow Melt |
| Slope | Specific Humidity |
| Aspect | Surface Air Temperature |
| Surface Radiative Temperature | |
| Surface Pressure |
Scatterplot of variables on 50 th day
| Variable | Assumed Distribution | Skewness | Variable | Assumed Distribution | Skewness |
|---|---|---|---|---|---|
| Snowmelt | Exponential | 1.36 | RadTemp | Weibull | -0.63 |
| Humidity | Normal | 0.16 | Elevation | Gamma | 0.49 |
| Air Temp | Weibull | -0.39 | Slope | Normal | -0.30 |
| Pressure | Normal | -0.24 | Aspect | Normal | 0.11 |
We will examine two elliptical copulas Gaussian copula and the t-copula, while the Archimedean copulas are the Joe copula, the Clayton copula, and the Gumbel copula.
Fitting the data
The table bellow shows the AIC values for the given copula fittted for the data. The copula with least AIC values are considered as the best fit copula for the data.
| Day | Normal | t | Joe | Clayton | Gumbel |
|---|---|---|---|---|---|
| 10 | -22836.39 | -22917.27 | -12.6755 | -611.2854 | -212.1033 |
| 20 | -22099.68 | -22105.65 | -8.9146 | -558.2436 | -170.5951 |
| 30 | -32569.72 | -32734.67 | -60.6516 | -784.6610 | -309.8161 |
| 40 | -27376.37 | -26612.30 | -33.2967 | -654.6281 | -231.0704 |
| 50 | -24961.45 | -25031.70 | -67.2194 | -1136.8968 | -465.9788 |
| 60 | -19693.16 | -19725.62 | -57.3981 | -997.0749 | -409.1935 |
| 70 | -24042.41 | -24129.75 | -63.2418 | -976.6298 | -405.6140 |
| 80 | - | - | -63.8057 | -1030.9356 | -428.8872 |
| 90 | -23313.31 | -24218.83 | -69.2516 | -1205.9109 | -491.2298 |
| Variable Combinations | AIC |
|---|---|
| humidity, Air Temp, pressure, radtemp, Elevation, slope, aspect | -25419.45 |
| humidity, Air Temp, pressure, radtemp, Elevation, aspect | -25210.43 |
| humidity, Air Temp, pressure, radtemp, Elevation, slope | -25181.98 |
| humidity, Air Temp, pressure, radtemp, Elevation | -24977.63 |
| humidity, Air Temp, pressure, radtemp, slope, aspect | -22852.87 |
| humidity, Air Temp, pressure, radtemp, aspect | -22741.06 |
| humidity, Air Temp, pressure, radtemp, slope | -22676.84 |
| humidity, Air Temp, pressure, radtemp | -22562.03 |
| humidity, Air Temp, pressure, Elevation, slope, aspect | -18986.30 |
| humidity, Air Temp, pressure, Elevation, slope | -18823.07 |
| humidity, Air Temp, pressure, Elevation, aspect | -18784.57 |
| Air Temp, pressure, radtemp, Elevation, slope, aspect | -18722.30 |
| humidity, Air Temp, pressure, Elevation | -18629.39 |
| Air Temp, pressure, radtemp, Elevation, aspect | -18527.43 |
| Air Temp, pressure, radtemp, Elevation, slope | -18484.03 |
We first fit an Ordinary Least Squares (OLS) model to predict snowmelt. The figure below shows the predicted versus actual snowmelt values for Day 60.
Actual vs. Predicted Snowmelt for Day 60 Using Standard OLS
\[ \hat{x}_d = E[X_d \mid X_1 = x_1, \ldots, X_{d-1} = x_{d-1}] = \int_0^\infty x_d f(x_d \mid x_1, \ldots, x_{d-1}, R) \, dx_d \]
The copula-based approach involves two key steps:
Modeling Marginal Distributions: For each variable \(X_i\) we assumed specific marginal distributions.
Modeling Dependence with a Copula:
The dependence structure is defined by a correlation matrix \(\mathbf{R}\) estimated from the data.
Rank-order correlations (e.g., Spearman’s \(\rho\), Kendall’s \(\tau\)) are used because they are invariant to marginal distributions
We convert these rank-order correlations into Pearson correlations using transformations.
\[ r_{ij} = \sin\left(\frac{\pi \tau_{ij}}{2}\right). \]
\[ r_{ij} = 2 \sin\left(\frac{\pi \rho_{ij}}{6}\right). \]
\[ f(x_1, \ldots, x_d \mid \mathbf{R}^*) = f_1(x_1) \times \cdots \times f_d(x_d) \times c_N(F_1(x_1), \ldots, F_d(x_d) \mid \mathbf{R}^*). \]
\[ f(x_d \mid x_1 \ldots, x_{d-1}, \mathbf{R}^*) = f_d(x_d) \times \frac{ c_N(F_1(x_1), \ldots, F_d(x_d) \mid \mathbf{R}^*)}{c_N(F_1(x_1), \ldots, F_{d-1}(x_{d-1}) \mid \mathbf{R}^*)} \]
\[ \phi^{(d)}(y_1, \ldots, y_d \mid \mathbf{R}) = \frac{1}{(2\pi)^{d/2} |\mathbf{R}|^{1/2}} \exp\left(-\frac{1}{2} \mathbf{y}^{\mathrm{T}} \mathbf{R}^{-1} \mathbf{y}\right), \]
\[ c_N\left[\Phi(y_1), \ldots, \Phi(y_d) \mid \mathbf{R}^*\right] = \frac{\phi^{(d)}(y_1, \ldots, y_d \mid \mathbf{R})}{\phi(y_1) \times \cdots \times \phi(y_d)}. \]
\[ \begin{align} &\frac{c_N(F_1(x_1), \ldots, F_d(x_d) \mid \mathbf{R}^*)}{c_N(F_1(x_1), \ldots, F_{d-1}(x_{d-1}) \mid \mathbf{R}^*)} \notag \\ &= \frac{\phi^{(d)}\left( \Phi^{-1}[F_1(x_1)], \ldots, \Phi^{-1}[F_d(x_d)] \mid \mathbf{R} \right)}{\phi\left( \Phi^{-1}[F_d(x_d)] \right) \times \phi^{(d-1)}\left( \Phi^{-1}[F_1(x_1)], \ldots, \Phi^{-1}[F_{d-1}(x_{d-1})] \mid \mathbf{R}_{d-1} \right)} \notag \\ &= \exp \left\{ -\frac{1}{2} \left[ \frac{\left( \Phi^{-1}[F_d(x_d)] - \mathbf{r}^{\mathrm{T}} \mathbf{R}_{d-1}^{-1} \mathbf{y}_{d-1} \right)^2}{1 - \mathbf{r}^{\mathrm{T}} \mathbf{R}_{d-1}^{-1} \mathbf{r}} - \left( \Phi^{-1}[F_d(x_d)] \right)^2 \right] \right\} \times \left( 1 - \mathbf{r}^{\mathrm{T}} \mathbf{R}_{d-1}^{-1} \mathbf{r} \right)^{-1/2}. \end{align} \]
\[ \begin{aligned} f(x_d \mid x_1, \ldots, x_{d-1}, \mathbf{R}^*) &= f_d(x_d) \times \exp \left\{ -0.5 \left[ \frac{\left( \Phi^{-1}[F_d(x_d)] - \mathbf{r}^{\mathrm{T}} \mathbf{R}_{d-1}^{-1} \mathbf{y}_{d-1} \right)^2}{1 - \mathbf{r}^{\mathrm{T}} \mathbf{R}_{d-1}^{-1} \mathbf{r}} - \left( \Phi^{-1}[F_d(x_d)] \right)^2 \right] \right\} \\ &\quad \times \left( 1 - \mathbf{r}^{\mathrm{T}} \mathbf{R}_{d-1}^{-1} \mathbf{r} \right)^{-1/2}. \end{aligned} \]
\[ \hat{x}_d = E[X_d \mid X_1 = x_1, \ldots, X_{d-1} = x_{d-1}] = \int_0^\infty x_d f(x_d \mid x_1, \ldots, x_{d-1}, R, \nu) \, dx_d \]
We use numerical integration to evaluate this for each test data point.
Actual vs. Predicted Snowmelt for Day 60 Using Gaussian Copula (Kendall’s Tau)
Actual vs. Predicted Snowmelt for Day 60 Using Gaussian Copula (Spearman’s Rho)
\[ f(x_1, \ldots, x_d \mid R, \nu) = \prod_{i=1}^d f_i(x_i) \times c_t(F_1(x_1), \ldots, F_d(x_d) \mid R, \nu) \]
\[ f(x_d \mid x_1, \ldots, x_{d-1}, R, \nu) = f_d(x_d) \times \frac{c_t(F_1(x_1), \ldots, F_d(x_d) \mid R, \nu)}{c_t(F_1(x_1), \ldots, F_{d-1}(x_{d-1}) \mid R, \nu)} \]
\[ f_t^{(d)}(\mathbf{y} \mid R, \nu) = \frac{\Gamma\left(\frac{\nu + d}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right) (\nu \pi)^{d/2} |R|^{1/2}} \left(1 + \frac{1}{\nu} \mathbf{y}^\top R^{-1} \mathbf{y}\right)^{-\frac{\nu + d}{2}} \]
\[ c_t(u_1, \ldots, u_d \mid R, \nu) = \frac{f_t^{(d)}(y_1, \ldots, y_d \mid R, \nu)}{\prod_{i=1}^d f_{t,\nu}(y_i)} \]
where \(y_i = t_\nu^{-1}(u_i)\), and the univariate t-density is:
\[ f_{t,\nu}(y_i) = \frac{\Gamma\left(\frac{\nu + 1}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right) \sqrt{\nu \pi}} \left(1 + \frac{y_i^2}{\nu}\right)^{-\frac{\nu + 1}{2}} \]
\[ \frac{c_t(F_1(x_1), \ldots, F_d(x_d) \mid R, \nu)}{c_t(F_1(x_1), \ldots, F_{d-1}(x_{d-1}) \mid R, \nu)} = \frac{f_t^{(d)}(y_1, \ldots, y_d \mid R, \nu)}{f_t^{(d-1)}(y_1, \ldots, y_{d-1} \mid R_{d-1}, \nu) \times f_{t,\nu}(y_d)} \]
Partition: \(\mathbf{y} = (\mathbf{y}_{d-1}, y_d)\), \(R = \begin{bmatrix} R_{d-1} & \mathbf{r} \\ \mathbf{r}^\top & 1 \end{bmatrix}\).
The conditional density is:
\[ f(y_d \mid \mathbf{y}_{d-1}, \nu) = \frac{f_t^{(d)}(\mathbf{y} \mid R, \nu)}{f_t^{(d-1)}(\mathbf{y}_{d-1} \mid R_{d-1}, \nu)} \]
This is a univariate t-distribution with:
Conditional density of \(X_d\):
\[ f(x_d \mid x_1, \ldots, x_{d-1}, R, \nu) = f_d(x_d) \cdot \frac{f_{t,\nu + d - 1}\left( \frac{y_d - \mu_{\text{cond}}}{\sigma_{\text{cond}}} \right) / \sigma_{\text{cond}}}{f_{t,\nu}(y_d)} \]
where \(y_d = t_\nu^{-1}(F_d(x_d))\).
\[ \hat{x}_d = E[X_d \mid X_1 = x_1, \ldots, X_{d-1} = x_{d-1}] = \int_0^\infty x_d f(x_d \mid x_1, \ldots, x_{d-1}, R, \nu) \, dx_d \]
We use numerical integration to evaluate this for each test data point.
Actual vs. Predicted Snowmelt for Day 60 Using t-Copula
The table below compares the Sum of Squared Errors (SSE) for the four methods across the evaluated days. Lower SSE values indicate better predictive performance.
| Day | t-Copula | Gaussian Copula ( \(\tau\)) |
Gaussian Copula ( \(\rho\)) |
GLM (log-link) |
Standard OLS |
|---|---|---|---|---|---|
| 10 | \(1.121\times10^{3}\) | \(1.178\times10^{3}\) | \(1.176\times10^{3}\) | \(1.255\times10^{5}\) | \(2.099\times10^{26}\) |
| 20 | \(7.005\times10^{3}\) | \(6.613\times10^{3}\) | \(6.613\times10^{3}\) | \(6.711\times10^{5}\) | \(1.007\times10^{27}\) |
| 30 | \(9.343\times10^{4}\) | \(1.778\times10^{5}\) | \(1.757\times10^{5}\) | \(1.744\times10^{7}\) | \(9.893\times10^{27}\) |
| 40 | \(5.160\times10^{4}\) | \(7.277\times10^{4}\) | \(7.244\times10^{4}\) | \(7.488\times10^{6}\) | \(7.921\times10^{27}\) |
| 50 | \(3.442\times10^{5}\) | \(1.832\times10^{6}\) | \(1.948\times10^{6}\) | \(2.054\times10^{8}\) | \(1.941\times10^{27}\) |
| 60 | \(2.004\times10^{5}\) | \(1.321\times10^{6}\) | \(1.261\times10^{6}\) | \(1.113\times10^{8}\) | \(1.986\times10^{28}\) |
| 70 | \(2.812\times10^{5}\) | \(8.783\times10^{5}\) | \(8.833\times10^{5}\) | \(9.004\times10^{7}\) | \(2.587\times10^{28}\) |
| 80 | \(5.043\times10^{5}\) | \(1.866\times10^{6}\) | \(1.754\times10^{6}\) | \(1.718\times10^{8}\) | \(8.879\times10^{28}\) |
| 90 | \(2.909\times10^{6}\) | \(4.897\times10^{6}\) | \(4.094\times10^{6}\) | \(3.933\times10^{8}\) | \(4.317\times10^{28}\) |
Correlations between variables (excluding snowmelt) remain relatively stable across days, the correlation between snowmelt and other variables increases progressively
We averaged the correlation matrices over 93 days, capturing the overall dependency structure effectively.
We selected t-copula model for prediction due to its consistently lower SSE compared to other methods.
Parameters are the degrees of freedom, \(\nu = 4\) and the average dependency matrix.
The t-copula model proved most effective for detecting large snowmelt events, outperforming GLM and OLS by accurately capturing tail dependencies and complex variable interactions critical for GLOF risk assessment.
We have data of 93 days(15 th March - 15 th June, 2022), the fitted t‑copula may not capture correlation patterns in other seasons.
Maina Samuel C., Mwigereri D., Weyn J., Mackey L., Ochieng M. (2023).”Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas.” Available at: https://doi.org/10.1145/3616384
Parsa, Rahul A., Stuart A. Klugman. (2011). “Copula Regression.” Available at: https://variancejournal.org/article/84912
Nelsen Roger B. (2006). An Introduction to Copulas (2nd ed.), Springer Series in Statistics.
Hofert M., Kojadinovic I., Mächler M.,Yan J. (2018), Elements of Copula Modeling with R., Springer International Publishing AG
Haugh M., (2016). An introduction to copulas, IEOR E4602: Quantitative Risk Management (Spring 2016)
Joe, Harry (2014). . CRC Press.
Salvadori, G., De Michele, C., Kilsby, C. G., Chandler, R. E. (2007). “Estimating Return Periods of Extreme Hydrological Events via the Maxima of Stochastic Processes Selected by a Peaks-Over-Threshold Condition on Another Process.” , 43(11).
Genest, C., Favre, A. C. (2007). “Everything You Always Wanted to Know About Copula Modeling but Were Afraid to Ask.” , 12(4), 347–368.