Analysis and modeling of snowmelt data for the prediction of Glacier lake outburst flood

Arpan Dutta (MD2305)

Under the Supervision of Prof. Sarbani Palit

Midterm Project Presentation

Introduction

  • Glacial Lake Outburst Floods (GLOFs) occur when rapid snowmelt weakens natural lake dams, leading to sudden flooding.
  • This study examines how climatic factors (temperature, humidity, pressure) and topographical features (elevation, slope, aspect) influence snowmelt.
  • Copula regression is used to model complex dependencies between these factors, improving snowmelt prediction.
  • The goal is to develop a more accurate predictive model for flood risk assessment and disaster preparedness.

Basic Theory of Copulas

Definition (Copula)

For every \(d\geq 4\), a d dimentional copula \(\mathcal{C} : [0,1]^d\rightarrow[0,1]\) is a multivariate distribution function on \([0,1]^d\) with uniformly distributed marginals.

Let the random vector \((X_1, X_2,\ldots, X_d )\) exist with marginal cdf’s \(F_i(x) = P(X_i\leq x)\). Using the probability integral transformation the vector \((U_1, \ldots,U_d )\) = \((F_1(x_1),\ldots ,F_d(x_d))\) has uniform marginal, Then the copula of \((X1, \ldots, X_d )\) is the joint cdf of \((U_1,\ldots, U_d )\). Namely, it is \(\mathcal{C}_\theta(u_1,\ldots,u_d) = P[U_1\leq u_1,\ldots,U_d\leq u_d ]\). where \(\theta\) is the parameter values of copula.

it has some properties:

  • \(C(x_1, \ldots, x_d)\) is non-decreasing in each component \(x_i\).
  • \(C(1,\ldots,1, x_i,1,\ldots,1) = x_i\) for all \(i \in \{1,...,n\}\), \(x_i \in [0,1]\).
  • For \(a_i \leq b_i\), the probability \(P[U_1 \in [a_1, b_1] ,...,U_d \in [a_d,b_d]]\) must be non-negetive.

Sklar’s Theorem (1959)

Consider a d dimensional CDF, F, with pdf f and with marginals \(F_1, F_2, \ldots, F_d\) Then there exists a copula, \(\mathcal{C}\), such that,

\[F(x_1,x_2,\ldots,x_d)= \mathcal{C}_\theta(F_1(x_1), F_1(x_2),\ldots,F_1(x_d))\] For all \(x_i\in\mathbb{R}\) and \(i = 1,2,\ldots,d\)

If \(F_i\) is continuous for all \(i = 1,2,3,\ldots, d\), then \(\mathcal{C}\) is unique. If any distribution is discrete, then C is uniquely defined on the support of the joint distribution.

On the otherhand, consider a copula, \(\mathcal C\) and univariate CDF’s \(F_1,F_2,\ldots,F_d\). Then \(F\) as defined as before is a multivariate CDF with marginals \(F_1,F_2,\ldots,F_d\).

Continued

The joint density function is obtained by,

\[ \begin{aligned} f(x_1,..., x_d) &= \frac{\partial ^d F(x_1, ..., x_d)}{\partial x_1 \cdots \partial x_d} \\ &= \frac{\partial^d C_\theta(F_1(x_1),...,F_d(x_d))}{\partial x_1 \cdots \partial x_d} \\ &= \frac{\partial ^d C_\theta (F_1(x_1),...,F_d(x_d))}{\partial F_1(x_1) \cdots \partial F_d(x_d)} \cdot \frac{\partial F_1(x_1)}{\partial x_1} \cdots \frac{\partial F_d(x_d)}{\partial x_d} \\ &= c_\theta (F_1(x_1),...,F_d(x_d)) \cdot f_1(x_1) \cdots f_d(x_d) \end{aligned} \] Where \(c_\theta\) is multivariate density of the copula.

Estimating Copulas (Pseudo-Maximum Likelihood Estimation)

Suppose, \(\mathbf{Y} = (Y_1,\ldots,Y_d)\) is a random vector, and we have parametric models \(F_{Y_1} (\cdot \mid \theta_1), \dots, F_{Y_d} (\cdot \mid \theta_d)\) for the marginal CDFs. We assume that we also have a parametric model \(c_Y(\cdot \mid \theta_C)\) for the copula density of \(\mathbf{Y}\). We see that the density of \(\mathbf{Y}\) is given by:

\[ f_{\mathbf{Y}}(\mathbf{y}) = f_Y(y_1, \dots, y_d) = c_Y(F_{Y_1} (y_1), \dots, F_{Y_d} (y_d)) \prod_{j=1}^{d} f_{Y_j} (y_j). \]

Estimating Copulas (Continued)

We then obtain the log-likelihood as,

\[ \begin{aligned} log L(\theta_1, ..., \theta_d, \theta_C) &= \log \prod_{i=1}^{n} f_Y(y_i) \\ &= \sum_{i=1}^{n} \Bigg( \log c_Y \Big( F_{Y_1}(y_{i,1} \mid \theta_1), \ldots, F_{Y_d}(y_{i,d} \mid \theta_d) \mid \theta_C \Big) \quad \\ &+ \log f_{Y_1}(y_{i,1} \mid \theta_1) + \dots + \log f_{Y_d}(y_{i,d} \mid \theta_d) \Bigg) \end{aligned} \]

Estimating Copulas (Continued)

It has two steps:

Step 1 First estimate the marginal CDFs to obtain \(\hat{F}_{y_j}\) for \(j = 1, \dots, d\). We can do this using either:

  • The empirical CDF of \(Y_{1,j}, \dots, Y_{n,j}\) so that

\[\hat{F}_{Y_j}(y) = \frac{\sum_{i=1}^{n} \mathbf{1}\{ y_{i,j} \leq y \}}{n}\]

  • A parametric model with \(\hat{\theta}_j\) obtained using the usual univariate MLE approach.

Step 2 Then estimate the copula parameters \(\theta_C\) by maximizing

\[ \sum_{i=1}^{n} \log \left[ c_Y \left( \hat{F}_{Y_1}(y_{i,1}), \dots, \hat{F}_{Y_d}(y_{i,d}) \mid \theta_C \right) \right] \]

Steps to Estimate the Joint Distribution

  • Estimate the marginal distributions.

  • Estimate the copula family using dependence structure estimation (Pseudo-Maximum likelihood method) and check goodness of fit using AIC on the joint distribution.

Multicollinearity

Multicollinearity occurs when the independent variables in a regression model are correlated with each other. Generally VIF (Variance inflation factor) is used to test the presence of multicollinearity in a regression model. Suppose our respond variable is \(Y\) and covariates are \(X_1, X_2, \ldots,X_p\). The formula of VIF for the \(i^{th}\) covariate is, \[ VIF_i = \frac{1}{1-{R_i}^2}\hspace{.5cm}i = 1,2,\ldots,p. \] where, \({R_i^2}\) is the coefficient of determination when we regress \(X_i\) by the other predictor variables. A rule of thumb is that, if \(VIF_i > 10\) then we suspect presence of multicollinearity in our model.

Data Description

The dataset is sourced from NASA Giovanni with a daily temporal resolution and \(0.01^\circ\) spatial resolution. It combines static topographical variables derived from DEM data at a 30m spatial resolution and dynamic climatic variables that vary over time. The dataset includes Snow Melt data (kg \(m^{-2} s^{-1}\)) recorded from 15th March 2022 to 15th June 2022 (93 days). The data was processed in QGIS, resulting in 2025 spatial points for model analysis.

Variables Overview

Static Variables Dynamic Variables
Elevation Snow Melt
Slope Specific Humidity
Aspect Surface Air Temperature
Surface Radiative Temperature
Surface Pressure

Exploratory Data Analysis

Histogram

Snowmelt

Specific Humidity

Surface Air Temperature

Surface Pressure

Radiative Temprature

Histogram of 50th day

Assumed Distribution

Variable Assumed Distribution Skewness Variable Assumed Distribution Skewness
Snowmelt Exponential 1.36 RadTemp Weibull -0.63
Humidity Normal 0.16 Elevation Gamma 0.49
Air Temp Weibull -0.39 Slope Normal -0.30
Pressure Normal -0.24 Aspect Normal 0.11

Distribution selection in R can be performed using the model_select() function in UnivariateML package, which fits multiple candidate distributions to the data, estimates their parameters by maximizing the log-likelihood, and selects the best model based on criteria like AIC or log-likelihood.

Correlation

Correlation between Snowmelt and other variables for different days.
Day Humidity AirTemp Pressure Radtemp Elevation Slope Aspect
10 0.287 0.227 0.239 0.239 -0.174 -0.015 -0.014
20 0.362 0.303 0.334 0.319 -0.247 -0.053 -0.012
30 0.606 0.524 0.564 0.491 -0.401 -0.122 -0.003
40 0.468 0.413 0.450 0.394 -0.309 -0.093 0.001
50 0.813 0.810 0.835 0.799 -0.673 -0.191 -0.089
60 0.759 0.756 0.765 0.765 -0.628 -0.162 -0.091
70 0.704 0.690 0.701 0.648 -0.547 -0.126 -0.061
80 0.773 0.787 0.797 0.793 -0.636 -0.171 -0.095
90 0.848 0.827 0.849 0.674 -0.681 -0.201 -0.109

The amount of Snow melt and magnitude of correlation with other variables increases as time progresses.

Correlation for Specific Humidity

Correlation between Specific Humidity and other variables for different days.
Day Snow Melt Temperature Pressure Radtemp Elevation Slope Aspect
10 0.287 0.980 0.972 0.938 -0.775 -0.161 0.015
20 0.362 0.970 0.990 0.887 -0.810 -0.172 0.001
30 0.606 0.991 0.989 0.983 -0.822 -0.176 0.008
40 0.468 0.988 0.981 0.932 -0.817 -0.177 0.006
50 0.813 0.912 0.965 0.894 -0.774 -0.174 0.009
60 0.759 0.915 0.958 0.867 -0.791 -0.143 -0.016
70 0.704 0.930 0.985 0.916 -0.817 -0.175 0.003
80 0.773 0.874 0.975 0.835 -0.785 -0.179 0.010
90 0.848 0.935 0.978 0.883 -0.808 -0.174 -0.003

Correlation for Aspect

Correlation between Aspect and other variables for different days.
Day Snow Melt Humidity Temperature Pressure Radtemp Elevation Slope
10 -0.014 0.015 0.003 0.002 -0.070 -0.097 0.093
20 -0.012 0.001 -0.003 0.002 -0.092 -0.097 0.093
30 -0.003 0.008 0.000 0.002 -0.022 -0.097 0.093
40 0.001 0.006 -0.001 0.002 -0.042 -0.097 0.093
50 -0.089 0.009 -0.010 0.002 -0.090 -0.097 0.093
60 -0.091 -0.016 -0.018 0.002 -0.118 -0.097 0.093
70 -0.061 0.003 -0.019 0.002 -0.099 -0.097 0.093
80 -0.095 0.010 -0.005 0.002 -0.106 -0.097 0.093
90 -0.109 -0.003 -0.026 0.002 -0.055 -0.097 0.093

Scatter plot for 50 th day

Variable Selection

Variance Inflation Factor for 50th day

Suppose we fit a linear regression with response variable Snow melt and covariates with remaining variables for the 50th day. The table is given bellow is the VIF values for the model of 50th day.

Humidity Air Temp Pressure Rad Temp Elevation Slope Aspect
26.801735 70.179792 113.651005 18.760397 3.797529 1.118656 1.167679

Choice of Copulas

We will examine two elliptical copulas and three Archimedean copulas. The elliptical copulas are the Gaussian copula and the t-copula, while the Archimedean copulas are the Joe copula, the Clayton copula, and the Gumbel copula.

The fitCopula(x, data = y, method = "mpl") function, from Copula package in R estimates copula parameters using the Maximum Pseudo-Likelihood (MPL) method. It fits a specified copula model(here we use 5 prespecified copulas) (x) to the data (y). The function maximizes the pseudo log-likelihood for each copula. The output provides the estimated copula parameters, log-likelihood value, and convergence diagnostics.

Fitting the data

The table bellow shows the AIC values for the given copula fittted for the data. The copula with least AIC values are considered as the best fit copula for the data.

Copula AIC
Normal -25419.45468
t -25492.97474
Joe -78.91499
Clayton -945.01009
Gumbel -377.38352

Continued

According to the above AIC values Normal or t-copula is best fitted for the data. we fitted normal copula for the data with 2025 observations with maximum likelihood 12738 and the parameter values (total \(\binom{8}{2} = 28\) parameters) or the estimated correlation structure is given by,

\[ \begin{bmatrix} 0.86420 & 0.90704 & 0.91752 & 0.92660 & -0.75803 & -0.21551 & -0.10143 \\ 0.90591 & 1.00000 & 0.97386 & 0.96914 & -0.83455 & -0.18410 & -0.02047 \\ 0.96618 & 0.97386 & 1.00000 & 0.94845 & -0.82305 & -0.17961 & -0.01318 \\ 0.89682 & 0.96914 & 0.94845 & 1.00000 & -0.80231 & -0.21454 & -0.08833 \\ -0.76963 & -0.83455 & -0.82305 & -0.80231 & 1.00000 & 0.26042 & -0.07999 \\ -0.18589 & -0.18410 & -0.17961 & -0.21454 & 0.26042 & 1.00000 & 0.06749 \\ -0.01171 & -0.02047 & -0.01318 & -0.08833 & -0.07999 & 0.06749 & 1.00000 \end{bmatrix} \]

Continued

Variable Combinations AIC
humidity, Air Temp, pressure, radtemp, Elevation, slope, aspect -25419.45
humidity, Air Temp, pressure, radtemp, Elevation, aspect -25210.43
humidity, Air Temp, pressure, radtemp, Elevation, slope -25181.98
humidity, Air Temp, pressure, radtemp, Elevation -24977.63
humidity, Air Temp, pressure, radtemp, slope, aspect -22852.87
humidity, Air Temp, pressure, radtemp, aspect -22741.06
humidity, Air Temp, pressure, radtemp, slope -22676.84
humidity, Air Temp, pressure, radtemp -22562.03
humidity, Air Temp, pressure, Elevation, slope, aspect -18986.30
humidity, Air Temp, pressure, Elevation, slope -18823.07
humidity, Air Temp, pressure, Elevation, aspect -18784.57
Air Temp, pressure, radtemp, Elevation, slope, aspect -18722.30
humidity, Air Temp, pressure, Elevation -18629.39
Air Temp, pressure, radtemp, Elevation, aspect -18527.43
Air Temp, pressure, radtemp, Elevation, slope -18484.03

Future plan of study

Since the covariates are highly correlated, OLS may not perform well. This study will explore copula regression as an alternative, as it better captures dependencies and handles non-normal distributions. I will implement copula regression and compare its performance with OLS to assess its effectiveness in modeling Snow Melt.

References

Maina Samuel C., Mwigereri D., Weyn J., Mackey L., Ochieng M. (2023).”Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas.” Available at: https://doi.org/10.1145/3616384

Parsa, Rahul A., Stuart A. Klugman. (2011). “Copula Regression.” Available at: https://variancejournal.org/article/84912

Nelsen Roger B. (2006). An Introduction to Copulas (2nd ed.), Springer Series in Statistics.

Hofert M., Kojadinovic I., Mächler M.,Yan J. (2018), Elements of Copula Modeling with R., Springer International Publishing AG

Haugh M., (2016). An introduction to copulas, IEOR E4602: Quantitative Risk Management (Spring 2016)

Thank You