Analysis and modeling of snowmelt data for the prediction of Glacier lake outburst flood
Arpan Dutta (MD2305)
Midterm Project Presentation
For every \(d\geq 4\), a d dimentional copula \(\mathcal{C} : [0,1]^d\rightarrow[0,1]\) is a multivariate distribution function on \([0,1]^d\) with uniformly distributed marginals.
Let the random vector \((X_1, X_2,\ldots, X_d )\) exist with marginal cdf’s \(F_i(x) = P(X_i\leq x)\). Using the probability integral transformation the vector \((U_1, \ldots,U_d )\) = \((F_1(x_1),\ldots ,F_d(x_d))\) has uniform marginal, Then the copula of \((X1, \ldots, X_d )\) is the joint cdf of \((U_1,\ldots, U_d )\). Namely, it is \(\mathcal{C}_\theta(u_1,\ldots,u_d) = P[U_1\leq u_1,\ldots,U_d\leq u_d ]\). where \(\theta\) is the parameter values of copula.
it has some properties:
Consider a d dimensional CDF, F, with pdf f and with marginals \(F_1, F_2, \ldots, F_d\) Then there exists a copula, \(\mathcal{C}\), such that,
\[F(x_1,x_2,\ldots,x_d)= \mathcal{C}_\theta(F_1(x_1), F_1(x_2),\ldots,F_1(x_d))\] For all \(x_i\in\mathbb{R}\) and \(i = 1,2,\ldots,d\)
If \(F_i\) is continuous for all \(i = 1,2,3,\ldots, d\), then \(\mathcal{C}\) is unique. If any distribution is discrete, then C is uniquely defined on the support of the joint distribution.
On the otherhand, consider a copula, \(\mathcal C\) and univariate CDF’s \(F_1,F_2,\ldots,F_d\). Then \(F\) as defined as before is a multivariate CDF with marginals \(F_1,F_2,\ldots,F_d\).
The joint density function is obtained by,
\[ \begin{aligned} f(x_1,..., x_d) &= \frac{\partial ^d F(x_1, ..., x_d)}{\partial x_1 \cdots \partial x_d} \\ &= \frac{\partial^d C_\theta(F_1(x_1),...,F_d(x_d))}{\partial x_1 \cdots \partial x_d} \\ &= \frac{\partial ^d C_\theta (F_1(x_1),...,F_d(x_d))}{\partial F_1(x_1) \cdots \partial F_d(x_d)} \cdot \frac{\partial F_1(x_1)}{\partial x_1} \cdots \frac{\partial F_d(x_d)}{\partial x_d} \\ &= c_\theta (F_1(x_1),...,F_d(x_d)) \cdot f_1(x_1) \cdots f_d(x_d) \end{aligned} \] Where \(c_\theta\) is multivariate density of the copula.
Suppose, \(\mathbf{Y} = (Y_1,\ldots,Y_d)\) is a random vector, and we have parametric models \(F_{Y_1} (\cdot \mid \theta_1), \dots, F_{Y_d} (\cdot \mid \theta_d)\) for the marginal CDFs. We assume that we also have a parametric model \(c_Y(\cdot \mid \theta_C)\) for the copula density of \(\mathbf{Y}\). We see that the density of \(\mathbf{Y}\) is given by:
\[ f_{\mathbf{Y}}(\mathbf{y}) = f_Y(y_1, \dots, y_d) = c_Y(F_{Y_1} (y_1), \dots, F_{Y_d} (y_d)) \prod_{j=1}^{d} f_{Y_j} (y_j). \]
We then obtain the log-likelihood as,
\[ \begin{aligned} log L(\theta_1, ..., \theta_d, \theta_C) &= \log \prod_{i=1}^{n} f_Y(y_i) \\ &= \sum_{i=1}^{n} \Bigg( \log c_Y \Big( F_{Y_1}(y_{i,1} \mid \theta_1), \ldots, F_{Y_d}(y_{i,d} \mid \theta_d) \mid \theta_C \Big) \quad \\ &+ \log f_{Y_1}(y_{i,1} \mid \theta_1) + \dots + \log f_{Y_d}(y_{i,d} \mid \theta_d) \Bigg) \end{aligned} \]
It has two steps:
Step 1 First estimate the marginal CDFs to obtain \(\hat{F}_{y_j}\) for \(j = 1, \dots, d\). We can do this using either:
\[\hat{F}_{Y_j}(y) = \frac{\sum_{i=1}^{n} \mathbf{1}\{ y_{i,j} \leq y \}}{n}\]
Step 2 Then estimate the copula parameters \(\theta_C\) by maximizing
\[ \sum_{i=1}^{n} \log \left[ c_Y \left( \hat{F}_{Y_1}(y_{i,1}), \dots, \hat{F}_{Y_d}(y_{i,d}) \mid \theta_C \right) \right] \]
Estimate the marginal distributions.
Estimate the copula family using dependence structure estimation (Pseudo-Maximum likelihood method) and check goodness of fit using AIC on the joint distribution.
Multicollinearity occurs when the independent variables in a regression model are correlated with each other. Generally VIF (Variance inflation factor) is used to test the presence of multicollinearity in a regression model. Suppose our respond variable is \(Y\) and covariates are \(X_1, X_2, \ldots,X_p\). The formula of VIF for the \(i^{th}\) covariate is, \[ VIF_i = \frac{1}{1-{R_i}^2}\hspace{.5cm}i = 1,2,\ldots,p. \] where, \({R_i^2}\) is the coefficient of determination when we regress \(X_i\) by the other predictor variables. A rule of thumb is that, if \(VIF_i > 10\) then we suspect presence of multicollinearity in our model.
The dataset is sourced from NASA Giovanni with a daily temporal resolution and \(0.01^\circ\) spatial resolution. It combines static topographical variables derived from DEM data at a 30m spatial resolution and dynamic climatic variables that vary over time. The dataset includes Snow Melt data (kg \(m^{-2} s^{-1}\)) recorded from 15th March 2022 to 15th June 2022 (93 days). The data was processed in QGIS, resulting in 2025 spatial points for model analysis.
| Static Variables | Dynamic Variables |
|---|---|
| Elevation | Snow Melt |
| Slope | Specific Humidity |
| Aspect | Surface Air Temperature |
| Surface Radiative Temperature | |
| Surface Pressure |
| Variable | Assumed Distribution | Skewness | Variable | Assumed Distribution | Skewness |
|---|---|---|---|---|---|
| Snowmelt | Exponential | 1.36 | RadTemp | Weibull | -0.63 |
| Humidity | Normal | 0.16 | Elevation | Gamma | 0.49 |
| Air Temp | Weibull | -0.39 | Slope | Normal | -0.30 |
| Pressure | Normal | -0.24 | Aspect | Normal | 0.11 |
Distribution selection in R can be performed using the model_select() function in UnivariateML package, which fits multiple candidate distributions to the data, estimates their parameters by maximizing the log-likelihood, and selects the best model based on criteria like AIC or log-likelihood.
| Day | Humidity | AirTemp | Pressure | Radtemp | Elevation | Slope | Aspect |
|---|---|---|---|---|---|---|---|
| 10 | 0.287 | 0.227 | 0.239 | 0.239 | -0.174 | -0.015 | -0.014 |
| 20 | 0.362 | 0.303 | 0.334 | 0.319 | -0.247 | -0.053 | -0.012 |
| 30 | 0.606 | 0.524 | 0.564 | 0.491 | -0.401 | -0.122 | -0.003 |
| 40 | 0.468 | 0.413 | 0.450 | 0.394 | -0.309 | -0.093 | 0.001 |
| 50 | 0.813 | 0.810 | 0.835 | 0.799 | -0.673 | -0.191 | -0.089 |
| 60 | 0.759 | 0.756 | 0.765 | 0.765 | -0.628 | -0.162 | -0.091 |
| 70 | 0.704 | 0.690 | 0.701 | 0.648 | -0.547 | -0.126 | -0.061 |
| 80 | 0.773 | 0.787 | 0.797 | 0.793 | -0.636 | -0.171 | -0.095 |
| 90 | 0.848 | 0.827 | 0.849 | 0.674 | -0.681 | -0.201 | -0.109 |
The amount of Snow melt and magnitude of correlation with other variables increases as time progresses.
| Day | Snow Melt | Temperature | Pressure | Radtemp | Elevation | Slope | Aspect |
|---|---|---|---|---|---|---|---|
| 10 | 0.287 | 0.980 | 0.972 | 0.938 | -0.775 | -0.161 | 0.015 |
| 20 | 0.362 | 0.970 | 0.990 | 0.887 | -0.810 | -0.172 | 0.001 |
| 30 | 0.606 | 0.991 | 0.989 | 0.983 | -0.822 | -0.176 | 0.008 |
| 40 | 0.468 | 0.988 | 0.981 | 0.932 | -0.817 | -0.177 | 0.006 |
| 50 | 0.813 | 0.912 | 0.965 | 0.894 | -0.774 | -0.174 | 0.009 |
| 60 | 0.759 | 0.915 | 0.958 | 0.867 | -0.791 | -0.143 | -0.016 |
| 70 | 0.704 | 0.930 | 0.985 | 0.916 | -0.817 | -0.175 | 0.003 |
| 80 | 0.773 | 0.874 | 0.975 | 0.835 | -0.785 | -0.179 | 0.010 |
| 90 | 0.848 | 0.935 | 0.978 | 0.883 | -0.808 | -0.174 | -0.003 |
| Day | Snow Melt | Humidity | Temperature | Pressure | Radtemp | Elevation | Slope |
|---|---|---|---|---|---|---|---|
| 10 | -0.014 | 0.015 | 0.003 | 0.002 | -0.070 | -0.097 | 0.093 |
| 20 | -0.012 | 0.001 | -0.003 | 0.002 | -0.092 | -0.097 | 0.093 |
| 30 | -0.003 | 0.008 | 0.000 | 0.002 | -0.022 | -0.097 | 0.093 |
| 40 | 0.001 | 0.006 | -0.001 | 0.002 | -0.042 | -0.097 | 0.093 |
| 50 | -0.089 | 0.009 | -0.010 | 0.002 | -0.090 | -0.097 | 0.093 |
| 60 | -0.091 | -0.016 | -0.018 | 0.002 | -0.118 | -0.097 | 0.093 |
| 70 | -0.061 | 0.003 | -0.019 | 0.002 | -0.099 | -0.097 | 0.093 |
| 80 | -0.095 | 0.010 | -0.005 | 0.002 | -0.106 | -0.097 | 0.093 |
| 90 | -0.109 | -0.003 | -0.026 | 0.002 | -0.055 | -0.097 | 0.093 |
Suppose we fit a linear regression with response variable Snow melt and covariates with remaining variables for the 50th day. The table is given bellow is the VIF values for the model of 50th day.
| Humidity | Air Temp | Pressure | Rad Temp | Elevation | Slope | Aspect |
|---|---|---|---|---|---|---|
| 26.801735 | 70.179792 | 113.651005 | 18.760397 | 3.797529 | 1.118656 | 1.167679 |
We will examine two elliptical copulas and three Archimedean copulas. The elliptical copulas are the Gaussian copula and the t-copula, while the Archimedean copulas are the Joe copula, the Clayton copula, and the Gumbel copula.
The fitCopula(x, data = y, method = "mpl") function, from Copula package in R estimates copula parameters using the Maximum Pseudo-Likelihood (MPL) method. It fits a specified copula model(here we use 5 prespecified copulas) (x) to the data (y). The function maximizes the pseudo log-likelihood for each copula. The output provides the estimated copula parameters, log-likelihood value, and convergence diagnostics.
The table bellow shows the AIC values for the given copula fittted for the data. The copula with least AIC values are considered as the best fit copula for the data.
| Copula | AIC |
|---|---|
| Normal | -25419.45468 |
| t | -25492.97474 |
| Joe | -78.91499 |
| Clayton | -945.01009 |
| Gumbel | -377.38352 |
According to the above AIC values Normal or t-copula is best fitted for the data. we fitted normal copula for the data with 2025 observations with maximum likelihood 12738 and the parameter values (total \(\binom{8}{2} = 28\) parameters) or the estimated correlation structure is given by,
\[ \begin{bmatrix} 0.86420 & 0.90704 & 0.91752 & 0.92660 & -0.75803 & -0.21551 & -0.10143 \\ 0.90591 & 1.00000 & 0.97386 & 0.96914 & -0.83455 & -0.18410 & -0.02047 \\ 0.96618 & 0.97386 & 1.00000 & 0.94845 & -0.82305 & -0.17961 & -0.01318 \\ 0.89682 & 0.96914 & 0.94845 & 1.00000 & -0.80231 & -0.21454 & -0.08833 \\ -0.76963 & -0.83455 & -0.82305 & -0.80231 & 1.00000 & 0.26042 & -0.07999 \\ -0.18589 & -0.18410 & -0.17961 & -0.21454 & 0.26042 & 1.00000 & 0.06749 \\ -0.01171 & -0.02047 & -0.01318 & -0.08833 & -0.07999 & 0.06749 & 1.00000 \end{bmatrix} \]
| Variable Combinations | AIC |
|---|---|
| humidity, Air Temp, pressure, radtemp, Elevation, slope, aspect | -25419.45 |
| humidity, Air Temp, pressure, radtemp, Elevation, aspect | -25210.43 |
| humidity, Air Temp, pressure, radtemp, Elevation, slope | -25181.98 |
| humidity, Air Temp, pressure, radtemp, Elevation | -24977.63 |
| humidity, Air Temp, pressure, radtemp, slope, aspect | -22852.87 |
| humidity, Air Temp, pressure, radtemp, aspect | -22741.06 |
| humidity, Air Temp, pressure, radtemp, slope | -22676.84 |
| humidity, Air Temp, pressure, radtemp | -22562.03 |
| humidity, Air Temp, pressure, Elevation, slope, aspect | -18986.30 |
| humidity, Air Temp, pressure, Elevation, slope | -18823.07 |
| humidity, Air Temp, pressure, Elevation, aspect | -18784.57 |
| Air Temp, pressure, radtemp, Elevation, slope, aspect | -18722.30 |
| humidity, Air Temp, pressure, Elevation | -18629.39 |
| Air Temp, pressure, radtemp, Elevation, aspect | -18527.43 |
| Air Temp, pressure, radtemp, Elevation, slope | -18484.03 |
Since the covariates are highly correlated, OLS may not perform well. This study will explore copula regression as an alternative, as it better captures dependencies and handles non-normal distributions. I will implement copula regression and compare its performance with OLS to assess its effectiveness in modeling Snow Melt.
Maina Samuel C., Mwigereri D., Weyn J., Mackey L., Ochieng M. (2023).”Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas.” Available at: https://doi.org/10.1145/3616384
Parsa, Rahul A., Stuart A. Klugman. (2011). “Copula Regression.” Available at: https://variancejournal.org/article/84912
Nelsen Roger B. (2006). An Introduction to Copulas (2nd ed.), Springer Series in Statistics.
Hofert M., Kojadinovic I., Mächler M.,Yan J. (2018), Elements of Copula Modeling with R., Springer International Publishing AG
Haugh M., (2016). An introduction to copulas, IEOR E4602: Quantitative Risk Management (Spring 2016)