1 Objective

This exercise aims to explore the concept of linear (in)dependence in the context of multispectral satellite imagery. You will:

  • Download multispectral satellite imagery (Red and Near-Infrared bands) using Google Earth Engine (GEE).

  • Read and process this imagery in Python using libraries like rasterio and numpy.

  • Calculate an unnormalized vegetation index (Difference Vegetation Index- DVI) and a nor malized vegetation index (Normalized Difference Vegetation Index- NDVI).

  • Analyze the set of bands {NIR, Red, DVI} and {NIR, Red, NDVI} for linear dependence using the covariance matrix, its determinant, and its eigenvalues/eigenvectors.

  • Explain the observed differences based on the mathematical properties of the indices and the principles of linear algebra.

2 Methodology and Tasks

2.1 Load libraries

pacman::p_load(bookdown,leaflet, terra,sf, knitr, kableExtra, ggplot2,gridExtra)

2.2 Part 0: Data Acquisition (using Google Earth Engine)

2.2.1 Define a Region of interest (ROI) over a predominantly vegetated area.

The area selected was a one of the most intense wildfire from 2019 on Durango, Dgo. The study area is dominated by conifers of the genera Pinus, Juniperus and Cupressus, as well as grasslands and shrublands dominated by Acacia and Prosopis species (see Figure 2.1). The climate ranges from semi-dry temperate to semi-cold subhumid and subhumid. Altitude ranges from 800 to 3,200 m.

2.2.2 Load region of interes (ROI)

# Load the vector file
cpc <- vect("D:/Centro_geo/C03_APR_Aprendizaje profundo/T01_Linear independence and vegetation indices/insumos/19DUR01_cpc_a_4326.gpkg")

2.2.3 Visualize dinamyc ROI

# Convert to sf for Leaflet compatibility
cpc_sf <- st_as_sf(cpc)

# Get extent (bounding box) of the polygon using terra functions
extent_cpc <- ext(cpc)

# Extract min/max long adn lat for view on leaflet
lng1 <- xmin(extent_cpc)  
lng2 <- xmax(extent_cpc)  
lat1 <- ymin(extent_cpc)  
lat2 <- ymax(extent_cpc)  

# Create leaflet dynamic map
leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(data = cpc_sf, 
              fillColor = "blue", fillOpacity = 0.5, 
              color = NA) %>%
  fitBounds(lng1, lat1, lng2, lat2)

2.2.4 Visualize fuel-bed mapping of wildfire

Location of the wildfire from 2019 analyzed in the Mexican province of Sierra Madre Occidental of Durango State, Mexico. AGAF = Aggregated active fire perimeter. Fuelbed types (Scale: 1:250.000): 1.1.GR1 = Warm-humid tall grass, 1.2.GR2 = Temperate-humid short grass and shrublands, 1.3.GR3 = Semiarid short grass and shrublands, 2.1.GS1 = Semiarid grass and shrublands, 2.2.GS2 = Arid grass and shrublands, 3.1.SH1 = Arid tall shrublands, 3.2.SH2 = Arid short shrublands, 4.1.TU1 = Subhumid temperate timber understory with dense shrubs, 4.2.TU2 = Humid temperate-cold timber understory with dense pasture, 4.3.TU3 = Subhumid timber understory with sparse grasslands, 4.4.TU4 = Subhumid warm to semiarid timber understory with sparse grass and shrublands, 5.1.TL1 = Humid temperate short needle coniferous timber litter, 5.2.TL2 = Humid long needle coniferous timber litter, 5.3.TL3 = Subhumid to humid sclerophyll broadleaf timber litter, 5.4.TL4 = Seasonally dry humid broadleaf timber litter, 7.NB = Not burnable (bare ground), 8.NB1 = Not forestry areas (e.g. urban development, agricultural) and 8.1.NB2 = Water bodies.

Figure 2.1: Location of the wildfire from 2019 analyzed in the Mexican province of Sierra Madre Occidental of Durango State, Mexico. AGAF = Aggregated active fire perimeter. Fuelbed types (Scale: 1:250.000): 1.1.GR1 = Warm-humid tall grass, 1.2.GR2 = Temperate-humid short grass and shrublands, 1.3.GR3 = Semiarid short grass and shrublands, 2.1.GS1 = Semiarid grass and shrublands, 2.2.GS2 = Arid grass and shrublands, 3.1.SH1 = Arid tall shrublands, 3.2.SH2 = Arid short shrublands, 4.1.TU1 = Subhumid temperate timber understory with dense shrubs, 4.2.TU2 = Humid temperate-cold timber understory with dense pasture, 4.3.TU3 = Subhumid timber understory with sparse grasslands, 4.4.TU4 = Subhumid warm to semiarid timber understory with sparse grass and shrublands, 5.1.TL1 = Humid temperate short needle coniferous timber litter, 5.2.TL2 = Humid long needle coniferous timber litter, 5.3.TL3 = Subhumid to humid sclerophyll broadleaf timber litter, 5.4.TL4 = Seasonally dry humid broadleaf timber litter, 7.NB = Not burnable (bare ground), 8.NB1 = Not forestry areas (e.g. urban development, agricultural) and 8.1.NB2 = Water bodies.

2.2.5 Search for a Sentinel-2 MSI (or Landsat 8/9 OLI) image that is relatively cloud-free over your ROI.

We develop a GEE-code to download specific bands using Sentinel-2 data (“COPERNICUS/S2_SR_HARMONIZED”). The workflow begins after defining the ROI and extracting wildfire occurrence dates to create post-fire composite bands (see Table 2.1 ). A cloud masking function removes cloud interference using scene classification mask (SCL) filters out unwanted elements like shadows, snow, and water.

Table 2.1: Wildfire event summary and Sentinel-2 image collection analysis.
Attribute Value
Wildfire ID 19DUR01
UTM Zone 13
Occurred between 2019-04-24 to 2019-05-31
Time window (post-fire) 60 days
Post-fire period 2019-05-31 to 2019-07-30
Images found in post-fire 48
Image Collection COPERNICUS/S2_SR_HARMONIZED (48 elements)

2.2.6 Select the Red band (e.g., Sentinel-2 B4, Landsat B4) and Near-Infrared (NIR) band (e.g., Sentinel-2 B8, Landsat B5).

To generate a composite image, exploratory analyses have been conducted using different composite techniques (mean, median, and minimum) and various time windows (1, 15, 30, 60, and 90 days). The time window that best predicted wildfire severity based on field variables was 60 days, using the 25th percentile composite (see Reference. The script selects the Red (B4) and Near-Infrared (B8) bands, applying a percentile 25 reduction.

2.2.7 Clip the selected bands to your ROI.

The download include also the RGB Sentinel bands (B2 and B3) to enabled the color natural visualization and false infrarred color (see Figure 2.2c and d).

GEE composite visualization.

Figure 2.2: GEE composite visualization.

2.2.8 Export the band images as a to Google Drive

Finally the download consist on four bands (B2-Blue, B3-Green, B4-Red, B8-NIR), of 60 day and the percentile 25 composite at 10m spatial resolution and UTM coordinate system (see Figure 2.3).

The acces to the full script created for this homework is shown on GEE-CODE. All the input-data where avilable on drive

GEE composite export parameters.

Figure 2.3: GEE composite export parameters.

2.3 Part 1: Linear Dependence with DVI

2.3.1 Load data

# Load Raster Bands
nir_raster <- rast("D:/Centro_geo/C03_APR_Aprendizaje profundo/T01_Linear independence and vegetation indices/insumos/19DUR01_B8_60d_median_10m_13.tif")
red_raster <- rast("D:/Centro_geo/C03_APR_Aprendizaje profundo/T01_Linear independence and vegetation indices/insumos/19DUR01_B4_60d_median_10m_13.tif")

# Convert raster values into dataframes for ggplot2
nir_df <- as.data.frame(nir_raster, xy = TRUE)
red_df <- as.data.frame(red_raster, xy = TRUE)

# Plot Red & NIR Bands with correct column references
p1 <- ggplot(nir_df, aes(x, y, fill = B8)) + 
  geom_raster() + scale_fill_viridis_c() + 
  ggtitle("NIR Band") + theme_minimal()

p2 <- ggplot(red_df, aes(x, y, fill = B4)) + 
  geom_raster() + scale_fill_viridis_c() + 
  ggtitle("Red Band") + theme_minimal()

# Arrange both plots side by side
gridExtra::grid.arrange(p1, p2, ncol=2)

2.3.2 Prepare data for analysis (flatten, 2D to 1D)

# Convert raster to matrix before flattening
nir_matrix <- as.matrix(nir_raster)
red_matrix <- as.matrix(red_raster)

# Flatten the 2D matrices into 1D arrays
nir_flat <- as.vector(nir_matrix)
red_flat <- as.vector(red_matrix)

# Check the structure
print(length(nir_flat)) 
## [1] 8488656
print(length(red_flat)) 
## [1] 8488656

Linear algebra operations (like covariance, eigenvalues, and determinant) are easier to compute because the data is now structured as column vectors instead of spatial grids.

2.3.3 Calculate DVI

With the equation:

\[ DVI = NIR_{flat} - RED_{flat} \]

# DVI calculation
dvi_flat <- nir_flat - red_flat

# Confirmar dimensiones
cat("Dimensión Red Band:", length(red_flat), "\n")
## Dimensión Red Band: 8488656
cat("Dimensión NIR Band:", length(nir_flat), "\n")
## Dimensión NIR Band: 8488656
cat("Dimensión DVI:", length(dvi_flat), "\n")
## Dimensión DVI: 8488656

2.3.4 Create Composite Data Matrix

# Create composite data matrix
data_dvi <- cbind(nir_flat, red_flat, dvi_flat)

2.3.5 Analyze covariance

Print detdvi. ¿What value do you expect and why?

It should be very close to zero, as the covariance matrix is calculated considering the spectral bands and the spectral index. Therefore, dvi_flat must be correlated with one of the bands (NIR or Red), meaning there is linear dependence.

The determinant indicates the volume of the data space occupied by the variables (spectral bands) in the multidimensional space. If the determinant is approximately zero, the data is contained within a smaller subdimension, meaning one band can be expressed as a function of the others. Since the determinant is approximately zero ( \[ determinant = -2.5 \times 10^{-24}\]), the spectral bands do not span the entire multidimensional space. This indicates that at least one component is a linear combination of the others, meaning the data exists in a lower-dimensional subspace rather than being fully independent.

# Count missing values
sum(is.na(data_dvi))  
## [1] 173397
# Remove rows with NaN values
data_dvi_clean <- na.omit(data.frame(data_dvi))  
sum(is.na(data_dvi_clean))  
## [1] 0
# Calculate covariance matrix with the clean data
covmatrix_dvi_clean <- cov(data_dvi_clean)
print("Covariance matrix:")
## [1] "Covariance matrix:"
covmatrix_dvi_clean
##             nir_flat     red_flat     dvi_flat
## nir_flat 0.001438591  0.000379412  0.001059179
## red_flat 0.000379412  0.001465402 -0.001085990
## dvi_flat 0.001059179 -0.001085990  0.002145169
# Calculate covariance matrix determinant 
det_dvi <- det(covmatrix_dvi_clean)

# Print determinant value
print("Determinant:")
## [1] "Determinant:"
print(det_dvi)
## [1] -2.555457e-24

2.3.6 Eigenanalysis

¿How many are significantly non-zero ? What does a near zero eigen value indicate in this context?

Eigenvectors represent the principal direction of variation in the data, while eigenvalues indicate the magnitude of variance along these directions (how much variability there is). In this case, 2 are the significantly non-zero eigenvalues. This indicate us that at least one of the component matrix (nir, red, dvi) can be expressed as a combination of the others, the data varies in two independent directions rather than three. Also i think it could be help us in a optimizing context,if we use only 2 instead three without losing essential information. Although, when the eigenvalues are near zero, indicates lineal dependency, only the last component was equals to zero: \[\lambda_1 = 3.2 \times 10^{-3}, \quad \lambda_2 = 1.83 \times 10^{-3}, \quad \lambda_3 = 1.73 \times 10^{-18}\].

Examine the eigenvector corresponding to the near-zero eigenvalue. How does this vector relate to the mathematical definition of DVI and the general form of linear dependence?

The eigenvector corresponding to the near-zero eigenvalue for NIR, Red, and DVI is 0.57, -0.57, and -0.57, respectively. The negative values indicate that Red and DVI contribute in the opposite direction, meaning they are inversely related to NIR. Also considerating that the eigenvalue of DVI (\[\lambda_3 = 1.73 \times 10^{-18}\]) confirms that DVI does not introduce new information but is instead dependent on NIR and Red. The eigenvector assigns nearly opposite weights to NIR and Red, which aligns with the mathematical definition of DVI.

# Calculate eigen vectors and values 
eig_dvi <- eigen(covmatrix_dvi_clean)

# Calculate eigen vectors and values 
eigenvalues_dvi <- eig_dvi$values
eigenvectors_dvi <- eig_dvi$vectors

# Print eigenvalues
print("Eigenvalues:")
## [1] "Eigenvalues:"
print(eigenvalues_dvi)
## [1] 3.218143e-03 1.831020e-03 1.734723e-18
# Count how many are different to zero 
non_zero_eigen <- sum(eigenvalues_dvi > 1e-15)
print(paste("Eigen values non-cero: ", non_zero_eigen))
## [1] "Eigen values non-cero:  2"
# Print eigenvectors
print("Eigenvectors:")
## [1] "Eigenvectors:"
print(eigenvectors_dvi)
##            [,1]        [,2]       [,3]
## [1,]  0.3963531 -0.71384232  0.5773503
## [2,] -0.4200290 -0.70017304 -0.5773503
## [3,]  0.8163822 -0.01366928 -0.5773503
# Near zero eigenvecor
near_zero_index <- which.min(eigenvalues_dvi)  
near_zero_eigenvector <- eigenvectors_dvi[, near_zero_index]

print("Eigenvector associated with the near-zero eigenvalue: ")
## [1] "Eigenvector associated with the near-zero eigenvalue: "
print(near_zero_eigenvector)
## [1]  0.5773503 -0.5773503 -0.5773503

2.4 Part2: Exploring a non-linear index (NDVI)

2.4.1 Calculate NDVI:

# Calculate NDVI considerating divided values by zero
ndvi_flat <- (nir_flat - red_flat) / (nir_flat + red_flat + 1e-5)

2.4.2 Create composite data matrix

data_ndvi <- cbind(nir_flat, red_flat, ndvi_flat)

2.4.3 Analyze covariance

Print det_ndvi. How does this value compare to det_dvi?

det_dvi is extremely close to zero, indicating strong linear dependence among the spectral bands (one of them can be expressed as a function of the others. det_ndvi is larger than det_dvi, but still quite small, meaning there is still some dependence, though NDVI introduces slightly more variability compared to DVI.

( \[ det_{dvi} = -2.5 \times 10^{-24}\]) ( \[ det_{ndvi} = 1.65 \times 10^{-9}\])

# Count missing values
sum(is.na(data_ndvi))  
## [1] 173397
# Remove rows with NaN values
data_ndvi_clean <- na.omit(data.frame(data_ndvi))
sum(is.na(data_ndvi_clean))  
## [1] 0
# Calculate covariance matrix with the clean data
covmatrix_ndvi <- cov(data_ndvi_clean)
print("Covariance matrix:")
## [1] "Covariance matrix:"
covmatrix_ndvi
##              nir_flat     red_flat    ndvi_flat
## nir_flat  0.001438591  0.000379412  0.001622547
## red_flat  0.000379412  0.001465402 -0.005095500
## ndvi_flat 0.001622547 -0.005095500  0.025017446
# Calculate covariance matrix determinant 
det_ndvi <- det(covmatrix_ndvi)

# Print determinant value
print("Determinant:")
## [1] "Determinant:"
print(det_ndvi)
## [1] 1.654915e-09

2.4.4 Eigen analysis

Print eigenvalues_ndvi. Are any eigen values close to zero now? How does this compare to the DVI case?

The smallest NDVI eigenvalue 3.68e-05 is very close to zero, but still larger than the smallest eigenvalue in DVI (1.73e-18), that is practicaly zero. Could suggest some dependence among NDVI, NIR, and Red, but not as strict as in DVI.

# Calculate eigen vectors and values 
eig_ndvi <- eigen(covmatrix_ndvi)

# Calculate ndvi eigen vectors and values 
eigenvalues_ndvi <- eig_ndvi$values
eigenvectors_ndvi <- eig_ndvi$vectors

# Print eigenvalues
print("NDVI eigenvalues:")
## [1] "NDVI eigenvalues:"
print(eigenvalues_ndvi)
## [1] 2.616511e-02 1.719550e-03 3.678225e-05
# Count how many are different to zero 
non_zero_eigenndvi <- sum(eigenvalues_ndvi > 1e-15)
print(paste("Eigen values non-cero: ", non_zero_eigenndvi))
## [1] "Eigen values non-cero:  3"
# Print eigenvectors
print("NDVI eigenvectors:")
## [1] "NDVI eigenvectors:"
print(eigenvectors_ndvi)
##             [,1]       [,2]       [,3]
## [1,] -0.06107797 0.88040545  0.4702720
## [2,]  0.20076622 0.47235858 -0.8582367
## [3,] -0.97773330 0.04199537 -0.2056064
# Near zero eigenvecor
near_zero_index_ndvi <- which.min(eigenvalues_ndvi)  
near_zero_eigenvector_ndvi <- eigenvectors_ndvi[, near_zero_index_ndvi]

print("Eigenvector associated with the near-zero eigenvalue: ")
## [1] "Eigenvector associated with the near-zero eigenvalue: "
print(near_zero_eigenvector_ndvi)
## [1]  0.4702720 -0.8582367 -0.2056064
Table 2.2: Comparison of eigen analysis for DVI and NDVI data
Metric DVI_Analysis NDVI_Analysis
Determinant -2.55e-24 1.65e-09
Eigenvalues 3.21e-03, 1.83e-03, 1.73e-18 2.61e-02, 1.71e-03, 3.67e-05
Non-Zero Eigenvalues 2 3
Near-Zero Eigenvector 0.57, -0.57, -0.57 0.47, -0.85, -0.20

2.5 Part 3: Explanation and Discussion

Answer the following questions based on your results and the theoretical background:

1. Why was the determinant of the covariance matrix for the {NIR, Red, DVI} set close to zero, while for the {NIR, Red, NDVI} set it was significantly non-zero? Relate your answer to the mathematical definitions of DVI and NDVI, and the concept of linear independence.

On DVI the determinant is close to zero, indicating strong linear dependence among its components. This could happen because DVI is defined as a simple linear subtraction, is completely determined by NIR and Red, it does not introduce new independent information (the dataset is constrained to a lower-dimensional subspace).

On the other hand, in NDVI equation, the ratio structure allows to capture non-linear variations that are not explicitly present in NIR and Red alone, introduces some unique variability, leading to a slightly higher determinant (-2.55e-24 vs 1.65e-09, see Table 2.2 ).

2. How did the eigenvalues of the covariance matrices differ between the DVI and NDVI cases? Explain what these differences signify about the dimensionality and structure of the data in each case.

DVI case only one eigenvalue is effectively to zero, confirming that at least one component is redundant. The dataset is constrained to a 2D plane or line, meaning it does not span the full 3D space. This tell us that the dimensionality reduction could be possible (one component can be eliminated without losing meaningful information).

The NDVI case, all eigenvalues are close but greater than zero, meaning each component contributes some independent variation. The dataset spans three dimensions, suggesting a higher degree of uniqueness among features (Non-Zero Eigenvalues = 3). Checking the effect on the covariance structure and principal component decomposition (det_dvi ≈ 0 vs det_ndvi > 0), although it is not clear to me, how much is “far”? In terms of how far the determinant must be to consider that given a determinant different from zero suggests the dataset spans a higher-dimensional space.

3. For the DVI case, explain how the eigenvector corresponding to the near-zero eigenvalue confirms the linear dependency. Specifically, show how the components of this eigenvector relate to the coefficients in the DVI formula (Equation 6) when expressed as a linear combination equaling a constant (or zero, after mean centering).

The eigenvector associated with the near-zero eigenvalue confirms linear dependence. The eigenvector calculated to the near-zero eigenvalue was:

\[ v = \begin{bmatrix} 0.577 \\ -0.577 \\ -0.577 \end{bmatrix} \]

The eigenvector assigns equal but opposite weights to NIR and Red, reflecting the structure of DVI. Since the third component (DVI) has the same weight as Red, this proves that DVI is just a linear combination of the nir and red bands.

4. What are the implications of using linearly dependent features (like NIR, Red, and DVI together) as inputs to a machine learning model that is sensitive to multicollinearity (e.g., linear regression)? Would using NIR, Red, and NDVI together pose the exact same degree of problem in terms of strict linear dependence? Explain.

Using linearly dependent features in a machine learning model sensitive to multicollinearity could cause instability in parameter estimation. Since DVI is strictly dependent on NIR and Red, the model will struggle to assign distinct weights, leading to redundancy and unstable coefficients.

To the NDVI case that introduces some independent variation, some correlation remains, which could still affect multicollinearity-sensitive models, causing unreliable predictions due to high correlation among variables.

In most of the literature I have read on regression models, spectral bands are primarily used for exploratory analysis to understand spectral responses to specific phenomena (e.g., wildfires). This helps determine which spectral index might serve best as an independent variable. More information does not always lead to better results, as seen in the DVI case. Therefore, this analysis is essential for improving regression models (including machine learning) by identifying and filtering out spectral bands that do not contribute meaningful information.