library(terra)      # For raster data
library(ggplot2)    # Visualization
library(corrplot)   # Correlation matrices
library(psych)      # KMO test
library(factoextra) # Plots

Introduction

Remote sensing data is multidimensional by nature. Modern earth observation satellites, such as the Sentinel-2 constellation, capture the Earth’s surface not just in the visible spectrum (Red, Green, Blue), but across a wide range of electromagnetic wavelengths, including the “Red Edge” and Short-Wave Infrared (SWIR). While this gives us a lot of information, it creates a problem known as the “curse of dimensionality”. Adjacent spectral bands often carry highly correlated information (e.g., if a pixel is bright in Green, it is often bright in Red), leading to data redundancy.

In this project, I analyze the area of the Bełchatów Lignite Mine, one of the largest open-pit mines in Europe. This area is complex, featuring deep pits, mineral spoil heaps, water bodies, and vegetation.

Instead of discarding bands arbitrarily, which risks losing subtle spectral details, I used Principal Component Analysis (PCA). PCA allows to transform bands into a new set of uncorrelated variables. This technique reduces noise, compresses data, and helps visualize the underlying geological and vegetative structure of the terrain in a single 3-channel composite that are invisible in standard photos.

Data Description

The analysis utilizes the Sentinel-2 Level-2A product. This dataset provides Bottom-Of-Atmosphere reflectance values, meaning the data has already been atmospherically corrected. Sentinel-2 captures images in 13 spectral bands with varying spatial resolutions (10m, 20m, and 60m). Data can be downloaded from here.

To perform a valid Principal Component Analysis, all input variables must share the same spatial domain and resolution. While the Level-2A product provides 10m bands resampled to 20m, not all captured bands are relevant for land surface analysis.

The table below summarizes all 13 available bands and their status in our analysis:

Band Name Resolution Description
B01 Coastal Aerosol 60 m (ocean color/atmosphere)
B02 Blue 10 m Standard visible blue.
B03 Green 10 m Standard visible green.
B04 Red 10 m Standard visible red.
B05 Red Edge 1 20 m A transition zone between Red and NIR. Very sensitive to small changes in plant chlorophyll.
B06 Red Edge 2 20 m Used to assess vegetation health and biomass content.
B07 Red Edge 3 20 m Similar to B06, helps in classifying different types of vegetation.
B08 NIR (Broad) 10 m Similar to B8A
B8A NIR (Narrow) 20 m Near Infrared. Plants reflect this strongly. It is crucial for separating biology (plants) from geology (rocks/soil).
B09 Water Vapour 60 m (atmosphere)
B10 SWIR / Cirrus 60 m (high clouds)
B11 SWIR 1 20 m Short-Wave Infrared. Sensitive to moisture content in soil and plants. Helps differentiate rock types.
B12 SWIR 2 20 m Longer wavelength infrared. Useful for identifying clay minerals and bare soil properties.

I reduced the dataset from 13 to 9 input variables based on the following criteria:

  1. Atmospheric Bands (B01, B09, B10): These bands are primarily designed for atmospheric corrections (detecting aerosols, water vapor, and cirrus clouds) rather than observing land surface features. Although they are provided in the dataset resampled to 20m, their native resolution is 60m, which means they lack the spatial detail required to distinguish small geological structures within the mine.
  2. Spectral Redundancy (B08): The satellite provides two Near-Infrared bands. I selected B8A (Narrow NIR) over B08 (Broad NIR). Since the analysis is standardized to 20m, B8A is the natural choice (native 20m), whereas B08 is a 10m band. Furthermore, B8A offers more precise spectral information for vegetation analysis, avoiding water vapor absorption lines present in the broader B08 channel.
safe_path <- "data/S2B_MSIL2A_20250630T095029_N0511_R079_T33UYS_20250630T120527.SAFE"

get_band <- function(path, band_name) {
  pattern <- paste0("_", band_name, "_20m.jp2$")
  file <- list.files(path, pattern = pattern, recursive = TRUE, full.names = TRUE)
  if(length(file) == 0) stop(paste("Band", band_name, "not found!"))
  return(file[1]) 
}
bands_list <- c("B02", "B03", "B04", "B05", "B06", "B07", "B8A", "B11", "B12")

files <- sapply(bands_list, function(b) get_band(safe_path, b))

sen_stack <- rast(files) # Load rasters as a multi-layer stack
names(sen_stack) <- c("Blue", "Green", "Red", "RE1", "RE2", "RE3", "NIR_N", "SWIR1", "SWIR2")

# Longitude: 19.20 - 19.36 E | Latitude: 51.20 - 51.28 N | For Bełchatów mine
e_geo <- ext(19.20, 19.36, 51.20, 51.28) 
e_proj <- project(e_geo, from = "EPSG:4326", to = crs(sen_stack))

# Crop the raster stack to the area of interest
sen_crop <- crop(sen_stack, e_proj)
summary(sen_crop)
##       Blue          Green           Red            RE1            RE2      
##  Min.   :1113   Min.   :1105   Min.   :1108   Min.   :1074   Min.   :1051  
##  1st Qu.:1303   1st Qu.:1470   1st Qu.:1356   1st Qu.:1743   1st Qu.:2730  
##  Median :1486   Median :1707   Median :1697   Median :2155   Median :3109  
##  Mean   :1661   Mean   :1873   Mean   :1908   Mean   :2288   Mean   :3134  
##  3rd Qu.:1864   3rd Qu.:2077   3rd Qu.:2207   3rd Qu.:2568   3rd Qu.:3497  
##  Max.   :5695   Max.   :6391   Max.   :7036   Max.   :7609   Max.   :7591  
##       RE3           NIR_N          SWIR1          SWIR2     
##  Min.   :1094   Min.   :1094   Min.   :1106   Min.   :1085  
##  1st Qu.:2988   1st Qu.:3170   1st Qu.:2464   1st Qu.:1856  
##  Median :3422   Median :3652   Median :3015   Median :2364  
##  Mean   :3439   Mean   :3646   Mean   :3274   Mean   :2727  
##  3rd Qu.:3870   3rd Qu.:4129   3rd Qu.:3752   3rd Qu.:3120  
##  Max.   :7616   Max.   :7681   Max.   :9066   Max.   :8900

Sentinel-2 data is stored as integers (0–10000). I rescaled this to reflectance values (0–1) for proper interpretation.

sen_crop <- sen_crop / 10000
print(sen_crop)
## class       : SpatRaster 
## size        : 477, 584, 9  (nrow, ncol, nlyr)
## resolution  : 20, 20  (x, y)
## extent      : 792880, 804560, 5680460, 5690000  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 33N (EPSG:32633) 
## source(s)   : memory
## names       :   Blue,  Green,    Red,    RE1,    RE2,    RE3, ... 
## min values  : 0.1103, 0.1105, 0.1108, 0.1074, 0.1051, 0.1024, ... 
## max values  : 0.8888, 0.8961, 0.9599, 0.9410, 0.8126, 0.8671, ...
# True Color Composite (Red, Green, Blue)
plotRGB(sen_crop, r = 3, g = 2, b = 1, stretch = "lin", axes = TRUE)

Before applying dimension reduction, it is crucial to verify the relationships between the spectral bands. Since calculating the correlation matrix for millions of pixels is computationally intensive, I calculated the Pearson correlation coefficients based on a random sample of 50,000 pixels drawn from the cropped image. Also I checked if the dataset is suitable for use of PCA by using Kaiser-Meyer-Olkin test that compares the magnitude of observed correlation coefficients to partial correlation coefficients. A value close to 1 indicates that the sum of partial correlations is small compared to the sum of correlations.

set.seed(123)

pixel_sample <- spatSample(sen_crop, size = 50000, method = "random", na.rm = TRUE, values = TRUE)

cor_matrix <- cor(pixel_sample, method = "pearson")
KMO(cor_matrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_matrix)
## Overall MSA =  0.83
## MSA for each item = 
##  Blue Green   Red   RE1   RE2   RE3 NIR_N SWIR1 SWIR2 
##  0.84  0.83  0.88  0.85  0.84  0.75  0.81  0.83  0.83

The value is greater than 0.8, what confirms that the dataset is suitable for the use of PCA.

corrplot(cor_matrix, type = "lower", 
         order = "hclust", 
         addCoef.col = "black",
         tl.col = "black", tl.srt = 45, # Label rotation
         number.cex = 0.7, # Font size
         title = "Spectral Band Correlation Matrix", 
         mar = c(0,0,2,0)) # Margin adjustment for title

Principal Component Analysis (PCA)

Principal Component Analysis is a linear transformation technique. It works by calculating the eigenvectors and eigenvalues of the data’s covariance matrix. Imagine n-dimensional cloud of data points. PCA rotates this cloud in such a way that the first axis (PC1) aligns with the direction of the greatest variation in the data (maximum variance). The second axis (PC2) is perpendicular (orthogonal) to the first and captures the second greatest variance. This process repeats for all dimensions.

The result is a set of new variables (Principal Components) that are uncorrelated. The first few components usually contain most of the total variation, while the later components can be skipped without lose of much information.

Unlike the correlation matrix, here I calculated the PCA model on the entire dataset. This ensures capturing every detail of the terrain. I used the standard prcomp function form package stats on the normalized spectral data. Additionally, since all bands measure the same physical unit (reflectance), I wanted to preserve the natural magnitude of signals (e.g., vegetation is naturally very bright in NIR) - that is why scale = FALSE.

sen_df <- values(sen_crop, dataframe = TRUE, na.rm = TRUE)
pca_model <- prcomp(sen_df, center = TRUE, scale. = FALSE)
summary(pca_model)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6      PC7
## Standard deviation     0.2068 0.1013 0.05625 0.01474 0.01086 0.00701 0.005289
## Proportion of Variance 0.7550 0.1813 0.05589 0.00384 0.00208 0.00087 0.000490
## Cumulative Proportion  0.7550 0.9364 0.99224 0.99607 0.99816 0.99902 0.999520
##                             PC8      PC9
## Standard deviation     0.004249 0.003044
## Proportion of Variance 0.000320 0.000160
## Cumulative Proportion  0.999840 1.000000
fviz_eig(pca_model, addlabels = TRUE, ylim = c(0, 100), main = "Explained Variance")

First three components (PC1, PC2, PC3) explain over 99% of the total variance.

Interpretation of loadings

To understand what these new components represent physically, I analyzed the Loadings (Eigenvectors) by following the methodology described by Richards & Jia (2006):

PC1 (Brightness): In satellite imagery, the first component usually represents overall Albedo (Brightness). It can be spotted when all bands have the same sign (all positive or all negative) in the loadings table.

PC2 & PC3 (Spectral Contrast): These components reveal spectral features. It can be distinguished by looking for bands with opposite signs (e.g., positive NIR vs. negative Red) to identify what the component detects (e.g., Vegetation).

var <- get_pca_var(pca_model)
print(var$coord[, 1:3], digits = 3)
##        Dim.1    Dim.2    Dim.3
## Blue  0.0421 -0.01271 -0.02158
## Green 0.0496 -0.00899 -0.02327
## Red   0.0649 -0.01804 -0.02370
## RE1   0.0659 -0.00584 -0.02037
## RE2   0.0444  0.04244 -0.00722
## RE3   0.0407  0.05672 -0.00208
## NIR_N 0.0436  0.06085  0.00416
## SWIR1 0.1076 -0.00910  0.02556
## SWIR2 0.1134 -0.02937  0.02131
fviz_pca_var(pca_model,
             col.var = "contrib", # Color variables by their contribution
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE,
             title = "Variables - PCA Factor Map (PC1 vs PC2)")

  • PC1: All values are positive. This confirms PC1 represents Brightness. It captures the intensity of light reflected by the surface (hills are bright, shadows are dark) and is responsible for most of the total variance (75.5%).

  • PC2: There is a strong contrast. Visible bands are negative, while NIR/RedEdge bands are positive. This is the spectral signature of Vegetation. PC2 separates plants (high NIR) from bare soil/water.

  • PC3: Here, SWIR bands are positive, while others are negative or close to zero. This component highlights Soil/Mineral differences (e.g., moisture or clay content).

pca_image <- predict(sen_crop, pca_model, index = 1:3)
names(pca_image) <- c("PC1", "PC2", "PC3")

Below, there is a map the three new components back to an image. This creates a “False Color Composite” where colors have specific meanings:

  • Red Channel = PC1 (Brightness/Terrain)

  • Green Channel = PC2 (Vegetation)

  • Blue Channel = PC3 (Soils/Minerals)

plotRGB(pca_image, r = 1, g = 2, b = 3, 
        stretch = "lin", 
        axes = TRUE)

It is clear that the mine area is characterized by a palette of light purple colors and differs from the area of the power plant, the ash storage site, or the surrounding fields.

par(mfrow = c(1, 2))
plotRGB(sen_crop, r = 3, g = 2, b = 1, 
        stretch = "lin", 
        axes = TRUE, 
        main = "Standard True Color (RGB)")
plotRGB(pca_image, r = 1, g = 2, b = 3, 
        stretch = "lin", 
        axes = TRUE, 
        main = "PCA False Color (PC1-PC2-PC3)")

par(mfrow = c(1, 1))

Physical Validation of Components

To scientifically validate hypotheses stated above, a cross-validation is performed. Established physical spectral indices: Iron Oxide ratio, NDVI (Vegetation - Band ratio parameter in Rouse et al. 1974) and Mean Brightness were calculated.

If the interpretation above is correct, this outcome is expected:

  • A strong positive correlation between PC1 and Brightness.
  • A strong positive correlation between PC2 and NDVI.
  • A significant correlation between PC3 and Iron Oxide ratio.
# Brightness Index
# Approximate physical brightness as the mean of the visible spectrum bands.
brightness <- (sen_crop$Blue + sen_crop$Green + sen_crop$Red) / 3
names(brightness) <- "Brightness"

# NDVI (Normalized Difference Vegetation Index)
# Standard index for vegetation biomass: (NIR - Red) / (NIR + Red)
ndvi <- (sen_crop$NIR_N - sen_crop$Red) / (sen_crop$NIR_N + sen_crop$Red)
names(ndvi) <- "NDVI"

# Iron Oxide Ratio
# Formula: Red / Blue
iron_oxide <- sen_crop$Red / sen_crop$Blue
names(iron_oxide) <- "Iron_Oxide"
# Stack the new PCA images with the calculated indices to treat them as a single dataset.
validation_stack <- c(pca_image$PC1, pca_image$PC2, pca_image$PC3, 
                      brightness, ndvi, iron_oxide)

set.seed(123)
val_sample <- spatSample(validation_stack, size = 50000, method = "random", na.rm = TRUE, values = TRUE)
val_cor <- cor(val_sample, method = "pearson")
results <- data.frame(
  Hypothesis = c("PC1 = Brightness?", "PC2 = Vegetation?", "PC3 = Iron Oxides?"),
  Standard_Index = c("Mean Vis", "NDVI", "Iron Oxide Ratio (Red/Blue)"),
  Pair = c("PC1 vs Brightness", "PC2 vs NDVI", "PC3 vs Iron Oxide"),
  Correlation = c(val_cor["PC1", "Brightness"],
                  val_cor["PC2", "NDVI"],
                  val_cor["PC3", "Iron_Oxide"])
)

print(results)
##           Hypothesis              Standard_Index              Pair Correlation
## 1  PC1 = Brightness?                    Mean Vis PC1 vs Brightness  0.89106529
## 2  PC2 = Vegetation?                        NDVI       PC2 vs NDVI  0.79271384
## 3 PC3 = Iron Oxides? Iron Oxide Ratio (Red/Blue) PC3 vs Iron Oxide  0.07913265

The correlation table provides quantitative evidence for our component assignment:

PC1 vs. Brightness: The extremely high correlation coefficient confirms that the first component is primarily driven by surface albedo and topography.

PC2 vs. NDVI: The strong positive correlation validates that PC2 effectively captures vegetation biomass, distinguishing reclaimed forest areas from the barren mining pit.

par(mfrow = c(1, 2))
plot(pca_image$PC2, main = "PC2 Component", col = gray.colors(100), axes = FALSE)
plot(ndvi, main = "Standard NDVI Index", col = gray.colors(100), axes = FALSE)

par(mfrow = c(1, 1))

PC3 vs. Iron Oxide Ratio: The correlation is low, but it does not mean that the analysis is wrong, but rather provides an important geological insight. The standard Iron Oxide Ratio (Red/Blue) is designed to detect ferric minerals (hematite/goethite) typical of metallic ore deposits. The Bełchatów mine extracts Lignite (Brown Coal), and its overburden consists mainly of Quaternary deposits (sands, clays, glacial tills), which are not dominant in iron oxides. Furthermore, the PCA False Color image above clearly shows the difference in the color palette in the mine area compared to the surrounding areas.

Conclusion

The objective of this study was to reduce the dimensionality of Sentinel-2 satellite imagery to facilitate the interpretation of the complex landscape of the Bełchatów Lignite Mine. The dataset was successfully reduced from 9 correlated spectral bands to just 3 Principal Components. This reduction retained over 99% of the original information, proving that the original 9 bands contained significant redundancy.

The PCA transformation proved to be highly effective in highlighting geological and vegetative features: PC1 successfully mapped the topographical structure and albedo, PC2 provided a clear separation between vegetation (reclaimed areas) and barren mining pits and PC3 revealed subtle variations in soil composition within the excavation pit, which are often indistinguishable in standard RGB images.

In summary, PCA served as a powerful tool for exploratory data analysis, allowing for a clearer visualization of the mining environment than standard true-color photography.

Bibliography