library(terra) # For raster data
library(ggplot2) # Visualization
library(corrplot) # Correlation matrices
library(psych) # KMO test
library(factoextra) # Plots
Remote sensing data is multidimensional by nature. Modern earth observation satellites, such as the Sentinel-2 constellation, capture the Earth’s surface not just in the visible spectrum (Red, Green, Blue), but across a wide range of electromagnetic wavelengths, including the “Red Edge” and Short-Wave Infrared (SWIR). While this gives us a lot of information, it creates a problem known as the “curse of dimensionality”. Adjacent spectral bands often carry highly correlated information (e.g., if a pixel is bright in Green, it is often bright in Red), leading to data redundancy.
In this project, I analyze the area of the Bełchatów Lignite Mine, one of the largest open-pit mines in Europe. This area is complex, featuring deep pits, mineral spoil heaps, water bodies, and vegetation.
Instead of discarding bands arbitrarily, which risks losing subtle spectral details, I used Principal Component Analysis (PCA). PCA allows to transform bands into a new set of uncorrelated variables. This technique reduces noise, compresses data, and helps visualize the underlying geological and vegetative structure of the terrain in a single 3-channel composite that are invisible in standard photos.
The analysis utilizes the Sentinel-2 Level-2A product. This dataset provides Bottom-Of-Atmosphere reflectance values, meaning the data has already been atmospherically corrected. Sentinel-2 captures images in 13 spectral bands with varying spatial resolutions (10m, 20m, and 60m). Data can be downloaded from here.
To perform a valid Principal Component Analysis, all input variables must share the same spatial domain and resolution. While the Level-2A product provides 10m bands resampled to 20m, not all captured bands are relevant for land surface analysis.
The table below summarizes all 13 available bands and their status in our analysis:
| Band | Name | Resolution | Description |
|---|---|---|---|
| B01 | Coastal Aerosol | 60 m | (ocean color/atmosphere) |
| B02 | Blue | 10 m | Standard visible blue. |
| B03 | Green | 10 m | Standard visible green. |
| B04 | Red | 10 m | Standard visible red. |
| B05 | Red Edge 1 | 20 m | A transition zone between Red and NIR. Very sensitive to small changes in plant chlorophyll. |
| B06 | Red Edge 2 | 20 m | Used to assess vegetation health and biomass content. |
| B07 | Red Edge 3 | 20 m | Similar to B06, helps in classifying different types of vegetation. |
| B08 | NIR (Broad) | 10 m | Similar to B8A |
| B8A | NIR (Narrow) | 20 m | Near Infrared. Plants reflect this strongly. It is crucial for separating biology (plants) from geology (rocks/soil). |
| B09 | Water Vapour | 60 m | (atmosphere) |
| B10 | SWIR / Cirrus | 60 m | (high clouds) |
| B11 | SWIR 1 | 20 m | Short-Wave Infrared. Sensitive to moisture content in soil and plants. Helps differentiate rock types. |
| B12 | SWIR 2 | 20 m | Longer wavelength infrared. Useful for identifying clay minerals and bare soil properties. |
I reduced the dataset from 13 to 9 input variables based on the following criteria:
safe_path <- "data/S2B_MSIL2A_20250630T095029_N0511_R079_T33UYS_20250630T120527.SAFE"
get_band <- function(path, band_name) {
pattern <- paste0("_", band_name, "_20m.jp2$")
file <- list.files(path, pattern = pattern, recursive = TRUE, full.names = TRUE)
if(length(file) == 0) stop(paste("Band", band_name, "not found!"))
return(file[1])
}
bands_list <- c("B02", "B03", "B04", "B05", "B06", "B07", "B8A", "B11", "B12")
files <- sapply(bands_list, function(b) get_band(safe_path, b))
sen_stack <- rast(files) # Load rasters as a multi-layer stack
names(sen_stack) <- c("Blue", "Green", "Red", "RE1", "RE2", "RE3", "NIR_N", "SWIR1", "SWIR2")
# Longitude: 19.20 - 19.36 E | Latitude: 51.20 - 51.28 N | For Bełchatów mine
e_geo <- ext(19.20, 19.36, 51.20, 51.28)
e_proj <- project(e_geo, from = "EPSG:4326", to = crs(sen_stack))
# Crop the raster stack to the area of interest
sen_crop <- crop(sen_stack, e_proj)
summary(sen_crop)
## Blue Green Red RE1 RE2
## Min. :1113 Min. :1105 Min. :1108 Min. :1074 Min. :1051
## 1st Qu.:1303 1st Qu.:1470 1st Qu.:1356 1st Qu.:1743 1st Qu.:2730
## Median :1486 Median :1707 Median :1697 Median :2155 Median :3109
## Mean :1661 Mean :1873 Mean :1908 Mean :2288 Mean :3134
## 3rd Qu.:1864 3rd Qu.:2077 3rd Qu.:2207 3rd Qu.:2568 3rd Qu.:3497
## Max. :5695 Max. :6391 Max. :7036 Max. :7609 Max. :7591
## RE3 NIR_N SWIR1 SWIR2
## Min. :1094 Min. :1094 Min. :1106 Min. :1085
## 1st Qu.:2988 1st Qu.:3170 1st Qu.:2464 1st Qu.:1856
## Median :3422 Median :3652 Median :3015 Median :2364
## Mean :3439 Mean :3646 Mean :3274 Mean :2727
## 3rd Qu.:3870 3rd Qu.:4129 3rd Qu.:3752 3rd Qu.:3120
## Max. :7616 Max. :7681 Max. :9066 Max. :8900
Sentinel-2 data is stored as integers (0–10000). I rescaled this to reflectance values (0–1) for proper interpretation.
sen_crop <- sen_crop / 10000
print(sen_crop)
## class : SpatRaster
## size : 477, 584, 9 (nrow, ncol, nlyr)
## resolution : 20, 20 (x, y)
## extent : 792880, 804560, 5680460, 5690000 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 33N (EPSG:32633)
## source(s) : memory
## names : Blue, Green, Red, RE1, RE2, RE3, ...
## min values : 0.1103, 0.1105, 0.1108, 0.1074, 0.1051, 0.1024, ...
## max values : 0.8888, 0.8961, 0.9599, 0.9410, 0.8126, 0.8671, ...
# True Color Composite (Red, Green, Blue)
plotRGB(sen_crop, r = 3, g = 2, b = 1, stretch = "lin", axes = TRUE)
Before applying dimension reduction, it is crucial to verify the relationships between the spectral bands. Since calculating the correlation matrix for millions of pixels is computationally intensive, I calculated the Pearson correlation coefficients based on a random sample of 50,000 pixels drawn from the cropped image. Also I checked if the dataset is suitable for use of PCA by using Kaiser-Meyer-Olkin test that compares the magnitude of observed correlation coefficients to partial correlation coefficients. A value close to 1 indicates that the sum of partial correlations is small compared to the sum of correlations.
set.seed(123)
pixel_sample <- spatSample(sen_crop, size = 50000, method = "random", na.rm = TRUE, values = TRUE)
cor_matrix <- cor(pixel_sample, method = "pearson")
KMO(cor_matrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_matrix)
## Overall MSA = 0.83
## MSA for each item =
## Blue Green Red RE1 RE2 RE3 NIR_N SWIR1 SWIR2
## 0.84 0.83 0.88 0.85 0.84 0.75 0.81 0.83 0.83
The value is greater than 0.8, what confirms that the dataset is suitable for the use of PCA.
corrplot(cor_matrix, type = "lower",
order = "hclust",
addCoef.col = "black",
tl.col = "black", tl.srt = 45, # Label rotation
number.cex = 0.7, # Font size
title = "Spectral Band Correlation Matrix",
mar = c(0,0,2,0)) # Margin adjustment for title
Principal Component Analysis is a linear transformation technique. It works by calculating the eigenvectors and eigenvalues of the data’s covariance matrix. Imagine n-dimensional cloud of data points. PCA rotates this cloud in such a way that the first axis (PC1) aligns with the direction of the greatest variation in the data (maximum variance). The second axis (PC2) is perpendicular (orthogonal) to the first and captures the second greatest variance. This process repeats for all dimensions.
The result is a set of new variables (Principal Components) that are uncorrelated. The first few components usually contain most of the total variation, while the later components can be skipped without lose of much information.
Unlike the correlation matrix, here I calculated the PCA model on the
entire dataset. This ensures capturing every detail of the terrain. I
used the standard prcomp function form package stats on the
normalized spectral data. Additionally, since all bands measure the same
physical unit (reflectance), I wanted to preserve the natural magnitude
of signals (e.g., vegetation is naturally very bright in NIR) - that is
why scale = FALSE.
sen_df <- values(sen_crop, dataframe = TRUE, na.rm = TRUE)
pca_model <- prcomp(sen_df, center = TRUE, scale. = FALSE)
summary(pca_model)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 0.2068 0.1013 0.05625 0.01474 0.01086 0.00701 0.005289
## Proportion of Variance 0.7550 0.1813 0.05589 0.00384 0.00208 0.00087 0.000490
## Cumulative Proportion 0.7550 0.9364 0.99224 0.99607 0.99816 0.99902 0.999520
## PC8 PC9
## Standard deviation 0.004249 0.003044
## Proportion of Variance 0.000320 0.000160
## Cumulative Proportion 0.999840 1.000000
fviz_eig(pca_model, addlabels = TRUE, ylim = c(0, 100), main = "Explained Variance")
First three components (PC1, PC2, PC3) explain over 99% of the total variance.
To understand what these new components represent physically, I analyzed the Loadings (Eigenvectors) by following the methodology described by Richards & Jia (2006):
PC1 (Brightness): In satellite imagery, the first component usually represents overall Albedo (Brightness). It can be spotted when all bands have the same sign (all positive or all negative) in the loadings table.
PC2 & PC3 (Spectral Contrast): These components reveal spectral features. It can be distinguished by looking for bands with opposite signs (e.g., positive NIR vs. negative Red) to identify what the component detects (e.g., Vegetation).
var <- get_pca_var(pca_model)
print(var$coord[, 1:3], digits = 3)
## Dim.1 Dim.2 Dim.3
## Blue 0.0421 -0.01271 -0.02158
## Green 0.0496 -0.00899 -0.02327
## Red 0.0649 -0.01804 -0.02370
## RE1 0.0659 -0.00584 -0.02037
## RE2 0.0444 0.04244 -0.00722
## RE3 0.0407 0.05672 -0.00208
## NIR_N 0.0436 0.06085 0.00416
## SWIR1 0.1076 -0.00910 0.02556
## SWIR2 0.1134 -0.02937 0.02131
fviz_pca_var(pca_model,
col.var = "contrib", # Color variables by their contribution
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE,
title = "Variables - PCA Factor Map (PC1 vs PC2)")
PC1: All values are positive. This confirms PC1 represents Brightness. It captures the intensity of light reflected by the surface (hills are bright, shadows are dark) and is responsible for most of the total variance (75.5%).
PC2: There is a strong contrast. Visible bands are negative, while NIR/RedEdge bands are positive. This is the spectral signature of Vegetation. PC2 separates plants (high NIR) from bare soil/water.
PC3: Here, SWIR bands are positive, while others are negative or close to zero. This component highlights Soil/Mineral differences (e.g., moisture or clay content).
pca_image <- predict(sen_crop, pca_model, index = 1:3)
names(pca_image) <- c("PC1", "PC2", "PC3")
Below, there is a map the three new components back to an image. This creates a “False Color Composite” where colors have specific meanings:
Red Channel = PC1 (Brightness/Terrain)
Green Channel = PC2 (Vegetation)
Blue Channel = PC3 (Soils/Minerals)
plotRGB(pca_image, r = 1, g = 2, b = 3,
stretch = "lin",
axes = TRUE)
It is clear that the mine area is characterized by a palette of light purple colors and differs from the area of the power plant, the ash storage site, or the surrounding fields.
par(mfrow = c(1, 2))
plotRGB(sen_crop, r = 3, g = 2, b = 1,
stretch = "lin",
axes = TRUE,
main = "Standard True Color (RGB)")
plotRGB(pca_image, r = 1, g = 2, b = 3,
stretch = "lin",
axes = TRUE,
main = "PCA False Color (PC1-PC2-PC3)")
par(mfrow = c(1, 1))
To scientifically validate hypotheses stated above, a cross-validation is performed. Established physical spectral indices: Iron Oxide ratio, NDVI (Vegetation - Band ratio parameter in Rouse et al. 1974) and Mean Brightness were calculated.
If the interpretation above is correct, this outcome is expected:
# Brightness Index
# Approximate physical brightness as the mean of the visible spectrum bands.
brightness <- (sen_crop$Blue + sen_crop$Green + sen_crop$Red) / 3
names(brightness) <- "Brightness"
# NDVI (Normalized Difference Vegetation Index)
# Standard index for vegetation biomass: (NIR - Red) / (NIR + Red)
ndvi <- (sen_crop$NIR_N - sen_crop$Red) / (sen_crop$NIR_N + sen_crop$Red)
names(ndvi) <- "NDVI"
# Iron Oxide Ratio
# Formula: Red / Blue
iron_oxide <- sen_crop$Red / sen_crop$Blue
names(iron_oxide) <- "Iron_Oxide"
# Stack the new PCA images with the calculated indices to treat them as a single dataset.
validation_stack <- c(pca_image$PC1, pca_image$PC2, pca_image$PC3,
brightness, ndvi, iron_oxide)
set.seed(123)
val_sample <- spatSample(validation_stack, size = 50000, method = "random", na.rm = TRUE, values = TRUE)
val_cor <- cor(val_sample, method = "pearson")
results <- data.frame(
Hypothesis = c("PC1 = Brightness?", "PC2 = Vegetation?", "PC3 = Iron Oxides?"),
Standard_Index = c("Mean Vis", "NDVI", "Iron Oxide Ratio (Red/Blue)"),
Pair = c("PC1 vs Brightness", "PC2 vs NDVI", "PC3 vs Iron Oxide"),
Correlation = c(val_cor["PC1", "Brightness"],
val_cor["PC2", "NDVI"],
val_cor["PC3", "Iron_Oxide"])
)
print(results)
## Hypothesis Standard_Index Pair Correlation
## 1 PC1 = Brightness? Mean Vis PC1 vs Brightness 0.89106529
## 2 PC2 = Vegetation? NDVI PC2 vs NDVI 0.79271384
## 3 PC3 = Iron Oxides? Iron Oxide Ratio (Red/Blue) PC3 vs Iron Oxide 0.07913265
The correlation table provides quantitative evidence for our component assignment:
PC1 vs. Brightness: The extremely high correlation coefficient confirms that the first component is primarily driven by surface albedo and topography.
PC2 vs. NDVI: The strong positive correlation validates that PC2 effectively captures vegetation biomass, distinguishing reclaimed forest areas from the barren mining pit.
par(mfrow = c(1, 2))
plot(pca_image$PC2, main = "PC2 Component", col = gray.colors(100), axes = FALSE)
plot(ndvi, main = "Standard NDVI Index", col = gray.colors(100), axes = FALSE)
par(mfrow = c(1, 1))
PC3 vs. Iron Oxide Ratio: The correlation is low, but it does not mean that the analysis is wrong, but rather provides an important geological insight. The standard Iron Oxide Ratio (Red/Blue) is designed to detect ferric minerals (hematite/goethite) typical of metallic ore deposits. The Bełchatów mine extracts Lignite (Brown Coal), and its overburden consists mainly of Quaternary deposits (sands, clays, glacial tills), which are not dominant in iron oxides. Furthermore, the PCA False Color image above clearly shows the difference in the color palette in the mine area compared to the surrounding areas.
The objective of this study was to reduce the dimensionality of Sentinel-2 satellite imagery to facilitate the interpretation of the complex landscape of the Bełchatów Lignite Mine. The dataset was successfully reduced from 9 correlated spectral bands to just 3 Principal Components. This reduction retained over 99% of the original information, proving that the original 9 bands contained significant redundancy.
The PCA transformation proved to be highly effective in highlighting geological and vegetative features: PC1 successfully mapped the topographical structure and albedo, PC2 provided a clear separation between vegetation (reclaimed areas) and barren mining pits and PC3 revealed subtle variations in soil composition within the excavation pit, which are often indistinguishable in standard RGB images.
In summary, PCA served as a powerful tool for exploratory data analysis, allowing for a clearer visualization of the mining environment than standard true-color photography.