Abstract

The project aims at utilizing night light data for clustering purpose. Till now as for geo-spatial satellite data utilized in planning purpose, only census data, day-light images and geographical terrain images are considered crucial. But, as for urban planning, we ignore the crucial lens of night light data which can easily be utilized to identify and segregate areas as urban, industrial, rural and un-inhabited.

The results derived from this kind of clustering can be advantageous in cases of designing evacuation plans for natural calamities, identifying industrial belt vicinity to urban households and accordingly using the results for urban planning and for better clarity in segregating forests and green cover in an area from farms and human made artificial green belts (since natural green belts will have near zero night light intensity).

Keywords : night lights, DMSP‑OLS, geospatial clustering, urban mapping, remote sensing, k‑means, DBSCAN

Installing Necessary R Libraries or Packages

library(terra); # library for reading GeoTIFF data
## Warning: package 'terra' was built under R version 4.5.2
library(tidyverse); # library utilized for ggplot2 and other functions
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
library(dbscan);
## Warning: package 'dbscan' was built under R version 4.5.2
# library(sf);
library(ClusterR); # library imported to utilize Optimal_Clusters_KMeans function for silhouette score calculation and also for obtaining elbow point method results
## Warning: package 'ClusterR' was built under R version 4.5.2
#library(clustertend); Depricated and suggestion to utilize hopkins library
library(hopkins); # library imported to utilize hopkins function to check data clusterability
## Warning: package 'hopkins' was built under R version 4.5.2
library(fpc); # library for calinhara function import to calculate Calinski-Harabasz index
## Warning: package 'fpc' was built under R version 4.5.2
# library(factoextra);

Dataset description

Reconstructed Poland Maps Data using NOAA/DMSP-OLS/NIGHTTIME_LIGHTS dataset in Google Earth Engine with all country polygons for Night Light data

Javascript code used to retrieve the dataset and construct maps

The javascript code attached is executed in Google Earth Engine code editor and is utilized for extracting required dataset for 2012 for clustering purpose and other GeoTIFF or tif files showcased in the paper

For the sake of simplicity and to avoid results for getting influenced by nightlights from neighbouring countries, a black mask layer (typically 0 intensity for our nightlight setup) is added in the map output window (code for which can be traced). Also geo-spatial data outside the boundaries of Poland is eliminated for country specific clustering results.

Data is classified at Voivodeship levels (16 Provices which Poland is divided into) also to track the magnitude of industrial and urban activity per entity. The Voivodeship boundary data or tagging for a particular coordinate on the map with respective Province value is obtained from the FAO/GAUL dataset available from the year 2015.

  // Country Boundaries Loaded
  var countries = ee.FeatureCollection("USDOS/LSIB_SIMPLE/2017");
  
  // Poland Boundary Data filtered
  var poland = countries.filter(ee.Filter.eq("country_na", "Poland"));
  
  // Neighboring Country Boundaries filtered
  var neighbors = countries
  .filterBounds(poland.geometry())
  .filter(ee.Filter.neq("country_na", "Poland"));
  
  // Voivodeships Data filtered using GAUL Level 1 Data Set
  var gaul = ee.FeatureCollection("FAO/GAUL/2015/level1");
  var voivodeships = gaul.filter(ee.Filter.eq("ADM0_NAME", "Poland"));
  
  // Setting up width of Poland and respective voivodeship boundaries
  var NEON_BOUNDARY_VALUE = 200;
  var BOUNDARY_WIDTH_PIXELS = 2.25;
  
  var region_of_interest_rect = ee.Geometry.Rectangle([13.0, 48.0, 25.0, 56.0]);
  
  // Loading DMSP‑OLS night lights data from dataset in Google Earth Engine
  var dmspCollection = ee.ImageCollection("NOAA/DMSP-OLS/NIGHTTIME_LIGHTS")
  .filterBounds(region_of_interest_rect);
  
  // Setting up year values for median night light data extraction
  var year_to_process = 2012;
  var startDate = ee.Date(year_to_process + '-01-01');
  var endDate = ee.Date(year_to_process + '-12-31');
  
  var yearImages = dmspCollection.filterDate(startDate, endDate);
  
  if (yearImages.size().getInfo() === 0) {
    print("No DMSP data for " + year_to_process);
  } else {
    // Taking median of the data for entire year
    var nightLightsComposite = yearImages.median()
      .select("stable_lights")
      .clip(poland.geometry());
  
    // Making neighbouring countries night light value 0
    var neighborsMask = ee.Image().byte().paint(neighbors, 1);
    var nightLightsWithBlackNeighbors = nightLightsComposite.where(neighborsMask.eq(1), 0);
  
    // Increasing contrast for better visualization
    var stretchedLights = nightLightsWithBlackNeighbors
      .unitScale(0, 63)   // normalize 0–63 → 0–1
      .multiply(150);     // stretch to 0–150 for stronger contrast
  
    var emptyCanvas = ee.Image(0).rename("stable_lights").byte();
    
    // Including boundaries and night light value
    var polandOutlineRaster = emptyCanvas.paint({
      featureCollection: poland,
      color: NEON_BOUNDARY_VALUE,
      width: BOUNDARY_WIDTH_PIXELS
    });
  
    var voivodeshipsOutlineRaster = emptyCanvas.paint({
      featureCollection: voivodeships,
      color: NEON_BOUNDARY_VALUE,
      width: BOUNDARY_WIDTH_PIXELS
    });
  
    var finalImageForExport = stretchedLights
      .unmask(0)
      .max(polandOutlineRaster)
      .max(voivodeshipsOutlineRaster);
  
    // Color palette and visualization parameters chosen for night lights data
    var finalVisParams = {
      min: 0,
      max: 200,
      palette: [
        'black','blue','purple','cyan','orange','yellow','red','white',
        '#00FF00'
      ]
    };
  
    // Displaying data in Google Earth Engine maps for cross-verification prior to export
    Map.addLayer(finalImageForExport, finalVisParams,
                 "High‑Contrast Night Lights " + year_to_process);
  
    var centerPoint = poland.geometry().centroid();
    Map.setCenter(centerPoint.coordinates().get(0).getInfo(),
                  centerPoint.coordinates().get(1).getInfo(), 6);
  
    // Exporting map images as prepared in previous steps
    Export.image.toDrive({
      image: finalImageForExport,
      description: 'NightLights_' + year_to_process + '_high_contrast',
      folder: 'EarthEngineExports',
      fileNamePrefix: 'nightlights_' + year_to_process + '_high_contrast',
      region: region_of_interest_rect,
      scale: 500,
      maxPixels: 1e13,
      fileFormat: 'GeoTIFF'
    });
    
    var imageWithCoords = finalImageForExport
      .rename("night_light")
      .addBands(ee.Image.pixelLonLat());
    
    
    var samples = imageWithCoords.sampleRegions({
      collection: voivodeships,
      properties: ['ADM1_NAME'],   // province name according to dataset
      scale: 1000,
      geometries: false
    }).map(function (f) {
      return f.set({
        year: year_to_process,
        province: f.get('ADM1_NAME')
      });
    });
    
    // Exporting Longitude, Latitude, Province Name, Night Light intensity data for clustering
    Export.table.toDrive({
      collection: samples,
      description: 'NightLights_CSV_' + year_to_process,
      folder: 'EarthEngineExports',
      fileNamePrefix: 'nightlights_' + year_to_process + '_pixels',
      fileFormat: 'CSV'
    });
  }

Visualizing the YoY data (with a gap of 2 years) for better clarity and validation of dataset.

Plotting the datasets in Google Earth Engine itself with map overlay and observing the same to validate an increase in Night Light intensity over a period from year 2000 to 2012.

During the mentioned period, Poland saw a major growth in the population for core cities like Warsaw, Kraków, Wrocław, Poznań, and the Tri‑City. The population inflows can be attributed to commercial construction, development in infrastructure and rapid industry growth which further contributed to increasing light intensity in these specific regions.

Some other reasons for increased night light intensity can be - - Energy grid reliability due to technical advancements across the globe. - Rapid Highways (A1, A2, A4) expansion during the mentioned decade. - Shopping centers, logistics hubs, and business parks expanded rapidly, especially after 2005 attributing to the IT revolution across the globe.

# Dividing screen into 9 frames
# windows(width = 12, height = 10);
par(mfrow = c(3, 3));

for(i in seq(2000, 2012, 2)){
  rel_path = paste0("./2000-2012GeospatialNightLightData/nightlights_", i, ".tif");
  r <- rast(rel_path);
  plot(r, main = paste("Year", i))
}

Data Import and Cleanup

Importing Dataset extracted for 2012 and observing summary for the same

summary and str can help us understand the dataset structure and accordingly either tweak column data types (for example tweaking Province column datatype from Character to Factor since the same contains repetition of 16 Voivodeship values from Poland).

The same can help in reducing time for producing plots owing to better linked and mastered data. Also columns with repetitive or similar values were dropped.

df <- read_csv("./2000-2012GeospatialNightLightData/nightlights_2012_pixels.csv");
## Rows: 507207 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): system:index, ADM1_NAME, province, .geo
## dbl (4): latitude, longitude, night_light, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(df);
## spc_tbl_ [507,207 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ system:index: chr [1:507207] "000000000000000003c0_0" "000000000000000003c0_1" "000000000000000003c0_2" "000000000000000003c0_3" ...
##  $ ADM1_NAME   : chr [1:507207] "Dolnoslaskie" "Dolnoslaskie" "Dolnoslaskie" "Dolnoslaskie" ...
##  $ latitude    : num [1:507207] 50.1 50.1 50.1 50.1 50.1 ...
##  $ longitude   : num [1:507207] 16.7 16.7 16.7 16.7 16.7 ...
##  $ night_light : num [1:507207] 200 200 200 200 200 200 200 200 200 200 ...
##  $ province    : chr [1:507207] "Dolnoslaskie" "Dolnoslaskie" "Dolnoslaskie" "Dolnoslaskie" ...
##  $ year        : num [1:507207] 2012 2012 2012 2012 2012 ...
##  $ .geo        : chr [1:507207] "{\"type\":\"MultiPoint\",\"coordinates\":[]}" "{\"type\":\"MultiPoint\",\"coordinates\":[]}" "{\"type\":\"MultiPoint\",\"coordinates\":[]}" "{\"type\":\"MultiPoint\",\"coordinates\":[]}" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `system:index` = col_character(),
##   ..   ADM1_NAME = col_character(),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   night_light = col_double(),
##   ..   province = col_character(),
##   ..   year = col_double(),
##   ..   .geo = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(df$night_light); # data should be in range from 0 to 200
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   19.05   26.19   43.97   47.62  200.00

Data Cleanup

Column Name being dropped Reason for Dropping Column
system:index The same is an ID value added in Google Earth Engine Data Set NOAA/DMSP-OLS/NIGHTTIME_LIGHTS which isn’t of much relevance to us.
ADM1_NAME Redundant data since column value is already added by us in Province column.
.geo Empty and common JSON object for all data set entries.
year Only value is 2012 since data for 2012 is only extracted in CSV format from Google Earth Engine - Remaining years data is just used for visualization and verification of growth trend.
# print(summary(df$year)) # Only value 2012 thus, year column can also be dropped
df <- df[ , !(names(df) %in% c("system:index", ".geo", "ADM1_NAME", "year"))];
print(head(df, 5));
## # A tibble: 5 × 4
##   latitude longitude night_light province    
##      <dbl>     <dbl>       <dbl> <chr>       
## 1     50.1      16.7         200 Dolnoslaskie
## 2     50.1      16.7         200 Dolnoslaskie
## 3     50.1      16.7         200 Dolnoslaskie
## 4     50.1      16.7         200 Dolnoslaskie
## 5     50.1      16.7         200 Dolnoslaskie

Converting Province column value to factor data type since Poland is bifurcated into only 16 Provinces (or voivodeships)

df$province <- as.factor(df$province)
print(head(df, 5));
## # A tibble: 5 × 4
##   latitude longitude night_light province    
##      <dbl>     <dbl>       <dbl> <fct>       
## 1     50.1      16.7         200 Dolnoslaskie
## 2     50.1      16.7         200 Dolnoslaskie
## 3     50.1      16.7         200 Dolnoslaskie
## 4     50.1      16.7         200 Dolnoslaskie
## 5     50.1      16.7         200 Dolnoslaskie

Log transforming the data to reduce skewness and for scaling. Selecting only columns with numerical values eliminating Province (or voivodeship column) since the information for the same can be reconstructed using longitude and latitude column values.

df_scaled <- scale(select(mutate(df, night_light_log = log1p(night_light)), latitude, longitude, night_light_log));

df_scaled <- as.data.frame(df_scaled); # Converting df_scaled from matrix (result of utilization of scale function) to dataframe

df_scaled$original_longitude <- df$longitude;
df_scaled$original_latitude <- df$latitude;
df_scaled$original_night_light <- df$night_light;
df_scaled$province <- df$province;
# print(head(df_scaled, 5));

Accounting to large size of the dataset, we are sampling 11000 random values from the same for ease of clustering.

set.seed(11);
df_sample <- df_scaled[sample(nrow(df_scaled), 11000), ];

df_sample_selected_columns <- df_sample[, c("latitude", "longitude", "night_light_log")];
# print(head(df_sample, 5));

Verifying the new dimensions with PCA

PCA (Principal Component Analysis) being a strong pre-clustering tool is utilized to determine the following facts -

  • The dimensions or columns of data used for clustering (longitude, latitude and night_light_intensity fields) whats is the amount of correlation between them.
  • To determine the amount of noise in data according to variance value. More the noise lesser is the clusterability of the data - or lesser separable is the data.
pca_res <- prcomp(df_sample_selected_columns, scale. = TRUE);

print(summary(pca_res));
## Importance of components:
##                           PC1    PC2    PC3
## Standard deviation     1.1550 0.9796 0.8406
## Proportion of Variance 0.4446 0.3198 0.2355
## Cumulative Proportion  0.4446 0.7645 1.0000

Summarizing the PCA results - - PC1 in general accounts for biggest source of variation in the data or direction along which data is most spread. In current case the general pattern of differentiating brigter from darker data points might be considered for PC1. - PC2 and PC3 consecutively capture the next most important dimension or parameter which PC1 missed. PC2 might be utilizing other indicators like differentiating areas as Industrial, Urban, Rural or Un-Inhabited. - PC1 and PC2 alongside combine to form 44 + 32 = 76 percent indicating high variance along these basic dimensions. - The combined high value of PC1 and PC2 is an indicator of dominant patterns in data. The data is low dimensional and not spread randomly. The same is an ideal indicator for clusterability. - A combined value of PC1 and PC2 less than 10 percent indicate highly noisy data and that clutering results from the same will be unreliable. - A too high PC1 value equivalent to 90 percent also is a bad indicator for clustering since the same indicates that the data is mostly linear.

As for this paper PCA is only being utilized to judge the data quality.

Clustering of data (Approach I - K-Means Clustering Algorithm)

Determining the ideal number of clusters and also validating the clusterability of the dataset

Deciding on ideal number of clusters using Silhouette Score and Elbow Point method

Silhouette is a score assigned to a data point / object for rating its presence in the current cluster based on average distance from data points / objects of current cluster and nearest cluster. a(x) is the average of distance from all points in current cluster and b(x) is the average distance from all points of the nearest cluster.

Formula for the same - Silhouette Score = (b(x) - a(x)) / max(a(x), b(x))

The value of Silhouette Score generally ranges from -1 to 1.

  • Closeness to -1 is an indicator that the data point / object is incorrectly allocated to current cluster.
  • 0 is an indicator that the data point / object is on the borderline of 2 clusters (and might belong to either of them).
  • Closeness to +1 is an indicator that the data point / object fits well in currently allocated cluster and is far enough from other clusters.

Elbow point is another method for locating the ideal number of clusters for a particular dataset. It helps us to identify on how much do we benefit when we increase the number of clusters and accordingly we can select a point that stands out and increasing number of clusters beyond which doesn’t contribute much.

For each number of clusters value, we compute the WCSS (Within-Cluster Sum of Squares) also known as inertia or distortion.

The same is a measure of how tightly the data points / objects in a cluster are grouped around the selected centroid. Low WCSS indicates more tight grouping around the centroid. Before the Elbow Point (or ideal number of cluster selection), WCSS value is reducing dynamically with each subsequent increase in number of clusters. After the elbow point, we just notice minimal benefits with increase in number of clusters and are just overfitting the noise in the model.

res_elbow_point <- Optimal_Clusters_KMeans(
  df_sample_selected_columns,
  max_clusters = 20,
  criterion = "WCSSE"
);

res_silhouette <- Optimal_Clusters_KMeans(
  df_sample_selected_columns,
  max_clusters = 20,
  criterion = "silhouette"
);

par(mfrow = c(1, 2));

plot(res_elbow_point, main = "Elbow Point");
plot(res_silhouette, main = "Silhouette");

As per silhouette score, 4 appears to be the ideal number of clusters (maximum silhouette score of 0.33 - as per visual comparison), post 4, similar silhouette score value (although lower in magnitude - 0.32) is obtained only at 13 (increased number of cluster groups - thus, the same can be discarded).

Plotting the log normalized data set for better visualization

ggplot(df_sample, aes(original_longitude, original_latitude, color = night_light_log)) + geom_point()

Using Hopkins Statistics to determine the clusterability of dataset (passing df_sample accounting to the large size of data set).

hopkins_statistics <- hopkins(df_sample_selected_columns);
print(hopkins_statistics);
## [1] 0.9982452

Hopkins Statistics value of 0.998 indicates high clusterability value for the nightlight dataset.

Proceeding for actual clustering to select the best algorithm and accordingly interpreting the results of clustering

set.seed(11); # to ensure replicability of results

k <- 4; # chosen as per silhouette score 

kmeans_results <- kmeans(df_sample_selected_columns, centers = k, nstart = 11); # nstart indicates the number of random initializations of cluster centers before actually choosing the best solution thus, setting it to 11 for sake of simplicity and speed. The same is a trade-off between speed and stability, higher the value of nstart more stable the model predictions and lower the speed, vice-versa for other combination.

df_sample$cluster <- factor(kmeans_results$cluster);
print(head(df_sample, 5));
##          latitude  longitude night_light_log original_longitude
## 65570  -0.6758266  0.0208703       0.3791067           19.45302
## 459673  0.4869804 -1.1196877      -0.2857201           16.58739
## 281148  0.3490202  1.2007579       1.6085076           22.41746
## 335756  1.6760654 -0.5225931       0.1758876           18.08758
## 291030  0.6906358  1.5475733       0.2841053           23.28882
##        original_latitude original_night_light      province cluster
## 65570           51.19948             40.47619       Lodzkie       3
## 459673          52.78950             16.66667 Wielkopolskie       4
## 281148          52.60085            200.00000     Podlaskie       1
## 335756          54.41545             30.95238     Pomorskie       4
## 291030          53.06798             35.71429     Podlaskie       1

Plotting K-means algorithm clustering results for visualization and comparison of results

ggplot(df_sample, aes(original_longitude, original_latitude)) + geom_point(aes(color = original_night_light, shape = cluster));

Drawing conclusions from results obtained by K-means clustering

Calculating Mean Night Light intensity per cluster to draw conclusions and also intensity per province according to clusters for more details.

aggregate(original_night_light ~ cluster, data = df_sample, mean);
##   cluster original_night_light
## 1       1             38.00939
## 2       2              0.00000
## 3       3             61.51085
## 4       4             45.34555
aggregate(original_night_light ~ province + cluster, data = df_sample, mean);
##               province cluster original_night_light
## 1   Kujawsko-Pomorskie       1             39.77933
## 2              Lodzkie       1             29.49309
## 3            Lubeiskie       1             34.23859
## 4          Mazowieckie       1             43.07585
## 5            Podlaskie       1             34.46574
## 6            Pomorskie       1             48.62637
## 7  Warminsko-Mazurskie       1             34.11403
## 8         Dolnoslaskie       2              0.00000
## 9   Kujawsko-Pomorskie       2              0.00000
## 10             Lodzkie       2              0.00000
## 11           Lubeiskie       2              0.00000
## 12            Lubuskie       2              0.00000
## 13          Malopolske       2              0.00000
## 14         Mazowieckie       2              0.00000
## 15            Opolskie       2              0.00000
## 16        Podkarpackie       2              0.00000
## 17           Podlaskie       2              0.00000
## 18           Pomorskie       2              0.00000
## 19             Slaskie       2              0.00000
## 20      Swietokrzyskie       2              0.00000
## 21 Warminsko-Mazurskie       2              0.00000
## 22       Wielkopolskie       2              0.00000
## 23 Zachodnio-Pomorskie       2              0.00000
## 24        Dolnoslaskie       3             74.14369
## 25             Lodzkie       3             60.44767
## 26           Lubeiskie       3             48.35476
## 27          Malopolske       3             66.14494
## 28         Mazowieckie       3             64.89796
## 29            Opolskie       3             59.53953
## 30        Podkarpackie       3             56.69203
## 31             Slaskie       3             83.04020
## 32      Swietokrzyskie       3             55.21570
## 33       Wielkopolskie       3            154.08163
## 34        Dolnoslaskie       4             49.25080
## 35  Kujawsko-Pomorskie       4             44.09452
## 36             Lodzkie       4             45.10148
## 37            Lubuskie       4             45.94719
## 38         Mazowieckie       4            200.00000
## 39            Opolskie       4             21.42857
## 40           Pomorskie       4             42.57589
## 41       Wielkopolskie       4             46.58170
## 42 Zachodnio-Pomorskie       4             42.19866

Interpretation of results obtained by applying K-means clustering algorithm

Cluster Interpretation
Cluster 1 Semi‑urban or mixed‑development rural areas with moderate night‑light intensity.
Cluster 2 Uninhabited or sparsely populated regions such as forests, farms, lakes, and other natural landscapes.
Cluster 3 Extremely bright zones with continuous activity, including commercial districts and major industrial belts.
Cluster 4 Urban residential and metropolitan areas with high illumination, including major cities and surrounding suburbs.
Province Interpretation
Dolnośląskie Bright urban + high‑intensity zones (Wrocław) mixed with dark rural areas.
Kujawsko‑Pomorskie Moderate brightness with pockets of high‑intensity urban areas (Bydgoszcz, Toruń).
Łódzkie Full spectrum from dark rural to high‑intensity urban core (Łódź).
Lubelskie Moderate to bright regions, fewer extreme peaks; centered around Lublin.
Lubuskie Mostly dark rural areas with moderately bright cities (Zielona Góra, Gorzów).
Małopolskie Bright urban regions (Kraków) with surrounding rural areas.
Mazowieckie Contains Poland’s brightest region (200) — Warsaw; strong mix of all cluster types.
Opolskie Mostly rural with bright urban pockets around Opole.
Podkarpackie Bright regional center (Rzeszów) but no extreme metropolitan peaks.
Podlaskie Moderate brightness with extensive dark forested areas.
Pomorskie Moderate to high brightness; Tri‑City drives the intensity.
Śląskie One of the brightest industrial regions (Katowice–Gliwice belt).
Świętokrzyskie Moderate brightness centered around Kielce; no extreme peaks.
Warmińsko‑Mazurskie Low population density; moderate brightness with large dark lake regions.
Wielkopolskie Very high brightness (154 in Cluster 3) due to Poznań and industrial corridors.
Zachodniopomorskie Dark rural areas with bright coastal and urban centers (Szczecin).

Calculating Calinski-Harabasz index

Calinski-Harabasz index is used to calculate how well clustering results obtained actually classify the data. It is a ration of dispersion between the different cluster points and dispersion within the cluster points.

ch_score <- calinhara(df_sample_selected_columns, kmeans_results$cluster);
print(ch_score);
## [1] 5997.872

The score obtained is high indicating ideal clustering results as a second reference, comparing the KMeans Clustering results with DBSCAN Clustering results.

Clustering of data (Approach II - DBSCAN Clustering Algorithm)

DBSCAN clustering

eps_value <- 0.367; # radius of neighbourhood around each point. If another point lies within the radius value then is considered a neighbour. Smaller epsilon value indicates that only close points become neighbours while if epsilon value is large then, the cluster size grows quickly and may merge.
minPts_value <- 18; # the same indicates the minimum number of points required to form a cluster center. If a point has at least minimum number of neighbours then, the same becomes a cluster center and subsequently can influence other points to form a cluster. Points with lesser than minimum point value of neighbours account as border points or noise.
dbscan_results <- dbscan::dbscan(df_sample_selected_columns, eps = eps_value, minPts = minPts_value); 

# Adding cluster labels to your original df
df_sample$dbscan_cluster <- as.factor(dbscan_results$cluster);

print(head(df_sample, 5));
##          latitude  longitude night_light_log original_longitude
## 65570  -0.6758266  0.0208703       0.3791067           19.45302
## 459673  0.4869804 -1.1196877      -0.2857201           16.58739
## 281148  0.3490202  1.2007579       1.6085076           22.41746
## 335756  1.6760654 -0.5225931       0.1758876           18.08758
## 291030  0.6906358  1.5475733       0.2841053           23.28882
##        original_latitude original_night_light      province cluster
## 65570           51.19948             40.47619       Lodzkie       3
## 459673          52.78950             16.66667 Wielkopolskie       4
## 281148          52.60085            200.00000     Podlaskie       1
## 335756          54.41545             30.95238     Pomorskie       4
## 291030          53.06798             35.71429     Podlaskie       1
##        dbscan_cluster
## 65570               1
## 459673              1
## 281148              1
## 335756              1
## 291030              1

Plotting DBSCAN algorithm clustering results for visualization and comparison of results

ggplot(df_sample, aes(original_longitude, original_latitude)) + geom_point(aes(color = original_night_light, alpha = dbscan_cluster), size = 1) + scale_color_viridis_c() + scale_alpha_discrete(range = c(0.4, 1)) + labs(title = "DBSCAN Clustering Results", x = "Longitude", y = "Latitude", alpha = "DBSCAN Cluster", color = "Night Light")
## Warning: Using alpha for a discrete variable is not advised.

Drawing conclusions from results obtained by DBSCAN clustering

Calculating Mean Night Light intensity per cluster to draw conclusions and also intensity per province according to clusters for more details.

aggregate(original_night_light ~ dbscan_cluster, data = df_sample, mean);
##   dbscan_cluster original_night_light
## 1              0             77.06767
## 2              1             48.09142
## 3              2              0.00000
## 4              3              0.00000
## 5              4              0.00000
## 6              5            195.05494
## 7              6            200.00000
aggregate(original_night_light ~ province + dbscan_cluster, data = df_sample, mean);
##               province dbscan_cluster original_night_light
## 1         Dolnoslaskie              0              0.00000
## 2            Lubeiskie              0             86.79654
## 3           Malopolske              0              0.00000
## 4          Mazowieckie              0             63.09524
## 5         Podkarpackie              0             61.53846
## 6            Podlaskie              0            169.44444
## 7              Slaskie              0              0.00000
## 8  Warminsko-Mazurskie              0            133.65079
## 9         Dolnoslaskie              1             51.38769
## 10  Kujawsko-Pomorskie              1             43.28542
## 11             Lodzkie              1             52.51610
## 12           Lubeiskie              1             42.98993
## 13            Lubuskie              1             45.94719
## 14          Malopolske              1             66.14494
## 15         Mazowieckie              1             47.96059
## 16            Opolskie              1             57.74184
## 17        Podkarpackie              1             55.55240
## 18           Podlaskie              1             30.46682
## 19           Pomorskie              1             43.63345
## 20             Slaskie              1             83.04020
## 21      Swietokrzyskie              1             55.21570
## 22 Warminsko-Mazurskie              1             30.82152
## 23       Wielkopolskie              1             48.12056
## 24 Zachodnio-Pomorskie              1             39.18335
## 25          Malopolske              2              0.00000
## 26        Podkarpackie              2              0.00000
## 27        Dolnoslaskie              3              0.00000
## 28  Kujawsko-Pomorskie              3              0.00000
## 29             Lodzkie              3              0.00000
## 30           Lubeiskie              3              0.00000
## 31         Mazowieckie              3              0.00000
## 32            Opolskie              3              0.00000
## 33           Podlaskie              3              0.00000
## 34           Pomorskie              3              0.00000
## 35             Slaskie              3              0.00000
## 36      Swietokrzyskie              3              0.00000
## 37 Warminsko-Mazurskie              3              0.00000
## 38       Wielkopolskie              3              0.00000
## 39        Dolnoslaskie              4              0.00000
## 40  Kujawsko-Pomorskie              4              0.00000
## 41            Lubuskie              4              0.00000
## 42           Pomorskie              4              0.00000
## 43       Wielkopolskie              4              0.00000
## 44 Zachodnio-Pomorskie              4              0.00000
## 45           Podlaskie              5            192.85714
## 46 Warminsko-Mazurskie              5            200.00000
## 47 Zachodnio-Pomorskie              6            200.00000

Adding a Few Major Cities to trace and interpret the results from DBSCAN Clustering

# Coordinates for major cities in Poland traced from Google Earth Engine
cities <- data.frame( city = c("Warsaw", "Krakow", "Gdansk", "Poznan", "Wroclaw", "Lodz"), lon = c(21.0122, 19.9445, 18.6466, 16.9252, 17.0385, 19.4550), lat = c(52.2297, 50.0647, 54.3520, 52.4064, 51.1079, 51.7592) )

# Adding transparent markers to view the city locations and draw conclusions
ggplot(df_sample, aes(original_longitude, original_latitude)) + geom_point(aes(color = original_night_light, alpha = dbscan_cluster), size = 1) + scale_color_viridis_c() + scale_alpha_discrete(range = c(0.4, 1)) + geom_point(data = cities, aes(x = lon, y = lat), color = "red", alpha = 0.35, size = 12) + geom_text(data = cities, aes(x = lon, y = lat, label = city), color = "red", vjust = -1, fontface = "bold") + labs(title = "DBSCAN Clustering with Major Polish Cities", x = "Longitude", y = "Latitude", alpha = "DBSCAN Cluster", color = "Night Light")
## Warning: Using alpha for a discrete variable is not advised.

Interpretation of results obtained by applying DBSCAN clustering algorithm

DBSCAN Cluster Mean Intensity Interpretation
6 200.00 Ultra‑bright metropolitan cores (Warsaw, Kraków center, Tri‑City core).
5 195.05 High‑intensity urban/industrial belts (Silesia, Poznań corridors, major commercial zones).
0 77.07 Bright urban + suburban regions; dense residential belts and secondary cities.
1 48.09 Semi‑urban / pre‑urban transition zones; small towns and mixed‑density areas.
2, 3, 4 0.00 Rural, forest, lake, and uninhabited regions (low‑density natural landscapes).
Province Geographical Conclusion (DBSCAN)
Dolnośląskie Dominated by rural/forest regions with scattered semi‑urban pockets; Wrocław’s brightest core lies outside sampled bright clusters.
Kujawsko‑Pomorskie Mostly semi‑urban development with extensive rural surroundings; lacks ultra‑bright metropolitan peaks.
Łódzkie Semi‑urban belts around Łódź surrounded by large rural zones; no extreme brightness captured in DBSCAN clusters.
Lubelskie Mix of semi‑urban centers and rural landscapes; moderate illumination without high‑intensity hotspots.
Lubuskie Predominantly rural with modest semi‑urban brightness; no major metropolitan cores.
Małopolskie Semi‑urban Kraków outskirts plus large rural areas; the ultra‑bright Kraków core appears in DBSCAN Cluster 6 in full data.
Mazowieckie Strong mix of rural, semi‑urban, and bright‑urban zones; Warsaw’s ultra‑bright metropolitan core is captured in Cluster 6.
Opolskie Primarily rural with moderate semi‑urban brightness around Opole; no extreme illumination zones.
Podkarpackie Semi‑urban Rzeszów surroundings embedded in large rural regions; lacks ultra‑bright metropolitan peaks.
Podlaskie Wide contrast: rural forests, semi‑urban towns, and isolated ultra‑bright hotspots (Cluster 5).
Pomorskie Semi‑urban Tri‑City outskirts with extensive rural areas; the brightest Tri‑City core appears in Cluster 6.
Śląskie Semi‑urban and rural mix in sampled slice; the high‑intensity Silesian industrial belt appears in Cluster 5 in full data.
Świętokrzyskie Mostly rural with moderate semi‑urban brightness around Kielce; no extreme illumination.
Warmińsko‑Mazurskie Large rural and lake regions with isolated ultra‑bright hotspots (Cluster 5).
Wielkopolskie Semi‑urban Poznań outskirts surrounded by rural areas; Poznań’s bright core appears in Cluster 5 in full data.
Zachodniopomorskie Rural and semi‑urban mix with an ultra‑bright metropolitan core around Szczecin (Cluster 6).

Calculating Calinski-Harabasz index

Calinski-Harabasz index for clustering results obtained using DBSCAN algorithm.

ch_score <- calinhara(df_sample_selected_columns, dbscan_results$cluster);
print(ch_score);
## [1] 966.3889

The score obtained is high too indicating ideal clustering results. But, lower than DBSCAN clustering results.