The project aims at utilizing night light data for clustering purpose. Till now as for geo-spatial satellite data utilized in planning purpose, only census data, day-light images and geographical terrain images are considered crucial. But, as for urban planning, we ignore the crucial lens of night light data which can easily be utilized to identify and segregate areas as urban, industrial, rural and un-inhabited.
The results derived from this kind of clustering can be advantageous in cases of designing evacuation plans for natural calamities, identifying industrial belt vicinity to urban households and accordingly using the results for urban planning and for better clarity in segregating forests and green cover in an area from farms and human made artificial green belts (since natural green belts will have near zero night light intensity).
Keywords : night lights, DMSP‑OLS, geospatial clustering, urban mapping, remote sensing, k‑means, DBSCAN
library(terra); # library for reading GeoTIFF data
## Warning: package 'terra' was built under R version 4.5.2
library(tidyverse); # library utilized for ggplot2 and other functions
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
library(dbscan);
## Warning: package 'dbscan' was built under R version 4.5.2
# library(sf);
library(ClusterR); # library imported to utilize Optimal_Clusters_KMeans function for silhouette score calculation and also for obtaining elbow point method results
## Warning: package 'ClusterR' was built under R version 4.5.2
#library(clustertend); Depricated and suggestion to utilize hopkins library
library(hopkins); # library imported to utilize hopkins function to check data clusterability
## Warning: package 'hopkins' was built under R version 4.5.2
library(fpc); # library for calinhara function import to calculate Calinski-Harabasz index
## Warning: package 'fpc' was built under R version 4.5.2
# library(factoextra);
The javascript code attached is executed in Google Earth Engine code editor and is utilized for extracting required dataset for 2012 for clustering purpose and other GeoTIFF or tif files showcased in the paper
For the sake of simplicity and to avoid results for getting influenced by nightlights from neighbouring countries, a black mask layer (typically 0 intensity for our nightlight setup) is added in the map output window (code for which can be traced). Also geo-spatial data outside the boundaries of Poland is eliminated for country specific clustering results.
Data is classified at Voivodeship levels (16 Provices which Poland is divided into) also to track the magnitude of industrial and urban activity per entity. The Voivodeship boundary data or tagging for a particular coordinate on the map with respective Province value is obtained from the FAO/GAUL dataset available from the year 2015.
// Country Boundaries Loaded
var countries = ee.FeatureCollection("USDOS/LSIB_SIMPLE/2017");
// Poland Boundary Data filtered
var poland = countries.filter(ee.Filter.eq("country_na", "Poland"));
// Neighboring Country Boundaries filtered
var neighbors = countries
.filterBounds(poland.geometry())
.filter(ee.Filter.neq("country_na", "Poland"));
// Voivodeships Data filtered using GAUL Level 1 Data Set
var gaul = ee.FeatureCollection("FAO/GAUL/2015/level1");
var voivodeships = gaul.filter(ee.Filter.eq("ADM0_NAME", "Poland"));
// Setting up width of Poland and respective voivodeship boundaries
var NEON_BOUNDARY_VALUE = 200;
var BOUNDARY_WIDTH_PIXELS = 2.25;
var region_of_interest_rect = ee.Geometry.Rectangle([13.0, 48.0, 25.0, 56.0]);
// Loading DMSP‑OLS night lights data from dataset in Google Earth Engine
var dmspCollection = ee.ImageCollection("NOAA/DMSP-OLS/NIGHTTIME_LIGHTS")
.filterBounds(region_of_interest_rect);
// Setting up year values for median night light data extraction
var year_to_process = 2012;
var startDate = ee.Date(year_to_process + '-01-01');
var endDate = ee.Date(year_to_process + '-12-31');
var yearImages = dmspCollection.filterDate(startDate, endDate);
if (yearImages.size().getInfo() === 0) {
print("No DMSP data for " + year_to_process);
} else {
// Taking median of the data for entire year
var nightLightsComposite = yearImages.median()
.select("stable_lights")
.clip(poland.geometry());
// Making neighbouring countries night light value 0
var neighborsMask = ee.Image().byte().paint(neighbors, 1);
var nightLightsWithBlackNeighbors = nightLightsComposite.where(neighborsMask.eq(1), 0);
// Increasing contrast for better visualization
var stretchedLights = nightLightsWithBlackNeighbors
.unitScale(0, 63) // normalize 0–63 → 0–1
.multiply(150); // stretch to 0–150 for stronger contrast
var emptyCanvas = ee.Image(0).rename("stable_lights").byte();
// Including boundaries and night light value
var polandOutlineRaster = emptyCanvas.paint({
featureCollection: poland,
color: NEON_BOUNDARY_VALUE,
width: BOUNDARY_WIDTH_PIXELS
});
var voivodeshipsOutlineRaster = emptyCanvas.paint({
featureCollection: voivodeships,
color: NEON_BOUNDARY_VALUE,
width: BOUNDARY_WIDTH_PIXELS
});
var finalImageForExport = stretchedLights
.unmask(0)
.max(polandOutlineRaster)
.max(voivodeshipsOutlineRaster);
// Color palette and visualization parameters chosen for night lights data
var finalVisParams = {
min: 0,
max: 200,
palette: [
'black','blue','purple','cyan','orange','yellow','red','white',
'#00FF00'
]
};
// Displaying data in Google Earth Engine maps for cross-verification prior to export
Map.addLayer(finalImageForExport, finalVisParams,
"High‑Contrast Night Lights " + year_to_process);
var centerPoint = poland.geometry().centroid();
Map.setCenter(centerPoint.coordinates().get(0).getInfo(),
centerPoint.coordinates().get(1).getInfo(), 6);
// Exporting map images as prepared in previous steps
Export.image.toDrive({
image: finalImageForExport,
description: 'NightLights_' + year_to_process + '_high_contrast',
folder: 'EarthEngineExports',
fileNamePrefix: 'nightlights_' + year_to_process + '_high_contrast',
region: region_of_interest_rect,
scale: 500,
maxPixels: 1e13,
fileFormat: 'GeoTIFF'
});
var imageWithCoords = finalImageForExport
.rename("night_light")
.addBands(ee.Image.pixelLonLat());
var samples = imageWithCoords.sampleRegions({
collection: voivodeships,
properties: ['ADM1_NAME'], // province name according to dataset
scale: 1000,
geometries: false
}).map(function (f) {
return f.set({
year: year_to_process,
province: f.get('ADM1_NAME')
});
});
// Exporting Longitude, Latitude, Province Name, Night Light intensity data for clustering
Export.table.toDrive({
collection: samples,
description: 'NightLights_CSV_' + year_to_process,
folder: 'EarthEngineExports',
fileNamePrefix: 'nightlights_' + year_to_process + '_pixels',
fileFormat: 'CSV'
});
}
Plotting the datasets in Google Earth Engine itself with map overlay and observing the same to validate an increase in Night Light intensity over a period from year 2000 to 2012.
During the mentioned period, Poland saw a major growth in the population for core cities like Warsaw, Kraków, Wrocław, Poznań, and the Tri‑City. The population inflows can be attributed to commercial construction, development in infrastructure and rapid industry growth which further contributed to increasing light intensity in these specific regions.
Some other reasons for increased night light intensity can be - - Energy grid reliability due to technical advancements across the globe. - Rapid Highways (A1, A2, A4) expansion during the mentioned decade. - Shopping centers, logistics hubs, and business parks expanded rapidly, especially after 2005 attributing to the IT revolution across the globe.
# Dividing screen into 9 frames
# windows(width = 12, height = 10);
par(mfrow = c(3, 3));
for(i in seq(2000, 2012, 2)){
rel_path = paste0("./2000-2012GeospatialNightLightData/nightlights_", i, ".tif");
r <- rast(rel_path);
plot(r, main = paste("Year", i))
}
summary and str can help us understand the dataset structure and accordingly either tweak column data types (for example tweaking Province column datatype from Character to Factor since the same contains repetition of 16 Voivodeship values from Poland).
The same can help in reducing time for producing plots owing to better linked and mastered data. Also columns with repetitive or similar values were dropped.
df <- read_csv("./2000-2012GeospatialNightLightData/nightlights_2012_pixels.csv");
## Rows: 507207 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): system:index, ADM1_NAME, province, .geo
## dbl (4): latitude, longitude, night_light, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(df);
## spc_tbl_ [507,207 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ system:index: chr [1:507207] "000000000000000003c0_0" "000000000000000003c0_1" "000000000000000003c0_2" "000000000000000003c0_3" ...
## $ ADM1_NAME : chr [1:507207] "Dolnoslaskie" "Dolnoslaskie" "Dolnoslaskie" "Dolnoslaskie" ...
## $ latitude : num [1:507207] 50.1 50.1 50.1 50.1 50.1 ...
## $ longitude : num [1:507207] 16.7 16.7 16.7 16.7 16.7 ...
## $ night_light : num [1:507207] 200 200 200 200 200 200 200 200 200 200 ...
## $ province : chr [1:507207] "Dolnoslaskie" "Dolnoslaskie" "Dolnoslaskie" "Dolnoslaskie" ...
## $ year : num [1:507207] 2012 2012 2012 2012 2012 ...
## $ .geo : chr [1:507207] "{\"type\":\"MultiPoint\",\"coordinates\":[]}" "{\"type\":\"MultiPoint\",\"coordinates\":[]}" "{\"type\":\"MultiPoint\",\"coordinates\":[]}" "{\"type\":\"MultiPoint\",\"coordinates\":[]}" ...
## - attr(*, "spec")=
## .. cols(
## .. `system:index` = col_character(),
## .. ADM1_NAME = col_character(),
## .. latitude = col_double(),
## .. longitude = col_double(),
## .. night_light = col_double(),
## .. province = col_character(),
## .. year = col_double(),
## .. .geo = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(df$night_light); # data should be in range from 0 to 200
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 19.05 26.19 43.97 47.62 200.00
| Column Name being dropped | Reason for Dropping Column |
|---|---|
| system:index | The same is an ID value added in Google Earth Engine Data Set NOAA/DMSP-OLS/NIGHTTIME_LIGHTS which isn’t of much relevance to us. |
| ADM1_NAME | Redundant data since column value is already added by us in Province column. |
| .geo | Empty and common JSON object for all data set entries. |
| year | Only value is 2012 since data for 2012 is only extracted in CSV format from Google Earth Engine - Remaining years data is just used for visualization and verification of growth trend. |
# print(summary(df$year)) # Only value 2012 thus, year column can also be dropped
df <- df[ , !(names(df) %in% c("system:index", ".geo", "ADM1_NAME", "year"))];
print(head(df, 5));
## # A tibble: 5 × 4
## latitude longitude night_light province
## <dbl> <dbl> <dbl> <chr>
## 1 50.1 16.7 200 Dolnoslaskie
## 2 50.1 16.7 200 Dolnoslaskie
## 3 50.1 16.7 200 Dolnoslaskie
## 4 50.1 16.7 200 Dolnoslaskie
## 5 50.1 16.7 200 Dolnoslaskie
Converting Province column value to factor data type since Poland is bifurcated into only 16 Provinces (or voivodeships)
df$province <- as.factor(df$province)
print(head(df, 5));
## # A tibble: 5 × 4
## latitude longitude night_light province
## <dbl> <dbl> <dbl> <fct>
## 1 50.1 16.7 200 Dolnoslaskie
## 2 50.1 16.7 200 Dolnoslaskie
## 3 50.1 16.7 200 Dolnoslaskie
## 4 50.1 16.7 200 Dolnoslaskie
## 5 50.1 16.7 200 Dolnoslaskie
Log transforming the data to reduce skewness and for scaling. Selecting only columns with numerical values eliminating Province (or voivodeship column) since the information for the same can be reconstructed using longitude and latitude column values.
df_scaled <- scale(select(mutate(df, night_light_log = log1p(night_light)), latitude, longitude, night_light_log));
df_scaled <- as.data.frame(df_scaled); # Converting df_scaled from matrix (result of utilization of scale function) to dataframe
df_scaled$original_longitude <- df$longitude;
df_scaled$original_latitude <- df$latitude;
df_scaled$original_night_light <- df$night_light;
df_scaled$province <- df$province;
# print(head(df_scaled, 5));
Accounting to large size of the dataset, we are sampling 11000 random values from the same for ease of clustering.
set.seed(11);
df_sample <- df_scaled[sample(nrow(df_scaled), 11000), ];
df_sample_selected_columns <- df_sample[, c("latitude", "longitude", "night_light_log")];
# print(head(df_sample, 5));
PCA (Principal Component Analysis) being a strong pre-clustering tool is utilized to determine the following facts -
pca_res <- prcomp(df_sample_selected_columns, scale. = TRUE);
print(summary(pca_res));
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.1550 0.9796 0.8406
## Proportion of Variance 0.4446 0.3198 0.2355
## Cumulative Proportion 0.4446 0.7645 1.0000
Summarizing the PCA results - - PC1 in general accounts for biggest source of variation in the data or direction along which data is most spread. In current case the general pattern of differentiating brigter from darker data points might be considered for PC1. - PC2 and PC3 consecutively capture the next most important dimension or parameter which PC1 missed. PC2 might be utilizing other indicators like differentiating areas as Industrial, Urban, Rural or Un-Inhabited. - PC1 and PC2 alongside combine to form 44 + 32 = 76 percent indicating high variance along these basic dimensions. - The combined high value of PC1 and PC2 is an indicator of dominant patterns in data. The data is low dimensional and not spread randomly. The same is an ideal indicator for clusterability. - A combined value of PC1 and PC2 less than 10 percent indicate highly noisy data and that clutering results from the same will be unreliable. - A too high PC1 value equivalent to 90 percent also is a bad indicator for clustering since the same indicates that the data is mostly linear.
As for this paper PCA is only being utilized to judge the data quality.
Deciding on ideal number of clusters using Silhouette Score and Elbow Point method
Silhouette is a score assigned to a data point / object for rating its presence in the current cluster based on average distance from data points / objects of current cluster and nearest cluster. a(x) is the average of distance from all points in current cluster and b(x) is the average distance from all points of the nearest cluster.
Formula for the same - Silhouette Score = (b(x) - a(x)) / max(a(x), b(x))
The value of Silhouette Score generally ranges from -1 to 1.
Elbow point is another method for locating the ideal number of clusters for a particular dataset. It helps us to identify on how much do we benefit when we increase the number of clusters and accordingly we can select a point that stands out and increasing number of clusters beyond which doesn’t contribute much.
For each number of clusters value, we compute the WCSS (Within-Cluster Sum of Squares) also known as inertia or distortion.
The same is a measure of how tightly the data points / objects in a cluster are grouped around the selected centroid. Low WCSS indicates more tight grouping around the centroid. Before the Elbow Point (or ideal number of cluster selection), WCSS value is reducing dynamically with each subsequent increase in number of clusters. After the elbow point, we just notice minimal benefits with increase in number of clusters and are just overfitting the noise in the model.
res_elbow_point <- Optimal_Clusters_KMeans(
df_sample_selected_columns,
max_clusters = 20,
criterion = "WCSSE"
);
res_silhouette <- Optimal_Clusters_KMeans(
df_sample_selected_columns,
max_clusters = 20,
criterion = "silhouette"
);
par(mfrow = c(1, 2));
plot(res_elbow_point, main = "Elbow Point");
plot(res_silhouette, main = "Silhouette");
As per silhouette score, 4 appears to be the ideal number of clusters (maximum silhouette score of 0.33 - as per visual comparison), post 4, similar silhouette score value (although lower in magnitude - 0.32) is obtained only at 13 (increased number of cluster groups - thus, the same can be discarded).
Plotting the log normalized data set for better visualization
ggplot(df_sample, aes(original_longitude, original_latitude, color = night_light_log)) + geom_point()
Using Hopkins Statistics to determine the clusterability of dataset (passing df_sample accounting to the large size of data set).
hopkins_statistics <- hopkins(df_sample_selected_columns);
print(hopkins_statistics);
## [1] 0.9982452
Hopkins Statistics value of 0.998 indicates high clusterability value for the nightlight dataset.
set.seed(11); # to ensure replicability of results
k <- 4; # chosen as per silhouette score
kmeans_results <- kmeans(df_sample_selected_columns, centers = k, nstart = 11); # nstart indicates the number of random initializations of cluster centers before actually choosing the best solution thus, setting it to 11 for sake of simplicity and speed. The same is a trade-off between speed and stability, higher the value of nstart more stable the model predictions and lower the speed, vice-versa for other combination.
df_sample$cluster <- factor(kmeans_results$cluster);
print(head(df_sample, 5));
## latitude longitude night_light_log original_longitude
## 65570 -0.6758266 0.0208703 0.3791067 19.45302
## 459673 0.4869804 -1.1196877 -0.2857201 16.58739
## 281148 0.3490202 1.2007579 1.6085076 22.41746
## 335756 1.6760654 -0.5225931 0.1758876 18.08758
## 291030 0.6906358 1.5475733 0.2841053 23.28882
## original_latitude original_night_light province cluster
## 65570 51.19948 40.47619 Lodzkie 3
## 459673 52.78950 16.66667 Wielkopolskie 4
## 281148 52.60085 200.00000 Podlaskie 1
## 335756 54.41545 30.95238 Pomorskie 4
## 291030 53.06798 35.71429 Podlaskie 1
Plotting K-means algorithm clustering results for visualization and comparison of results
ggplot(df_sample, aes(original_longitude, original_latitude)) + geom_point(aes(color = original_night_light, shape = cluster));
Calculating Mean Night Light intensity per cluster to draw conclusions and also intensity per province according to clusters for more details.
aggregate(original_night_light ~ cluster, data = df_sample, mean);
## cluster original_night_light
## 1 1 38.00939
## 2 2 0.00000
## 3 3 61.51085
## 4 4 45.34555
aggregate(original_night_light ~ province + cluster, data = df_sample, mean);
## province cluster original_night_light
## 1 Kujawsko-Pomorskie 1 39.77933
## 2 Lodzkie 1 29.49309
## 3 Lubeiskie 1 34.23859
## 4 Mazowieckie 1 43.07585
## 5 Podlaskie 1 34.46574
## 6 Pomorskie 1 48.62637
## 7 Warminsko-Mazurskie 1 34.11403
## 8 Dolnoslaskie 2 0.00000
## 9 Kujawsko-Pomorskie 2 0.00000
## 10 Lodzkie 2 0.00000
## 11 Lubeiskie 2 0.00000
## 12 Lubuskie 2 0.00000
## 13 Malopolske 2 0.00000
## 14 Mazowieckie 2 0.00000
## 15 Opolskie 2 0.00000
## 16 Podkarpackie 2 0.00000
## 17 Podlaskie 2 0.00000
## 18 Pomorskie 2 0.00000
## 19 Slaskie 2 0.00000
## 20 Swietokrzyskie 2 0.00000
## 21 Warminsko-Mazurskie 2 0.00000
## 22 Wielkopolskie 2 0.00000
## 23 Zachodnio-Pomorskie 2 0.00000
## 24 Dolnoslaskie 3 74.14369
## 25 Lodzkie 3 60.44767
## 26 Lubeiskie 3 48.35476
## 27 Malopolske 3 66.14494
## 28 Mazowieckie 3 64.89796
## 29 Opolskie 3 59.53953
## 30 Podkarpackie 3 56.69203
## 31 Slaskie 3 83.04020
## 32 Swietokrzyskie 3 55.21570
## 33 Wielkopolskie 3 154.08163
## 34 Dolnoslaskie 4 49.25080
## 35 Kujawsko-Pomorskie 4 44.09452
## 36 Lodzkie 4 45.10148
## 37 Lubuskie 4 45.94719
## 38 Mazowieckie 4 200.00000
## 39 Opolskie 4 21.42857
## 40 Pomorskie 4 42.57589
## 41 Wielkopolskie 4 46.58170
## 42 Zachodnio-Pomorskie 4 42.19866
| Cluster | Interpretation |
|---|---|
| Cluster 1 | Semi‑urban or mixed‑development rural areas with moderate night‑light intensity. |
| Cluster 2 | Uninhabited or sparsely populated regions such as forests, farms, lakes, and other natural landscapes. |
| Cluster 3 | Extremely bright zones with continuous activity, including commercial districts and major industrial belts. |
| Cluster 4 | Urban residential and metropolitan areas with high illumination, including major cities and surrounding suburbs. |
| Province | Interpretation |
|---|---|
| Dolnośląskie | Bright urban + high‑intensity zones (Wrocław) mixed with dark rural areas. |
| Kujawsko‑Pomorskie | Moderate brightness with pockets of high‑intensity urban areas (Bydgoszcz, Toruń). |
| Łódzkie | Full spectrum from dark rural to high‑intensity urban core (Łódź). |
| Lubelskie | Moderate to bright regions, fewer extreme peaks; centered around Lublin. |
| Lubuskie | Mostly dark rural areas with moderately bright cities (Zielona Góra, Gorzów). |
| Małopolskie | Bright urban regions (Kraków) with surrounding rural areas. |
| Mazowieckie | Contains Poland’s brightest region (200) — Warsaw; strong mix of all cluster types. |
| Opolskie | Mostly rural with bright urban pockets around Opole. |
| Podkarpackie | Bright regional center (Rzeszów) but no extreme metropolitan peaks. |
| Podlaskie | Moderate brightness with extensive dark forested areas. |
| Pomorskie | Moderate to high brightness; Tri‑City drives the intensity. |
| Śląskie | One of the brightest industrial regions (Katowice–Gliwice belt). |
| Świętokrzyskie | Moderate brightness centered around Kielce; no extreme peaks. |
| Warmińsko‑Mazurskie | Low population density; moderate brightness with large dark lake regions. |
| Wielkopolskie | Very high brightness (154 in Cluster 3) due to Poznań and industrial corridors. |
| Zachodniopomorskie | Dark rural areas with bright coastal and urban centers (Szczecin). |
Calinski-Harabasz index is used to calculate how well clustering results obtained actually classify the data. It is a ration of dispersion between the different cluster points and dispersion within the cluster points.
ch_score <- calinhara(df_sample_selected_columns, kmeans_results$cluster);
print(ch_score);
## [1] 5997.872
The score obtained is high indicating ideal clustering results as a second reference, comparing the KMeans Clustering results with DBSCAN Clustering results.
eps_value <- 0.367; # radius of neighbourhood around each point. If another point lies within the radius value then is considered a neighbour. Smaller epsilon value indicates that only close points become neighbours while if epsilon value is large then, the cluster size grows quickly and may merge.
minPts_value <- 18; # the same indicates the minimum number of points required to form a cluster center. If a point has at least minimum number of neighbours then, the same becomes a cluster center and subsequently can influence other points to form a cluster. Points with lesser than minimum point value of neighbours account as border points or noise.
dbscan_results <- dbscan::dbscan(df_sample_selected_columns, eps = eps_value, minPts = minPts_value);
# Adding cluster labels to your original df
df_sample$dbscan_cluster <- as.factor(dbscan_results$cluster);
print(head(df_sample, 5));
## latitude longitude night_light_log original_longitude
## 65570 -0.6758266 0.0208703 0.3791067 19.45302
## 459673 0.4869804 -1.1196877 -0.2857201 16.58739
## 281148 0.3490202 1.2007579 1.6085076 22.41746
## 335756 1.6760654 -0.5225931 0.1758876 18.08758
## 291030 0.6906358 1.5475733 0.2841053 23.28882
## original_latitude original_night_light province cluster
## 65570 51.19948 40.47619 Lodzkie 3
## 459673 52.78950 16.66667 Wielkopolskie 4
## 281148 52.60085 200.00000 Podlaskie 1
## 335756 54.41545 30.95238 Pomorskie 4
## 291030 53.06798 35.71429 Podlaskie 1
## dbscan_cluster
## 65570 1
## 459673 1
## 281148 1
## 335756 1
## 291030 1
Plotting DBSCAN algorithm clustering results for visualization and comparison of results
ggplot(df_sample, aes(original_longitude, original_latitude)) + geom_point(aes(color = original_night_light, alpha = dbscan_cluster), size = 1) + scale_color_viridis_c() + scale_alpha_discrete(range = c(0.4, 1)) + labs(title = "DBSCAN Clustering Results", x = "Longitude", y = "Latitude", alpha = "DBSCAN Cluster", color = "Night Light")
## Warning: Using alpha for a discrete variable is not advised.
Calculating Mean Night Light intensity per cluster to draw conclusions and also intensity per province according to clusters for more details.
aggregate(original_night_light ~ dbscan_cluster, data = df_sample, mean);
## dbscan_cluster original_night_light
## 1 0 77.06767
## 2 1 48.09142
## 3 2 0.00000
## 4 3 0.00000
## 5 4 0.00000
## 6 5 195.05494
## 7 6 200.00000
aggregate(original_night_light ~ province + dbscan_cluster, data = df_sample, mean);
## province dbscan_cluster original_night_light
## 1 Dolnoslaskie 0 0.00000
## 2 Lubeiskie 0 86.79654
## 3 Malopolske 0 0.00000
## 4 Mazowieckie 0 63.09524
## 5 Podkarpackie 0 61.53846
## 6 Podlaskie 0 169.44444
## 7 Slaskie 0 0.00000
## 8 Warminsko-Mazurskie 0 133.65079
## 9 Dolnoslaskie 1 51.38769
## 10 Kujawsko-Pomorskie 1 43.28542
## 11 Lodzkie 1 52.51610
## 12 Lubeiskie 1 42.98993
## 13 Lubuskie 1 45.94719
## 14 Malopolske 1 66.14494
## 15 Mazowieckie 1 47.96059
## 16 Opolskie 1 57.74184
## 17 Podkarpackie 1 55.55240
## 18 Podlaskie 1 30.46682
## 19 Pomorskie 1 43.63345
## 20 Slaskie 1 83.04020
## 21 Swietokrzyskie 1 55.21570
## 22 Warminsko-Mazurskie 1 30.82152
## 23 Wielkopolskie 1 48.12056
## 24 Zachodnio-Pomorskie 1 39.18335
## 25 Malopolske 2 0.00000
## 26 Podkarpackie 2 0.00000
## 27 Dolnoslaskie 3 0.00000
## 28 Kujawsko-Pomorskie 3 0.00000
## 29 Lodzkie 3 0.00000
## 30 Lubeiskie 3 0.00000
## 31 Mazowieckie 3 0.00000
## 32 Opolskie 3 0.00000
## 33 Podlaskie 3 0.00000
## 34 Pomorskie 3 0.00000
## 35 Slaskie 3 0.00000
## 36 Swietokrzyskie 3 0.00000
## 37 Warminsko-Mazurskie 3 0.00000
## 38 Wielkopolskie 3 0.00000
## 39 Dolnoslaskie 4 0.00000
## 40 Kujawsko-Pomorskie 4 0.00000
## 41 Lubuskie 4 0.00000
## 42 Pomorskie 4 0.00000
## 43 Wielkopolskie 4 0.00000
## 44 Zachodnio-Pomorskie 4 0.00000
## 45 Podlaskie 5 192.85714
## 46 Warminsko-Mazurskie 5 200.00000
## 47 Zachodnio-Pomorskie 6 200.00000
# Coordinates for major cities in Poland traced from Google Earth Engine
cities <- data.frame( city = c("Warsaw", "Krakow", "Gdansk", "Poznan", "Wroclaw", "Lodz"), lon = c(21.0122, 19.9445, 18.6466, 16.9252, 17.0385, 19.4550), lat = c(52.2297, 50.0647, 54.3520, 52.4064, 51.1079, 51.7592) )
# Adding transparent markers to view the city locations and draw conclusions
ggplot(df_sample, aes(original_longitude, original_latitude)) + geom_point(aes(color = original_night_light, alpha = dbscan_cluster), size = 1) + scale_color_viridis_c() + scale_alpha_discrete(range = c(0.4, 1)) + geom_point(data = cities, aes(x = lon, y = lat), color = "red", alpha = 0.35, size = 12) + geom_text(data = cities, aes(x = lon, y = lat, label = city), color = "red", vjust = -1, fontface = "bold") + labs(title = "DBSCAN Clustering with Major Polish Cities", x = "Longitude", y = "Latitude", alpha = "DBSCAN Cluster", color = "Night Light")
## Warning: Using alpha for a discrete variable is not advised.
| DBSCAN Cluster | Mean Intensity | Interpretation |
|---|---|---|
| 6 | 200.00 | Ultra‑bright metropolitan cores (Warsaw, Kraków center, Tri‑City core). |
| 5 | 195.05 | High‑intensity urban/industrial belts (Silesia, Poznań corridors, major commercial zones). |
| 0 | 77.07 | Bright urban + suburban regions; dense residential belts and secondary cities. |
| 1 | 48.09 | Semi‑urban / pre‑urban transition zones; small towns and mixed‑density areas. |
| 2, 3, 4 | 0.00 | Rural, forest, lake, and uninhabited regions (low‑density natural landscapes). |
| Province | Geographical Conclusion (DBSCAN) |
|---|---|
| Dolnośląskie | Dominated by rural/forest regions with scattered semi‑urban pockets; Wrocław’s brightest core lies outside sampled bright clusters. |
| Kujawsko‑Pomorskie | Mostly semi‑urban development with extensive rural surroundings; lacks ultra‑bright metropolitan peaks. |
| Łódzkie | Semi‑urban belts around Łódź surrounded by large rural zones; no extreme brightness captured in DBSCAN clusters. |
| Lubelskie | Mix of semi‑urban centers and rural landscapes; moderate illumination without high‑intensity hotspots. |
| Lubuskie | Predominantly rural with modest semi‑urban brightness; no major metropolitan cores. |
| Małopolskie | Semi‑urban Kraków outskirts plus large rural areas; the ultra‑bright Kraków core appears in DBSCAN Cluster 6 in full data. |
| Mazowieckie | Strong mix of rural, semi‑urban, and bright‑urban zones; Warsaw’s ultra‑bright metropolitan core is captured in Cluster 6. |
| Opolskie | Primarily rural with moderate semi‑urban brightness around Opole; no extreme illumination zones. |
| Podkarpackie | Semi‑urban Rzeszów surroundings embedded in large rural regions; lacks ultra‑bright metropolitan peaks. |
| Podlaskie | Wide contrast: rural forests, semi‑urban towns, and isolated ultra‑bright hotspots (Cluster 5). |
| Pomorskie | Semi‑urban Tri‑City outskirts with extensive rural areas; the brightest Tri‑City core appears in Cluster 6. |
| Śląskie | Semi‑urban and rural mix in sampled slice; the high‑intensity Silesian industrial belt appears in Cluster 5 in full data. |
| Świętokrzyskie | Mostly rural with moderate semi‑urban brightness around Kielce; no extreme illumination. |
| Warmińsko‑Mazurskie | Large rural and lake regions with isolated ultra‑bright hotspots (Cluster 5). |
| Wielkopolskie | Semi‑urban Poznań outskirts surrounded by rural areas; Poznań’s bright core appears in Cluster 5 in full data. |
| Zachodniopomorskie | Rural and semi‑urban mix with an ultra‑bright metropolitan core around Szczecin (Cluster 6). |
Calinski-Harabasz index for clustering results obtained using DBSCAN algorithm.
ch_score <- calinhara(df_sample_selected_columns, dbscan_results$cluster);
print(ch_score);
## [1] 966.3889
The score obtained is high too indicating ideal clustering results. But, lower than DBSCAN clustering results.