Link to download dataset: https://public.opendatasoft.com/explore/assets/openaq/export/
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
openaq_data <- read.csv('openaq.csv', sep = ';')
str(openaq_data)
## 'data.frame': 61177 obs. of 10 variables:
## $ Country.Code : chr "CN" "CN" "CN" "CN" ...
## $ City : chr "" "" "" "" ...
## $ Location : chr "市八十六中" "市农科院" "市发改委" "市委" ...
## $ Coordinates : chr "23.1047, 113.43319999999999" "21.9508, 108.6553" "29.8454, 114.3107" "30.457600000000003, 106.63030000000002" ...
## $ Pollutant : chr "O3" "SO2" "PM2.5" "O3" ...
## $ Source.Name : chr "ChinaAQIData" "ChinaAQIData" "ChinaAQIData" "ChinaAQIData" ...
## $ Unit : chr "µg/m³" "µg/m³" "µg/m³" "µg/m³" ...
## $ Value : num 36 7 26 91 19 113 600 300 98 14 ...
## $ Last.Updated : chr "2021-08-09T13:00:00+02:00" "2020-12-31T17:00:00+01:00" "2021-08-09T13:00:00+02:00" "2021-08-09T13:00:00+02:00" ...
## $ Country.Label: chr "China" "China" "China" "China" ...
I will first cluster the data based on countries, so I will group data by country
openaq_data$Country.Code <- factor(openaq_data$Country.Code)
str(openaq_data)
## 'data.frame': 61177 obs. of 10 variables:
## $ Country.Code : Factor w/ 137 levels "-99","AD","AE",..: 30 30 30 30 30 30 30 30 30 30 ...
## $ City : chr "" "" "" "" ...
## $ Location : chr "市八十六中" "市农科院" "市发改委" "市委" ...
## $ Coordinates : chr "23.1047, 113.43319999999999" "21.9508, 108.6553" "29.8454, 114.3107" "30.457600000000003, 106.63030000000002" ...
## $ Pollutant : chr "O3" "SO2" "PM2.5" "O3" ...
## $ Source.Name : chr "ChinaAQIData" "ChinaAQIData" "ChinaAQIData" "ChinaAQIData" ...
## $ Unit : chr "µg/m³" "µg/m³" "µg/m³" "µg/m³" ...
## $ Value : num 36 7 26 91 19 113 600 300 98 14 ...
## $ Last.Updated : chr "2021-08-09T13:00:00+02:00" "2020-12-31T17:00:00+01:00" "2021-08-09T13:00:00+02:00" "2021-08-09T13:00:00+02:00" ...
## $ Country.Label: chr "China" "China" "China" "China" ...
There is some strange unexpected factor ‘-99’ in Country.Code, so I am going to explore what this is
openaq_data[openaq_data$Country.Code == '-99',]
## Country.Code City Location
## 2306 -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## 10362 -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## 10890 -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## 23473 -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## 34321 -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## 57862 -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## 57863 -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## Coordinates Pollutant
## 2306 34.9923, 33.779 PM10
## 10362 34.9923, 33.779 PM2.5
## 10890 34.9923, 33.779 SO2
## 23473 34.9923, 33.779 NOX
## 34321 34.9923, 33.779 O3
## 57862 34.9923, 33.779 NO
## 57863 34.9923, 33.779 NO2
## Source.Name Unit Value
## 2306 Republic of Cyprus Department of Labor Inspection µg/m³ 22.80000
## 10362 Republic of Cyprus Department of Labor Inspection µg/m³ 11.21000
## 10890 Republic of Cyprus Department of Labor Inspection µg/m³ 6.84391
## 23473 Republic of Cyprus Department of Labor Inspection µg/m³ 13.62056
## 34321 Republic of Cyprus Department of Labor Inspection µg/m³ 99.55051
## 57862 Republic of Cyprus Department of Labor Inspection µg/m³ 0.83549
## 57863 Republic of Cyprus Department of Labor Inspection µg/m³ 12.35798
## Last.Updated Country.Label
## 2306 2025-01-31T10:00:00+01:00
## 10362 2025-01-31T10:00:00+01:00
## 10890 2025-01-31T10:00:00+01:00
## 23473 2025-01-31T10:00:00+01:00
## 34321 2025-01-31T10:00:00+01:00
## 57862 2025-01-31T10:00:00+01:00
## 57863 2025-01-31T10:00:00+01:00
All of the Country.Code entries that have -99 have City Ορμήδια - Βιομηχανικός Σταθμός , which is a city in Cyprus. Let’s check if Cyprus is already in the data
head(openaq_data[openaq_data$Country.Code == 'CY',])
## Country.Code City
## 123 CY Αγία Μαρίνα Ξυλιάτου - Σταθμός Υποβάθρου
## 2440 CY Λεμεσός - Κυκλοφοριακός Σταθμός
## 2441 CY Καλαβασός - Βιομηχανικός Σταθμός
## 2442 CY Πάφος - Κυκλοφοριακός Σταθμός
## 4978 CY Παραλίμνι - Κυκλοφοριακός Σταθμός
## 4979 CY Πάφος - Κυκλοφοριακός Σταθμός
## Location
## 123 Ayia Marina Xyliatou - Background Station
## 2440 Limassol - Traffic Station
## 2441 Kalavasos - Industrial Station
## 2442 Paphos - Traffic Station
## 4978 Paralimni - Traffic Station
## 4979 Paphos - Traffic Station
## Coordinates Pollutant
## 123 35.0380556, 33.05777779999994 O3
## 2440 34.6861111, 33.03555559999995 SO2
## 2441 34.77152999999999, 33.29623 NO2
## 2442 34.7752778, 32.42194440000003 NO
## 4978 35.04569276191925, 33.977734884655774 NO
## 4979 34.7752778, 32.42194440000003 PM2.5
## Source.Name Unit Value
## 123 Republic of Cyprus Department of Labor Inspection µg/m³ 114.8657000
## 2440 Republic of Cyprus Department of Labor Inspection µg/m³ 0.6449241
## 2441 Republic of Cyprus Department of Labor Inspection µg/m³ 7.5563500
## 2442 Republic of Cyprus Department of Labor Inspection µg/m³ 1.9617910
## 4978 Republic of Cyprus Department of Labor Inspection µg/m³ 11.5399600
## 4979 Republic of Cyprus Department of Labor Inspection µg/m³ 14.7685100
## Last.Updated Country.Label
## 123 2025-01-31T10:00:00+01:00 Cyprus
## 2440 2025-01-31T10:00:00+01:00 Cyprus
## 2441 2025-01-31T10:00:00+01:00 Cyprus
## 2442 2025-01-31T10:00:00+01:00 Cyprus
## 4978 2025-01-31T10:00:00+01:00 Cyprus
## 4979 2025-01-31T10:00:00+01:00 Cyprus
Yes, so Cyprus is already in the data so I will just fix -99 to represent CY
openaq_data[openaq_data$Country.Code == '-99',]$Country.Code <- 'CY'
I want to reshape the data to have a Polutants as separate columns that will have the values in each row Let’s check what Pollutants there are in the data and if they all have a value and let’s handle this.
unique(openaq_data$Pollutant)
## [1] "O3" "SO2" "PM2.5" "NO2"
## [5] "CO" "PM10" "NO" "PM1"
## [9] "RELATIVEHUMIDITY" "TEMPERATURE" "NOX" "UM003"
## [13] "BC"
So overall we have the following Pollutants: BC CO NO NO2 NOX O3 PM1 PM10 PM2.5 RELATIVEHUMIDITY SO2 TEMPERATURE UM003 Let’s check what is the proportion of those reported in the data:
print(paste('BC:',count(openaq_data[openaq_data$Pollutant == 'BC',])))
## [1] "BC: 137"
print(paste('CO:',count(openaq_data[openaq_data$Pollutant == 'CO',])))
## [1] "CO: 5968"
print(paste('NO:',count(openaq_data[openaq_data$Pollutant == 'NO',])))
## [1] "NO: 4030"
print(paste('NO2:',count(openaq_data[openaq_data$Pollutant == 'NO2',])))
## [1] "NO2: 11129"
print(paste('NOX:',count(openaq_data[openaq_data$Pollutant == 'NOX',])))
## [1] "NOX: 2094"
print(paste('O3:',count(openaq_data[openaq_data$Pollutant == 'O3',])))
## [1] "O3: 9377"
print(paste('PM1:',count(openaq_data[openaq_data$Pollutant == 'PM1',])))
## [1] "PM1: 159"
print(paste('PM10:',count(openaq_data[openaq_data$Pollutant == 'PM10',])))
## [1] "PM10: 9264"
print(paste('PM2.5:',count(openaq_data[openaq_data$Pollutant == 'PM2.5',])))
## [1] "PM2.5: 10662"
print(paste('RELATIVEHUMIDITY:',count(openaq_data[openaq_data$Pollutant == 'RELATIVEHUMIDITY',])))
## [1] "RELATIVEHUMIDITY: 89"
print(paste('SO2:',count(openaq_data[openaq_data$Pollutant == 'SO2',])))
## [1] "SO2: 8061"
print(paste('TEMPERATURE:',count(openaq_data[openaq_data$Pollutant == 'TEMPERATURE',])))
## [1] "TEMPERATURE: 120"
print(paste('UM003:',count(openaq_data[openaq_data$Pollutant == 'UM003',])))
## [1] "UM003: 87"
I will drop those with too few observations, lets say under 1.000
openaq_data$Pollutant <- factor(openaq_data$Pollutant)
openaq_data <- openaq_data[openaq_data$Pollutant %in% names(which(table(openaq_data$Pollutant) >= 1000)), ]
duplicates <- openaq_data %>%
group_by(Country.Code, City, Location, Coordinates, Source.Name, Unit, Last.Updated, Country.Label, Pollutant) %>%
filter(n() > 1) %>%
arrange(Pollutant, Location)
duplicates
## # A tibble: 128 × 10
## # Groups: Country.Code, City, Location, Coordinates, Source.Name, Unit,
## # Last.Updated, Country.Label, Pollutant [61]
## Country.Code City Location Coordinates Pollutant Source.Name Unit Value
## <fct> <chr> <chr> <chr> <fct> <chr> <chr> <dbl>
## 1 ZA Sekhuk… Dilokong -24.615222… CO South Afri… ppm 0.789
## 2 ZA Sekhuk… Dilokong -24.615222… CO South Afri… ppm 0.856
## 3 ZA Sekhuk… Dilokong -24.615222… CO South Afri… ppm 1.04
## 4 ZA Sekhuk… Dilokong -24.615222… CO South Afri… ppm 0.824
## 5 ZA Sekhuk… Dilokong -24.615222… NO South Afri… ppm 0.0139
## 6 ZA Sekhuk… Dilokong -24.615222… NO South Afri… ppm 0.0152
## 7 ZA Sekhuk… Dilokong -24.615222… NO South Afri… ppm 0.0154
## 8 ZA Sekhuk… Dilokong -24.615222… NO South Afri… ppm 0.0137
## 9 ZA Sekhuk… Dilokong -24.615222… NO2 South Afri… ppm 0.00156
## 10 ZA Sekhuk… Dilokong -24.615222… NO2 South Afri… ppm 0.00139
## # ℹ 118 more rows
## # ℹ 2 more variables: Last.Updated <chr>, Country.Label <chr>
there are some duplicates since some of the air quality report were reported a few times accross differents times at the same location, so I will actually also take median of those cases and keep 1 row
As I want to cluster based on the Pollutant types, I need to reshape the data
openaq_data <- openaq_data %>%
pivot_wider(
id_cols = c(Country.Code, City, Location, Coordinates, Source.Name, Unit, Last.Updated, Country.Label),
names_from = Pollutant,
values_from = Value,
values_fn = list(Value = median) #taking median if any identical rows
)
head(openaq_data)
## # A tibble: 6 × 16
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 CN "" 市八十六中 23.1047, 113.433… ChinaAQIDa… µg/m³ 2021-08-09T…
## 2 CN "" 市农科院 21.9508, 108.6553 ChinaAQIDa… µg/m³ 2020-12-31T…
## 3 CN "" 市发改委 29.8454, 114.3107 ChinaAQIDa… µg/m³ 2021-08-09T…
## 4 CN "" 市委 30.4576000000000… ChinaAQIDa… µg/m³ 2021-08-09T…
## 5 CN "" 市委党校 27.7314000000000… ChinaAQIDa… µg/m³ 2021-08-09T…
## 6 CN "" 市审计局 25.8070999999999… ChinaAQIDa… µg/m³ 2021-08-09T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
head(openaq_data)
## # A tibble: 6 × 16
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 CN "" 市八十六中 23.1047, 113.433… ChinaAQIDa… µg/m³ 2021-08-09T…
## 2 CN "" 市农科院 21.9508, 108.6553 ChinaAQIDa… µg/m³ 2020-12-31T…
## 3 CN "" 市发改委 29.8454, 114.3107 ChinaAQIDa… µg/m³ 2021-08-09T…
## 4 CN "" 市委 30.4576000000000… ChinaAQIDa… µg/m³ 2021-08-09T…
## 5 CN "" 市委党校 27.7314000000000… ChinaAQIDa… µg/m³ 2021-08-09T…
## 6 CN "" 市审计局 25.8070999999999… ChinaAQIDa… µg/m³ 2021-08-09T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
class(openaq_data$SO2)
## [1] "numeric"
Let us check for outliers, as this is very important and I want to do it before scaling or filling in missing data.
cols <- c("O3", "PM2.5", "NO2", "CO", "NO", "NOX", "PM10", "SO2")
boxplot(openaq_data[, cols], main = "Pollutants")
CO has a lot of big outliers, I need to handle them.
summary(openaq_data$CO)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -49051.4 0.4 336.6 5591.4 665.0 3198676.0 19513
Strange that it has min value of -49051 , I need to check all the Pollutants, since neither if them should have negative values
for (i in c("O3", "PM2.5", "NO2", "CO", "NO", "NOX", "PM10", "SO2")) {
n <- nrow(openaq_data[openaq_data[[i]] < 0 & !is.na(openaq_data[[i]]), ])
print(head(openaq_data[openaq_data[[i]] < 0 & !is.na(openaq_data[[i]]), ]))
print(paste(i, ": ", n, "negative values"))
}
## # A tibble: 6 × 16
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 LT Karolini ki … "\"Viln… 54.6861114… eea µg/m³ 2025-01-31T…
## 2 LT Petra in sen… "\"Kaun… 54.8951280… eea µg/m³ 2025-01-31T…
## 3 LV Rezekne "Rezekn… 56.5110139… eea µg/m³ 2025-01-31T…
## 4 ZA Gert Sibande "Camden" -26.6226, … South Afri… ppm 2023-05-29T…
## 5 RO ALBA IULIA "AB-1" 46.0642749… eea µg/m³ 2024-09-08T…
## 6 RO RAMNICU VALC… "VL-2" 45.0388309… eea µg/m³ 2024-09-08T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "O3 : 121 negative values"
## # A tibble: 6 × 16
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 LT Petra in sen… "\"Kaun… 54.8951280… eea µg/m³ 2025-01-31T…
## 2 NO Oslo "Vahl s… 59.9169440… eea µg/m³ 2025-01-31T…
## 3 PL Biała "Biała,… 52.6025339… eea µg/m³ 2024-12-19T…
## 4 IT Via Amiternu… "AQ - A… 42.3641700… eea µg/m³ 2025-01-11T…
## 5 IT C.so Umberto… "MONTES… 42.5175000… eea µg/m³ 2025-01-11T…
## 6 US Poughkeepsie… "Newbur… 41.4994, -… AirNow µg/m³ 2023-05-31T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "PM2.5 : 217 negative values"
## # A tibble: 6 × 16
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 LT Karolini ki … "\"Viln… 54.6861114… eea µg/m³ 2025-01-31T…
## 2 LT Petra in sen… "\"Kaun… 54.8951280… eea µg/m³ 2025-01-31T…
## 3 NL Breda "Breda-… 51.6031, 4… Netherlands µg/m³ 2025-01-31T…
## 4 NL Heerlen "Heerle… 50.888, 5.… Netherlands µg/m³ 2025-01-31T…
## 5 NL Veldhoven "Veldho… 51.4074, 5… Netherlands µg/m³ 2025-01-31T…
## 6 HU Budapest "\"Buda… 47.4919440… eea µg/m³ 2025-01-31T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "NO2 : 216 negative values"
## # A tibble: 6 × 16
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 LT Petra in sen… "\"Kaun… 54.8951280… eea µg/m³ 2025-01-31T…
## 2 PT Rede de Qual… "Franci… 41.1644440… eea µg/m³ 2024-11-29T…
## 3 AT 5400 Hallein "Hallei… 47.6825900… eea µg/m³ 2024-12-21T…
## 4 TR Trabzon "Trabzo… 41.014341,… Turkiye µg/m³ 2022-02-18T…
## 5 PE Ilo "Ilo - … -17.634021… peru-oefa µg/m³ 2025-01-07T…
## 6 PE Tupac Amaru … "Pisco … -13.713381… peru-oefa µg/m³ 2025-01-08T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "CO : 44 negative values"
## # A tibble: 6 × 16
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 NL Wijk aan Zee Wijk aan… 52.4939920… eea µg/m³ 2024-03-25T…
## 2 NL Amsterdam Amsterda… 52.3813310… eea µg/m³ 2024-03-25T…
## 3 NL Breda Breda-Ba… 51.6031, 4… Netherlands µg/m³ 2025-01-31T…
## 4 NL Heerlen Heerlen-… 50.888, 5.… Netherlands µg/m³ 2025-01-31T…
## 5 NL Veldhoven Veldhove… 51.4074, 5… Netherlands µg/m³ 2025-01-31T…
## 6 US Bakersfield Edison 35.345607,… AirNow ppm 2023-05-31T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "NO : 110 negative values"
## # A tibble: 5 × 16
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 US Ogden-Clearf… Harrisv… 41.302799,… AirNow ppm 2023-05-31T…
## 2 ZA Gert Sibande Elandsf… -26.245481… South Afri… ppm 2023-05-29T…
## 3 ZA Gert Sibande Camden -26.6226, … South Afri… ppm 2023-05-29T…
## 4 ZA Nkangala Masakha… -25.97255,… South Afri… ppm 2023-05-29T…
## 5 ZA Gert Sibande Grootvl… -26.801033… South Afri… ppm 2023-05-07T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "NOX : 5 negative values"
## # A tibble: 6 × 16
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 LT Karolini ki … "\"Viln… 54.6861114… eea µg/m³ 2025-01-31T…
## 2 LT Petra in sen… "\"Kaun… 54.8951280… eea µg/m³ 2025-01-31T…
## 3 NO Oslo "Vahl s… 59.9169440… eea µg/m³ 2025-01-31T…
## 4 PL Biała "Biała,… 52.6025339… eea µg/m³ 2024-12-19T…
## 5 IT Via Amiternu… "AQ - A… 42.3641700… eea µg/m³ 2025-01-11T…
## 6 IT C.so Umberto… "MONTES… 42.5175000… eea µg/m³ 2025-01-11T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "PM10 : 202 negative values"
## # A tibble: 6 × 16
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 FR Nord "FR1002… 50.9959079… EEA France µg/m³ 2024-01-29T…
## 2 LT Karolini ki … "\"Viln… 54.6861114… eea µg/m³ 2025-01-31T…
## 3 LT Petra in sen… "\"Kaun… 54.8951280… eea µg/m³ 2025-01-31T…
## 4 HU Budapest "\"Buda… 47.4919440… eea µg/m³ 2025-01-31T…
## 5 US Houston-Suga… "Housto… 29.6239, -… AirNow ppm 2016-03-06T…
## 6 US ST. LOUIS "Rider … 38.75264, … AirNow ppm 2023-05-31T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "SO2 : 200 negative values"
So around 1% of the data is negative, I think this is not significant. My guess is that this is either disfunctioning sensors or just a mistake in the report. It seems like those negative values come from completely different cities and hence sensors, so I am guessing this is not sensor malfunction but rather reporting issue. So I can do 1 of 3 things: remove those rows, set them to 0, fill in with median. I am in between dropping those values and setting them to 0, but since this is not a significant amount of data I will set those values to NA, so later they could be filled with median.
for (col in cols) {
openaq_data[[col]][openaq_data[[col]] < 0] <- NA
}
I will continue with outliers, will zoom in to see where the positive outliers lie
boxplot(openaq_data[, cols], main = "Pollutants", ylim = c(0, 1000))
Not too many positive outliers that lie above. However I realise that
the Unit for those (even of the same kind) pollutants are different.
for (col in cols){
print(paste(col, ":", nrow(openaq_data[openaq_data[[col]] > 1000,])))
}
## [1] "O3 : 16234"
## [1] "PM2.5 : 15110"
## [1] "NO2 : 14581"
## [1] "CO : 20125"
## [1] "NO : 21561"
## [1] "NOX : 23392"
## [1] "PM10 : 16458"
## [1] "SO2 : 17646"
Let’s check if units are the same, otherwise I will need to normalize the data:
unique(openaq_data$Unit)
## [1] "µg/m³" "ppm" "ppb"
nrow(openaq_data[openaq_data$Unit == 'ppb',])
## [1] 6
nrow(openaq_data[openaq_data$Unit == 'ppm',])
## [1] 6335
nrow(openaq_data[openaq_data$Unit == 'µg/m³',])
## [1] 19138
So µg/m³ has the most data. For the sake of starting clustering I will just assign median to those, otherwise if I have time I will get back to this and convert all of the values to µg/m³
for(p in cols) {
openaq_data[[p]][openaq_data$Unit != "µg/m³"] <- NA
}
openaq_data$Unit <- "µg/m³"
boxplot(openaq_data[, cols], main = "Pollutants", ylim = c(0, 1000))
There is a lot of missing data on some of the pollutants. I will fill them in with the median data to avoid outliers (with mean)
cols <- c("O3", "PM2.5", "NO2", "CO", "NO", "NOX", "PM10", "SO2")
for (col in cols) {
openaq_data[[col]][is.na(openaq_data[[col]])] <- median(openaq_data[[col]], na.rm = TRUE)
}
I need to scale data.
openaq_data[paste0(cols, "_scaled")] <- scale(openaq_data[cols])
There are a lot of columns that give us granularity on location, however for this particular clustering part I will not need such a deep granularity, I will only leave the country code and the neccesary features.
openaq_country <- openaq_data[,c('Country.Code', "O3_scaled", "PM2.5_scaled", "NO2_scaled", "CO_scaled", "NO_scaled", "NOX_scaled", "PM10_scaled", "SO2_scaled") ]
Now, I need to group those countries together, since we have so many rows for the same country. I will take a median to group them, since there are cases when the country is generally clean, but in the capital the air quality is terrible, so we still want to assign this country to clean one. I will be looking at a single cities (or rahter capitals) in another analysis.
openaq_country <- aggregate(. ~ Country.Code, data = openaq_country, FUN = median)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
set.seed(123)
hopkins_stat <- get_clust_tendency(openaq_country[,-1], n = nrow(openaq_country) -1, graph = TRUE)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
hopkins_stat
## $hopkins_stat
## [1] 0.9224853
##
## $plot
Number >0.75 represent a strong clustering tendency, so in my case the number is 0.92 which clearly indicates that I can proceed clustering the data.
Now I need to find the number of cluster suitable for my clustering analysis of countries. I will use Gap Statistic, Elbow method and Silhouette method
library(factoextra)
fviz_nbclust(openaq_country[,-1] , FUN = kmeans, method = "gap_stat")
From this plot I want to choose the number of cluters to be 4, because
there is a big jusmp from 3 to 4 cluters in the Gap statistic, and also
after 4 the curves still goes up but not as fast, so it will be a good
trade off between the quality and the computation speed to stop at 4.
Though math suggestion here is only 1 cluster - which does not make
sense for me because it is just all data in 1 cluster.
Let’s also confirm this with elbow method:
fviz_nbclust(openaq_country[,-1] , FUN = kmeans, method = "wss")
From WSS method the number of clusters it suggests me to pick is 5
And let’s also check that with Silhouette method
fviz_nbclust(openaq_country[, -1] , FUN = kmeans, method = "silhouette")
So 3 methods that suggest different number of clusters. I want to pick
5, becuase the drop in the elbow methods is quite significant and this
is also suppported by the silhouette , which even though has the 2 as
the suggestion and biggest climb, it also has 5 with the largest average
width. Also I would like to explore what 5 clusters could give me.
Because picking just 2 clusters might only suggest “clean” and
“pilluted” cities, and I want to explore more than that.
Now let’s use K-mean clustering to cluster our data.
rownames(openaq_country) <- openaq_country$Country.Code
kmean_country_k2 = eclust(openaq_country[,-1], FUNcluster = "kmeans", k = 2, graph = TRUE)
as.data.frame(kmean_country_k2$cluster)
## kmean_country_k2$cluster
## AD 2
## AE 1
## AF 2
## AG 2
## AJ 2
## AM 2
## AQ 2
## AR 2
## AT 2
## AU 2
## AZ 2
## BA 2
## BD 2
## BE 2
## BF 2
## BG 2
## BH 2
## BK 2
## BM 2
## BR 2
## BZ 2
## CA 2
## CD 2
## CE 2
## CF 2
## CH 2
## CI 2
## CL 2
## CN 1
## CO 2
## CR 2
## CS 2
## CW 2
## CY 1
## CZ 2
## DE 2
## DK 2
## DZ 2
## EC 2
## EE 2
## EG 2
## ES 2
## ET 2
## FI 2
## FR 2
## GA 2
## GB 2
## GH 2
## GI 2
## GN 2
## GR 2
## GT 2
## HK 1
## HR 2
## HU 2
## ID 2
## IE 2
## IL 2
## IN 2
## IQ 2
## IS 2
## IT 2
## IZ 2
## JO 2
## JP 2
## KE 2
## KG 2
## KR 2
## KU 2
## KV 2
## KW 2
## KZ 2
## LA 2
## LK 2
## LT 2
## LU 2
## LV 2
## MA 2
## MD 2
## ME 2
## MF 2
## MG 2
## MK 2
## ML 2
## MM 2
## MN 2
## MT 2
## MX 2
## MY 2
## MZ 2
## NG 2
## NI 2
## NL 2
## NO 2
## NP 2
## NZ 2
## OM 1
## PE 2
## PH 2
## PK 2
## PL 2
## PR 2
## PS 2
## PT 2
## PY 2
## QA 2
## RO 2
## RS 2
## RU 2
## RW 2
## SA 2
## SD 2
## SE 2
## SG 2
## SI 2
## SK 2
## SM 1
## SN 2
## SU 2
## TD 2
## TH 2
## TI 2
## TJ 2
## TM 2
## TR 2
## TT 2
## TW 2
## TX 2
## UC 2
## UG 2
## US 2
## UZ 2
## VM 2
## VN 2
## XK 2
## ZA 2
The 2 different clusters are very clear and interprettable. I can see that in cluster 2 in red we only have the most industrial and with biggest urban traffic and hence polluted countries like China, UAE, Hong Kong as well as the countries which geographically have a high dust level like UAE, Oman etc
The countries in the 2nd blue cluster represent the less affected by urbanism and geographical location countries.
I also want to have a look at 9 clusters, as I clearly have the 2nd biggest drop at 9 in WSS, and also the biggest increase at 9 in Gap Statistic.
rownames(openaq_country) <- openaq_country$Country.Code
kmean_country_k9 = eclust(openaq_country[,-1], FUNcluster = "kmeans", k = 9, graph = TRUE)
as.data.frame(kmean_country_k9$cluster)
## kmean_country_k9$cluster
## AD 9
## AE 7
## AF 6
## AG 9
## AJ 9
## AM 9
## AQ 9
## AR 9
## AT 9
## AU 9
## AZ 9
## BA 9
## BD 6
## BE 9
## BF 6
## BG 9
## BH 9
## BK 9
## BM 9
## BR 9
## BZ 9
## CA 9
## CD 6
## CE 9
## CF 9
## CH 8
## CI 9
## CL 9
## CN 7
## CO 9
## CR 9
## CS 9
## CW 9
## CY 5
## CZ 9
## DE 9
## DK 9
## DZ 9
## EC 9
## EE 1
## EG 6
## ES 9
## ET 9
## FI 9
## FR 9
## GA 9
## GB 9
## GH 9
## GI 9
## GN 9
## GR 9
## GT 9
## HK 3
## HR 9
## HU 9
## ID 9
## IE 9
## IL 9
## IN 2
## IQ 9
## IS 9
## IT 9
## IZ 9
## JO 9
## JP 9
## KE 9
## KG 6
## KR 9
## KU 9
## KV 9
## KW 9
## KZ 6
## LA 9
## LK 9
## LT 9
## LU 9
## LV 6
## MA 9
## MD 9
## ME 9
## MF 4
## MG 9
## MK 9
## ML 9
## MM 9
## MN 6
## MT 9
## MX 9
## MY 9
## MZ 9
## NG 9
## NI 9
## NL 9
## NO 9
## NP 9
## NZ 9
## OM 3
## PE 9
## PH 9
## PK 9
## PL 9
## PR 9
## PS 9
## PT 8
## PY 6
## QA 9
## RO 9
## RS 9
## RU 9
## RW 9
## SA 9
## SD 9
## SE 9
## SG 9
## SI 9
## SK 9
## SM 5
## SN 9
## SU 9
## TD 9
## TH 9
## TI 9
## TJ 9
## TM 9
## TR 9
## TT 9
## TW 9
## TX 9
## UC 9
## UG 9
## US 9
## UZ 9
## VM 9
## VN 9
## XK 8
## ZA 9
This one is difficult to interpret given many dimentions of Pollutants. Perhaps, the India cluster that stand separately at the bottom represent a pollution with a certain pollutant. Same goes with Estonia and Cyprus. The urban countries still fell into the same cluste just like with the k=2 clustering, which does classify them as highly polluted.
To better interpret this data I will check the mean of the pollutants in each cluster to identify what this cluster means.
openaq_country$kmean_cluster <- as.factor(kmean_country_k9$cluster)
cluster_profiles <- openaq_country %>%
group_by(kmean_cluster) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
## ℹ In group 1: `kmean_cluster = 1`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
plot_data <- cluster_profiles %>%
pivot_longer(cols = -kmean_cluster, names_to = "Pollutant", values_to = "Average_Value")
ggplot(plot_data, aes(x = Pollutant, y = Average_Value, fill = kmean_cluster)) +
geom_bar(stat = "identity") +
coord_flip() +
facet_wrap(~kmean_cluster) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Pollutant Signatures by Cluster",
x = "Pollutant Type",
y = "Mean Concentration (Scaled/Standardized)")
Now much better for me to tell what those clusters mean. As I am not good at chemistry, I will use AI to ask for meanings of combinations like “high NO2” , “High O3 and High NO2”, “High O3 and Low NO2”
Cluster 1 - This clusters has very low (almost zero) concentration of pollutants, so it suggest that this cluster represents low pollution countries.
Cluster 2 - ozone (O₃) is strongly negative while other pollutants are close to zero, which matches the known effect where ozone is destroyed by fresh nitric oxide emissions near traffic. This is usually called ozone titration or scavenging. (From source: “Generally, close to the sources in city centres ozone concentrations are lower than those in suburbs and rural areas, mainly as a result of ozone scavenging by nitric oxide from traffic.” https://www.eea.europa.eu/en/analysis/publications/top08-98/page004.html)
Cluster 3 - This cluster has high ozone and moderately positive NO₂, which is a typical signal of photochemical smog formation under sunlight with traffic emissions present.( 1) https://en.wikipedia.org/wiki/Smog#Photochemical_smog 2) “The majority of ground-level ozone formation occurs when nitrogen oxides (NOx), carbon monoxide (CO), and volatile organic compounds (VOCs), react in the atmosphere in the presence of sunlight, specifically the UV spectrum. NOx, CO, and VOCs are considered ozone precursors. Motor vehicle exhaust, industrial emissions, and chemical solvents are the major anthropogenic sources of these ozone precursors.Although the ozone precursors often originate in urban areas, winds can carry NOx hundreds of kilometers, causing ozone formation to occur in less populated regions as well” https://en.wikipedia.org/wiki/Ground-level_ozone )
Cluster 4 - Very high NO and NO₂ dominate this cluster, while other pollutants stay low, which strongly points to direct emissions from heavy traffic or combustion engines. (“NOx gases are usually produced from the reaction between nitrogen and oxygen during combustion of fuels, such as hydrocarbons, in air; especially at high temperatures, such as in car engines.In areas of high motor vehicle traffic, such as in large cities, the nitrogen oxides emitted can be a significant source of air pollution. NOx gases are also produced naturally by lightning.” https://en.wikipedia.org/wiki/NOx)
Cluster 5 - This cluster shows high ozone but low NO₂, which suggests strong sunlight but limited local traffic, allowing ozone to accumulate instead of being destroyed by NO. (“The majority of ground-level ozone formation occurs when nitrogen oxides (NOx), carbon monoxide (CO), and volatile organic compounds (VOCs), react in the atmosphere in the presence of sunlight, specifically the UV spectrum.” https://en.wikipedia.org/wiki/Ground-level_ozone)
Cluster 6 - All pollutants are very close to zero, meaning this cluster represents an average or neutral air quality state without any dominant pollutant.
Cluster 7 - This cluster has extremely high ozone and PM₁₀, which fits environments affected by strong sunlight plus dust, industrial activity, or arid conditions. (https://www.sciencedirect.com/science/article/pii/S026974912101647X)
Cluster 8 - Low ozone combined with moderate PM₁₀ suggests winter conditions, where particulate matter increases due to heating, while ozone stays low because of weak sunlight. (“Due to the photochemical nature, the highest levels of ozone are seen during periods of sunny weather. It is worth mentioning that ozone can also be generated by household equipment, such as portable air cleaners.” https://www.who.int/teams/environment-climate-change-and-health/air-quality-and-health/health-impacts/types-of-pollutants)
Cluster 9 - This cluster is slightly negative/stable for all chemicals, suggesting rural or low-pollution conditions — typical for places with few emissions and clean air.
Now I will explore the same clustering task but with hierarchical clustering
For K=2 I will use Ward.D2 to minimize variations inside clusters to keep countries closer by the poluution.
str(openaq_country)
## 'data.frame': 136 obs. of 10 variables:
## $ Country.Code : Factor w/ 137 levels "-99","AD","AE",..: 2 3 4 5 6 7 8 9 10 11 ...
## $ O3_scaled : num -0.0991 2.1526 -0.0991 -0.0991 -0.0991 ...
## $ PM2.5_scaled : num -0.0543 -0.0543 0.0798 -0.0528 -0.0337 ...
## $ NO2_scaled : num -0.12 -0.158 -0.12 -0.12 -0.12 ...
## $ CO_scaled : num -0.0286 -0.0286 -0.0286 -0.0286 -0.0286 ...
## $ NO_scaled : num -0.0855 -0.0855 -0.0855 -0.0855 -0.0855 ...
## $ NOX_scaled : num -0.0267 -0.0267 -0.0267 -0.0267 -0.0267 ...
## $ PM10_scaled : num -0.0452 0.2358 -0.0452 -0.0452 -0.0452 ...
## $ SO2_scaled : num -0.0362 0.3214 -0.0362 -0.0362 -0.0362 ...
## $ kmean_cluster: Factor w/ 9 levels "1","2","3","4",..: 9 7 6 9 9 9 9 9 9 9 ...
library(factoextra)
fviz_nbclust(openaq_country[,-c(1,10)] , FUN = hcut, method = "gap_stat")
Just like with k-means, there is a very steep jump when it comes to k=2
and k=9.
fviz_nbclust(openaq_country[,-1] , FUN = hcut, method = "wss")
fviz_nbclust(openaq_country[, -1] , FUN = kmeans, method = "silhouette")
All 3 methods are
# install.packages("NbClust")
library(NbClust)
duda_test <- NbClust(openaq_country[, -c(1,10)],
distance = "euclidean",
method = "ward.D2",
index = "duda",
min.nc = 2, max.nc = 2)
# See the result (Look for the "Best number of clusters")
duda_test$Best.nc
## Number_clusters Value_Index
## 2.0000 0.4822
All the methods above suggest k = 2, just like for the k-means. For the sake of the analysis I also want to pick 9 (to later compare it with k-means, and also because in gap and silhouette statistic there is a strong suggestion for this).
rownames(openaq_country) <- openaq_country$Country.Code #I add this line to assign the country code to later see which cluster got assigned
hc_clusters_k2 <- eclust(openaq_country[,-c(1,10)], FUNcluster = "hclust", k = 2, hc_method = "ward.D2", graph = TRUE)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
fviz_cluster(hc_clusters_k2,
geom = "text",
repel = TRUE,
show.clust.cent = FALSE,
main = "Hierarchical Clustering with 2 clusters")
The results are very similar to the K-mean clustering, the split appears
to be between the very industrial countries and less affected
countries.
Now let’s run this method for 9 clusters
rownames(openaq_country) <- openaq_country$Country.Code #I add this line to assign the country code to later see which cluster got assigned
hc_clusters_k9 <- eclust(openaq_country[,-c(1,10)], FUNcluster = "hclust", k = 9, hc_method = "ward.D2", graph = TRUE)
fviz_cluster(hc_clusters_k9,
geom = "text",
# repel = TRUE,
show.clust.cent = FALSE,
main = "Hierarchical Clustering with 9 clusters")
I will not do the deep-dive into the countries within each cluster,
however from the graph above we can clearly see that the results are
very similar to the results of k-mean clustering
For the dbscan I first need to find the optimal value for epsilon which will determine the radius between the individual points to be able to assign them to the same cluster.
library(dbscan)
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
kNNdistplot(openaq_country[,-c(1,10)], k = 2)
From this graph the line curves up at the point 0.15 , but I will
increase it to 0.20 to be my eps parameter to make the algorithm less
strict when it comes to clustering.
It is also important to choose the minimum points to form a cluster. in the context of this project it makes sense to choose 1 point only , as there could be only 1 country in the world that is exceptionally clean or dirty. However if I pick 1, every point will become a single cluster which is not what I am trying to achieve.
Following the DBSCAN Wikipedia article (https://en.wikipedia.org/wiki/DBSCAN) , the min I will choose is 3. However it is also mentioned that it depends on the number of dimensions, and as a rule of thumb we can use minPts = 2·dim So in my case: 2 x 8 dims = 16
dbscan_country <- dbscan(openaq_country[,-c(1,10)], eps = 0.20, minPts = 16)
dbscan_country$cluster
## [1] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1
## [38] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
rownames(openaq_country) <- openaq_country$Country.Code
fviz_cluster(dbscan_country, data = openaq_country[,-c(1,10)],
geom = "text", repel = TRUE, main = "DBSCAN Clustering")
Interesting results, my interpretation of this is that most countries in
the world are very similar to each other when it comes to air quality so
they were assigned to the “average air quality cluster”. However there
are some countries with either much better air quality (like EE, XK, CH,
PT) or much worse air quality (like IN, CN, AE, OM) which is why they
are not part of the “average air quality cluster”.
With k=9 we have eps 0.4
library(dbscan)
kNNdistplot(openaq_country[,-c(1,10)], k = 9)
dbscan_country <- dbscan(openaq_country[,-c(1,10)], eps = 0.40, minPts = 16)
dbscan_country$cluster
## [1] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1
## [38] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
rownames(openaq_country) <- openaq_country$Country.Code
fviz_cluster(dbscan_country, data = openaq_country[,-c(1,10)],
geom = "text", repel = TRUE, main = "DBSCAN Clustering")
So with eps 0.40 the results are very similar to the eps = 0.20
Now that I have the results of the 3 main methods we covered during this course: K-means, DBSCAN, Hierarchical clustering - I can now compare the results of those , how much they agree with each other. So if all of them agree it means that I have a very good clustering results that can be reliable. It also provides me the insight on how much the certain result happened because of the use of a certain algorithm.
To do this I will use 2 approaches we went through during the course: ARI for comparing the clusters and Jaccard Similarity for checking robustness of each.
library(mclust)
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
library(fpc)
##
## Attaching package: 'fpc'
## The following object is masked from 'package:dbscan':
##
## dbscan
print(adjustedRandIndex(kmean_country_k2$cluster, hc_clusters_k2$cluster))
## [1] 1
This score is 1 which indicates that the clusters are strong and both algorithms agree with each other. And the results are the same no matter the algorithm used. And this is exactly what I observed above from the graph of the clustering assignments.
# install.packages("fossil")
library(fossil)
## Loading required package: sp
## Loading required package: maps
##
## Attaching package: 'maps'
## The following object is masked from 'package:mclust':
##
## map
## Loading required package: shapefiles
## Loading required package: foreign
##
## Attaching package: 'shapefiles'
## The following objects are masked from 'package:foreign':
##
## read.dbf, write.dbf
# print(adjustedRandIndex(kmean_country_k2$cluster, dbscan_country$cluster))
print(
data.frame(
Pair = c("KM vs HC", "KM vs DBSCAN", "HC vs DBSCAN"),
ARI = c(adjustedRandIndex(kmean_country_k2$cluster, hc_clusters_k2$cluster), adjustedRandIndex(kmean_country_k2$cluster, dbscan_country$cluster), adjustedRandIndex(hc_clusters_k2$cluster, dbscan_country$cluster)),
Jaccard = c(jaccard(kmean_country_k2$cluster, hc_clusters_k2$cluster), jaccard(kmean_country_k2$cluster, dbscan_country$cluster), jaccard(hc_clusters_k2$cluster, dbscan_country$cluster))
)
)
## Pair ARI Jaccard
## 1 KM vs HC 1.0000000 1.0000000
## 2 KM vs DBSCAN 0.7691994 0.9338235
## 3 HC vs DBSCAN 0.7691994 0.9338235
The results of the clusters are very close to each other both in ARI and Jaccard, which again proves the correct split of the countries for k=2 Though I would say that this version of Jaccard usage from fossil library is not the most reliable, I would create my own custom function, however here I can clearly see from another supporting metric ARI that the clusters are indeed similar.
print(
data.frame(
Pair = c("KM vs HC", "KM vs DBSCAN", "HC vs DBSCAN"),
ARI = c(adjustedRandIndex(kmean_country_k9$cluster, hc_clusters_k9$cluster), adjustedRandIndex(kmean_country_k9$cluster, dbscan_country$cluster), adjustedRandIndex(hc_clusters_k9$cluster, dbscan_country$cluster)),
Jaccard = c(jaccard(kmean_country_k9$cluster, hc_clusters_k9$cluster), jaccard(kmean_country_k9$cluster, dbscan_country$cluster), jaccard(hc_clusters_k9$cluster, dbscan_country$cluster))
)
)
## Pair ARI Jaccard
## 1 KM vs HC 0.6573443 1.0000000
## 2 KM vs DBSCAN 0.5113288 0.9338235
## 3 HC vs DBSCAN 0.8239581 0.9338235
The results are low for DBSCAN is as expected, since DBSCAN only had 1 cluster created in the end + the data points which were not in any of the clusters. Howeer I am surprised when it comes to the ARI index for KM vs HC, this index is a bit lower than I expected, but 0.66 is still a good result that shows that the clusters are somewhat similar.
Jaccard index from this part is completely not reliable so I would not use it in my next project, at least not from the fossil library.
head(openaq_data)
## # A tibble: 6 × 24
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 CN "" 市八十六中 23.1047, 113.433… ChinaAQIDa… µg/m³ 2021-08-09T…
## 2 CN "" 市农科院 21.9508, 108.6553 ChinaAQIDa… µg/m³ 2020-12-31T…
## 3 CN "" 市发改委 29.8454, 114.3107 ChinaAQIDa… µg/m³ 2021-08-09T…
## 4 CN "" 市委 30.4576000000000… ChinaAQIDa… µg/m³ 2021-08-09T…
## 5 CN "" 市委党校 27.7314000000000… ChinaAQIDa… µg/m³ 2021-08-09T…
## 6 CN "" 市审计局 25.8070999999999… ChinaAQIDa… µg/m³ 2021-08-09T…
## # ℹ 17 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>, O3_scaled <dbl>,
## # PM2.5_scaled <dbl>, NO2_scaled <dbl>, CO_scaled <dbl>, NO_scaled <dbl>,
## # NOX_scaled <dbl>, PM10_scaled <dbl>, SO2_scaled <dbl>
poland <- openaq_data[openaq_data$Country.Code == 'PL',]
head(poland)
## # A tibble: 6 × 24
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 PL "" Konarskiego 52.1720333… Poland µg/m³ 2017-01-08T…
## 2 PL "" Reja 52.5509402… Poland µg/m³ 2017-01-08T…
## 3 PL "" Tochtermana 51.3990955… Poland µg/m³ 2017-01-08T…
## 4 PL "Łódź" Łódź, ul. Gdań… 51.775378,… GIOS µg/m³ 2025-01-31T…
## 5 PL "Piła" Pila, ul. Kuso… 53.154408,… GIOS µg/m³ 2025-01-02T…
## 6 PL "Ełk" Ełk, ul. Piłsu… 53.8283890… GIOS µg/m³ 2025-01-31T…
## # ℹ 17 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>, O3_scaled <dbl>,
## # PM2.5_scaled <dbl>, NO2_scaled <dbl>, CO_scaled <dbl>, NO_scaled <dbl>,
## # NOX_scaled <dbl>, PM10_scaled <dbl>, SO2_scaled <dbl>
Seems like a lot of City entries are missing
sum(is.na(poland$City))
## [1] 0
No NA entries in City column as the missing entries are just empty strings
nrow(poland[poland$City == "",])
## [1] 95
We have 95 entry out of 524 missing.
summary(poland)
## Country.Code City Location Coordinates
## PL :524 Length:524 Length:524 Length:524
## -99 : 0 Class :character Class :character Class :character
## AD : 0 Mode :character Mode :character Mode :character
## AE : 0
## AF : 0
## AG : 0
## (Other): 0
## Source.Name Unit Last.Updated Country.Label
## Length:524 Length:524 Length:524 Length:524
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## O3 SO2 PM2.5 NO2
## Min. : 2.337 Min. : 0.400 Min. : 0.00 Min. : 0.8519
## 1st Qu.:56.000 1st Qu.: 5.000 1st Qu.: 9.30 1st Qu.:16.4000
## Median :56.000 Median : 5.000 Median : 9.30 Median :16.4000
## Mean :50.584 Mean : 7.317 Mean : 16.18 Mean :17.6918
## 3rd Qu.:56.000 3rd Qu.: 5.000 3rd Qu.: 10.48 3rd Qu.:16.4000
## Max. :83.800 Max. :861.973 Max. :233.00 Max. :64.7987
##
## CO PM10 NO NOX
## Min. : 106.0 Min. : 0.00 Min. :3.000 Min. :20
## 1st Qu.: 570.0 1st Qu.: 20.90 1st Qu.:3.000 1st Qu.:20
## Median : 570.0 Median : 22.00 Median :3.000 Median :20
## Mean : 572.2 Mean : 29.16 Mean :3.004 Mean :20
## 3rd Qu.: 570.0 3rd Qu.: 28.21 3rd Qu.:3.000 3rd Qu.:20
## Max. :3566.0 Max. :262.90 Max. :5.000 Max. :20
##
## O3_scaled PM2.5_scaled NO2_scaled CO_scaled
## Min. :-1.73727 Min. :-0.07396 Min. :-1.12257 Min. :-0.03951
## 1st Qu.:-0.09907 1st Qu.:-0.05428 1st Qu.:-0.11983 1st Qu.:-0.02865
## Median :-0.09907 Median :-0.05428 Median :-0.11983 Median :-0.02865
## Mean :-0.26440 Mean :-0.03971 Mean :-0.03652 Mean :-0.02859
## 3rd Qu.:-0.09907 3rd Qu.:-0.05177 3rd Qu.:-0.11983 3rd Qu.:-0.02865
## Max. : 0.74960 Max. : 0.41911 Max. : 3.00152 Max. : 0.04153
##
## NO_scaled NOX_scaled PM10_scaled SO2_scaled
## Min. :-0.08551 Min. :-0.02673 Min. :-0.12701 Min. :-0.24454
## 1st Qu.:-0.08551 1st Qu.:-0.02673 1st Qu.:-0.04924 1st Qu.:-0.03621
## Median :-0.08551 Median :-0.02673 Median :-0.04515 Median :-0.03621
## Mean :-0.08492 Mean :-0.02673 Mean :-0.01850 Mean : 0.06872
## 3rd Qu.:-0.08551 3rd Qu.:-0.02673 3rd Qu.:-0.02205 3rd Qu.:-0.03621
## Max. : 0.20993 Max. :-0.02673 Max. : 0.85123 Max. :38.77644
##
#install.packages("tidygeocoder")
head(poland)
## # A tibble: 6 × 24
## Country.Code City Location Coordinates Source.Name Unit Last.Updated
## <fct> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 PL "" Konarskiego 52.1720333… Poland µg/m³ 2017-01-08T…
## 2 PL "" Reja 52.5509402… Poland µg/m³ 2017-01-08T…
## 3 PL "" Tochtermana 51.3990955… Poland µg/m³ 2017-01-08T…
## 4 PL "Łódź" Łódź, ul. Gdań… 51.775378,… GIOS µg/m³ 2025-01-31T…
## 5 PL "Piła" Pila, ul. Kuso… 53.154408,… GIOS µg/m³ 2025-01-02T…
## 6 PL "Ełk" Ełk, ul. Piłsu… 53.8283890… GIOS µg/m³ 2025-01-31T…
## # ℹ 17 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## # NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>, O3_scaled <dbl>,
## # PM2.5_scaled <dbl>, NO2_scaled <dbl>, CO_scaled <dbl>, NO_scaled <dbl>,
## # NOX_scaled <dbl>, PM10_scaled <dbl>, SO2_scaled <dbl>
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.1 ✔ readr 2.1.5
## ✔ lubridate 1.9.4 ✔ stringr 1.5.2
## ✔ purrr 1.1.0 ✔ tibble 3.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::map() masks maps::map(), mclust::map()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidygeocoder)
poland_prep <- poland %>%
mutate(Coordinates = as.character(Coordinates)) %>%
separate(Coordinates, into = c("lat_raw", "lon_raw"), sep = ",", remove = FALSE) %>%
mutate(
lat = as.numeric(str_trim(lat_raw)),
lon = as.numeric(str_trim(lon_raw))
)
# filter for unique unknown locations
to_geocode <- poland_prep %>%
filter(is.na(City) | City == "" | City == "NA") %>%
filter(!is.na(lat)) %>%
distinct(lat, lon)
if (nrow(to_geocode) > 0) {
results <- to_geocode %>%
reverse_geocode(lat = lat, long = lon, method = 'osm', full_results = TRUE)
city_lookup <- results %>%
mutate(
extracted_city = coalesce(
if("city" %in% names(.)) city else NULL,
if("town" %in% names(.)) town else NULL,
if("village" %in% names(.)) village else NULL,
if("suburb" %in% names(.)) suburb else NULL
)
) %>%
select(lat, lon, extracted_city)
poland <- poland_prep %>%
left_join(city_lookup, by = c("lat", "lon")) %>%
mutate(
City = ifelse(is.na(City) | City == "" | City == "NA", extracted_city, City)
) %>%
select(-any_of(c("lat_raw", "lon_raw", "lat", "lon", "extracted_city")))
print(head(poland[, c("City", "Location")]))
}
## Passing 71 coordinates to the Nominatim single coordinate geocoder
## Query completed in: 71.7 seconds
## # A tibble: 6 × 2
## City Location
## <chr> <chr>
## 1 Siedlce Konarskiego
## 2 Płock Reja
## 3 Radom Tochtermana
## 4 Łódź Łódź, ul. Gdańska
## 5 Piła Pila, ul. Kusocińskiego
## 6 Ełk Ełk, ul. Piłsudskiego
poland.backup <- poland
poland <- poland.backup
poland <- poland %>% group_by(City) %>% summarise(
O3_scaled = median(O3_scaled, na.rm = TRUE),
PM2.5_scaled = median(PM2.5_scaled, na.rm = TRUE),
NO2_scaled = median(NO2_scaled, na.rm = TRUE),
CO_scaled = median(CO_scaled, na.rm = TRUE),
NO_scaled = median(NO_scaled, na.rm = TRUE),
NOX_scaled = median(NOX_scaled, na.rm = TRUE),
PM10_scaled = median(PM10_scaled, na.rm = TRUE),
SO2_scaled = median(SO2_scaled, na.rm = TRUE),
)
##Outliers
Let’s check for outliers in Poland specific data as outliers in such data are common and happen due to misperformance of the sensors. I alreay removed outliers due to sensors misperformance at the EDA stage, however now I will be looking at the outliers on the city level.
boxplot(poland[-1], main = "Pollutants", ylim = c(-3, 4))
We have quite a lot of outliers, but I do not want to remove them as they represent real data representing real cities, because sometimes it happens that the air quality level is very bad.
I will need to use the clustering methods that are robust to outliers, like DBSCAN, PAM and Hierarchical clustering.
I will limit dataset to pollutants only to cluster on this data.
poland_train <- poland %>% select(-1)
set.seed(123)
hopkins_stat <- get_clust_tendency(poland_train, n = nrow(poland_train) -1, graph = TRUE)
hopkins_stat
## $hopkins_stat
## [1] 0.9226019
##
## $plot
Result is 0.97 –> strongly clusterable data, I can proceed with the
analysis
Now I need to find the number of cluster suitable for my clustering analysis of Poland cities. I will use Gap Statistic, Elbow method and Silhouette method
library(factoextra)
library(cluster)
##
## Attaching package: 'cluster'
## The following object is masked from 'package:maps':
##
## votes.repub
fviz_nbclust(poland_train, FUN = pam, method = "gap_stat")
Not so good results with the gap statistic peaking at 1, which is
totally useless for me as having just 1 cluster would put all my data
into that 1 cluster. Let me try to use elbow method instead:
fviz_nbclust(poland_train, FUN = pam, method = "wss")
This result is more descriptive, it tells me that the number of clusters
that is optimal for my data is 4, as there is the biggest drop at that
point and this is a very reasonable number.
Finally, let me try with Silhouette method.
fviz_nbclust(poland_train, FUN = pam, method = "silhouette")
This tells me the number of clusters I should use is 2. Since the Elbow
plot shows that k=2 and k=3 still leave a huge amount of unexplained
variation (high WSS) - I will use k = 4
poland_df <- as.data.frame(poland_train)
rownames(poland_df) <- poland$City
pam_poland = eclust(poland_df, FUNcluster = "pam", k = 4, graph = TRUE)
as.data.frame(pam_poland$cluster)
## pam_poland$cluster
## Augustów 1
## Belsk Duży 1
## Biała 1
## Biała Podlaska 2
## Białystok 1
## Bielsko-Biała 1
## Biskupiec 1
## Boguchwała 1
## Borsukowizna 3
## Borówiec 1
## Brzeg 3
## Bydgoszcz 1
## Bytów 1
## Chełm 1
## Chełmno 1
## Chojnice 1
## Chojnów 1
## Ciechanów 1
## Ciechocinek 4
## Cieszyn 1
## Czerniawa 3
## Częstochowa 1
## Diabla Góra 2
## Dolní Lutyně 1
## Duszniki-Zdrój 1
## Działdowo 1
## Działoszyn 1
## Dzierżoniów 3
## Dąbki 3
## Dąbrowa Górnicza 2
## Dębica 1
## Elbląg 2
## Ełk 2
## Florianka 2
## Gajew 2
## Gdańsk 1
## Gdynia 1
## Giżycko 1
## Gliwice 1
## Goczałkowice-Zdrój 2
## Gorzów Wielkopolski 1
## Gołdap 1
## Gołuchów 2
## Grajewo 3
## Granica 2
## Grudziądz 1
## Gubin 3
## Guty Duże 2
## Głubczyce 1
## Inowrocław 3
## Janów Lubelski 1
## Jarczew 2
## Jarosław 1
## Jastrzębie-Zdrój 1
## Jasło 1
## Jawor 3
## Jedlicze 1
## Jedlina-Zdrój 1
## Jelenia Góra 1
## Jędrzejów 1
## Kalisz 3
## Kamienna Góra 1
## Kamień Pomorski 3
## Kaszów 2
## Katowice 1
## Kazimierza Wielka 1
## Kielce 1
## Kluczbork 1
## Kolbuszowa 1
## Koniczynka 2
## Konin 2
## Konstancin-Jeziorna 1
## Kostrza 1
## Koszalin 1
## Koziegłowy 1
## Kołobrzeg 1
## Końskie 2
## Kościan 1
## Kościerzyna 1
## Kraków 1
## Krapkowice 1
## Krasnobród 3
## Krempna 3
## Krosno 1
## Krotoszyn 1
## Krynica-Zdrój 1
## Krzyżówka 3
## Kudowa-Zdrój 1
## Kutno 1
## Kędzierzyn-Koźle 4
## Kętrzyn 2
## Kłodzko 1
## Latoszyn 1
## Legionowo 4
## Legnica 1
## Liniewko Kościerskie 1
## Lubań 1
## Lublin 4
## Lubliniec 1
## Lubsko 1
## Lwówek Śląski 1
## Lądek-Zdrój 1
## Lębork 1
## Malbork 1
## Marszewo 1
## Małogoszcz 1
## Mielec 2
## Milicz 1
## Międzyrzecz 3
## Mińsk Mazowiecki 1
## Mogilno 3
## Mosina 1
## Mszana Dolna 1
## Muszyna 1
## Myślenice 1
## Nakło nad Notecią 3
## National automated monitoring network (SIS) 1
## Nałęczów 3
## Niepołomice 1
## Nisko 2
## Nowa Ruda 1
## Nowa Sól 3
## Nowiny 4
## Nowy Sącz 1
## Nowy Targ 1
## Nysa 1
## Olesno 2
## Olkusz 1
## Olsztyn 4
## Opatów 1
## Opole 1
## Osieczów 3
## Ostrowiec Świętokrzyski 2
## Ostrołęka 1
## Ostróda 1
## Otwock 4
## Oława 1
## Oświęcim 1
## Pabianice 2
## Parzniewice 2
## Piastów 4
## Piotrków Trybunalski 2
## Piła 1
## Pińczów 1
## Pleszew 1
## Polanica-Zdrój 1
## Polańczyk 1
## Poznań 1
## Połaniec 1
## Połczyn-Zdrój 3
## Prudnik 1
## Przemyśl 2
## Puławy 3
## Pułtusk 1
## Płock 2
## Rabka-Zdrój 1
## Racibórz 1
## Radom 4
## Radomsko 4
## Radzyń Podlaski 1
## Rudnik nad Sanem 1
## Rybnik 2
## Rymanów-Zdrój 1
## Rzeszów 1
## Sandomierz 2
## Sanok 1
## Siedlce 1
## Sierpc 1
## Sinołęka 4
## Skarżysko-Kamienna 1
## Skawina 1
## Smolary Bytnickie 3
## Solec Kujawski 1
## Sopot 1
## Sosnowiec 1
## Starachowice 1
## Strzegom 1
## Strzelce Opolskie 1
## Sucha Beskidzka 1
## Sulechów 3
## Sulęcin 1
## Suwałki 1
## Szamotuły 3
## Szarów 4
## Szczawnica 1
## Szczecin 1
## Szczecinek 3
## Szczytno 3
## Szymbark 3
## Słupsk 1
## Tarnobrzeg 1
## Tarnów 1
## Tomaszów Lubelski 1
## Toruń 1
## Trzebinia 1
## Trzebnica 1
## Tychy 1
## Ustka 1
## Ustroń 1
## Uście Gorlickie Wysowa 1
## Wadowice 1
## Warszawa 2
## Wałbrzych 1
## Wieniec-Zdrój 3
## Wilczopole 2
## Wodzisław Śląski 2
## Wrocław 1
## Wschowa 1
## Wysokie Mazowieckie 1
## Włocławek 1
## Włoszczowa 2
## Zabierzów 1
## Zabrze 2
## Zakopane 1
## Zamość 1
## Zawiercie 1
## Zdzieszowice 3
## Zgierz 4
## Zgorzelec 3
## Zielona Góra 1
## Zielonka 3
## Złockie 1
## Złoty Potok 1
## Český Těšín 4
## Łagów 1
## Łask 3
## Łeba 1
## Łomża 1
## Łuków 1
## Łódź 1
## Środa Śląska 1
## Świebodzin 1
## Świecie 3
## Świnoujście 1
## Żagań 1
## Żary 3
## Żory 1
## Żyrardów 1
## Żywiec 3
Now I will explore the same clustering task but with hierarchical clustering
I will use Ward.D2 to minimize variations inside clusters to keep countries closer by the poluution.
fviz_nbclust(poland_train, FUN = hcut, method = "gap_stat")
fviz_nbclust(poland_train, FUN = hcut, method = "wss")
fviz_nbclust(poland_train, FUN = hcut, method = "silhouette")
d <- dist(poland_df, method = "euclidean")
hc_poland <- hclust(d, method = "ward.D2" )
#plot(hc1, cex = 0.6, hang = -1) #This plotting did not work for me as I have many cities, and it the dendogram was very cluttered and not interpretable.
fviz_dend(hc_poland, k = 4,
cex = 0.3,
type = "circular",
rect = TRUE,
palette = "jco")
hc_poland <- cutree(hc_poland, k = 4)
Looking at the clusters the cities assigned to them are a mix of big and smaller cities, which is unexpected. My interpretation: Blue: These are the cleanest spots, like beach cities (Świnoujście) and health resorts (Ciechocinek, Lądek-Zdrój), where the air stays fresher because of better wind or cleaner heating rules. Yellow: These are middle-sized towns like Kalisz or Iława that get dirty air in the winter when many people use old coal heaters to stay warm. Grey: These are the busiest urban spots like Warsaw where a lot of car exhaust and city smoke cause the worst pollution spikes. Red: This is the most common group for Poland, containing cities like Poznań and Wrocław, where hundreds of towns share the same general smog problems during the cold months.
library(dbscan)
kNNdistplot(poland_train, k = 4)
abline(h = 0.30, lty = 2)
So optimal eps for me is 0.30
library("fpc")
library("dbscan")
set.seed(123)
db <- fpc::dbscan(poland_df, eps = 0.30, MinPts = 3)
fviz_cluster(db, data = poland_df, stand = FALSE,
ellipse = TRUE, show.clust.cent = FALSE,
geom = "text", labelsize = 8, repel = TRUE)
It seems like the data is too connected, so that for any minPts higher
than 2 I only get 1 cluster. For this particular analysis DBSCAN is not
the best method, so I will stick to the above methods like PAM and
hclust.
In my case I will only compare PAM to hclust.
print(adjustedRandIndex(pam_poland$cluster, hc_poland))
## [1] 0.5918604
The ARI of 0.59 mean that my clusters mostly agree, in my opinion they agree on the bigger cluster which includes most of the cities in poland that share very similar air quality.
Since PCA assumes linearity I want to check if it would not fail for my case:
pairs(poland_df, main = "Scatterplot Matrix of Pollutants")
I can see the linearity however some of the interactions are non-linear
and I need to be careful about those.
I need to check the distribution of the data to make sure the data is not skewed.
library(tidyverse)
poland_long <- poland_df %>%
pivot_longer(cols = everything(),
names_to = "pollutant",
values_to = "value")
ggplot(poland_long, aes(x = value, fill = pollutant)) +
geom_histogram(bins = 30, color = "white", alpha = 0.8) +
facet_wrap(~ pollutant, scales = "free") + # 'free' lets each plot have its own axis
theme_minimal() +
labs(title = "Distribution of Pollutants in Polish Cities",
x = "Scaled Value",
y = "Frequency") +
theme(legend.position = "none")
It seems like most of the data is skewed to the right. Also interesting
that NOX has 0 variance which is probably a mistake.
summary(poland_df$NOX_scaled)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.02673 -0.02673 -0.02673 -0.02673 -0.02673 -0.02673
summary(openaq_data$NOX_scaled)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.69495 -0.02672 -0.02672 0.00000 -0.02672 52.12095
summary(openaq_data$NOX)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 20.00 20.00 20.09 20.00 204.00
unique(poland_df$NOX_scaled)
## [1] -0.02672527
Seems like this feature only container outliers, and the rest of the data has no variance. I will have to remove this feature.
NO pollutant also seems off with just 1 bar. Let me check it.
summary(poland_df$NO_scaled)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.08551 -0.08551 -0.08551 -0.08486 -0.08551 0.06221
summary(openaq_data$NO_scaled)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.52868 -0.08551 -0.08551 0.00000 -0.08551 55.26648
unique(poland_df$NO_scaled)
## [1] -0.08551228 -0.07812611 0.06221120
unique(openaq_data$NO_scaled)
## [1] -0.0855122800 -0.4121682843 21.6434048014 0.3106821017 -0.1022050336
## [6] -0.2332357631 0.5842659925 -0.4828884496 1.0526971575 -0.2940771569
## [11] -0.3444124492 -0.3809592463 -0.4734025043 -0.3665314630 0.7136717637
## [16] 0.4832231300 2.8200167403 1.9826164838 1.2543397119 -0.4356169350
## [21] 1.9233248438 -0.3459953137 0.3473175255 5.5279800788 1.6871695175
## [26] -0.5286827294 0.0932331346 0.8008286187 0.3576581694 8.0392792919
## [31] -0.3218698530 0.0622112031 1.8348930006 7.7290599774 -0.4991380328
## [36] -0.3366422013 0.2542517312 -0.2036910665 -0.0264228868 1.9087547422
## [41] 4.6564115283 0.0326665065 9.5756035164 0.1951623379 -0.4843656845
## [46] -0.2627804598 4.3757369104 15.3811364033 8.3347262582 -0.4791185463
## [51] -0.0796033407 -0.1532183840 3.1644043488 7.8915558088 6.6462468460
## [56] 0.4201189065 1.5163073238 -0.0008667316 1.8110806015 -0.4715703993
## [61] -0.3720844005 -0.1150569766 1.1202056178 -0.3361588841 -0.3132280440
## [66] 0.2727322683 -0.3997193944 -0.5095224633 0.5866294545 -0.0966654059
## [71] -0.4052612365 -0.1878510659 1.7935304638 0.3998933473 -0.2910503027
## [76] -0.2388801297 0.0178941582 -0.4548209878 0.3990207446 1.4877428153
## [81] -0.3720958373 -0.0042643643 7.4443268857 1.7935304254 0.0536432411
## [86] -0.3666500111 -0.4706356730 -0.2676553347 2.2559049275 -0.4721046354
## [91] 0.9485521019 0.2099346862 1.2439990681 -0.5006152676 0.2690240795
## [96] 0.5053816525 1.5867175490 2.7507785960 1.0962755850 5.2916225058
## [101] 1.3178608097 8.7926690559 -0.1298293250 8.9699372356 1.7758036074
## [106] 0.2247070346 0.3281134727 1.4508119445 1.1996820232 0.1656176413
## [111] -0.0559675834 1.6428524726 0.6826498322 -0.0943756890 -0.4622071620
## [116] 18.3245268046 0.6738322175 -0.4540682303 0.6159482479 3.3121278319
## [121] 9.4618564344 4.5810725519 0.5940157424 1.5394460344 1.1760403569
## [126] 1.3664263820 0.6019928105 0.0044290224 1.0276544418 3.8749543026
## [131] -0.2802236487 -0.1121025070 -0.4370941699 -0.2805072777 -0.3088598459
## [136] 0.4034524491 -0.5153876159 -0.5139103811 -0.0411952351 -0.2923251564
## [141] 10.5948955500 3.2825831353 -0.4105039429 5.5132077305 -0.4400486395
## [146] 2.6473721578 0.2837964278 3.2973554836 12.8550648418 15.8538515493
## [151] 0.4462922592 2.1303399669 11.6732769768 0.0769835514 -0.3070975047
## [156] 12.4857561340 7.3006618763 -0.4252762912 3.9030217644 -0.1239203856
## [161] 2.4257869331 -0.3514145496 -0.4927859230 -0.4918360654 0.3502719952
## [166] -0.0680993951 -0.2242837142 -0.2689848623 0.7999743205 -0.4181427331
## [171] 0.8991255882 0.3556372442 -0.3628727260 -0.1970627227 -0.3536610530
## [176] 0.1828502060 -0.4882435130 -0.4430031121 -0.2732691846 12.9066988613
## [181] -0.2686893991 0.8423334429 -0.3292338686 -0.0939682558 46.7949552511
## [186] 0.2655479480 11.6440277272 0.2483427919 2.2445302193 0.4618032250
## [191] 2.0287359029 13.0041461830 -0.2709517842 3.2175848027 0.7466141004
## [196] -0.3650051101 2.2655733025 3.4834870723 4.4939156969 -0.1545427251
## [201] -0.3473478699 0.4268313153 -0.3661868980 42.5770296468 0.2394793829
## [206] 9.3687906401 4.4643710003 2.2485187534 1.3326331580 8.9994819322
## [211] 5.3802565957 8.6301732244 6.5620444607 7.7438323257 23.8900090314
## [216] -0.0707399317 13.0618777182 -0.5094786766 0.0947103694 -0.2203099584
## [221] -0.1224431508 4.2590353587 1.7149415323 0.9518020185 1.1741110883
## [226] 2.9280467758 -0.3757889244 0.5201540008 11.6282213145 -0.4659002491
## [231] 0.7038636689 -0.1989639151 -0.5267623241 9.1416367069 -0.3039953116
## [236] -0.2169861800 1.7870788607 -0.2834559581 0.0629165827 7.3139569898
## [241] -0.0589220531 -0.3957315946 0.0031218099 0.4758369559 3.9177941127
## [246] 1.0371861917 9.4869694266 -0.2775528081 4.9370861463 -0.2184634148
## [251] -0.1889187182 0.1360729447 1.2292267198 3.8143876745 8.0540516402
## [256] 0.1508452930 0.3428858211 -0.2480081115 5.6940212738 -0.0500586441
## [261] 1.2735437647 19.2108677033 0.0132260961 0.3551143710 3.4598513150
## [266] 8.4824497413 2.2780634500 0.1351215803 0.2607571779 -0.0220410803
## [271] -0.3251049042 -0.2799677325 -0.1652829565 -0.4273543987 -0.0404643923
## [276] -0.0408807273 -0.4457777285 0.6708317955 -0.0189423173 0.7249018565
## [281] 2.8394126859 0.8156009671 0.4853001222 7.2828991512 24.2624265681
## [286] -0.1416472036 -0.3657097511 0.1775832434 1.2602486513 0.3118638896
## [291] 0.9498816132 1.1821029287 0.2391839359 1.0259402127 2.7950956409
## [296] 2.0121611804 6.4734103708 1.1849096749 -0.0116505385 2.5144210230
## [301] 0.9633244502 6.6359062022 0.6531051356 14.6868360326 4.0212005509
## [306] 0.5024271828 0.9374728406 2.3703906270 -0.0485814092 -0.4697632181
## [311] 2.5735104163 12.6187072688 13.9482186170 -0.1446016733 -0.2023960338
## [316] 0.4753889844 -0.3033809473 -0.1187501213 1.1110479333 0.8673041862
## [321] 7.9574628964 2.9813749532 -0.4432985561 -0.4128675186 1.9857186769
## [326] -0.4740250406 -0.3085747395 -0.4042654325 -0.3868681856 -0.2719638940
## [331] -0.3813137797 0.1124371874 6.0006952248 0.6087880907 0.1803899896
## [336] 1.8053483040 11.8210004600 1.3769502029 3.9916558543 0.5644710457
## [341] 3.3269001802 -0.4961835631 0.7056805984 -0.0744330483 -0.4179639832
## [346] 2.4649335646 -0.4411718920 18.3472851376 0.1065282481 3.4889528412
## [351] -0.2864162171 3.9735652170 2.3814698882 27.7012748960 0.7565115738
## [356] 8.3199539099 17.5822163019 -0.1741463699 6.5915891573 0.5496986974
## [361] 0.1213005964 7.9654175504 4.2280134273 3.7552982813 0.2648789586
## [366] -0.1401699688 -0.3499373148 4.2676030992 0.0828924908 -0.4757238607
## [371] -0.3253694947 -0.3076027367 -0.4347237309 0.7517389501 -0.2633713537
## [376] 20.2575594502 -0.0680809090 0.2350476784 -0.1383933132 4.1984687306
## [381] 0.2252129919 -0.2868468310 1.3894549958 -0.5272054946 1.9678441354
## [386] 30.0943953227 0.3872028660 13.2391458980 0.3133411244 8.0983686852
## [391] 2.2632911017 5.3507118990 13.5641375608 2.2928357983 9.2210671569
## [396] 0.4669735469 1.7019418658 -0.0928984542 0.5166086372 -0.1534650823
## [401] -0.2673598877 1.0360044039 3.8587047195 0.9603699805 0.3739077525
## [406] 0.0474388548 -0.2701666339 0.7269668772 5.0995819777 4.4495986520
## [411] 1.2587714164 4.3461922138 7.5640528467 -0.1963048924 -0.2114816928
## [416] 26.8561485536 1.2809299389 3.7422761711 0.3112729957 0.8153055201
## [421] -0.0191339486 1.4064948996 1.3917225512 2.6178274612 -0.4695933361
## [426] 3.3416725285 1.5099013377 1.3621778546 3.3859895735 17.7151674367
## [431] 5.9711505282 -0.5035697373 12.1755368195 3.6666641914 5.6757035619
## [436] 0.3783394570 0.1161339025 -0.4328840565 -0.1918731879 2.7397928733
## [441] 0.0437457677 2.1979705812 3.6022895400 -0.1387666105 -0.2523327297
## [446] 0.2543089563 -0.1602160632 -0.1979308611 -0.0810805755 1.6280801243
## [451] -0.4363555524 5.6195686383 0.5275401750 -0.4743338935 -0.4420930616
## [456] -0.3632324460 2.4390820688 0.5730649554 -0.4552668912 0.6257762912
## [461] -0.4459575788 -0.5124331462 -0.0235633590 5.2679867485 5.6047962900
## [466] 7.5961088425 5.9120611349 2.5291933713 3.5928024498 17.3458587289
## [471] 0.4315199109 2.7360062477 3.6075747981 3.5041683600 -0.2465308766
## [476] 1.2971795220 4.4377807733 -0.4529744443 1.0636956538 -0.2154860702
## [481] -0.3352377395 1.9235270905 -0.1002846283 3.0134309490 -0.1011709692
## [486] -0.4725478058 -0.1043578094 -0.1686205455 1.9791068796 -0.5109559114
## [491] 2.7212338994 -0.1593740216 5.8529717417 1.7462589108 1.0815032367
## [496] 5.8825164383 5.5722971237 0.4019752143 7.1529383932 -0.4936084473
## [501] -0.0704445217 -0.3923339589 -0.3292560272 1.9631169840 2.6769168544
## [506] -0.2902628273 1.5918213082 0.6799985813 -0.0234684171 -0.5080014418
## [511] 0.1386231557 0.1198233615 -0.1325810944 1.7373955018 0.3391927340
## [516] -0.1944583488 3.4450789667 1.1258202816 3.4155342701 2.1451123152
## [521] 3.6518918431 4.5234603935 10.5801232017 1.2144543715 0.0917558997
## [526] 6.1779634045 0.4906093042 0.9928691468 0.8259416109 2.2071561781
## [531] 1.1252293877 0.0668305903 -0.3724654267 1.1657056221 0.6548207740
## [536] -0.5069673774 6.2473934416 7.0068398684 2.1081814444 -0.4341397002
## [541] 7.4483853594 -0.0544903486 -0.1357382643 0.8765369038 0.1493680582
## [546] 6.6263041758 -0.2121113051 -0.4991715660 -0.3676641328 -0.3632324283
## [551] 0.7532616572 -0.0583079828 -0.4016405339 -0.3469828451 0.5703799851
## [556] -0.3478555763 1.0024226771 -0.0443395294 -0.2214178845 2.0860229219
## [561] 0.8894627086 4.2871028205 2.1894293601 0.7417392255 6.0154675731
## [566] 1.5837630793 -0.4474348137 -0.3667777919 -0.1327837946 -0.4814112148
## [571] 0.8692244939 0.1464874340 0.3275705476 0.5127678266 0.2450972301
## [576] -0.1694277184 0.2468655570 -0.2707560772 -0.0507973132 0.7405574376
## [581] 0.4919151798 0.6796953626 11.2105803208 0.6944677109 -0.4592526923
## [586] -0.3271878984 -0.4060722384 -0.4968470631 -0.3868115632 7.9943713531
## [591] 9.9596845726 3.9325664610 4.3609645621 1.9382994388 -0.3942543597
## [596] 55.2664768474 0.9234391097 4.9186207109 -0.1039777154 17.9367526614
## [601] -0.1785780744 -0.0293773564 2.1693389664 -0.2672121643 -0.2486052881
## [606] 0.5408352884 -0.4641139767 0.5811637993 0.7343530513 -0.3159609137
## [611] -0.3702478165 0.6499419548 -0.3485516685 1.5630817917 0.7121945289
## [616] 2.7655509443 5.0700372811 2.5882827646 0.4795248017 -0.1529033652
## [621] -0.2891793892 3.7996153262 -0.0094347069 0.3724305177 0.7801473311
## [626] 0.4493944524 -0.3487850716 1.6728403396 3.5854222496 18.4161574697
## [631] -0.4235988911 -0.2131247783 0.0766670509 -0.4159697118 0.5260629401
## [636] 0.8599180120 9.9744569209 4.9666308429 -0.4858429193 -0.4651616316
## [641] 1.0414701728 3.1791766971 3.3771261645 2.0417058770 3.6223471465
## [646] 0.7860562704 2.5439657196 0.1375500717 0.8045216482 -0.0677115885
## [651] -0.0036177328 1.0888894108 0.9869602075 11.2891959207 0.6265149087
## [656] 0.0363595936 -0.0380930419 1.8452336444 -0.3670732389 -0.3765275418
## [661] -0.4133598810 0.2173208604 0.0181942053 9.5165141232 0.3030004806
## [666] -0.4902746238 3.3328091195 6.7984020337 7.0495319550 0.8746903603
## [671] 3.2087213937 8.8517584491 2.3666975399 -0.4045950036 0.9573859662
## [676] 0.0189282226 3.2347945885 -0.3005001739 2.3499530831 3.6578007824
## [681] 0.1297694758 -0.4478779945 -0.4496506718 -0.3912998901 -0.2946577929
## [686] 0.0078489613 3.5263268824 2.3706860739 0.2586834357 0.2239684171
## [691] 0.3901573357 -0.4823040555 -0.3893443266 -0.2760755732 6.7688573370
## [696] 0.0045990447 -0.0128294235 -0.4291971014 0.2867508975 -0.1638057261
## [701] -0.1682374306 1.1553649782 4.8041350115 5.2768501575 6.3404592360
## [706] 0.8451456637 0.4389060851 3.0166808656 -0.1539821189 -0.0471778914
## [711] -0.2983910623 -0.3874590840 3.7397871013 5.6904759102 3.2382660903
## [716] 11.3630576623 -0.0988073935 1.4375168310 0.4149748808 1.2620034024
## [721] 2.0557396079 -0.2016702093 4.3285835746 -0.4174438444 2.3534024264
## [726] 3.7833657431 -0.3898226553 -0.4755022755 0.4167475626 0.0051775180
## [731] -0.4852233655 -0.3168143935 -0.2652917605 0.7963969143 1.1227180885
## [736] 0.5792433941 2.4095373500 1.2135384859 -0.2661323056 -0.4326624654
## [741] 5.8234270450 0.0112944609 2.3238577298 -0.5065242069 0.9042350569
## [746] 5.2029884159 4.9075414497 2.4848763264 9.8852319371 2.7768905519
## [751] 3.3054802751 -0.2273268238 0.5910612727 1.9501173175 -0.3771603153
## [756] 1.3001990949 -0.2183937632 8.2347174601 -0.2804844456 -0.1075864302
## [761] -0.2366728853 -0.1652829609 1.0224138434 1.7167142141 5.8677440900
## [766] 6.1336463596 0.5349263491 0.9190074052 -0.4784567451 0.6881894629
## [771] 0.3172281193 9.0378900379 -0.4322794184 8.1816847297 1.7314865624
## [776] 3.6778764038 0.1708175079 0.6974221806 1.6486136884 2.8689573825
## [781] -0.4642010627 0.2266739166 -0.2132931298 0.0880628127 -0.2406219373
## [786] 0.8305380271 3.8227336362 -0.1233693682 11.0572700522 0.5866295682
## [791] 2.3962422365 0.0531262089 1.2883161131 0.0164169233 -0.2509625811
## [796] -0.2849389822 -0.4360883472 -0.3497920288 -0.2675850420 0.2567940110
## [801] 1.2086787378 3.5558715790 -0.0633537575 1.4434257703 1.8792100456
## [806] 2.6030551129 1.6576248209 11.9391792465 4.9518584946 -0.4089310601
## [811] 4.6952628044 -0.4031177687 -0.2593274233 -0.3449147164 6.7757381462
## [816] 0.9159424458 2.0231222628 5.8260860677 -0.2952796261 0.4647576946
Looks like NO pollutant feature is not recorded very well in Poland with the number of unique records in Poland being 3, which is certainly not enough to drive any conclusion, so I can remove this fature from the dataset as well.
poland_df <- poland_df %>% select(-NOX_scaled, -NO_scaled)
poland_log <- poland_df %>%
mutate(across(everything(), ~ log(. - min(.) + 0.1)))
poland_long <- poland_log %>%
pivot_longer(cols = everything(),
names_to = "pollutant",
values_to = "value")
ggplot(poland_long, aes(x = value, fill = pollutant)) +
geom_histogram(bins = 30, color = "white", alpha = 0.8) +
facet_wrap(~ pollutant, scales = "free") + # 'free' lets each plot have its own axis
theme_minimal() +
labs(title = "Distribution of Pollutants in Polish Cities",
x = "Scaled Value",
y = "Frequency") +
theme(legend.position = "none")
Great, now it looks much better.
I do not need to standardize the data as I already did this step during EDA.
pca_poland <- prcomp(poland_log, center=FALSE, scale.=FALSE)
pca_poland
## Standard deviations (1, .., p=6):
## [1] 3.6967413 0.5420173 0.3471524 0.2510593 0.2061991 0.1108709
##
## Rotation (n x k) = (6 x 6):
## PC1 PC2 PC3 PC4 PC5
## O3_scaled -0.10509687 -0.18879252 0.97420227 -0.04487156 -0.03667269
## PM2.5_scaled 0.56215298 -0.02865123 0.04430971 0.17749370 0.13982151
## NO2_scaled 0.01504604 0.95789576 0.18950498 -0.10329150 0.18872695
## CO_scaled 0.59516632 -0.16131887 0.05410104 -0.30295320 0.55900517
## PM10_scaled 0.46127120 0.13017347 0.09999128 0.64173605 -0.43999562
## SO2_scaled 0.32514338 0.05480234 -0.01136956 -0.67246383 -0.66136546
## PC6
## O3_scaled 0.029725755
## PM2.5_scaled 0.793817181
## NO2_scaled 0.003193814
## CO_scaled -0.461041134
## PM10_scaled -0.393527870
## SO2_scaled 0.039209062
summary(pca_poland)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 3.6967 0.54202 0.34715 0.25106 0.20620 0.11087
## Proportion of Variance 0.9625 0.02069 0.00849 0.00444 0.00299 0.00087
## Cumulative Proportion 0.9625 0.98321 0.99170 0.99614 0.99913 1.00000
In fact just the 1st PC already explains 96% of the variance which is a really good result and it means that I can reduce the dimention to just 2 dimentions to preserve still some of the data in PCA , even though it only holds 2% of the variance. So all the pollutants are highly correlated with each other.
fviz_pca_var(pca_poland, col.var="steelblue")
# visusalisation of quality
fviz_eig(pca_poland, choice='eigenvalue') # eigenvalues on y-axis
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.
fviz_eig(pca_poland) # percentage of explained variance on y-axis
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.
This plot exactly proves my conclusion above - 1st dimentions explains
most of the variance (in this case it is 96%)
Let;s check the most significant variables that are in this PC
loading_scores_PC_1<-pca_poland$rotation[,1]
fac_scores_PC_1<-abs(loading_scores_PC_1)
fac_scores_PC_1_ranked<-names(sort(fac_scores_PC_1, decreasing=T))
pca_poland$rotation[fac_scores_PC_1_ranked, 1]
## CO_scaled PM2.5_scaled PM10_scaled SO2_scaled O3_scaled NO2_scaled
## 0.59516632 0.56215298 0.46127120 0.32514338 -0.10509687 0.01504604
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
var<-get_pca_var(pca_poland)
a<-fviz_contrib(pca_poland, "var", axes=1, xtickslab.rt=90) # default angle=45°
b<-fviz_contrib(pca_poland, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')
First graph shows the contribution of variables to the PC1 (with 96%
variance) - the significant variables here are CO, PM2.5 and PM10 First
graph shows the contribution of variables to the PC2 (with 2% variance)
- NO2 is the only significant variable here
#Clustering after Dimention reduction
I will use only PC1 and PC2 as they explain 98% of the variance
pca_poland_1_2 <- pca_poland$x[, 1:2]
fviz_nbclust(pca_poland_1_2, FUN = pam, method = "gap_stat")
fviz_nbclust(pca_poland_1_2, FUN = pam, method = "wss")
fviz_nbclust(pca_poland_1_2, FUN = pam, method = "silhouette")
The outputs are different from the before dimention reduction. I will
take k = 3 to capture more granularity as well as k=4 to compare it to
the clusters before dimention reduction.
pam_pca_k4 = eclust(pca_poland_1_2, FUNcluster = "pam", k = 4, graph = TRUE)
This looks very similar to the PAM before dimention reduction, but
clusters look tigher together
pam_pca_k3 = eclust(pca_poland_1_2, FUNcluster = "pam", k = 3, graph = TRUE)
fviz_nbclust(pca_poland_1_2, FUN = hcut, method = "gap_stat")
fviz_nbclust(pca_poland_1_2, FUN = hcut, method = "wss")
fviz_nbclust(pca_poland_1_2, FUN = hcut, method = "silhouette")
Very similar results to pam, methods stringly suggest 1 to 3 clustes, I
choose k=3 for more granularity. As well as k = 4 for comparison with
before simention reduction.
d_reduced <- dist(pca_poland_1_2, method = "euclidean")
hc_poland_reduced <- hclust(d_reduced, method = "ward.D2" )
#plot(hc1, cex = 0.6, hang = -1) #This plotting did not work for me as I have many cities, and it the dendogram was very cluttered and not interpretable.
fviz_dend(hc_poland_reduced, k = 4,
cex = 0.3,
type = "circular",
rect = TRUE,
palette = "jco")
hc_poland_k4 <- cutree(hc_poland_reduced, k = 4)
This looks different to the previous Hierarchical clustering before dimention reduction. I like this clustering more because now it actually gives the split that makes sense when it comes to pollution in big industrial and smaller coastal cities.
My interpretation: Yellow cluster: This big cluster represents a wide range of mid-sized towns and regional centers with moderate air quality that often fluctuates seasonally, cities like Gdańsk, Wrocław, Białystok, Poznań, and Rzeszów. Red cluster: This large cluster contains major metropolitan and industrial centers that historically face significant smog challenges due to traffic and residential heating, including Warsaw, Lublin, Zamość Grey cluster: This minor cluster includes cities that act as transitions between the main groups or have unique outlier data points, represented by locations such as Otwock, Radom. Blue cluster: This small, distinct group appears to represent locations with unique environmental monitoring profiles or typical Polish health resorts, such as Nałęczów, Krasnobród, Szamotuły, and Mogilno.
fviz_dend(hc_poland_reduced, k = 3,
cex = 0.3,
type = "circular",
rect = TRUE,
palette = "jco")
hc_poland_k3 <- cutree(hc_poland_reduced, k = 3)
Split with k=3 makes much more sense than k = 4 as it removed this small cluster which only had 3 cities and was difficult to interpret why they were clustered together. The interpretation is the same as before with k=4, excep the grey cluster in the previoud hclust.
kNNdistplot(pca_poland_1_2, k = 4)
abline(h = 0.25, lty = 2)
So optimal eps for k=4 is 0.25 similar to before dimention
reduction.
kNNdistplot(pca_poland_1_2, k = 3)
abline(h = 0.18, lty = 2)
And eps = 0.18 for k = 3
set.seed(123)
db <- fpc::dbscan(pca_poland_1_2, eps = 0.25, MinPts = 4)
fviz_cluster(db, data = pca_poland_1_2, stand = FALSE,
ellipse = TRUE, show.clust.cent = FALSE,
geom = "text", labelsize = 8, repel = TRUE)
Interesting, now with minPts = 4 I have much more clusters than before
dimention reduction with same minPts , where I only had 1 cluster. Blue
and red clusters are similar to the clusters I got in Hierarchical
clustering and PAM, which proves that those are consistent. In blue, for
example, I have Mogilno, Krasnobród etc, which represent the clean air
cluster.
Now let’s test with eps = 0.18
set.seed(123)
db_reduced <- fpc::dbscan(pca_poland_1_2, eps = 0.18, MinPts = 4)
fviz_cluster(db_reduced, data = pca_poland_1_2, stand = FALSE,
ellipse = TRUE, show.clust.cent = FALSE,
geom = "text", labelsize = 8, repel = TRUE)
Results are very similar or identical to the eps = 0.25, so no change here. It is still consistent with those red and blue clusters.
print(adjustedRandIndex(pam_poland$cluster, pam_pca_k4$cluster))
## [1] 0.3929003
For k = 4 in both the index is quite low.
print(adjustedRandIndex(hc_poland, hc_poland_k4))
## [1] 0.2038499
Same here, the result is low.
In my opinion this does not indicate that dimention reduction did not preserve the data. As I picked PC1 (96% of variance) and PC2 (2% of variance) then preserve alltogether 98% of variance. Also I keep in mind that after the dimention reduction the clusters were much more interpretable made much more sense (having the blue cluster with clean air with all the clean health Polish resorts is not coincidence.) This could mean that the 2% of the variance that I left behind with PC1 and PC2 were actually noice, which were distrupting the clisters
print(adjustedRandIndex(hc_poland_k3, pam_pca_k3$cluster))
## [1] 0.5780577
Index of 0.58 is a good agreement between clusters and also the strongest among the clusters after dimention reduction. So I can be confident that those 3 clusters are not coincidental and they actually exist.
print(adjustedRandIndex(db_reduced$cluster, pam_pca_k3$cluster))
## [1] 0.3939372
This disagreement is expected as dbscan is a desity based algorithm vs PAM which is based on medoids.
print(adjustedRandIndex(db_reduced$cluster, hc_poland_k3))
## [1] 0.2981993
Very poor agreement between those 2 methods.