Link to download dataset: https://public.opendatasoft.com/explore/assets/openaq/export/

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
openaq_data <- read.csv('openaq.csv', sep = ';')
str(openaq_data)
## 'data.frame':    61177 obs. of  10 variables:
##  $ Country.Code : chr  "CN" "CN" "CN" "CN" ...
##  $ City         : chr  "" "" "" "" ...
##  $ Location     : chr  "市八十六中" "市农科院" "市发改委" "市委" ...
##  $ Coordinates  : chr  "23.1047, 113.43319999999999" "21.9508, 108.6553" "29.8454, 114.3107" "30.457600000000003, 106.63030000000002" ...
##  $ Pollutant    : chr  "O3" "SO2" "PM2.5" "O3" ...
##  $ Source.Name  : chr  "ChinaAQIData" "ChinaAQIData" "ChinaAQIData" "ChinaAQIData" ...
##  $ Unit         : chr  "µg/m³" "µg/m³" "µg/m³" "µg/m³" ...
##  $ Value        : num  36 7 26 91 19 113 600 300 98 14 ...
##  $ Last.Updated : chr  "2021-08-09T13:00:00+02:00" "2020-12-31T17:00:00+01:00" "2021-08-09T13:00:00+02:00" "2021-08-09T13:00:00+02:00" ...
##  $ Country.Label: chr  "China" "China" "China" "China" ...

I will first cluster the data based on countries, so I will group data by country

openaq_data$Country.Code <- factor(openaq_data$Country.Code)
str(openaq_data)
## 'data.frame':    61177 obs. of  10 variables:
##  $ Country.Code : Factor w/ 137 levels "-99","AD","AE",..: 30 30 30 30 30 30 30 30 30 30 ...
##  $ City         : chr  "" "" "" "" ...
##  $ Location     : chr  "市八十六中" "市农科院" "市发改委" "市委" ...
##  $ Coordinates  : chr  "23.1047, 113.43319999999999" "21.9508, 108.6553" "29.8454, 114.3107" "30.457600000000003, 106.63030000000002" ...
##  $ Pollutant    : chr  "O3" "SO2" "PM2.5" "O3" ...
##  $ Source.Name  : chr  "ChinaAQIData" "ChinaAQIData" "ChinaAQIData" "ChinaAQIData" ...
##  $ Unit         : chr  "µg/m³" "µg/m³" "µg/m³" "µg/m³" ...
##  $ Value        : num  36 7 26 91 19 113 600 300 98 14 ...
##  $ Last.Updated : chr  "2021-08-09T13:00:00+02:00" "2020-12-31T17:00:00+01:00" "2021-08-09T13:00:00+02:00" "2021-08-09T13:00:00+02:00" ...
##  $ Country.Label: chr  "China" "China" "China" "China" ...

There is some strange unexpected factor ‘-99’ in Country.Code, so I am going to explore what this is

openaq_data[openaq_data$Country.Code == '-99',]
##       Country.Code                           City                     Location
## 2306           -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## 10362          -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## 10890          -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## 23473          -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## 34321          -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## 57862          -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
## 57863          -99 Ορμήδια - Βιομηχανικός Σταθμός Ormidia - Industrial Station
##           Coordinates Pollutant
## 2306  34.9923, 33.779      PM10
## 10362 34.9923, 33.779     PM2.5
## 10890 34.9923, 33.779       SO2
## 23473 34.9923, 33.779       NOX
## 34321 34.9923, 33.779        O3
## 57862 34.9923, 33.779        NO
## 57863 34.9923, 33.779       NO2
##                                             Source.Name  Unit    Value
## 2306  Republic of Cyprus Department of Labor Inspection µg/m³ 22.80000
## 10362 Republic of Cyprus Department of Labor Inspection µg/m³ 11.21000
## 10890 Republic of Cyprus Department of Labor Inspection µg/m³  6.84391
## 23473 Republic of Cyprus Department of Labor Inspection µg/m³ 13.62056
## 34321 Republic of Cyprus Department of Labor Inspection µg/m³ 99.55051
## 57862 Republic of Cyprus Department of Labor Inspection µg/m³  0.83549
## 57863 Republic of Cyprus Department of Labor Inspection µg/m³ 12.35798
##                    Last.Updated Country.Label
## 2306  2025-01-31T10:00:00+01:00              
## 10362 2025-01-31T10:00:00+01:00              
## 10890 2025-01-31T10:00:00+01:00              
## 23473 2025-01-31T10:00:00+01:00              
## 34321 2025-01-31T10:00:00+01:00              
## 57862 2025-01-31T10:00:00+01:00              
## 57863 2025-01-31T10:00:00+01:00

All of the Country.Code entries that have -99 have City Ορμήδια - Βιομηχανικός Σταθμός , which is a city in Cyprus. Let’s check if Cyprus is already in the data

head(openaq_data[openaq_data$Country.Code == 'CY',])
##      Country.Code                                     City
## 123            CY Αγία Μαρίνα Ξυλιάτου - Σταθμός Υποβάθρου
## 2440           CY          Λεμεσός - Κυκλοφοριακός Σταθμός
## 2441           CY         Καλαβασός - Βιομηχανικός Σταθμός
## 2442           CY            Πάφος - Κυκλοφοριακός Σταθμός
## 4978           CY        Παραλίμνι - Κυκλοφοριακός Σταθμός
## 4979           CY            Πάφος - Κυκλοφοριακός Σταθμός
##                                       Location
## 123  Ayia Marina Xyliatou - Background Station
## 2440                Limassol - Traffic Station
## 2441            Kalavasos - Industrial Station
## 2442                  Paphos - Traffic Station
## 4978               Paralimni - Traffic Station
## 4979                  Paphos - Traffic Station
##                                Coordinates Pollutant
## 123          35.0380556, 33.05777779999994        O3
## 2440         34.6861111, 33.03555559999995       SO2
## 2441           34.77152999999999, 33.29623       NO2
## 2442         34.7752778, 32.42194440000003        NO
## 4978 35.04569276191925, 33.977734884655774        NO
## 4979         34.7752778, 32.42194440000003     PM2.5
##                                            Source.Name  Unit       Value
## 123  Republic of Cyprus Department of Labor Inspection µg/m³ 114.8657000
## 2440 Republic of Cyprus Department of Labor Inspection µg/m³   0.6449241
## 2441 Republic of Cyprus Department of Labor Inspection µg/m³   7.5563500
## 2442 Republic of Cyprus Department of Labor Inspection µg/m³   1.9617910
## 4978 Republic of Cyprus Department of Labor Inspection µg/m³  11.5399600
## 4979 Republic of Cyprus Department of Labor Inspection µg/m³  14.7685100
##                   Last.Updated Country.Label
## 123  2025-01-31T10:00:00+01:00        Cyprus
## 2440 2025-01-31T10:00:00+01:00        Cyprus
## 2441 2025-01-31T10:00:00+01:00        Cyprus
## 2442 2025-01-31T10:00:00+01:00        Cyprus
## 4978 2025-01-31T10:00:00+01:00        Cyprus
## 4979 2025-01-31T10:00:00+01:00        Cyprus

Yes, so Cyprus is already in the data so I will just fix -99 to represent CY

openaq_data[openaq_data$Country.Code == '-99',]$Country.Code <- 'CY'

I want to reshape the data to have a Polutants as separate columns that will have the values in each row Let’s check what Pollutants there are in the data and if they all have a value and let’s handle this.

unique(openaq_data$Pollutant)
##  [1] "O3"               "SO2"              "PM2.5"            "NO2"             
##  [5] "CO"               "PM10"             "NO"               "PM1"             
##  [9] "RELATIVEHUMIDITY" "TEMPERATURE"      "NOX"              "UM003"           
## [13] "BC"

So overall we have the following Pollutants: BC CO NO NO2 NOX O3 PM1 PM10 PM2.5 RELATIVEHUMIDITY SO2 TEMPERATURE UM003 Let’s check what is the proportion of those reported in the data:

print(paste('BC:',count(openaq_data[openaq_data$Pollutant == 'BC',])))
## [1] "BC: 137"
print(paste('CO:',count(openaq_data[openaq_data$Pollutant == 'CO',])))
## [1] "CO: 5968"
print(paste('NO:',count(openaq_data[openaq_data$Pollutant == 'NO',])))
## [1] "NO: 4030"
print(paste('NO2:',count(openaq_data[openaq_data$Pollutant == 'NO2',])))
## [1] "NO2: 11129"
print(paste('NOX:',count(openaq_data[openaq_data$Pollutant == 'NOX',])))
## [1] "NOX: 2094"
print(paste('O3:',count(openaq_data[openaq_data$Pollutant == 'O3',])))
## [1] "O3: 9377"
print(paste('PM1:',count(openaq_data[openaq_data$Pollutant == 'PM1',])))
## [1] "PM1: 159"
print(paste('PM10:',count(openaq_data[openaq_data$Pollutant == 'PM10',])))
## [1] "PM10: 9264"
print(paste('PM2.5:',count(openaq_data[openaq_data$Pollutant == 'PM2.5',])))
## [1] "PM2.5: 10662"
print(paste('RELATIVEHUMIDITY:',count(openaq_data[openaq_data$Pollutant == 'RELATIVEHUMIDITY',])))
## [1] "RELATIVEHUMIDITY: 89"
print(paste('SO2:',count(openaq_data[openaq_data$Pollutant == 'SO2',])))
## [1] "SO2: 8061"
print(paste('TEMPERATURE:',count(openaq_data[openaq_data$Pollutant == 'TEMPERATURE',])))
## [1] "TEMPERATURE: 120"
print(paste('UM003:',count(openaq_data[openaq_data$Pollutant == 'UM003',])))
## [1] "UM003: 87"

I will drop those with too few observations, lets say under 1.000

openaq_data$Pollutant <- factor(openaq_data$Pollutant)
openaq_data <- openaq_data[openaq_data$Pollutant %in% names(which(table(openaq_data$Pollutant) >= 1000)), ]
duplicates <- openaq_data %>%
  group_by(Country.Code, City, Location, Coordinates, Source.Name, Unit, Last.Updated, Country.Label, Pollutant) %>%
  filter(n() > 1) %>%
  arrange(Pollutant, Location)

duplicates
## # A tibble: 128 × 10
## # Groups:   Country.Code, City, Location, Coordinates, Source.Name, Unit,
## #   Last.Updated, Country.Label, Pollutant [61]
##    Country.Code City    Location Coordinates Pollutant Source.Name Unit    Value
##    <fct>        <chr>   <chr>    <chr>       <fct>     <chr>       <chr>   <dbl>
##  1 ZA           Sekhuk… Dilokong -24.615222… CO        South Afri… ppm   0.789  
##  2 ZA           Sekhuk… Dilokong -24.615222… CO        South Afri… ppm   0.856  
##  3 ZA           Sekhuk… Dilokong -24.615222… CO        South Afri… ppm   1.04   
##  4 ZA           Sekhuk… Dilokong -24.615222… CO        South Afri… ppm   0.824  
##  5 ZA           Sekhuk… Dilokong -24.615222… NO        South Afri… ppm   0.0139 
##  6 ZA           Sekhuk… Dilokong -24.615222… NO        South Afri… ppm   0.0152 
##  7 ZA           Sekhuk… Dilokong -24.615222… NO        South Afri… ppm   0.0154 
##  8 ZA           Sekhuk… Dilokong -24.615222… NO        South Afri… ppm   0.0137 
##  9 ZA           Sekhuk… Dilokong -24.615222… NO2       South Afri… ppm   0.00156
## 10 ZA           Sekhuk… Dilokong -24.615222… NO2       South Afri… ppm   0.00139
## # ℹ 118 more rows
## # ℹ 2 more variables: Last.Updated <chr>, Country.Label <chr>

there are some duplicates since some of the air quality report were reported a few times accross differents times at the same location, so I will actually also take median of those cases and keep 1 row

As I want to cluster based on the Pollutant types, I need to reshape the data

openaq_data <- openaq_data %>%
  pivot_wider(
    id_cols = c(Country.Code, City, Location, Coordinates, Source.Name, Unit, Last.Updated, Country.Label),
    names_from = Pollutant,
    values_from = Value, 
    values_fn = list(Value = median) #taking median if any identical rows
  )

head(openaq_data)
## # A tibble: 6 × 16
##   Country.Code City  Location   Coordinates       Source.Name Unit  Last.Updated
##   <fct>        <chr> <chr>      <chr>             <chr>       <chr> <chr>       
## 1 CN           ""    市八十六中 23.1047, 113.433… ChinaAQIDa… µg/m³ 2021-08-09T…
## 2 CN           ""    市农科院   21.9508, 108.6553 ChinaAQIDa… µg/m³ 2020-12-31T…
## 3 CN           ""    市发改委   29.8454, 114.3107 ChinaAQIDa… µg/m³ 2021-08-09T…
## 4 CN           ""    市委       30.4576000000000… ChinaAQIDa… µg/m³ 2021-08-09T…
## 5 CN           ""    市委党校   27.7314000000000… ChinaAQIDa… µg/m³ 2021-08-09T…
## 6 CN           ""    市审计局   25.8070999999999… ChinaAQIDa… µg/m³ 2021-08-09T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
head(openaq_data)
## # A tibble: 6 × 16
##   Country.Code City  Location   Coordinates       Source.Name Unit  Last.Updated
##   <fct>        <chr> <chr>      <chr>             <chr>       <chr> <chr>       
## 1 CN           ""    市八十六中 23.1047, 113.433… ChinaAQIDa… µg/m³ 2021-08-09T…
## 2 CN           ""    市农科院   21.9508, 108.6553 ChinaAQIDa… µg/m³ 2020-12-31T…
## 3 CN           ""    市发改委   29.8454, 114.3107 ChinaAQIDa… µg/m³ 2021-08-09T…
## 4 CN           ""    市委       30.4576000000000… ChinaAQIDa… µg/m³ 2021-08-09T…
## 5 CN           ""    市委党校   27.7314000000000… ChinaAQIDa… µg/m³ 2021-08-09T…
## 6 CN           ""    市审计局   25.8070999999999… ChinaAQIDa… µg/m³ 2021-08-09T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
class(openaq_data$SO2)
## [1] "numeric"

Let us check for outliers, as this is very important and I want to do it before scaling or filling in missing data.

cols <- c("O3", "PM2.5", "NO2", "CO", "NO", "NOX", "PM10", "SO2")
boxplot(openaq_data[, cols], main = "Pollutants")

CO has a lot of big outliers, I need to handle them.

summary(openaq_data$CO)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##  -49051.4       0.4     336.6    5591.4     665.0 3198676.0     19513

Strange that it has min value of -49051 , I need to check all the Pollutants, since neither if them should have negative values

for (i in c("O3", "PM2.5", "NO2", "CO", "NO", "NOX", "PM10", "SO2")) {
  n <- nrow(openaq_data[openaq_data[[i]] < 0 & !is.na(openaq_data[[i]]), ])
  print(head(openaq_data[openaq_data[[i]] < 0 & !is.na(openaq_data[[i]]), ]))
  print(paste(i, ": ", n, "negative values"))
}
## # A tibble: 6 × 16
##   Country.Code City          Location Coordinates Source.Name Unit  Last.Updated
##   <fct>        <chr>         <chr>    <chr>       <chr>       <chr> <chr>       
## 1 LT           Karolini ki … "\"Viln… 54.6861114… eea         µg/m³ 2025-01-31T…
## 2 LT           Petra in sen… "\"Kaun… 54.8951280… eea         µg/m³ 2025-01-31T…
## 3 LV           Rezekne       "Rezekn… 56.5110139… eea         µg/m³ 2025-01-31T…
## 4 ZA           Gert Sibande  "Camden" -26.6226, … South Afri… ppm   2023-05-29T…
## 5 RO           ALBA IULIA    "AB-1"   46.0642749… eea         µg/m³ 2024-09-08T…
## 6 RO           RAMNICU VALC… "VL-2"   45.0388309… eea         µg/m³ 2024-09-08T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "O3 :  121 negative values"
## # A tibble: 6 × 16
##   Country.Code City          Location Coordinates Source.Name Unit  Last.Updated
##   <fct>        <chr>         <chr>    <chr>       <chr>       <chr> <chr>       
## 1 LT           Petra in sen… "\"Kaun… 54.8951280… eea         µg/m³ 2025-01-31T…
## 2 NO           Oslo          "Vahl s… 59.9169440… eea         µg/m³ 2025-01-31T…
## 3 PL           Biała         "Biała,… 52.6025339… eea         µg/m³ 2024-12-19T…
## 4 IT           Via Amiternu… "AQ - A… 42.3641700… eea         µg/m³ 2025-01-11T…
## 5 IT           C.so Umberto… "MONTES… 42.5175000… eea         µg/m³ 2025-01-11T…
## 6 US           Poughkeepsie… "Newbur… 41.4994, -… AirNow      µg/m³ 2023-05-31T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "PM2.5 :  217 negative values"
## # A tibble: 6 × 16
##   Country.Code City          Location Coordinates Source.Name Unit  Last.Updated
##   <fct>        <chr>         <chr>    <chr>       <chr>       <chr> <chr>       
## 1 LT           Karolini ki … "\"Viln… 54.6861114… eea         µg/m³ 2025-01-31T…
## 2 LT           Petra in sen… "\"Kaun… 54.8951280… eea         µg/m³ 2025-01-31T…
## 3 NL           Breda         "Breda-… 51.6031, 4… Netherlands µg/m³ 2025-01-31T…
## 4 NL           Heerlen       "Heerle… 50.888, 5.… Netherlands µg/m³ 2025-01-31T…
## 5 NL           Veldhoven     "Veldho… 51.4074, 5… Netherlands µg/m³ 2025-01-31T…
## 6 HU           Budapest      "\"Buda… 47.4919440… eea         µg/m³ 2025-01-31T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "NO2 :  216 negative values"
## # A tibble: 6 × 16
##   Country.Code City          Location Coordinates Source.Name Unit  Last.Updated
##   <fct>        <chr>         <chr>    <chr>       <chr>       <chr> <chr>       
## 1 LT           Petra in sen… "\"Kaun… 54.8951280… eea         µg/m³ 2025-01-31T…
## 2 PT           Rede de Qual… "Franci… 41.1644440… eea         µg/m³ 2024-11-29T…
## 3 AT           5400 Hallein  "Hallei… 47.6825900… eea         µg/m³ 2024-12-21T…
## 4 TR           Trabzon       "Trabzo… 41.014341,… Turkiye     µg/m³ 2022-02-18T…
## 5 PE           Ilo           "Ilo - … -17.634021… peru-oefa   µg/m³ 2025-01-07T…
## 6 PE           Tupac Amaru … "Pisco … -13.713381… peru-oefa   µg/m³ 2025-01-08T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "CO :  44 negative values"
## # A tibble: 6 × 16
##   Country.Code City         Location  Coordinates Source.Name Unit  Last.Updated
##   <fct>        <chr>        <chr>     <chr>       <chr>       <chr> <chr>       
## 1 NL           Wijk aan Zee Wijk aan… 52.4939920… eea         µg/m³ 2024-03-25T…
## 2 NL           Amsterdam    Amsterda… 52.3813310… eea         µg/m³ 2024-03-25T…
## 3 NL           Breda        Breda-Ba… 51.6031, 4… Netherlands µg/m³ 2025-01-31T…
## 4 NL           Heerlen      Heerlen-… 50.888, 5.… Netherlands µg/m³ 2025-01-31T…
## 5 NL           Veldhoven    Veldhove… 51.4074, 5… Netherlands µg/m³ 2025-01-31T…
## 6 US           Bakersfield  Edison    35.345607,… AirNow      ppm   2023-05-31T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "NO :  110 negative values"
## # A tibble: 5 × 16
##   Country.Code City          Location Coordinates Source.Name Unit  Last.Updated
##   <fct>        <chr>         <chr>    <chr>       <chr>       <chr> <chr>       
## 1 US           Ogden-Clearf… Harrisv… 41.302799,… AirNow      ppm   2023-05-31T…
## 2 ZA           Gert Sibande  Elandsf… -26.245481… South Afri… ppm   2023-05-29T…
## 3 ZA           Gert Sibande  Camden   -26.6226, … South Afri… ppm   2023-05-29T…
## 4 ZA           Nkangala      Masakha… -25.97255,… South Afri… ppm   2023-05-29T…
## 5 ZA           Gert Sibande  Grootvl… -26.801033… South Afri… ppm   2023-05-07T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "NOX :  5 negative values"
## # A tibble: 6 × 16
##   Country.Code City          Location Coordinates Source.Name Unit  Last.Updated
##   <fct>        <chr>         <chr>    <chr>       <chr>       <chr> <chr>       
## 1 LT           Karolini ki … "\"Viln… 54.6861114… eea         µg/m³ 2025-01-31T…
## 2 LT           Petra in sen… "\"Kaun… 54.8951280… eea         µg/m³ 2025-01-31T…
## 3 NO           Oslo          "Vahl s… 59.9169440… eea         µg/m³ 2025-01-31T…
## 4 PL           Biała         "Biała,… 52.6025339… eea         µg/m³ 2024-12-19T…
## 5 IT           Via Amiternu… "AQ - A… 42.3641700… eea         µg/m³ 2025-01-11T…
## 6 IT           C.so Umberto… "MONTES… 42.5175000… eea         µg/m³ 2025-01-11T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "PM10 :  202 negative values"
## # A tibble: 6 × 16
##   Country.Code City          Location Coordinates Source.Name Unit  Last.Updated
##   <fct>        <chr>         <chr>    <chr>       <chr>       <chr> <chr>       
## 1 FR           Nord          "FR1002… 50.9959079… EEA France  µg/m³ 2024-01-29T…
## 2 LT           Karolini ki … "\"Viln… 54.6861114… eea         µg/m³ 2025-01-31T…
## 3 LT           Petra in sen… "\"Kaun… 54.8951280… eea         µg/m³ 2025-01-31T…
## 4 HU           Budapest      "\"Buda… 47.4919440… eea         µg/m³ 2025-01-31T…
## 5 US           Houston-Suga… "Housto… 29.6239, -… AirNow      ppm   2016-03-06T…
## 6 US           ST. LOUIS     "Rider … 38.75264, … AirNow      ppm   2023-05-31T…
## # ℹ 9 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>
## [1] "SO2 :  200 negative values"

So around 1% of the data is negative, I think this is not significant. My guess is that this is either disfunctioning sensors or just a mistake in the report. It seems like those negative values come from completely different cities and hence sensors, so I am guessing this is not sensor malfunction but rather reporting issue. So I can do 1 of 3 things: remove those rows, set them to 0, fill in with median. I am in between dropping those values and setting them to 0, but since this is not a significant amount of data I will set those values to NA, so later they could be filled with median.

for (col in cols) {
  openaq_data[[col]][openaq_data[[col]] < 0] <- NA
}

I will continue with outliers, will zoom in to see where the positive outliers lie

boxplot(openaq_data[, cols], main = "Pollutants", ylim = c(0, 1000))

Not too many positive outliers that lie above. However I realise that the Unit for those (even of the same kind) pollutants are different.

for (col in cols){
  print(paste(col, ":", nrow(openaq_data[openaq_data[[col]] > 1000,])))
}
## [1] "O3 : 16234"
## [1] "PM2.5 : 15110"
## [1] "NO2 : 14581"
## [1] "CO : 20125"
## [1] "NO : 21561"
## [1] "NOX : 23392"
## [1] "PM10 : 16458"
## [1] "SO2 : 17646"

Let’s check if units are the same, otherwise I will need to normalize the data:

unique(openaq_data$Unit)
## [1] "µg/m³" "ppm"   "ppb"
nrow(openaq_data[openaq_data$Unit == 'ppb',])
## [1] 6
nrow(openaq_data[openaq_data$Unit == 'ppm',])
## [1] 6335
nrow(openaq_data[openaq_data$Unit == 'µg/m³',])
## [1] 19138

So µg/m³ has the most data. For the sake of starting clustering I will just assign median to those, otherwise if I have time I will get back to this and convert all of the values to µg/m³

for(p in cols) {
  openaq_data[[p]][openaq_data$Unit != "µg/m³"] <- NA
}

openaq_data$Unit <- "µg/m³"
boxplot(openaq_data[, cols], main = "Pollutants", ylim = c(0, 1000))

There is a lot of missing data on some of the pollutants. I will fill them in with the median data to avoid outliers (with mean)

cols <- c("O3", "PM2.5", "NO2", "CO", "NO", "NOX", "PM10", "SO2")

for (col in cols) {
  openaq_data[[col]][is.na(openaq_data[[col]])] <- median(openaq_data[[col]], na.rm = TRUE)
}

I need to scale data.

openaq_data[paste0(cols, "_scaled")] <- scale(openaq_data[cols])

Clustering by countries

There are a lot of columns that give us granularity on location, however for this particular clustering part I will not need such a deep granularity, I will only leave the country code and the neccesary features.

openaq_country <- openaq_data[,c('Country.Code', "O3_scaled", "PM2.5_scaled", "NO2_scaled", "CO_scaled", "NO_scaled", "NOX_scaled", "PM10_scaled", "SO2_scaled") ]

Now, I need to group those countries together, since we have so many rows for the same country. I will take a median to group them, since there are cases when the country is generally clean, but in the capital the air quality is terrible, so we still want to assign this country to clean one. I will be looking at a single cities (or rahter capitals) in another analysis.

openaq_country <- aggregate(. ~ Country.Code, data = openaq_country, FUN = median)

Is data clustarable?

library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
set.seed(123)
hopkins_stat <- get_clust_tendency(openaq_country[,-1], n = nrow(openaq_country) -1, graph = TRUE)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
hopkins_stat
## $hopkins_stat
## [1] 0.9224853
## 
## $plot

Number >0.75 represent a strong clustering tendency, so in my case the number is 0.92 which clearly indicates that I can proceed clustering the data.

K-Means -> Countries

Optimal number of clusters

Now I need to find the number of cluster suitable for my clustering analysis of countries. I will use Gap Statistic, Elbow method and Silhouette method

library(factoextra)

fviz_nbclust(openaq_country[,-1] , FUN = kmeans, method = "gap_stat") 

From this plot I want to choose the number of cluters to be 4, because there is a big jusmp from 3 to 4 cluters in the Gap statistic, and also after 4 the curves still goes up but not as fast, so it will be a good trade off between the quality and the computation speed to stop at 4. Though math suggestion here is only 1 cluster - which does not make sense for me because it is just all data in 1 cluster.

Let’s also confirm this with elbow method:

fviz_nbclust(openaq_country[,-1] , FUN = kmeans, method = "wss")

From WSS method the number of clusters it suggests me to pick is 5

And let’s also check that with Silhouette method

fviz_nbclust(openaq_country[, -1] , FUN = kmeans, method = "silhouette")

So 3 methods that suggest different number of clusters. I want to pick 5, becuase the drop in the elbow methods is quite significant and this is also suppported by the silhouette , which even though has the 2 as the suggestion and biggest climb, it also has 5 with the largest average width. Also I would like to explore what 5 clusters could give me. Because picking just 2 clusters might only suggest “clean” and “pilluted” cities, and I want to explore more than that.

Now let’s use K-mean clustering to cluster our data.

rownames(openaq_country) <- openaq_country$Country.Code
kmean_country_k2 = eclust(openaq_country[,-1], FUNcluster = "kmeans", k = 2, graph = TRUE)

as.data.frame(kmean_country_k2$cluster)
##    kmean_country_k2$cluster
## AD                        2
## AE                        1
## AF                        2
## AG                        2
## AJ                        2
## AM                        2
## AQ                        2
## AR                        2
## AT                        2
## AU                        2
## AZ                        2
## BA                        2
## BD                        2
## BE                        2
## BF                        2
## BG                        2
## BH                        2
## BK                        2
## BM                        2
## BR                        2
## BZ                        2
## CA                        2
## CD                        2
## CE                        2
## CF                        2
## CH                        2
## CI                        2
## CL                        2
## CN                        1
## CO                        2
## CR                        2
## CS                        2
## CW                        2
## CY                        1
## CZ                        2
## DE                        2
## DK                        2
## DZ                        2
## EC                        2
## EE                        2
## EG                        2
## ES                        2
## ET                        2
## FI                        2
## FR                        2
## GA                        2
## GB                        2
## GH                        2
## GI                        2
## GN                        2
## GR                        2
## GT                        2
## HK                        1
## HR                        2
## HU                        2
## ID                        2
## IE                        2
## IL                        2
## IN                        2
## IQ                        2
## IS                        2
## IT                        2
## IZ                        2
## JO                        2
## JP                        2
## KE                        2
## KG                        2
## KR                        2
## KU                        2
## KV                        2
## KW                        2
## KZ                        2
## LA                        2
## LK                        2
## LT                        2
## LU                        2
## LV                        2
## MA                        2
## MD                        2
## ME                        2
## MF                        2
## MG                        2
## MK                        2
## ML                        2
## MM                        2
## MN                        2
## MT                        2
## MX                        2
## MY                        2
## MZ                        2
## NG                        2
## NI                        2
## NL                        2
## NO                        2
## NP                        2
## NZ                        2
## OM                        1
## PE                        2
## PH                        2
## PK                        2
## PL                        2
## PR                        2
## PS                        2
## PT                        2
## PY                        2
## QA                        2
## RO                        2
## RS                        2
## RU                        2
## RW                        2
## SA                        2
## SD                        2
## SE                        2
## SG                        2
## SI                        2
## SK                        2
## SM                        1
## SN                        2
## SU                        2
## TD                        2
## TH                        2
## TI                        2
## TJ                        2
## TM                        2
## TR                        2
## TT                        2
## TW                        2
## TX                        2
## UC                        2
## UG                        2
## US                        2
## UZ                        2
## VM                        2
## VN                        2
## XK                        2
## ZA                        2

The 2 different clusters are very clear and interprettable. I can see that in cluster 2 in red we only have the most industrial and with biggest urban traffic and hence polluted countries like China, UAE, Hong Kong as well as the countries which geographically have a high dust level like UAE, Oman etc

The countries in the 2nd blue cluster represent the less affected by urbanism and geographical location countries.

I also want to have a look at 9 clusters, as I clearly have the 2nd biggest drop at 9 in WSS, and also the biggest increase at 9 in Gap Statistic.

rownames(openaq_country) <- openaq_country$Country.Code
kmean_country_k9 = eclust(openaq_country[,-1], FUNcluster = "kmeans", k = 9, graph = TRUE)

as.data.frame(kmean_country_k9$cluster)
##    kmean_country_k9$cluster
## AD                        9
## AE                        7
## AF                        6
## AG                        9
## AJ                        9
## AM                        9
## AQ                        9
## AR                        9
## AT                        9
## AU                        9
## AZ                        9
## BA                        9
## BD                        6
## BE                        9
## BF                        6
## BG                        9
## BH                        9
## BK                        9
## BM                        9
## BR                        9
## BZ                        9
## CA                        9
## CD                        6
## CE                        9
## CF                        9
## CH                        8
## CI                        9
## CL                        9
## CN                        7
## CO                        9
## CR                        9
## CS                        9
## CW                        9
## CY                        5
## CZ                        9
## DE                        9
## DK                        9
## DZ                        9
## EC                        9
## EE                        1
## EG                        6
## ES                        9
## ET                        9
## FI                        9
## FR                        9
## GA                        9
## GB                        9
## GH                        9
## GI                        9
## GN                        9
## GR                        9
## GT                        9
## HK                        3
## HR                        9
## HU                        9
## ID                        9
## IE                        9
## IL                        9
## IN                        2
## IQ                        9
## IS                        9
## IT                        9
## IZ                        9
## JO                        9
## JP                        9
## KE                        9
## KG                        6
## KR                        9
## KU                        9
## KV                        9
## KW                        9
## KZ                        6
## LA                        9
## LK                        9
## LT                        9
## LU                        9
## LV                        6
## MA                        9
## MD                        9
## ME                        9
## MF                        4
## MG                        9
## MK                        9
## ML                        9
## MM                        9
## MN                        6
## MT                        9
## MX                        9
## MY                        9
## MZ                        9
## NG                        9
## NI                        9
## NL                        9
## NO                        9
## NP                        9
## NZ                        9
## OM                        3
## PE                        9
## PH                        9
## PK                        9
## PL                        9
## PR                        9
## PS                        9
## PT                        8
## PY                        6
## QA                        9
## RO                        9
## RS                        9
## RU                        9
## RW                        9
## SA                        9
## SD                        9
## SE                        9
## SG                        9
## SI                        9
## SK                        9
## SM                        5
## SN                        9
## SU                        9
## TD                        9
## TH                        9
## TI                        9
## TJ                        9
## TM                        9
## TR                        9
## TT                        9
## TW                        9
## TX                        9
## UC                        9
## UG                        9
## US                        9
## UZ                        9
## VM                        9
## VN                        9
## XK                        8
## ZA                        9

This one is difficult to interpret given many dimentions of Pollutants. Perhaps, the India cluster that stand separately at the bottom represent a pollution with a certain pollutant. Same goes with Estonia and Cyprus. The urban countries still fell into the same cluste just like with the k=2 clustering, which does classify them as highly polluted.

To better interpret this data I will check the mean of the pollutants in each cluster to identify what this cluster means.

openaq_country$kmean_cluster <- as.factor(kmean_country_k9$cluster)

cluster_profiles <- openaq_country %>%
  group_by(kmean_cluster) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
## ℹ In group 1: `kmean_cluster = 1`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
plot_data <- cluster_profiles %>%
  pivot_longer(cols = -kmean_cluster, names_to = "Pollutant", values_to = "Average_Value")

ggplot(plot_data, aes(x = Pollutant, y = Average_Value, fill = kmean_cluster)) +
  geom_bar(stat = "identity") +
  coord_flip() + 
  facet_wrap(~kmean_cluster) +
  theme_minimal() +
  theme(legend.position = "none") + 
  labs(title = "Pollutant Signatures by Cluster",
       x = "Pollutant Type",
       y = "Mean Concentration (Scaled/Standardized)")

Now much better for me to tell what those clusters mean. As I am not good at chemistry, I will use AI to ask for meanings of combinations like “high NO2” , “High O3 and High NO2”, “High O3 and Low NO2”

Final Air Quality Cluster Profiles

Cluster 1 - This clusters has very low (almost zero) concentration of pollutants, so it suggest that this cluster represents low pollution countries.

Cluster 2 - ozone (O₃) is strongly negative while other pollutants are close to zero, which matches the known effect where ozone is destroyed by fresh nitric oxide emissions near traffic. This is usually called ozone titration or scavenging. (From source: “Generally, close to the sources in city centres ozone concentrations are lower than those in suburbs and rural areas, mainly as a result of ozone scavenging by nitric oxide from traffic.” https://www.eea.europa.eu/en/analysis/publications/top08-98/page004.html)

Cluster 3 - This cluster has high ozone and moderately positive NO₂, which is a typical signal of photochemical smog formation under sunlight with traffic emissions present.( 1) https://en.wikipedia.org/wiki/Smog#Photochemical_smog 2) “The majority of ground-level ozone formation occurs when nitrogen oxides (NOx), carbon monoxide (CO), and volatile organic compounds (VOCs), react in the atmosphere in the presence of sunlight, specifically the UV spectrum. NOx, CO, and VOCs are considered ozone precursors. Motor vehicle exhaust, industrial emissions, and chemical solvents are the major anthropogenic sources of these ozone precursors.Although the ozone precursors often originate in urban areas, winds can carry NOx hundreds of kilometers, causing ozone formation to occur in less populated regions as well” https://en.wikipedia.org/wiki/Ground-level_ozone )

Cluster 4 - Very high NO and NO₂ dominate this cluster, while other pollutants stay low, which strongly points to direct emissions from heavy traffic or combustion engines. (“NOx gases are usually produced from the reaction between nitrogen and oxygen during combustion of fuels, such as hydrocarbons, in air; especially at high temperatures, such as in car engines.In areas of high motor vehicle traffic, such as in large cities, the nitrogen oxides emitted can be a significant source of air pollution. NOx gases are also produced naturally by lightning.” https://en.wikipedia.org/wiki/NOx)

Cluster 5 - This cluster shows high ozone but low NO₂, which suggests strong sunlight but limited local traffic, allowing ozone to accumulate instead of being destroyed by NO. (“The majority of ground-level ozone formation occurs when nitrogen oxides (NOx), carbon monoxide (CO), and volatile organic compounds (VOCs), react in the atmosphere in the presence of sunlight, specifically the UV spectrum.” https://en.wikipedia.org/wiki/Ground-level_ozone)

Cluster 6 - All pollutants are very close to zero, meaning this cluster represents an average or neutral air quality state without any dominant pollutant.

Cluster 7 - This cluster has extremely high ozone and PM₁₀, which fits environments affected by strong sunlight plus dust, industrial activity, or arid conditions. (https://www.sciencedirect.com/science/article/pii/S026974912101647X)

Cluster 8 - Low ozone combined with moderate PM₁₀ suggests winter conditions, where particulate matter increases due to heating, while ozone stays low because of weak sunlight. (“Due to the photochemical nature, the highest levels of ozone are seen during periods of sunny weather. It is worth mentioning that ozone can also be generated by household equipment, such as portable air cleaners.” https://www.who.int/teams/environment-climate-change-and-health/air-quality-and-health/health-impacts/types-of-pollutants)

Cluster 9 - This cluster is slightly negative/stable for all chemicals, suggesting rural or low-pollution conditions — typical for places with few emissions and clean air.

hierarchical clustering - countries

Now I will explore the same clustering task but with hierarchical clustering

For K=2 I will use Ward.D2 to minimize variations inside clusters to keep countries closer by the poluution.

str(openaq_country)
## 'data.frame':    136 obs. of  10 variables:
##  $ Country.Code : Factor w/ 137 levels "-99","AD","AE",..: 2 3 4 5 6 7 8 9 10 11 ...
##  $ O3_scaled    : num  -0.0991 2.1526 -0.0991 -0.0991 -0.0991 ...
##  $ PM2.5_scaled : num  -0.0543 -0.0543 0.0798 -0.0528 -0.0337 ...
##  $ NO2_scaled   : num  -0.12 -0.158 -0.12 -0.12 -0.12 ...
##  $ CO_scaled    : num  -0.0286 -0.0286 -0.0286 -0.0286 -0.0286 ...
##  $ NO_scaled    : num  -0.0855 -0.0855 -0.0855 -0.0855 -0.0855 ...
##  $ NOX_scaled   : num  -0.0267 -0.0267 -0.0267 -0.0267 -0.0267 ...
##  $ PM10_scaled  : num  -0.0452 0.2358 -0.0452 -0.0452 -0.0452 ...
##  $ SO2_scaled   : num  -0.0362 0.3214 -0.0362 -0.0362 -0.0362 ...
##  $ kmean_cluster: Factor w/ 9 levels "1","2","3","4",..: 9 7 6 9 9 9 9 9 9 9 ...

Optimal number of clusters

Gap Statistic

library(factoextra)

fviz_nbclust(openaq_country[,-c(1,10)] , FUN = hcut, method = "gap_stat") 

Just like with k-means, there is a very steep jump when it comes to k=2 and k=9.

Elbow method - HC

fviz_nbclust(openaq_country[,-1] , FUN = hcut, method = "wss")

fviz_nbclust(openaq_country[, -1] , FUN = kmeans, method = "silhouette")

All 3 methods are

# install.packages("NbClust")
library(NbClust)
duda_test <- NbClust(openaq_country[, -c(1,10)], 
                     distance = "euclidean", 
                     method = "ward.D2", 
                     index = "duda", 
                     min.nc = 2, max.nc = 2)

# See the result (Look for the "Best number of clusters")
duda_test$Best.nc
## Number_clusters     Value_Index 
##          2.0000          0.4822

All the methods above suggest k = 2, just like for the k-means. For the sake of the analysis I also want to pick 9 (to later compare it with k-means, and also because in gap and silhouette statistic there is a strong suggestion for this).

rownames(openaq_country) <- openaq_country$Country.Code #I add this line to assign the country code to later see which cluster got assigned
hc_clusters_k2 <- eclust(openaq_country[,-c(1,10)], FUNcluster = "hclust", k = 2, hc_method = "ward.D2", graph = TRUE)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
fviz_cluster(hc_clusters_k2, 
             geom = "text", 
             repel = TRUE, 
             show.clust.cent = FALSE,
             main = "Hierarchical Clustering with 2 clusters")

The results are very similar to the K-mean clustering, the split appears to be between the very industrial countries and less affected countries.

Now let’s run this method for 9 clusters

rownames(openaq_country) <- openaq_country$Country.Code #I add this line to assign the country code to later see which cluster got assigned
hc_clusters_k9 <- eclust(openaq_country[,-c(1,10)], FUNcluster = "hclust", k = 9, hc_method = "ward.D2", graph = TRUE)

fviz_cluster(hc_clusters_k9, 
             geom = "text", 
             # repel = TRUE, 
             show.clust.cent = FALSE,
             main = "Hierarchical Clustering with 9 clusters")

I will not do the deep-dive into the countries within each cluster, however from the graph above we can clearly see that the results are very similar to the results of k-mean clustering

DBSCAN clustering - countries

k = 2

For the dbscan I first need to find the optimal value for epsilon which will determine the radius between the individual points to be able to assign them to the same cluster.

library(dbscan)
## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
## 
##     as.dendrogram
kNNdistplot(openaq_country[,-c(1,10)], k =  2)

From this graph the line curves up at the point 0.15 , but I will increase it to 0.20 to be my eps parameter to make the algorithm less strict when it comes to clustering.

It is also important to choose the minimum points to form a cluster. in the context of this project it makes sense to choose 1 point only , as there could be only 1 country in the world that is exceptionally clean or dirty. However if I pick 1, every point will become a single cluster which is not what I am trying to achieve.

Following the DBSCAN Wikipedia article (https://en.wikipedia.org/wiki/DBSCAN) , the min I will choose is 3. However it is also mentioned that it depends on the number of dimensions, and as a rule of thumb we can use minPts = 2·dim So in my case: 2 x 8 dims = 16

dbscan_country <- dbscan(openaq_country[,-c(1,10)], eps = 0.20, minPts = 16)
dbscan_country$cluster
##   [1] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1
##  [38] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
rownames(openaq_country) <- openaq_country$Country.Code

fviz_cluster(dbscan_country, data = openaq_country[,-c(1,10)], 
             geom = "text", repel = TRUE, main = "DBSCAN Clustering")

Interesting results, my interpretation of this is that most countries in the world are very similar to each other when it comes to air quality so they were assigned to the “average air quality cluster”. However there are some countries with either much better air quality (like EE, XK, CH, PT) or much worse air quality (like IN, CN, AE, OM) which is why they are not part of the “average air quality cluster”.

With k=9 we have eps 0.4

library(dbscan)
kNNdistplot(openaq_country[,-c(1,10)], k =  9)

dbscan_country <- dbscan(openaq_country[,-c(1,10)], eps = 0.40, minPts = 16)
dbscan_country$cluster
##   [1] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1
##  [38] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
rownames(openaq_country) <- openaq_country$Country.Code

fviz_cluster(dbscan_country, data = openaq_country[,-c(1,10)], 
             geom = "text", repel = TRUE, main = "DBSCAN Clustering")

So with eps 0.40 the results are very similar to the eps = 0.20

Comparing different algorithms

Now that I have the results of the 3 main methods we covered during this course: K-means, DBSCAN, Hierarchical clustering - I can now compare the results of those , how much they agree with each other. So if all of them agree it means that I have a very good clustering results that can be reliable. It also provides me the insight on how much the certain result happened because of the use of a certain algorithm.

To do this I will use 2 approaches we went through during the course: ARI for comparing the clusters and Jaccard Similarity for checking robustness of each.

Adjusted Random Index (ARI)

k=2

K-means VS Hierarchical clustering

library(mclust)
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
library(fpc)
## 
## Attaching package: 'fpc'
## The following object is masked from 'package:dbscan':
## 
##     dbscan
print(adjustedRandIndex(kmean_country_k2$cluster, hc_clusters_k2$cluster))
## [1] 1

This score is 1 which indicates that the clusters are strong and both algorithms agree with each other. And the results are the same no matter the algorithm used. And this is exactly what I observed above from the graph of the clustering assignments.

K-means VS DBSCAN

# install.packages("fossil")
library(fossil)
## Loading required package: sp
## Loading required package: maps
## 
## Attaching package: 'maps'
## The following object is masked from 'package:mclust':
## 
##     map
## Loading required package: shapefiles
## Loading required package: foreign
## 
## Attaching package: 'shapefiles'
## The following objects are masked from 'package:foreign':
## 
##     read.dbf, write.dbf
# print(adjustedRandIndex(kmean_country_k2$cluster, dbscan_country$cluster))

print(
  data.frame(
  Pair = c("KM vs HC", "KM vs DBSCAN", "HC vs DBSCAN"),
  ARI = c(adjustedRandIndex(kmean_country_k2$cluster, hc_clusters_k2$cluster), adjustedRandIndex(kmean_country_k2$cluster, dbscan_country$cluster), adjustedRandIndex(hc_clusters_k2$cluster, dbscan_country$cluster)),
  Jaccard = c(jaccard(kmean_country_k2$cluster, hc_clusters_k2$cluster), jaccard(kmean_country_k2$cluster, dbscan_country$cluster), jaccard(hc_clusters_k2$cluster, dbscan_country$cluster))
)
)
##           Pair       ARI   Jaccard
## 1     KM vs HC 1.0000000 1.0000000
## 2 KM vs DBSCAN 0.7691994 0.9338235
## 3 HC vs DBSCAN 0.7691994 0.9338235

The results of the clusters are very close to each other both in ARI and Jaccard, which again proves the correct split of the countries for k=2 Though I would say that this version of Jaccard usage from fossil library is not the most reliable, I would create my own custom function, however here I can clearly see from another supporting metric ARI that the clusters are indeed similar.

K = 9

print(
  data.frame(
  Pair = c("KM vs HC", "KM vs DBSCAN", "HC vs DBSCAN"),
  ARI = c(adjustedRandIndex(kmean_country_k9$cluster, hc_clusters_k9$cluster), adjustedRandIndex(kmean_country_k9$cluster, dbscan_country$cluster), adjustedRandIndex(hc_clusters_k9$cluster, dbscan_country$cluster)),
  Jaccard = c(jaccard(kmean_country_k9$cluster, hc_clusters_k9$cluster), jaccard(kmean_country_k9$cluster, dbscan_country$cluster), jaccard(hc_clusters_k9$cluster, dbscan_country$cluster))
)
)
##           Pair       ARI   Jaccard
## 1     KM vs HC 0.6573443 1.0000000
## 2 KM vs DBSCAN 0.5113288 0.9338235
## 3 HC vs DBSCAN 0.8239581 0.9338235

The results are low for DBSCAN is as expected, since DBSCAN only had 1 cluster created in the end + the data points which were not in any of the clusters. Howeer I am surprised when it comes to the ARI index for KM vs HC, this index is a bit lower than I expected, but 0.66 is still a good result that shows that the clusters are somewhat similar.

Jaccard index from this part is completely not reliable so I would not use it in my next project, at least not from the fossil library.

Clustering cities in Poland

head(openaq_data)
## # A tibble: 6 × 24
##   Country.Code City  Location   Coordinates       Source.Name Unit  Last.Updated
##   <fct>        <chr> <chr>      <chr>             <chr>       <chr> <chr>       
## 1 CN           ""    市八十六中 23.1047, 113.433… ChinaAQIDa… µg/m³ 2021-08-09T…
## 2 CN           ""    市农科院   21.9508, 108.6553 ChinaAQIDa… µg/m³ 2020-12-31T…
## 3 CN           ""    市发改委   29.8454, 114.3107 ChinaAQIDa… µg/m³ 2021-08-09T…
## 4 CN           ""    市委       30.4576000000000… ChinaAQIDa… µg/m³ 2021-08-09T…
## 5 CN           ""    市委党校   27.7314000000000… ChinaAQIDa… µg/m³ 2021-08-09T…
## 6 CN           ""    市审计局   25.8070999999999… ChinaAQIDa… µg/m³ 2021-08-09T…
## # ℹ 17 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>, O3_scaled <dbl>,
## #   PM2.5_scaled <dbl>, NO2_scaled <dbl>, CO_scaled <dbl>, NO_scaled <dbl>,
## #   NOX_scaled <dbl>, PM10_scaled <dbl>, SO2_scaled <dbl>
poland <- openaq_data[openaq_data$Country.Code == 'PL',]
head(poland)
## # A tibble: 6 × 24
##   Country.Code City   Location        Coordinates Source.Name Unit  Last.Updated
##   <fct>        <chr>  <chr>           <chr>       <chr>       <chr> <chr>       
## 1 PL           ""     Konarskiego     52.1720333… Poland      µg/m³ 2017-01-08T…
## 2 PL           ""     Reja            52.5509402… Poland      µg/m³ 2017-01-08T…
## 3 PL           ""     Tochtermana     51.3990955… Poland      µg/m³ 2017-01-08T…
## 4 PL           "Łódź" Łódź, ul. Gdań… 51.775378,… GIOS        µg/m³ 2025-01-31T…
## 5 PL           "Piła" Pila, ul. Kuso… 53.154408,… GIOS        µg/m³ 2025-01-02T…
## 6 PL           "Ełk"  Ełk, ul. Piłsu… 53.8283890… GIOS        µg/m³ 2025-01-31T…
## # ℹ 17 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>, O3_scaled <dbl>,
## #   PM2.5_scaled <dbl>, NO2_scaled <dbl>, CO_scaled <dbl>, NO_scaled <dbl>,
## #   NOX_scaled <dbl>, PM10_scaled <dbl>, SO2_scaled <dbl>

Seems like a lot of City entries are missing

sum(is.na(poland$City))
## [1] 0

No NA entries in City column as the missing entries are just empty strings

nrow(poland[poland$City == "",])
## [1] 95

We have 95 entry out of 524 missing.

summary(poland)
##   Country.Code     City             Location         Coordinates       
##  PL     :524   Length:524         Length:524         Length:524        
##  -99    :  0   Class :character   Class :character   Class :character  
##  AD     :  0   Mode  :character   Mode  :character   Mode  :character  
##  AE     :  0                                                           
##  AF     :  0                                                           
##  AG     :  0                                                           
##  (Other):  0                                                           
##  Source.Name            Unit           Last.Updated       Country.Label     
##  Length:524         Length:524         Length:524         Length:524        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##        O3              SO2              PM2.5             NO2         
##  Min.   : 2.337   Min.   :  0.400   Min.   :  0.00   Min.   : 0.8519  
##  1st Qu.:56.000   1st Qu.:  5.000   1st Qu.:  9.30   1st Qu.:16.4000  
##  Median :56.000   Median :  5.000   Median :  9.30   Median :16.4000  
##  Mean   :50.584   Mean   :  7.317   Mean   : 16.18   Mean   :17.6918  
##  3rd Qu.:56.000   3rd Qu.:  5.000   3rd Qu.: 10.48   3rd Qu.:16.4000  
##  Max.   :83.800   Max.   :861.973   Max.   :233.00   Max.   :64.7987  
##                                                                       
##        CO              PM10              NO             NOX    
##  Min.   : 106.0   Min.   :  0.00   Min.   :3.000   Min.   :20  
##  1st Qu.: 570.0   1st Qu.: 20.90   1st Qu.:3.000   1st Qu.:20  
##  Median : 570.0   Median : 22.00   Median :3.000   Median :20  
##  Mean   : 572.2   Mean   : 29.16   Mean   :3.004   Mean   :20  
##  3rd Qu.: 570.0   3rd Qu.: 28.21   3rd Qu.:3.000   3rd Qu.:20  
##  Max.   :3566.0   Max.   :262.90   Max.   :5.000   Max.   :20  
##                                                                
##    O3_scaled         PM2.5_scaled        NO2_scaled         CO_scaled       
##  Min.   :-1.73727   Min.   :-0.07396   Min.   :-1.12257   Min.   :-0.03951  
##  1st Qu.:-0.09907   1st Qu.:-0.05428   1st Qu.:-0.11983   1st Qu.:-0.02865  
##  Median :-0.09907   Median :-0.05428   Median :-0.11983   Median :-0.02865  
##  Mean   :-0.26440   Mean   :-0.03971   Mean   :-0.03652   Mean   :-0.02859  
##  3rd Qu.:-0.09907   3rd Qu.:-0.05177   3rd Qu.:-0.11983   3rd Qu.:-0.02865  
##  Max.   : 0.74960   Max.   : 0.41911   Max.   : 3.00152   Max.   : 0.04153  
##                                                                             
##    NO_scaled          NOX_scaled        PM10_scaled         SO2_scaled      
##  Min.   :-0.08551   Min.   :-0.02673   Min.   :-0.12701   Min.   :-0.24454  
##  1st Qu.:-0.08551   1st Qu.:-0.02673   1st Qu.:-0.04924   1st Qu.:-0.03621  
##  Median :-0.08551   Median :-0.02673   Median :-0.04515   Median :-0.03621  
##  Mean   :-0.08492   Mean   :-0.02673   Mean   :-0.01850   Mean   : 0.06872  
##  3rd Qu.:-0.08551   3rd Qu.:-0.02673   3rd Qu.:-0.02205   3rd Qu.:-0.03621  
##  Max.   : 0.20993   Max.   :-0.02673   Max.   : 0.85123   Max.   :38.77644  
## 
#install.packages("tidygeocoder")
head(poland)
## # A tibble: 6 × 24
##   Country.Code City   Location        Coordinates Source.Name Unit  Last.Updated
##   <fct>        <chr>  <chr>           <chr>       <chr>       <chr> <chr>       
## 1 PL           ""     Konarskiego     52.1720333… Poland      µg/m³ 2017-01-08T…
## 2 PL           ""     Reja            52.5509402… Poland      µg/m³ 2017-01-08T…
## 3 PL           ""     Tochtermana     51.3990955… Poland      µg/m³ 2017-01-08T…
## 4 PL           "Łódź" Łódź, ul. Gdań… 51.775378,… GIOS        µg/m³ 2025-01-31T…
## 5 PL           "Piła" Pila, ul. Kuso… 53.154408,… GIOS        µg/m³ 2025-01-02T…
## 6 PL           "Ełk"  Ełk, ul. Piłsu… 53.8283890… GIOS        µg/m³ 2025-01-31T…
## # ℹ 17 more variables: Country.Label <chr>, O3 <dbl>, SO2 <dbl>, PM2.5 <dbl>,
## #   NO2 <dbl>, CO <dbl>, PM10 <dbl>, NO <dbl>, NOX <dbl>, O3_scaled <dbl>,
## #   PM2.5_scaled <dbl>, NO2_scaled <dbl>, CO_scaled <dbl>, NO_scaled <dbl>,
## #   NOX_scaled <dbl>, PM10_scaled <dbl>, SO2_scaled <dbl>
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ readr     2.1.5
## ✔ lubridate 1.9.4     ✔ stringr   1.5.2
## ✔ purrr     1.1.0     ✔ tibble    3.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::map()    masks maps::map(), mclust::map()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidygeocoder)

poland_prep <- poland %>%
  mutate(Coordinates = as.character(Coordinates)) %>%
  separate(Coordinates, into = c("lat_raw", "lon_raw"), sep = ",", remove = FALSE) %>%
  mutate(
    lat = as.numeric(str_trim(lat_raw)),
    lon = as.numeric(str_trim(lon_raw))
  )

# filter for unique unknown locations
to_geocode <- poland_prep %>%
  filter(is.na(City) | City == "" | City == "NA") %>%
  filter(!is.na(lat)) %>%
  distinct(lat, lon)

if (nrow(to_geocode) > 0) {
  
  results <- to_geocode %>%
    reverse_geocode(lat = lat, long = lon, method = 'osm', full_results = TRUE)

  city_lookup <- results %>%
    mutate(
      extracted_city = coalesce(
        if("city" %in% names(.)) city else NULL,
        if("town" %in% names(.)) town else NULL,
        if("village" %in% names(.)) village else NULL,
        if("suburb" %in% names(.)) suburb else NULL
      )
    ) %>%
    select(lat, lon, extracted_city)

  poland <- poland_prep %>%
    left_join(city_lookup, by = c("lat", "lon")) %>%
    mutate(
      City = ifelse(is.na(City) | City == "" | City == "NA", extracted_city, City)
    ) %>%
    select(-any_of(c("lat_raw", "lon_raw", "lat", "lon", "extracted_city")))
    print(head(poland[, c("City", "Location")]))
}
## Passing 71 coordinates to the Nominatim single coordinate geocoder
## Query completed in: 71.7 seconds
## # A tibble: 6 × 2
##   City    Location               
##   <chr>   <chr>                  
## 1 Siedlce Konarskiego            
## 2 Płock   Reja                   
## 3 Radom   Tochtermana            
## 4 Łódź    Łódź, ul. Gdańska      
## 5 Piła    Pila, ul. Kusocińskiego
## 6 Ełk     Ełk, ul. Piłsudskiego
poland.backup <- poland
poland <- poland.backup
poland <- poland %>% group_by(City) %>% summarise(
    O3_scaled    = median(O3_scaled, na.rm = TRUE),
    PM2.5_scaled = median(PM2.5_scaled, na.rm = TRUE),
    NO2_scaled   = median(NO2_scaled, na.rm = TRUE),
    CO_scaled    = median(CO_scaled, na.rm = TRUE),
    NO_scaled    = median(NO_scaled, na.rm = TRUE),
    NOX_scaled   = median(NOX_scaled, na.rm = TRUE),
    PM10_scaled  = median(PM10_scaled, na.rm = TRUE),
    SO2_scaled   = median(SO2_scaled, na.rm = TRUE),
)

##Outliers

Let’s check for outliers in Poland specific data as outliers in such data are common and happen due to misperformance of the sensors. I alreay removed outliers due to sensors misperformance at the EDA stage, however now I will be looking at the outliers on the city level.

boxplot(poland[-1], main = "Pollutants", ylim = c(-3, 4))

We have quite a lot of outliers, but I do not want to remove them as they represent real data representing real cities, because sometimes it happens that the air quality level is very bad.

I will need to use the clustering methods that are robust to outliers, like DBSCAN, PAM and Hierarchical clustering.

I will limit dataset to pollutants only to cluster on this data.

poland_train <- poland %>% select(-1)

Is data clustarable?

set.seed(123)
hopkins_stat <- get_clust_tendency(poland_train, n = nrow(poland_train) -1, graph = TRUE)
hopkins_stat
## $hopkins_stat
## [1] 0.9226019
## 
## $plot

Result is 0.97 –> strongly clusterable data, I can proceed with the analysis

PAM - Poland

Optimal number of clusters

Now I need to find the number of cluster suitable for my clustering analysis of Poland cities. I will use Gap Statistic, Elbow method and Silhouette method

library(factoextra)
library(cluster)
## 
## Attaching package: 'cluster'
## The following object is masked from 'package:maps':
## 
##     votes.repub
fviz_nbclust(poland_train, FUN = pam, method = "gap_stat") 

Not so good results with the gap statistic peaking at 1, which is totally useless for me as having just 1 cluster would put all my data into that 1 cluster. Let me try to use elbow method instead:

fviz_nbclust(poland_train, FUN = pam, method = "wss")

This result is more descriptive, it tells me that the number of clusters that is optimal for my data is 4, as there is the biggest drop at that point and this is a very reasonable number.

Finally, let me try with Silhouette method.

fviz_nbclust(poland_train, FUN = pam, method = "silhouette")

This tells me the number of clusters I should use is 2. Since the Elbow plot shows that k=2 and k=3 still leave a huge amount of unexplained variation (high WSS) - I will use k = 4

poland_df <- as.data.frame(poland_train)
rownames(poland_df) <- poland$City

pam_poland = eclust(poland_df, FUNcluster = "pam", k = 4, graph = TRUE)

as.data.frame(pam_poland$cluster)
##                                             pam_poland$cluster
## Augustów                                                     1
## Belsk Duży                                                   1
## Biała                                                        1
## Biała Podlaska                                               2
## Białystok                                                    1
## Bielsko-Biała                                                1
## Biskupiec                                                    1
## Boguchwała                                                   1
## Borsukowizna                                                 3
## Borówiec                                                     1
## Brzeg                                                        3
## Bydgoszcz                                                    1
## Bytów                                                        1
## Chełm                                                        1
## Chełmno                                                      1
## Chojnice                                                     1
## Chojnów                                                      1
## Ciechanów                                                    1
## Ciechocinek                                                  4
## Cieszyn                                                      1
## Czerniawa                                                    3
## Częstochowa                                                  1
## Diabla Góra                                                  2
## Dolní Lutyně                                                 1
## Duszniki-Zdrój                                               1
## Działdowo                                                    1
## Działoszyn                                                   1
## Dzierżoniów                                                  3
## Dąbki                                                        3
## Dąbrowa Górnicza                                             2
## Dębica                                                       1
## Elbląg                                                       2
## Ełk                                                          2
## Florianka                                                    2
## Gajew                                                        2
## Gdańsk                                                       1
## Gdynia                                                       1
## Giżycko                                                      1
## Gliwice                                                      1
## Goczałkowice-Zdrój                                           2
## Gorzów Wielkopolski                                          1
## Gołdap                                                       1
## Gołuchów                                                     2
## Grajewo                                                      3
## Granica                                                      2
## Grudziądz                                                    1
## Gubin                                                        3
## Guty Duże                                                    2
## Głubczyce                                                    1
## Inowrocław                                                   3
## Janów Lubelski                                               1
## Jarczew                                                      2
## Jarosław                                                     1
## Jastrzębie-Zdrój                                             1
## Jasło                                                        1
## Jawor                                                        3
## Jedlicze                                                     1
## Jedlina-Zdrój                                                1
## Jelenia Góra                                                 1
## Jędrzejów                                                    1
## Kalisz                                                       3
## Kamienna Góra                                                1
## Kamień Pomorski                                              3
## Kaszów                                                       2
## Katowice                                                     1
## Kazimierza Wielka                                            1
## Kielce                                                       1
## Kluczbork                                                    1
## Kolbuszowa                                                   1
## Koniczynka                                                   2
## Konin                                                        2
## Konstancin-Jeziorna                                          1
## Kostrza                                                      1
## Koszalin                                                     1
## Koziegłowy                                                   1
## Kołobrzeg                                                    1
## Końskie                                                      2
## Kościan                                                      1
## Kościerzyna                                                  1
## Kraków                                                       1
## Krapkowice                                                   1
## Krasnobród                                                   3
## Krempna                                                      3
## Krosno                                                       1
## Krotoszyn                                                    1
## Krynica-Zdrój                                                1
## Krzyżówka                                                    3
## Kudowa-Zdrój                                                 1
## Kutno                                                        1
## Kędzierzyn-Koźle                                             4
## Kętrzyn                                                      2
## Kłodzko                                                      1
## Latoszyn                                                     1
## Legionowo                                                    4
## Legnica                                                      1
## Liniewko Kościerskie                                         1
## Lubań                                                        1
## Lublin                                                       4
## Lubliniec                                                    1
## Lubsko                                                       1
## Lwówek Śląski                                                1
## Lądek-Zdrój                                                  1
## Lębork                                                       1
## Malbork                                                      1
## Marszewo                                                     1
## Małogoszcz                                                   1
## Mielec                                                       2
## Milicz                                                       1
## Międzyrzecz                                                  3
## Mińsk Mazowiecki                                             1
## Mogilno                                                      3
## Mosina                                                       1
## Mszana Dolna                                                 1
## Muszyna                                                      1
## Myślenice                                                    1
## Nakło nad Notecią                                            3
## National automated monitoring network (SIS)                  1
## Nałęczów                                                     3
## Niepołomice                                                  1
## Nisko                                                        2
## Nowa Ruda                                                    1
## Nowa Sól                                                     3
## Nowiny                                                       4
## Nowy Sącz                                                    1
## Nowy Targ                                                    1
## Nysa                                                         1
## Olesno                                                       2
## Olkusz                                                       1
## Olsztyn                                                      4
## Opatów                                                       1
## Opole                                                        1
## Osieczów                                                     3
## Ostrowiec Świętokrzyski                                      2
## Ostrołęka                                                    1
## Ostróda                                                      1
## Otwock                                                       4
## Oława                                                        1
## Oświęcim                                                     1
## Pabianice                                                    2
## Parzniewice                                                  2
## Piastów                                                      4
## Piotrków Trybunalski                                         2
## Piła                                                         1
## Pińczów                                                      1
## Pleszew                                                      1
## Polanica-Zdrój                                               1
## Polańczyk                                                    1
## Poznań                                                       1
## Połaniec                                                     1
## Połczyn-Zdrój                                                3
## Prudnik                                                      1
## Przemyśl                                                     2
## Puławy                                                       3
## Pułtusk                                                      1
## Płock                                                        2
## Rabka-Zdrój                                                  1
## Racibórz                                                     1
## Radom                                                        4
## Radomsko                                                     4
## Radzyń Podlaski                                              1
## Rudnik nad Sanem                                             1
## Rybnik                                                       2
## Rymanów-Zdrój                                                1
## Rzeszów                                                      1
## Sandomierz                                                   2
## Sanok                                                        1
## Siedlce                                                      1
## Sierpc                                                       1
## Sinołęka                                                     4
## Skarżysko-Kamienna                                           1
## Skawina                                                      1
## Smolary Bytnickie                                            3
## Solec Kujawski                                               1
## Sopot                                                        1
## Sosnowiec                                                    1
## Starachowice                                                 1
## Strzegom                                                     1
## Strzelce Opolskie                                            1
## Sucha Beskidzka                                              1
## Sulechów                                                     3
## Sulęcin                                                      1
## Suwałki                                                      1
## Szamotuły                                                    3
## Szarów                                                       4
## Szczawnica                                                   1
## Szczecin                                                     1
## Szczecinek                                                   3
## Szczytno                                                     3
## Szymbark                                                     3
## Słupsk                                                       1
## Tarnobrzeg                                                   1
## Tarnów                                                       1
## Tomaszów Lubelski                                            1
## Toruń                                                        1
## Trzebinia                                                    1
## Trzebnica                                                    1
## Tychy                                                        1
## Ustka                                                        1
## Ustroń                                                       1
## Uście Gorlickie Wysowa                                       1
## Wadowice                                                     1
## Warszawa                                                     2
## Wałbrzych                                                    1
## Wieniec-Zdrój                                                3
## Wilczopole                                                   2
## Wodzisław Śląski                                             2
## Wrocław                                                      1
## Wschowa                                                      1
## Wysokie Mazowieckie                                          1
## Włocławek                                                    1
## Włoszczowa                                                   2
## Zabierzów                                                    1
## Zabrze                                                       2
## Zakopane                                                     1
## Zamość                                                       1
## Zawiercie                                                    1
## Zdzieszowice                                                 3
## Zgierz                                                       4
## Zgorzelec                                                    3
## Zielona Góra                                                 1
## Zielonka                                                     3
## Złockie                                                      1
## Złoty Potok                                                  1
## Český Těšín                                                  4
## Łagów                                                        1
## Łask                                                         3
## Łeba                                                         1
## Łomża                                                        1
## Łuków                                                        1
## Łódź                                                         1
## Środa Śląska                                                 1
## Świebodzin                                                   1
## Świecie                                                      3
## Świnoujście                                                  1
## Żagań                                                        1
## Żary                                                         3
## Żory                                                         1
## Żyrardów                                                     1
## Żywiec                                                       3

hierarchical clustering - Poland

Now I will explore the same clustering task but with hierarchical clustering

I will use Ward.D2 to minimize variations inside clusters to keep countries closer by the poluution.

Optimal number of clusters

fviz_nbclust(poland_train, FUN = hcut, method = "gap_stat") 

fviz_nbclust(poland_train, FUN = hcut, method = "wss") 

fviz_nbclust(poland_train, FUN = hcut, method = "silhouette") 

d <- dist(poland_df, method = "euclidean")

hc_poland <- hclust(d, method = "ward.D2" )

#plot(hc1, cex = 0.6, hang = -1) #This plotting did not work for me as I have many cities, and it the dendogram was very cluttered and not interpretable. 

fviz_dend(hc_poland, k = 4,               
          cex = 0.3,              
          type = "circular",     
          rect = TRUE,      
          palette = "jco")

hc_poland <- cutree(hc_poland, k = 4)

Looking at the clusters the cities assigned to them are a mix of big and smaller cities, which is unexpected. My interpretation: Blue: These are the cleanest spots, like beach cities (Świnoujście) and health resorts (Ciechocinek, Lądek-Zdrój), where the air stays fresher because of better wind or cleaner heating rules. Yellow: These are middle-sized towns like Kalisz or Iława that get dirty air in the winter when many people use old coal heaters to stay warm. Grey: These are the busiest urban spots like Warsaw where a lot of car exhaust and city smoke cause the worst pollution spikes. Red: This is the most common group for Poland, containing cities like Poznań and Wrocław, where hundreds of towns share the same general smog problems during the cold months.

DBSCAN clustering - Poland

Optimal eps

library(dbscan)
kNNdistplot(poland_train, k =  4)
abline(h = 0.30, lty = 2)

So optimal eps for me is 0.30

library("fpc")
library("dbscan")

set.seed(123)
db <- fpc::dbscan(poland_df, eps = 0.30, MinPts = 3)

fviz_cluster(db, data = poland_df, stand = FALSE, 
             ellipse = TRUE, show.clust.cent = FALSE,
             geom = "text", labelsize = 8, repel = TRUE)

It seems like the data is too connected, so that for any minPts higher than 2 I only get 1 cluster. For this particular analysis DBSCAN is not the best method, so I will stick to the above methods like PAM and hclust.

Comparing different algorithms with ARI (adjusted random index)

In my case I will only compare PAM to hclust.

print(adjustedRandIndex(pam_poland$cluster, hc_poland))
## [1] 0.5918604

The ARI of 0.59 mean that my clusters mostly agree, in my opinion they agree on the bigger cluster which includes most of the cities in poland that share very similar air quality.

Dimention Reduction

Since PCA assumes linearity I want to check if it would not fail for my case:

pairs(poland_df, main = "Scatterplot Matrix of Pollutants")

I can see the linearity however some of the interactions are non-linear and I need to be careful about those.

I need to check the distribution of the data to make sure the data is not skewed.

library(tidyverse)

poland_long <- poland_df %>%
  pivot_longer(cols = everything(), 
               names_to = "pollutant", 
               values_to = "value")

ggplot(poland_long, aes(x = value, fill = pollutant)) +
  geom_histogram(bins = 30, color = "white", alpha = 0.8) +
  facet_wrap(~ pollutant, scales = "free") + # 'free' lets each plot have its own axis
  theme_minimal() +
  labs(title = "Distribution of Pollutants in Polish Cities",
       x = "Scaled Value",
       y = "Frequency") +
  theme(legend.position = "none") 

It seems like most of the data is skewed to the right. Also interesting that NOX has 0 variance which is probably a mistake.

summary(poland_df$NOX_scaled)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.02673 -0.02673 -0.02673 -0.02673 -0.02673 -0.02673
summary(openaq_data$NOX_scaled)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -5.69495 -0.02672 -0.02672  0.00000 -0.02672 52.12095
summary(openaq_data$NOX)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   20.00   20.00   20.09   20.00  204.00
unique(poland_df$NOX_scaled)
## [1] -0.02672527

Seems like this feature only container outliers, and the rest of the data has no variance. I will have to remove this feature.

NO pollutant also seems off with just 1 bar. Let me check it.

summary(poland_df$NO_scaled)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.08551 -0.08551 -0.08551 -0.08486 -0.08551  0.06221
summary(openaq_data$NO_scaled)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.52868 -0.08551 -0.08551  0.00000 -0.08551 55.26648
unique(poland_df$NO_scaled)
## [1] -0.08551228 -0.07812611  0.06221120
unique(openaq_data$NO_scaled)
##   [1] -0.0855122800 -0.4121682843 21.6434048014  0.3106821017 -0.1022050336
##   [6] -0.2332357631  0.5842659925 -0.4828884496  1.0526971575 -0.2940771569
##  [11] -0.3444124492 -0.3809592463 -0.4734025043 -0.3665314630  0.7136717637
##  [16]  0.4832231300  2.8200167403  1.9826164838  1.2543397119 -0.4356169350
##  [21]  1.9233248438 -0.3459953137  0.3473175255  5.5279800788  1.6871695175
##  [26] -0.5286827294  0.0932331346  0.8008286187  0.3576581694  8.0392792919
##  [31] -0.3218698530  0.0622112031  1.8348930006  7.7290599774 -0.4991380328
##  [36] -0.3366422013  0.2542517312 -0.2036910665 -0.0264228868  1.9087547422
##  [41]  4.6564115283  0.0326665065  9.5756035164  0.1951623379 -0.4843656845
##  [46] -0.2627804598  4.3757369104 15.3811364033  8.3347262582 -0.4791185463
##  [51] -0.0796033407 -0.1532183840  3.1644043488  7.8915558088  6.6462468460
##  [56]  0.4201189065  1.5163073238 -0.0008667316  1.8110806015 -0.4715703993
##  [61] -0.3720844005 -0.1150569766  1.1202056178 -0.3361588841 -0.3132280440
##  [66]  0.2727322683 -0.3997193944 -0.5095224633  0.5866294545 -0.0966654059
##  [71] -0.4052612365 -0.1878510659  1.7935304638  0.3998933473 -0.2910503027
##  [76] -0.2388801297  0.0178941582 -0.4548209878  0.3990207446  1.4877428153
##  [81] -0.3720958373 -0.0042643643  7.4443268857  1.7935304254  0.0536432411
##  [86] -0.3666500111 -0.4706356730 -0.2676553347  2.2559049275 -0.4721046354
##  [91]  0.9485521019  0.2099346862  1.2439990681 -0.5006152676  0.2690240795
##  [96]  0.5053816525  1.5867175490  2.7507785960  1.0962755850  5.2916225058
## [101]  1.3178608097  8.7926690559 -0.1298293250  8.9699372356  1.7758036074
## [106]  0.2247070346  0.3281134727  1.4508119445  1.1996820232  0.1656176413
## [111] -0.0559675834  1.6428524726  0.6826498322 -0.0943756890 -0.4622071620
## [116] 18.3245268046  0.6738322175 -0.4540682303  0.6159482479  3.3121278319
## [121]  9.4618564344  4.5810725519  0.5940157424  1.5394460344  1.1760403569
## [126]  1.3664263820  0.6019928105  0.0044290224  1.0276544418  3.8749543026
## [131] -0.2802236487 -0.1121025070 -0.4370941699 -0.2805072777 -0.3088598459
## [136]  0.4034524491 -0.5153876159 -0.5139103811 -0.0411952351 -0.2923251564
## [141] 10.5948955500  3.2825831353 -0.4105039429  5.5132077305 -0.4400486395
## [146]  2.6473721578  0.2837964278  3.2973554836 12.8550648418 15.8538515493
## [151]  0.4462922592  2.1303399669 11.6732769768  0.0769835514 -0.3070975047
## [156] 12.4857561340  7.3006618763 -0.4252762912  3.9030217644 -0.1239203856
## [161]  2.4257869331 -0.3514145496 -0.4927859230 -0.4918360654  0.3502719952
## [166] -0.0680993951 -0.2242837142 -0.2689848623  0.7999743205 -0.4181427331
## [171]  0.8991255882  0.3556372442 -0.3628727260 -0.1970627227 -0.3536610530
## [176]  0.1828502060 -0.4882435130 -0.4430031121 -0.2732691846 12.9066988613
## [181] -0.2686893991  0.8423334429 -0.3292338686 -0.0939682558 46.7949552511
## [186]  0.2655479480 11.6440277272  0.2483427919  2.2445302193  0.4618032250
## [191]  2.0287359029 13.0041461830 -0.2709517842  3.2175848027  0.7466141004
## [196] -0.3650051101  2.2655733025  3.4834870723  4.4939156969 -0.1545427251
## [201] -0.3473478699  0.4268313153 -0.3661868980 42.5770296468  0.2394793829
## [206]  9.3687906401  4.4643710003  2.2485187534  1.3326331580  8.9994819322
## [211]  5.3802565957  8.6301732244  6.5620444607  7.7438323257 23.8900090314
## [216] -0.0707399317 13.0618777182 -0.5094786766  0.0947103694 -0.2203099584
## [221] -0.1224431508  4.2590353587  1.7149415323  0.9518020185  1.1741110883
## [226]  2.9280467758 -0.3757889244  0.5201540008 11.6282213145 -0.4659002491
## [231]  0.7038636689 -0.1989639151 -0.5267623241  9.1416367069 -0.3039953116
## [236] -0.2169861800  1.7870788607 -0.2834559581  0.0629165827  7.3139569898
## [241] -0.0589220531 -0.3957315946  0.0031218099  0.4758369559  3.9177941127
## [246]  1.0371861917  9.4869694266 -0.2775528081  4.9370861463 -0.2184634148
## [251] -0.1889187182  0.1360729447  1.2292267198  3.8143876745  8.0540516402
## [256]  0.1508452930  0.3428858211 -0.2480081115  5.6940212738 -0.0500586441
## [261]  1.2735437647 19.2108677033  0.0132260961  0.3551143710  3.4598513150
## [266]  8.4824497413  2.2780634500  0.1351215803  0.2607571779 -0.0220410803
## [271] -0.3251049042 -0.2799677325 -0.1652829565 -0.4273543987 -0.0404643923
## [276] -0.0408807273 -0.4457777285  0.6708317955 -0.0189423173  0.7249018565
## [281]  2.8394126859  0.8156009671  0.4853001222  7.2828991512 24.2624265681
## [286] -0.1416472036 -0.3657097511  0.1775832434  1.2602486513  0.3118638896
## [291]  0.9498816132  1.1821029287  0.2391839359  1.0259402127  2.7950956409
## [296]  2.0121611804  6.4734103708  1.1849096749 -0.0116505385  2.5144210230
## [301]  0.9633244502  6.6359062022  0.6531051356 14.6868360326  4.0212005509
## [306]  0.5024271828  0.9374728406  2.3703906270 -0.0485814092 -0.4697632181
## [311]  2.5735104163 12.6187072688 13.9482186170 -0.1446016733 -0.2023960338
## [316]  0.4753889844 -0.3033809473 -0.1187501213  1.1110479333  0.8673041862
## [321]  7.9574628964  2.9813749532 -0.4432985561 -0.4128675186  1.9857186769
## [326] -0.4740250406 -0.3085747395 -0.4042654325 -0.3868681856 -0.2719638940
## [331] -0.3813137797  0.1124371874  6.0006952248  0.6087880907  0.1803899896
## [336]  1.8053483040 11.8210004600  1.3769502029  3.9916558543  0.5644710457
## [341]  3.3269001802 -0.4961835631  0.7056805984 -0.0744330483 -0.4179639832
## [346]  2.4649335646 -0.4411718920 18.3472851376  0.1065282481  3.4889528412
## [351] -0.2864162171  3.9735652170  2.3814698882 27.7012748960  0.7565115738
## [356]  8.3199539099 17.5822163019 -0.1741463699  6.5915891573  0.5496986974
## [361]  0.1213005964  7.9654175504  4.2280134273  3.7552982813  0.2648789586
## [366] -0.1401699688 -0.3499373148  4.2676030992  0.0828924908 -0.4757238607
## [371] -0.3253694947 -0.3076027367 -0.4347237309  0.7517389501 -0.2633713537
## [376] 20.2575594502 -0.0680809090  0.2350476784 -0.1383933132  4.1984687306
## [381]  0.2252129919 -0.2868468310  1.3894549958 -0.5272054946  1.9678441354
## [386] 30.0943953227  0.3872028660 13.2391458980  0.3133411244  8.0983686852
## [391]  2.2632911017  5.3507118990 13.5641375608  2.2928357983  9.2210671569
## [396]  0.4669735469  1.7019418658 -0.0928984542  0.5166086372 -0.1534650823
## [401] -0.2673598877  1.0360044039  3.8587047195  0.9603699805  0.3739077525
## [406]  0.0474388548 -0.2701666339  0.7269668772  5.0995819777  4.4495986520
## [411]  1.2587714164  4.3461922138  7.5640528467 -0.1963048924 -0.2114816928
## [416] 26.8561485536  1.2809299389  3.7422761711  0.3112729957  0.8153055201
## [421] -0.0191339486  1.4064948996  1.3917225512  2.6178274612 -0.4695933361
## [426]  3.3416725285  1.5099013377  1.3621778546  3.3859895735 17.7151674367
## [431]  5.9711505282 -0.5035697373 12.1755368195  3.6666641914  5.6757035619
## [436]  0.3783394570  0.1161339025 -0.4328840565 -0.1918731879  2.7397928733
## [441]  0.0437457677  2.1979705812  3.6022895400 -0.1387666105 -0.2523327297
## [446]  0.2543089563 -0.1602160632 -0.1979308611 -0.0810805755  1.6280801243
## [451] -0.4363555524  5.6195686383  0.5275401750 -0.4743338935 -0.4420930616
## [456] -0.3632324460  2.4390820688  0.5730649554 -0.4552668912  0.6257762912
## [461] -0.4459575788 -0.5124331462 -0.0235633590  5.2679867485  5.6047962900
## [466]  7.5961088425  5.9120611349  2.5291933713  3.5928024498 17.3458587289
## [471]  0.4315199109  2.7360062477  3.6075747981  3.5041683600 -0.2465308766
## [476]  1.2971795220  4.4377807733 -0.4529744443  1.0636956538 -0.2154860702
## [481] -0.3352377395  1.9235270905 -0.1002846283  3.0134309490 -0.1011709692
## [486] -0.4725478058 -0.1043578094 -0.1686205455  1.9791068796 -0.5109559114
## [491]  2.7212338994 -0.1593740216  5.8529717417  1.7462589108  1.0815032367
## [496]  5.8825164383  5.5722971237  0.4019752143  7.1529383932 -0.4936084473
## [501] -0.0704445217 -0.3923339589 -0.3292560272  1.9631169840  2.6769168544
## [506] -0.2902628273  1.5918213082  0.6799985813 -0.0234684171 -0.5080014418
## [511]  0.1386231557  0.1198233615 -0.1325810944  1.7373955018  0.3391927340
## [516] -0.1944583488  3.4450789667  1.1258202816  3.4155342701  2.1451123152
## [521]  3.6518918431  4.5234603935 10.5801232017  1.2144543715  0.0917558997
## [526]  6.1779634045  0.4906093042  0.9928691468  0.8259416109  2.2071561781
## [531]  1.1252293877  0.0668305903 -0.3724654267  1.1657056221  0.6548207740
## [536] -0.5069673774  6.2473934416  7.0068398684  2.1081814444 -0.4341397002
## [541]  7.4483853594 -0.0544903486 -0.1357382643  0.8765369038  0.1493680582
## [546]  6.6263041758 -0.2121113051 -0.4991715660 -0.3676641328 -0.3632324283
## [551]  0.7532616572 -0.0583079828 -0.4016405339 -0.3469828451  0.5703799851
## [556] -0.3478555763  1.0024226771 -0.0443395294 -0.2214178845  2.0860229219
## [561]  0.8894627086  4.2871028205  2.1894293601  0.7417392255  6.0154675731
## [566]  1.5837630793 -0.4474348137 -0.3667777919 -0.1327837946 -0.4814112148
## [571]  0.8692244939  0.1464874340  0.3275705476  0.5127678266  0.2450972301
## [576] -0.1694277184  0.2468655570 -0.2707560772 -0.0507973132  0.7405574376
## [581]  0.4919151798  0.6796953626 11.2105803208  0.6944677109 -0.4592526923
## [586] -0.3271878984 -0.4060722384 -0.4968470631 -0.3868115632  7.9943713531
## [591]  9.9596845726  3.9325664610  4.3609645621  1.9382994388 -0.3942543597
## [596] 55.2664768474  0.9234391097  4.9186207109 -0.1039777154 17.9367526614
## [601] -0.1785780744 -0.0293773564  2.1693389664 -0.2672121643 -0.2486052881
## [606]  0.5408352884 -0.4641139767  0.5811637993  0.7343530513 -0.3159609137
## [611] -0.3702478165  0.6499419548 -0.3485516685  1.5630817917  0.7121945289
## [616]  2.7655509443  5.0700372811  2.5882827646  0.4795248017 -0.1529033652
## [621] -0.2891793892  3.7996153262 -0.0094347069  0.3724305177  0.7801473311
## [626]  0.4493944524 -0.3487850716  1.6728403396  3.5854222496 18.4161574697
## [631] -0.4235988911 -0.2131247783  0.0766670509 -0.4159697118  0.5260629401
## [636]  0.8599180120  9.9744569209  4.9666308429 -0.4858429193 -0.4651616316
## [641]  1.0414701728  3.1791766971  3.3771261645  2.0417058770  3.6223471465
## [646]  0.7860562704  2.5439657196  0.1375500717  0.8045216482 -0.0677115885
## [651] -0.0036177328  1.0888894108  0.9869602075 11.2891959207  0.6265149087
## [656]  0.0363595936 -0.0380930419  1.8452336444 -0.3670732389 -0.3765275418
## [661] -0.4133598810  0.2173208604  0.0181942053  9.5165141232  0.3030004806
## [666] -0.4902746238  3.3328091195  6.7984020337  7.0495319550  0.8746903603
## [671]  3.2087213937  8.8517584491  2.3666975399 -0.4045950036  0.9573859662
## [676]  0.0189282226  3.2347945885 -0.3005001739  2.3499530831  3.6578007824
## [681]  0.1297694758 -0.4478779945 -0.4496506718 -0.3912998901 -0.2946577929
## [686]  0.0078489613  3.5263268824  2.3706860739  0.2586834357  0.2239684171
## [691]  0.3901573357 -0.4823040555 -0.3893443266 -0.2760755732  6.7688573370
## [696]  0.0045990447 -0.0128294235 -0.4291971014  0.2867508975 -0.1638057261
## [701] -0.1682374306  1.1553649782  4.8041350115  5.2768501575  6.3404592360
## [706]  0.8451456637  0.4389060851  3.0166808656 -0.1539821189 -0.0471778914
## [711] -0.2983910623 -0.3874590840  3.7397871013  5.6904759102  3.2382660903
## [716] 11.3630576623 -0.0988073935  1.4375168310  0.4149748808  1.2620034024
## [721]  2.0557396079 -0.2016702093  4.3285835746 -0.4174438444  2.3534024264
## [726]  3.7833657431 -0.3898226553 -0.4755022755  0.4167475626  0.0051775180
## [731] -0.4852233655 -0.3168143935 -0.2652917605  0.7963969143  1.1227180885
## [736]  0.5792433941  2.4095373500  1.2135384859 -0.2661323056 -0.4326624654
## [741]  5.8234270450  0.0112944609  2.3238577298 -0.5065242069  0.9042350569
## [746]  5.2029884159  4.9075414497  2.4848763264  9.8852319371  2.7768905519
## [751]  3.3054802751 -0.2273268238  0.5910612727  1.9501173175 -0.3771603153
## [756]  1.3001990949 -0.2183937632  8.2347174601 -0.2804844456 -0.1075864302
## [761] -0.2366728853 -0.1652829609  1.0224138434  1.7167142141  5.8677440900
## [766]  6.1336463596  0.5349263491  0.9190074052 -0.4784567451  0.6881894629
## [771]  0.3172281193  9.0378900379 -0.4322794184  8.1816847297  1.7314865624
## [776]  3.6778764038  0.1708175079  0.6974221806  1.6486136884  2.8689573825
## [781] -0.4642010627  0.2266739166 -0.2132931298  0.0880628127 -0.2406219373
## [786]  0.8305380271  3.8227336362 -0.1233693682 11.0572700522  0.5866295682
## [791]  2.3962422365  0.0531262089  1.2883161131  0.0164169233 -0.2509625811
## [796] -0.2849389822 -0.4360883472 -0.3497920288 -0.2675850420  0.2567940110
## [801]  1.2086787378  3.5558715790 -0.0633537575  1.4434257703  1.8792100456
## [806]  2.6030551129  1.6576248209 11.9391792465  4.9518584946 -0.4089310601
## [811]  4.6952628044 -0.4031177687 -0.2593274233 -0.3449147164  6.7757381462
## [816]  0.9159424458  2.0231222628  5.8260860677 -0.2952796261  0.4647576946

Looks like NO pollutant feature is not recorded very well in Poland with the number of unique records in Poland being 3, which is certainly not enough to drive any conclusion, so I can remove this fature from the dataset as well.

poland_df <- poland_df %>% select(-NOX_scaled, -NO_scaled)

Log scaling

poland_log <- poland_df %>%
  mutate(across(everything(), ~ log(. - min(.) + 0.1)))
poland_long <- poland_log %>%
  pivot_longer(cols = everything(), 
               names_to = "pollutant", 
               values_to = "value")

ggplot(poland_long, aes(x = value, fill = pollutant)) +
  geom_histogram(bins = 30, color = "white", alpha = 0.8) +
  facet_wrap(~ pollutant, scales = "free") + # 'free' lets each plot have its own axis
  theme_minimal() +
  labs(title = "Distribution of Pollutants in Polish Cities",
       x = "Scaled Value",
       y = "Frequency") +
  theme(legend.position = "none") 

Great, now it looks much better.

I do not need to standardize the data as I already did this step during EDA.

PCA - Poland

pca_poland <- prcomp(poland_log, center=FALSE, scale.=FALSE)
pca_poland
## Standard deviations (1, .., p=6):
## [1] 3.6967413 0.5420173 0.3471524 0.2510593 0.2061991 0.1108709
## 
## Rotation (n x k) = (6 x 6):
##                      PC1         PC2         PC3         PC4         PC5
## O3_scaled    -0.10509687 -0.18879252  0.97420227 -0.04487156 -0.03667269
## PM2.5_scaled  0.56215298 -0.02865123  0.04430971  0.17749370  0.13982151
## NO2_scaled    0.01504604  0.95789576  0.18950498 -0.10329150  0.18872695
## CO_scaled     0.59516632 -0.16131887  0.05410104 -0.30295320  0.55900517
## PM10_scaled   0.46127120  0.13017347  0.09999128  0.64173605 -0.43999562
## SO2_scaled    0.32514338  0.05480234 -0.01136956 -0.67246383 -0.66136546
##                       PC6
## O3_scaled     0.029725755
## PM2.5_scaled  0.793817181
## NO2_scaled    0.003193814
## CO_scaled    -0.461041134
## PM10_scaled  -0.393527870
## SO2_scaled    0.039209062
summary(pca_poland)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     3.6967 0.54202 0.34715 0.25106 0.20620 0.11087
## Proportion of Variance 0.9625 0.02069 0.00849 0.00444 0.00299 0.00087
## Cumulative Proportion  0.9625 0.98321 0.99170 0.99614 0.99913 1.00000

In fact just the 1st PC already explains 96% of the variance which is a really good result and it means that I can reduce the dimention to just 2 dimentions to preserve still some of the data in PCA , even though it only holds 2% of the variance. So all the pollutants are highly correlated with each other.

fviz_pca_var(pca_poland, col.var="steelblue")

# visusalisation of quality
fviz_eig(pca_poland, choice='eigenvalue') # eigenvalues on y-axis
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

fviz_eig(pca_poland) # percentage of explained variance on y-axis
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

This plot exactly proves my conclusion above - 1st dimentions explains most of the variance (in this case it is 96%)

Let;s check the most significant variables that are in this PC

loading_scores_PC_1<-pca_poland$rotation[,1]
fac_scores_PC_1<-abs(loading_scores_PC_1)
fac_scores_PC_1_ranked<-names(sort(fac_scores_PC_1, decreasing=T))
pca_poland$rotation[fac_scores_PC_1_ranked, 1]
##    CO_scaled PM2.5_scaled  PM10_scaled   SO2_scaled    O3_scaled   NO2_scaled 
##   0.59516632   0.56215298   0.46127120   0.32514338  -0.10509687   0.01504604
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
var<-get_pca_var(pca_poland)
a<-fviz_contrib(pca_poland, "var", axes=1, xtickslab.rt=90) # default angle=45°
b<-fviz_contrib(pca_poland, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')

First graph shows the contribution of variables to the PC1 (with 96% variance) - the significant variables here are CO, PM2.5 and PM10 First graph shows the contribution of variables to the PC2 (with 2% variance) - NO2 is the only significant variable here

#Clustering after Dimention reduction

I will use only PC1 and PC2 as they explain 98% of the variance

pca_poland_1_2 <- pca_poland$x[, 1:2]

PAM after Dimention reduction

fviz_nbclust(pca_poland_1_2, FUN = pam, method = "gap_stat") 

fviz_nbclust(pca_poland_1_2, FUN = pam, method = "wss")

fviz_nbclust(pca_poland_1_2, FUN = pam, method = "silhouette")

The outputs are different from the before dimention reduction. I will take k = 3 to capture more granularity as well as k=4 to compare it to the clusters before dimention reduction.

pam_pca_k4 = eclust(pca_poland_1_2, FUNcluster = "pam", k = 4, graph = TRUE)

This looks very similar to the PAM before dimention reduction, but clusters look tigher together

pam_pca_k3 = eclust(pca_poland_1_2, FUNcluster = "pam", k = 3, graph = TRUE)

Hierarchical clustring After dimention reduction

fviz_nbclust(pca_poland_1_2, FUN = hcut, method = "gap_stat") 

fviz_nbclust(pca_poland_1_2, FUN = hcut, method = "wss")

fviz_nbclust(pca_poland_1_2, FUN = hcut, method = "silhouette")

Very similar results to pam, methods stringly suggest 1 to 3 clustes, I choose k=3 for more granularity. As well as k = 4 for comparison with before simention reduction.

d_reduced <- dist(pca_poland_1_2, method = "euclidean")

hc_poland_reduced <- hclust(d_reduced, method = "ward.D2" )

#plot(hc1, cex = 0.6, hang = -1) #This plotting did not work for me as I have many cities, and it the dendogram was very cluttered and not interpretable. 

fviz_dend(hc_poland_reduced, k = 4,               
          cex = 0.3,              
          type = "circular",     
          rect = TRUE,      
          palette = "jco")

hc_poland_k4 <- cutree(hc_poland_reduced, k = 4)

This looks different to the previous Hierarchical clustering before dimention reduction. I like this clustering more because now it actually gives the split that makes sense when it comes to pollution in big industrial and smaller coastal cities.

My interpretation: Yellow cluster: This big cluster represents a wide range of mid-sized towns and regional centers with moderate air quality that often fluctuates seasonally, cities like Gdańsk, Wrocław, Białystok, Poznań, and Rzeszów. Red cluster: This large cluster contains major metropolitan and industrial centers that historically face significant smog challenges due to traffic and residential heating, including Warsaw, Lublin, Zamość Grey cluster: This minor cluster includes cities that act as transitions between the main groups or have unique outlier data points, represented by locations such as Otwock, Radom. Blue cluster: This small, distinct group appears to represent locations with unique environmental monitoring profiles or typical Polish health resorts, such as Nałęczów, Krasnobród, Szamotuły, and Mogilno.

fviz_dend(hc_poland_reduced, k = 3,               
          cex = 0.3,              
          type = "circular",     
          rect = TRUE,      
          palette = "jco")

hc_poland_k3 <- cutree(hc_poland_reduced, k = 3)

Split with k=3 makes much more sense than k = 4 as it removed this small cluster which only had 3 cities and was difficult to interpret why they were clustered together. The interpretation is the same as before with k=4, excep the grey cluster in the previoud hclust.

DBSCAN after dimention reduction - Poland

Optimal eps

kNNdistplot(pca_poland_1_2, k =  4)
abline(h = 0.25, lty = 2)

So optimal eps for k=4 is 0.25 similar to before dimention reduction.

kNNdistplot(pca_poland_1_2, k =  3)
abline(h = 0.18, lty = 2)

And eps = 0.18 for k = 3

set.seed(123)
db <- fpc::dbscan(pca_poland_1_2, eps = 0.25, MinPts = 4)

fviz_cluster(db, data = pca_poland_1_2, stand = FALSE, 
             ellipse = TRUE, show.clust.cent = FALSE,
             geom = "text", labelsize = 8, repel = TRUE)

Interesting, now with minPts = 4 I have much more clusters than before dimention reduction with same minPts , where I only had 1 cluster. Blue and red clusters are similar to the clusters I got in Hierarchical clustering and PAM, which proves that those are consistent. In blue, for example, I have Mogilno, Krasnobród etc, which represent the clean air cluster.

Now let’s test with eps = 0.18

set.seed(123)
db_reduced <- fpc::dbscan(pca_poland_1_2, eps = 0.18, MinPts = 4)

fviz_cluster(db_reduced, data = pca_poland_1_2, stand = FALSE, 
             ellipse = TRUE, show.clust.cent = FALSE,
             geom = "text", labelsize = 8, repel = TRUE)

Results are very similar or identical to the eps = 0.25, so no change here. It is still consistent with those red and blue clusters.

Did reducing dimentions preserved the original structure of the data?

PAM

print(adjustedRandIndex(pam_poland$cluster, pam_pca_k4$cluster))
## [1] 0.3929003

For k = 4 in both the index is quite low.

Hierarchical clustering

print(adjustedRandIndex(hc_poland, hc_poland_k4))
## [1] 0.2038499

Same here, the result is low.

In my opinion this does not indicate that dimention reduction did not preserve the data. As I picked PC1 (96% of variance) and PC2 (2% of variance) then preserve alltogether 98% of variance. Also I keep in mind that after the dimention reduction the clusters were much more interpretable made much more sense (having the blue cluster with clean air with all the clean health Polish resorts is not coincidence.) This could mean that the 2% of the variance that I left behind with PC1 and PC2 were actually noice, which were distrupting the clisters

Comparing different algorithms after dimention reduction with ARI (adjusted random index)

PAM vs Hierarchical clustering for k=3

print(adjustedRandIndex(hc_poland_k3, pam_pca_k3$cluster))
## [1] 0.5780577

Index of 0.58 is a good agreement between clusters and also the strongest among the clusters after dimention reduction. So I can be confident that those 3 clusters are not coincidental and they actually exist.

PAM vs DBSCAN for k=3

print(adjustedRandIndex(db_reduced$cluster, pam_pca_k3$cluster))
## [1] 0.3939372

This disagreement is expected as dbscan is a desity based algorithm vs PAM which is based on medoids.

DBSCAN vs Hierarchical clustering for k=3

print(adjustedRandIndex(db_reduced$cluster, hc_poland_k3))
## [1] 0.2981993

Very poor agreement between those 2 methods.