Klaster analiza je vrlo popularna i često korištena statistička procedura za identifikaciju strukture među opservacijama na osnovi skupa varijabli, odnosno karakteristika koje te opservacije posjeduju. Zbog širokog spektra primjene u različitim disciplinama poput: biologije, sociologije, ekonomike, poslovne analize i dr., klaster analiza se u literaturi često naziva: Q analiza, tipologizacija, klasifikacijska analiza i numerička taksonomija. Klaster analiza je usporediva sa faktorskom analizom zbog toga što se obje metode koriste u svrhu otkrivanja strukture u podatcima. Razlike se odnose na to što klaster analiza strukturu identificira na osnovi objekata, a faktorska analiza na osnovi varijabli. Dodatno, faktorska analiza grupira podatke na osnovi varijacije dok klaster analiza vrši grupiranje na osnovi neke od mjera udaljenosti. Struktura u podatcima kod klaster analize je otkrivena na način da se opservacije grupiraju u klastere koje sadrže slične opservacije unutar sebe i različite opservacije između različitih klastera. Grupiranje se postiže maksimizacijom homogenosti opservacija unutar klastera i maksimizacijom heterogenosti između klastera. Klaster analiza se najčešće koristi u svrhu:
Klaster analiza je često kritizirana metoda iz više razloga. Prije svega valja istaknuti da će klaster algoritam uvijek naći strukturu u podatcima pa je apriori koncepcija o postojećoj strukturi u podatcima neophodna za procjenu opravdanosti grupiranja. To je slučaj i kada se kluster analiza koristi u eksporatorne svrhe, ne samo u cilju verifikacije teorije. Klaster analiza je dekriptivna, ne-teoretska i ne-inferencijalna jer ne postoji statistička osnova za zaključivanje sa uzorka na populaciju i ne postoji garancija jedinstvenog rješenja. Generalizacija rješenja klaster analize nije moguća jer je ono uvjetovano korištenim varijablama, i to u znatno većoj mjeri nego kod drugih metoda. Upravo zbog toga je pažljiv odabir varijabli ključan dio klaster analize.
Praktična provedba klaster analize uključuje odluku oko izbora mjere sličnosti, kriterija za svrstavanje u klaster i odabir broja klastera. Prvi korak analize se odnosi na izbor mjere sličnosti između svakog objekta na osnovi sličnosti karakteristika (varijabli) u procesu klasteringa. Sličnost označava međusobnu povezanost objekata, pri čemu se kao mjera najčešće koristi euklidova udaljenost između objekata (postoji i mnoštvo drugih mjera udaljenosti,a neki algoritmi mogu grupirati i prema korelacijskim mjerama). Tablica i grafički prikaz podataka za klaster analizu te izračunata matrica euklidovskih udaljenosti su prikazani niže.
Tablica i vizualizacija podataka za klaster analizu.
MAtrica euklidovskih udaljenosti.
U sljedećem je koraku potrebno formirati klastere na osnovi mjere sličnosti za svaku opservaciju. Formiranje klastera se u osnovi svodi na identifikaciju najsličnijih (najbližih) opservacija koje nisu u istom klasteru i njihovo daljnje kombiniranje. Na donjem prikazu su dani rezultati hijerarhijske procedure clustering algoritma. Ova procedura započinje iteracijski proces tako da je svaka opservacija jedan klaster, a potom se kombiniraju po dva klastera dok sve opservacije ne završe u jednom klasteru.
Hijerarhijska procedure klastering algoritma.
Zadnji korak uključuje odabir broja klastera u finalnom rješenju. Odabir optimalnog broja klastera je potreban jer algoritam generira veći broj klasterskih rješenja pa valja izabrati najbolje. Pri tome je povećanje od jednog do više klasterskih rješenja uvijek praćeno porastom heterogenosti pa je cilj identificirati skupove opservacija uz što manje heterogenosti. Postoji više različitih mjera heterogenosti no u ovom primjeru se koristi prosjek svih udaljenosti između opservacija u klasterima. Optimalni broj klastera je stoga 3 ili četire u našem slučaju. Grafički prikaz klasterskog rješenja je dan na sljedećem grafikonu:
Klastersko rješenje.
## UČITAJ PODATKE
url <- "https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide.xlsx"
COVID19_dta <- rio::import(url)
## PREGLED PODATAKA
str(COVID19_dta)
## 'data.frame': 9107 obs. of 10 variables:
## $ dateRep : POSIXct, format: "2020-04-06" "2020-04-05" ...
## $ day : num 6 5 4 3 2 1 31 30 29 28 ...
## $ month : num 4 4 4 4 4 4 3 3 3 3 ...
## $ year : num 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ cases : num 29 35 0 43 26 25 27 8 15 16 ...
## $ deaths : num 2 1 0 0 0 0 0 1 1 1 ...
## $ countriesAndTerritories: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ geoId : chr "AF" "AF" "AF" "AF" ...
## $ countryterritoryCode : chr "AFG" "AFG" "AFG" "AFG" ...
## $ popData2018 : num 37172386 37172386 37172386 37172386 37172386 ...
head(COVID19_dta,15)
## dateRep day month year cases deaths countriesAndTerritories geoId
## 1 2020-04-06 6 4 2020 29 2 Afghanistan AF
## 2 2020-04-05 5 4 2020 35 1 Afghanistan AF
## 3 2020-04-04 4 4 2020 0 0 Afghanistan AF
## 4 2020-04-03 3 4 2020 43 0 Afghanistan AF
## 5 2020-04-02 2 4 2020 26 0 Afghanistan AF
## 6 2020-04-01 1 4 2020 25 0 Afghanistan AF
## 7 2020-03-31 31 3 2020 27 0 Afghanistan AF
## 8 2020-03-30 30 3 2020 8 1 Afghanistan AF
## 9 2020-03-29 29 3 2020 15 1 Afghanistan AF
## 10 2020-03-28 28 3 2020 16 1 Afghanistan AF
## 11 2020-03-27 27 3 2020 0 0 Afghanistan AF
## 12 2020-03-26 26 3 2020 33 0 Afghanistan AF
## 13 2020-03-25 25 3 2020 2 0 Afghanistan AF
## 14 2020-03-24 24 3 2020 6 1 Afghanistan AF
## 15 2020-03-23 23 3 2020 10 0 Afghanistan AF
## countryterritoryCode popData2018
## 1 AFG 37172386
## 2 AFG 37172386
## 3 AFG 37172386
## 4 AFG 37172386
## 5 AFG 37172386
## 6 AFG 37172386
## 7 AFG 37172386
## 8 AFG 37172386
## 9 AFG 37172386
## 10 AFG 37172386
## 11 AFG 37172386
## 12 AFG 37172386
## 13 AFG 37172386
## 14 AFG 37172386
## 15 AFG 37172386
summary(COVID19_dta)
## dateRep day month year
## Min. :2019-12-31 00:00:00 Min. : 1.00 Min. : 1.000 Min. :2019
## 1st Qu.:2020-02-02 00:00:00 1st Qu.: 6.00 1st Qu.: 2.000 1st Qu.:2020
## Median :2020-03-09 00:00:00 Median :17.00 Median : 3.000 Median :2020
## Mean :2020-02-28 11:14:22 Mean :15.91 Mean : 2.529 Mean :2020
## 3rd Qu.:2020-03-26 00:00:00 3rd Qu.:24.00 3rd Qu.: 3.000 3rd Qu.:2020
## Max. :2020-04-06 00:00:00 Max. :31.00 Max. :12.000 Max. :2020
##
## cases deaths countriesAndTerritories
## Min. : -9.0 Min. : 0.000 Length:9107
## 1st Qu.: 0.0 1st Qu.: 0.000 Class :character
## Median : 0.0 Median : 0.000 Mode :character
## Mean : 136.6 Mean : 7.574
## 3rd Qu.: 11.0 3rd Qu.: 0.000
## Max. :34272.0 Max. :2004.000
##
## geoId countryterritoryCode popData2018
## Length:9107 Length:9107 Min. :1.000e+03
## Class :character Class :character 1st Qu.:3.731e+06
## Mode :character Mode :character Median :1.063e+07
## Mean :6.489e+07
## 3rd Qu.:4.272e+07
## Max. :1.393e+09
## NA's :36
# Broj zemalja
length(unique(COVID19_dta$geoId)) # COVID19_dta %>% dplyr::summarise(n_distinct(geoId))
## [1] 205
## PRILAGODI PODATKE ZA ANALIZU
COVID19 <- COVID19_dta %>% dplyr::select(dateRep,
cases,
deaths,
countriesAndTerritories) %>%
rename( "Country" = countriesAndTerritories) %>%
mutate_at(.,c("cases", "deaths"), scale)
# Pregled
head(COVID19, 15)
## dateRep cases deaths Country
## 1 2020-04-06 -0.10095355 -0.08647484 Afghanistan
## 2 2020-04-05 -0.09532649 -0.10198894 Afghanistan
## 3 2020-04-04 -0.12815098 -0.11750303 Afghanistan
## 4 2020-04-03 -0.08782375 -0.11750303 Afghanistan
## 5 2020-04-02 -0.10376708 -0.11750303 Afghanistan
## 6 2020-04-01 -0.10470492 -0.11750303 Afghanistan
## 7 2020-03-31 -0.10282923 -0.11750303 Afghanistan
## 8 2020-03-30 -0.12064824 -0.10198894 Afghanistan
## 9 2020-03-29 -0.11408334 -0.10198894 Afghanistan
## 10 2020-03-28 -0.11314550 -0.10198894 Afghanistan
## 11 2020-03-27 -0.12815098 -0.11750303 Afghanistan
## 12 2020-03-26 -0.09720218 -0.11750303 Afghanistan
## 13 2020-03-25 -0.12627530 -0.11750303 Afghanistan
## 14 2020-03-24 -0.12252393 -0.10198894 Afghanistan
## 15 2020-03-23 -0.11877256 -0.11750303 Afghanistan
## TIME SERIES CASES
COVID19_cases <- COVID19 %>% dplyr::select(-deaths) %>%
pivot_wider(names_from = dateRep, values_from = cases) %>%
filter(complete.cases(.))
# Pregled
head(COVID19_cases,15)
## # A tibble: 15 x 99
## Country `2020-04-06`[,1] `2020-04-05`[,1] `2020-04-04`[,1] `2020-04-03`[,1]
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Austra~ -0.0747 0.00221 0.176 0.104
## 2 Austria 0.0754 0.0979 0.243 0.264
## 3 Belgium 1.05 1.43 1.21 1.17
## 4 Canada 1.39 1.15 1.05 1.44
## 5 China -0.0653 -0.0831 -0.0700 -0.0625
## 6 Czech_~ -0.0203 0.136 0.183 0.124
## 7 Denmark 0.146 0.172 0.220 0.134
## 8 France 1.63 3.87 4.78 1.86
## 9 Germany 3.32 5.44 5.58 5.66
## 10 Iceland -0.0634 -0.0784 -0.0859 -0.0353
## 11 Iran 2.20 4.82 -0.128 2.57
## 12 Italy 3.92 4.38 4.17 4.25
## 13 Japan 0.231 0.187 0.170 0.284
## 14 Norway -0.00623 0.155 0.128 0.125
## 15 Singap~ -0.0156 -0.0578 -0.0672 -0.0822
## # ... with 94 more variables: `2020-04-02`[,1] <dbl>, `2020-04-01`[,1] <dbl>,
## # `2020-03-31`[,1] <dbl>, `2020-03-30`[,1] <dbl>, `2020-03-29`[,1] <dbl>,
## # `2020-03-28`[,1] <dbl>, `2020-03-27`[,1] <dbl>, `2020-03-26`[,1] <dbl>,
## # `2020-03-25`[,1] <dbl>, `2020-03-24`[,1] <dbl>, `2020-03-23`[,1] <dbl>,
## # `2020-03-22`[,1] <dbl>, `2020-03-21`[,1] <dbl>, `2020-03-20`[,1] <dbl>,
## # `2020-03-19`[,1] <dbl>, `2020-03-18`[,1] <dbl>, `2020-03-17`[,1] <dbl>,
## # `2020-03-16`[,1] <dbl>, `2020-03-15`[,1] <dbl>, `2020-03-11`[,1] <dbl>,
## # `2020-03-08`[,1] <dbl>, `2020-03-02`[,1] <dbl>, `2020-03-01`[,1] <dbl>,
## # `2020-02-29`[,1] <dbl>, `2020-02-28`[,1] <dbl>, `2020-02-27`[,1] <dbl>,
## # `2020-02-26`[,1] <dbl>, `2020-02-25`[,1] <dbl>, `2020-02-24`[,1] <dbl>,
## # `2020-02-23`[,1] <dbl>, `2020-02-22`[,1] <dbl>, `2020-02-21`[,1] <dbl>,
## # `2020-02-20`[,1] <dbl>, `2020-02-19`[,1] <dbl>, `2020-02-18`[,1] <dbl>,
## # `2020-02-17`[,1] <dbl>, `2020-02-16`[,1] <dbl>, `2020-02-15`[,1] <dbl>,
## # `2020-02-14`[,1] <dbl>, `2020-02-13`[,1] <dbl>, `2020-02-12`[,1] <dbl>,
## # `2020-02-11`[,1] <dbl>, `2020-02-10`[,1] <dbl>, `2020-02-09`[,1] <dbl>,
## # `2020-02-08`[,1] <dbl>, `2020-02-07`[,1] <dbl>, `2020-02-06`[,1] <dbl>,
## # `2020-02-05`[,1] <dbl>, `2020-02-04`[,1] <dbl>, `2020-02-03`[,1] <dbl>,
## # `2020-02-02`[,1] <dbl>, `2020-02-01`[,1] <dbl>, `2020-01-31`[,1] <dbl>,
## # `2020-01-30`[,1] <dbl>, `2020-01-29`[,1] <dbl>, `2020-01-28`[,1] <dbl>,
## # `2020-01-27`[,1] <dbl>, `2020-01-26`[,1] <dbl>, `2020-01-25`[,1] <dbl>,
## # `2020-01-24`[,1] <dbl>, `2020-01-23`[,1] <dbl>, `2020-01-22`[,1] <dbl>,
## # `2020-01-21`[,1] <dbl>, `2020-01-20`[,1] <dbl>, `2020-01-19`[,1] <dbl>,
## # `2020-01-18`[,1] <dbl>, `2020-01-17`[,1] <dbl>, `2020-01-16`[,1] <dbl>,
## # `2020-01-15`[,1] <dbl>, `2020-01-14`[,1] <dbl>, `2020-01-13`[,1] <dbl>,
## # `2020-01-12`[,1] <dbl>, `2020-01-11`[,1] <dbl>, `2020-01-10`[,1] <dbl>,
## # `2020-01-09`[,1] <dbl>, `2020-01-08`[,1] <dbl>, `2020-01-07`[,1] <dbl>,
## # `2020-01-06`[,1] <dbl>, `2020-01-05`[,1] <dbl>, `2020-01-04`[,1] <dbl>,
## # `2020-01-03`[,1] <dbl>, `2020-01-02`[,1] <dbl>, `2020-01-01`[,1] <dbl>,
## # `2019-12-31`[,1] <dbl>, `2020-03-14`[,1] <dbl>, `2020-03-13`[,1] <dbl>,
## # `2020-03-12`[,1] <dbl>, `2020-03-10`[,1] <dbl>, `2020-03-09`[,1] <dbl>,
## # `2020-03-06`[,1] <dbl>, `2020-03-05`[,1] <dbl>, `2020-03-04`[,1] <dbl>,
## # `2020-03-03`[,1] <dbl>, `2020-03-07`[,1] <dbl>
# Clustering algoritam prihvaća samo numeričke varijable
COVID19_cases_noGeo <- COVID19_cases %>% dplyr::select(-Country)
# Odabir optimalnog broja klastera
wss <- map_dbl(1:5,
~{kmeans(COVID19_cases_noGeo, ., nstart=50,iter.max = 15 )$tot.withinss})
n_clust <- 1:5
elbow_df <- as.data.frame(cbind("n_clust" = n_clust, "wss" = wss))
# Vizualizacija
ggplot(elbow_df) +
geom_line(aes(n_clust,wss ), colour = "blue")
# Provedi klastering i izvuci centroide
clusters <- kmeans(COVID19_cases_noGeo, centers = 3)
centers <- rownames_to_column(as.data.frame(clusters$centers), "cluster")
# Dodaj klastere i zemlje
COVID19_cases_clust <- COVID19_cases_noGeo %>%
mutate(Cluster = clusters$cluster) %>%
mutate(Country = COVID19_cases$Country)
# Pregled
head(COVID19_cases_clust,10)
## # A tibble: 10 x 100
## `2020-04-06`[,1] `2020-04-05`[,1] `2020-04-04`[,1] `2020-04-03`[,1]
## <dbl> <dbl> <dbl> <dbl>
## 1 -0.0747 0.00221 0.176 0.104
## 2 0.0754 0.0979 0.243 0.264
## 3 1.05 1.43 1.21 1.17
## 4 1.39 1.15 1.05 1.44
## 5 -0.0653 -0.0831 -0.0700 -0.0625
## 6 -0.0203 0.136 0.183 0.124
## 7 0.146 0.172 0.220 0.134
## 8 1.63 3.87 4.78 1.86
## 9 3.32 5.44 5.58 5.66
## 10 -0.0634 -0.0784 -0.0859 -0.0353
## # ... with 96 more variables: `2020-04-02`[,1] <dbl>, `2020-04-01`[,1] <dbl>,
## # `2020-03-31`[,1] <dbl>, `2020-03-30`[,1] <dbl>, `2020-03-29`[,1] <dbl>,
## # `2020-03-28`[,1] <dbl>, `2020-03-27`[,1] <dbl>, `2020-03-26`[,1] <dbl>,
## # `2020-03-25`[,1] <dbl>, `2020-03-24`[,1] <dbl>, `2020-03-23`[,1] <dbl>,
## # `2020-03-22`[,1] <dbl>, `2020-03-21`[,1] <dbl>, `2020-03-20`[,1] <dbl>,
## # `2020-03-19`[,1] <dbl>, `2020-03-18`[,1] <dbl>, `2020-03-17`[,1] <dbl>,
## # `2020-03-16`[,1] <dbl>, `2020-03-15`[,1] <dbl>, `2020-03-11`[,1] <dbl>,
## # `2020-03-08`[,1] <dbl>, `2020-03-02`[,1] <dbl>, `2020-03-01`[,1] <dbl>,
## # `2020-02-29`[,1] <dbl>, `2020-02-28`[,1] <dbl>, `2020-02-27`[,1] <dbl>,
## # `2020-02-26`[,1] <dbl>, `2020-02-25`[,1] <dbl>, `2020-02-24`[,1] <dbl>,
## # `2020-02-23`[,1] <dbl>, `2020-02-22`[,1] <dbl>, `2020-02-21`[,1] <dbl>,
## # `2020-02-20`[,1] <dbl>, `2020-02-19`[,1] <dbl>, `2020-02-18`[,1] <dbl>,
## # `2020-02-17`[,1] <dbl>, `2020-02-16`[,1] <dbl>, `2020-02-15`[,1] <dbl>,
## # `2020-02-14`[,1] <dbl>, `2020-02-13`[,1] <dbl>, `2020-02-12`[,1] <dbl>,
## # `2020-02-11`[,1] <dbl>, `2020-02-10`[,1] <dbl>, `2020-02-09`[,1] <dbl>,
## # `2020-02-08`[,1] <dbl>, `2020-02-07`[,1] <dbl>, `2020-02-06`[,1] <dbl>,
## # `2020-02-05`[,1] <dbl>, `2020-02-04`[,1] <dbl>, `2020-02-03`[,1] <dbl>,
## # `2020-02-02`[,1] <dbl>, `2020-02-01`[,1] <dbl>, `2020-01-31`[,1] <dbl>,
## # `2020-01-30`[,1] <dbl>, `2020-01-29`[,1] <dbl>, `2020-01-28`[,1] <dbl>,
## # `2020-01-27`[,1] <dbl>, `2020-01-26`[,1] <dbl>, `2020-01-25`[,1] <dbl>,
## # `2020-01-24`[,1] <dbl>, `2020-01-23`[,1] <dbl>, `2020-01-22`[,1] <dbl>,
## # `2020-01-21`[,1] <dbl>, `2020-01-20`[,1] <dbl>, `2020-01-19`[,1] <dbl>,
## # `2020-01-18`[,1] <dbl>, `2020-01-17`[,1] <dbl>, `2020-01-16`[,1] <dbl>,
## # `2020-01-15`[,1] <dbl>, `2020-01-14`[,1] <dbl>, `2020-01-13`[,1] <dbl>,
## # `2020-01-12`[,1] <dbl>, `2020-01-11`[,1] <dbl>, `2020-01-10`[,1] <dbl>,
## # `2020-01-09`[,1] <dbl>, `2020-01-08`[,1] <dbl>, `2020-01-07`[,1] <dbl>,
## # `2020-01-06`[,1] <dbl>, `2020-01-05`[,1] <dbl>, `2020-01-04`[,1] <dbl>,
## # `2020-01-03`[,1] <dbl>, `2020-01-02`[,1] <dbl>, `2020-01-01`[,1] <dbl>,
## # `2019-12-31`[,1] <dbl>, `2020-03-14`[,1] <dbl>, `2020-03-13`[,1] <dbl>,
## # `2020-03-12`[,1] <dbl>, `2020-03-10`[,1] <dbl>, `2020-03-09`[,1] <dbl>,
## # `2020-03-06`[,1] <dbl>, `2020-03-05`[,1] <dbl>, `2020-03-04`[,1] <dbl>,
## # `2020-03-03`[,1] <dbl>, `2020-03-07`[,1] <dbl>, Cluster <int>,
## # Country <chr>
# Transformiraj podatke
COVID19_cases_clust_long <- COVID19_cases_clust %>%
pivot_longer(cols=c(-Country, -Cluster),
names_to = "Date",
values_to = "Cases")
# Pregled
head(COVID19_cases_clust_long,10)
## # A tibble: 10 x 4
## Cluster Country Date Cases[,1]
## <int> <chr> <chr> <dbl>
## 1 2 Australia 2020-04-06 -0.0747
## 2 2 Australia 2020-04-05 0.00221
## 3 2 Australia 2020-04-04 0.176
## 4 2 Australia 2020-04-03 0.104
## 5 2 Australia 2020-04-02 0.124
## 6 2 Australia 2020-04-01 0.0125
## 7 2 Australia 2020-03-31 0.307
## 8 2 Australia 2020-03-30 0.138
## 9 2 Australia 2020-03-29 0.276
## 10 2 Australia 2020-03-28 0.0707
# Transformiraj centroide
COVID19_cases_centers_long <- centers %>%
pivot_longer(cols = -cluster,
names_to = "Date",
values_to = "Cases")
# Pregled
head(COVID19_cases_centers_long,10)
## # A tibble: 10 x 3
## cluster Date Cases
## <chr> <chr> <dbl>
## 1 1 2020-04-06 23.7
## 2 1 2020-04-05 32.0
## 3 1 2020-04-04 30.3
## 4 1 2020-04-03 26.9
## 5 1 2020-04-02 25.3
## 6 1 2020-04-01 23.3
## 7 1 2020-03-31 20.1
## 8 1 2020-03-30 17.1
## 9 1 2020-03-29 18.6
## 10 1 2020-03-28 17.4
# Vizualiziraj
ggplot() +
# geom_line(data = COVID19_cases_clust_long, aes(y = Cases, x = Date, group = Country), colour = "gray") +
facet_wrap(~cluster, nrow = 1) +
geom_line(data = COVID19_cases_centers_long, aes(y = Cases, x = Date, group = cluster), col = "black", size = 1.2)
## CROSS-SECTION CLUSTERING
COVID19_CS <- COVID19 %>% group_by(Country) %>%
summarise(Deaths = sum(deaths),
Cases = sum(cases))
# Pregled
head(COVID19_CS,15)
## # A tibble: 15 x 3
## Country Deaths Cases
## <chr> <dbl> <dbl>
## 1 Afghanistan -10.2 -11.0
## 2 Albania -3.08 -3.38
## 3 Algeria -8.57 -10.7
## 4 Andorra -2.54 -2.61
## 5 Angola -1.85 -2.04
## 6 Anguilla -1.29 -1.41
## 7 Antigua_and_Barbuda -2.12 -2.29
## 8 Argentina -2.93 -2.52
## 9 Armenia -10.3 -10.6
## 10 Aruba -1.76 -1.86
## 11 Australia -11.0 -7.17
## 12 Austria -8.35 -1.32
## 13 Azerbaijan -10.6 -11.1
## 14 Bahamas -2.27 -2.54
## 15 Bahrain -11.3 -11.8
# Makni varijable koje nisu numeričke i skaliraj
COVID19_CS_noGeo <- COVID19_CS %>%
remove_rownames %>%
column_to_rownames(var = "Country") %>%
scale()
# Pregled
head(COVID19_CS_noGeo,15)
## Deaths Cases
## Afghanistan -0.39822096 -0.42143953
## Albania -0.11994468 -0.12945007
## Algeria -0.33353410 -0.40929930
## Andorra -0.09888978 -0.09986223
## Angola -0.07196467 -0.07807610
## Anguilla -0.05030596 -0.05391543
## Antigua_and_Barbuda -0.08231884 -0.08786257
## Argentina -0.11399584 -0.09639428
## Armenia -0.40279423 -0.40755334
## Aruba -0.06859903 -0.07136782
## Australia -0.42644301 -0.27485017
## Austria -0.32500205 -0.05061122
## Azerbaijan -0.41194077 -0.42592982
## Bahamas -0.08844630 -0.09718180
## Bahrain -0.44119183 -0.45122784
# Provedi klastering
clust2 <- kmeans(COVID19_CS_noGeo, centers = 2, nstart = 25)
# Pregled
str(clust2)
## List of 9
## $ cluster : Named int [1:205] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr [1:205] "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ centers : num [1:2, 1:2] -0.129 6.501 -0.11 5.519
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "1" "2"
## .. ..$ : chr [1:2] "Deaths" "Cases"
## $ totss : num 408
## $ withinss : num [1:2] 45.3 66.1
## $ tot.withinss: num 111
## $ betweenss : num 297
## $ size : int [1:2] 201 4
## $ iter : int 1
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
# Pregled
clust2
## K-means clustering with 2 clusters of sizes 201, 4
##
## Cluster means:
## Deaths Cases
## 1 -0.1293637 -0.1098371
## 2 6.5005255 5.5193132
##
## Clustering vector:
## Afghanistan
## 1
## Albania
## 1
## Algeria
## 1
## Andorra
## 1
## Angola
## 1
## Anguilla
## 1
## Antigua_and_Barbuda
## 1
## Argentina
## 1
## Armenia
## 1
## Aruba
## 1
## Australia
## 1
## Austria
## 1
## Azerbaijan
## 1
## Bahamas
## 1
## Bahrain
## 1
## Bangladesh
## 1
## Barbados
## 1
## Belarus
## 1
## Belgium
## 1
## Belize
## 1
## Benin
## 1
## Bermuda
## 1
## Bhutan
## 1
## Bolivia
## 1
## Bonaire, Saint Eustatius and Saba
## 1
## Bosnia_and_Herzegovina
## 1
## Botswana
## 1
## Brazil
## 1
## British_Virgin_Islands
## 1
## Brunei_Darussalam
## 1
## Bulgaria
## 1
## Burkina_Faso
## 1
## Burundi
## 1
## Cambodia
## 1
## Cameroon
## 1
## Canada
## 1
## Cape_Verde
## 1
## Cases_on_an_international_conveyance_Japan
## 1
## Cayman_Islands
## 1
## Central_African_Republic
## 1
## Chad
## 1
## Chile
## 1
## China
## 1
## Colombia
## 1
## Congo
## 1
## Costa_Rica
## 1
## Cote_dIvoire
## 1
## Croatia
## 1
## Cuba
## 1
## CuraA§ao
## 1
## Cyprus
## 1
## Czech_Republic
## 1
## Democratic_Republic_of_the_Congo
## 1
## Denmark
## 1
## Djibouti
## 1
## Dominica
## 1
## Dominican_Republic
## 1
## Ecuador
## 1
## Egypt
## 1
## El_Salvador
## 1
## Equatorial_Guinea
## 1
## Eritrea
## 1
## Estonia
## 1
## Eswatini
## 1
## Ethiopia
## 1
## Falkland_Islands_(Malvinas)
## 1
## Faroe_Islands
## 1
## Fiji
## 1
## Finland
## 1
## France
## 2
## French_Polynesia
## 1
## Gabon
## 1
## Gambia
## 1
## Georgia
## 1
## Germany
## 1
## Ghana
## 1
## Gibraltar
## 1
## Greece
## 1
## Greenland
## 1
## Grenada
## 1
## Guam
## 1
## Guatemala
## 1
## Guernsey
## 1
## Guinea
## 1
## Guinea_Bissau
## 1
## Guyana
## 1
## Haiti
## 1
## Holy_See
## 1
## Honduras
## 1
## Hungary
## 1
## Iceland
## 1
## India
## 1
## Indonesia
## 1
## Iran
## 1
## Iraq
## 1
## Ireland
## 1
## Isle_of_Man
## 1
## Israel
## 1
## Italy
## 2
## Jamaica
## 1
## Japan
## 1
## Jersey
## 1
## Jordan
## 1
## Kazakhstan
## 1
## Kenya
## 1
## Kosovo
## 1
## Kuwait
## 1
## Kyrgyzstan
## 1
## Laos
## 1
## Latvia
## 1
## Lebanon
## 1
## Liberia
## 1
## Libya
## 1
## Liechtenstein
## 1
## Lithuania
## 1
## Luxembourg
## 1
## Madagascar
## 1
## Malawi
## 1
## Malaysia
## 1
## Maldives
## 1
## Mali
## 1
## Malta
## 1
## Mauritania
## 1
## Mauritius
## 1
## Mexico
## 1
## Moldova
## 1
## Monaco
## 1
## Mongolia
## 1
## Montenegro
## 1
## Montserrat
## 1
## Morocco
## 1
## Mozambique
## 1
## Myanmar
## 1
## Namibia
## 1
## Nepal
## 1
## Netherlands
## 1
## New_Caledonia
## 1
## New_Zealand
## 1
## Nicaragua
## 1
## Niger
## 1
## Nigeria
## 1
## North_Macedonia
## 1
## Northern_Mariana_Islands
## 1
## Norway
## 1
## Oman
## 1
## Pakistan
## 1
## Palestine
## 1
## Panama
## 1
## Papua_New_Guinea
## 1
## Paraguay
## 1
## Peru
## 1
## Philippines
## 1
## Poland
## 1
## Portugal
## 1
## Puerto_Rico
## 1
## Qatar
## 1
## Romania
## 1
## Russia
## 1
## Rwanda
## 1
## Saint_Barthelemy
## 1
## Saint_Kitts_and_Nevis
## 1
## Saint_Lucia
## 1
## Saint_Vincent_and_the_Grenadines
## 1
## San_Marino
## 1
## Saudi_Arabia
## 1
## Senegal
## 1
## Serbia
## 1
## Seychelles
## 1
## Sierra_Leone
## 1
## Singapore
## 1
## Sint_Maarten
## 1
## Slovakia
## 1
## Slovenia
## 1
## Somalia
## 1
## South_Africa
## 1
## South_Korea
## 1
## South_Sudan
## 1
## Spain
## 2
## Sri_Lanka
## 1
## Sudan
## 1
## Suriname
## 1
## Sweden
## 1
## Switzerland
## 1
## Syria
## 1
## Taiwan
## 1
## Thailand
## 1
## Timor_Leste
## 1
## Togo
## 1
## Trinidad_and_Tobago
## 1
## Tunisia
## 1
## Turkey
## 1
## Turks_and_Caicos_islands
## 1
## Uganda
## 1
## Ukraine
## 1
## United_Arab_Emirates
## 1
## United_Kingdom
## 1
## United_Republic_of_Tanzania
## 1
## United_States_of_America
## 2
## United_States_Virgin_Islands
## 1
## Uruguay
## 1
## Uzbekistan
## 1
## Venezuela
## 1
## Vietnam
## 1
## Zambia
## 1
## Zimbabwe
## 1
##
## Within cluster sum of squares by cluster:
## [1] 45.26684 66.06593
## (between_SS / total_SS = 72.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Vizualizacija 1
fviz_cluster(clust2, data = COVID19_CS_noGeo)
# Vizualizacija 2
COVID19_CS_noGeo %>%
as_tibble() %>%
mutate(Cluster = clust2$cluster,
State = row.names(COVID19_CS_noGeo)) %>%
ggplot(aes(Deaths, Cases, color = factor(Cluster), label = State)) +
geom_text()
# Vizualizacija više klastera
clust3 <- kmeans(COVID19_CS_noGeo, centers = 3, nstart = 25)
clust4 <- kmeans(COVID19_CS_noGeo, centers = 4, nstart = 25)
clust5 <- kmeans(COVID19_CS_noGeo, centers = 5, nstart = 25)
p1 <- fviz_cluster(clust2, geom = "point", data = COVID19_CS_noGeo) + ggtitle("k = 2")
p2 <- fviz_cluster(clust3, geom = "point", data = COVID19_CS_noGeo) + ggtitle("k = 3")
p3 <- fviz_cluster(clust4, geom = "point", data = COVID19_CS_noGeo) + ggtitle("k = 4")
p4 <- fviz_cluster(clust5, geom = "point", data = COVID19_CS_noGeo) + ggtitle("k = 5")
gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2)
# Odredi optimalni broj klastera (izračun)
set.seed(123)
# within-cluster sum of square
wss <- function(k) {
kmeans(COVID19_CS_noGeo, k, nstart = 10 )$tot.withinss
}
# wss for k = 1 to k = 15
k.values <- 1:15
# wss for 2-15 clusters
wss_values <- map_dbl(k.values, wss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
# Pomoću funkcije
fviz_nbclust(COVID19_CS_noGeo, kmeans, method = "wss")
# Silhouette
avg_sil <- function(k) {
km.res <- kmeans(COVID19_CS_noGeo, centers = k, nstart = 25)
ss <- silhouette(km.res$cluster, dist(COVID19_CS_noGeo))
mean(ss[, 3])
}
# Compute and plot wss for k = 2 to k = 15
k.values <- 2:15
# extract avg silhouette for 2-15 clusters
avg_sil_values <- map_dbl(k.values, avg_sil)
plot(k.values, avg_sil_values,
type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters K",
ylab = "Average Silhouettes")
# Pomoću funkcije
fviz_nbclust(COVID19_CS_noGeo, kmeans, method = "silhouette")
# Izračunaj gap
set.seed(123)
gap_stat <- clusGap(COVID19_CS_noGeo, FUN = kmeans, nstart = 25,
K.max = 10, B = 50)
fviz_gap_stat(gap_stat)
# KONAČNO RJEŠENJE
CS_finalno <- kmeans(COVID19_CS_noGeo, 3, nstart = 25)
#print(CS_finalno)
# Vizualizacija
fviz_cluster(CS_finalno, data = COVID19_CS_noGeo)
# Deskriptivna statistika na klasterima
COVID19_CS %>%
dplyr::select(-Country) %>%
mutate(Cluster = CS_finalno$cluster) %>%
group_by(Cluster) %>%
summarise_all(c("mean", "sd")) %>%
round(1)
## # A tibble: 3 x 5
## Cluster Deaths_mean Cases_mean Deaths_sd Cases_sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 167 144 53.2 110.
## 2 2 -0.9 -0.4 7.9 9.9
## 3 3 -10 -9.8 1.2 1.9