Final project - Analyzing population

RPubs Link: https://rpubs.com/YifeiLiu/AnalyzingPopulation

1. Packages needed

# install.packages("leaflet")
# install.packages("stats")
# install.packages("factoextra")
library(leaflet)
library(raster)
library(factoextra)

2. Get the polygons data

polygons <- getData('GADM', country='Poland', level=1)

3. Prepare statistic datasets

  1. Population: Wiki Link (click to go)
  2. city number : pl gov statistic centre (click to go)
# dataset$NAME_1

voivodeship<-c(
  "Dolnośląskie", "Kujawsko-Pomorskie", "Łódzkie"       , "Lubelskie"          , "Lubuskie"     ,      
  "Małopolskie" , "Mazowieckie"       , "Opolskie"      , "Podkarpackie"       , "Podlaskie"    ,      
  "Pomorskie"   , "Śląskie"           , "Świętokrzyskie", "Warmińsko-Mazurskie", "Wielkopolskie",
  "Zachodniopomorskie"
)

population <- c(2903000,2056000,18219,2104000,995000,3399000,5432000,966000,2098000,1182000,2355000,4501000,1216000,1423000,3486000,1694000)
city_num <- c(26,19,21,20,12,19,37,11,21,14,16,17,13,19,31,18)

dataset<-data.frame(voivodeship, population)
rownames(dataset) <- dataset$voivodeship
dataset$voivodeship <- NULL

polygons$population <- population
polygons$city_num <- city_num

4. k-means

  1. Estimate the optimal number of clusters
  2. compute kmeans
  3. Visualize clusters to help determine
  4. Access to result

4.1 Estimate the optimal number of clusters

dataset.scaled <- scale(dataset)
fviz_nbclust(dataset.scaled, kmeans, method = "wss") +
  geom_vline(xintercept = 2, linetype = 2) +
  geom_vline(xintercept = 4, linetype = 2) +
  geom_vline(xintercept = 5, linetype = 2)

fviz_nbclust(dataset.scaled, kmeans, method = "silhouette") +
  geom_vline(xintercept = 2, linetype = 2) +
  geom_vline(xintercept = 5, linetype = 2)

In “wss” plot, it represents the variance within the clusters. It decreases as k increases, but it can be seen a “elbow” at k = 2 or 4 or 5. This bend indicates that additional clusters beyond the fourth have little value.

In “silhouette” plot, we can see dataset has a obvious peak in k = 2 and k = 5.

So we can try both 2, 4 and 5 and decide at last.

4.2 compute kmeans

set.seed(123)
km.res2 <- kmeans(dataset, 2, nstart = 25)
km.res2
## K-means clustering with 2 clusters of sizes 5, 11
## 
## Cluster means:
##   population
## 1    3944200
## 2    1464293
## 
## Clustering vector:
##        Dolnośląskie  Kujawsko-Pomorskie             Łódzkie           Lubelskie 
##                   1                   2                   2                   2 
##            Lubuskie         Małopolskie         Mazowieckie            Opolskie 
##                   2                   1                   1                   2 
##        Podkarpackie           Podlaskie           Pomorskie             Śląskie 
##                   2                   2                   2                   1 
##      Świętokrzyskie Warmińsko-Mazurskie       Wielkopolskie  Zachodniopomorskie 
##                   2                   2                   1                   2 
## 
## Within cluster sum of squares by cluster:
## [1] 4.114863e+12 4.709757e+12
##  (between_SS / total_SS =  70.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
km.res4 <- kmeans(dataset, 4, nstart = 25)
km.res4
## K-means clustering with 4 clusters of sizes 6, 2, 5, 3
## 
## Cluster means:
##   population
## 1   966703.2
## 2  4966500.0
## 3  2061400.0
## 4  3262666.7
## 
## Clustering vector:
##        Dolnośląskie  Kujawsko-Pomorskie             Łódzkie           Lubelskie 
##                   4                   3                   1                   3 
##            Lubuskie         Małopolskie         Mazowieckie            Opolskie 
##                   1                   4                   2                   1 
##        Podkarpackie           Podlaskie           Pomorskie             Śląskie 
##                   3                   1                   3                   2 
##      Świętokrzyskie Warmińsko-Mazurskie       Wielkopolskie  Zachodniopomorskie 
##                   1                   1                   4                   3 
## 
## Within cluster sum of squares by cluster:
## [1] 1.217132e+12 4.333805e+11 2.243672e+11 1.978247e+11
##  (between_SS / total_SS =  93.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
km.res5 <- kmeans(dataset, 5, nstart = 25)
km.res5
## K-means clustering with 5 clusters of sizes 5, 3, 2, 5, 1
## 
## Cluster means:
##   population
## 1    1156400
## 2    3262667
## 3    4966500
## 4    2061400
## 5      18219
## 
## Clustering vector:
##        Dolnośląskie  Kujawsko-Pomorskie             Łódzkie           Lubelskie 
##                   2                   4                   5                   4 
##            Lubuskie         Małopolskie         Mazowieckie            Opolskie 
##                   1                   2                   3                   1 
##        Podkarpackie           Podlaskie           Pomorskie             Śląskie 
##                   4                   1                   4                   3 
##      Świętokrzyskie Warmińsko-Mazurskie       Wielkopolskie  Zachodniopomorskie 
##                   1                   1                   2                   4 
## 
## Within cluster sum of squares by cluster:
## [1] 137585200000 197824666667 433380500000 224367200000            0
##  (between_SS / total_SS =  96.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Cluster means: a matrix, which rows are cluster number (1 to 4) and columns are variables Clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated

So the detailed display see 4.3 and 4.4.

4.3 Visualize clusters to help determine

To help to see the result more easily, we add one more useless column.

The city_num is very small when compare to population, so it will not influence the result. The result show below also prove this.

When look plot, we only need focus the population dimension, e.g. the distance in x-axis.

vis_dataset<- cbind(dataset, city_num)

4.3.1 k = 2

km.res2.vis <- kmeans(vis_dataset, 2, nstart = 25)
km.res2.vis
## K-means clustering with 2 clusters of sizes 11, 5
## 
## Cluster means:
##   population city_num
## 1    1464293 16.72727
## 2    3944200 26.00000
## 
## Clustering vector:
##        Dolnośląskie  Kujawsko-Pomorskie             Łódzkie           Lubelskie 
##                   2                   1                   1                   1 
##            Lubuskie         Małopolskie         Mazowieckie            Opolskie 
##                   1                   2                   2                   1 
##        Podkarpackie           Podlaskie           Pomorskie             Śląskie 
##                   1                   1                   1                   2 
##      Świętokrzyskie Warmińsko-Mazurskie       Wielkopolskie  Zachodniopomorskie 
##                   1                   1                   2                   1 
## 
## Within cluster sum of squares by cluster:
## [1] 4.709757e+12 4.114863e+12
##  (between_SS / total_SS =  70.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
fviz_cluster(km.res2.vis, vis_dataset,
             palette = c("#2E9FDF", "#00AFBB"), 
             geom = "point",
             ellipse.type = "convex", 
             ggtheme = theme_bw())

  1. The gap between this two cluster is not too far when compare to the gap in the gap, so we can say it is not the best k value.

  2. Comparing to the result in 4.2, the means prove that city_num doesn’t influence the result, so we can visualization by this way.

4.3.2 k = 4

## K-means clustering with 4 clusters of sizes 2, 3, 5, 6
## 
## Cluster means:
##   population city_num
## 1  4966500.0 27.00000
## 2  3262666.7 25.33333
## 3  2061400.0 18.80000
## 4   966703.2 15.00000
## 
## Clustering vector:
##        Dolnośląskie  Kujawsko-Pomorskie             Łódzkie           Lubelskie 
##                   2                   3                   4                   3 
##            Lubuskie         Małopolskie         Mazowieckie            Opolskie 
##                   4                   2                   1                   4 
##        Podkarpackie           Podlaskie           Pomorskie             Śląskie 
##                   3                   4                   3                   1 
##      Świętokrzyskie Warmińsko-Mazurskie       Wielkopolskie  Zachodniopomorskie 
##                   4                   4                   2                   3 
## 
## Within cluster sum of squares by cluster:
## [1] 4.333805e+11 1.978247e+11 2.243672e+11 1.217132e+12
##  (between_SS / total_SS =  93.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

  1. The gap for the most left point is significantly big, we can assume when k = 5, it will divide this point out as single cluster. And next step prove this assumption is true.

  2. Comparing to the result in 4.2, the means prove that city_num doesn’t influence the result, so we can visualization by this way.

4.3.3 k = 5

## K-means clustering with 5 clusters of sizes 2, 5, 3, 5, 1
## 
## Cluster means:
##   population city_num
## 1    4966500 27.00000
## 2    1156400 13.80000
## 3    3262667 25.33333
## 4    2061400 18.80000
## 5      18219 21.00000
## 
## Clustering vector:
##        Dolnośląskie  Kujawsko-Pomorskie             Łódzkie           Lubelskie 
##                   3                   4                   5                   4 
##            Lubuskie         Małopolskie         Mazowieckie            Opolskie 
##                   2                   3                   1                   2 
##        Podkarpackie           Podlaskie           Pomorskie             Śląskie 
##                   4                   2                   4                   1 
##      Świętokrzyskie Warmińsko-Mazurskie       Wielkopolskie  Zachodniopomorskie 
##                   2                   2                   3                   4 
## 
## Within cluster sum of squares by cluster:
## [1] 433380500200 137585200039 197824666739 224367200015            0
##  (between_SS / total_SS =  96.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

  1. It prove the assumption in 4.3.2 is correct, and we can choose k = 5 as optimization.

  2. Comparing to the result in 4.2, the means prove that city_num doesn’t influence the result, so we can visualization by this way.

4.4 Access to result

  • Display results
cbind(dataset, cluster = km.res5$cluster)
##                     population cluster
## Dolnośląskie           2903000       2
## Kujawsko-Pomorskie     2056000       4
## Łódzkie                  18219       5
## Lubelskie              2104000       4
## Lubuskie                995000       1
## Małopolskie            3399000       2
## Mazowieckie            5432000       3
## Opolskie                966000       1
## Podkarpackie           2098000       4
## Podlaskie              1182000       1
## Pomorskie              2355000       4
## Śląskie                4501000       3
## Świętokrzyskie         1216000       1
## Warmińsko-Mazurskie    1423000       1
## Wielkopolskie          3486000       2
## Zachodniopomorskie     1694000       4
  • Cluster size
km.res5$size
## [1] 5 3 2 5 1
  • Cluster means
km.res5$centers
##   population
## 1    1156400
## 2    3262667
## 3    4966500
## 4    2061400
## 5      18219

5. Map (leaflet)

  1. add polygon
  2. give popup for each polygon with
  • name of voivodeship
  • population
  • number of cities in the voivodeship
  1. additional
  • make color depends on population
  • highlight the voivodeship when it is chosen

5.1 Center of view

pl_coordinates <- c(51.93046841624075, 18.913610940888134)

5.3 Color based on population

color_slope = 21302

color_popu = rgb(polygons$population / color_slope / 255, polygons$population / color_slope / 255, 0)

5.4 highlight

highlight = highlightOptions(color='white',weight=2,bringToFront = TRUE)

5.5 Create map

leaflet() %>%
  setView(lng = pl_coordinates[2], lat = pl_coordinates[1], zoom = 6) %>%
  addProviderTiles("OpenStreetMap.Mapnik") %>%
  addPolygons(
    data = polygons, 
    weight = 2, 
    opacity = 0.9, 
    fillColor = color_popu, 
    highlightOptions = highlight,
    popup = pop
  )