Final project - Analyzing population
RPubs Link: https://rpubs.com/YifeiLiu/AnalyzingPopulation
1. Packages needed
# install.packages("leaflet")
# install.packages("stats")
# install.packages("factoextra")
library(leaflet)
library(raster)
library(factoextra)2. Get the polygons data
polygons <- getData('GADM', country='Poland', level=1)3. Prepare statistic datasets
- Population: Wiki Link (click to go)
- city number : pl gov statistic centre (click to go)
# dataset$NAME_1
voivodeship<-c(
"Dolnośląskie", "Kujawsko-Pomorskie", "Łódzkie" , "Lubelskie" , "Lubuskie" ,
"Małopolskie" , "Mazowieckie" , "Opolskie" , "Podkarpackie" , "Podlaskie" ,
"Pomorskie" , "Śląskie" , "Świętokrzyskie", "Warmińsko-Mazurskie", "Wielkopolskie",
"Zachodniopomorskie"
)
population <- c(2903000,2056000,18219,2104000,995000,3399000,5432000,966000,2098000,1182000,2355000,4501000,1216000,1423000,3486000,1694000)
city_num <- c(26,19,21,20,12,19,37,11,21,14,16,17,13,19,31,18)
dataset<-data.frame(voivodeship, population)
rownames(dataset) <- dataset$voivodeship
dataset$voivodeship <- NULL
polygons$population <- population
polygons$city_num <- city_num4. k-means
- Estimate the optimal number of clusters
- compute kmeans
- Visualize clusters to help determine
- Access to result
4.1 Estimate the optimal number of clusters
dataset.scaled <- scale(dataset)
fviz_nbclust(dataset.scaled, kmeans, method = "wss") +
geom_vline(xintercept = 2, linetype = 2) +
geom_vline(xintercept = 4, linetype = 2) +
geom_vline(xintercept = 5, linetype = 2)fviz_nbclust(dataset.scaled, kmeans, method = "silhouette") +
geom_vline(xintercept = 2, linetype = 2) +
geom_vline(xintercept = 5, linetype = 2)In “wss” plot, it represents the variance within the clusters. It decreases as k increases, but it can be seen a “elbow” at k = 2 or 4 or 5. This bend indicates that additional clusters beyond the fourth have little value.
In “silhouette” plot, we can see dataset has a obvious peak in k = 2 and k = 5.
So we can try both 2, 4 and 5 and decide at last.
4.2 compute kmeans
set.seed(123)
km.res2 <- kmeans(dataset, 2, nstart = 25)
km.res2## K-means clustering with 2 clusters of sizes 5, 11
##
## Cluster means:
## population
## 1 3944200
## 2 1464293
##
## Clustering vector:
## Dolnośląskie Kujawsko-Pomorskie Łódzkie Lubelskie
## 1 2 2 2
## Lubuskie Małopolskie Mazowieckie Opolskie
## 2 1 1 2
## Podkarpackie Podlaskie Pomorskie Śląskie
## 2 2 2 1
## Świętokrzyskie Warmińsko-Mazurskie Wielkopolskie Zachodniopomorskie
## 2 2 1 2
##
## Within cluster sum of squares by cluster:
## [1] 4.114863e+12 4.709757e+12
## (between_SS / total_SS = 70.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
km.res4 <- kmeans(dataset, 4, nstart = 25)
km.res4## K-means clustering with 4 clusters of sizes 6, 2, 5, 3
##
## Cluster means:
## population
## 1 966703.2
## 2 4966500.0
## 3 2061400.0
## 4 3262666.7
##
## Clustering vector:
## Dolnośląskie Kujawsko-Pomorskie Łódzkie Lubelskie
## 4 3 1 3
## Lubuskie Małopolskie Mazowieckie Opolskie
## 1 4 2 1
## Podkarpackie Podlaskie Pomorskie Śląskie
## 3 1 3 2
## Świętokrzyskie Warmińsko-Mazurskie Wielkopolskie Zachodniopomorskie
## 1 1 4 3
##
## Within cluster sum of squares by cluster:
## [1] 1.217132e+12 4.333805e+11 2.243672e+11 1.978247e+11
## (between_SS / total_SS = 93.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
km.res5 <- kmeans(dataset, 5, nstart = 25)
km.res5## K-means clustering with 5 clusters of sizes 5, 3, 2, 5, 1
##
## Cluster means:
## population
## 1 1156400
## 2 3262667
## 3 4966500
## 4 2061400
## 5 18219
##
## Clustering vector:
## Dolnośląskie Kujawsko-Pomorskie Łódzkie Lubelskie
## 2 4 5 4
## Lubuskie Małopolskie Mazowieckie Opolskie
## 1 2 3 1
## Podkarpackie Podlaskie Pomorskie Śląskie
## 4 1 4 3
## Świętokrzyskie Warmińsko-Mazurskie Wielkopolskie Zachodniopomorskie
## 1 1 2 4
##
## Within cluster sum of squares by cluster:
## [1] 137585200000 197824666667 433380500000 224367200000 0
## (between_SS / total_SS = 96.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Cluster means: a matrix, which rows are cluster number (1 to 4) and columns are variables Clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated
So the detailed display see 4.3 and 4.4.
4.3 Visualize clusters to help determine
To help to see the result more easily, we add one more useless column.
The city_num is very small when compare to population, so it will not influence the result. The result show below also prove this.
When look plot, we only need focus the population dimension, e.g. the distance in x-axis.
vis_dataset<- cbind(dataset, city_num)4.3.1 k = 2
km.res2.vis <- kmeans(vis_dataset, 2, nstart = 25)
km.res2.vis## K-means clustering with 2 clusters of sizes 11, 5
##
## Cluster means:
## population city_num
## 1 1464293 16.72727
## 2 3944200 26.00000
##
## Clustering vector:
## Dolnośląskie Kujawsko-Pomorskie Łódzkie Lubelskie
## 2 1 1 1
## Lubuskie Małopolskie Mazowieckie Opolskie
## 1 2 2 1
## Podkarpackie Podlaskie Pomorskie Śląskie
## 1 1 1 2
## Świętokrzyskie Warmińsko-Mazurskie Wielkopolskie Zachodniopomorskie
## 1 1 2 1
##
## Within cluster sum of squares by cluster:
## [1] 4.709757e+12 4.114863e+12
## (between_SS / total_SS = 70.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
fviz_cluster(km.res2.vis, vis_dataset,
palette = c("#2E9FDF", "#00AFBB"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw())The gap between this two cluster is not too far when compare to the gap in the gap, so we can say it is not the best k value.
Comparing to the result in 4.2, the means prove that city_num doesn’t influence the result, so we can visualization by this way.
4.3.2 k = 4
## K-means clustering with 4 clusters of sizes 2, 3, 5, 6
##
## Cluster means:
## population city_num
## 1 4966500.0 27.00000
## 2 3262666.7 25.33333
## 3 2061400.0 18.80000
## 4 966703.2 15.00000
##
## Clustering vector:
## Dolnośląskie Kujawsko-Pomorskie Łódzkie Lubelskie
## 2 3 4 3
## Lubuskie Małopolskie Mazowieckie Opolskie
## 4 2 1 4
## Podkarpackie Podlaskie Pomorskie Śląskie
## 3 4 3 1
## Świętokrzyskie Warmińsko-Mazurskie Wielkopolskie Zachodniopomorskie
## 4 4 2 3
##
## Within cluster sum of squares by cluster:
## [1] 4.333805e+11 1.978247e+11 2.243672e+11 1.217132e+12
## (between_SS / total_SS = 93.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The gap for the most left point is significantly big, we can assume when k = 5, it will divide this point out as single cluster. And next step prove this assumption is true.
Comparing to the result in 4.2, the means prove that city_num doesn’t influence the result, so we can visualization by this way.
4.3.3 k = 5
## K-means clustering with 5 clusters of sizes 2, 5, 3, 5, 1
##
## Cluster means:
## population city_num
## 1 4966500 27.00000
## 2 1156400 13.80000
## 3 3262667 25.33333
## 4 2061400 18.80000
## 5 18219 21.00000
##
## Clustering vector:
## Dolnośląskie Kujawsko-Pomorskie Łódzkie Lubelskie
## 3 4 5 4
## Lubuskie Małopolskie Mazowieckie Opolskie
## 2 3 1 2
## Podkarpackie Podlaskie Pomorskie Śląskie
## 4 2 4 1
## Świętokrzyskie Warmińsko-Mazurskie Wielkopolskie Zachodniopomorskie
## 2 2 3 4
##
## Within cluster sum of squares by cluster:
## [1] 433380500200 137585200039 197824666739 224367200015 0
## (between_SS / total_SS = 96.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
It prove the assumption in 4.3.2 is correct, and we can choose k = 5 as optimization.
Comparing to the result in 4.2, the means prove that city_num doesn’t influence the result, so we can visualization by this way.
4.4 Access to result
- Display results
cbind(dataset, cluster = km.res5$cluster)## population cluster
## Dolnośląskie 2903000 2
## Kujawsko-Pomorskie 2056000 4
## Łódzkie 18219 5
## Lubelskie 2104000 4
## Lubuskie 995000 1
## Małopolskie 3399000 2
## Mazowieckie 5432000 3
## Opolskie 966000 1
## Podkarpackie 2098000 4
## Podlaskie 1182000 1
## Pomorskie 2355000 4
## Śląskie 4501000 3
## Świętokrzyskie 1216000 1
## Warmińsko-Mazurskie 1423000 1
## Wielkopolskie 3486000 2
## Zachodniopomorskie 1694000 4
- Cluster size
km.res5$size## [1] 5 3 2 5 1
- Cluster means
km.res5$centers## population
## 1 1156400
## 2 3262667
## 3 4966500
## 4 2061400
## 5 18219
5. Map (leaflet)
- add polygon
- give popup for each polygon with
- name of voivodeship
- population
- number of cities in the voivodeship
- additional
- make color depends on population
- highlight the voivodeship when it is chosen
5.1 Center of view
pl_coordinates <- c(51.93046841624075, 18.913610940888134)5.2 popup
pop = paste0(
"<b>Voivodoship: </b>", polygons$NAME_1, "<br>",
"<b>population: </b>", polygons$population, "<br>",
"<b>city quantity: </b>", polygons$city_num, "<br>"
)5.3 Color based on population
color_slope = 21302
color_popu = rgb(polygons$population / color_slope / 255, polygons$population / color_slope / 255, 0)5.4 highlight
highlight = highlightOptions(color='white',weight=2,bringToFront = TRUE)5.5 Create map
leaflet() %>%
setView(lng = pl_coordinates[2], lat = pl_coordinates[1], zoom = 6) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addPolygons(
data = polygons,
weight = 2,
opacity = 0.9,
fillColor = color_popu,
highlightOptions = highlight,
popup = pop
)