1. Overview

The objective of this assignment is to to reveal the resident population, sex ratio, married and divorced rate of Chinese population by provinces in 2018 through a series of map visualization done by tmap package. The population data is collected from National Bureau of Statistics of China in csv form. The map information is downloaded from https://data.humdata.org/dataset/china-administrative-boundaries in shapefile form.

1.1. Data description and major challenges

The population data collected from National Bureau of Statistics of China has been grouped and manipulated in Excel and little wrangling is needed in R. Data of 31 Provinces is included the exception of Hongkong, Macao and Taiwan. Although there are many columns, the columns we are going to plot are “Population” (The resident population of each province at the end of year 2018), “SexRatio” (Calculated from Sample Survey with Female=100), “Divorce”(Crude Divorce Rate % calculated using Registered Divorces), “Marriage” (Crude Marriage Rate % calculated using Registered Marriages with the population base same as Crude Divorce Rate).

For map data, several modifications need to be done. As the data we downloaded separate several Provinces like Anhui, Xinjiang into smaller parts, we need to merge these geometry and dissolve the boundary for the same Province. We are using st_union to solve this problem. However, when I try to add tm_text, it turns out the duplicated text will appear as the data is merged from several geometries. Didn’t find any good solutions, I worked around this issue by plotting a shape using st_combine and another shape using st_union together.

1.2. Sketch of proposed design

Sketch

2. Step-by-step preparation

2.1. Install and load packages

packages = c('sf', 'tmap', 'tidyverse', 'ggsci')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2.2. Import the data

Use st_read() of sf package and read_csv() function of readr package (one of the tidyverse package)

mpcn <- st_read(dsn = "chn_admbnda_adm1", 
                layer = "chn_admbnda_adm1_ocha")
popcn <- read_csv("Population.csv")

2.3. Map Data wrangling

Check the data.Though there are 34 divisions, it has 57 features.

head(mpcn)
## Simple feature collection with 6 features and 6 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 92.78866 ymin: 23.58468 xmax: 120.4246 ymax: 42.79352
## geographic CRS: WGS 84
##                  ADM1_EN ADM1_ZH ADM1_PCODE ADM0_EN ADM0_ZH ADM0_PCODE
## 1         Anhui Province  安徽省      CN340   China    中国         CN
## 2         Anhui Province  安徽省      CN341   China    中国         CN
## 3   Beijing Municipality  北京市      CN110   China    中国         CN
## 4 Chongqing Municipality  重庆市      CN500   China    中国         CN
## 5        Fujian Province  福建省      CN350   China    中国         CN
## 6         Gansu province  甘肃省      CN620   China    中国         CN
##                         geometry
## 1 MULTIPOLYGON (((116.9352 34...
## 2 MULTIPOLYGON (((118.3409 31...
## 3 MULTIPOLYGON (((117.3147 40...
## 4 MULTIPOLYGON (((109.546 31....
## 5 MULTIPOLYGON (((117.2327 23...
## 6 MULTIPOLYGON (((104.1044 37...

Do quick map on it. Several provinces are segmentated.

qtm(mpcn)

class(mpcn)
## [1] "sf"         "data.frame"

Create combined geometry using st_combine

mpcn2 = 
  mpcn %>%
  group_by(ADM1_EN) %>%
  summarize(geometry = st_combine(geometry))
summary(mpcn2)
##    ADM1_EN                   geometry 
##  Length:34          MULTIPOLYGON :34  
##  Class :character   epsg:4326    : 0  
##  Mode  :character   +proj=long...: 0

Create combined geometry using st_union

mpcn3 = mpcn %>%
  group_by(ADM1_EN) %>%
  summarize(geometry = st_union(geometry))
summary(mpcn3)
##    ADM1_EN                   geometry 
##  Length:34          MULTIPOLYGON :22  
##  Class :character   POLYGON      :12  
##  Mode  :character   epsg:4326    : 0  
##                     +proj=long...: 0

The boundary has been dissolved by using st_union.

m2 = qtm(mpcn2)
m3 = qtm(mpcn3)
tmap_arrange(m2, m3, asp=1, ncol=2)

As we would like to join map data with population data, we’d like to create a common column to join on.Columns“ADM1_EN”, “ADM1_PCODE” and “Region” are included in a dataframe named code.

ADM1_EN = mpcn$ADM1_EN
ADM1_PCODE = mpcn$ADM1_PCODE
Region = popcn$Region
code = data.frame('ADM1_EN' = ADM1_EN, 'ADM1_PCODE' = ADM1_PCODE)
code = code[!duplicated(code[,c("ADM1_EN")]), ]
code2 = popcn[,0:2]
code = left_join(code, code2, by = c('ADM1_PCODE' = "ADM1_PCODE"))

Append code to map data.

mpcn2 = left_join(mpcn2, code, 
                  by = c("ADM1_EN" = "ADM1_EN"))
mpcn3 = left_join(mpcn3, code, 
                  by = c("ADM1_EN" = "ADM1_EN"))

Join with population data.

mpcn_popcn <- left_join(mpcn3, popcn, 
                              by = c("ADM1_PCODE" = "ADM1_PCODE"))
mpcn_popcn$Divorce = mpcn_popcn$Divorce/100
mpcn_popcn$Marriage = mpcn_popcn$Marriage/1000

2.4. Plot graph

#tmaptools::palette_explorer()

2.4.1 Population Map

Plot the resident population for each Province.

g1 = tm_shape(mpcn_popcn) +
  tm_fill('Population',palette = "Blues") +
  tm_borders()+
  tm_layout(title = "Distribution of Resident Population \n by Province in year 2018",
            title.position =c("center", "top"),
            title.size = 1.2,
            legend.outside = F,
            legend.text.size = 0.7,
            #panel.show = T,
            frame = FALSE) +
  tm_shape(mpcn2) +
    tm_fill(alpha = 0) +
    tm_borders(alpha = 0) +
    tm_text("Region", size = 0.4, remove.overlap = T) +
    tm_layout(frame = FALSE) +
    tm_credits("Source: Population data from National Bureau of Statistics of China", 
               size = 0.5, align = "left")

g1

2.4.2 Sex Ratio Map

g2 = tm_shape(mpcn_popcn) +
  tm_fill('SexRatio',palette = "BuGn") +
  tm_borders()+
  tm_layout(title = "Distribution of Sex Ratio \nby Province in year 2018",
            title.position = c("center","top"),
            title.size = 1.2,
            legend.outside = F,
            frame = FALSE) +
  tm_shape(mpcn2) +
    tm_fill(alpha = 0) +
    tm_borders(alpha = 0) +
    tm_text("Region", size = 0.4, remove.overlap = T) +
    tm_layout(frame = FALSE) +
    tm_credits("Source: Population data from National Bureau of Statistics of China", 
               size = 0.5, align = "left")

g2

Higlight area with high Sex ratio

highratio = mpcn_popcn %>%
  filter(mpcn_popcn$SexRatio > 107)
mapoutline <- st_union(mpcn)
sexRatioMap = tm_shape(mpcn_popcn) +
                tm_fill("white") +
                tm_borders("grey", lwd = 0.5) +
              tm_shape(mapoutline) +
                tm_borders(lwd = 2) +
              tm_shape(highratio) +
                tm_fill("turquoise3") +
                tm_borders() +
              tm_layout(frame = T, title = "Provinces with Sex Ratio higher than 107 in 2018 ", 
                        title.size = 0.9, title.position = c(0.02, "top")) +
              tm_credits("Source: Population data from National Bureau of Statistics of China", 
                        size = 0.5, align = "right")

sexRatioMap

2.4.3 Divorce Rate Map

g3 = tm_shape(mpcn_popcn) +
  tm_fill('Divorce',palette = "GnBu") +
  tm_borders()+
  tm_layout(title = "Distribution of Divorce Rate by \n Province in year 2018",
            title.position = c("center","top"),
            title.size = 1.2,
            legend.outside = F,
            frame = FALSE) +
  tm_shape(mpcn2) +
    tm_fill(alpha = 0) +
    tm_borders(alpha = 0) +
    tm_text("Region", size = 0.4, remove.overlap = T) +
    tm_layout(frame = FALSE) +
    tm_credits("Source: Population data from National Bureau of Statistics of China", 
               size = 0.5, align = "left")

g3

highDivorce = mpcn_popcn %>%
  filter(mpcn_popcn$Divorce > 0.04) 
lowDivorce =  mpcn_popcn %>%
  filter(mpcn_popcn$Divorce < 0.02)
Divorce = rbind(highDivorce, lowDivorce)
divorceMap = tm_shape(mpcn_popcn) +
                tm_fill("white") +
                tm_borders("grey", lwd = 0.5) +
              tm_shape(mapoutline) +
                tm_borders(lwd = 2) +
              tm_shape(highDivorce) +
                tm_fill("skyblue") +
                tm_borders() +
              tm_shape(lowDivorce) +
                tm_fill("pink") +
                tm_borders() +
              tm_layout(frame = T, title = "Provinces with Divorce rate (>4% & <2%) in 2018 ", 
                        title.size = 0.9, title.position = c(0.02, "top")) +
              tm_credits("Source: Population data from National Bureau of Statistics of China", 
                        size = 0.5, align = "right")

Pink: Divorce rate < 2%; Blue: Divorce rate > 4%

divorceMap

2.4.4 Marriage Rate Map

g4 = tm_shape(mpcn_popcn) +
  tm_fill('Marriage',palette = "PuBu") +
  tm_borders()+
  tm_layout(title = "Distribution of Marriage Rate \n by Province in year 2018",
            title.position = c("center","top"),
            title.size = 1.2,
            legend.outside = F,
            frame = FALSE) +
  tm_shape(mpcn2) +
    tm_fill(alpha = 0) +
    tm_borders(alpha = 0) +
    tm_text("Region", size = 0.4, remove.overlap = T) +
    tm_layout(frame = FALSE) +
    tm_credits("Source: Population data from National Bureau of Statistics of China", 
               size = 0.5, align = "left")

g4

highMarriage = mpcn_popcn %>%
  filter(mpcn_popcn$Marriage > 0.08) 
lowMarriage =  mpcn_popcn %>%
  filter(mpcn_popcn$Marriage < 0.06)
marriageMap = tm_shape(mpcn_popcn) +
                tm_fill("white") +
                tm_borders("grey", lwd = 0.5) +
              tm_shape(mapoutline) +
                tm_borders(lwd = 2) +
              tm_shape(highMarriage) +
                tm_fill("pink") +
                tm_borders() +
              tm_shape(lowMarriage) +
                tm_fill("skyblue") +
                tm_borders() +
              tm_layout(frame = T, title = "Provinces with Marriage rate (>8% & <6%) in 2018 ", 
                        title.size = 0.9, title.position = c(0.02, "top")) +
              tm_credits("Source: Population data from National Bureau of Statistics of China", 
                        size = 0.5, align = "right")

Pink: marriage rate >8%; Blue: marriage rate <6%

marriageMap

3. Final visualisation and insights

3.1 Arrange maps together

tmap_arrange(g1, sexRatioMap, divorceMap, marriageMap)

3.2 Interactive map

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(mpcn_popcn) +
  tm_fill('ADM1_EN',palette = "Set3",
          popup.vars=c("Population"="Population","Sex Ratio" = "SexRatio", 
                       "Divorce Rate" = "Divorce", "Marriage rate" = "Marriage"),
          legend.show = F) +
  tm_borders(alpha = 0.5)
## Warning: Number of levels of the variable "ADM1_EN" is 34, which is
## larger than max.categories (which is 30), so levels are combined. Set
## tmap_options(max.categories = 34) in the layer function to show all levels.
tmap_mode("plot")
tmap_style("white")

3.3 Insights

As can be seen from the four graphs, Guangdong and Shandong are most populated provinces in China, while Qinghai, Tibet and Ningxia are less densly populated. Most coastal areas in the south have high sex ratio, which may due to cultural reasons. Though surprisingly, Tianjin and Qinghai also have high sex ratio. Heilongjiang, Jilin, Chonqing and Guizhou have higher divorce rate while big cities like Beijing and Shanghai have moderately high divorce rate. Shandong and Zhejiang have lower marriage rate while Guizhou has the highest marriage rate of 11% in 2018. Details of the map such as layout and style can be improved.Future work can include time series plots and gender differences to better illustrate the situation.