2. DataViz Step-by-step

2.1. Install and Load R packages

packages = c('tidyverse', 'reshape2', 'dplyr', 'grid')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)}

2.2. Dataset Overview

After installing and loading nessary libraries, we can load the dataset, then ascertain the dataframe dimensions, variable datatypes and number of missing values.

data <- read_csv("respopagesextod2011to2019.csv")

data_description <- read_csv("Notes_respopagesextod2011to2019.csv")

print(paste('DataFrame contains ', dim(data)[1], 'rows and ', dim(data)[2], 'columns.'))

## [1] "DataFrame contains  883728 rows and  7 columns."

print(paste('There are ', sum(is.na(data)), ' missing values representing ', 
            round(100*sum(is.na(data))/(nrow(data)*ncol(data)), 5), '% of the total.'))

## [1] "There are  0  missing values representing  0 % of the total."

print('Variable names are as follows:')

## [1] "Variable names are as follows:"

print(names(data))

## [1] "PA"   "SZ"   "AG"   "Sex"  "TOD"  "Pop"  "Time"

head(data)

## # A tibble: 6 x 7
##   PA        SZ               AG    Sex   TOD                           Pop  Time
##   <chr>     <chr>            <chr> <chr> <chr>                       <dbl> <dbl>
## 1 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 1- and 2-Room Flats         0  2011
## 2 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 3-Room Flats               10  2011
## 3 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 4-Room Flats               30  2011
## 4 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 5-Room and Executive F~    50  2011
## 5 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HUDC Flats (excluding thos~     0  2011
## 6 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males Landed Properties               0  2011

head(data_description, 8)

## # A tibble: 8 x 1
##   `Notes for Singapore Residents by Planning Area/Subzone, Age Group, Sex and T~
##   <chr>                                                                         
## 1 Column Headers:                                                               
## 2 PA - Planning Area                                                            
## 3 SZ - Subzone                                                                  
## 4 AG - Age Group                                                                
## 5 Sex - Sex                                                                     
## 6 TOD - Type of Dwelling                                                        
## 7 Pop - Resident Count                                                          
## 8 Time - Time / Period

print('data source: Singapore Department of Statistics https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data')

## [1] "data source: Singapore Department of Statistics https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data"

The dataset contains 7 variables with description of column headers. By looking at the data table, we see only ‘Pop’ and ‘Time’ are numeric while the rest are character variables. In overall, the population of Singapore residence are devided into Planning areas, subzones, age group, gender and type of dwellling from 2011 to 2019. The dataframe is already close to being ready for visualisation since there are no missing value.

We can change colunms name to the full names, which is more intuitive. By checking with the age groups’ levels, we need to unify their format to order age groups in ascending sequence.

names(data) <- c("Planning_Area", "Subzone", "Age", "Gender", "Type_of_Dwelling", "Population", "Year")


print(levels(as.factor(data$Age)))

##  [1] "0_to_4"      "10_to_14"    "15_to_19"    "20_to_24"    "25_to_29"   
##  [6] "30_to_34"    "35_to_39"    "40_to_44"    "45_to_49"    "5_to_9"     
## [11] "50_to_54"    "55_to_59"    "60_to_64"    "65_to_69"    "70_to_74"   
## [16] "75_to_79"    "80_to_84"    "85_to_89"    "90_and_over"

data$Age <- str_replace(as.factor(data$Age),"5_to_9","05_to_09")
data$Age <- str_replace(as.factor(data$Age),"0_to_4","0_to_04")
print('---------------------------------------------')

## [1] "---------------------------------------------"

print('Arrange age groups in ascending order:')

## [1] "Arrange age groups in ascending order:"

print(levels(as.factor(data$Age)))

##  [1] "0_to_04"     "05_to_09"    "10_to_14"    "15_to_19"    "20_to_24"   
##  [6] "25_to_29"    "30_to_34"    "35_to_39"    "40_to_044"   "45_to_49"   
## [11] "50_to_54"    "55_to_59"    "60_to_64"    "65_to_69"    "70_to_74"   
## [16] "75_to_79"    "80_to_84"    "85_to_89"    "90_and_over"

#data$Year <- as.integer(data$Year)

2.3. Data Wrangling

2.3.1. The Trend of Population Distribution by Year

We first calcullate the population proportion of each age group in Singapore in every year, then we can plot the change of age structure in Singapore. It seems that the proportion of each age group didn’t change much from 2011 to 2019, but we can also see most lines are slightly upward or nearly steady, which means the proportion of elder groups of perple are getting larger and Singapore is slowly stepping into the aging society. We can further calculate the difference of proportion between 2011 and 2019 for each age group. From the result in the data frame, the proportion of groups aged under 54 decreased from 2011 to 2019 except the group aged from 25 to 29, while the proportion of the groups aged above 54 are growing. the age group range from 65 to 69 has the largest increase of propotion while the proportion of the age group range from 15 to 19 drop the most.

age_pop_by_year <- data %>% group_by(Age, Year) %>% summarise(Population=sum(Population))
df_by_year <- group_by(.data = age_pop_by_year, Year)
df_summarize <- mutate(.data = df_by_year, Population_percent = Population/sum(Population))

ggplot(data = df_summarize, mapping = aes(x = as.integer(Year), y = Population_percent, fill = Age)) + 
  geom_area(alpha = 0.6) +
  geom_line(colour = 'black', size = 0.2, position = 'stack', alpha = 0.6) + 
  guides(fill = guide_legend(reverse = FALSE)) + 
  theme_minimal()+ 
  ggtitle("Singapore Population Proportion of Age Groups, 2011-2019")+
            xlab("Year")+
            ylab("Population Proportion")+ 
  theme(legend.position='right') + 
  scale_x_continuous(breaks = c(seq(2011, 2019, 1)), 
                     labels = paste0(as.character(c(seq(2011, 2019, 1)))))

overall_change_of_proportion = df_summarize[df_summarize$Year == 2019,]$Population_percent - df_summarize[df_summarize$Year == 2011,]$Population_percent
print(data.frame(levels(as.factor(df_summarize$Age)), overall_change_of_proportion))

##    levels.as.factor.df_summarize.Age.. overall_change_of_proportion
## 1                              0_to_04                 -0.003702793
## 2                             05_to_09                 -0.005760264
## 3                             10_to_14                 -0.011805408
## 4                             15_to_19                 -0.013599204
## 5                             20_to_24                 -0.005466835
## 6                             25_to_29                  0.002237579
## 7                             30_to_34                 -0.005708776
## 8                             35_to_39                 -0.008767543
## 9                            40_to_044                 -0.006201840
## 10                            45_to_49                 -0.008123085
## 11                            50_to_54                 -0.006394510
## 12                            55_to_59                  0.007459401
## 13                            60_to_64                  0.014246610
## 14                            65_to_69                  0.025310339
## 15                            70_to_74                  0.010917796
## 16                            75_to_79                  0.005876446
## 17                            80_to_84                  0.004350054
## 18                            85_to_89                  0.003249161
## 19                         90_and_over                  0.001882873

Now, select the data in 2019 to have a detail look at the age structure of Singapore in 2019.

data2019 = data[data$Year == 2019,]

2.3.2. Singapore Population by Gender and Age Group, 2019

The background bar chart in grey color is the population distribution by age groups, we can see the distribution is skewed to the left that there are more than 85% of population are aged under 65. Comparing age structure of Female and Male, we see female seems to have long life expectancy since women account for a larger proportion in elder age groups than men.

pop_by_age_gender <- data2019 %>% group_by(Age, Gender) %>% summarise(Population=sum(Population))
pop_by_age <- data2019 %>% group_by(Age) %>% summarise(Population=sum(Population))

ggplot() + 
  geom_bar(data = pop_by_age, aes(x = Age, y = Population), stat = 'sum', alpha=0.3, color='grey') +
  geom_bar(data = pop_by_age_gender, aes(x = Age, y = Population, fill = Gender), stat = 'sum') +
  scale_fill_manual(values=c('lightpink2','steelblue3')) + 
  facet_grid(~Gender) +
  scale_x_discrete(labels = NULL) + 
  scale_y_continuous(breaks = c(seq(0, 400000, 50000)), labels = c(seq(0, 400, 50))) + 
  theme_minimal()+ 
  ggtitle("Singapore Population by Gender and Age Group, 2019") +
            xlab("Age groups")+
            ylab("Population in thousands")

The pyramid with a narrow base shows a Constrictive tendency of Singapore population. Lower percentage of younger people, indicating declining birth rates with each succeeding age group getting smaller than the previous one.

data2019$Population <- ifelse(data2019$Gender == "Females",-1*data2019$Population, data2019$Population)

age_cohort <- ggplot(data2019, aes(x = Age, y = Population, fill = Gender))+
  geom_bar(data = subset(data2019, Gender == "Females"), stat = "identity")+
      geom_bar(data = subset(data2019, Gender == "Males"), stat = "identity") +
  scale_y_continuous(breaks = seq(-150000, 150000, 50000), 
                     labels = paste0(as.character(c(seq(150, 0, -50), seq(50, 150, 50))))) +
  coord_flip()

age_cohort+
  ggtitle("Singapore Residents Pyramid by Age Cohort, 2019")+
            xlab("Age group")+
            ylab("Population in thousands")+
  scale_fill_manual(values=c('lightpink2','steelblue3'))+
  theme_minimal()+ 
  theme(legend.position='right')

2.3.3. Population structure by Planning Areas and Age Groups, 2019

Firstly, summarize the population by planning areas and sort planning areas by population in descending order, we can see the population amounts are greatly difference among different planning areas.

data2019 = data[data$Year == 2019,]

pop_by_pa <- data2019 %>% group_by(Planning_Area) %>% summarise(Population=sum(Population))
pa_list <- pop_by_pa[order(pop_by_pa$Population, decreasing = TRUE),]$Planning_Area

print(pop_by_pa[order(pop_by_pa$Population, decreasing = TRUE),])

## # A tibble: 55 x 2
##    Planning_Area Population
##    <chr>              <dbl>
##  1 Bedok             279970
##  2 Jurong West       265010
##  3 Tampines          257020
##  4 Woodlands         255000
##  5 Sengkang          244910
##  6 Hougang           227110
##  7 Yishun            220810
##  8 Choa Chu Kang     191100
##  9 Punggol           170920
## 10 Ang Mo Kio        164430
## # ... with 45 more rows

Then, we can subset the dataset into 2 different data frame. the first data frame is the population of different age groups in the 9 planning areas with the most population; accordingly, the second one is in the 9 planning areas with the least population(exclude the planning areas of no population). To compare the population structre of different planning areas with different population amounts, we need to calculate and transform population to the population proportion.

pop_by_age_pa <- data2019 %>% group_by(Age, Planning_Area) %>% summarise(Population=sum(Population))
df_by_pa <-  group_by(.data = pop_by_age_pa, Planning_Area)
df_summarize2 <- mutate(.data = df_by_pa, Population_percent = Population/sum(Population))

top9 <- subset(df_summarize2, Planning_Area %in% pa_list[1:9])
bottom9 <- subset(df_summarize2, Planning_Area %in% pa_list[34:42])

From the result distribution of population proportion by age groups in different planning areas, most of the 9 planning areas with the largest population have distributions similar to the overall distribution of population by age groups in Singapore. the Punggol area’s population structure is most different from the other 8 areas that has a much higher proportion of young groups people aged below 10.

ggplot(data = top9, aes(x = Age, y = Population_percent)) + 
  geom_bar(stat = "sum", alpha = 0.7) + 
  facet_wrap(~Planning_Area) +
  scale_x_discrete(labels = NULL) +
  ggtitle("Distribution of Population by Age Cohort in the Top 9 Planning Areas, 2019", subtitle='The 9 planning areas with the largest population')+
  xlab("Age group")+
  ylab("Population Proportion")

Compared to areas with largest population, areas with smallset population show more different patterns of the population distribution by age groups. Some elderly age groups are not contained in these areas like Museum, Seletar, and Western Water Catchment. The Lim Chu Kang area, which has the smallest population equals to 70 people, only have 6 age groups have population, so the distribution is greatly different from other areas.

ggplot(data = bottom9, aes(x = Age, y = Population_percent)) + 
  geom_bar(stat = "sum", alpha = 0.7) + 
  facet_wrap(~Planning_Area) +
  scale_x_discrete(labels = NULL) +
  ggtitle("Distribution of Population by Age Cohort in the Tail 9 Planning Areas, 2019", subtitle='The 9 planning areas with the smallest population') +
  xlab("Age group")+
  ylab("Population Proportion")

2.3.4. Ternary Plot of Singapore Population Structure, 2019

packages = c('ggtern')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)}

Firstly, based on the dataset of 2019, construct another data frame by using spread() function to separate population by age groups into different cohort columns.

age_pop_ternary = data2019 %>%
  mutate(Year = as.character(Year))%>%
  spread(Age, Population) 

print(names(age_pop_ternary))

##  [1] "Planning_Area"    "Subzone"          "Gender"           "Type_of_Dwelling"
##  [5] "Year"             "0_to_04"          "05_to_09"         "10_to_14"        
##  [9] "15_to_19"         "20_to_24"         "25_to_29"         "30_to_34"        
## [13] "35_to_39"         "40_to_044"        "45_to_49"         "50_to_54"        
## [17] "55_to_59"         "60_to_64"         "65_to_69"         "70_to_74"        
## [21] "75_to_79"         "80_to_84"         "85_to_89"         "90_and_over"

head(age_pop_ternary)

## # A tibble: 6 x 24
##   Planning_Area Subzone Gender Type_of_Dwelling Year  `0_to_04` `05_to_09`
##   <chr>         <chr>   <chr>  <chr>            <chr>     <dbl>      <dbl>
## 1 Ang Mo Kio    Ang Mo~ Femal~ Condominiums an~ 2019         40         60
## 2 Ang Mo Kio    Ang Mo~ Femal~ HDB 1- and 2-Ro~ 2019          0          0
## 3 Ang Mo Kio    Ang Mo~ Femal~ HDB 3-Room Flats 2019         10         10
## 4 Ang Mo Kio    Ang Mo~ Femal~ HDB 4-Room Flats 2019         10         20
## 5 Ang Mo Kio    Ang Mo~ Femal~ HDB 5-Room and ~ 2019         20         40
## 6 Ang Mo Kio    Ang Mo~ Femal~ HUDC Flats (exc~ 2019          0          0
## # ... with 17 more variables: `10_to_14` <dbl>, `15_to_19` <dbl>,
## #   `20_to_24` <dbl>, `25_to_29` <dbl>, `30_to_34` <dbl>, `35_to_39` <dbl>,
## #   `40_to_044` <dbl>, `45_to_49` <dbl>, `50_to_54` <dbl>, `55_to_59` <dbl>,
## #   `60_to_64` <dbl>, `65_to_69` <dbl>, `70_to_74` <dbl>, `75_to_79` <dbl>,
## #   `80_to_84` <dbl>, `85_to_89` <dbl>, `90_and_over` <dbl>

Then, further group the original age groups into the ‘Young’, ‘Active’, and ‘Old’ age groups. Young group is aged below 25; Active group is aged above 24 and below 65; Old group is aged above 64.

age_pop_ternary = age_pop_ternary[, c(seq(1, 24, 1))]
age_pop_ternary = age_pop_ternary %>%
  mutate(Young = rowSums(.[6:10]))%>%
  mutate(Active = rowSums(.[11:18]))  %>%
  mutate(Old = rowSums(.[19:24])) %>%
  mutate(Total = rowSums(.[6:24])) %>%
  filter(Total > 0)

Finally, we can draw the ternary plot of singapore population in 3 age dimension: ‘Young’, ‘Active’, and ‘Old’. The population of active groups always the largest population groups in different planning areas and subzones. There are also some points spread near to the lower right corner which means these areas are aging society that elder peple account for the largest proportion. After applying different colors to the points to represent the type of dwelling, we see there are some patterns between age and the housing condition. Most areas with no yound people don’t have a specific type of dwelling. People living in Condominiums or apartments are tend to be in young or active grouops that aged below 65.

ggtern(data = age_pop_ternary, aes(x = Young,y = Active, z = Old, color = Type_of_Dwelling)) +
  geom_point() + 
  ggtitle("Ternary Plot of Singapore Population Structure, 2019") +
  theme_rgbw() +
  theme(plot.title = element_text(hjust=0))

ISSS608 Visual Analytics and Applications | DataViz Makeover 4

Liu Cuiyi

July 24, 2020

1. Overview

1.1. Design Challenges

1.2. Proposed Sketched Design