Assignment 4 Shafeeka Rahama

published at: https://rpubs.com/shafeeka/Assignment_4

Introduction

The aim of my visualization was to reveal the demographic structure of Singapore population by age cohort and by planning area in 2019. Data Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data (retrieved from the Singapore Department of Statistics)

1.1 Challenges & Solutions

Challenge #1: A design challenge I faced was that the number of variables given under Age Cohort and Planning Area was too large. I was concerned that this will lead to visualizations which are too dense with information, making it overwhelming and hard for readers to carry out clear comparisons. For example, I wanted to plot a stacked bar chart of Age Cohort against Planning Area. However, in such a visualization, it is not easy to compare the length of a bar chart across 19 different age cohorts for each area, even when they are differentiated by colors.

Solution #1: The easiest way to solve this problem will be to group the various variables in Age Cohort and Planning Area into general categories. For example, planning area can be grouped by Regions while Age Cohort can be grouped by {Young, Active, Old}. By reducing the granularity of data in these columns, the visualization created can still potray general trends and thus still satisfy our aim.

Challenge #2: Upon initial loading of the dataset and changing it from a long to wide format, the ordering of certain Age Cohort are wrong. If a visualization is built based on the default ordering of Age Cohort, a misleading plot may be created.

Solution #2: Upon processing of the dataset and prior to creating our plots, reorder the age groups such that they are in ascending order of age.

Challenge #3: Due to the structure of the dataset, it is not flexible enough to be able to plot different types of visualization without having to modify the data structure each time. The dataset is currently split by Sub Zones, Gender, Type of Dwelling etc. Different rows need to be combined or filtered in order to have the right input for visualizations to depict demographic based on Age Cohort and Planning Area.

Solution #3: Instead of undergoing vigorous pre-processing to include all different types of input for visualizations, a solution would be to process the dataset prior to each plot. This also helps to clearly explain how the plots were created.

Challenge #4: A challenge faced was how the demographic structure of Singapore can be analyzed by both Age Cohort and Planning Area to effectively extracting underlying trends.

Solution #4: To tackle this problem, I sketched some possible designs. Below are my proposed sketched design.

Step-by-Step Description of how visualization was prepared

2.1 Load necessary Packages The following code will iterate through the specified packages, installing the packages that are not present.

packages= c('tidyverse', 'ggtern', 'plotly', 'tibble', 'ggplot2', 'ggthemes' ,'data.table', 'reshape2', 'DT')
for(p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

## Loading required package: tidyverse

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Loading required package: ggtern

## Registered S3 methods overwritten by 'ggtern':
##   method           from   
##   grid.draw.ggplot ggplot2
##   plot.ggplot      ggplot2
##   print.ggplot     ggplot2

## --
## Remember to cite, run citation(package = 'ggtern') for further info.
## --

## 
## Attaching package: 'ggtern'

## The following objects are masked from 'package:ggplot2':
## 
##     aes, annotate, ggplot, ggplot_build, ggplot_gtable, ggplotGrob,
##     ggsave, layer_data, theme_bw, theme_classic, theme_dark,
##     theme_gray, theme_light, theme_linedraw, theme_minimal, theme_void

## Loading required package: plotly

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

## Loading required package: ggthemes

## Loading required package: data.table

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

## Loading required package: reshape2

## 
## Attaching package: 'reshape2'

## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

## The following object is masked from 'package:tidyr':
## 
##     smiths

## Loading required package: DT

2.2 Load CSV file Load the data using the function, read_csv(). The file path of the dataset is relative to where the markdown file is saved.

population = read_csv("respopagesextod2011to2020.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   PA = col_character(),
##   SZ = col_character(),
##   AG = col_character(),
##   Sex = col_character(),
##   TOD = col_character(),
##   Pop = col_double(),
##   Time = col_double()
## )

str(population)

## tibble [984,656 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ PA  : chr [1:984656] "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" ...
##  $ SZ  : chr [1:984656] "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" ...
##  $ AG  : chr [1:984656] "0_to_4" "0_to_4" "0_to_4" "0_to_4" ...
##  $ Sex : chr [1:984656] "Males" "Males" "Males" "Males" ...
##  $ TOD : chr [1:984656] "HDB 1- and 2-Room Flats" "HDB 3-Room Flats" "HDB 4-Room Flats" "HDB 5-Room and Executive Flats" ...
##  $ Pop : num [1:984656] 0 10 30 50 0 0 40 0 0 10 ...
##  $ Time: num [1:984656] 2011 2011 2011 2011 2011 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   PA = col_character(),
##   ..   SZ = col_character(),
##   ..   AG = col_character(),
##   ..   Sex = col_character(),
##   ..   TOD = col_character(),
##   ..   Pop = col_double(),
##   ..   Time = col_double()
##   .. )

2.3 Data pre-processing for Ternary Plot

2.3.1 Transform Long table to Wide Table using spread().

population_ternary = population %>%
  mutate(`Year` = as.character(Time))%>%
  spread(AG, Pop)

2.3.2 View how the dataset looks like to get an understanding of the columns and their values that we are going to work with. Here we can see that there are now 25 columns. Furthermore, the category “5_to_9” is not in the right order.

colnames(population_ternary)

##  [1] "PA"          "SZ"          "Sex"         "TOD"         "Time"       
##  [6] "Year"        "0_to_4"      "10_to_14"    "15_to_19"    "20_to_24"   
## [11] "25_to_29"    "30_to_34"    "35_to_39"    "40_to_44"    "45_to_49"   
## [16] "5_to_9"      "50_to_54"    "55_to_59"    "60_to_64"    "65_to_69"   
## [21] "70_to_74"    "75_to_79"    "80_to_84"    "85_to_89"    "90_and_over"

DT::datatable(
  head(population_ternary), extensions = 'FixedColumns',
  options = 
    list(dom = 't',
         columnDefs = list(list(width = '100px', targets = c(1, ncol(population_ternary)))),
         scrollX = TRUE,
         scrollCollapse = TRUE)
)

2.3.3 To solve the earlier problem, we are going to reorder the columns manually. Once that is done, we are doing to prepare the 3 variables needed to plot the Ternary graph: Young, Active and Old. Young would be those who are 0 - 24 years old, Active would be 25 - 64 years old and Old would be 65 years old and above.

We also filter the year to only focus on 2019.

population_ternary = population_ternary[, c(1, 2, 3, 4, 5, 6, 7, 16, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25)]

#deriving the young, active, and old measures. then only use 2019 data
  population_ternary = population_ternary %>%
  mutate(Young = rowSums(.[7:11]))%>%
  mutate(Active = rowSums(.[12:19]))  %>%
  mutate(Old = rowSums(.[20:25])) %>%
  mutate(Total = rowSums(.[7:25])) %>%
  filter(Year == 2019)%>%
  filter(Total > 0)

2.3.4 Plot the Ternary graph, setting variable x, y and z to young, active and old respectively. This is our first vizualisation.

ggtern(data = population_ternary,aes(x=Young,y=Active, z=Old)) +
  geom_point()+
  labs(title="Demographic Structure by Age Group (2019)") +
  theme_rgbw()

2.4 Data Pre-processing for Population Pyramid Plot

2.4.1 We are making reference to the variable which stores the dataset we loaded in steps 2.2. For the following vizualisations, we will just be focusing on 2019. Hence, I am applying the filter on the original variable.

population = population[population$Time == 2019,]

2.4.2 Separate the two genders

male = filter(population, Sex == 'Males')
female = filter(population, Sex == 'Females')

2.4.3 Change the population of female to negative values so that the population of both males and females can be plotted on the same x-axis. AF=fter this, join the male and female data.

female$Pop = -female$Pop
population_pyramid = rbind(male, female)

2.4.4 Now our data is ready and we can move to plotting the population pyramid. The output is our final visualization for population pyramid.

population_pyramid <- ggplot(population_pyramid, 
                             aes(x = AG, 
                                 y = Pop, 
                                 fill = Sex)
                             ) +
                      geom_bar(data = subset(population_pyramid,
                                             Sex == "Females"), 
                               stat = "identity") +
                      geom_bar(data = subset(population_pyramid,
                                             Sex == "Males"), 
                               stat = "identity") +
                      ggtitle("Population Pyramid of Singapore Residents by Age Group in 2019") +
                      xlab("Age group") +
                      ylab("Population (Thousands)") +
                      scale_y_continuous(breaks = seq(-150000, 150000, 50000), 
                                         labels = paste0(as.character(c(seq(150, 0, -50), seq(50, 150, 50))))
                                         ) + 
                      coord_flip() +
                      scale_fill_manual(values = c('lightpink2','steelblue3'))

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

population_pyramid

2.5 Data Pre-processing for stacked bar chart by Planning Area

2.5.1 First part of data pre-processing will including adding to new column variables: age_groups (age cohort grouped by Young, Active and Old) and Region (Planning Area grouped into North, North East, East, West and Central). The reason for creating this columsn have been explained under Challenges and SOlutions (Challenge #1)

population$age_groups <- with(population, dplyr::case_when(AG %in% c("0_to_4",
                                                     "5_to_9",
                                                     "10_to_14",
                                                     "15_to_19",
                                                     "20_to_24"
                                                     ) ~ 'Young',
                                           AG %in% c("25_to_29",
                                                     "30_to_34",
                                                     "35_to_39",
                                                     "40_to_44",
                                                     "45_to_49",
                                                     "50_to_54",
                                                     "55_to_59",
                                                     "60_to_64"
                                                     ) ~ 'Active',
                                           TRUE ~ 'Old'))

population$Region <- NA
population$Region <- with(population, dplyr::case_when(PA %in% c('Central Water Catchment','Lim Chu Kang','Mandai','Sembawang',
                                                 'Simpang','Sungei Kadult','Woodlands','Yishun') ~'North',
                                       PA %in% c('Ang Mo Kio','Hougang','North-Eastern Islands','Punggol',
                                                 'Seletar','Sengkang','Serangoon')~'N.East',
                                       PA %in% c('Bedok','Changi','Changi Bay', 'Paya Lebar','Pasir Ris', 
                                                 'Tampines') ~ 'East',
                                       PA %in% c('Boon Lay','Bukit Batok','Bukit Panjang','Choo Chu Kang', 
                                                 'Clementi','Jurong East','Jurong West','Pioneer', 
                                                 'Tengah','Tuas','Western Islands', 
                                                 'Western Water Catchment') ~ 'West',
                                       TRUE ~ 'Central'))

2.5.2 Using this dataset, we first plot a stacked bar graph to get a high level idea of the proportion of different age groups present in the various Regions. We convert the population count to percentage to focus on the proportion of population.

ggplot(population, aes(fill=age_groups, y=Pop, x=Region)) + 
    geom_bar(position="fill",stat="identity")+
  labs(x = "Planning Area", y = "% of Population")+
  geom_vline(xintercept=0.5)+
  coord_flip()+
  ggtitle("Proportion of Age Groups in Different Regions")+
  theme(plot.title = element_text(hjust = .5),
                axis.ticks = element_blank())+
  theme_classic()

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

3. Insights gathered

Insight 1: Ageing population Based on the population pyramid, the following trends are observed: The number of residents belonging to the ‘Young’ category is low causing the pyramid plot to have a narrow base. However, the centre portion of the pyramid chart is wide showing that the working class makes up majority of the demographic in Singapore. Thus, we can conclude that Singapore’s population is concentrated within the age group ranging from 25 to 64. The narrow base also suggests that our birth-rate is low, proving the government’s concern about ageing population.

Insight 2: Proportion of Age Groups across different Regions Based on the stacked bar chart, we can conclude that the central region houses a lower percentage of Young group but a higher percentage of those belonging to the ‘Old’ age group. At the same time, the ‘Active’ age group take up the highest portion in all regions.