IS428 Visual Analytics and Applications - Assignment 4
The data was collected from Singapore Department of Statistics. In the data, it shows the distribution of the residents’ population (round up to the nearest 10) in the different areas of Singapore per year from 2011 to 2020. The distribution of the residents’ population is reflected based on the year, their age group, gender and also their type of dwelling in each planning areas. The data consists of the following fields:
It would be interesting to look at the population distribution of the different age groups and based on the planning areas and the types of dwelling that residents are staying in.
There are a few data and design challenges when doing the visualisation:
As there are too many data in the csv file, from the period of 2011 to 2020, we will only be taking data from 2019 to visualise the demographic structure of Singapore’s population.
Although there are alot of zeros in the resident count (Pop column), I decide to retain it as it is and not remove them as it could possibly be an insight for certain information.
Age cohort (‘AG’ column) is in string (e.g. 0_to_4, 5_to_9, 10_to_14 etc.). This is a challenge because if I am designing a bar chart based on the age cohort, it will not be in the order of ascending age cohort. The bar charts order will be as follows:
0_to_4, 10_to_14, 15_to_19, …. 45_to_49, 5_to_9, 50_54….
This is happening because the age cohort is sorted based on alphabetical order, thus age cohort 5_to_9 will only be shown at the later part of the bar charts.
There is a total of 15 age cohorts, too many categories for readers to look at and understand which might make it confusing for readers and it will take a longer time to decipher the information from the visualisation due to information overload.
There are too many planning areas to show in the visualisation. Since we are suppose to do a static visualisation, it would be easier for readers to understand the demographic population in the different planning areas by their region instead.
The y-axis/ x-axis in visualisation kept showing abbreviation (e.g. 1e+00 in ggplot2) when use the population count. This will be misleading for viewers if they do not understand the abbreviation.
scales library to solve the problem. Click here for the link to code.x-axis label overlap with each other when the graphs are too near to each other
Proposed sketched design
Source from Singapore Department of Statistics
Step 1: Install and Load package
tidyverse contains a set of essential packages for data manipulation and exploration.plotly is used to get findings from data using a pie chart.scales is used to solve the problem of the y-axis showing abbreviation such as 1e+00 when it is suppose to show numbersgrid is used to create the combined visualisation. Important note: ggplot requires ggplot2 version 3.3.0 to work, ensure that you have the latest version installed.ggplot2 is a part of the tidyverse, an ecosystem of packages designed with common APIs and a shared philosophy.
packages=c("tidyverse","plotly", "scales","grid")
for(p in packages){
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Step 2: Load and extract 2019 data
demographics <-read_csv("data/respopagesextod2011to2020.csv")
# only take year 2019
demo_2019 <- filter(demographics, Time==2019)
Step 3: Create new columns in dataframe
3.1. Categorise each age cohort into their Age categories
There are a total of 19 age cohorts (‘AG’ column) in the data. It would be too tedious if readers wants to see the different age cohorts in the different planning areas. Thus, we will create a new column that segments the population into 5 age categories based on to the table below.
TABLE 1
| Age Category | Age cohorts (5 year range) |
|---|---|
| Children | ‘0_to_4’,‘5_to_9’,‘10_to_14’ |
| Young Adult | ‘15_to_19’,‘20_to_24’,‘25_to_29’ |
| Adult | ‘30_to_34’, ‘35_to_39’, ‘40_to_44’ |
| Middle age | ‘45_to_49’, ‘50_to_54’, ‘55_to_59’ |
| Elderly | ‘60_to_64’, ‘65_to_69’, ‘70_to_74’, ‘75_to_79’, ‘80_to_84’, ‘85_to_89’, ‘90_and_over’ |
Below is the code for grouping the different age cohorts into their respective age category based on the table above.
#group the age cohort into the different age category
children <- c('0_to_4','5_to_9','10_to_14')
young_adult <- c('15_to_19','20_to_24','25_to_29')
adult <- c('30_to_34', '35_to_39', '40_to_44')
middleage<-c('45_to_49', '50_to_54', '55_to_59')
elderly <- c('60_to_64', '65_to_69', '70_to_74', '75_to_79', '80_to_84', '85_to_89', '90_and_over')
#create a new column in the dataframe
demo_2019_ <- demo_2019 %>% mutate(Age_category =
case_when(`AG` %in% children ~ "Children (0-14)",
`AG` %in% young_adult ~ "Young Adult (15-29)",
`AG` %in% adult ~ "Adult (30-44)",
`AG` %in% middleage ~ "Middle age (45-59)",
`AG` %in% elderly ~ "Elderly (>60)",
TRUE ~ ""))
3.2. Categorise each Planning Areas based on Region
Since there is a lot of planning areas in Singapore, it would be difficult for readers to consume so much information. Thus, we are grouping the planning areas (PA column) into 5 different regions based on the table below.
TABLE 2
| Region | Planning Areas |
|---|---|
| North Region | Sungei Kadut, Yishun, Mandai, Woodlands, Simpang, Sembawang, Central Water Catchment, Lim Chu Kang |
| North-east Region | Hougang, Punggol, Ang Mo Kio, Sengkang, Seletar, North-Eastern Islands, Serangoon |
| Central Region | Kallang, Marina East, Museum, Novena, Singapore River, Rochor, Bukit Timah, Downtown Core, Marina South, Newton, Orchard, Queenstown, Southern Islands, Toa Payoh, Bishan, Bukit Merah, Geylang, Marine Parade, Outram, River Valley, Tanglin, Straits View |
| East Region | Changi, Pasir Ris, Bedok, Changi Bay, Tampines, Paya Lebar |
| West Region | Boon Lay, Bukit Panjang, Clementi, Tengah, Western Islands, Bukit Batok, Jurong East, Western Water Catchment, Choa Chu Kang, Jurong West, Tuas, Pioneer |
If you want to look at all the planning areas from the data file, use the code below.
unique(demo_2019_$PA)
Below is the code for grouping the different planning areas into their respective regions based on the table above.
#group the planning areas into the different regions
north<- c( "Sungei Kadut","Yishun", "Mandai", "Woodlands", "Simpang", "Sembawang","Lim Chu Kang","Central Water Catchment")
northeast<- c("Hougang","Punggol","Ang Mo Kio","Sengkang", "Seletar", "North-Eastern Islands", "Serangoon" )
central<-c("Kallang","Marina East","Museum","Novena","Singapore River","Rochor", "Bukit Timah","Downtown Core", "Marina South", "Newton", "Orchard","Queenstown", "Southern Islands", "Toa Payoh","Bishan", "Bukit Merah","Geylang", "Marine Parade", "Outram" , "River Valley", "Tanglin", "Straits View")
east<-c("Changi","Pasir Ris","Bedok", "Changi Bay","Tampines","Paya Lebar" )
west<-c("Boon Lay" ,"Bukit Panjang", "Clementi","Tengah","Western Islands", "Bukit Batok", "Jurong East", "Western Water Catchment", "Choa Chu Kang","Jurong West", "Tuas","Pioneer")
#create a new column in the dataframe
demo_2019__ <- demo_2019_ %>% mutate(Region =
case_when(`PA` %in% north ~ "North Region",
`PA` %in% northeast ~ "North-east Region",
`PA` %in% central ~ "Central Region",
`PA` %in% east ~ "East Region",
`PA` %in% west ~ "West Region",
TRUE ~ ""))
Step 4: Check for unusual data
This step is to confirm that it is okay to proceed with the grouping done in the previous step. If there is any unusual findings, such as 80% of the the age cohort 45_to_49 are females, we might not be able to group the age cohorts into categories as it might cause the visualisation to have missing information after grouping them together.
The code below is to check for unusual data in the age cohorts.
#Reorder the age cohort based on the age
demo_2019 %>%
dplyr::mutate(AG = factor(AG,
levels = c('0_to_4','5_to_9','10_to_14','15_to_19','20_to_24', '25_to_29', '30_to_34', '35_to_39', '40_to_44', '45_to_49', '50_to_54', '55_to_59', '60_to_64', '65_to_69', '70_to_74', '75_to_79', '80_to_84', '85_to_89', '90_and_over'))) %>%
#Plot graph
ggplot(aes(x=AG, y=Pop))+geom_col(aes(fill=Sex))+ scale_y_continuous(labels = comma)+theme(axis.text.x = element_text(angle=45, vjust=1,hjust=1))
Based on the chart above, the age cohorts do not have any unusual data, thus it is safe to group them based on their age group (Children, Young Adult and etc.).
As mentioned earlier in the data and design challenges, the chart below shows that it is indeed very cluttered when all planning areas is shown in the visualisation.
The code below is to show that there are too many planning areas which might be confusing for readers.
ggplot(data=demo_2019_,
aes(x=reorder(PA, -Pop), y=Pop))+
geom_col(aes(fill=Age_category)) + scale_y_continuous(labels = comma) +
theme(axis.text.x = element_text(angle=90, vjust=1,hjust=1), legend.position = "bottom")
Step 5: Other findings
The code below shows that the both the genders are almost the same for each age category except for the adult and elderly category, which has a relatively higher females compared to males.
p<-ggplot(demo_2019__ %>%
group_by(Age_category,Sex)%>% mutate(Age_category = factor(Age_category,
levels = c('Children (0-14)','Young Adult (15-29)','Adult (30-44)','Middle age (45-59)','Elderly (>60)')))%>% summarise(Pop_total=sum(Pop)),
aes(x=Age_category, y=Pop_total, fill=Sex))
## `summarise()` regrouping output by 'Age_category' (override with `.groups` argument)
p+geom_col(position="dodge")+scale_y_continuous(labels = comma)
The following two pie chart are plotted using plotly library.
The code below will show the pie chart of the population distribution by types of dwelling in 2019.
TOD_distribution <- plot_ly(data=demo_2019, labels = ~TOD, values = ~Pop, type = 'pie')
TOD_distribution <- TOD_distribution %>% layout(title = 'Types of Dwelling population distribution in 2019',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
TOD_distribution
The code below will show the pie chart of the population distribution by age category in 2019.
AG_distribution <- plot_ly(data=demo_2019_, labels = ~Age_category, values = ~Pop, type = 'pie')
AG_distribution <- AG_distribution %>% layout(title = 'Age category population distribution in 2019',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
AG_distribution
Step 6: Plot the charts
6.1 Plot bar chart based on population’s age category by region.
The code below will show the charts on the population distribution of the different age categories by regions in 2019.
#Reorder the age cohort based on the age
demo_2019__ %>%
dplyr::mutate(Age_category = factor(Age_category,
levels = c('Children (0-14)','Young Adult (15-29)','Adult (30-44)','Middle age (45-59)','Elderly (>60)'))) %>%
#Plot the graph
ggplot(
aes(fill=Region, y=Pop, x=Region))+
geom_col() +
theme(axis.text.x = element_blank(), legend.position = "none") + facet_wrap(~Age_category,ncol=5) + scale_y_continuous(labels = comma)+
ggtitle("Demographic structure of Singapore's Population in 2019") ->db1
# "->db1" is to create a vector so that we can combine the 2 charts together for the final visualisation
To see how the visualisation looks like, remove “-> db1” and change legend.position=“bottom” in the code above.
6.2 Plot bar chart based on population’s Type of dwelling (TOD) by region.
The code below will show the charts on the population distribution of the types of dwelling by regions in 2019.
#the dplyr mutate function is to reorder the different fields according to how you want it to be ordered
demo_2019__ %>%
dplyr::mutate(Age_category = factor(Age_category,
levels = c('Children (0-14)','Young Adult (15-29)','Adult (30-44)','Middle age (45-59)','Elderly (>60)'))) %>%dplyr::mutate(TOD = factor(TOD,
levels = c("HDB 1- and 2-Room Flats","HDB 3-Room Flats","HDB 4-Room Flats","HDB 5-Room and Executive Flats", "HUDC Flats (excluding those privatised)","Condominiums and Other Apartments", "Landed Properties", "Others"))) %>%
#plot the graph
ggplot(
aes(fill=Region, y=Pop, x=Region))+
geom_col() +
theme(axis.text.x = element_blank(), legend.position = "bottom", legend.text=element_text(size=7)) + facet_wrap(~TOD,ncol=8,labeller = label_wrap_gen(width = 8)) + scale_y_continuous(labels = comma) ->db2
# "->db2" is to create a vector so that we can combine the 2 charts together for the final visualisation
To see how the visualisation looks like, remove “-> db2” in the code above.
Step 7: Create the Final visualisation with the code below
The library grid is used to combine the visualisations.
grid.newpage()
pushViewport(viewport(layout=grid.layout(nrow=25, ncol=60)))
define_region<- function(row,col){
viewport(layout.pos.row=row, layout.pos.col=col)
}
print(db1,vp=define_region(row=1:10,col=1:60))
print(db2,vp=define_region(row=11:25,col=1:60))
Source from Singapore Department of Statistics \[\\[1in]\]
In general, from the final visualisation, most of the regions have a higher population of residents that are living in HDB 4-Room Flats and HDB 5-Room and Executive Flats and no one is living in HUDC Flats excluding those that are privatised. There is also a much lower population that are categorize under living in HDB 1- and 2-Room Flats and Others.
Diving into the specific regions, we are able to tell that the East and North region are having a lower population as compared to the other regions. In addition to this insight, the residents that are staying in these two regions have a higher population staying in HDB 4-Room Flats and HDB 5-Room and Executive Flats.
Another interesting insight would be that there is a higher elderly population living in the Central Region and has the highest elderly population as compared to other regions. In addition, the Central region have the highest population living in landed properties and Condominiums and Other Apartments as compared to the other regions. The wealthiness of the population living in the Central region justifies why there are more people that are living in the private estate. Although there is also more people that are living in the HDB 1- and 2-Room Flats and HDB 3-Room Flats, we cannot infer that there are more poor people in the Central Region as houses in the Central Region are usually more expensive compared to other regions of Singapore.
Lastly, there is a higher population living in HDB 4-Room Flats in the North-east and West region. We can also tell that there is a relatively higher population of Adult aged 30 to 44 living in the North-east region. It could be because in the past few years, Punggol has become a new planning area for new flats to be build and Sengkang is also relatively new. Thus, adults in the 30s, usually an age to start a new family/marriage, are applying their new home in the new planning areas.