1. Introduction
- In this assignment, we are required to design a static data visualization either:
- to reveal the demographic structure of Singapore population by age cohort (e.g., 0-4, 5-9, ……) and by planning area in 2019, or
- to tell any other story in your own interested field via visualization.
The dataset I will be using for this assignment is Singapore Residents by Planning Area Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019 data series.
1.1 Describe the major data and design challenges faced in accomplishing this assignment
Challenge 1: The large number of categories to be visualised in this dataset
Using Explore Data Analysis (EDA), there are a total of 55 planning zones that consists 335 sub zones, and 19 age groups (as shown below). There are too many categories to proper visualise. Therefore, I would be further reducing the number of categories through by grouping the planning zones into 5 regions, and reduce the number of age group to 6 main age group. This I believe that it will be easier to visualise insights from the dataset.
unique(sgpop$PA)
## [1] Ang Mo Kio Bedok Bishan
## [4] Boon Lay Bukit Batok Bukit Merah
## [7] Bukit Panjang Bukit Timah Central Water Catchment
## [10] Changi Changi Bay Choa Chu Kang
## [13] Clementi Downtown Core Geylang
## [16] Hougang Jurong East Jurong West
## [19] Kallang Lim Chu Kang Mandai
## [22] Marina East Marina South Marine Parade
## [25] Museum Newton North-Eastern Islands
## [28] Novena Orchard Outram
## [31] Pasir Ris Paya Lebar Pioneer
## [34] Punggol Queenstown River Valley
## [37] Rochor Seletar Sembawang
## [40] Sengkang Serangoon Simpang
## [43] Singapore River Southern Islands Straits View
## [46] Sungei Kadut Tampines Tanglin
## [49] Tengah Toa Payoh Tuas
## [52] Western Islands Western Water Catchment Woodlands
## [55] Yishun
## 55 Levels: Ang Mo Kio Bedok Bishan Boon Lay Bukit Batok ... Yishun
head(unique(sgpop$SZ))
## [1] Ang Mo Kio Town Centre Cheng San Chong Boon
## [4] Kebun Bahru Sembawang Hills Shangri-La
## 335 Levels: Admiralty Airport Road Alexandra Hill Alexandra North ... Yunnan
unique(sgpop$AG)
## [1] 0_to_4 5_to_9 10_to_14 15_to_19 20_to_24 25_to_29
## [7] 30_to_34 35_to_39 40_to_44 45_to_49 50_to_54 55_to_59
## [13] 60_to_64 65_to_69 70_to_74 75_to_79 80_to_84 85_to_89
## [19] 90_and_over
## 19 Levels: 0_to_4 10_to_14 15_to_19 20_to_24 25_to_29 30_to_34 ... 90_and_over
Challenge 2: Each planning zone has different number of sub zones
Since each planning zone has a different number of zones, there is a need to normalise the resident count. Therefore, I will be normalising the resident so that upon comparison with the rest of the planning zone it will be much comparable.
sgpop %>%
group_by(PA) %>%
summarise(Count = n_distinct(SZ)) %>%
arrange(desc(Count))
## # A tibble: 55 x 2
## PA Count
## <fct> <int>
## 1 Bukit Merah 17
## 2 Queenstown 15
## 3 Downtown Core 13
## 4 Ang Mo Kio 12
## 5 Jurong East 12
## 6 Toa Payoh 12
## 7 Hougang 10
## 8 Rochor 10
## 9 Bukit Batok 9
## 10 Clementi 9
## # ... with 45 more rows
Challenge 3: Population count of 0
There are a total of 68,193 rows that population is 0 in Year 2019. This would either mean that there isn’t anyone staying at the outskirts of Singapore or the data is missing in the first place. Therefore, in order to keep the dataset relevant I have decided to exclude all of these rows with 0 population.
sgpop %>%
filter(Time == "2019", Pop == 0) %>%
summarise(count = n())
## # A tibble: 1 x 1
## count
## <int>
## 1 68193
1.2 Proposed sketched design to overcome the challenges

2. Provide step-by-step description on how the data visualization was prepared by using ggplot2 and other related R packages.
Viewing the dataset before proceeding the data wrangling process
glimpse(sgpop)
## Rows: 984,656
## Columns: 7
## $ PA <fct> Ang Mo Kio, Ang Mo Kio, Ang Mo Kio, Ang Mo Kio, Ang Mo Kio, An...
## $ SZ <fct> Ang Mo Kio Town Centre, Ang Mo Kio Town Centre, Ang Mo Kio Tow...
## $ AG <fct> 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4...
## $ Sex <fct> Males, Males, Males, Males, Males, Males, Males, Males, Female...
## $ TOD <fct> HDB 1- and 2-Room Flats, HDB 3-Room Flats, HDB 4-Room Flats, H...
## $ Pop <dbl> 0, 10, 30, 50, 0, 0, 40, 0, 0, 10, 30, 60, 0, 0, 40, 0, 0, 10,...
## $ Time <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
From 2.1 to 2.3, I will be addressing the challenges I mentioned above. All of the visualisation is represented in percentage as mentioned.
2.1 Filter only Year 2019 and Population that is above 0
sgpop_2019 <- sgpop %>%
filter(sgpop$Time == '2019', sgpop$Pop > 0) %>%
mutate_at(.vars = vars(PA, SZ), .funs = funs(toupper))
glimpse(sgpop_2019)
## Rows: 29,999
## Columns: 7
## $ PA <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG M...
## $ SZ <chr> "ANG MO KIO TOWN CENTRE", "ANG MO KIO TOWN CENTRE", "ANG MO KI...
## $ AG <fct> 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4...
## $ Sex <fct> Males, Males, Males, Males, Females, Females, Females, Females...
## $ TOD <fct> HDB 3-Room Flats, HDB 4-Room Flats, HDB 5-Room and Executive F...
## $ Pop <dbl> 10, 10, 20, 50, 10, 10, 20, 40, 10, 10, 50, 60, 10, 20, 40, 60...
## $ Time <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20...
Number of Singapore population in 2019
total_pop <- sum(sgpop_2019$Pop)
format(sum(sgpop_2019$Pop), big.mark=",", scientific=FALSE)
## [1] "4,033,420"
2.2 Group the PA into 5 regions and AG into 6 main age groups
sgpop_2019$REGION <- fct_collapse(sgpop_2019$PA, NORTH=c('LIM CHU KANG', 'MANDAI', 'SEMBAWANG', 'SUNGEI KADUT', 'WOODLANDS', 'YISHUN'),
'NORTH-EAST'=c('ANG MO KIO', 'HOUGANG', 'PUNGGOL', 'SELETAR', 'SENGKANG', 'SERANGOON'),
EAST=c('CHANGI', 'PAYA LEBAR', 'TAMPINES', 'PASIR RIS', 'LOYANG', 'SIMEI', 'KATONG', 'MACPHERSON', 'BEDOK'),
WEST=c('BUKIT BATOK', 'BUKIT PANJANG', 'PIONEER', 'CHOA CHU KANG', 'CLEMENTI', 'JURONG EAST', 'JURONG WEST', 'TENGAH', 'WESTERN ISLANDS', 'WESTERN WATER CATCHMENT', 'BENOI', 'PANDAN GARDENS', 'JURONG ISLAND', 'KENT RIDGE', 'NANYANG', 'MANDAI'),
CENTRAL=c('SINGAPORE RIVER', 'MUSEUM', 'DOWNTOWN CORE', 'OUTRAM', 'ROCHOR', 'ORCHARD', 'NEWTON', 'RIVER VALLEY', 'BUKIT TIMAH', 'TANGLIN', 'NOVENA', 'BISHAN', 'BUKIT MERAH', 'GEYLANG', 'KALLANG', 'MARINE PARADE', 'QUEENSTOWN', 'SOUTHERN ISLANDS', 'TOA PAYOH'))
sgpop_2019$AGE_GRP <- fct_collapse(sgpop_2019$AG, '0-14'=c('0_to_4', '5_to_9', '10_to_14'), '15-24'=c('15_to_19', '20_to_24'), '25-54'=c('25_to_29', '30_to_34', '35_to_39', '45_to_49', '50_to_54'), '55-64'=c('55_to_59', '60_to_64'), 'Over 65'=c('65_to_69', '70_to_74', '75_to_79', '80_to_84', '85_to_89', '90_and_over'))
unique(sgpop_2019$REGION)
## [1] NORTH-EAST EAST CENTRAL WEST NORTH
## Levels: NORTH-EAST EAST CENTRAL WEST NORTH
unique(sgpop_2019$AGE_GRP)
## [1] 0-14 15-24 25-54 40_to_44 55-64 Over 65
## Levels: 0-14 15-24 25-54 40_to_44 55-64 Over 65
The data has properly categorise by regions and the defined age group for visualisation.
2.4 Creating the various graphs for visualisation
2.4.1 Singapore Region
ggplot(data=region_pop,
aes(x=reorder(REGION, -Pop),
y=Pop,
label = Pop))+
stat_summary(geom="bar",
fun="sum")+
geom_label(hjust = 1.5)+
labs(x="REGION", y="SG RESIDENTS (%)")+
coord_flip()+
theme_classic()

The top 3 populated regions in Singapore are Central, West and North-East.
2.4.2 Singapore Age group
ggplot(data=ag_pop,
aes(x=reorder(AGE_GRP, -Pop),
y=Pop,
label = Pop))+
stat_summary(geom="bar",
fun="sum")+
geom_label(hjust = 0.65)+
labs(x="Age Group", y="SG RESIDENTS (%)")+
coord_flip()+
theme_classic()

The top 3 age groups in Singapore are 25-54, 0-14, and 55-64 very close with over 65.
2.4.3 Singapore Type of Dwelling
ggplot(data=dw_pop,
aes(x=reorder(TOD, -Pop),
y=Pop,
label = Pop))+
stat_summary(geom="bar",
fun="sum")+
labs(x="TYPE OF DWELLING", y="SG RESIDENTS (%)")+
geom_label(hjust = 0.6)+
coord_flip()+
theme_classic()

The top 3 type of dwellings in Singapore are HDB 4-Room, HDB 5-Room, and Condo & Other Apartments very close with HDB 3-Room.
2.4.4 Type of dwelling by Singapore Region
ggplot(data=region_dw_pop,
aes(x=reorder(REGION, -Pop),
y=Pop,
label = Pop))+
stat_summary(geom="bar",
fun="sum")+
labs(x="TYPE OF DWELLING", y="SG RESIDENTS (%)")+
geom_label()+
coord_flip()+
theme_classic()+
facet_wrap(~TOD)

Within the top 3 type of dwellings (HDB 4-Room, HDB 5-Room, Condo & Other Apartments), let’s look in-depth into which of part of Singapore does these dwelling type located. For 4-RM, it’s mostly located in the West, North-East, and North. While 5-RM and EC, mostly located in the West, North-East, and East. Finally, Condo it’s mostly located in the Central, East, and West.
2.4.5 Singapore Region detail by Sub zones
2.4.5.1 North
ggplot(data=region_sz_pop_north,
aes(x=reorder(SZ, -Pop),
y=Pop,
label = Pop))+
geom_bar(stat = 'identity')+
labs(x="SUB ZONES", y="SG RESIDENTS (%)")+
coord_flip()+
theme_classic()+
facet_wrap(~REGION)+
theme(axis.text.y = element_text(size=6))

2.4.5.2 East
ggplot(data=region_sz_pop_east,
aes(x=reorder(SZ, -Pop),
y=Pop,
label = Pop))+
geom_bar(stat = 'identity')+
labs(x="SUB ZONES", y="SG RESIDENTS (%)")+
coord_flip()+
theme_classic()+
facet_wrap(~REGION)

2.4.5.3 North-East
ggplot(data=region_sz_pop_neast,
aes(x=reorder(SZ, -Pop),
y=Pop,
label = Pop))+
geom_bar(stat = 'identity')+
labs(x="SUB ZONES", y="SG RESIDENTS (%)")+
coord_flip()+
theme_classic()+
facet_wrap(~REGION)+
theme(axis.text.y = element_text(size=5))

2.4.5.4 West
ggplot(data=region_sz_pop_west,
aes(x=reorder(SZ, -Pop),
y=Pop,
label = Pop))+
geom_bar(stat = 'identity')+
labs(x="SUB ZONES", y="SG RESIDENTS (%)")+
coord_flip()+
theme_classic()+
facet_wrap(~REGION)+
theme(axis.text.y = element_text(size=6))

2.4.5.5 Central (Top 10)
ggplot(data=region_sz_pop_central_top10,
aes(x=reorder(SZ, -Pop),
y=Pop,
label = Pop))+
geom_bar(stat = 'identity')+
labs(x="SUB ZONES", y="SG RESIDENTS (%)")+
coord_flip()+
theme_classic()+
facet_wrap(~REGION)

Within the 3 populated regions (Central, West, North-East), the top 3 sub zones in Central region are Aljunied, Bendemeer, Balestier. While West it’s Yunnan, Jurong West Central, Hong Kah. Finally, North-East its Sengkang, Rivervale, Fernvale. Please refer to the graphs under 2.4.5.
2.4.6 Singapore Region detail by Sex
ggplot(data=region_sex_pop,
aes(x=reorder(REGION, -Pop),
y=Pop,
label = Pop))+
stat_summary(geom="bar",
fun="sum")+
geom_label(hjust = 1.5)+
labs(x="REGION", y="SG RESIDENTS (%)")+
coord_flip()+
theme_classic()+
facet_grid(~Sex)

Refer to 2.4.1, although West and Central region are equal in number of residents. But after drill-down, I can see that females are more than males.
2.4.7 Singapore Age group detail by Region
ggplot(data=region_ag_pop,
aes(x=reorder(AGE_GRP, -Pop),
y=Pop,
label = Pop))+
stat_summary(geom="bar",
fun="sum")+
geom_label(hjust = 0.65)+
labs(x="AGE GRP", y="SG RESIDENTS (%)")+
coord_flip()+
theme_classic()+
facet_wrap(~REGION)

Within the top 3 age groups (25-54, 0-14, 55-64), let’s look in-depth which of these age group lived in which Singapore region. For 25-54, mostly lived in the North-East, West and Central region. While 0-14, lived in the North-East, West, and Central. Finally, 55-64 lived mainly in the West, Central, and North-East.
2.4.8 Singapore Region detail by Age group and Gender
2.4.8.1 North and East Region with Gender
north_ag_sex_pop_plot <- ggplot(data=north_ag_sex_pop,
aes(x=reorder(AGE_GRP, -Pop),
y=Pop,
label = Pop))+
geom_bar(stat = 'identity')+
labs(x="AGE GRP", y="SG RESIDENTS (%)")+
geom_label(hjust = 0.65)+
coord_flip()+
theme_classic()+
facet_grid(REGION~Sex)
east_ag_sex_pop_plot <- ggplot(data=east_ag_sex_pop,
aes(x=reorder(AGE_GRP, -Pop),
y=Pop,
label = Pop))+
geom_bar(stat = 'identity')+
labs(x="AGE GRP", y="SG RESIDENTS (%)")+
geom_label(hjust = 0.65)+
coord_flip()+
theme_classic()+
facet_grid(REGION~Sex)
ggarrange(north_ag_sex_pop_plot, east_ag_sex_pop_plot, nrow=2)

2.4.8.2 North-East and West Region with Gender
neast_ag_sex_pop_plot <- ggplot(data=neast_ag_sex_pop,
aes(x=reorder(AGE_GRP, -Pop),
y=Pop,
label = Pop))+
geom_bar(stat = 'identity')+
labs(x="AGE GRP", y="SG RESIDENTS (%)")+
geom_label(hjust = 0.65)+
coord_flip()+
theme_classic()+
facet_grid(REGION~Sex)
west_ag_sex_pop_plot <- ggplot(data=west_ag_sex_pop,
aes(x=reorder(AGE_GRP, -Pop),
y=Pop,
label = Pop))+
geom_bar(stat = 'identity')+
labs(x="AGE GRP", y="SG RESIDENTS (%)")+
geom_label(hjust = 0.65)+
coord_flip()+
theme_classic()+
facet_grid(REGION~Sex)
ggarrange(neast_ag_sex_pop_plot, west_ag_sex_pop_plot, nrow=2)

2.4.8.3 Central Region with Gender
ggplot(data=central_ag_sex_pop,
aes(x=reorder(AGE_GRP, -Pop),
y=Pop,
label = Pop))+
geom_bar(stat = 'identity')+
labs(x="AGE GRP", y="SG RESIDENTS (%)")+
geom_label(hjust = 0.65)+
coord_flip()+
theme_classic()+
facet_grid(REGION~Sex)
