In order to explore the demographic structure of Singapore Population from different parts, we create three plots to visualize data analytics.
Population Pyramid
Ternary Plot
Bubble Plot
We use the code chunk below to ensure we install and launch packages we need in the following steps.
packages = c('tidyverse','ggthemes','ggtern','plotly','viridis')
for(p in packages){library
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Data sources from Singstat link, including characters of Singapore population, such as age, geographic distribution and gender.
Use read.csv() function to import data into R
sg <- read.csv("data/respopagesextod2011to2019.csv")
Check the structure of raw data using str() function
str(sg)
## 'data.frame': 883728 obs. of 7 variables:
## $ PA : Factor w/ 55 levels "Ang Mo Kio","Bedok",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ SZ : Factor w/ 323 levels "Admiralty","Airport Road",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ AG : Factor w/ 19 levels "0_to_4","10_to_14",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : Factor w/ 2 levels "Females","Males": 2 2 2 2 2 2 2 2 1 1 ...
## $ TOD : Factor w/ 8 levels "Condominiums and Other Apartments",..: 2 3 4 5 6 7 1 8 2 3 ...
## $ Pop : int 0 10 30 50 0 0 40 0 0 10 ...
## $ Time: int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
Rename those abbreviations to show the full name of variable.
names(sg)<-c("Zone","Subzone","Age","Gender","Type of Dwelling","Population","Year")
str(sg)
## 'data.frame': 883728 obs. of 7 variables:
## $ Zone : Factor w/ 55 levels "Ang Mo Kio","Bedok",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Subzone : Factor w/ 323 levels "Admiralty","Airport Road",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ Age : Factor w/ 19 levels "0_to_4","10_to_14",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Gender : Factor w/ 2 levels "Females","Males": 2 2 2 2 2 2 2 2 1 1 ...
## $ Type of Dwelling: Factor w/ 8 levels "Condominiums and Other Apartments",..: 2 3 4 5 6 7 1 8 2 3 ...
## $ Population : int 0 10 30 50 0 0 40 0 0 10 ...
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
levels(sg$Age)
## [1] "0_to_4" "10_to_14" "15_to_19" "20_to_24" "25_to_29"
## [6] "30_to_34" "35_to_39" "40_to_44" "45_to_49" "5_to_9"
## [11] "50_to_54" "55_to_59" "60_to_64" "65_to_69" "70_to_74"
## [16] "75_to_79" "80_to_84" "85_to_89" "90_and_over"
#the order of factors in age is incorrect, reorder the level of factors
sg$Age <- factor(sg$Age,levels(sg$Age)[c(1,10,2:9,11:19)])
levels(sg$Age)
## [1] "0_to_4" "5_to_9" "10_to_14" "15_to_19" "20_to_24"
## [6] "25_to_29" "30_to_34" "35_to_39" "40_to_44" "45_to_49"
## [11] "50_to_54" "55_to_59" "60_to_64" "65_to_69" "70_to_74"
## [16] "75_to_79" "80_to_84" "85_to_89" "90_and_over"
Because we only want to show data in 2019, so that we should use filter() function to select data of year 2019.
In the pymarid, we want to split gender variable, to compare difference between gender by age cohort, so that we should use formula to reverse the value of female population.
sg_2019 <- sg %>%
filter (Year == 2019)
sg_2019$Population <- ifelse(sg_2019$Gender == "Females",-1*sg_2019$Population,sg_2019$Population)
Create two bar plots to show different distributions of gender.
Reset the labels of y axis, to make the label value of female population is positive and same intervals of two bar plots.
Use coord_flip() function to reverse the y axis, and design the layout and color to make plot more pretty.
Use theme_economist to make the plot backgroung has more fun.
age_cohort <- ggplot(sg_2019,aes(x = Age, y = Population,fill = Gender))+
geom_bar(data = subset(sg_2019,Gender == "Females"), stat = "identity")+
geom_bar(data = subset(sg_2019,Gender == "Males"), stat = "identity") +
scale_y_continuous(breaks = seq(-150000, 150000, 50000),
labels = paste0(as.character(c(seq(150, 0, -50), seq(50, 150, 50))))) +
coord_flip()
age_cohort+
ggtitle("Singapore Population Pyramid by Age Cohort, 2019")+
xlab("Age Group")+
ylab("Population in Thousands")+
scale_fill_manual(values=c('lightpink2','steelblue3'))+
theme_economist()+
theme(legend.position='right')
For groups aged below 24, the populations of male are higher than that of female. But for groups aged above 24, the opposite is happening,especially for the aged group, the proportion of female is obviously higher than that of male.In the 95 and over group, the figure of female is around 3 times of male.
We use three age groups, young, economically active and old as three components of ternary plot.
Use spread() function to calculate population in different age groups.
Use mutate to create new age groups. The rows number is order according to the new table after spreading.
Filter function to keep only year 2019 of data.
Age_pop <- sg %>%
mutate(`Year` = as.character(Year))%>%
spread(Age, Population)
Age_pop[is.na(Age_pop)]<-0
Age_population <- Age_pop%>%
mutate(Young = rowSums(.[6:10]))%>%
mutate(Economically_Active = rowSums(.[11:18])) %>%
mutate(Old = rowSums(.[19:24])) %>%
mutate(TOTAL = rowSums(.[6:24])) %>%
filter(Year == 2019)%>%
filter(TOTAL > 0)
ggtern is a package designed by R to easier create a ternary plot.
Use yound, economically active and old group as x, y and z variable respectively.
Use theme_rgbw to set the background.
Use hjust=0.5 to make the title in the center.
ternplot<- ggtern(data=Age_population, aes(x=Young,y=Economically_Active, z=Old)) +
geom_point()
ternplot+
ggtitle("Ternary Plot of Population Structure, 2019")+
xlab("Young")+
ylab("Active")+
zlab("Old")+
theme_rgbw()+
theme(plot.title = element_text(hjust=0.5))
For majority of planning areas, economically active group always make up the largest percentage. But compare to part of younger area, has more points fall in the older area. Overall, the tendency of data samples to be more aging.
Proportion of young population and proportion of old population are the pair of variables we want to compare in the bubble plot.
Use group_by() function to calcurate variables in zone dimension.
Use summarise() function to create variable calculated by variables that group by zone.
We should create a new table, select the variables we need in the summarise result table as variables of new table. This step is to make sure the variables we want to show in the bubble plot already existed in the R environment.
group_zone <- Age_population %>%
group_by(Zone)
summary <- group_zone %>%
summarise(total_young=sum(Young),total_old=sum(Old),total=sum(TOTAL),Young_per=total_young/total,Old_per=total_old/total) %>%
select(Zone,total_young,total_old,Young_per,Old_per,total,total)
Bubble plot could be created in R by using scatter plot with shape 21.
Reset the color of points to viridis palette.
Rename the title, x and y labels.
Because zone has too many values, set guide=FALSE to hide the legend.
Use range=c() to set the point size, easily to compare points.
bubbleplot<-ggplot(summary,aes(x=Old_per,y=Young_per,size=total,fill=Zone))+
geom_point(alpha=0.5,shape=21,color="black")
bubbleplot +
scale_x_continuous(labels=scales::percent)+
scale_y_continuous(labels=scales::percent)+
scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A") +
scale_size_continuous(name = "Population",range=c(0.1,12))+
ggtitle("Bubble Plot of Comparison between Young and Old Population, 2019")+
xlab("Proportion of Old")+
ylab("Proportion of Young")+
theme(legend.position="right")
More bubbles have higher proportion of aged population. The values of active group seem always be the biggest compare to that of young group and old group.