Singapore, with a total population of around 5.7 million people and around 4 million of them are residents in 2019, has a growing population trend with a increasing median age since 2011. I’m motivated to explore Singapore’s population sturcure on age groups to have a better understanding of Singapore’s demographic features. The dataset used here contains Singapore residents population by age group, gender, area, and type of dwelling from the year 2011 to 2019.
In this article, I’ll forcus on exploring Singapore’s population structure of age groups in different ways. My analysis consists of 4 part: * Plot the population structure in the years from 2011 to 2019 and identify the overall trend * Identify the population structure by age group combined with Gender in 2019 * Identify the population structure by age group combined with type of dwelling in 2019 * Identify the population structure by age group combined with Planning area in 2019
packages = c('tidyverse', 'reshape2', 'dplyr', 'grid')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)}
After installing and loading nessary libraries, we can load the dataset, then ascertain the dataframe dimensions, variable datatypes and number of missing values.
data <- read_csv("respopagesextod2011to2019.csv")
data_description <- read_csv("Notes_respopagesextod2011to2019.csv")
print(paste('DataFrame contains ', dim(data)[1], 'rows and ', dim(data)[2], 'columns.'))
## [1] "DataFrame contains 883728 rows and 7 columns."
print(paste('There are ', sum(is.na(data)), ' missing values representing ',
round(100*sum(is.na(data))/(nrow(data)*ncol(data)), 5), '% of the total.'))
## [1] "There are 0 missing values representing 0 % of the total."
print('Variable names are as follows:')
## [1] "Variable names are as follows:"
print(names(data))
## [1] "PA" "SZ" "AG" "Sex" "TOD" "Pop" "Time"
head(data)
## # A tibble: 6 x 7
## PA SZ AG Sex TOD Pop Time
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 1- and 2-Room Flats 0 2011
## 2 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 3-Room Flats 10 2011
## 3 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 4-Room Flats 30 2011
## 4 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 5-Room and Executive F~ 50 2011
## 5 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HUDC Flats (excluding thos~ 0 2011
## 6 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males Landed Properties 0 2011
head(data_description, 8)
## # A tibble: 8 x 1
## `Notes for Singapore Residents by Planning Area/Subzone, Age Group, Sex and T~
## <chr>
## 1 Column Headers:
## 2 PA - Planning Area
## 3 SZ - Subzone
## 4 AG - Age Group
## 5 Sex - Sex
## 6 TOD - Type of Dwelling
## 7 Pop - Resident Count
## 8 Time - Time / Period
print('data source: Singapore Department of Statistics https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data')
## [1] "data source: Singapore Department of Statistics https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data"
The dataset contains 7 variables with description of column headers. By looking at the data table, we see only ‘Pop’ and ‘Time’ are numeric while the rest are character variables. In overall, the population of Singapore residence are devided into Planning areas, subzones, age group, gender and type of dwellling from 2011 to 2019. The dataframe is already close to being ready for visualisation since there are no missing value.
We can change colunms name to the full names, which is more intuitive. By checking with the age groups’ levels, we need to unify their format to order age groups in ascending sequence.
names(data) <- c("Planning_Area", "Subzone", "Age", "Gender", "Type_of_Dwelling", "Population", "Year")
print(levels(as.factor(data$Age)))
## [1] "0_to_4" "10_to_14" "15_to_19" "20_to_24" "25_to_29"
## [6] "30_to_34" "35_to_39" "40_to_44" "45_to_49" "5_to_9"
## [11] "50_to_54" "55_to_59" "60_to_64" "65_to_69" "70_to_74"
## [16] "75_to_79" "80_to_84" "85_to_89" "90_and_over"
data$Age <- str_replace(as.factor(data$Age),"5_to_9","05_to_09")
data$Age <- str_replace(as.factor(data$Age),"0_to_4","0_to_04")
print('---------------------------------------------')
## [1] "---------------------------------------------"
print('Arrange age groups in ascending order:')
## [1] "Arrange age groups in ascending order:"
print(levels(as.factor(data$Age)))
## [1] "0_to_04" "05_to_09" "10_to_14" "15_to_19" "20_to_24"
## [6] "25_to_29" "30_to_34" "35_to_39" "40_to_044" "45_to_49"
## [11] "50_to_54" "55_to_59" "60_to_64" "65_to_69" "70_to_74"
## [16] "75_to_79" "80_to_84" "85_to_89" "90_and_over"
#data$Year <- as.integer(data$Year)
We first calcullate the population proportion of each age group in Singapore in every year, then we can plot the change of age structure in Singapore. It seems that the proportion of each age group didn’t change much from 2011 to 2019, but we can also see most lines are slightly upward or nearly steady, which means the proportion of elder groups of perple are getting larger and Singapore is slowly stepping into the aging society. We can further calculate the difference of proportion between 2011 and 2019 for each age group. From the result in the data frame, the proportion of groups aged under 54 decreased from 2011 to 2019 except the group aged from 25 to 29, while the proportion of the groups aged above 54 are growing. the age group range from 65 to 69 has the largest increase of propotion while the proportion of the age group range from 15 to 19 drop the most.
age_pop_by_year <- data %>% group_by(Age, Year) %>% summarise(Population=sum(Population))
df_by_year <- group_by(.data = age_pop_by_year, Year)
df_summarize <- mutate(.data = df_by_year, Population_percent = Population/sum(Population))
ggplot(data = df_summarize, mapping = aes(x = as.integer(Year), y = Population_percent, fill = Age)) +
geom_area(alpha = 0.6) +
geom_line(colour = 'black', size = 0.2, position = 'stack', alpha = 0.6) +
guides(fill = guide_legend(reverse = FALSE)) +
theme_minimal()+
ggtitle("Singapore Population Proportion of Age Groups, 2011-2019")+
xlab("Year")+
ylab("Population Proportion")+
theme(legend.position='right') +
scale_x_continuous(breaks = c(seq(2011, 2019, 1)),
labels = paste0(as.character(c(seq(2011, 2019, 1)))))
overall_change_of_proportion = df_summarize[df_summarize$Year == 2019,]$Population_percent - df_summarize[df_summarize$Year == 2011,]$Population_percent
print(data.frame(levels(as.factor(df_summarize$Age)), overall_change_of_proportion))
## levels.as.factor.df_summarize.Age.. overall_change_of_proportion
## 1 0_to_04 -0.003702793
## 2 05_to_09 -0.005760264
## 3 10_to_14 -0.011805408
## 4 15_to_19 -0.013599204
## 5 20_to_24 -0.005466835
## 6 25_to_29 0.002237579
## 7 30_to_34 -0.005708776
## 8 35_to_39 -0.008767543
## 9 40_to_044 -0.006201840
## 10 45_to_49 -0.008123085
## 11 50_to_54 -0.006394510
## 12 55_to_59 0.007459401
## 13 60_to_64 0.014246610
## 14 65_to_69 0.025310339
## 15 70_to_74 0.010917796
## 16 75_to_79 0.005876446
## 17 80_to_84 0.004350054
## 18 85_to_89 0.003249161
## 19 90_and_over 0.001882873
Now, select the data in 2019 to have a detail look at the age structure of Singapore in 2019.
data2019 = data[data$Year == 2019,]
The background bar chart in grey color is the population distribution by age groups, we can see the distribution is skewed to the left that there are more than 85% of population are aged under 65. Comparing age structure of Female and Male, we see female seems to have long life expectancy since women account for a larger proportion in elder age groups than men.
pop_by_age_gender <- data2019 %>% group_by(Age, Gender) %>% summarise(Population=sum(Population))
pop_by_age <- data2019 %>% group_by(Age) %>% summarise(Population=sum(Population))
ggplot() +
geom_bar(data = pop_by_age, aes(x = Age, y = Population), stat = 'sum', alpha=0.3, color='grey') +
geom_bar(data = pop_by_age_gender, aes(x = Age, y = Population, fill = Gender), stat = 'sum') +
scale_fill_manual(values=c('lightpink2','steelblue3')) +
facet_grid(~Gender) +
scale_x_discrete(labels = NULL) +
scale_y_continuous(breaks = c(seq(0, 400000, 50000)), labels = c(seq(0, 400, 50))) +
theme_minimal()+
ggtitle("Singapore Population by Gender and Age Group, 2019") +
xlab("Age groups")+
ylab("Population in thousands")
The pyramid with a narrow base shows a Constrictive tendency of Singapore population. Lower percentage of younger people, indicating declining birth rates with each succeeding age group getting smaller than the previous one.
data2019$Population <- ifelse(data2019$Gender == "Females",-1*data2019$Population, data2019$Population)
age_cohort <- ggplot(data2019, aes(x = Age, y = Population, fill = Gender))+
geom_bar(data = subset(data2019, Gender == "Females"), stat = "identity")+
geom_bar(data = subset(data2019, Gender == "Males"), stat = "identity") +
scale_y_continuous(breaks = seq(-150000, 150000, 50000),
labels = paste0(as.character(c(seq(150, 0, -50), seq(50, 150, 50))))) +
coord_flip()
age_cohort+
ggtitle("Singapore Residents Pyramid by Age Cohort, 2019")+
xlab("Age group")+
ylab("Population in thousands")+
scale_fill_manual(values=c('lightpink2','steelblue3'))+
theme_minimal()+
theme(legend.position='right')
Firstly, summarize the population by planning areas and sort planning areas by population in descending order, we can see the population amounts are greatly difference among different planning areas.
data2019 = data[data$Year == 2019,]
pop_by_pa <- data2019 %>% group_by(Planning_Area) %>% summarise(Population=sum(Population))
pa_list <- pop_by_pa[order(pop_by_pa$Population, decreasing = TRUE),]$Planning_Area
print(pop_by_pa[order(pop_by_pa$Population, decreasing = TRUE),])
## # A tibble: 55 x 2
## Planning_Area Population
## <chr> <dbl>
## 1 Bedok 279970
## 2 Jurong West 265010
## 3 Tampines 257020
## 4 Woodlands 255000
## 5 Sengkang 244910
## 6 Hougang 227110
## 7 Yishun 220810
## 8 Choa Chu Kang 191100
## 9 Punggol 170920
## 10 Ang Mo Kio 164430
## # ... with 45 more rows
Then, we can subset the dataset into 2 different data frame. the first data frame is the population of different age groups in the 9 planning areas with the most population; accordingly, the second one is in the 9 planning areas with the least population(exclude the planning areas of no population). To compare the population structre of different planning areas with different population amounts, we need to calculate and transform population to the population proportion.
pop_by_age_pa <- data2019 %>% group_by(Age, Planning_Area) %>% summarise(Population=sum(Population))
df_by_pa <- group_by(.data = pop_by_age_pa, Planning_Area)
df_summarize2 <- mutate(.data = df_by_pa, Population_percent = Population/sum(Population))
top9 <- subset(df_summarize2, Planning_Area %in% pa_list[1:9])
bottom9 <- subset(df_summarize2, Planning_Area %in% pa_list[34:42])
From the result distribution of population proportion by age groups in different planning areas, most of the 9 planning areas with the largest population have distributions similar to the overall distribution of population by age groups in Singapore. the Punggol area’s population structure is most different from the other 8 areas that has a much higher proportion of young groups people aged below 10.
ggplot(data = top9, aes(x = Age, y = Population_percent)) +
geom_bar(stat = "sum", alpha = 0.7) +
facet_wrap(~Planning_Area) +
scale_x_discrete(labels = NULL) +
ggtitle("Distribution of Population by Age Cohort in the Top 9 Planning Areas, 2019", subtitle='The 9 planning areas with the largest population')+
xlab("Age group")+
ylab("Population Proportion")
Compared to areas with largest population, areas with smallset population show more different patterns of the population distribution by age groups. Some elderly age groups are not contained in these areas like Museum, Seletar, and Western Water Catchment. The Lim Chu Kang area, which has the smallest population equals to 70 people, only have 6 age groups have population, so the distribution is greatly different from other areas.
ggplot(data = bottom9, aes(x = Age, y = Population_percent)) +
geom_bar(stat = "sum", alpha = 0.7) +
facet_wrap(~Planning_Area) +
scale_x_discrete(labels = NULL) +
ggtitle("Distribution of Population by Age Cohort in the Tail 9 Planning Areas, 2019", subtitle='The 9 planning areas with the smallest population') +
xlab("Age group")+
ylab("Population Proportion")
packages = c('ggtern')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)}
Firstly, based on the dataset of 2019, construct another data frame by using spread() function to separate population by age groups into different cohort columns.
age_pop_ternary = data2019 %>%
mutate(Year = as.character(Year))%>%
spread(Age, Population)
print(names(age_pop_ternary))
## [1] "Planning_Area" "Subzone" "Gender" "Type_of_Dwelling"
## [5] "Year" "0_to_04" "05_to_09" "10_to_14"
## [9] "15_to_19" "20_to_24" "25_to_29" "30_to_34"
## [13] "35_to_39" "40_to_044" "45_to_49" "50_to_54"
## [17] "55_to_59" "60_to_64" "65_to_69" "70_to_74"
## [21] "75_to_79" "80_to_84" "85_to_89" "90_and_over"
head(age_pop_ternary)
## # A tibble: 6 x 24
## Planning_Area Subzone Gender Type_of_Dwelling Year `0_to_04` `05_to_09`
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Ang Mo Kio Ang Mo~ Femal~ Condominiums an~ 2019 40 60
## 2 Ang Mo Kio Ang Mo~ Femal~ HDB 1- and 2-Ro~ 2019 0 0
## 3 Ang Mo Kio Ang Mo~ Femal~ HDB 3-Room Flats 2019 10 10
## 4 Ang Mo Kio Ang Mo~ Femal~ HDB 4-Room Flats 2019 10 20
## 5 Ang Mo Kio Ang Mo~ Femal~ HDB 5-Room and ~ 2019 20 40
## 6 Ang Mo Kio Ang Mo~ Femal~ HUDC Flats (exc~ 2019 0 0
## # ... with 17 more variables: `10_to_14` <dbl>, `15_to_19` <dbl>,
## # `20_to_24` <dbl>, `25_to_29` <dbl>, `30_to_34` <dbl>, `35_to_39` <dbl>,
## # `40_to_044` <dbl>, `45_to_49` <dbl>, `50_to_54` <dbl>, `55_to_59` <dbl>,
## # `60_to_64` <dbl>, `65_to_69` <dbl>, `70_to_74` <dbl>, `75_to_79` <dbl>,
## # `80_to_84` <dbl>, `85_to_89` <dbl>, `90_and_over` <dbl>
Then, further group the original age groups into the ‘Young’, ‘Active’, and ‘Old’ age groups. Young group is aged below 25; Active group is aged above 24 and below 65; Old group is aged above 64.
age_pop_ternary = age_pop_ternary[, c(seq(1, 24, 1))]
age_pop_ternary = age_pop_ternary %>%
mutate(Young = rowSums(.[6:10]))%>%
mutate(Active = rowSums(.[11:18])) %>%
mutate(Old = rowSums(.[19:24])) %>%
mutate(Total = rowSums(.[6:24])) %>%
filter(Total > 0)
Finally, we can draw the ternary plot of singapore population in 3 age dimension: ‘Young’, ‘Active’, and ‘Old’. The population of active groups always the largest population groups in different planning areas and subzones. There are also some points spread near to the lower right corner which means these areas are aging society that elder peple account for the largest proportion. After applying different colors to the points to represent the type of dwelling, we see there are some patterns between age and the housing condition. Most areas with no yound people don’t have a specific type of dwelling. People living in Condominiums or apartments are tend to be in young or active grouops that aged below 65.
ggtern(data = age_pop_ternary, aes(x = Young,y = Active, z = Old, color = Type_of_Dwelling)) +
geom_point() +
ggtitle("Ternary Plot of Singapore Population Structure, 2019") +
theme_rgbw() +
theme(plot.title = element_text(hjust=0))