Q1. a. The data from the source contained the years 2011-2019. For our analysis, we were focusing only on 2019 and hence the data had to be filtered to only 2019. The column names had to be renamed to make the visuals interprettable.
Another challenge in visualizing the data is how to interpret the distribution of the sexes in each age group in Singapore. The distribution of the sexes in each age group generally shows what stage an economy is in. Most advanced countries will have a symmetric distribution of both sexes in each age group. Countries in the developing and under developed brackets usually show some disparity between the numbers in the two sexes, with a higher number of men than women.
The ultimate task to visualize the distribution of the population in each planning area requires a large number of data to be shown in a small yet interpretable visual. Singapore has a lot of idfferent planning areas and this makes the task even more challenging as some areas might not house as much population as the rest.
Q2. a. Column names like “PA”, “SZ”, “Time” were changed to “Planning Area”, “Sub-Zone”,“Year” respectively with the help of R’s “rename” function.
We make an age-sex pyramid to visualize the age distribution in both sexes. It is a graphical illustration that shows the distribution of various age groups in a population which forms the shape of a pyramid when the population is growing.
We make use of a heatmap. When applied to a tabular format, heatmaps are useful for cross-examining multivariate data, through placing variables in the columns and observation in rows and colouring the cells within the table.
Q3.Visualization in R:
This code chunk is to install the necesarry packages to analyse the population data
packages<- c('tidyverse','seriation', 'dendextend', 'plotly', 'ggthemes', 'ggpubr')
for(p in packages){
if(!require(p,character.only=T)){
install.packages(p)
}
library(p, character.only=T)
}
We import the data in the form of a “.csv” file with the “read.csv” function
pop_2011_2019<-read.csv("data/sgp_pop_2011to2019.csv")
We first remove the columns we do not require, such as “Type of Dwelling”
pop_2011_2019<-select(pop_2011_2019, -(TOD))
Renaming the columns to make the visuals more interpretable. We make use of R’s “rename” function
pop_2011_2019<-rename(pop_2011_2019, Year=Time, Planning_Area=PA, Sub_Zone=SZ, Age_Group=AG)
We subset the dataset for only Year 2019 with the help of R’s filter function
pop_2019<-filter(pop_2011_2019, Year=="2019")
head(pop_2019)
## Planning_Area Sub_Zone Age_Group Sex Pop Year
## 1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males 0 2019
## 2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males 10 2019
## 3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males 10 2019
## 4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males 20 2019
## 5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males 0 2019
## 6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males 0 2019
We aggregate the Age Groups by the sum of the population in each of them
pop_2019<-aggregate(Pop~Year+Age_Group+Sex+Planning_Area,data=pop_2019,FUN=sum)
head(pop_2019)
## Year Age_Group Sex Planning_Area Pop
## 1 2019 0_to_4 Females Ang Mo Kio 2660
## 2 2019 05_to_09 Females Ang Mo Kio 3110
## 3 2019 10_to_14 Females Ang Mo Kio 3670
## 4 2019 15_to_19 Females Ang Mo Kio 3890
## 5 2019 20_to_24 Females Ang Mo Kio 4390
## 6 2019 25_to_29 Females Ang Mo Kio 5410
To plot the age-sex pyramid, the axis has to be central at 0. Hence, one population must be on the negative x-axis. We do this by making use of R’s “ifelse” function and coverting the population values to negative if the sex is Male
pop_2019$Pop <- ifelse(pop_2019$Sex == "Males", -1*pop_2019$Pop, pop_2019$Pop)
We plot two bar graphs with the help of ggplot. The subset function subsets each sex from the dataframe and the stat=“identity” specidies that the y variable is a column from the dataset (here Population). We manually specify the scale of the population axis in the sequence of “2K” which is 2000. We then flip the coordinates to create the pyramid chart with x axis as populatino and y axis as categorical age groups. We set the theme and change the background colour to make the visual more interpretable and appealing.
p1<-ggplot(pop_2019, aes(x = Age_Group, y = Pop, fill = Sex)) +
geom_bar(data = subset(pop_2019, Sex == "Females"), stat = "identity") +
geom_bar(data = subset(pop_2019, Sex == "Males"), stat = "identity") +
scale_y_continuous(breaks = seq(-200000, 200000, 20000),labels = paste0(as.character(c(seq(200, 0, -20), seq(20,200,20))), "K")) +
scale_fill_economist() +
theme_bw()+
theme(axis.title.x = element_blank(),panel.grid.major.y = element_blank(),panel.grid.minor.y=element_blank(),axis.text.x.top = element_text(size=12),plot.title = element_text(size=14, face = "bold", hjust = 0.5),plot.subtitle = element_text(hjust = 0.5))+
coord_flip()
p1<-p1+labs(title = "Population Pyramid of Singapore (2019)", subtitle = "Singapore's case of ageing population is a known fact \nbut the population pyramid puts the problem to perspective. \nThe majority of the population if concentrated between the \nage groups of 25-65.")
We convert the values of Male populations back to positive
pop_2019$Pop <- ifelse(pop_2019$Sex == "Males", -1*pop_2019$Pop, pop_2019$Pop)
For making the heatmap interactive, we create a dummy text row in the dataset and define how the label should look like via the column names as variables
pop_2019 <- pop_2019 %>%
mutate(text = paste0("Area:", Planning_Area, "\n", Pop, " people in Age Group ", Age_Group))
We make use of the geom_raster (advanced version of geom_tile) to create the heatmap. We then add plotly to the underlying ggplot so that the heatmap becomes interactive.
p3<-ggplot(data=pop_2019, aes(Age_Group, Planning_Area, fill=Pop, text=text))+scale_x_discrete(expand=c(0,0))+theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.text.y = element_text(size = 5))+
geom_raster(aes(height=1.5))+labs(title="Heat map showing Singapore's population by planning area")
## Warning: Ignoring unknown aesthetics: height
p4<-ggplotly(p3, tooltip="text")
p1
p4
Data manipulation is much more faster with R as mostly anything is possible by the means of code as compared to Tableau where some functionalities if not present in the software cannot be done natively. For example, cleaning of data might sometimes need to be done on excel or tabelau prep.
R is much more visually customizable becasue of the large number of developers in the community. For example, the economist theme is something which would manually have to be done with the judgment of colour in Tableau whereas in R it is available as part of the “ggtheme” package.
Documenting the code is much easier with the help of R markdown. In case the code has to be approved or worked on by someone else, R markdown makes it easy to document and share the file in any commonly used format. With tableau, the user might need to document his work on a word file and share it separately. Otherwise, the new user might have to trace back the steps by viewing each measure one at a time.