The data set contains population data from 2011 to 2019. It has various columns like planning area, sub zone, age group, gender, type of dwelling, number of people and the year. For us to visualise the population by planning area in 2019, we need to filter out the population data of 2019 in a separate data frame to analyse the demographic structure in a heatmap. This is our first and foremost data and design challenge.
Inorder to show the Young, Active and Old groups of people in the population, we would need to group the given data based on the age groups provided. This is a tedious but necessary step to prepare the data in the optimum way for visualisation. This is our second data and design challenge.
For visualising the demographic pattern of all the planning areas, we would need to aggregate the data by planning areas. As each planning area contains multiple subzones, this is a difficult and time consuming task. As this is a considerably large dataset, this would pose a data and design challenge.
ggplot2 -
Overview ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. ggplot2 will help us visualise the demographics in a clear and aesthetical way, taking care of preparing the data along the way.
Installation
The easiest way to get ggplot2 is to install the whole tidyverse: install.packages(“tidyverse”)
Alternatively, install just ggplot2: install.packages(“ggplot2”)
Or the development version from GitHub: install.packages(“devtools”) devtools::install_github(“tidyverse/ggplot2”)
Usage It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).
library(ggplot2)
ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point()
This code chunk installs the basic tidyverse packages and load them into our R Studio Environment without having to explicitly load them every time.
#Code for checking if the packages are installed or not
packages <-c ('tidyverse','ggridges','aggregation','plotly','heatmaply','reshape2','plyr')
for (p in packages){
if(!require(p, character.only=T)){
install.packages(p)
}
library(p,character.only = T)
}
We will read the data using read_csv command. The benefit of this command is that we don’t have to key in the entire directory path everytime.
pop_data<- read_csv("respopagesextod2011to2019.csv")
## Parsed with column specification:
## cols(
## PA = col_character(),
## SZ = col_character(),
## AG = col_character(),
## Sex = col_character(),
## TOD = col_character(),
## Pop = col_double(),
## Time = col_double()
## )
As we need to show the demographic structure of the population by age cohort and planning area in 2019, we will create a new dataframe. This dataframe will contain all the variables and corresponding data for 2019. You can see the dataframe below.
data_2019<-pop_data %>%
select(PA,SZ,AG,Sex,TOD,Pop,Time) %>%
filter( Time == "2019")
We will start aggregating the data in our dataframe in the following way.
x <- data_2019$PA
y <- data_2019$AG
z <- data_2019$Pop
z_new = aggregate(z, by=list(x), FUN=sum)
data_2019new<-data_2019[-c(2, 4,5, 7) ]
Then, we will sum up the values for each planning area.
x <- data_2019new$PA
y <- data_2019new$AG
z <- data_2019new$Pop
z_new = aggregate(z, by=list(x,y), FUN=sum)
df<-z_new
The heatmap() function is natively provided in R. It produces high quality matrix and offers statistical tools to normalize input data, run clustering algorithm and visualize the result with dendrograms. We will create a heatmap of the Planning Area and the Population segregated in different age groups.
names(df)<-c("Area","AGroup","Popul")
names(df)
## [1] "Area" "AGroup" "Popul"
df1<-df %>%
pivot_wider(names_from = AGroup, values_from = Popul)
row.names(df1) <- df1$Area
## Warning: Setting row names on a tibble is deprecated.
df1<-select(df1,-Area)
matrix <- data.matrix(df1)
heatmap <- heatmap(matrix,Rowv=NA, Colv=NA)
`
heatmaply is an R package for easily creating interactive cluster heatmaps that can be shared online as a stand-alone HTML file. Interactivity includes a tooltip display of values when hovering over cells, as well as the ability to zoom in to specific sections of the figure from the data matrix, the side dendrograms, or annotated labels.
A population pyramid, also called an “age-gender-pyramid”, is a graphical illustration that shows the distribution of various age groups in a population, which forms the shape of a pyramid when the population is growing. In this case, we will be visualising the different age groups of Males and Females of Singapore’s population.
heatmaply((matrix),
Colv=NA,
seriate = "none",
k_row = 6,
colors = Blues,
margins = c(NA,200,60,NA),
fontsize_row = 5,
fontsize_col = 5,
main="Pop Area",
xlab = "Age Group",
ylab = "Area"
)
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
## dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
## Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/modules/R_X11.so
## Reason: image not found
## Warning: 'heatmap' objects don't have these attributes: 'showlegend'
## Valid attributes include:
## 'type', 'visible', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'z', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'transpose', 'xtype', 'ytype', 'zsmooth', 'connectgaps', 'xgap', 'ygap', 'zhoverformat', 'hovertemplate', 'zauto', 'zmin', 'zmax', 'zmid', 'colorscale', 'autocolorscale', 'reversescale', 'showscale', 'colorbar', 'coloraxis', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'zsrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'hovertemplatesrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
names(data_2019)
## [1] "PA" "SZ" "AG" "Sex" "TOD" "Pop" "Time"
datapy<-select(data_2019,-PA,-SZ,-TOD,-Time)
names(datapy)<-c("Age","Gender","Pop")
a <- datapy$Age
b <- datapy$Gender
c <- datapy$Pop
y_new = aggregate(c, by=list(a,b), FUN=sum)
df2<- y_new
names(df2)
## [1] "Group.1" "Group.2" "x"
names(df2)<-c("Age","Gender","Pop")
names(df2)
## [1] "Age" "Gender" "Pop"
df3<-df2 %>%
pivot_wider(names_from = Gender, values_from = Pop)
df<-df3
names(df) <- c("Age", "Male", "Female")
cols <- 2:3
df[,cols] <- apply(df[,cols], 2, function(x) as.numeric(as.character(gsub(",", "", x))))
df <- df[df$Age != 'Total', ]
df$Male <- -1 * df$Male
df$Age <- factor(df$Age, levels = df$Age, labels = df$Age)
df.melt <- melt(df,
value.name='Population',
variable.name = 'Gender',
id.vars='Age' )
df4<-df.melt
n1 <- ggplot(df4, aes(x = Age, y = Population, fill = Gender)) +
geom_bar(subset = .(Gender == "Female"), stat = "identity") +
geom_bar(subset = .(Gender == "Male"), stat = "identity") +
scale_y_continuous(breaks = seq(-15000000, 15000000, 5000000),
labels = paste0(as.character(c(seq(15, 0, -5), seq(5, 15, 5))), "m")) +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
theme_bw()
## Warning: Ignoring unknown parameters: subset
## Warning: Ignoring unknown parameters: subset
n1
In the heatmap, we can see the density of different age groups of the population in the planning areas. Sengkang, Woodlands, Punggol and Jurong West are densely populated with young and active people. Sengkang and Punggol have the most children of ages 0-14. Also, Sengkang has a large number of people of ages 30-44. This tells us that Sengkang area is popular with young families with active, working people and small children.
Bedok, on the other hand, has the highest density of older people. Their ages range from 55 to 90 and over. This shows that as people are less inclined to shift or move when they are old, Bedok has had these people living there for a long time. This shows that Bedok is a good area to settle people after retirement or in the final years of their respective careers.
In the population pyramid, we can see that majority of the population falls in the active age groups i.e. 25 to 64. This is for males as well as females. This bodes well for Singapore as they have a major percentage of active population. This has led to a one of the strongest economies and a great quality of life.
There are a number of visualization libraries and extensive online resources on how to use those libraries, like the R-bloggers guide. These online resources provide a lot of exposure when making visuals.
R code is reproducible and easy to export into various presentation formats, including PDFs and full websites. This is very important as Tableau has limited export options as compared to R for visualisations.
We can also visualize data in 3D models and multipanel charts, whereas Tableau doesn’t have that functionality.
As a data scientist, or whatever equivalent job title, we are supposed to analyze the data, statistically, which includes visualization. Dynamic programming is part of the R platform, so it is important we save your code-project for future changes. Any change or update of the data and we just output the results without changing much of the code, if any change at all, and that includes the visualization. We really don’t have time to upload the data again in Tableau to re-do a visualization. We can do dynamic programming using R with LaTex for pdf documents/slides using knitR, or use Markdown for html presentations.