The objective of this dataviz is to visualise the distribution of Singapore’s income inequality levels by planning area in a spatial manner. This allows the user to comprehend a large amount of information within a short period of time to draw meaningful insights on Singapore’s income distribution. The dataviz is intended to help policy-makers within the healthcare sector better enhance healthcare policies relating to Singapore’s population.
knitr::include_graphics("image/sketch.jpg")
knitr::include_graphics("image/leaflet_map.jpg")
The dataviz will incorporate interactivity features to enhance the user experience.
Firstly, the dataviz is represented in a geo-spatial manner to allow users to correlate income distribution to a specific planning region immediately. This is because the user (assuming to be a Singaporean) is likely to already be well-orientated to the major regions in Singapore.
Next, the multiple geo-spatial visualisations are shown within a single view ala the image above to provide an integrated perspective of different measures relating to income inequality. This allows the user to immediately coorelate a specific region’s income inequality with its low, middle and high income population levels.
Users will be able to mouse over the dataviz to gain further insights into each planning area. This includes the total population within the specific planning area, the percentage of low, medium and high income levels and the income inequality level - i.e. represented by the Gini Coefficient.
First, we conduct an exploratory data analysis of Singapore’s income distribution and experiment with different ways to visualise both the income distribution and inequality levels.
We use the Gini Coefficient to represent a specific planning area’s level of income inequality. A Lorenz Curve is often used by the economist discipline to visualise the extent of inequality in a region/country.
It is defined as a ratio with values between 0 and 1: the numerator is the area between the Lorenz curve of the distribution and the uniform distribution line; the denominator is the area under the uniform distribution line.
Here, 0 corresponds to perfect income equality (i.e. everyone in the planning area has the same income) and 1 corresponds to perfect income inequality.
Launch the following R packages
packages <- c("leaflet","tidyverse","readxl", "rgdal","reldist","htmltools","viridis","knitr")
for (p in packages){
if (!require (p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
#full list of packages used
#library("leaflet")
#library("tidyverse")
#library("readxl") for reading of excel files
#library("rgdal") for reading of geojson files
#library ("reldist") for calculation of gini coefficient
#library("htmltools") for visualizing multiple leaflet objects in a single view
#library("viridis") viridis color theme
#library("knitr")
Import aspatial data into R Data Source: https://www.singstat.gov.sg/publications/cop2010/census10_stat_release1
df_income1 = read_excel("data/aspatial/income.xlsx")
Import geospatial data into R Data Source: https://data.gov.sg/dataset/master-plan-2019-subzone-boundary-no-sea
df_map = readOGR(dsn = "data/geospatial/master-plan-2019-subzone-boundary-no-sea.geojson")
## OGR data source with driver: GeoJSON
## Source: "C:\Users\jiyee\Documents\SMU Laptop\Visual_Project\data\geospatial\master-plan-2019-subzone-boundary-no-sea.geojson", layer: "master-plan-2019-subzone-boundary-no-sea"
## with 332 features
## It has 2 fields
## Warning in readOGR(dsn = "data/geospatial/master-plan-2019-subzone-boundary-no-
## sea.geojson"): Z-dimension discarded
#use the code below to check the content of the map variable
#head(df_map@data,10)
Compute the gini coefficient for each planning area.
df_gini1 <- df_income1[-1,-2]
df_gini2 <- df_gini1 %>%
gather(Income_Group, Population, `Below $1,000`:`$8,000 & Over`) %>%
mutate(Mean_Income = ifelse(Income_Group=="Below $1,000", 500,
ifelse(Income_Group=="$1,000 - $1,499", 1250,
ifelse(Income_Group=="$1,500 - $1,999", 1750,
ifelse(Income_Group=="$2,000 - $2,499", 2250,
ifelse(Income_Group=="$2,500 - $2,999", 2750,
ifelse(Income_Group=="$3,000 - $3,999", 3500,
ifelse(Income_Group=="$4,000 - $4,999", 4500,
ifelse(Income_Group=="$5,000 - $5,999", 5500,
ifelse(Income_Group=="$6,000 - $6,999", 6500,
ifelse(Income_Group=="$7,000 - $7,999", 7500,
ifelse(Income_Group=="$8,000 & Over", 8500,0 ))))))))))))
df_gini2 <- df_gini2[order(df_gini2$`Planning Area`),]%>%
group_by(`Planning Area`) %>%
mutate(cum_inc_prop = cumsum(Mean_Income/max(cumsum(Mean_Income))), pop_prop = cumsum(Population)/max(cumsum(Population))) %>%
mutate(per_pop = Population/sum(Population))
ggplot(df_gini2, aes(x = pop_prop, y = cum_inc_prop)) +
geom_line() +
geom_ribbon(aes(ymax = pop_prop, ymin = cum_inc_prop), fill = "steelblue", alpha = 0.2) +
geom_abline(intercept = 0, slope = 1, colour = "steelblue") +
labs(x = "Cumulative proportion of population within planning area",
y = "Cumulative proportion of mean income within planning area",
caption = "Source: Singstats - Singapore Census of Population 2010") +
ggtitle("Inequality in Singapore Planning Areas", "Lorenz curve based on binned decile mean income levels") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_wrap(`Planning Area` ~ .)
df_gini3 <- df_gini2 %>%
group_by(`Planning Area`) %>%
summarise(Gini = gini(Mean_Income, w = per_pop)) %>%
mutate_at(.vars = vars(`Planning Area`), .funs = funs(toupper))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
df_income1 = df_income1[-1,]
df_income2 = df_income1 %>%
mutate(LOW_SES = rowSums(.[3:6])) %>%
mutate(MED_SES = rowSums(.[7:10])) %>%
mutate(HIGH_SES = rowSums(.[11:13])) %>%
select(`Planning Area`, Total, LOW_SES, MED_SES, HIGH_SES) %>%
mutate(`Planning Area` = str_to_upper(`Planning Area`)) %>%
mutate(Low_SES_percent = LOW_SES/Total*100) %>%
mutate(Med_SES_percent = MED_SES/Total*100) %>%
mutate(High_SES_percent = HIGH_SES/Total*100) %>%
mutate_at(vars(Low_SES_percent, Med_SES_percent, High_SES_percent), funs(round(.,0))) %>%
select(`Planning Area`, `Total`, `Low_SES_percent`,`Med_SES_percent`,`High_SES_percent`)
df_income2 <- left_join(df_income2, df_gini3, by = c("Planning Area" = "Planning Area"))
We observe that typical heartland areas like Outram, Kallang, Bukit Merah have a more unequal income distribution (i.e. higher levels of inequality) as compared to central areas like River Valley, Tanglin and Singapore River.
This could likely be due to the extent of income disparity between the high income and low income populations within these areas.
df_income2 %>%
mutate(`Planning Area` = fct_reorder(`Planning Area`, Gini)) %>%
ggplot(aes(x = Gini, y = `Planning Area`, colour = `Planning Area`, label = `Planning Area`)) +
scale_x_continuous(breaks = seq(0, 0.5, by = 0.05)) +
geom_text(size = 2) +
labs(colour = "", y = "", x = "Gini coefficient",
caption = "Source: Singapore Census of Population 2010") +
ggtitle("Income Inequality by Planning Area",
"Singapore's Working Population in 2010") +
theme(legend.position = "none")
Next, we plot a static dataviz of Singapore’s income distribution grouped by planning areas.
We create a new variable “stacked_bar” that aggregates the percentage of SES levels in the same column.
Using ggplot and the viridis theme, we create a stacked bar chart of the percentage of SES levels grouped by planning areas.
We observe that areas with low gini coefficient (i.e. River Valley, Tanglin, Singapore River) tend to have higher levels of high SES percentage. This is true conversely for the high gini coefficient areas.
stacked_bar <- df_income2 %>%
select(`Planning Area`, `Low_SES_percent`, `Med_SES_percent`, `High_SES_percent`) %>%
gather(Per_Label,Percentage, Low_SES_percent:High_SES_percent) %>%
mutate(`Planning Area`=factor(`Planning Area`,levels=rev(sort(unique(`Planning Area`)))))
ggplot(stacked_bar, aes(x=`Planning Area`, y=Percentage, fill= factor(Per_Label, levels = c('High_SES_percent','Med_SES_percent', 'Low_SES_percent')))) +
geom_bar(position = "stack", stat="identity") +
coord_flip()+
scale_color_viridis(discrete = TRUE, option = "D")+
scale_fill_viridis(discrete = TRUE)+
guides(fill=guide_legend(title="Percentage by SES Levels"))+
labs(x="",y="", title="Singapore's Income Distribution by Planning Area")
df_map$planning_area = sub(".*PLN_AREA_N</th> <td> *(.*?) *</td>.*", "\\1",df_map$Description)
df_map@data = left_join(df_map@data, df_income2, by = c("planning_area" = "Planning Area"))
df_map@data[is.na(df_map@data)] <- 0
bins <- c(0,0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45)
pal <- colorNumeric("plasma", domain = df_map@data$Gini, na.color = "#808080")
labels1 <- sprintf("<strong>%s</strong><br/>Total Population for Planning Area: %d<br/>Gini Coefficient: %.2f<br/> Proportion of High SES Population: %s %%<br/> Proportion of Medium SES Population: %s %%<br/> Proportion of Low SES Population: %s %%</sup>", df_map@data$planning_area,df_map@data$Total, df_map@data$Gini, df_map@data$High_SES_percent,df_map@data$Med_SES_percent,df_map@data$Low_SES_percent) %>%lapply(htmltools::HTML)
a <- leaflet(df_map) %>%
setView(lng = 103.80835, lat = 1.360365,zoom=11) %>%
addTiles(attribution = "Data Source: Singapore Census of Population 2010") %>%
addProviderTiles(providers$OpenStreetMap) %>%
addPolygons(fillColor=~pal(df_map@data$Gini), stroke=FALSE,color = "white", smoothFactor = 0.3, fillOpacity = 1,dashArray =1,
highlight=highlightOptions(
weight=1,
color="#666",
dashArray =3,
fillOpacity=0.7,
bringToFront=TRUE),
label = labels1,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"
)
)%>%
leaflet::addLegend(pal = pal, values = df_map@data$Gini, opacity = 0.7, title = "Gini Coefficient (2010)", position = "bottomright")
a
labels2 <- sprintf("<strong>%s</strong><br/>Total Population for Planning Area: %d<br/>Gini Coefficient: %.2f<br/> Proportion of High SES Population: %s %%<br/> Proportion of Medium SES Population: %s %%<br/> Proportion of Low SES Population: %s %%</sup>", df_map@data$planning_area,df_map@data$Total, df_map@data$Gini, df_map@data$High_SES_percent,df_map@data$Med_SES_percent,df_map@data$Low_SES_percent) %>%lapply(htmltools::HTML)
pal2 <- colorNumeric("viridis", domain = df_map@data$High_SES_percent)
b <-leaflet(df_map) %>%
addTiles() %>%
setView(103.80835, 1.360365,zoom=11) %>%
addTiles(attribution = "Data Source: Singapore Census of Population 2010") %>%
addPolygons(fillColor=~pal2(df_map@data$High_SES_percent),
stroke = FALSE,
weight = 1,
opacity = 3,
dashArray = "1",
color = "white",
fillOpacity = 3,
label = labels2,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"
)
)%>%
leaflet::addLegend(pal = pal2, values = df_map@data$High_SES_percent, opacity = 2, title = "Percentage of High SES (2010)", position = "bottomright")
b
We create the third leaflet object “c” that visualises Singapore’s medium income distribution by planning area.
As we are using colorNumeric function to create the color gradient, we do not use the same bin approach as in object “a”.
We use the color scheme “YlorRd” to visualise the different levels of income inequality. Areas with a higher proportion of medium SES populations are shaded with a darker orange shade.
pal3 <- colorNumeric("YlOrRd",domain = df_map@data$Med_SES_percent)
c <-leaflet(df_map) %>%
addTiles() %>%
setView(103.80835, 1.360365,zoom=11) %>%
addTiles(attribution = "Data Source: Singapore Census of Population 2010") %>%
addPolygons(fillColor=~pal3(df_map@data$Med_SES_percent),
stroke=TRUE,
weight = 1,
opacity = 3,
dashArray = "1",
color = "orange",
fillOpacity = 3,
label = labels2,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"
)
)%>%
leaflet::addLegend(pal = pal3, values = df_map@data$Med_SES_percent, opacity = 2, title = "Percentage of Medium SES (2010)", position = "bottomright")
c
pal4 <- colorNumeric("RdPu",domain = df_map@data$Low_SES_percent)
d <-leaflet(df_map) %>%
addTiles() %>%
setView(103.80835, 1.360365,zoom=11) %>%
addTiles(attribution = "Data Source: Singapore Census of Population 2010") %>%
addPolygons(fillColor=~pal4(df_map@data$Low_SES_percent),
stroke = TRUE,
weight = 1,
opacity = 3,
color = "pink",
dashArray = "1",
fillOpacity = 3,
label = labels2,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"
)
)%>%
leaflet::addLegend(pal = pal4, values = df_map@data$Low_SES_percent, opacity = 2, title = "Percentage of Low SES (2010)", position = "bottomright")
d
Given the narrow range of Singapore’s gini coefficient (or gini index if multiplied by 100), it can be concluded that the income disparity between planning areas is not too large. This confers some form of homogeneity in Singapore’s income distribution spatially although differences still exist.
One key difference is that areas with higher proportions of high SES levels tend to have lower levels of income inequality (or greater levels of income equality). This means that the income levels of high SES areas tend to be more similar (i.e. rich people congregating in a concentrated geographical cluster) as compared to areas with higher proportion of medium and low SES levels. In other words, the extent of intra-region income disparity in high SES areas is lower as compared to medium and low SES areas.
Areas with higher proportion of medium SES levels tend to be concentrated in sub-urban areas just outside the high SES cluster whereas the north-eastern part of Singapore have a higher proportion of low SES levels.
Government policies could measure the change in geo-spatial distribution of income levels and intra-regional inequality measures over a period of time to determine whether certain policies have been working out or not. For example, whether a policy to ensure equal healthcare access within areas of lower income levels do work or not. This is the stated aim of our project.
Overall, whilst the intra-region income disparity in Singapore isnt too big (confined within a narrow gini range), there are clusters of low, medium and high SES levels. They congregate in the central CBD area, the sub-urban areas outside the CBD and in the north-eastern parts of Singapore respectively.
leaflet_grid <-
tagList(
tags$table(width = "150%",
tags$tr(
tags$td(a),
tags$td(b)
),
tags$tr(
tags$td(c),
tags$td(d)
)
)
)
browsable(leaflet_grid)
An interactive plot contains more information within the same data plot and allows the user to develop more meaningful insights as compared to a static plot. For instance, the user is able to quickly understand not just the distribution of income levels but brushing over, he/she can correlate the distribution measure with other measures such as total population levels, gini coefficient etc.
An interactive plot provides more parameters that allows the user to “play around” with the data to form his/her own conclusions based on different approaches. For example, the user could view the same income distribution issue from different perspectives (low, medium, high SES levels). Whereas for static plots, the information is usually hard-coded and does not allow for simulation of different scenarios.
An integrated view of interactive plots allow on-the-fly comparative analysis between plots of different measures. For example, my dataviz allows the user to visualise the proportion of low, medium and high income percentages in the same plot. This provides a deeper and more meaningful analysis instead of plotting four plots separately.