1 Proposed Data Visualisation

1.1 Objective

The objective of this dataviz is to visualise the distribution of Singapore’s income inequality levels by planning area in a spatial manner. This allows the user to comprehend a large amount of information within a short period of time to draw meaningful insights on Singapore’s income distribution. The dataviz is intended to help policy-makers within the healthcare sector better enhance healthcare policies relating to Singapore’s population.

1.2 Proposed Design

knitr::include_graphics("image/sketch.jpg")

2 Suggestions to enhance user experience

knitr::include_graphics("image/leaflet_map.jpg")

The dataviz will incorporate interactivity features to enhance the user experience.

Firstly, the dataviz is represented in a geo-spatial manner to allow users to correlate income distribution to a specific planning region immediately. This is because the user (assuming to be a Singaporean) is likely to already be well-orientated to the major regions in Singapore.
Next, the multiple geo-spatial visualisations are shown within a single view ala the image above to provide an integrated perspective of different measures relating to income inequality. This allows the user to immediately coorelate a specific region’s income inequality with its low, middle and high income population levels.
Users will be able to mouse over the dataviz to gain further insights into each planning area. This includes the total population within the specific planning area, the percentage of low, medium and high income levels and the income inequality level - i.e. represented by the Gini Coefficient.

3 Step-by-step description

3.1 Exploratory Data Analysis

First, we conduct an exploratory data analysis of Singapore’s income distribution and experiment with different ways to visualise both the income distribution and inequality levels.

We use the Gini Coefficient to represent a specific planning area’s level of income inequality. A Lorenz Curve is often used by the economist discipline to visualise the extent of inequality in a region/country.

It is defined as a ratio with values between 0 and 1: the numerator is the area between the Lorenz curve of the distribution and the uniform distribution line; the denominator is the area under the uniform distribution line.
Here, 0 corresponds to perfect income equality (i.e. everyone in the planning area has the same income) and 1 corresponds to perfect income inequality.

3.1.1 Data Preparation (as part of EDA)

Launch the following R packages

packages <- c("leaflet","tidyverse","readxl", "rgdal","reldist","htmltools","viridis","knitr")

for (p in packages){
  if (!require (p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

#full list of packages used
#library("leaflet")
#library("tidyverse")
#library("readxl") for reading of excel files 
#library("rgdal") for reading of geojson files 
#library ("reldist") for calculation of gini coefficient 
#library("htmltools") for visualizing multiple leaflet objects in a single view
#library("viridis") viridis color theme
#library("knitr")

Import aspatial data into R Data Source: https://www.singstat.gov.sg/publications/cop2010/census10_stat_release1

df_income1 = read_excel("data/aspatial/income.xlsx")

Import geospatial data into R Data Source: https://data.gov.sg/dataset/master-plan-2019-subzone-boundary-no-sea

df_map = readOGR(dsn = "data/geospatial/master-plan-2019-subzone-boundary-no-sea.geojson")

## OGR data source with driver: GeoJSON 
## Source: "C:\Users\jiyee\Documents\SMU Laptop\Visual_Project\data\geospatial\master-plan-2019-subzone-boundary-no-sea.geojson", layer: "master-plan-2019-subzone-boundary-no-sea"
## with 332 features
## It has 2 fields

## Warning in readOGR(dsn = "data/geospatial/master-plan-2019-subzone-boundary-no-
## sea.geojson"): Z-dimension discarded

#use the code below to check the content of the map variable
#head(df_map@data,10)

Compute the gini coefficient for each planning area.

1. Define the variable df_gini1 as the key variable for gini coefficent.
1. Compute a new column “Mean_Income” that takes the mean of each income group.

df_gini1 <- df_income1[-1,-2]

df_gini2 <- df_gini1 %>% 
  gather(Income_Group, Population, `Below $1,000`:`$8,000 & Over`) %>%
  mutate(Mean_Income = ifelse(Income_Group=="Below $1,000", 500, 
                                          ifelse(Income_Group=="$1,000 - $1,499", 1250, 
                                                 ifelse(Income_Group=="$1,500 - $1,999", 1750, 
                                                               ifelse(Income_Group=="$2,000 - $2,499", 2250, 
                                                                      ifelse(Income_Group=="$2,500 - $2,999", 2750, 
                                                                             ifelse(Income_Group=="$3,000 - $3,999", 3500, 
                                                                                    ifelse(Income_Group=="$4,000 - $4,999", 4500, 
                                                                                           ifelse(Income_Group=="$5,000 - $5,999", 5500, 
                                                                                                  ifelse(Income_Group=="$6,000 - $6,999", 6500, 
                                                                                                                ifelse(Income_Group=="$7,000 - $7,999", 7500, 
                                                                                                                       ifelse(Income_Group=="$8,000 & Over", 8500,0 ))))))))))))

1. Using the Mean_Income column, compute the cumulative proportion of population and income levels of each planning area. I use the following group_by and mutate functions to do so.

df_gini2 <- df_gini2[order(df_gini2$`Planning Area`),]%>%
  group_by(`Planning Area`) %>%
  mutate(cum_inc_prop = cumsum(Mean_Income/max(cumsum(Mean_Income))), pop_prop = cumsum(Population)/max(cumsum(Population))) %>%
  mutate(per_pop = Population/sum(Population))

1. Plot the Lorenz Curve for each planning area using the following code chunk. We can observe that central areas of Singapore like Bukit Timah, River Valley, Singapore River and Tanglin have an inverted Lorenz Curve, likely because the income distribution within each of this planning area has a very narrow range - i.e. most of the residents staying there tend to have higher incomes.

ggplot(df_gini2, aes(x = pop_prop, y = cum_inc_prop)) +
  geom_line() +
  geom_ribbon(aes(ymax = pop_prop, ymin = cum_inc_prop), fill = "steelblue", alpha = 0.2) +
  geom_abline(intercept = 0, slope = 1, colour = "steelblue") +
  labs(x = "Cumulative proportion of population within planning area",
       y = "Cumulative proportion of mean income within planning area", 
       caption = "Source: Singstats - Singapore Census of Population 2010") +
  ggtitle("Inequality in Singapore Planning Areas", "Lorenz curve based on binned decile mean income levels") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  facet_wrap(`Planning Area` ~ .)

1. Prepare dataframe df_gini3 that collapses all the planning area into its corresponding gini coeffcient. We compute the coefficient by using the reldist package.
refer to link for more info on reldist: https://cran.r-project.org/web/packages/reldist/reldist.pdf

df_gini3 <- df_gini2 %>%
  group_by(`Planning Area`) %>%
  summarise(Gini = gini(Mean_Income, w = per_pop)) %>%
  mutate_at(.vars = vars(`Planning Area`), .funs = funs(toupper))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

1. Next, using the original dataframe df_income1, we plot the percentage of low, medium and high income levels by creating a new variable df_income2. We use the formal term “Social Economic Status” (SES) to represent each income level.
- High SES: “$6,000 - above $8,000”
- Medium SES: “$3,000 - $5,999”
- Low SES: “$0 - $2,499”

df_income1 = df_income1[-1,]

df_income2 = df_income1 %>%
  mutate(LOW_SES = rowSums(.[3:6])) %>%
  mutate(MED_SES = rowSums(.[7:10])) %>%
  mutate(HIGH_SES = rowSums(.[11:13])) %>%
  select(`Planning Area`, Total, LOW_SES, MED_SES, HIGH_SES) %>%
  mutate(`Planning Area` = str_to_upper(`Planning Area`)) %>%
  mutate(Low_SES_percent = LOW_SES/Total*100) %>%
  mutate(Med_SES_percent = MED_SES/Total*100) %>%
  mutate(High_SES_percent = HIGH_SES/Total*100) %>%
  mutate_at(vars(Low_SES_percent, Med_SES_percent, High_SES_percent), funs(round(.,0))) %>%
  select(`Planning Area`, `Total`, `Low_SES_percent`,`Med_SES_percent`,`High_SES_percent`)

1. Conduct a left join of the variable containing the gini coefficient into the main data frame with percentages of income levels.

df_income2 <- left_join(df_income2, df_gini3, by = c("Planning Area" = "Planning Area"))

3.1.2 Static Data Visualisation (part of EDA)

1. Plot a snapshot of each planning area ordered by it’s gini coefficient in descending order.
- We observe that typical heartland areas like Outram, Kallang, Bukit Merah have a more unequal income distribution (i.e. higher levels of inequality) as compared to central areas like River Valley, Tanglin and Singapore River.
- This could likely be due to the extent of income disparity between the high income and low income populations within these areas.

df_income2 %>%
  mutate(`Planning Area` = fct_reorder(`Planning Area`, Gini)) %>%
  ggplot(aes(x = Gini, y = `Planning Area`, colour = `Planning Area`, label = `Planning Area`)) +
  scale_x_continuous(breaks = seq(0, 0.5, by = 0.05)) +
  geom_text(size = 2) +
  labs(colour = "", y = "", x = "Gini coefficient",
       caption = "Source: Singapore Census of Population 2010") +
  ggtitle("Income Inequality by Planning Area",
          "Singapore's Working Population in 2010") + 
  theme(legend.position = "none")

1. Next, we plot a static dataviz of Singapore’s income distribution grouped by planning areas.
  - We create a new variable “stacked_bar” that aggregates the percentage of SES levels in the same column.
  - Using ggplot and the viridis theme, we create a stacked bar chart of the percentage of SES levels grouped by planning areas.
  - We observe that areas with low gini coefficient (i.e. River Valley, Tanglin, Singapore River) tend to have higher levels of high SES percentage. This is true conversely for the high gini coefficient areas.

stacked_bar <- df_income2 %>%
  select(`Planning Area`, `Low_SES_percent`, `Med_SES_percent`, `High_SES_percent`) %>%
  gather(Per_Label,Percentage, Low_SES_percent:High_SES_percent) %>%
  mutate(`Planning Area`=factor(`Planning Area`,levels=rev(sort(unique(`Planning Area`)))))

ggplot(stacked_bar, aes(x=`Planning Area`, y=Percentage, fill= factor(Per_Label, levels = c('High_SES_percent','Med_SES_percent', 'Low_SES_percent')))) + 
  geom_bar(position = "stack", stat="identity") + 
  coord_flip()+
  scale_color_viridis(discrete = TRUE, option = "D")+
  scale_fill_viridis(discrete = TRUE)+
  guides(fill=guide_legend(title="Percentage by SES Levels"))+
  labs(x="",y="", title="Singapore's Income Distribution by Planning Area")

3.2 Geospatial Analysis by Leaflet

We are now ready to proceed with developing our geospatial analysis to enable the user to visualise income inequality and distribution within a single view.
Firstly, we transform the geojson variable df_map by joining the coordinate columns with the gini and SES percentage columns.
Thereafter, we impute missing values with 0 to ensure that the color gradient can plot even with missing values.

df_map$planning_area = sub(".*PLN_AREA_N</th> <td> *(.*?) *</td>.*", "\\1",df_map$Description)

df_map@data = left_join(df_map@data, df_income2, by = c("planning_area" = "Planning Area"))

df_map@data[is.na(df_map@data)] <- 0

We create the first leaflet object “a” that visualises Singapore’s income inequality by planning area.
In “bins”, we create different bins to guide the colour gradient in categorising the gini coefficient of each planning area.
In “labels1”, we create the specific labels that appear when the user mouses over each planning area.
Thereafter, we create the leaflet object “a” by combining leaflet, bins and labels1 in the same code chunk.
We use the color scheme “plasma” to visualise the different levels of income inequality. Areas with high income inequality have a lighter yellow shade while lower areas have a darker blue shade.

3.2.1 Singapore’s Income Inequality by Planning Area

bins <- c(0,0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45)

pal <- colorNumeric("plasma", domain = df_map@data$Gini, na.color = "#808080")

labels1 <- sprintf("<strong>%s</strong><br/>Total Population for Planning Area: %d<br/>Gini Coefficient: %.2f<br/> Proportion of High SES Population: %s %%<br/> Proportion of Medium SES Population: %s %%<br/> Proportion of Low SES Population: %s %%</sup>", df_map@data$planning_area,df_map@data$Total, df_map@data$Gini, df_map@data$High_SES_percent,df_map@data$Med_SES_percent,df_map@data$Low_SES_percent) %>%lapply(htmltools::HTML)

a <- leaflet(df_map) %>%
  setView(lng = 103.80835, lat = 1.360365,zoom=11) %>%
  addTiles(attribution = "Data Source: Singapore Census of Population 2010") %>%
  addProviderTiles(providers$OpenStreetMap) %>%
  addPolygons(fillColor=~pal(df_map@data$Gini), stroke=FALSE,color = "white", smoothFactor = 0.3, fillOpacity = 1,dashArray =1,
              highlight=highlightOptions(
                weight=1,
                color="#666",
                dashArray =3,
                fillOpacity=0.7,
                bringToFront=TRUE),
              label = labels1,
              labelOptions = labelOptions(
              style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "15px",
                direction = "auto"
              )
      )%>%
      leaflet::addLegend(pal = pal, values = df_map@data$Gini, opacity = 0.7,  title = "Gini Coefficient (2010)", position = "bottomright")

a

We create the second leaflet object “b” that visualises Singapore’s high income distribution by planning area.
As we are using colorNumeric function to create the color gradient, we do not use the same bin approach as in object “a”. The labels2 code chunk is the same as in labels1.
We use the color scheme “viridis” to visualise the different levels of income inequality. Areas with higher income distribution have a lighter yellow green shade while lower areas have a darker blue shade.
We adopt a similar color scheme in objects “a” and “b” to allow the user to quickly compare between regions of high income inequality and high SES areas, to understand that they are actually not the same correlation.

3.2.2 Singapore’s High SES Areas by Planning Area

labels2 <- sprintf("<strong>%s</strong><br/>Total Population for Planning Area: %d<br/>Gini Coefficient: %.2f<br/> Proportion of High SES Population: %s %%<br/> Proportion of Medium SES Population: %s %%<br/> Proportion of Low SES Population: %s %%</sup>", df_map@data$planning_area,df_map@data$Total, df_map@data$Gini, df_map@data$High_SES_percent,df_map@data$Med_SES_percent,df_map@data$Low_SES_percent) %>%lapply(htmltools::HTML)

pal2 <- colorNumeric("viridis", domain = df_map@data$High_SES_percent)


b <-leaflet(df_map) %>% 
  addTiles() %>%
  setView(103.80835, 1.360365,zoom=11) %>%
  addTiles(attribution = "Data Source: Singapore Census of Population 2010") %>%
  addPolygons(fillColor=~pal2(df_map@data$High_SES_percent),
              stroke = FALSE, 
              weight = 1,
              opacity = 3,
              dashArray = "1",
              color = "white",
              fillOpacity = 3,
              label = labels2, 
              labelOptions = labelOptions(
              style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "15px",
                direction = "auto"
              )
      )%>%
      leaflet::addLegend(pal = pal2, values = df_map@data$High_SES_percent, opacity = 2, title = "Percentage of High SES (2010)", position = "bottomright")

b

We create the third leaflet object “c” that visualises Singapore’s medium income distribution by planning area.
As we are using colorNumeric function to create the color gradient, we do not use the same bin approach as in object “a”.
We use the color scheme “YlorRd” to visualise the different levels of income inequality. Areas with a higher proportion of medium SES populations are shaded with a darker orange shade.

3.2.3 Singapore’s Medium SES Areas by Planning Area

pal3 <- colorNumeric("YlOrRd",domain = df_map@data$Med_SES_percent)

c <-leaflet(df_map) %>% 
  addTiles() %>%
  setView(103.80835, 1.360365,zoom=11) %>%
  addTiles(attribution = "Data Source: Singapore Census of Population 2010") %>%
  addPolygons(fillColor=~pal3(df_map@data$Med_SES_percent),
              stroke=TRUE,
              weight = 1,
              opacity = 3,
              dashArray = "1",
              color = "orange",
              fillOpacity = 3,
              label = labels2, 
              labelOptions = labelOptions(
              style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "15px",
                direction = "auto"
              )
      )%>%
      leaflet::addLegend(pal = pal3, values = df_map@data$Med_SES_percent, opacity = 2, title = "Percentage of Medium SES (2010)", position = "bottomright")

c

We create a fourth leaflet object “d” that visualises Singapore’s low income distribution by planning area.
We use the color scheme “RdPu” to visualise the different levels of income inequality. Areas with a higher proportion of low SES populations are shaded with a darker pink shade.
- We adopt a similar color scheme in objects “c” and “d” to allow the user to quickly compare between regions of medium income inequality and low SES areas.

3.2.4 Singapore’s Low SES Areas by Planning Area

pal4 <- colorNumeric("RdPu",domain = df_map@data$Low_SES_percent)

d <-leaflet(df_map) %>% 
  addTiles() %>%
  setView(103.80835, 1.360365,zoom=11) %>%
  addTiles(attribution = "Data Source: Singapore Census of Population 2010") %>%
  addPolygons(fillColor=~pal4(df_map@data$Low_SES_percent),
              stroke = TRUE, 
              weight = 1,
              opacity = 3,
              color = "pink",
              dashArray = "1",
              fillOpacity = 3,
              label = labels2, 
              labelOptions = labelOptions(
              style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "15px",
                direction = "auto"
              )
      )%>%
      leaflet::addLegend(pal = pal4, values = df_map@data$Low_SES_percent, opacity = 2, title = "Percentage of Low SES (2010)", position = "bottomright")

d

Lastly, we combine all four leaflet objects using the tagList function.
The variable leaflet_grid is plotted on the same view using the browsable function. This is found under the “htmltools” package.

4 Final Output: Single View of Singapore’s Inequality and Income Distribution

Given the narrow range of Singapore’s gini coefficient (or gini index if multiplied by 100), it can be concluded that the income disparity between planning areas is not too large. This confers some form of homogeneity in Singapore’s income distribution spatially although differences still exist.
One key difference is that areas with higher proportions of high SES levels tend to have lower levels of income inequality (or greater levels of income equality). This means that the income levels of high SES areas tend to be more similar (i.e. rich people congregating in a concentrated geographical cluster) as compared to areas with higher proportion of medium and low SES levels. In other words, the extent of intra-region income disparity in high SES areas is lower as compared to medium and low SES areas.
Areas with higher proportion of medium SES levels tend to be concentrated in sub-urban areas just outside the high SES cluster whereas the north-eastern part of Singapore have a higher proportion of low SES levels.

This indicates that there is some form of geo-spatial concentration of income levels in Singapore.
We can conclude preliminarily that there might be an interaction effect between income levels and dwelling choices such that clusters of income levels do form.
Examples of interaction variables could be government policies (Proximity Housing Grant), housing prices, lifestyle choices or inter-generational mobility effects i.e. richer families tend to have children who tend to be more well-off and hence stay near their parents.

Government policies could measure the change in geo-spatial distribution of income levels and intra-regional inequality measures over a period of time to determine whether certain policies have been working out or not. For example, whether a policy to ensure equal healthcare access within areas of lower income levels do work or not. This is the stated aim of our project.
Overall, whilst the intra-region income disparity in Singapore isnt too big (confined within a narrow gini range), there are clusters of low, medium and high SES levels. They congregate in the central CBD area, the sub-urban areas outside the CBD and in the north-eastern parts of Singapore respectively.

leaflet_grid <- 
  tagList(
    tags$table(width = "150%",
      tags$tr(
        tags$td(a),
        tags$td(b)
      ),
      tags$tr(
        tags$td(c),
        tags$td(d)
      )
    )
  )

browsable(leaflet_grid)

5 Reflections on the advantages of interactivity

An interactive plot contains more information within the same data plot and allows the user to develop more meaningful insights as compared to a static plot. For instance, the user is able to quickly understand not just the distribution of income levels but brushing over, he/she can correlate the distribution measure with other measures such as total population levels, gini coefficient etc.
An interactive plot provides more parameters that allows the user to “play around” with the data to form his/her own conclusions based on different approaches. For example, the user could view the same income distribution issue from different perspectives (low, medium, high SES levels). Whereas for static plots, the information is usually hard-coded and does not allow for simulation of different scenarios.
An integrated view of interactive plots allow on-the-fly comparative analysis between plots of different measures. For example, my dataviz allows the user to visualise the proportion of low, medium and high income percentages in the same plot. This provides a deeper and more meaningful analysis instead of plotting four plots separately.

DataViz Makeover 9

Yee Jiahao

3/29/2020