A work by Chen Fangqi

fangqi.chen.2017@smu.edu.sg

This document is best viewed in a maximized window

 

1. Overview

Singapore is known to the world as one of the countries with the highest GDP per capita of $66189 despite the size of our country. [2017]
This visualization aims to find out what are our Singapore graduates earning as compared to the GDP per capita.
I will be using data from the Graduate Employment Survey
I will be using the 2017 GDP data because the GES data contains the 2017 and to get all the universities to appear, only the 2017 data has all of them.

1.1 Data/Design Challenges

The data challenge would be to process the data and ensure that it can be used for an interactive map.
To do this, I’ll be using OneMap’s API to get the Latitude and Longitude of the schools.

Moreover, another challenge was the visualization of the geospatial data. The data only provides the latitude and longtitude of the schools and thus Leaflet would be a better fit.

To visualize the data, I’ll be using leaflet to create an interactive map to show where the Universities are located and students from which universities are earning more

The visualization would also be comparing the GDP per capita against the graduate’s annual income.

1.2 Packages used

  • formattable is used in the tables above to easily come up with a table using a dataframe
  • ggplot2 is used to create the various visualizations such as stacked bar charts, Choropleth maps, etc
  • tidyverse is used for data manipulation
  • dplyr is used to create pipes of the data as well as read_csv
  • leaflet is for the interactive geospatial visualization
  • packrat and rsconnect are essentials for publishing of R documents
packages = c( 'formattable', 'ggplot2', 'tidyverse', 'dplyr', 'packrat', 'rsconnect', 'leaflet')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

2. Proposed Design Sketch

Sketch of Proposed DataViz Design

3. Data Preparation

I will be using read_csv from the dplyr package as it is generally faster to use read_csv as compared to read.csv

data <- read_csv('data/graduate-employment-survey-ntu-nus-sit-smu-suss-sutd.csv')

3.1 Target Variables

The target variables used are: gross_monthly_mean and gross_monthly_median as these values are a fairer comparison against the GDP per capita I’ll also be converting these variables to numeric variables because they are recognized as character variables.

df2 <- data.frame(data)
df2$gross_monthly_mean <- as.numeric(as.character(df2$gross_monthly_mean))
df2$gross_monthly_median <- as.numeric(as.character(df2$gross_monthly_median))

remove_na_df <- na.omit(df2) 
remove_na_df
ges_mutated <- subset(remove_na_df[,c('year', 'university', 'gross_monthly_mean', 'gross_monthly_median')])  %>%
  filter(year == 2017) %>%
  group_by(university) %>%
  summarise_at(.vars = vars(gross_monthly_mean, gross_monthly_median),
             .funs = c(mean="mean"))
  
  

df_sum <- data.frame(ges_mutated)
df_sum$annual_mean = df_sum$gross_monthly_mean*12
df_sum$annual_median = df_sum$gross_monthly_median*12

3.2 Latitude and Longitude Data Preparation

3.2.1 Table that shows the Latitude and Longtitude that will be added

For the case of Singapore Institute of Technology, I’ll be using their future Punggol campuses’ Lat and Long

Generation Latitude Longtitude
Singapore Management University [SMU] 1.29780556390712 103.849028435475
Singapore University of Social Sciences [SUSS] 1.32881426960899 103.775709357706
Singapore University of Technology and Design [SUTD] 1.3401716369901 103.962860116421
Singapore Institute of Technology [SIT] 1.41225272368082 103.91027100122
National University of Singapore [NUS] 1.29670387133402 103.781368628838
Nanyang Technological University [NTU] 1.34676465758604 103.678818130577

3.2.2 Adding the latitude and longitude data to the university

The following codes will add a new column, lat and lng and amend to the row based on the Universities’ name

df_sum$lat <- ifelse(df_sum$university == "Singapore Management University", 1.29780556390712,
                     ifelse(df_sum$university == "Singapore University of Social Sciences", 1.32881426960899, 
                            ifelse(df_sum$university == "Singapore University of Technology and Design", 1.3401716369901,
                                   ifelse(df_sum$university == "Singapore Institute of Technology", 1.41225272368082,
                                          ifelse(df_sum$university == "National University of Singapore", 1.29670387133402,
                                                 ifelse(df_sum$university == "Nanyang Technological University", 1.34676465758604, ""))))))

df_sum$lng <- ifelse(df_sum$university == "Singapore Management University", 103.849028435475,
                     ifelse(df_sum$university == "Singapore University of Social Sciences", 103.775709357706, 
                            ifelse(df_sum$university == "Singapore University of Technology and Design", 103.962860116421,
                                   ifelse(df_sum$university == "Singapore Institute of Technology", 103.91027100122,
                                          ifelse(df_sum$university == "National University of Singapore", 103.781368628838,
                                                 ifelse(df_sum$university == "Nanyang Technological University", 103.678818130577, ""))))))

df_sum$lat <- as.numeric(as.character(df_sum$lat))
df_sum$lng <- as.numeric(as.character(df_sum$lng))

df_sum$lat <- round(df_sum$lat,6)
df_sum$lng <- round(df_sum$lng,6)

3.3 Data Preparation for Annual Income vs GDP Per Capita

Changing the University names to its short form for a better visualization Added GDP Per capita as a “University” so it could be displayed seamlessly without changing much codes

new_df <- df_sum
new_df <- new_df %>% 
  add_row(university = "GDP Per Capita", gross_monthly_mean_mean = 0, gross_monthly_median_mean = 0, annual_mean = 0, annual_median = 60913.00, lat=0, lng=0)

new_df <- data.frame(new_df)

new_df$university <- ifelse(new_df$university == "Singapore Management University", "SMU", 
                            ifelse(new_df$university == "Singapore Institute of Technology", "SIT",
                                   ifelse(new_df$university == "Singapore University of Social Sciences", "SUSS",
                                          ifelse(new_df$university == "Singapore University of Technology and Design", "SUTD",
                                                 ifelse(new_df$university == "National University of Singapore", "NUS",
                                                        ifelse(new_df$university == "Nanyang Technological University", "NTU", "GDP Per Capita"))))))

3.4 Data Preparation for Bar Chart [Bottom 10 paying salaries]

Using the tail function to get the bottom 10

pay_data <- subset(remove_na_df[,c('year', 'university', 'degree', 'gross_monthly_mean', 'gross_monthly_median')])
mutated_pay <- pay_data %>%
  filter(year == 2017) %>%
  group_by(degree, university) %>%
  summarise_at(.vars = vars(gross_monthly_mean, gross_monthly_median),
             .funs = c(mean="mean")) %>%
  arrange(desc(gross_monthly_median_mean))

pay_df <- data.frame(tail(mutated_pay,10))

3.5 Data Preparation for Density Plot

Same as in 3.1, I’m just changing the names to the short form versions of it

pay_df_2 <- data.frame(mutated_pay)
pay_df_2$university <- ifelse(pay_df_2$university == "Singapore Management University", "SMU", 
                            ifelse(pay_df_2$university == "Singapore Institute of Technology", "SIT",
                                   ifelse(pay_df_2$university == "Singapore University of Social Sciences", "SUSS",
                                          ifelse(pay_df_2$university == "Singapore University of Technology and Design", "SUTD",
                                                 ifelse(pay_df_2$university == "National University of Singapore", "NUS",
                                                        ifelse(pay_df_2$university == "Nanyang Technological University", "NTU", ""))))))

4 Data Visualization

4.1 Leaflet Map

I will be using leaflet to visualize the Gross Monthly Median and Gross Monthly Mean. The difference between this 2 variables are not very big. I have included the icons of each universities in the visualization. The larger the circle and the darker the shade of the color, the higher the Gross Monthly Median/Gross Monthly Mean Salary.

The inner circle represents the Gross Monthly Mean
The outer circle represents the Gross Monthly Median

df_sum$gross_monthly_mean_mean <- as.numeric(as.character(df_sum$gross_monthly_mean_mean))
df_sum$gross_monthly_median_mean<- as.numeric(as.character(df_sum$gross_monthly_median_mean))

pal <- colorNumeric(
  palette = c("red", "green", "blue"),
  domain = df_sum$gross_monthly_median_mean
)

pal2 <- colorNumeric(
  palette = c("red", "green", "blue"),
  domain = df_sum$gross_monthly_median_mean
)

uni_buildings <- icons(
  iconUrl = ifelse(df_sum$university == "Singapore Management University", "https://www.smu.edu.sg/sites/default/files/smu/branding/logo_intro_new.png",
                   
    ifelse(df_sum$university == "Singapore Institute of Technology", "https://upload.wikimedia.org/wikipedia/en/7/7f/SIT_logo_2.png",
           
    ifelse(df_sum$university == "Singapore University of Technology and Design", "https://media.glassdoor.com/sqll/729796/singapore-university-of-technology-and-design-squarelogo-1426146883065.png",
           
    ifelse(df_sum$university == "Singapore University of Social Sciences", "https://www.suss.edu.sg/images/default-source/content/media-centre/1-mediacentre_exclusion_460x460.png?sfvrsn=5bccff4e_2&MaxWidth=400&MaxHeight=400&ScaleUp=false&Quality=High&Method=ResizeFitToAreaArguments&Signature=999986A16B0A1AA2716401EFD4E9614CD4239F92", 
           
   ifelse(df_sum$university == "National University of Singapore", "https://www.nus.edu.sg/images/default-source/identity-images/NUS_logo_full-vertical.jpg", 
          
  ifelse(df_sum$university == "Nanyang Technological University", "https://upload.wikimedia.org/wikipedia/en/thumb/f/f8/Nanyang_Technological_University_coat_of_arms_vector.svg/1200px-Nanyang_Technological_University_coat_of_arms_vector.svg.png","")))))
  ),
  iconWidth = 60, iconHeight = 60,
)

leaflet(df_sum) %>%
  addTiles() %>%
  addMarkers(lat = ~lat, lng = ~lng, icon = uni_buildings,
             clusterOptions = markerClusterOptions(zoomToBoundsOnClick = T), 
             popup = ~paste(
               paste('<b>', 'University:', '</b> ', df_sum$university), 
               paste('<b>', 'Gross Monthly Mean Salary:', '</b>', round(df_sum$gross_monthly_mean_mean)), 
               paste('<b>',  'Gross Monthly Median Salary:', '</b>', round(df_sum$gross_monthly_median_mean)),
               sep = '<br/>'),
             popupOptions = popupOptions(closeButton = FALSE)
             ) %>%
  addCircles(lng = ~lng, lat = ~lat, weight = 1, 
             radius = ~df_sum$gross_monthly_mean_mean, color = ~pal(gross_monthly_median_mean), opacity = 1) %>%
  addCircles(lng = ~lng, lat = ~lat, weight = 1, 
             radius = ~df_sum$gross_monthly_median_mean, color = ~pal(gross_monthly_median_mean), opacity = 1) %>%
  addLegend("bottomright", pal = pal2, values = ~df_sum$gross_monthly_median_mean,
  title = "Gross Monthly Median",
  labFormat = labelFormat(prefix = "$"),
  opacity = 1
) 

4.2 Annual Median Income vs GDP Per Capita

I’ll create a new dataframe and add in a new row with the name “GDP Per Capita” as a university This way, it allows me to compare the Annual Median against the Annual GDP.

new_df$annual_mean <- as.numeric(as.character(new_df$annual_mean))
new_df$annual_median <- as.numeric(as.character(new_df$annual_median))

check_sum <- subset(new_df, new_df$university == 'GDP Per Capita')
ggplot(new_df, aes(y=annual_median, x=reorder(university, annual_median), fill=university)) + geom_bar(stat="identity") + xlab('Universities vs GDP Per Capita') + ylab('Gross Median Anual Salary in $') + theme_classic() + guides(fill=guide_legend(title='Universities vs GDP Per Capita')) + labs(title = "University Annual Median Salary vs GDP Per Capita [Singapore 2017]",
caption = "Data Source: Data.gov.sg, Worldbank") + theme(plot.title = element_text(hjust = 0.5)) + ylim(0,70000)

4.3 Which degree pays the most

Since there’s no way University Graduate could match the GDP per capita, another insight that we can find from this data would be the graduates who are paid the least so we mostly know which are the degrees that pays the most.

ggplot(data=pay_df, aes(x=gross_monthly_median_mean, y=reorder(degree, gross_monthly_median_mean), fill=degree)) + geom_bar(stat="identity") + xlab('Gross Median Salary') + ylab('Degree') + theme_classic() + guides(fill=guide_legend(title='Gross Median Salary')) + labs(title = "Bottom 10 Gross Median Salary by Degree [Singapore 2017]",
caption = "Data Source: Data.gov.sg") + theme(plot.title = element_text(hjust = 0.5))

4.4 Density Plot

One of the best ways to find out which Universities are generating more higher paying jobs is through the density plots. Generally, even though SMU students have a higher average in terms of Gross Monthly Median salary, this number can be skewed up by higher paying jobs from faculties such as the law faculty.

ggplot(pay_df_2, aes(x=gross_monthly_median_mean, fill=university)) +
  geom_density(alpha=0.4) + labs(title = "Density Distribution of Gross Monthly Median Salary [Singapore 2017]",
caption = "Data Source: Data.gov.sg") + theme(plot.title = element_text(hjust = 0.5)) + xlab('Gross Month Median Salary') + ylab('Density') + theme_classic()

5 Insights

Based on what I found in the data, these are the insights I can derive:
1) Graduates from SMU and SUTD got paid the most in 2017

2) The GDP per capita is far beyond the annual gross median salary of a fresh graduate [about 1.5x]

3) The lowest paying degrees are generally in the field of Arts, Food, and Children Care

4) SUTD tends to have students who have more students who get paid more than $3000 Gross Monthly Salary but loses on the tail end.