Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.
Source: Mike Bostock (2019), author’s Twitter and code on Observable website
Objective
The original visualisation shows the numbers of H1-B visa-sponsored jobs of employers across sectors in the United States in the fiscal year October 1st 2018 to September 30th 2019 recording when an H1-B application is approved or denied (U.S. Citizenship and Immigration Services 2019). The area of bubbles represents the numerical values of submitted visa applications. Bubbles are grouped by state and city (Bostock, M 2019).
H-1B visas are used by US employers seeking to hire nonimmigrants for 3-6 years as workers in speciality occupations that require at least a Bachelor’s degree. These occupations are roles that America is in shortage of, including Silicon Valley jobs. In 2018, which includes some of the months during the fiscal year of the data source used in the visualisation, there is high demand for foreign talent with 472,000 unfilled software engineering positions (Mukherjee, R. 2018).
The target audience is highly educated non-American nationals in fields where the US has shortages, for example, international students with high GPAs that graduated from American universities. Since it can be difficult to find a company willing to hire with an H1-B visa that is expensive and time-consuming to approve, knowing which industries are more likely to sponsor is valuable to foreign students that wish to continue working in the US after their student visa expires or noncitizen experts seeking short-term employment.
The visualisation chosen had the following three main issues:
Deceptive area - The packed bubble chart displays data as clusters. The area, labels and colour represent three different data columns (Tableau, 2021). In this example, area displays the number of submitted visas; labels display the employer names. The clusters represent different states and the bubbles are different employers. It is hard for the audience to compare similar numbers accurately for the State and City variables because smaller differences are harder to perceive. Area is a main visual variable that is harder to perceive than positions on axes as there is no aligned baseline (Baglin, J. 2020).
Irresponsible use of colour - Colour should represent another column (Tableau 2021), but it has no clear meaning. The colour scheme seems to be a default colouring. The action colour that guides the viewer to the bubbles, the key message, is yellow. The background is less important than the bubbles so should be in a muted colour, but is in red which is a brighter colour drawing attention to the viewer more than yellow. The bubbles are all in the same hue, making it hard to compare employers that submit different numbers of visas. The colour is not legible when printed in black and white (Evergreen, S. et al 2016).
Visual bombardment - There is no order to the data, the visualisation is not accessible and the audience cannot focus on the key message. There is a lot of data that could have been faceted. The labels cannot be read, and some labels overlap with other bubbles. There are so many yellow circles that the small ones seem to merge with the red background and appear orange (Baglin, J. 2020).
Reference
The following code was used to fix the issues identified in the original.
# loading libraries
library(ggplot2)
library(dplyr)
# loading dataset
data <- read.csv("h-1b-data-export.csv")
# view first few rows of data
head(data)
## Fiscal.Year Employer Initial.Approval Initial.Denial
## 1 2019 1 0
## 2 2019 01INTERACTIVE INC 1 0
## 3 2019 0956588 BC LTD DBA PROCOGIA 0 1
## 4 2019 1 800 CONTACTS INC 1 0
## 5 2019 1 HOTEL SOUTH BEACH INC 0 0
## 6 2019 1 WORLD ED INC 1 0
## Continuing.Approval Continuing.Denial NAICS Tax.ID State City
## 1 6 1 54 3023 CA MILPITAS
## 2 1 0 54 9852 CA CITY OF INDUSTRY
## 3 0 1 54 209 WA BELLEVUE
## 4 0 0 42 1643 UT DRAPER
## 5 1 0 72 9513 FL MIAMI BEACH
## 6 0 0 61 1468 CA LOS ANGELES
## ZIP
## 1 95035
## 2 91745
## 3 98006
## 4 84020
## 5 33139
## 6 90056
# define factors for meaningful string variables so they can be ordered
data$Employer <- as.factor(data$Employer)
data$State <- as.factor(data$State)
data$NAICS <- as.factor(data$NAICS)
# check classes correctly defined
class(data$Employer)
## [1] "factor"
class(data$State)
## [1] "factor"
class(data$NAICS)
## [1] "factor"
# check levels of NAICS
levels(data$NAICS)
## [1] "11" "21" "22" "23" "31" "32" "33" "42" "44" "45" "48" "49" "51" "52" "53"
## [16] "54" "55" "56" "61" "62" "71" "72" "81" "92" "99"
# reorder levels in NAICS with meaningful labels. treat as factors because NAICS is a nominal variable indicating industry.
data$NAICS <- data$NAICS %>% factor(levels=c(11, 21, 22, 23, 31, 32, 33, 42, 44, 45, 48, 49, 51, 52, 53, 54, 55, 56, 61, 62, 71, 72, 81, 92, 99), labels = c("Agriculture, Forestry", "Mining", "Utilities", "Construction", "Manufacturing", "Manufacturing", "Manufacturing", "Wholesales", "Retail", "Retail", "Transportation, Logistics", "Transportation, Logistics", "Information", "Finance", "Real Estate", "STEM", "Management", "Administration, Waste Management", "Education", "Health and Social Care", "Arts and Recreation", "Accommodation, Food", "Other", "Public Administration", "Nonclassifiable"), ordered = TRUE)
# assign new name for NAICS column so it is less confusing for viewer
names(data)[names(data) == 'NAICS'] <- 'Job.Sector'
# sum up initial approval, initial denial, continuing approval and continuing denial variables into a new variable
data$Number.of.Visas <- data$Initial.Approval + data$Initial.Denial + data$Continuing.Approval + data$Continuing.Denial
# group employers by State and Job Sector
EmployerStateJob <- data %>% group_by(across(c("State","Job.Sector"))) %>% summarize(Number.of.Visas = sum(Number.of.Visas))
# want to remove the first 4 rows because they contain blanks (not treated as missing values by R). the number of visas is only a few so should not affect statistics of the visualisation
EmployerStateJob <- EmployerStateJob[-c(1,2,3,4), ]
# check data correctly grouped
head(EmployerStateJob)
## # A tibble: 6 x 3
## # Groups: State [1]
## State Job.Sector Number.of.Visas
## <fct> <ord> <int>
## 1 AK Mining 2
## 2 AK Manufacturing 2
## 3 AK Wholesales 1
## 4 AK Retail 1
## 5 AK Transportation, Logistics 1
## 6 AK Information 1
# faceting into States results in visual bombardment, further condense into Job Sectors and Number of Visas only
VisaPerSector <- EmployerStateJob %>% group_by(Job.Sector) %>% summarise(Number.of.Visas = sum(Number.of.Visas))
# order data so that Job Sectors appear in descending order of Number of Visas
VisaPerSector$Job.Sector <- factor(VisaPerSector$Job.Sector, levels = VisaPerSector$Job.Sector[order(VisaPerSector$Number.of.Visas)])
# univariate dot plot for Job Sector and Number of Visas
# Different colour used for STEM than the rest of the data points with colour-blind friendly palette
p1 <- ggplot(VisaPerSector, aes(y = Job.Sector, x = Number.of.Visas, colour = Number.of.Visas))
p1 <- p1 + geom_point() +
geom_segment(aes(x = 0, y = Job.Sector, xend = Number.of.Visas,yend=Job.Sector),linetype = 2, size = 1.4) +
labs(title = "Number of H1-B Visas Sponsored by US\nEmployers Sorted by Job Sector",
subtitle="North American scientific firms submitted more\nH1-B visas for foreign experts than all other sectors",
x = "Number of H1-B Visas",
y = "Job Sector") + geom_text(aes(label=round(Number.of.Visas,2)), hjust = -.2,size = 3, colour = "black") + scale_x_continuous() + scale_colour_gradient(low="#998ec3",high="#f1a340")
# make background white, remove redundant colour legend and tick marks
p1 <- p1 + theme_classic() + theme(legend.position = "none")
# ensure ggplot fits into RMarkdown webpage
p1 <- p1 + coord_cartesian(xlim = c(0, 300000))
Data Reference
The following plot fixes the main issues in the original.