While surfing the web for data on an unrelated project, I came across this visualization of employment demographics by sector, primarily focused on the male/female split in various occupations.

US Census: Occupational Shares by Sex

It took me some time to figure out exactly what this chart is attempting to show, but the two dimensions being used here are area and color. This was the first red flag for me, as human perception of area and an unequal color scale can be the least precise.

The area of each block in this graphic represents the size of that occupational group. The treetable form of the graphic here does systematically place the more common occupations at the top-left and less common occupations at the bottom right, but the differing dimensions of each box makes any direct comparisons of size fuzzy at best.

The shading of each box is meant to represent the “gains in percent male/female” in each occupational group, or in other words, whether the sector has trended more towards male or female employees from 2000 - 2010. The scale itself though doesn’t provide any threshold values for each of the shades, and the color scheme itself provides more gradations for female gains than for male.

Lastly, there is the simple difficulty of knowing what the chart is attempting to say because the plot itself is so busy. To highlight key aspects the authors included notes and dashed pointers to highlight interesting trends in the data. However, that these trends would have been difficult to spot in the first place suggests that improvements could be made.

Potential Modifications

After perusing the raw data (available for download from the Census website here), I became convinced that the data might be better displayed in another type of chart. Not only does the treetable appear jumbled, it also focuses only on the changes in the gender split and size of occupational group in its display. Other important context, such as the growth of a given sector over that time period, the raw numbers of men or women in each profession, and the rate of change for the male/female split are not clearly addressed.

After deciding to include more context for the data if possible, I also resolved to make the graphic interactive. Considering how small some of the sectors were in the area based chart compared to the large boxes, it would be much simpler to read the chart were it zoomable to differing scales. There is also much information missing in the chart, as labels for cetain sectors are abbreviated or dropped once they can no longer fit in the smaller boxes. Both of these issues can be addressed with an interactive plot.

The Attempted Solution

In an attempt to better engage with the data, I redesigned the treetable using the Highcharts functionality from the rCharts package. This first requried loading the packages and a good deal of data manipulation…

## load necessary packages
library('rCharts')
library('reshape')

## read in the data from downloaded csv files
SOC_Gp <- read.csv(path.expand("~/Desktop/Census Blog Post/SOC-Gp.csv"), header=TRUE, sep=",", stringsAsFactors = FALSE)
Broad_Gp <- read.csv(path.expand("~/Desktop/Census Blog Post/BroadGp.csv"), header=TRUE, sep=",")
Minor_Gp <- read.csv(path.expand("~/Desktop/Census Blog Post/MinorGp.csv"), header=TRUE, sep=",")
OccID <- read.csv(path.expand("~/Desktop/Census Blog Post/OccupationID.csv"), header=TRUE, sep=",")

Data2000 <- read.csv(path.expand("~/Desktop/Census Blog Post/2000 Occupation Tabulation.csv"), header=TRUE, sep=",")
Data2010 <- read.csv(path.expand("~/Desktop/Census Blog Post/2010 Occupation Tabulation.csv"), header=TRUE, sep=",")

## standardize group names and group identifiers
Data2010$Subject <- gsub("Total, both sexes", "Total", Data2010$Subject)

SOC_Gp[, "SOCID"] <- paste("SOC", substr(SOC_Gp[,1], 1, 2), sep = " ")
Broad_Gp[, "BrdGp_ID"] <- substr(Broad_Gp[,1], 1, 6)
Minor_Gp[, "MnrGp_ID"] <- substr(Minor_Gp[,1], 1, 4)

SOC_Gp = SOC_Gp[-23,]

## initialize target lists
SOClist2000 <- list()
SOCagg2000 <- list()
SOClist2010 <- list()
SOCagg2010 <- list()
SOCFemCnt2000 <- c()
SOCTotCnt2000 <- c()
SOCFemCnt2010 <- c()
SOCTotCnt2010 <- c()

## loop through each group, find all rows associated with that group, and aggregate the employee counts by male/female/total
for (i in 1:(nrow(SOC_Gp))) {
  templist <- grep(SOC_Gp[i,3], Data2000$Occupation)
    SOClist2000[[i]] <- subset(Data2000, rownames(Data2000) %in% templist)
    SOCagg2000[[i]] <- aggregate(SOClist2000[[i]]$Total, list(Sex = SOClist2000[[i]]$Sex), sum)
    
    SOCFemCnt2000 <- append(SOCFemCnt2000, SOCagg2000[[i]][1,2])
    SOCTotCnt2000 <- append(SOCTotCnt2000, SOCagg2000[[i]][3,2])
    
    templist <- grep(SOC_Gp[i,3], Data2010$Occupation.Code)
    SOClist2010[[i]] <- subset(Data2010, rownames(Data2010) %in% templist)
    SOCagg2010[[i]] <- aggregate(SOClist2010[[i]]$Total, list(Sex = SOClist2010[[i]]$Subject), sum)

    SOCFemCnt2010 <- append(SOCFemCnt2010, SOCagg2010[[i]][1,2])
    SOCTotCnt2010 <- append(SOCTotCnt2010, SOCagg2010[[i]][3,2])
}

## assign aggregation results to group IDs
SOCTotals2000 <- cbind(SOC_Gp, SOCFemCnt2000, SOCTotCnt2000-SOCFemCnt2000, "2000")
SOCTotals2010 <- cbind(SOC_Gp, SOCFemCnt2010, SOCTotCnt2010-SOCFemCnt2010, "2010")

colnames(SOCTotals2000)[4] <- "Women"
colnames(SOCTotals2000)[5] <- "Men"
colnames(SOCTotals2000)[6] <- "Year"
colnames(SOCTotals2010)[4] <- "Women"
colnames(SOCTotals2010)[5] <- "Men"
colnames(SOCTotals2010)[6] <- "Year"

## Combine two study periods into single data frame
SOCTotals <- rbind(SOCTotals2000, SOCTotals2010)

Finally, we get to the rCharts portion and use Highcharts to produce the chart. In this case, I created a line chart that plots the raw number of males vs females grouped by occupational sector, and then connects the 2000 and 2010 data points with a line. This lets us view the raw employment numbers (not only changes in percentage gender splits), growth of the sector over time, and most critically the starting gender split in each sector. The following code produces this plot -

## add additional values to plot a y=x line where the male/female split is 50/50
SOCTotals <- rbind(SOCTotals, list("xx-xxxx","-50% Male/Female Line-","SOC xx",0,0,2000), list("xx-xxxx","-50% Male/Female Line-","SOC xx",10000000,10000000, 2010))

## plot male vs female employment numbers for each occupation, connecting the 2000 and 2010 data with a line. Then format chart and add zoom capability
chart <- hPlot(x = "Men", y = "Women", data = SOCTotals, type = "line", group = "SOC.Grouping")
chart$yAxis(title = list(text ="Female Workers"), min = 0, max = 16000000)
chart$xAxis(title = list(text = "Male Workers"), min = 0, max = 10000000)
chart$chart(zoomType = 'xy', height = 700)
chart$legend(layout = 'vertical')

##show final chart
chart$print('chart', include_assets=TRUE)














Highcharts does a great job of providing the desired interactivity in this view. Hovering over a given line or point tells you immediately which segment you are looking at, whereas previously we had to deal with abbreviations or dropped segment titles for the smaller industries. We’ve enabled zoom, so highlighting any segment of the graph zooms us in to that portion for a closer look. And lastly, we can select or deselect any of the industires individually to focus only on specfic industries as needed.

Beyond the interactivity gains, this view also provides some greater context into the employemnt numbers than the original treetable. Previously, we only had two dimensions available to us: sector size (box area) and gains in percent male/female(shade). In the new chart, we retain sector size (point placement) and gains in percent male-female(slope of each line), but we also gain insight into the initial male/female split in 2000 vs 2010, raw employment counts, sector growth (length of line), and distance from the 50-50 equality line.

This plot puts some of the results from the first chart in a new light. The original chart told us that from 2000 to 2010 men saw gains in both Office Support and Computer occupations, but only in this chart do we see this shift is making the Office Support occupation more diverse (as it is currently dominated by women) but the Computer occupation more homogenous (as that occupation already skewed male). The original chart indicated that Production occupations were shifting heavily male while the Education split remained relatively unchanged, but this new chart reveals that both industries accounted for roughly the same number of new jobs for women between 2000 and 2010 due to differing industry growth rates.

The additional context that becomes available in the modified version helps place the gender statistics for various occupations in a more complete frame of reference. If the goal of the original chart is to explore workplace diversity, the chart can only tell part of the story in its current form.