Comparing Visualizations
Intro
This post is inspired by a discussion in LinkedIn about the good and the ugly about boxplots vs. the scatter plots. The post was written by Paul van der Laken, PhD in which he cites an article by Nick Desbarats titled “I’ve stopped using boxplots. Should you?”, a very thought provoking piece that you can read here.
I don’t want to prove Nick and Paul wrong, that’s not the goal of this post. I actually loved Nick’s article. During the discussion along the comments I provided an example where a boxplot was useful and I was asked to make a violin plot, so that’s why I’m writing this post. It’s fair to mention that I don’t consider myself an expert, I’m an enthusiastic People Analytics practitioner, and from time to time, it’s nice to have a debate in which ideas can be challenged.
Even though I agree with the general idea of Nick’s article that simpler is better, and that you have to always think first in your audience and whether they would understand or not a plot, in my opinion there is no a best type of plot over other. A plot should be used with a purpose.
Another point against boxplots is that they are complex to understand, and you need people with statistical comprehension to read them. Which is true. Considering I work mainly with HR professionals and statistics is not our main strength, and since boxplots condense a lot of statistical information that means that most of my audience won’t understand them. So, why bother?
In my opinion, even if you are making a pie chart or a bar chart, you always have to explain what the visualization is saying, specially if you are using some other type of chart people is no familiar with. Don’t assume your audience will interpret the plot the same way you intended to. Pay attention to Hans Rosling’s Ted Talks and how he always explains how to interpret his visualizations.
Even though boxplots are hard to understand, specially for HR professionales, I believe they can be useful for people that work in Compensation and Benefits for instance since they constantly work with medians and salary bands.
And I believe every plot serves a purpose. For instance, in commercial presentations, I’d always include a sankey chart for the Wow Effect but I’ve barely used it in production because my clients wouldn’t know how to read them. So, my point is that you always have to use a visualization the user and the audience will understand.
So, given all that, let’s load some data and compare some plots! 🤓
The data
The data we are going to see, is a open salary survey developed by a Latin American RUG called R4HR Club de R para RRHH, a community for learning how to use R for Spanish-speaking people that work or want to work in Human Resource. You can find the raw data and our full analysis in this link (it’s in Spanish 🤷🏻).
So first, let’s load some libraries and a subset of the data. If you don’t want to see all the data preparation process, simply go to Comparing visualizations in the left menu.
Data preparation
I’m loading data from the original source. That’s why this requires some data cleaning. The data is filter to show only the results from Argentina, and I made some calculations to estimate part time salaries as equivalent to a full time salary. That’s what I’m going to use as data to make the visualizations. You can download a clean version of this data from this GitHub repository.
# Libraries & Data ----
library(tidyverse) # For data wrangling and cleansing
library(funModeling) # For EDA and some data cleansing... and much more
library(scales) # For adjusments on the axis display of the plots
library(googlesheets4) # Reading files from Google Sheets
library(gargle) # Handling special characters from Spanish
# Data
salaries <- read_sheet("1aeuu9dVfN42EjyvbmhEcsf0ilSz2DiXU-0MpnF896ss") %>%
select(gender = "Género",
role = "¿En qué puesto trabajás?",
gross_salary = "¿Cuál es tu remuneración BRUTA MENSUAL en tu moneda local? (antes de impuestos y deducciones)",
country = "País en el que trabajas",
work_type = "Trabajo",
work_hours = "Tipo de contratación") %>%
filter(country == "Argentina",
work_type == "Relación de Dependencia",
gender %in% c("Femenino", "Masculino")) %>%
select(-country, -work_type)
## Clean data (you can't hide from it) ----
salaries <- salaries %>%
mutate(gross_salary = as.numeric(unlist(gross_salary)))
# Add a column to estimate full time salary for part time workers
salaries <- salaries %>%
mutate(multiplier = if_else(work_hours == "Part time", 1.5, 1),
ft_salary = gross_salary * multiplier) %>%
select(-work_hours, -multiplier, -gross_salary)
# Filter and unify roles
salaries <- salaries %>%
filter(role != "Juzgado Civil y Comercial",
role != "Programador",
role != "Cuidado",
role != "Asesor",
role != "Jefe de Proyecto") %>%
mutate(role = str_trim(role, side = "both"), # Elimina espacios vacíos
role = fct_collapse(role, "Gerente" = "Superintendente"),
role = fct_collapse(role, "Director" = "Director ( escalafón municipal)"),
role = fct_collapse(role, "HRBP" = c("Senior Consultoría", "specialist", "especialista",
"Especialista en selección IT", "Recruiter")),
role = fct_collapse(role, "Responsable" = c("Coordinación", "Coordinador de Payroll",
"Encargado", "Supervisor")),
role = fct_collapse(role, "Administrativo" = c("Asistente", "Asistente RRHH", "Aux",
"Capacitador", "Consultor Ejecutivo",
"consultor jr")),
role = fct_collapse(role, "Analista" = c("Asesoramiento", "Consultor", "Generalista",
"Reclutadora", "Selectora", "Senior")))
# Filter roles to analyze
salaries <- salaries %>%
filter(role %in% c("Analista", "HRBP", "Responsable",
"Jefe", "Gerente"))
# Write a csv file to share
write_delim(salaries, file = "hr_salaries_arg.csv",
delim = ";")I typically like to customize my charts, so I usually do this modifications.
options(scipen = 999) # Modifies the scientific notations of plots to nominal values
extrafont::loadfonts(quiet = TRUE) # Loads different fonts into R
# Clean style with grey horizontal lines
styleh <- theme(panel.grid = element_blank(),
plot.background = element_rect(fill = "#FBFCFC"),
panel.background = element_blank(),
panel.grid.major.y = element_line(color = "#CACFD2"),
axis.ticks.y = element_blank(),
text = element_text(family = "Poppins"),
plot.title.position = "plot")
stylev <- theme(panel.grid = element_blank(),
plot.background = element_rect(fill = "#FBFCFC"),
panel.background = element_blank(),
panel.grid.major.x = element_line(color = "#CACFD2"),
axis.ticks.x = element_blank(),
text = element_text(family = "Poppins"),
plot.title.position = "plot")
# Modify the way the y axis is displayed
axis_x_n <- scale_x_continuous(labels = comma_format(big.mark = ".", decimal.mark = ","))
axis_y_n <- scale_y_continuous(labels = comma_format(big.mark = ".", decimal.mark = ","))
# Colors
gender_colors <- genero <- c("#8624F5", "#1FC3AA") # Purple and green (sort of :p)
# Data source for plot's caption
fuente <- "Data Source: Encuesta KIWI de Sueldos de RRHH LATAM 2020\nR4HR Club de R para RRHH"The data was collected in Spanish, so let’s translate the values into English first and summarise the information:
# Translate values
salaries <- salaries %>%
mutate(gender = fct_recode(gender, "Female" = "Femenino",
"Male" = "Masculino"),
role = fct_recode(role, "Analyst" = "Analista",
"Supervisor" = "Responsable",
"Head" = "Jefe",
"Manager" = "Gerente"),
role = fct_relevel(role, c("Analyst", "HRBP", "Supervisor",
"Head", "Manager")))
# Let's make a summary analysis of the data
summary(salaries)## gender role ft_salary
## Female:360 Analyst :223 Min. : 2
## Male :176 Supervisor :136 1st Qu.: 56000
## Head : 72 Median : 75000
## HRBP : 57 Mean : 93288
## Manager : 48 3rd Qu.: 105250
## Administrativo: 0 Max. :2140000
## (Other) : 0
There are a couple of unusual values. First the minimum value, that’s clearly a mistake (or a bad intended person), and the maximum value, that could be possible, but it’s highly unusual for the Argentinean market. If we make a histogram, the result would be awkward.
ggplot(salaries, aes(x = ft_salary)) +
geom_histogram() +
labs(title = paste0("Gross Salary Distribution in HR ", emo::ji("scream")),
subtitle = "Data from Argentina | In AR$",
x = NULL, y = NULL,
caption = fuente) +
axis_x_n +
stylehThis is where funModeling comes handy. The
profiling_num function delivers a table with a lot of
descriptive statistics for numerical variables.
## variable mean std_dev variation_coef p_01 p_05 p_25 p_50 p_75
## 1 ft_salary 93287.88 104693.9 1.122267 217.5 35000 56000 75000 105250
## p_95 p_99 skewness kurtosis iqr range_98 range_80
## 1 195500 290000 14.41845 275.362 49250 [217.5, 290000] [44500, 150000]
Since I want to analyze central values of the salaries, I’ll filter everything beyond the percentiles 5 and 95.
# Store percentiles 5 and 95 to filter
p05 <- numerical[1,6]
p95 <- numerical[1,10]
# Filter values within the p05 a p95 values
salaries <- salaries %>%
filter(between( # Supporting function
ft_salary, # Variable to filter
p05, # Minimum threshold
p95 # Maximun threshold
))
# Now I can remove the objects I would no longer use
rm(numerical, p05, p95)Now that we have a cleaner version of the data we can start comparing the visualizations:
ggplot(salaries, aes(x = ft_salary)) +
geom_histogram() +
labs(title = "Gross Salary Distribution in HR | Clean Data",
subtitle = "Data from Argentina | In AR$",
x = NULL, y = NULL,
caption = fuente) +
axis_x_n +
stylehComparing visualizations
Why would I use a boxplot instead of a simple bar plot? Let’s give a try using a bar chart to compare the median salary for both men and women in each role:
salaries %>%
group_by(role, gender) %>%
summarise(median_salary = median(ft_salary)) %>%
ggplot(aes(x = role, y = median_salary, fill = gender)) +
geom_col(position = "dodge") +
scale_fill_manual(values = gender_colors) +
axis_y_n +
styleh +
labs(title = "Median Salary per Role and Gender in HR",
subtitle = "Data from Argentina | In AR$",
y = "Median Salary in AR$",
x = NULL,
fill = "Gender",
caption = fuente)So if we look at each bar we can see that the pay gap for Analyst and Supervisor is larger than in the other roles, so male HR people earns more than their female colleagues. But when we look into HRBPs and Managers, the median salary for women is slightly higher than men, and even in the Head role, the gap is quite close. So with this evidence we could say that the gender salary gap in Human Resources in Argentina it’s not a issue, but…
Boxplots
In the discussion in LinkedIn, Nick Desbarats said that he struggles to come up with use cases in which boxplots would be the best choice, so I shared this plot:
ggplot(salaries, aes(x = role, y = ft_salary, fill = gender)) +
geom_boxplot() +
scale_fill_manual(values = gender_colors) +
axis_y_n +
styleh +
labs(title = "Salary Distribution in HR Roles in Argentina",
subtitle = "In AR$",
y = "Gross Salary in AR$",
x = NULL,
fill = "Gender",
caption = fuente)What I like about this visualization is that we can see the distribution of the salaries by the size of the halves of the boxes. Les take for instance the Head position. The medians are similar, but in the case of women the bottom half of the box is larger, so that means that the range of salaries for women is broader. That tells us that there are women in Head position with salaries far below the median.
The opposite happens with male professionals in the Head position. The top half of the box is larger meaning that there are men in the Head position with salaries far above the median.
But Nick has a point. How many data points we have? 3, 15, 300? We can’t tell from this plot. So he suggested to try a violin plot. So let’s see what goes on:
Violin plot
Violin plots have become an option to boxplots. They show the range of the values with its lenght and the different concentrations of the data with its width. The widest section of the plot usually indicates the median of the numerical value:
Let’s switch our original boxplot into a violin plot.
ggplot(salaries, aes(x = role, y = ft_salary, fill = gender)) +
geom_violin() +
scale_fill_manual(values = gender_colors) +
axis_y_n +
styleh +
labs(title = "Salary Distribution in HR Roles in Argentina",
subtitle = "In AR$",
y = "Gross Salary in AR$",
x = NULL,
fill = "Gender",
caption = fuente)Given the amount of roles we can’t appreciate the value of this kind of chart. So let’s repeat it only with Analysts and Managers.
salaries %>%
filter(role %in% c("Analyst", "Manager")) %>%
ggplot(aes(x = role, y = ft_salary, fill = gender)) +
geom_violin() +
scale_fill_manual(values = gender_colors) +
axis_y_n +
styleh +
labs(title = "Salary Distribution in HR Roles in Argentina",
subtitle = "In AR$",
y = "Gross Salary in AR$",
x = NULL,
fill = "Gender",
caption = fuente)The width of each plot indicates that the region is populated with more cases. So, for male managers we can see that most of them are close to the median. The lenght of the plots indicates the range of the values. So, in the case of female managers the range goes from around AR$ 50.000 up to close to AR$ 200.000 and the width is quite even all along the data points.
In the case of the Analysts, for women we can see that there are wider section around AR$ 50.000 and the start to narrow to the top. In the case of men the widest part is above than women, and the range expands to greater values.
Perhaps for this dataset, the violin plot is not the most suitable case to see all the roles together. Let’s try a scatter plot.
Scatter plot
Another way we can see the distribution of the data points is with the scatter plot. We tend to see them to visualize relationships between two numerical variables, but we can use it with nominal variables as well.
ggplot(salaries, aes(x = role, y = ft_salary, color = gender)) +
geom_point(size = 3,
alpha = 0.2,
position = position_jitter(0.3)) +
scale_color_manual(values = gender_colors) +
styleh +
axis_y_n +
labs(title = "Salary Distribution per Gender and Role in HR",
subtitle = "HR Professionals in Argentina | In AR$",
y = "Gross Salary in AR$",
x = NULL,
color = "Gender",
caption = fuente)Again, for this dataset, the scatter plot can be more confusing because in some positions, because Analysts and Supervisor are so populated and it’s hard to tell the differences by color. But for instance, for Managers we can appreciate the range of the salaries, and also the concentration for men.
Perhaps we could split the charts in smaller charts, to see if it helps to clarify the data and its interpretation.
ggplot(salaries, aes(x = gender, y = ft_salary, color = gender)) +
geom_point(size = 3,
alpha = 0.2,
position = position_jitter(0.22)) +
scale_color_manual(values = gender_colors) +
styleh +
axis_y_n +
labs(title = "Salary Distribution per Gender and Role in HR",
subtitle = "HR Professionals in Argentina | In AR$",
y = "Gross Salary in AR$",
x = NULL,
color = "Gender",
caption = fuente) +
facet_wrap(~role, nrow = 1)Now we can appreciate in a better way all the positions of the data points, where data is more concentrated and also the different ranges of the salaries for both men and women in the different roles. So it’s easier to compare and analyze the outputs and see the number of observations.
Since I’m designing all of these visualizations, I might be biased, but in my opinion, while seeing all the roles together in one visualization the cognitive load increases to interpret the salary situation for both gender and all the roles at once.
Conclusions
My perception of the plots is biased by my experience. For this specific dataset, a easier to read plot like the last scatterplot might be more difficult to extract conclusions. I acknowledge that violin plots are more sophisticated from a design point of view, and they provide a wow effect that makes you react “Wow, this looks cool” but end up having no clue of how to interpret what you are seeing.
I believe it’s Alberto Cairo who says visualizations are a tool to represent a complex reality in a visual encoding. So, every visualization will have some ups (simplicity for bar plots, a lot of information for boxplots) and downs (oversimplification and hiding reality for bar plots; too much information and “hiding” the amount of data points for boxplots) so we have to make conscious decisions when we use visualizations to communicate results:
What’s the best visualization that represents the story of the data.
What’s my purpose? To shock and awe? To educate? How much time do I have to explain the visualization?
Consider how much does the audience know about data a visualization and how much they really want to improve their capabilities.
Do we need a simple visualization that everyone understands, or should we use a more complex visualization for experts in the field?
For this particular example, and for this specific dataset, and if you want to display the results of all roles in one plot, the boxplot is the best choice, because, even with its complexity, it condenses enough information to draw conclusions and comprehend the story behind the data. I think for this particular case, its consistent shape along the roles and genders reduces the cognitive load and helps the user to compare results once he or she understands how to interpret the chart.
Anyway, don’t take my word for granted. As I said before, I am not an expert in visualization but a People Analytics enthusiastic practitioner. So you might agree with me, or not, it’s okay. I hope this post helps you to consider new angles the next time you need to use visualizations to communicate results. Ideally, I should have done this exercise with different data sets but who has the time? But I guess that reinforces my initial statement. There is not a plot that is better than other per se, there are plots the serves more properly in specific situations, and for a specific audience and with a specific purpose and depending on the data you have.
If you want to reach out, just contact me through my LinkedIn profile, on Twitter, by Telegram or send me an email to sergio@d4hr.com. You can access this repo on GitHub to reproduce the results. Follow Nightingaledvs.com for amazing content on data visualization, it was great to navigate through their content.
Thanks for reading.
🤘 Sergio
PS: What’s your favorite plot?
# Boxplot
salaries %>%
filter(role == "Analyst") %>%
ggplot(aes(y = ft_salary, fill = gender)) +
geom_boxplot() +
scale_fill_manual(values = gender_colors) +
axis_y_n +
styleh +
labs(title = "Boxplot",
subtitle = "In AR$",
y = "Gross Salary in AR$",
x = NULL,
fill = "Gender",
caption = fuente)
# Violin plot
salaries %>%
filter(role == "Analyst") %>%
ggplot(aes(x = role, y = ft_salary, fill = gender)) +
geom_violin() +
scale_fill_manual(values = gender_colors) +
axis_y_n +
styleh +
labs(title = "Violin Plot",
y = "Gross Salary in AR$",
x = NULL,
fill = "Gender",
caption = fuente)
# Scatter plot
salaries %>%
filter(role == "Analyst") %>%
ggplot(aes(x = gender, y = ft_salary, color = gender)) +
geom_point(size = 3,
alpha = 0.2,
position = position_jitter(0.22)) +
scale_color_manual(values = gender_colors) +
styleh +
axis_y_n +
labs(title = "Scatter Plot",
y = "Gross Salary in AR$",
x = NULL,
color = "Gender",
caption = fuente)R packages used to making this post
These are the R packages I used for making this post
rmdformats: Julien Barnier (2021). rmdformats: HTML Output Formats and Templates for ‘rmarkdown’ Documents. R package version 1.0.3. https://CRAN.R-project.org/package=rmdformats
funModeling: Pablo Casas (2020). funModeling: Exploratory Data Analysis and Data Preparation Tool-Box. R package, version 1.9.4. https://CRAN.R-project.org/package=funModeling
tidyverse: Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
scales: Hadley Wickham and Dana Seidel (2020). scales: Scale Functions for Visualization. R package version 1.1.1. https://CRAN.R-project.org/package=scales
googlesheets4: Jennifer Bryan (2021). googlesheets4: Access Google Sheets using the Sheets API V4. R package version 1.0.0. https://CRAN.R-project.org/package=googlesheets4
gargle: Jennifer Bryan, Craig Citro and Hadley Wickham (2021). gargle: Utilities for Working with Google APIs. R package version 1.2.0. https://CRAN.R-project.org/package=gargle
extrafont: Winston Chang, (2014). extrafont: Tools for using fonts. R package version 0.17. https://CRAN.R-project.org/package=extrafont
emo: Hadley Wickham, Romain François and Lucy D’Agostino McGowan (2021). emo: Easily Insert ‘Emoji’. R package version 0.0.0.9000. https://github.com/hadley/emo
rmarkdown: JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.11. URL https://rmarkdown.rstudio.com.