Box-and-whisker plots give a visual indication of the spread of data point values for a continuous numerical variable.
The central box has as its upper margin (on the \(y\)-axis) the third quartile value and as it lower margin the first quartile value. A central bar in the box shows the median.
The whisker stretch up and down from the ends of the box and can terminate at the minimum and maximum values. Some plots, though terminate their whiskers at \(1.5\) times the IQR above and below the quartile values, so as to show any potential outliers beyond them.
In this tutorial we will be working with some simulated numerical and categorical variable data point values. In the code chunk below we create these and add them to a tibble.
A tibble is a more modern version of the inbuilt R data.frame
. We can print it to the screen when we render this RMD file using knitr
. This is done with the DT
package.
The numerical variable will be that of income. We will create a categorical variable for career stage and country.
# Seeding the pseudo-random number generator for reproducible results
set.seed(1234)
# Create three varaible
income <- round(rnorm(500, # 500 random data point values
mean = 10000, # mean of 100
sd = 1000), # standard deviation of 1000
digits = 2) # round the random values to two decimal points
stage <- sample(c("Early",
"Mid",
"Late"), # sample space of the stage variable
500, # 500 random data point values
replace = TRUE) # replace values for reselection
country <- sample(c("USA",
"Canada"), # sample space of the country variabe
500, # 500 random data point values
replace = TRUE) # replace values for reselection
# Create tibble
df <- tibble(Income = income, # create an Income variable for the income data point values
Stage = stage, # create a Stage variable for the stage data point values
Country = country) # create a Country variable for the country data point values
# Print a data table
datatable(df)
In the code chunk below is a simple box-and-whisker plot of the income for all the simulated survey members.
p1 <- plot_ly(type = "box",
y = ~Income,
data = df,
name = "All income") %>%
layout(title = "Overall income",
xaxis = list(title = "",
zeroline = FALSE),
yaxis = list(title = "Income",
zeroline = FALSE))
p1
Horizontal box plots are easy to create, we simply swap the axes.
p2 <- plot_ly(type = "box",
x = ~Income,
data = df,
name = "All income") %>%
layout(title = "Overall income",
yaxis = list(title = "",
zeroline = FALSE),
xaxis = list(title = "Income",
zeroline = FALSE))
p2
We can add all the data point values. Since there are many with potential overlap, we add a bit of hitter. The jitter can also be offset so as not to overlay the box.
p3 <- plot_ly(type = "box",
y = ~Income,
data = df,
name = "All income",
boxpoints = "all",
jitter = 0.3,
pointpos = -2) %>%
layout(title = "Overall income",
xaxis = list(title = "",
zeroline = FALSE),
yaxis = list(title = "Income",
zeroline = FALSE))
p3
As mentioned the central box shows a line for the median. The mean can also be added. This is done by adding the boxmean = TRUE
argument. If the argument is set to boxmean = "sd
, then both a mean and standard deviation is indicated.
p4 <- plot_ly(type = "box",
y = ~Income,
data = df,
name = "All income",
boxmean = "sd") %>%
layout(title = "Overall income",
xaxis = list(title = "",
zeroline = FALSE),
yaxis = list(title = "Income",
zeroline = FALSE))
p4
We can use the color =
argument to split the numerical variable according to one of the categorical variables.
p5 <- plot_ly(df,
y = ~Income,
color = ~Stage,
type = "box") %>%
layout(title = "Income by career stage",
xaxis = list(title = "Stage",
zeroline = FALSE),
yaxis = list(title = "Income",
zeroline = FALSE))
p5
We can even categorize the x-axis by the Country variable and then split each by the Stage variable using the color =
argument.
p6 <- plot_ly(df,
x = ~Country,
y = ~Income,
color = ~Stage,
type = "box") %>%
layout(boxmode = "group",
title = "Income by career stage",
xaxis = list(title = "Country",
zeroline = FALSE),
yaxis = list(title = "Income",
zeroline = FALSE))
p6
There are many different shapes to choose from. Below we make the outliers square.
p7 <- plot_ly(type = "box",
y = ~Income,
data = df,
name = "All income",
marker = list(symbol = "square-dot")) %>%
layout(title = "Overall income",
xaxis = list(title = "",
zeroline = FALSE),
yaxis = list(title = "Income",
zeroline = FALSE))
p7
The color of the box and its outline can be selected.
p8 <- plot_ly(type = "box",
y = ~Income,
data = df,
name = "All income",
marker = list(symbol = "square-dot"),
fillcolor = "pink",
line = list(color = "gray",
width = 2)) %>%
layout(title = "Overall income",
xaxis = list(title = "",
zeroline = FALSE),
yaxis = list(title = "Income",
zeroline = FALSE))
p8
There are many color sets to choose from. Here we use Set(3)
.
p7 <- plot_ly(type = "box",
y = ~Income,
color = ~Country,
data = df,
marker = list(symbol = "square-dot"),
colors = "Set3") %>%
layout(title = "Overall income",
xaxis = list(title = "Country",
zeroline = FALSE),
yaxis = list(title = "Income",
zeroline = FALSE))
p7
Creating box-and-whisker plots are easy using Plotly
for R. It is a good way to show the spread of numerical data point values based on a categorical variable.