Boxplots for Exploratory Analysis.
This chapter will provide an overview of box and whiskers plots, describing what they are, why we use them and how to create them.
When we work with data, whether to explore it, use it to test hypotheses, or to present it to other people, we need to summarize all of our observations in a smaller number of summary statistics or visualizations. How we do this depends on the type of data that we have, the size of the data set, and the audience.
When it comes to types of variables, remember that we typically divide variables up into nominal/categorical, ordinal and interval/ratio. Some people add more types, for example calculated variables such as the murder rate or the poverty rate, but for this chapter we are going to group those with the interval/ratio variables. What makes these variables distinct is that they have an underlying unit that represents a fixed interval. Examples might be the number of children, the hours worked in a single day, or the percent of a state’s households that have income below the poverty level.
Describing numeric data
When we work with numerical variables and want to summarize them there are three basic characteristics that we usually focus on:
- Center also called central tendency or what is an “average” ot “typical” value for an observation.
- Variation or how different observations are from each other.
- Shape which can include anything from whether the distribution has one or more modes, whether it is symmetrical (if you folded it in half both sides would match), and whether it is more peaked or less peaked than a normal distribution (sometimes called kurtosis).
Each of these are at least somewhat related to the other two.
In trying to understand our data or to present the data to others we have a lot of choices.
Let’s use the iris data set as an example. We can use a set of basic statistics to describe Sepal.Length.
First a quick review about the median and quartiles. The median or 50th percentile is the value which half the observations are below and half the observations are above. To find it, we sort our data and then find the middle value.
Sepal.Length.Sorted <- sort(iris$Sepal.Length)
Sepal.Length.Sorted
[1] 4.3 4.4 4.4 4.4 4.5 4.6 4.6 4.6 4.6 4.7 4.7 4.8 4.8 4.8 4.8 4.8 4.9 4.9 4.9 4.9 4.9 4.9 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
[32] 5.0 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.2 5.2 5.2 5.2 5.3 5.4 5.4 5.4 5.4 5.4 5.4 5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.6 5.6 5.6
[63] 5.6 5.6 5.6 5.7 5.7 5.7 5.7 5.7 5.7 5.7 5.7 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.9 5.9 5.9 6.0 6.0 6.0 6.0 6.0 6.0 6.1 6.1 6.1 6.1
[94] 6.1 6.1 6.2 6.2 6.2 6.2 6.3 6.3 6.3 6.3 6.3 6.3 6.3 6.3 6.3 6.4 6.4 6.4 6.4 6.4 6.4 6.4 6.5 6.5 6.5 6.5 6.5 6.6 6.6 6.7 6.7
[125] 6.7 6.7 6.7 6.7 6.7 6.7 6.8 6.8 6.8 6.9 6.9 6.9 6.9 7.0 7.1 7.2 7.2 7.2 7.3 7.4 7.6 7.7 7.7 7.7 7.7 7.9
Since there are 150 irises, the middle value is between the 75th and 76th observations in the sorted data (150/2 =75). In this case
Sepal.Length.Sorted[75]
[1] 5.8
Sepal.Length.Sorted[76]
[1] 5.8
So our median is 5.8.
To get the first quartile or 25th percentile, we know that 150/4 = 37.5. Rounding up that is 38. Thirty-seven observations are below observation 38 and 37 observations are above observation 38 and below the space between observations 75 and 76.
Sepal.Length.Sorted[38]
[1] 5.1
On the other side, 75.5+37.5 = 113. There are 37 observations between starting at observation 76 and stopping at 112. There are 37 observations starting at just above 113 and going to 150.
Sepal.Length.Sorted[113]
[1] 6.4
You can get the same values by counting through the list of observations.
Or we can do a short cut.
summary(iris$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
This set of values, the so-called five number summary plus the mean, give us a fair amount of information about the Sepal.Length values. For example we see the highest and lowest values, which we could use to also calculate a measure of variation, the range (in this case, 7.9 - 4.3 = 3.6). We see the First and Third quaritles which we could use to calculate the interquartile range in this case (6.4 - 5.1 = 1.3).
We also see the mean and the median. For exploratory analysis we usually use the median as our basic measure of central tendency, but having both the mean and the median is useful because it often gives us a clue about skewness. If a mean is greater than the median the skew is often to the right. If it is less, the skew is usually to the left. If they are about the same there is generally no meaningful skew.
In the case of our Sepal.Length data we can see that the mean and median are almost the same. But we can also see that the maximum is a bit further from the median than the minimum is (7.9 - 5.8 = 2.1 versus 5.8 - 4.3 = 1.5). Once we start trying to make those kinds of comparisons, things quickly become complex. Using a data visualization can help with that, and a box and whiskers plot is one of the basic ways to do that. Let’s try yet another way to get our five number summary.
dplyr::summarize(iris, Minimum=min(Sepal.Length),
"1st Quartile" = quantile(Sepal.Length, .25),
Median = median(Sepal.Length),
"3rd Quartile" = quantile(Sepal.Length, .75),
Maximum = max(Sepal.Length))
This is great. However, now supposed we want to compare the five number summary for the three different species of irises to see if they are different. By the way, the sepal is the green part the encloses the petals of the flower.
iris.grouped<-dplyr::group_by(iris, Species)
dplyr::summarize(iris.grouped, Minimum=min(Sepal.Length),
"1st Quartile" = quantile(Sepal.Length, .25),
Median = median(Sepal.Length),
"3rd Quartile" = quantile(Sepal.Length, .75),
Maximum = max(Sepal.Length))
Table: Five Number Summary by Species
This is still a nice, compact summary but it is a bit difficult to see what is going on. For example we can see that the minimum for two species is the same, but their maximums and medians are different. Is this meaningful? Using a visual representation can help with making the comparisons.
Creating a Box Plot using ggplot2
The statistician and mathematician John Tukey developed box plots as a way to summarize this information visually. He was particularly interested in how to accurately summarize and interpret data without making any assumptions about the data and in the detection of unusual or influential data points, often called outliers. He wrote a ground breaking book called Exploratoray Data Analysis that presented many of his ideas about this.
library(ggplot2)
ggplot(iris, aes(y=Sepal.Length, x=Species)) +
geom_boxplot() +
ggtitle("Sepal Length by Species in Anderson's Iris Data")

This plot shows the same information as the table …. but even more. Let’s break it down. Each of the three species is represented by a box and whiskers – the lines sticking out of the boxes. One species–virginica– also has a point by itself. Each of the boxes has a dark line in the middle. For sertosa this is at 5.0, for versicolor this is at 5.9, and for virginica this is at 6.5. Compare those numbers to the table, and you will see that these represent the medians for each of the species. If you look at the bottom of the boxes, you will see they are at 4.8, 5.6, and 6.2. That is, they represent the first quartile. The top of the boxes are at 5.2, 6.3, and 6.9. They represent the third quartile.
That means that the boxes represent the middle 50% of the observations for each species, those between the first and the third quartile (25th and 75th percentiles).
If we look more closely at these boxes the visualization tells us some interesting things. For example there is no overlap between the setosa species middle half and those of the other two species. There is a tiny amount of overlap between the middle 50% of the versicolor and virginica species. The overlap is small enough that the median of versicolor is below the 25th percentile of virginica and the median of virginica is above the 75th percentile of versicolor.
If we took the distance from the top of each box to the bottom that would be the interquartile range. We can see that the IQR for setosa is much smaller than that for the other two species. We can also see that the median of setosa is in the middle of of its box, while for the other two species the median is closer to the bottom of the box than to the top.
Make sure you read over the previous two paragraphs really carefully and understand all of the sentences. If percentiles are new to you, it is totally normal that you will have to read it and look at the graphs several times.
Now let’s take a look at the whiskers. For setosa the bottom whisker extends to 4.3 and the top one to 5.8. Those correspond to the maximum and minimum values. Similathrly the lower whisker for versicolor extends to the minimum value, 4.9, and the top whisker extends to the maximum, 7.0. So for these two species, the whiskers represent the first and fourth quartiles. All five numbers from the five number summary are shown.
We can see that all of the setosa values fall below the medians of both the versicolor and virginica values.
Virginica is different. Notice that in addition to the lower whisker there is a single dot. Remember that in our table the minimum value for virginica was the same as that for setosa, 4.9. This is very strange considering that the median and first quartile of virginica are so much higher than for setosa. The box and whiskers plot highlights the strangeness of this and highlights just how far away that point is from the rest of the virginica observations. This kind of point is called an outlier because it is so far away from the other values. It’s such an odd plant for that species that we might even wonder if an example from another species was mislabelled or an error was made when the measurements were taken or copied by hand. On the other hand, maybe it is just a small flower because it was growing in the shade, was crowded by other plants or a random mutation. Many articles and books have been written using the iris data, and some errors have been found, but not for this flower.
This highlights an important point. Part of the purpose of a box and whiskers plot is to help identify outliers. That is why whiskers do not always go to the minimum or maximum value. There is a lot of debate in statistics about how to know if an observation is an outlier. Basically, this is an art and not a science but there are some rules of thumb that people use. In the case of box and whiskers plots this is how our graph would done if we were going to do it by hand rather than using convenient functions in our software.
- Calculate the Interquartile range.
third_quartile<-6.9
first_quartile<-6.225
iqr <- third_quartile - first_quartile
iqr
[1] 0.675
- Multiply the IQR by 1.5.
iqr1.5 <- 1.5* iqr
iqr1.5
[1] 1.0125
- Take this value and subtract it from the first quartile and add it to the third quartile
first_quartile - iqr1.5
[1] 5.2125
third_quartile + iqr1.5
[1] 7.9125
Any values 5.2 or lower or 7.9 or higher would be considered outliers. Therefore the observation that is 4.9 is an outlier, and it is the only virginica observation below 5.2. The lower whisker extends only to the lowest value that is greater than 5.2. Keep in mind that the original data was rounded to the nearest tenth when the plants were measured and we do not want to round an already rounded number.
Some crucial points
Here’s a crucial point. As a researcher and data analyst you have the ability to make many choices. In our box plots we used 1.5 times the IQR to define the whiskers. Not every box plot you see does this; in fact there are many variations on box plots. For example, sometimes a researcher may use 2 times the IQR or sometimes they may choose to extend the whiskers to the maximum.
If you learned to make a boxplot in elementary school you probably learned to extend the whiskers to the maximum. You could read the Help page for geom_boxplot() and learn how to change 1.5 to some other value.
If you are using a box plot for your own information, you can do what makes sense, but if you are presentng it to others you should always explain what approach you took to this.
Here’s a second crucial point. Just because you identified a point as an outlier does not mean you know what to do about it. As an analyst you have to consider what the outlier means and how its presence should impact your analysis. Whatever you do you should record the decision and reasons for it completely and make sure to share that information with consumers of your work. This is an essential element of doing reproducible research.
Now that you have seen how a box plot is created, you can understand why sometimes people are very confused by them. They are jam-packed with information and you could write many paragraphs just summarizing what our box plot shows us about these species of irises.
Here are ten statements based on our box plot. Make sure you understand why each one of them is true.
- More than 75% of virginica have longer sepal lengths than the median versicolor.
- All setosa have sepal lengths shorter than 75% of virginica.
- The longest setosa is about equal to the median versicolor.
- The shortest versicolor and shortest virginica are the same length.
- All of the irises with sepal length greater than 7 are virginica.
- All of the irises with sepal length less than 4.9 are setosa.
- The interquartile ranges for virginica and versicolor overlap but neither group’s median is in the overlapping area.
- The virginica species sepal length has one possible outlier.
- The variation of setosa is less than that of the other two species.
- THe virginica and versicolor distributions (centers, variation and shape) are more similar to each other than either is to the setosa.
