Understanding Boxplot

Sameer Mathur

Boxplot

Regression Diagnostics

---

Understanding Box-plot

Different Parts of Box-plot

Boxplots are a standardized way of displaying the distribution of data based on a five number summary

  1. Minimum
  2. First Quartile (Q1)
  3. Median (Q2)
  4. Third Quartile (Q3)
  5. Maximum




  • Median (Q2 / 50th Percentile): The middle value of the dataset.




  • First Quartile (Q1 / 25th Percentile): The middle number between the smallest number (not the “minimum”) and the median of the dataset.




  • Third Quartile (Q3 / 75th Percentile): The middle value between the median and the highest value (not the “maximum”) of the dataset.



  • Interquartile range (IQR): 25th to the 75th percentile.

  • Maximum: Q3 + 1.5*IQR

  • Minimum: Q1 - 1.5*IQR

Simple Box-plot

This simplest possible box plot displays the full range of variation (from min to max), the likely range of variation (the IQR), and a typical value (the median).

Not uncommonly real datasets will display surprisingly high maximums or surprisingly low minimums called outliers.

John Tukey has provided a precise definition for two types of outliers:

  • Outliers are either \( 3 \times IQR \) or more above the third quartile or \( 3 \times IQR \) or more below the first quartile.

  • Suspected Outliers are slightly more central versions of outliers either \( 1.5 \times IQR \) or more above the third quartile or \( 1.5 \times IQR \) or more below the IQR.

Box-plot on Normal distribution

Previous image is a comparison of a boxplot of a nearly normal distribution and the probability density function (pdf) for a normal distribution.

The reason showing this image is that looking at a statistical distribution is more commonplace than looking at a box plot.

Example

Outliers in Price Variable in Airline Data

# reading data into R
airline.df <- read.csv(paste("BOMDELBOM.csv"))
# attaching data columns of the dataframe
attach(airline.df)

Boxplot of Variable Price

plot of chunk unnamed-chunk-2

Histogram of Variable Price

plot of chunk unnamed-chunk-3

We can see that our data is right skewed. We conclude presence outliers in our dataset.

Density Plot

# plot density
plot(density(Price), frame = TRUE, 
     main = "Density Plot of Ticket Price") 
polygon(density(Price), col = "black")

plot of chunk unnamed-chunk-5

Normality Plot

# normality plot using qqnorm and qqline
qqnorm(Price)
qqline(Price)

plot of chunk unnamed-chunk-7

Quartiles of Variable Price

# quantiles
quantile(Price)
   0%   25%   50%   75%  100% 
 2607  4051  4681  5725 18015 

where

  • 0% is minimum
  • 50% is median
  • 100% is maximum