Sameer Mathur
Boxplot
Regression Diagnostics
---
Boxplots are a standardized way of displaying the distribution of data based on a five number summary
MinimumFirst Quartile (Q1)Median (Q2)Third Quartile (Q3)Maximum
Median (Q2 / 50th Percentile): The middle value of the dataset.
First Quartile (Q1 / 25th Percentile): The middle number between the smallest number (not the “minimum”) and the median of the dataset.
Third Quartile (Q3 / 75th Percentile): The middle value between the median and the highest value (not the “maximum”) of the dataset.
Interquartile range (IQR): 25th to the 75th percentile.
Maximum: Q3 + 1.5*IQR
Minimum: Q1 - 1.5*IQR
This simplest possible box plot displays the full range of variation (from min to max), the likely range of variation (the IQR), and a typical value (the median).
Not uncommonly real datasets will display surprisingly high maximums or surprisingly low minimums called outliers.
John Tukey has provided a precise definition for two types of outliers:
Outliers are either \( 3 \times IQR \) or more above the third quartile or \( 3 \times IQR \) or more below the first quartile.
Suspected Outliers are slightly more central versions of outliers either \( 1.5 \times IQR \) or more above the third quartile or \( 1.5 \times IQR \) or more below the IQR.
Previous image is a comparison of a boxplot of a nearly normal distribution and the probability density function (pdf) for a normal distribution.
The reason showing this image is that looking at a statistical distribution is more commonplace than looking at a box plot.
# reading data into R
airline.df <- read.csv(paste("BOMDELBOM.csv"))
# attaching data columns of the dataframe
attach(airline.df)
We can see that our data is right skewed. We conclude presence outliers in our dataset.
# plot density
plot(density(Price), frame = TRUE,
main = "Density Plot of Ticket Price")
polygon(density(Price), col = "black")
# normality plot using qqnorm and qqline
qqnorm(Price)
qqline(Price)
# quantiles
quantile(Price)
0% 25% 50% 75% 100%
2607 4051 4681 5725 18015
where
0% is minimum50% is median100% is maximum