Box Plot

Author

Manoj

Program: To generate basic box plot using ggplot2, enhanced with notches and outliers and grouped by a categorical variable using an in-built dataset in R

Steps

  • Step 1: Load required packages and library - ggplot2
  • Step 2: Use and explore built-in dataset - iris
  • Step 3: Visualize box plot with notches and outliers - step by step

Step 1: Load required packages and library - ggplot2

We use ggplot2 package for data visualization. If it is not already installed you can install it using

install.packages(‘ggplot2’)

library(ggplot2)

Step 2: Use and explore built-in dataset - iris

We will use the built-in iris dataset. This dataset contatins measurements of sepal and petal dimnesions for three species of ‘iris’ folowers.

  • Setosa
  • Versicolor
  • Virginica
data=iris
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
table(data$Species)

    setosa versicolor  virginica 
        50         50         50 
tail(iris)
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
145          6.7         3.3          5.7         2.5 virginica
146          6.7         3.0          5.2         2.3 virginica
147          6.3         2.5          5.0         1.9 virginica
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica
str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(data)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Step 3: Visualize box plot with notches and outliers - step by step

We now create box plot for Sepal.Length, grouped by Species. We’ll enhance the plot usin: Notches to show the confidence interval around the median- Outlier highlighting using color shape.

boxplot(iris$Sepal.Length)

ggplot(iris, aes(x=Species, y= Sepal.Length))

ggplot(iris, aes(x=Species, y= Sepal.Length))+geom_boxplot()

ggplot(iris, aes(x=Species, y= Sepal.Length))+geom_boxplot(
  notch=TRUE
)

ggplot(iris, aes(x=Species, y= Sepal.Length))+geom_boxplot(
  notch=TRUE,
  notchwidth = 0.5
)

ggplot(iris, aes(x=Species, y= Sepal.Length))+geom_boxplot(
  notch=TRUE,
  notchwidth = 0.5,
  outlier.color = "red"
)

ggplot(iris, aes(x=Species, y= Sepal.Length))+geom_boxplot(
  notch=TRUE,
  notchwidth = 0.5,
  outlier.color = "red",
  outlier.shape = 21,
)

ggplot(iris, aes(x=Species, y= Sepal.Length))+geom_boxplot(
  notch=TRUE,
  notchwidth = 0.5,
  outlier.color = "red",
  outlier.shape = 21,
  fill='skyblue'
)

ggplot(iris, aes(x=Species, y= Sepal.Length))+geom_boxplot(
  notch=TRUE,
  notchwidth = 0.5,
  outlier.color = "red",
  outlier.shape = 21,
  fill='skyblue',
  alpha=0.1
)

ggplot(iris, aes(x=Species, y= Sepal.Length))+geom_boxplot(
  notch=TRUE,
  notchwidth = 0.5,
  outlier.color = "red",
  outlier.shape = 21,
  fill='skyblue',
  alpha=0.1
)+labs(
  title='Sepal length Distribution by IRIS Species',
  x='IRIS Species',
  y='Sepal Length of IRIS different species'
)

ggplot(iris, aes(x=Species, y= Sepal.Length))+geom_boxplot(
  notch=TRUE,
  notchwidth = 0.5,
  outlier.color = "red",
  outlier.shape = 21,
  fill='skyblue',
  alpha=0.1
)+labs(
  title='Sepal length Distribution by IRIS Species',
  x='IRIS Species',
  y='Sepal Length of IRIS different species'
)+theme_minimal()