Introduction to data analysis using R, R Studio and R Markdown
Dee Chiluiza, PhD
Northeastern University
Boston, Massachusetts

 
Short manual series: Box plots
# {r library_data, message=FALSE, warning=FALSE}

# Libraries
library(dplyr) 
library(readxl)
library(RColorBrewer)  # Colors, check display.brewer.all()
library(e1071)         # For skewness and kurtosis

# Data sets
sample1bp = c(rnorm(1000, mean = 120, sd = 25),
            rnorm(500, mean = 250, sd = 10),
            rnorm(500, mean = 550, sd = 60),
            1025, 1200)

box2 = c(2, 3, 5, 6, 7, 7, 8, 9, 10, 10, 11, 11, 11, 11, 
         12, 12, 13, 14, 15, 16, 16, 16, 17, 19, 20, 21, 
         21, 22, 23, 27, 30, 32, 32, 33, 37, 37, 38)

# For class data containing 16 values, same min and max.
box3 = c(4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64)
box4 = c(4,6,8,10,12,14,16,18,22,28,34,40,46,52,58,64)
box5 = c(4,10,16,22,28,34,40,46,50,52,54,56,58,60,62,64)

carSales <- read_excel("DataSets/carSales.xlsx")


Topics


Introduction

Box plots are graphical tools that display the distribution of continuous data based on quartiles (see section: Basic descriptive statistics).



The number of obsevations is equally divided into four equal parts. This means, if the data set contains 100 observations, 25 observations will be located into each quartile, in order of their values. If the data set contains 16 observations, 4 observations will be located into each quartile.

Observe the following three data sets, they have the same number of observations (n=16), the same minimum and maximum values, but different data distribution. Observe the values and positions of quartiles limits. In each case, the data values are divided in 4 quartiles containing the same number of observations (n=4). Quartiles allow to understand the data distribution and behavior.

set1 = c(4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64)
set2 = c(4,6,8,10,12,14,16,18,22,28,34,40,46,52,58,64)
set3 = c(4,10,16,22,28,34,40,46,50,52,54,56,58,60,62,64)

Basic box plot code
Notice the sequence of events.
1. Basic code.
2. Add las=1 to change y-axis labels orientation.
3. Add a color.
4. Change Y-axis limits to improve.

par(mfrow=c(2,2), mai=c(0.2,1,0.2,0.4))
# 1
boxplot(box2,
        main="1")
# 2
boxplot(box2,
        main="2",
        las=1)
# 3
boxplot(box2,
        las=1,
        main="3",
        col = "#FFCCFF")
# 4
boxplot(box2,
        las=1,
        main="4",
        col = "#FFCCFF",
        ylim = c(0,40))

Box plot statistics

The code boxplot.stats(), library(grDevices), will produce the following information:
1. stats: The limits of the four quartiles: min, Q1, Q2, Q3, max.
2. n: The n value.
3. Conf. Confidence intervals.
4. Out: List of any outliers present in the data set.

boxplot.stats(box2)
## $stats
## [1]  2 10 15 22 38
## 
## $n
## [1] 37
## 
## $conf
## [1] 11.883 18.117
## 
## $out
## numeric(0)

Adding data values

The values of stats can be added to the box plot using code text().

  1. Enter y value, which indicate their positions.
  2. Labels indicate the actual values.
  3. For 1 and 2, use the values of boxplot.stats()$stats.
  4. Enter the position on x-axis where labels will be placed. x=1 is the center of the graph.
  5. Use cex=1 to change size of the numbers.
  6. Use col = to add colors to the numbers.

Using the previous box plot from last R chunk.

boxplot(box2,
        las=1,
        col = "#FFCCFF",
        ylim = c(0,40))

text(y = boxplot.stats(box2)$stats,
     labels = boxplot.stats(box2)$stats,
     x=1.23,
     cex = 0.8,
     col = "#A11515")

Outliers

Analyze the codes on the R Chunk below, they are listed using # and following the numbers below.

  1. The box plot on the left display a data set that contains outliers on the top (sample1bp, observe library R Chunk).
  2. For the box plot on the right, notice the use of code the code text() to add the values of these outliers.
    In this case, use boxplot.stats()$out.
  3. Observe how the value for the mean has been added using a pch symbol.
  4. Observe how the value for the mean has been added using a horizontal line.

par(mfrow=c(2,2),mai=c(0.2,0.8,0.2,0.2))

# 1 Left side
boxplot(sample1bp,
        las = 1,
        main = "1",
        col = "#FDEFEF",
        ylim = c(-100,1400),
        yaxp = c(0,1400,7),
        notch=FALSE,
        boxwex = 0.8)

# 2 Right side
boxplot(sample1bp,
        las = 1,
        main = "2",
        col = "#FDEFEF",
        ylim = c(-100,1400),
        yaxp = c(0,1400,7),
        notch=FALSE,
        boxwex = 0.8)

text(x = 1.2,
     y = boxplot.stats(sample1bp)$out,
     labels = round(boxplot.stats(sample1bp)$out,0),
     cex = 0.8,
     col ="#A11515")

# 3 Mean using a horizontal line
boxplot(sample1bp,
        las = 1,
        main = "3",
        col = "#FDEFEF",
        ylim = c(-100,1400),
        yaxp = c(0,1400,7),
        notch=FALSE,
        boxwex = 0.8)

abline(h = mean(sample1bp),
       col = "#7127D1",
       lwd = 1)

#4 Mean using a pch code
boxplot(sample1bp,
        las = 1,
        main = "4",
        col = "#FDEFEF",
        ylim = c(-100,1400),
        yaxp = c(0,1400,7),
        notch=FALSE,
        boxwex = 0.8)

points(mean(sample1bp), 
       pch = 18, 
       col = "#A11515", 
       lwd = 10)

Complete Box plot with all options

Use the non-coding label text inside the R Chunk to understand the use of the codes.
The main code is the boxplot(), the other codes, text(), abline() and mtext() are always placed after the main graph code. Their order is not important as long as they are after the main graph code.

# Change margins for better fitting of the graph
par(mai=c(0.6,1.5,0.4,1.5))

# Main code in this case is boxplot()
boxplot(sample1bp,
        las = 1,
        col = "pink",
        ylim = c(-100,1400),
        yaxp = c(0,1400,7),
        notch=FALSE,
        boxwex = 0.8)

# Add a horizontal line for the mean
abline(h = mean(sample1bp),
       col = "#7127D1",
       lwd = 1)

# Add the text for the mean. over the mean line
text(y=mean(sample1bp)+50,
     x=0.5,
     paste("Mean:", round(mean(sample1bp),1)),
     col = "#7127D1",
     cex = 0.8,
     pos = 4)

# Add text labels for quartiles
text(y=boxplot.stats(sample1bp)$stats,
     labels = round(boxplot.stats(sample1bp)$stats,0),
     x = 1.25,
     cex = 0.8,
     col ="red")

# Add text labels for outliers
text(y=boxplot.stats(sample1bp)$out,
     labels = round(boxplot.stats(sample1bp)$out,0),
     x = 1.1,
     cex = 0.8,
     col ="blue")

# Add a new Y-axis label 
mtext("Values",
     side = 2,
     line = 3,
     cex=1,
     col = "#000000")

Skewness and Kurtosis

To obtain a numerical value indicating skewness and Kurtosis on your data, use the following codes from library(e1071):
skewness()
kurtosis()

Check the codes below:

sample1bpSK = skewness(sample1bp)
sample1bpKR = kurtosis(sample1bp)

Sample sample1bp:
Skewness: 1
Kurtosis: -0.17

Data set carSales

The codes to import the data set is located in the Library Data Rc hunk.

First, check the structure of the data set using either code str() or dplyr::glimpse()
## Rows: 5,844
## Columns: 11
## $ Location     <chr> "Caracas", "Montevideo", "Bogota", "Quito", "PanamaCity",~
## $ Year         <dbl> 2014, 2013, 2015, 2012, 2010, 2014, 2014, 2011, 2013, 201~
## $ FuelType     <chr> "Diesel", "Diesel", "Gasoline", "Gasoline", "Gasoline", "~
## $ Transmission <chr> "Manual", "Manual", "Manual", "Automatic", "Manual", "Man~
## $ Owner        <chr> "First", "First", "First", "First", "First", "First", "Fi~
## $ Efficiency   <dbl> 18, 18, 22, 16, 16, 22, 26, 22, 16, 16, 16, 16, 12, 16, 1~
## $ Engine_cc    <dbl> 1500, 2000, 1500, 2000, 1500, 1500, 1500, 1500, 2000, 150~
## $ Power_bhp    <dbl> 150, 150, 100, 200, 100, 100, 100, 100, 150, 100, 150, 15~
## $ Seats        <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 5, 5, 5, 5, 5, 5, ~
## $ Km           <dbl> 108533, 110315, 105434, 114721, 105233, 102039, 119700, 1~
## $ Price        <dbl> 23029, 21384, 20208, 21179, 19980, 19113, 19443, 19926, 2~

1. Let’s start the analysis of the variable price.
2. Notice in the left box plot that the y-axis values are long numbers, since the values are in the thousands of dollars.
3. On the right, notice that the name of the variable Price has been divided by 1,000, this strategy reduces the size of the values without affecting the data distribution.

par(mfrow=c(1,2), mai=c(0.3,0.8,0.3,0.2))

# 1 Left plot
boxplot(carSales$Price,
        col="#B6AFEE",
        las=1)

# 2 Right plot, price divided by 1,000
boxplot(carSales$Price/1000,
        col="#B6AFEE",
        las=1)
text(x = 1.3,
     y = boxplot.stats(carSales$Price/1000)$stats,
     labels = round(boxplot.stats(carSales$Price/1000)$stats,1),
     cex = 0.7,
     col = "blue")

Using Box plot to combine numerical and categorical variables

The previous box plot show the data distribution of the whole variable Price, this does not provide much information about the variable. A better way to analyze the variable is to ask the relationship with another variable, in this case, the distribution of price per owner.

The two variables are inserted in the boxplot() code, first the numerical variable, then the categorical variable, separated by a wavy dash ~.

par(mai=c(0.6,1.2,0.4,1.5))

ownerOrd = ordered(carSales$Owner,
              levels = c("First",
                           "Second",
                           "Third",
                           "Fourth"))

boxplot(carSales$Price/1000 ~ ownerOrd,
        ylab = "Price (in US$ x 1,000)",
        las = 1,
        col = brewer.pal(4,"Set2"),
        notch=FALSE,
        boxwex = 0.8)

Using dot plots to present quartiles

Plot quartiles as a dot chart and then as a horizontal box plot.

par(mfrow=c(1,2), mai=c(0.8,0.8,0.6,0.2))

dotchart((quantile(carSales$Price/1000)),
         xlab="Price (US$ x 1,000)",
         ylab = "Percentiles")

dotchart((quantile(carSales$Km/1000)),
         xlab="Kilometers (x 1,000)",
         ylab = "")

Repeat using different numerical and categorical variables

Let’s observe efficiency per location.
For boxes width use code: boxwex = number
To add remove or keep frame: frame.plot=TRUE
To change color of axes labels: col.lab = “color”
To change size of title text: cex.main = number
To change size of labels text: cex.lab = number

Exercise: Can you add data labels?

par(las=1)
boxplot(carSales$Efficiency ~ carSales$Location,
        main="Box plot 7. Car efficiency per location",
        col.main = "red",
        cex.main=1, 
        xlab="Location",
        ylab="Efficiency",
        ylim=c(0,40),
        cex.lab=1, 
        col.lab = "blue",
        col = c("#DA7B7B", "#D11515", "#AE33B2", "#9D8FF3", "#3113F3", "#3BBD59"),
        las=1,
        frame.plot=FALSE,
        boxwex = 0.8,
        notch = FALSE,
        outline = TRUE)

Data set used in class example

Create a new vector with the following information.
Since the box plot is horizontal, we need to make a change on the text() code. Compare its x= and y= values with the code on box plots 4 and 6.

box2 = c(2, 3, 5, 6, 7, 7, 8, 9, 10, 10, 11, 11, 11, 11, 
         12, 12, 13, 14, 15, 16, 16, 16, 17, 19, 20, 21, 
         21, 22, 23, 27, 30, 32, 32, 33, 37, 37, 38)

boxplot(box2, horizontal = TRUE,
        main="Box plot 9. Class example",
        xlab="Data",
        ylab="Data values",
        ylim=c(0,50),
        las=1,
        col=c("lightblue"),
        frame.plot=FALSE,
        boxwex=0.6,
        staplewex = 1,
        plot=TRUE)

text(x = boxplot.stats(box2)$stats, 
     labels = boxplot.stats(box2)$stats, 
     y = 1.25)

Add the mean value

Add the mean as a point in the graph.
Use code: points() to indicate the mean.
In text code, observe change on x and y, compare with previous codes. Hint: horizontal versus vertical box plot.

mean_box2 = round(mean(box2), 1)

boxplot(box2, 
        main = "Box plot 10",
        ylim = c(0,40),
        col = "#09F2B4", 
        las =1)
text(y = boxplot.stats(box2)$stats, 
     labels = boxplot.stats(box2)$stats, 
     x = 1.25)
text(y = mean_box2,
     labels = mean_box2,
     x = 1.05,
     col = "red")
points(mean_box2, 
       pch = 18, 
       col = "#F20909", 
       lwd = 7)




Disclaimer: This short series manual project is a work in progress. Until otherwise clearly stated, this material is considered to be draft version.


Dee Chiluiza, PhD
January 2021
Last update: 07 October, 2021
Boston, Massachusetts, USA

Bruno Dog