|
Dee Chiluiza, PhD Northeastern University Boston, Massachusetts |
| Short manual series: Box plots |
# {r library_data, message=FALSE, warning=FALSE}
# Libraries
library(dplyr)
library(readxl)
library(RColorBrewer) # Colors, check display.brewer.all()
library(e1071) # For skewness and kurtosis
# Data sets
sample1bp = c(rnorm(1000, mean = 120, sd = 25),
rnorm(500, mean = 250, sd = 10),
rnorm(500, mean = 550, sd = 60),
1025, 1200)
box2 = c(2, 3, 5, 6, 7, 7, 8, 9, 10, 10, 11, 11, 11, 11,
12, 12, 13, 14, 15, 16, 16, 16, 17, 19, 20, 21,
21, 22, 23, 27, 30, 32, 32, 33, 37, 37, 38)
# For class data containing 16 values, same min and max.
box3 = c(4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64)
box4 = c(4,6,8,10,12,14,16,18,22,28,34,40,46,52,58,64)
box5 = c(4,10,16,22,28,34,40,46,50,52,54,56,58,60,62,64)
carSales <- read_excel("DataSets/carSales.xlsx")
Topics
Introduction
The number of obsevations is equally divided into four equal parts. This means, if the data set contains 100 observations, 25 observations will be located into each quartile, in order of their values. If the data set contains 16 observations, 4 observations will be located into each quartile.
Observe the following three data sets, they have the same number of observations (n=16), the same minimum and maximum values, but different data distribution. Observe the values and positions of quartiles limits. In each case, the data values are divided in 4 quartiles containing the same number of observations (n=4). Quartiles allow to understand the data distribution and behavior.
set1 = c(4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64)
set2 = c(4,6,8,10,12,14,16,18,22,28,34,40,46,52,58,64)
set3 = c(4,10,16,22,28,34,40,46,50,52,54,56,58,60,62,64)
Basic box plot code
Notice the sequence of events.
1. Basic code.
2. Add las=1 to change y-axis labels orientation.
3. Add a color.
4. Change Y-axis limits to improve.
par(mfrow=c(2,2), mai=c(0.2,1,0.2,0.4))
# 1
boxplot(box2,
main="1")
# 2
boxplot(box2,
main="2",
las=1)
# 3
boxplot(box2,
las=1,
main="3",
col = "#FFCCFF")
# 4
boxplot(box2,
las=1,
main="4",
col = "#FFCCFF",
ylim = c(0,40))Box plot statistics
The code boxplot.stats(), library(grDevices), will produce the following information:
1. stats: The limits of the four quartiles: min, Q1, Q2, Q3, max.
2. n: The n value.
3. Conf. Confidence intervals.
4. Out: List of any outliers present in the data set.
## $stats
## [1] 2 10 15 22 38
##
## $n
## [1] 37
##
## $conf
## [1] 11.883 18.117
##
## $out
## numeric(0)
Adding data values
The values of stats can be added to the box plot using code text().
Using the previous box plot from last R chunk.
boxplot(box2,
las=1,
col = "#FFCCFF",
ylim = c(0,40))
text(y = boxplot.stats(box2)$stats,
labels = boxplot.stats(box2)$stats,
x=1.23,
cex = 0.8,
col = "#A11515")Outliers
Analyze the codes on the R Chunk below, they are listed using # and following the numbers below.
par(mfrow=c(2,2),mai=c(0.2,0.8,0.2,0.2))
# 1 Left side
boxplot(sample1bp,
las = 1,
main = "1",
col = "#FDEFEF",
ylim = c(-100,1400),
yaxp = c(0,1400,7),
notch=FALSE,
boxwex = 0.8)
# 2 Right side
boxplot(sample1bp,
las = 1,
main = "2",
col = "#FDEFEF",
ylim = c(-100,1400),
yaxp = c(0,1400,7),
notch=FALSE,
boxwex = 0.8)
text(x = 1.2,
y = boxplot.stats(sample1bp)$out,
labels = round(boxplot.stats(sample1bp)$out,0),
cex = 0.8,
col ="#A11515")
# 3 Mean using a horizontal line
boxplot(sample1bp,
las = 1,
main = "3",
col = "#FDEFEF",
ylim = c(-100,1400),
yaxp = c(0,1400,7),
notch=FALSE,
boxwex = 0.8)
abline(h = mean(sample1bp),
col = "#7127D1",
lwd = 1)
#4 Mean using a pch code
boxplot(sample1bp,
las = 1,
main = "4",
col = "#FDEFEF",
ylim = c(-100,1400),
yaxp = c(0,1400,7),
notch=FALSE,
boxwex = 0.8)
points(mean(sample1bp),
pch = 18,
col = "#A11515",
lwd = 10)Complete Box plot with all options
Use the non-coding label text inside the R Chunk to understand the use of the codes.
The main code is the boxplot(), the other codes, text(), abline() and mtext() are always placed after the main graph code. Their order is not important as long as they are after the main graph code.
# Change margins for better fitting of the graph
par(mai=c(0.6,1.5,0.4,1.5))
# Main code in this case is boxplot()
boxplot(sample1bp,
las = 1,
col = "pink",
ylim = c(-100,1400),
yaxp = c(0,1400,7),
notch=FALSE,
boxwex = 0.8)
# Add a horizontal line for the mean
abline(h = mean(sample1bp),
col = "#7127D1",
lwd = 1)
# Add the text for the mean. over the mean line
text(y=mean(sample1bp)+50,
x=0.5,
paste("Mean:", round(mean(sample1bp),1)),
col = "#7127D1",
cex = 0.8,
pos = 4)
# Add text labels for quartiles
text(y=boxplot.stats(sample1bp)$stats,
labels = round(boxplot.stats(sample1bp)$stats,0),
x = 1.25,
cex = 0.8,
col ="red")
# Add text labels for outliers
text(y=boxplot.stats(sample1bp)$out,
labels = round(boxplot.stats(sample1bp)$out,0),
x = 1.1,
cex = 0.8,
col ="blue")
# Add a new Y-axis label
mtext("Values",
side = 2,
line = 3,
cex=1,
col = "#000000")Skewness and Kurtosis
To obtain a numerical value indicating skewness and Kurtosis on your data, use the following codes from library(e1071):
skewness()
kurtosis()
Check the codes below:
Sample sample1bp:
Skewness: 1
Kurtosis: -0.17
Data set carSales
The codes to import the data set is located in the Library Data Rc hunk.
First, check the structure of the data set using either code str() or dplyr::glimpse()## Rows: 5,844
## Columns: 11
## $ Location <chr> "Caracas", "Montevideo", "Bogota", "Quito", "PanamaCity",~
## $ Year <dbl> 2014, 2013, 2015, 2012, 2010, 2014, 2014, 2011, 2013, 201~
## $ FuelType <chr> "Diesel", "Diesel", "Gasoline", "Gasoline", "Gasoline", "~
## $ Transmission <chr> "Manual", "Manual", "Manual", "Automatic", "Manual", "Man~
## $ Owner <chr> "First", "First", "First", "First", "First", "First", "Fi~
## $ Efficiency <dbl> 18, 18, 22, 16, 16, 22, 26, 22, 16, 16, 16, 16, 12, 16, 1~
## $ Engine_cc <dbl> 1500, 2000, 1500, 2000, 1500, 1500, 1500, 1500, 2000, 150~
## $ Power_bhp <dbl> 150, 150, 100, 200, 100, 100, 100, 100, 150, 100, 150, 15~
## $ Seats <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 5, 5, 5, 5, 5, 5, ~
## $ Km <dbl> 108533, 110315, 105434, 114721, 105233, 102039, 119700, 1~
## $ Price <dbl> 23029, 21384, 20208, 21179, 19980, 19113, 19443, 19926, 2~
1. Let’s start the analysis of the variable price.
2. Notice in the left box plot that the y-axis values are long numbers, since the values are in the thousands of dollars.
3. On the right, notice that the name of the variable Price has been divided by 1,000, this strategy reduces the size of the values without affecting the data distribution.
par(mfrow=c(1,2), mai=c(0.3,0.8,0.3,0.2))
# 1 Left plot
boxplot(carSales$Price,
col="#B6AFEE",
las=1)
# 2 Right plot, price divided by 1,000
boxplot(carSales$Price/1000,
col="#B6AFEE",
las=1)
text(x = 1.3,
y = boxplot.stats(carSales$Price/1000)$stats,
labels = round(boxplot.stats(carSales$Price/1000)$stats,1),
cex = 0.7,
col = "blue")
Using Box plot to combine numerical and categorical variables
The previous box plot show the data distribution of the whole variable Price, this does not provide much information about the variable. A better way to analyze the variable is to ask the relationship with another variable, in this case, the distribution of price per owner.
The two variables are inserted in the boxplot() code, first the numerical variable, then the categorical variable, separated by a wavy dash ~.
par(mai=c(0.6,1.2,0.4,1.5))
ownerOrd = ordered(carSales$Owner,
levels = c("First",
"Second",
"Third",
"Fourth"))
boxplot(carSales$Price/1000 ~ ownerOrd,
ylab = "Price (in US$ x 1,000)",
las = 1,
col = brewer.pal(4,"Set2"),
notch=FALSE,
boxwex = 0.8)Using dot plots to present quartiles
Plot quartiles as a dot chart and then as a horizontal box plot.
par(mfrow=c(1,2), mai=c(0.8,0.8,0.6,0.2))
dotchart((quantile(carSales$Price/1000)),
xlab="Price (US$ x 1,000)",
ylab = "Percentiles")
dotchart((quantile(carSales$Km/1000)),
xlab="Kilometers (x 1,000)",
ylab = "")Repeat using different numerical and categorical variables
Let’s observe efficiency per location.
For boxes width use code: boxwex = number
To add remove or keep frame: frame.plot=TRUE
To change color of axes labels: col.lab = “color”
To change size of title text: cex.main = number
To change size of labels text: cex.lab = number
Exercise: Can you add data labels?
par(las=1)
boxplot(carSales$Efficiency ~ carSales$Location,
main="Box plot 7. Car efficiency per location",
col.main = "red",
cex.main=1,
xlab="Location",
ylab="Efficiency",
ylim=c(0,40),
cex.lab=1,
col.lab = "blue",
col = c("#DA7B7B", "#D11515", "#AE33B2", "#9D8FF3", "#3113F3", "#3BBD59"),
las=1,
frame.plot=FALSE,
boxwex = 0.8,
notch = FALSE,
outline = TRUE)Data set used in class example
Create a new vector with the following information.
Since the box plot is horizontal, we need to make a change on the text() code. Compare its x= and y= values with the code on box plots 4 and 6.
box2 = c(2, 3, 5, 6, 7, 7, 8, 9, 10, 10, 11, 11, 11, 11,
12, 12, 13, 14, 15, 16, 16, 16, 17, 19, 20, 21,
21, 22, 23, 27, 30, 32, 32, 33, 37, 37, 38)
boxplot(box2, horizontal = TRUE,
main="Box plot 9. Class example",
xlab="Data",
ylab="Data values",
ylim=c(0,50),
las=1,
col=c("lightblue"),
frame.plot=FALSE,
boxwex=0.6,
staplewex = 1,
plot=TRUE)
text(x = boxplot.stats(box2)$stats,
labels = boxplot.stats(box2)$stats,
y = 1.25)Add the mean value
Add the mean as a point in the graph.
Use code: points() to indicate the mean.
In text code, observe change on x and y, compare with previous codes. Hint: horizontal versus vertical box plot.
mean_box2 = round(mean(box2), 1)
boxplot(box2,
main = "Box plot 10",
ylim = c(0,40),
col = "#09F2B4",
las =1)
text(y = boxplot.stats(box2)$stats,
labels = boxplot.stats(box2)$stats,
x = 1.25)
text(y = mean_box2,
labels = mean_box2,
x = 1.05,
col = "red")
points(mean_box2,
pch = 18,
col = "#F20909",
lwd = 7)
Disclaimer: This short series manual project is a work in progress. Until otherwise clearly stated, this material is considered to be draft version.