26-04-2025                       >eR-BioStat


Visualizing Data and Exploratory Data analysis using ggplot2 in R

Ziv Shkedy and Thi Huyen Nguyen

## load/install libraries
.libPaths(c("./Rpackages",.libPaths()))
library(knitr)
library(tidyverse)
library(deSolve)
library(minpack.lm)
library(ggpubr)
library(readxl)
library(gamlss)
library(data.table)
library(grid)
library(png)
library(nlme)
library(gridExtra)
library(mvtnorm)
library(e1071)
library(lattice)
library(ggplot2)
library(dslabs)
library(NHANES)
library(plyr)
library(dplyr)
library(nasaweather)
library(ggplot2)
library(gganimate)
library(av)
library(gifski)
library(foreach)
library("DAAG")
library(DT)
library(TeachingDemos)
library(gridExtra)

1. Introduction

“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone - as the first step”

— John W. Tukey (1977)

Location, Spread and Shape in univariate data

In this course, we focus on descriptive measures, numerical and graphical, to characterize and visualize the features of a particular univariate distribution. The following three main concepts are usually used to specify a particular distribution:

  • Location
  • Spread
  • Shape

Each of these control different characteristics of a distribution.

R datasets for illustraions

In order to simplify the usage of slides, the data we used for illustrations are R datasets. We give a short description of each data in the relevant slides. * More details can be found with help(dataset) or in

Basic R functions for illustraions

The following basic graphical functions are used for visualization:

  • plot()
  • lines()
  • hist()
  • boxplot()
  • qqnorm()

The lattice R package

The following lattice graphical functions are used for visualization:

  • dotplot()
  • histogram()
  • bwplot()
  • qqmath()

The ggplot2 R package

The following ggplot2 graphical functions are used for visualization:

  • ggplot()
  • geom_point()
  • theme_bw()
  • geom_smooth()
  • geom_histogram()
  • geom_boxplot()
  • geom=“density”
  • geom_violin()
  • facet_wrap()
  • qplot()
  • stat_summary()
  • geom_density_ridges() + theme_ridges()

Online references

  • Basic Skills in Visualizing Data and Exploratory Data Analysis Using R - An interactive online book for the course: BookVD.
  • Book: R Graphics Cookbook, 2nd edition by Winston Chang: RGraphics.
  • Website: From Data to Viz: Viz.

2. Working with the ggplot2 R package for vizualization

The layers of a ggplot2 figure

A key idea behind ggplot2 is that it allows to easily building up a complex plot layer by layer. Each layer adds an extra level of information to the plot. In that way we can build sophisticated plots tailored to the problem at hand.

The mtcars data

The data gives information about fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

For our example, we focus on mpg, hp and cyl.

The mtcars data

Bivariate data (miles/(US) per gallon, horsepower). We wish to produce a basic scatterplot (see Figure 1): mgp Vs. hp using the R function plot().

plot(mtcars$hp,mtcars$mpg)
Miles/(US) per gallon vs. Horsepower

Figure 1: Miles/(US) per gallon vs. Horsepower

Linear regression

We consider a simple linear regression model of the form \[\mbox{mpg}_{i}=\beta_{0}+\beta_{1} \times \mbox{hp}_{i} + \varepsilon_{i}\]. In R, we can fit the model using the lm() function using the call lm(response ~ predictor).

fit.lm<-lm(mtcars$mpg~mtcars$hp)

The fitted model

summary(fit.lm)
## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$hp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## mtcars$hp   -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

First layer: basic plot

A basic plot using ggplot2, shown in Figure 2, present a scatterplot of the variables.

library(ggplot2)
gg <- ggplot(mtcars, aes(hp, mpg)) +
        geom_point() +
        labs(x = "Horsepower",y= "Miles Per Gallon")
gg
Miles/(US) per gallon vs. Horsepower

Figure 2: Miles/(US) per gallon vs. Horsepower

What is aes() function?

The aes() function is a generate aesthetic mappings that describe how variables in the data are mapped to visual properties (Aesthetics) of geoms. The aes() is a quoting function. This means that its inputs are quoted to be evaluated in the context of the data. For example, in Figure 2 we use aes(hp, mpg). This means that we specify the variables mpg and hp to be used in the plot. For more details use the ggplot2 Cheat Sheet: ggplot2.

Second layer: add Cylinders information

We wish to add, in Figure 3, information about the cylinders using different colors to the data points (by Cylinders number), we use aes(color=as.factor(cyl)), size=5.

library(ggplot2)
gg <- ggplot(mtcars, aes(hp, mpg)) +
        geom_point(aes(color=as.factor(cyl)), size=5) +
        labs(x = "Horsepower",y= "Miles Per Gallon",color= "# of Cylinders")
gg
Miles/(US) per gallon vs. Horsepower

Figure 3: Miles/(US) per gallon vs. Horsepower

The geom() function

A ggplot2 geom “tells” R how do we want to display the data. For example, geom_point(aes(color=as.factor(cyl)), size=5) produces a scatterplot with points. Note that aes(color=as.factor(cyl)): uses different colors according to the factor (cyl) levels.

Third layer: change the background

In Figure 4, in order to to make the figure clearer, we use theme_bw() to change from the gray background in Figure 3 to the black and white background in Figure 4.

library(ggplot2)
gg <- ggplot(mtcars, aes(hp, mpg)) +
        geom_point(aes(color=as.factor(cyl)), size=5) + labs(x = "Horsepower",y= "Miles Per Gallon",color= "# of Cylinders")+
        theme_bw()
gg
Miles/(US) per gallon vs. Horsepower

Figure 4: Miles/(US) per gallon vs. Horsepower

Fourth layer: add the regression line

To add a regression line we use geom_smooth(method=“lm”, se=FALSE). To produce Figure 5 we use

  • ggplot(data,aes(variables)).
  • geom_point(size).
  • aes(color the data points).
  • geom_smooth(add regression line).
  • labs(add labels).
library(ggplot2)
gg <- ggplot(mtcars, aes(hp, mpg)) +
        geom_point(aes(color=as.factor(cyl)), size=5) + geom_smooth(method="lm", se=FALSE) +
        labs(x = "Horsepower",y= "Miles Per Gallon", color= "# of Cylinders") +  theme_bw()
gg
Miles/(US) per gallon vs. Horsepower

Figure 5: Miles/(US) per gallon vs. Horsepower

Fourth layer: add confidence intervals to the plot

To add confidence internal around the regression line in Figure 6 we use geom_smooth(method=“lm”, se=TRUE).

library(ggplot2)
gg <- ggplot(mtcars, aes(hp, mpg)) + geom_point(aes(color=as.factor(cyl)), size=5) +
        geom_smooth(method="lm", se=TRUE) +
        labs(x = "Horsepower",y= "Miles Per Gallon",color= "# of Cylinders") +theme_bw()
gg
Miles/(US) per gallon vs. Horsepower

Figure 6: Miles/(US) per gallon vs. Horsepower

Fifth layer: add a smoother

A loess smother was added to Figure 7 using method=“loess” in geom_smooth(method=“loess”, se=TRUE).

library(ggplot2)
gg <- ggplot(mtcars, aes(hp, mpg)) +geom_point(aes(color=as.factor(cyl)), size=5) +
        geom_smooth(method="lm", se=TRUE) + geom_smooth(method="loess",colour = "blue", size = 1.5)+
        labs(x = "Horsepower",y= "Miles Per Gallon",color= "# of Cylinders") + theme_bw()
gg
Miles/(US) per gallon vs. Horsepower

Figure 7: Miles/(US) per gallon vs. Horsepower

Visualizing correlation patterns across cylinders

As shown in Figure 8, we can visualize how the dependence between mpg and hp changes with Cylinder numbers.

qplot(hp,mpg,data = mtcars, colour = factor(cyl))+
geom_smooth(method = "lm",se = F)
Miles/(US) per gallon vs. Horsepower

Figure 8: Miles/(US) per gallon vs. Horsepower

3. The 5-number summary and stem-and-leaf

Introduction

In his 1977 book Exploratory Data Analysis, John W. Tukey (page 32) proposed the five number summary - the median, the minimum and maximum and the upper and lower quartiles - as a numerical summary of the data. The extremes (the minimum and maximum) give information about the depth of the data (i.e., spread), the median gives information about the center of the distribution (location) and the lower and upper quartile give information on both spread (at the center of the distribution of the data) and shape. As John W. Tukey pointed out, in a given dataset we want to know

  • Which values seems to be typical (Location) ?
  • How much variation there is in the data (Spread) ?
  • How the distribution of the data looks like (Shape) ?

Turkey’s idea was to use the five number summary to summarize a single distribution and to compare between distributions.

Sorting and ranking the sample

Consider a sample of size 5 from \(N(0,2^{2})\), : \(x_{1},x_{2},x_{3},x_{4},x_{5}\). In R we can use the R function rnorm() to draw the sample in the following way rnorm(sample size, mean, SD) :

x <- rnorm(5, 0, 2)
x
## [1]  3.1675255 -1.8656206  0.3845279 -0.7028048  1.9157203

The Ordered sample is the sample sorted from the lowest to the highest value: \[ x_{(1)},x_{(2)},x_{(3)},x_{(4)},x_{(5)} \]

sort(x)
## [1] -1.8656206 -0.7028048  0.3845279  1.9157203  3.1675255

Numerical summaries for sample distribution

The Median (M)

The median is the center of the sample in terms of counting: \(\frac{1}{2}\) of the observations are smaller or equals to the median and \(\frac{1}{2}\) are greater than the median. In our example,

median(x) #note that the sample size is 5
## [1] 0.3845279

Note that if the sample size is even, for example,

y <- c(10, 2, 50, 6, 100, 25)
y
## [1]  10   2  50   6 100  25

The median is the average of the two central values (10 and 25).

sort(y)
## [1]   2   6  10  25  50 100
median(y) #=(10+25)/2
## [1] 17.5

The extremes

The maximum, \(x_{(n)}\), the largest value in the data,

max(y)
## [1] 100

The minimum, \(x_{(1)}\), the smallest value in the data,

max(y)
## [1] 100

The range of the data, that gives indication about the spread of the data:

range<-max(y)-min(y)
range
## [1] 98

Lower and Upper fourth (the lower and upper quartiles)

The lower fourth (\(Q_{1}\)) is the value that \(\frac{1}{4}\) of the observations are smaller or equals to and \(\frac{3}{4}\) of the observations are greater than \(Q_{1}\). The upper forth (\(Q_{3}\)) is the value that \(\frac{3}{4}\) of the observations are smaller or equals to and $ of the observations are greater than \(Q_{3}\).

The 5-number summary

The idea is to summarize the information in the data with 5 numbers: The median, the fourths and the extremes, \(x_{1},Q_{1},M,Q_{3},x_{n}\). In R, The 5-number summary of a sample \(x=(x_{1},\dots,x_{n})\) can be produced using the function quantile().

Example: the old faithful geyser data

The Old faithful geyser dataset, the R object faithful, gives information about the Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. The duration of the eruption are listed below.

faithful$eruptions 
##   [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 1.950 4.350 1.833 3.917
##  [13] 4.200 1.750 4.700 2.167 1.750 4.800 1.600 4.250 1.800 1.750 3.450 3.067
##  [25] 4.533 3.600 1.967 4.083 3.850 4.433 4.300 4.467 3.367 4.033 3.833 2.017
##  [37] 1.867 4.833 1.833 4.783 4.350 1.883 4.567 1.750 4.533 3.317 3.833 2.100
##  [49] 4.633 2.000 4.800 4.716 1.833 4.833 1.733 4.883 3.717 1.667 4.567 4.317
##  [61] 2.233 4.500 1.750 4.800 1.817 4.400 4.167 4.700 2.067 4.700 4.033 1.967
##  [73] 4.500 4.000 1.983 5.067 2.017 4.567 3.883 3.600 4.133 4.333 4.100 2.633
##  [85] 4.067 4.933 3.950 4.517 2.167 4.000 2.200 4.333 1.867 4.817 1.833 4.300
##  [97] 4.667 3.750 1.867 4.900 2.483 4.367 2.100 4.500 4.050 1.867 4.700 1.783
## [109] 4.850 3.683 4.733 2.300 4.900 4.417 1.700 4.633 2.317 4.600 1.817 4.417
## [121] 2.617 4.067 4.250 1.967 4.600 3.767 1.917 4.500 2.267 4.650 1.867 4.167
## [133] 2.800 4.333 1.833 4.383 1.883 4.933 2.033 3.733 4.233 2.233 4.533 4.817
## [145] 4.333 1.983 4.633 2.017 5.100 1.800 5.033 4.000 2.400 4.600 3.567 4.000
## [157] 4.500 4.083 1.800 3.967 2.200 4.150 2.000 3.833 3.500 4.583 2.367 5.000
## [169] 1.933 4.617 1.917 2.083 4.583 3.333 4.167 4.333 4.500 2.417 4.000 4.167
## [181] 1.883 4.583 4.250 3.767 2.033 4.433 4.083 1.833 4.417 2.183 4.800 1.833
## [193] 4.800 4.100 3.966 4.233 3.500 4.366 2.250 4.667 2.100 4.350 4.133 1.867
## [205] 4.600 1.783 4.367 3.850 1.933 4.500 2.383 4.700 1.867 3.833 3.417 4.233
## [217] 2.400 4.800 2.000 4.150 1.867 4.267 1.750 4.483 4.000 4.117 4.083 4.267
## [229] 3.917 4.550 4.083 2.417 4.183 2.217 4.450 1.883 1.850 4.283 3.950 2.333
## [241] 4.150 2.350 4.933 2.900 4.583 3.833 2.083 4.367 2.133 4.350 2.200 4.450
## [253] 3.567 4.500 4.150 3.817 3.917 4.450 2.000 4.283 4.767 4.533 1.850 4.250
## [265] 1.983 2.250 4.750 4.117 2.150 4.417 1.817 4.467

The 5 number summary for the duration of the eruption is

quantile(faithful$eruptions) 
##      0%     25%     50%     75%    100% 
## 1.60000 2.16275 4.00000 4.45425 5.10000

The 5 number summary indicates that in \(50\%\) of the observations the duration is more than 4.00 (location) and that in \(50\%\) of the observations, of the times the eruption is between 2.16 to 4.45 (spread). The range of the data is between 1.6 to 5.1 (spread).

Stem-and-leaf

Stem-and-leaf enables to organize the data graphically in a way that directs our attention to the main pattern in the data. This is a basic and simple but very effective tool to visualize the pattern in the data.

Constructing stem-and-leaf

We split the data values for a stem and leaf,

  • data value \(\rightarrow\) split \(\rightarrow\) stem and leaf
  • 10.5 \(\rightarrow\) 10|5 \(\rightarrow\) 10 and 5.

We use the R function stem() to construct a stem-and-leaf. For the Old faithful geyser data we have

stem(faithful$eruptions) 
## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   16 | 070355555588
##   18 | 000022233333335577777777888822335777888
##   20 | 00002223378800035778
##   22 | 0002335578023578
##   24 | 00228
##   26 | 23
##   28 | 080
##   30 | 7
##   32 | 2337
##   34 | 250077
##   36 | 0000823577
##   38 | 2333335582225577
##   40 | 0000003357788888002233555577778
##   42 | 03335555778800233333555577778
##   44 | 02222335557780000000023333357778888
##   46 | 0000233357700000023578
##   48 | 00000022335800333
##   50 | 0370

Note that \(16|0=1.6\) and \(16|7=1.67\) etc.

Online tutorials

  • For a short online YouTube introduction, by Simple Learning Pro, about the five numbers Boxplots, and Outliers with R see
    YTV2.

  • For a short online YouTube introduction, by Ed Boone, stem-and-leaf using R see YTV3.

The 5-number summary and samples comparison

The five number summery can be used for a comparisons between two samples. The relationship between the quantiles of the two samples can give information about the difference between the two samples in terms of location and spread.

The beaver data

The beaver dataset (the R objects beaver1 and beaver2), published by Reynolds (1994), describes a small part of a study of the long-term temperature dynamics of beaver Castor canadensis in north-central Wisconsin. Body temperature was measured by telemetry every 10 minutes for four females.For illustration we use data obtained for two beavers and focused on the body temperature (the variable temp) in a specific day (346). A part of the data for one beaver is listed below.

head(beaver1[beaver1$day==346& beaver1$activ==0,])
##   day time  temp activ
## 1 346  840 36.33     0
## 2 346  850 36.34     0
## 3 346  900 36.35     0
## 4 346  910 36.42     0
## 5 346  920 36.55     0
## 6 346  930 36.69     0

A comparison between the two beavers: stem-and-leaf

The stem-and-leaf plot and 5 number summary are shown below

temp1<-beaver1[beaver1$day==346& beaver1$activ==0,]$temp
stem(temp1)
## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   363 | 345
##   364 | 2
##   365 | 04559
##   366 | 224577999
##   367 | 145567789
##   368 | 011234555567777889999999
##   369 | 111222334445567788999
##   370 | 0001259
##   371 | 08
##   372 | 00013
quantile(temp1)
##      0%     25%     50%     75%    100% 
## 36.3300 36.7525 36.8800 36.9575 37.2300

For the second beaver we have

temp2<- beaver2[beaver2$day==307 & beaver2$activ==0,]$temp
stem(temp2)
## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   364 | 8
##   366 | 3
##   368 | 90035577899
##   370 | 0114472223445577
##   372 | 34488
##   374 | 411
##   376 | 4
quantile(temp2)
##      0%     25%     50%     75%    100% 
## 36.5800 36.9725 37.0950 37.1700 37.6400

A comparison between the two beavers: the 5 number summary

We can see that the 5 number summary values obtained for the second beaver are higher than those obtained for the first beaver, indicating that the temperature distribution of the second beaver located to the right compare to the temperature distribution of the first beaver. This can be seen in Figure 9 that compares the 5 number summaries of the two beavers and shows that the quantiles obtained for the second beaver are above the \(45^{o}\) line.

plot(quantile(temp1),quantile(temp2))
abline(0,1)
The 5 number summary of two beavers.

Figure 9: The 5 number summary of two beavers.

4. Location

Two normal densities

Location is the center of the distribution. Figure 10 shows presents distributions with different locations, two normal densities with mean equal to 0 (the black line) and mean equal to 2 (the red line).

x<-seq(from=-5,to=5,length=1000)
dx1<-dnorm(x,0,1)
dx2<-dnorm(x,2,1)
plot(x,dx1,type="l")
lines(x,dx2,col=2)
Two normal densities

Figure 10: Two normal densities

Two random samples

Figure 11 shows a histograms for two \(random\) \(samples\) drawn from normal distribution with the same variance but different mean. Both histograms show that the data are symmetric around the sample mean but the histogram of \(x_{2}\) is located to the right relative to the histogram of \(x_{1}\).

x1 <- rnorm(10000, 0, 1)
x2 <- rnorm(10000, 2, 1)
par(mfrow = c(2, 1))
hist(x1, col = 0, nclass = 50, xlim = c(-4, 6))
hist(x2, col = 0, nclass = 50, xlim = c(-4, 6))
Two normal densities

Figure 11: Two normal densities

5. Graphical displays for location

We focus of the following questions:

  • How can we visualize the distribution ?
  • How can we visualize a shift in location (as shown in Figure 11) ?
  • Numerical summaries for location: mean, median, trimmed mean…
x<-seq(from=-5,to=5,length=1000)
dx1<-dnorm(x,0,1)
dx2<-dnorm(x,2,1)
plot(x,dx1,type="l")
lines(x,dx2,col=2)
Two normal densities

Figure 12: Two normal densities

The singer dataset

The singer dataset (the R object singer) is a data frame giving the heights of singers in the New York Choral Society. The variables are named height (inches) and voice.part which is the voice group of the singer

  • Alto.
  • Soprano.
  • Tenor.
  • Bass.

Each voice group is subdivided into two groups, high voice and low voice (for example Bass1 and Bass2 and the lower and higher Bass voices, respectively). Our main focus: the distribution of the singers’ height.

The singer dataset

The first few lines of the data are shown below.

head(singer)
##   height voice.part
## 1     64  Soprano 1
## 2     62  Soprano 1
## 3     66  Soprano 1
## 4     65  Soprano 1
## 5     60  Soprano 1
## 6     61  Soprano 1

Stripplot (using the lattice package)

In a Stripplot we plot the data of each voice group in a different strip. We use the syntax voice.part~height. In Figure 13 we can see a clear main pattern in the data: It is easy to distinguish between women (Sopranos and Altos) and men (Tenors and Basses). Women are clearly shorter than men.

dotplot(singer$voice.part~singer$height, 
        aspect=1,
        xlab="Mean Height (inches)")
Stripplot

Figure 13: Stripplot

Stripplot (using ggplot2): first layer

An equivalent stripplot, shown in Figure 14 and Figure 15, can be produced using the ggplot() package. Note that using the function geom_jitter (in Figure 14) implies that observations with the same values will be plotted side by side and will not overlap so sample size would be seen as well in the plot.

ggplot(singer, 
       aes(voice.part,height)) + 
       geom_point()
Dotplot

Figure 14: Dotplot

ggplot(singer, 
       aes(voice.part,height)) + 
       geom_jitter(position = position_jitter(width = .05))
Dotplot with jitter

Figure 15: Dotplot with jitter

Mean by voice group

In order to get a better insight of other patterns, we can summarize the distribution of each group with the sample mean. In R this can be done using the function tapply(). The means of the singers’ height by the voice group are shown in the panel below.

attach(singer)
tapply(singer$height,singer$voice.part,mean)
##    Bass 2    Bass 1   Tenor 2   Tenor 1    Alto 2    Alto 1 Soprano 2 Soprano 1 
##  71.38462  70.71795  69.90476  68.90476  66.03704  64.88571  63.96667  64.25000

Mean by voice group

The group means point on the pattern that was already detected: on average men are taller than women. In addition, we can see that within each gender group, singers with lower voice are taller than singers with higher voice.For example, the average of the two bass groups (71.38 and 70.71 for Bass 1 and Bass 2 respectively) are higher than the average in the tenor groups (69.90 and 68.90 for Tenor 1 and Tenor 2 respectively).

Mean by voice group

Among women, the sopranos are shorter, on average, than the altos. Within each voice group (all except the sopranos), the singers with lower voices (the second voice group Bass 2, Tenor 2 and Alto 2) are taller than the singers with the higher voices (the first group Bass 1, Tenor 1 and Alto 1). For example the mean of the Bass 2 group (71.38) is higher than the mean of the Bass 1 group (70.71)

Stripplot (ggplot2): second layer

Figure 16 shows the same information as before with the addition of the mean for each voice group. Note that we use the function stat_summary() with the option fun.y = “mean” to calculate the mean.

ggplot(singer, aes(voice.part,height)) +
geom_point() +
stat_summary(geom = "point", fun.y = "mean", colour = "red", size = 4)
Stripplot

Figure 16: Stripplot

Boxplot (lattice)

A boxplot is a graphical display of the location of a distribution. The location of each group is summarized by the median (the dot inside the box). Other aspects of this plot will be discussed in later chapters). Figure 17 was produced using the function bwplot() which is a part of the lattice package.

bwplot(as.factor(singer$voice.part)~ singer$height,
        data=singer,
        aspect=1,
        xlab="Height (inches)")
Boxplot

Figure 17: Boxplot

Multiway histogram (lattice)

The multiway histogram, shown in Figure 17, presents the distributions of heights across the voice groups. Note how the distribution of height is shifted from left to right across the voice levels. The function histogram() which was used to produce Figure 18 is a part of the lattice package.

histogram(~ singer$height | singer$voice.part,
        data=singer, layout = c(2, 4), 
          aspect = 0.5,xlab = "height")
Multiway histogram

Figure 18: Multiway histogram

Multiway histogram (ggplot2)

An equivalent multiway histogram, shown in Figure 19, can be produced with the ggplot2 package. We have additional two layers in the basic plot: in the first layer we specify the plot type geom_histogram() and the second layer indicates the factor for the plot splitting. The function facet_wrap(factor).is used to produce the histograms by voice group.

ggplot(singer, aes(height)) +
geom_histogram() +
facet_wrap(~voice.part,ncol = 2)
Multiway histogram

Figure 19: Multiway histogram

In Figure 20, we add the colors (by group) using the option fill = voice.part in the aes() function.

ggplot(singer, aes(height,fill = voice.part)) +
geom_histogram() +
facet_wrap(~voice.part,ncol = 2)
Multiway histogram

Figure 20: Multiway histogram

6. Spread

Two normal densities

Figure 21 shows two normal densities that have different variability (or spread). * The density with the black line has variance 1 and density with the green line has variance 2.

dx3<-dnorm(x,0,2)
plot(x,dx1,type="l")
lines(x,dx3,col=3)
Two normal densities

Figure 21: Two normal densities

Spread in random samples

Figure 22 shows two samples that were drawn from normal distribution with mean equal to 0 but with different variance. The two distributions have the same shape, both histograms are symmetric around 0 as expected. The spread in the histogram of \(x_{2}\) is much higher than the spread in the histogram of \(x_{1}\).

x1 <- rnorm(10000, 0, 0.75)
x2 <- rnorm(10000, 0, 2)
par(mfrow = c(2, 1))
hist(x1, col = 0, nclass = 25, xlim = c(-4, 6), ylim = c(0, 1000))
hist(x2, col = 0, nclass = 50, xlim = c(-4, 6), ylim = c(0, 1000))
Two normal densities

Figure 22: Two normal densities

Main concepts

Up till now we summarized the distribution of the data with location estimators. In this chapter we focus on the spread. We want to measure how close the data are to each other and how concentrate the data around the center of the distribution. Numerical summaries for spread that often used are:

* The sample variance.
* The fourth-spared.

Both sample variance and forth spread can be explore using a boxplot or a violin plot which are graphical displays for spread.

7. Boxplot: A graphical display for spread and location

A Boxplot is a graphical display which shows the location, the spread and the shape of the distribution. The location is summarized by the median, the spread is summarized by the fourth-spread which is simply the length of the box in the boxplot (see the right panels in Figure 23). Note that the length of the box represents the variability of \(50\%\) of the data in the center of the distribution.

x1<-rnorm(1000,0,1)
par(mfrow=c(2,2))
hist(x1,main="random sample from N(0,1)",xlim=c(-10,10))
boxplot(x1,ylim=c(-10,10))
x2<-rnorm(1000,0,3)
hist(x2,main="random sample from N(0,9)",xlim=c(-10,10))
boxplot(x2,ylim=c(-10,10))
Histograms and boxplots

Figure 23: Histograms and boxplots

The adjacent values in a boxplot

The upper and lower adjacent values in the boxplot are given by

\[\mbox{Upper adjacent value} = \textit{Min} \left \{ max(X),Q_{3}+1.5(Q_{3}-Q_{1}) \right \}\]

\[\mbox{Lower adjacent value} = \textit{max} \left \{ min(X),Q_{1}-1.5(Q_{3}-Q_{1}) \right \}\]

The upper and lower adjacent values are used to identify extreme values. Observations higher than the upper adjacent value or smaller than the lower adjacent value are considered to be outliers.

Example: the airquality data

Daily air quality measurements in New York, May to September 1973. The histogram and boxplot for the Ozone level are shown in Figure 24 and reveal a skewed distribution with few outliers at the upper tail (histogram). In the boxplot these outliers can be identified above the upper adjacent value.

par(mfrow=c(1,2))
airquality1<-na.omit(airquality)
hist(airquality1$Ozone)
boxplot(airquality1$Ozone)
Histogram and Boxplot for the Ozone level in New York

Figure 24: Histogram and Boxplot for the Ozone level in New York

Online tutorials

YouTube tutorial: Boxplot in R

For a short online YouTube tutorials:

  • by Data Science Tutorials, about boxplot using the ggplot2 package see YTVD8.

  • by LawrenceStats, about boxplot using the ggplot2 package see YTVD9

Web tutorial: Advanced boxplots in R

  • Example for advanced boxplots in R using the ggplot2 package and code to produce the plots can be found in the R Graph Gallery website here WAVD2.

Example: boxplot for the singer dataset

Figure 25 shows a boxplot for the singers’ height by voice group that was produced using the function geom_boxplot().

ggplot(singer, aes(voice.part,height)) + geom_boxplot()
Boxplot for the singers' height

Figure 25: Boxplot for the singers’ height

Boxplot for the singer dataset: adding color group

The same boxplot in which colors (by group) are added to the boxplot using the argument fill=voice.part in the aes is shown in Figure 26. Note that the object voice.part is a factor.

ggplot(singer, aes(voice.part,height,fill=voice.part)) + geom_boxplot()
Boxplot for the singers' height

Figure 26: Boxplot for the singers’ height

Boxplot for the singer dataset: adding the data to the figure

In Figure 27, the data are added to the boxplot using the argument geom = c(“boxplot”, “jitter”).

qplot(voice.part, height, data = singer, geom = c("boxplot", "jitter"))
Boxplot for the singers' height

Figure 27: Boxplot for the singers’ height

Violin plot for the singer dataset

When the argument geom_violin() is used instead of geom_boxplot() the boxplot become a violin plot shown in Figure 28.

ggplot(singer, aes(voice.part,height)) + geom_violin()
Violin plot for the singers' height

Figure 28: Violin plot for the singers’ height

A violin plot and a boxplot in the same figure

In Figure 29 we combined the information that was displayed in Figure 27 and Figure 28. This can be done when we use both geom_violin() and <geom_boxplot() in the same plot.

ggplot(singer, aes(x = voice.part, y = height, fill = voice.part)) +
  geom_violin(alpha = 0.6) +
  geom_boxplot(width = 0.1, color = "black", alpha = 0.5) +
  theme_minimal()
Combined violin and boxplot

Figure 29: Combined violin and boxplot

8. Shape: histograms and density edtimates

The shape of a distribution

Two densities

Figure 30 shows two beta densities having different shapes. Black line: \(Beta(2,2)\), red line: \(Beta(2,4)\).

x<-seq(from=0,to=1,length=1000)
dx1 <- dbeta(x, 2, 2)
dx2 <- dbeta(x, 2, 4)
plot(x,dx1,type="l",ylim=c(0,2.5))
lines(x,dx2,col=2)
Shape

Figure 30: Shape

Four random samples

Four samples (each with 10000 observations) that were drawn from different distributions are shown in Figure 31: \(x_{1}\) and \(x_{2}\): samples were drawn from symmetric distributions. The distributions of \(x_{3}\) is skewed to the left and the distribution of \(x_{4}\) to the right.

x1 <- rbeta(10000, 1, 1)
x2 <- rbeta(10000, 2, 2)
x3 <- rbeta(10000, 8, 3)
x4 <- rbeta(10000, 3, 8)
par(mfrow = c(2, 2))
hist(x1, col = 0, nclass = 50)
hist(x2, col = 0, nclass = 50)
hist(x3, col = 0, nclass = 50)
hist(x4, col = 0, nclass = 50)
Four histograms of random samples

Figure 31: Four histograms of random samples

Density and density estimate

So far we used histogram to visualize the shape of the distribution of the observations in the sample. In this chapter we discuss density estimates as a method to estimate and visualized the distribution in the population.

Density and density estimate

Figure 32 shows a density function of \(N(0,1)\) that represents the distribution of a random variable in the population. Suppose that we draw a random sample of size \(n\) from the population. The histogram in Figure 32 can be used to visualize the shape of the distribution. It is an estimate for the density in the population.

x<-seq(from=-3,to=3,length=1000)
dx1<-dnorm(x,0,1)
par(mfrow=c(1,2))
plot(x,dx1,type="l")
title("a")
x1 <- rnorm(1000, 0, 1)
hist(x1, col = 0, nclass = 50, xlim = c(-3.5,3.5),main=" ")
title("b")
Density and histogram

Figure 32: Density and histogram

Density and density estimate

A second approach to estimate the distribution of the population is to use a smooth version of the histogram, i.e., a density estimate. The density estimate for our example is shown (in red line) in Figure 33.

x<-seq(from=-3,to=3,length=1000)
dx1<-dnorm(x,0,1)
par(mfrow=c(1,2))
#plot(x,dx1,type="l")
#title("a")
#x1 <- rnorm(1000, 0, 1)
#hist(x1, col = 0, nclass = 50, xlim = c(-3.5,3.5),main=" ")
#title("b")
dx2<-density(x1)
hist(x1, col = 0, nclass = 50, xlim = c(-3.5,3.5),main=" ",probability =T)
lines(dx2$x,dx2$y,col=2,lwd=2)
title("c")
plot(x,dx1,type="l",ylim=c(0,0.55))
lines(dx2$x,dx2$y,col=2,lwd=2)
title("d")
Two normal densities

Figure 33: Two normal densities

Online tutorials

YouTube tutorial: Creating density plots and enhancing it with the ggplot2 package

A short online YouTube tutorial by LawrenceStats, about density plot using the ggplot2 package see YTVD10.

Web tutorial: the ridgeline chart

  • A Web tutorial about the ridgeline chart using the ggplot2 and ggridges package is given in the the R Graph Gallery website WAVD4.

Example: the old faithful data

The dataset, the R object faithful, gives information about waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. We focus on two variables:

  • eruptions: numeric, Eruption time in mins.
  • waiting: numeric, Waiting time to next eruption (in mins)

The first few lines on the data are presented in the panel below.

head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

Figure 34 shows a Histogram of eruptions time and reveals a A bi-model. It was produced using the hist() function, a basic graphical function in R.

hist(faithful$eruptions,nclass=20,col=2)
The Old faithful data

Figure 34: The Old faithful data

A boxplot for the old faithful data

Figure 35 is an example for a failure of the boxplot to capture the bi-model feature of the data .

qplot(rep(1,length(eruptions)),eruptions, data=faithful, geom = c("boxplot"))
The Old faithful data

Figure 35: The Old faithful data

Adding the data to the boxplot for the old faithful data

In Figure 36, we add the data to the boxplot, using the option geom = c(“boxplot”, “jitter”) and we can identify the two parts of the eruptions time distribution.

qplot(rep(1,length(eruptions)),eruptions, data=faithful, geom = c("boxplot", "jitter"))
The fathful data

Figure 36: The fathful data

Histogram for the eruptions time using <ggplot2

The histogram presented in Figure 37 is able to capture the shape of the distribution.

qplot(eruptions, data=faithful, geom="histogram")
The fathful data

Figure 37: The fathful data

Density estimate for the eruptions time using <ggplot2

Figure 38 presents a density estimate for the distribution of the eruptions that was produced using the option geom=“density”).

qplot(eruptions, data=faithful, geom="density")
The fathful data

Figure 38: The fathful data

Density estimate and histogram for the eruptions time using <ggplot2

As shown in Figure 38, we can plot both histogram and density estimate in the same plot. Note that Figure 39 was produced using basic graphical functions in R: hist and lines().

hist(faithful$eruptions,nclass=15,probability = TRUE)
dx<-density(faithful$eruptions)
lines(dx$x,dx$y,lwd=2,col=2)
The fathful data

Figure 39: The fathful data

Example: the singer data

Multiway histogram

We use multiway histogram to visualize the shit of the distribution of the singers’ height across the voice part groups. The histograms in Figure 40 ravels the shift within and between the voice groups.

ggplot(singer, aes(height,fill = voice.part)) +
geom_histogram() +
facet_wrap(~voice.part,ncol = 2)
Multiway histogram for the singers' height

Figure 40: Multiway histogram for the singers’ height

Density plots

Using the density plots presented in Figure 41, the difference between the sopranos and altos (women singers, the densities in the left) and the tenors and basses (men singers, densities in the right) is clearly seen. The difference within each group is more difficult to detect.

qplot(height, data=singer, geom="density", xlim = c(50,80),
fill = voice.part, alpha = I(0.2))
Density estimates for the singers' height

Figure 41: Density estimates for the singers’ height

Ridgeline charts for the singers’ height

A ridgeline charts, presented in Figure 42, visualizes the difference between the groups and within each voice group. Note that the R package ggridges should be installed to produce the plot.

library(ggridges)
ggplot(singer, aes(x=height,y=voice.part,fill = voice.part)) +
  geom_density_ridges() +
  theme_ridges() + 
  theme(legend.position = "none")
ridgeline charts

Figure 42: ridgeline charts

9. Shape: the normal probability plot

Normal probability plot

A normal probability plot is a plot in which the quantile of the samples are plotted versus the corresponding quantiles of a standard normal distribution \(N(0,1)\).In this chapter we discuss the normal probability plot as a graphical tools to visualize the shape of a distribution. Using histograms and boxplots we are able to investigate the shape of the distribution focusing on the following issues:

  • How nearly symmetric the distribution of the data is.
  • Whether the distribution of the data is single-peaked, or whether it is multi-peaked.
  • Whether it is skewed.
  • How far we from a Normal distribution ?

Quantile of \(N(\mu,\sigma^{2})\): Definition and a simple example

A qq normal plot is a graphical display to investigate how now nearly is the sample to a normal distribution. Let \(q_{\mu,\sigma}(f)\) be a quantile of \(N(\mu,\sigma^{2})\), it can be expressed as

\[q_{\mu,\sigma}(f)=\mu+\sigma q_{0,1}(f).\]

For example, the $2.5 % $ quantile of the standard normal distribution is -1.96.

qnorm(0.025,0,1)
## [1] -1.959964

For \(N(2,5^{2})\) we have

\[q_{2,5}(2.5 \% )=2+ 5 \times - 1.96=-7.8\]

qnorm(0.025,2,5)
## [1] -7.79982

YouTube tutorial: QQ-plots in RStudio

For a short online YouTube tutorials by UTSSC, about normal probability plot using R studio see YTVD12.

A sample from N(0,1)

We use the R function qqnorm() to produce the normal probability plot shown in Figure 43. Data were sampled from \(N(0,1)\). We expect that all points in the normal probability plot will lay on the on the \(45^o\) line.

x <- rnorm(1000, 0, 1)
par(mfrow = c(2, 2))
hist(x, nclass = 25, col = 0)
boxplot(x, boxcol = 2, medcol = 1)
qqnorm(x)
abline(0, 1)
Normal probability plot for a sample from $N(0,1)$.

Figure 43: Normal probability plot for a sample from \(N(0,1)\).

A Sample from N(2,1)

A sample from \(N(2,1)\), i.e., it represents a shift model with the same variability compared to \(N(0,1)\). In this case we expect that all points in the normal probability plot will lay above and parallel to the \(45^{o}\) lines as shown in Figure 44.

par(mfrow = c(2, 2))
x <- rnorm(1000, 2, 1)
hist(x, nclass = 25, col = 0)
boxplot(x, boxcol = 2, medcol = 1)
qqnorm(x)
abline(0, 1)
Normal probability plot for a sample from $N(2,1)$.

Figure 44: Normal probability plot for a sample from \(N(2,1)\).

A Sample from N(0,2)

The sample was drawn from \(N(0,2^2)\) which implies that the mean is the same as \(N(0,1)\) but the variability is higher. As illustrated in Figure 45, we expect the points in the normal probability plot to form a straight line with higher slope than 1.

par(mfrow = c(2, 2))
x <- rnorm(1000, 0, 2)
hist(x, nclass = 25, col = 0)
boxplot(x, boxcol = 2, medcol = 1)
qqnorm(x)
abline(0, 1)
Normal probability plot for a sample from $N(0,2).

Figure 45: Normal probability plot for a sample from $N(0,2).

A Sample from \(t_{(3)}\)

Figure 46 presents two densities: a \(t_{(3)}\) distribution (red line) has the same mean as \(N(0,1)\) (black line) but longer tails. Note that the two distribution and centered around zero.

par(mfrow = c(1, 1))
qx <- seq(from = -7, to = 7, length = 1000)
xn <- dnorm(qx, mean = 0, sd = 1)
xt <- dt(qx, 3)
plot(qx, xn, xlim = c(-7, 7), type = "l")
lines(qx, xt, col=2)
Two densities: $N(0,1)$ and $t_{(3)}$.

Figure 46: Two densities: \(N(0,1)\) and \(t_{(3)}\).

A Sample from \(t_{(3)}\)

Figure 47 shows the normal probability plot for a random sample from \(t_{(3)}\). we expect that the points will lay on the \(45^o\) line in the center but with more extreme values .

par(mfrow = c(2, 2))
x <- rt(1000, 3)
hist(x, nclass = 25, col = 0)
boxplot(x, boxcol = 2, medcol = 1)
qqnorm(x)
abline(0, 1)
Normal probability plot for a random sample from $t_{(3)}$.

Figure 47: Normal probability plot for a random sample from \(t_{(3)}\).

A Sample from \(U(-3,3)\)

The data of this example are uniformly distributed between the minimum and maximum values. In Figure 48, we expect the points in the normal probability plot to cross the \(45^o\) line and to lay relatively far from the line (to represent the larger variability in the sample compare to \(N(0,1)\)).

x <- runif(1000, -3, 3)
par(mfrow = c(2, 2))
hist(x, nclass = 25, col = 0)
boxplot(x, boxcol = 2, medcol = 1)
qqnorm(x)
abline(0, 1)
Normal probability plot for a sample from $U(-3,3)$.

Figure 48: Normal probability plot for a sample from \(U(-3,3)\).

Example: The mtcars dataset

Normal probability plot for mpg

Figure 49 shows the Normal probability plot for the variable mpg for 32 cars. We use this figure to investigate if the data represent a sample from

  • \(N(0,1)\) ?
  • \(N(\mu,\sigma^{2})\) ?
qqnorm(mtcars$mpg)
Normal probability plot for variable Miles per galon (<tt>mpg</tt>.

Figure 49: Normal probability plot for variable Miles per galon (mpg.

With mean and variance equal to the \(45^o\) line.

mean(mtcars$mpg)
## [1] 20.09062
var(mtcars$mpg)
## [1] 36.3241

we do not expect that the data points will be close to the

Normal probability plot for the Z-score

We define a standardize variable \((X_{i}=mpg_{i})\), \[ Z_{i}=\frac{X_{i}-\bar{X}}{SD_{X}}. \] Figure 50 shows the Normal probability plot for for the Z-score of mpg.

m.mpg<-mean(mtcars$mpg)
sd.mpg<-sqrt(var(mtcars$mpg))
z<-(mtcars$mpg-m.mpg)/sd.mpg
qqnorm(z)
abline(0,1,col=2)
Normal probability plot for the Z-score

Figure 50: Normal probability plot for the Z-score

For comparison Figure 51 shows a Normal probability plot for a random sample \(n=32\) from \(N(0,1)\).

z<-rnorm(32,0,1)
qqnorm(z)
abline(0,1,col=2)
Normal probability plot for a random sample from $N(0,1)$,

Figure 51: Normal probability plot for a random sample from \(N(0,1)\),

Example: The signer dataset

Normal probability plot for the singers’ height

Figure 52 presents a Normal probability plot for the singers’ height. We use the function qqmath() to produce the figure and the option distribution = qnorm was used to specify a Normal distribution as the reference distribution for the plot. The function qqmath() is a part of the lattice package.

qqmath(~ height,
        distribution = qnorm,
        data=singer,
           layout=c(1,1), 
             prepanel = prepanel.qqmathline,
            panel = function(x, ...) {
            panel.grid()
            panel.qqmathline(x, ...)
            panel.qqmath(x, ...)
            },
        aspect=1,
        xlab = "f-value",
        ylab="height")
Normal probability plot for the singers' height

Figure 52: Normal probability plot for the singers’ height

Normal probability plot for the singers’ height by voice group

We can investigate the distribution of the singers’ height across the voice groups (the factor levels) using a Normal probability plot for the height conditioned on the voice group presented in Figure 53. This can be done when we use the code qqmath(~ height | voice.part).

qqmath(~ height | voice.part,
        distribution = qnorm,
        data=singer,
           layout=c(4,2), 
             prepanel = prepanel.qqmathline,
            panel = function(x, ...) {
            panel.grid()
            panel.qqmathline(x, ...)
            panel.qqmath(x, ...)
            },
        aspect=1,
        xlab = "f-value",
        ylab="height")
Normal probability plot of the singers' height by voice group

Figure 53: Normal probability plot of the singers’ height by voice group

10. The quantile plot

Definitions and examples

recall from Section 3 that \(X_{(1)},X_{(2)} \dots, X_{(n)}\) is the sorted sample (i.e., the order statistics) so \(X_{(1)}\) is the minimum and \(X_{(n)}\) is the maximum. The \(f\) quantile of the sample is order statistic \(X_{(i)}\) that a proportion \(f\) of the observations is less than or equal to. For example, the median is the \(0.5\) quantile since \(50\%\) of the observations are less than or equal to the median. The lower fourth, \(Q_{1}\), is the value that \(25\%\) of the observations are less than or equal to. Hence, the lower forth is the 0.25 quantile of the sample.

The f-value

The \(f_{i}\) value is the proportion of observations in the sample that are less than or equal to the \(i\)′th order statistic, \(X_{(i)}\). Hence, for a sample of since \(n\),

\[ f_{i}=\frac{i}{n}. \]

This implies that \(X_{(i)}\) is the \(f_{i}\) quantile of the sample.

Quantile plot

A quantile plot is a figure in which \(X_{(i)}\) is plotted versus \(f_{i}\). Let us focus on the second Bass voice group in the singer dataset. The panel below presents the singers’ height and the corresponding \(f\) value. The minimum is equal to 66 with \(f-value=0.0384= \frac{1}{n}\) and the maximum is equal to 75 with \(f-value=1 =\frac{n}{n}\).

x<-sort(singer$height[singer$voice.part=="Bass 2"])
n<-length(x)
f.value<-c(1:n)/n
cbind(x,f.value)
##        x    f.value
##  [1,] 66 0.03846154
##  [2,] 67 0.07692308
##  [3,] 67 0.11538462
##  [4,] 68 0.15384615
##  [5,] 68 0.19230769
##  [6,] 69 0.23076923
##  [7,] 70 0.26923077
##  [8,] 70 0.30769231
##  [9,] 70 0.34615385
## [10,] 70 0.38461538
## [11,] 71 0.42307692
## [12,] 72 0.46153846
## [13,] 72 0.50000000
## [14,] 72 0.53846154
## [15,] 72 0.57692308
## [16,] 72 0.61538462
## [17,] 72 0.65384615
## [18,] 72 0.69230769
## [19,] 74 0.73076923
## [20,] 74 0.76923077
## [21,] 74 0.80769231
## [22,] 74 0.84615385
## [23,] 75 0.88461538
## [24,] 75 0.92307692
## [25,] 75 0.96153846
## [26,] 75 1.00000000

Figure 54 presents the quantile plot for the Bass2 group. Note that the median (72) is the value that cross the Horizontal line of \(f-value=0.5\).

plot(x,f.value,type="s")
abline(0.5,0,col=2)
Quantile plot for the singer data for the Bass2 voice group.

Figure 54: Quantile plot for the singer data for the Bass2 voice group.

To produce the quantile plot for the second Bass group, shown in Figure 55, we can use the function qqmath() of the R package lattice with the argument distribution = qunif in the following way

qqmath(~ height[voice.part=="Bass 2"] | voice.part[voice.part=="Bass 2"] , aspect = "xy",
       data = singer,layout=c(1,1),distribution = qunif,
       prepanel = prepanel.qqmathline,
       panel = function(x, ...) {
       panel.qqmathline(x, ...)
        panel.qqmath(x, ...)
       })
Quantile plot for the singer data for the Bass2 voice group using the lattice R package.

Figure 55: Quantile plot for the singer data for the Bass2 voice group using the lattice R package.

The argument distribution = qunif implies that quantile for a uniform distribution will be calculated, i.e, \(f_{i}=i/n\). Figure 56 shows the quantile plot for all voice groups.

qqmath(~ height | voice.part, aspect = 0.25, data = singer,layout=c(2,4),
       distribution = qunif,
       prepanel = prepanel.qqmathline,
       panel = function(x, ...) {
       panel.qqmathline(x, ...)
       panel.qqmath(x, ...)
       })
Quantile plot for all voice group in the singer data.

Figure 56: Quantile plot for all voice group in the singer data.

Comparing distributions with QQ plots

Example : The chicks dataset

In this section we will discuss ways to compare two (or more) distributions. We will use the chicks dataset as an example. The data contains data from an experiment that was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens.The data set is available as data frame in R (chickwts) and consists two variables: the weight (weight) of the chicken and the diet type (feed).

In this section, we will compare two diet groups: the Sunflower diet and the Linseed diet. The panel below shows that the mean of the Sunflower (328.91) is higher than the mean of the Linseed diet (218.75).

weight.med <- tapply(chickwts$weight,chickwts$feed,median)  
weight.mean <- tapply(chickwts$weight,chickwts$feed,mean)  
weight.sd <- sqrt(tapply(chickwts$weight,chickwts$feed,var))
chick.n <- tapply(chickwts$weight,chickwts$feed,length)
data.frame(weight.med,weight.mean,weight.sd,chick.n)  
##           weight.med weight.mean weight.sd chick.n
## casein         342.0    323.5833  64.43384      12
## horsebean      151.5    160.2000  38.62584      10
## linseed        221.0    218.7500  52.23570      12
## meatmeal       263.0    276.9091  64.90062      11
## soybean        248.0    246.4286  54.12907      14
## sunflower      328.0    328.9167  48.83638      12

The difference between the two diet groups can be clearly seen from the dotplot for the group medians presented in Figure 57.

dotplot(names(weight.med)~weight.med,cex=1.25,xlab="Median") 
Dotplot for the medians of the chicks' weights.

Figure 57: Dotplot for the medians of the chicks’ weights.

Both the group means and the dotplot rely on the same method to compare the two samples. First, we summarize the distributions with the location estimator (mean or median), and then the comparison between the two diets is based on the summaries. Alternatively, we can use stripplot (shown in Figure 58) which presents both the data and the location summary.

ggplot(chickwts, aes(feed,weight)) +
geom_point()+ 
stat_summary(geom = "point", fun.y = "median", colour = "red", size = 4)
Chicks weights with groups mean.

Figure 58: Chicks weights with groups mean.

The QQ plot

An alternative approach is to compare the quantiles of the two distributions using a quantile-quantile plot (QQ plot).Let \(x\) and and \(y\) be two datasets, \(x=(x_{1},x_{2},\dots,x_{n})\) and \(y=(y_{1},y_{2},\dots,y_{m})\). Let \(x_{(1)},x_{(2)},\dots,x_{(n)}\) and \(y_{(1)},y_{(2)},\dots,y_{(m)}\) be the sorted samples. A QQ plot is a figure in which \(x_{(i)}\) is plotted against \(y_{(i)}\). If \(y_{(i)}=x_{(i)}\) for all \(i\), then we expect the QQ plot to be a straight line and the quantiles \((x_{(i)},y_{(i)})\) to be on the \(45^{o}\) line. If the distribution of the \(y\)’s is a shift of the distribution of the \(x\)’s, that is \[ y_{i}=x_{i}+\Delta, \] then if \(\Delta > 0\), we expect that points in the QQ plot will be a straight line above the \(45^{o}\) line (and the opposite, of course, for \(\Delta <0\)). Let us look at the sorted samples of Sunflower and Linseed diets. We can see from the panel below that, for all \(i\), \(y_{(i)}>x_{(i)}\), all the quantiles of the sunflower sample are higher than the corresponding quantiles of the linseed sample.

cbind(sort(chickwts$weight[chickwts$feed=="sunflower"]),
      sort(chickwts$weight[chickwts$feed=="linseed"]))
##       [,1] [,2]
##  [1,]  226  141
##  [2,]  295  148
##  [3,]  297  169
##  [4,]  318  181
##  [5,]  320  203
##  [6,]  322  213
##  [7,]  334  229
##  [8,]  339  244
##  [9,]  340  257
## [10,]  341  260
## [11,]  392  271
## [12,]  423  309

Figure 59 shows the QQ plot for the two diet groups. The fact that all the points lie above the \(45^{o}\) line and that the plot is almost a straight line and (almost) parallel to the \(45^{o}\) indicates that the Sunflower distribution is a shift of the Linseed distribution. Figure XXXX was produced using the qq() function of the lattice package.

qq(chickwts$feed ~ chickwts$weight,data=chickwts,
        subset=chickwts$feed=="sunflower" | chickwts$feed=="linseed",
        aspect=1, 
        xlab = "weight of the chicken (linseed)",
        ylab = "weight of the chicken (sunflower)")
A qqplot between the sunflower and the linseed diet groups.

Figure 59: A qqplot between the sunflower and the linseed diet groups.

Example : the singer dataset

Figure 60 shows a QQ plot for the Bass 2 and the Tenor 1. It is clear from this plot that the singers in the Bass 2 group are taller than the singers from the Tenor 1 group. Once again the plot is almost a straight line parallel to the 45o line which is indicating that the distribution of the Bass 2 group is a shift of the distribution of the Tenor 1 group. This is the reason why all the points (all except one) are below the straight line.

qq(singer$voice.part ~ singer$height,data=singer,
        subset=voice.part=="Bass 2" | voice.part=="Tenor 1",
        aspect=1, 
        ylab = "Tenor 1 Height (inches)",
        xlab = "Base 2 Height (inches)")
A qqplot between the second bass and first tenor voice groups.

Figure 60: A qqplot between the second bass and first tenor voice groups.

Figure 61 shows a different relationships between the Tenor groups. This figure suggests the following pattern: for small values of heights, the Tenor 2 singers are taller than the Tenor 1 singers . The situation is reversed for large values of height where the Tenor 1 singers are taller than the Tenor 2 singers.

qq(singer$voice.part ~ singer$height,data=singer,
        subset=voice.part=="Tenor 2" | voice.part=="Tenor 1",
        aspect=1, 
        xlab = "Tenor 1 Height (inches)",
        ylab = "Tenor 2 Height (inches)")
A qqplot between the first and second tenor groups.

Figure 61: A qqplot between the first and second tenor groups.

qq(singer$voice.part ~ singer$height,data=singer,
        subset=voice.part=="Alto 2" | voice.part=="Tenor 1",
        aspect=1, 
        xlab = "Tenor 1 Height (inches)",
        ylab = "Alto Height (inches)")
A qqplot between the first and second tenor groups.

Figure 62: A qqplot between the first and second tenor groups.

11. Vizualizing caterogical data

The Boston data

In this part , we focus on the Boston dataset which is a part of the MASS R package. The Boston dataset contains information about various attributes for suburbs in Boston, Massachusetts. To access the data you need to install the package. More information can be found in https://www.statology.org/boston-dataset-r/. We use the code below to access the data.

library(MASS)
data(Boston)
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Categorical data

The variable crim is a numerical variable that give information about the per capita crime rate by town. We define a new categorical variable crim_cat by Re-codeing the variable crim into three categories:

crim <5: Low. crim 5-15: Medium. crim >15: High.

Boston3=Boston %>% 
  mutate(crim_cat=cut(crim,
             breaks = c(0, 5, 15, Inf),
             labels = c("Low",
                        "Medium",
                        "High")))
table(Boston3$crim_cat)
## 
##    Low Medium   High 
##    400     76     30

ridges density plot for age

The variable age is the proportion of owner-occupied units built prior to 1940. Similar to Section XX, Figure 63 visualizes the distribution of age across the categories of crim_cat.

library(ggridges)
ggplot(Boston3, aes(x=age,y=crim_cat,fill = crim_cat)) +
  geom_density_ridges() +
  theme_ridges() + 
  theme(legend.position = "none")
Density plot for <tt>age</tt> across the crime rate categories.

Figure 63: Density plot for age across the crime rate categories.

Pie chart for categorical data

The distribution of the crime rate is presented in the panel below.

counttab=as.data.frame(table(Boston3$crim_cat))
colnames(counttab)=c("Category", "Freq")
counttab
##   Category Freq
## 1      Low  400
## 2   Medium   76
## 3     High   30

One way to visualize the distribution is to use a pie chart, shown in Figure 64. Note that the function geom_bar() defines the figure’s type while the function coord_polar() is used to specify the pie chart.

plot1=ggplot(counttab, aes(x="", y=Freq, fill=Category)) +
  geom_bar(stat="identity", width=1, color="black") +
  coord_polar("y", start=0)+theme_void()

plot1
Pie chart for the crime rate categories.

Figure 64: Pie chart for the crime rate categories.

Adding numerical information to the plot

The counts of the crime rate categories in Figure 65 can be added to Figure 54 using the function geom_text(aes(label = Freq). The variable Freq is the count.

counttab=as.data.frame(table(Boston3$crim_cat))
colnames(counttab)=c("Category", "Freq")
plot2=ggplot(counttab, aes(x="", y=Freq, fill=Category)) +
  geom_bar(stat="identity", width=1, color="black") +
  coord_polar("y", start=0)+theme_void()+
  geom_text(aes(label = Freq),
            position = position_stack(vjust = 0.5))
plot2
Pie chart for the crime rate categories.

Figure 65: Pie chart for the crime rate categories.

Barplot for categorical data

The pie charts in the previous section are a special case of a barplot. Figure 66 shows barplot for the crime categoris without and with the categories’ counts. Note that Figure 56 presents a figure with two panels. This was produced using the function grid.arrange(figure 1, figure 2, ncol=2) of the package gridExtra.

plot31=ggplot(counttab, aes(x=Category, y=Freq, fill=Category)) + 
  geom_bar(stat = "identity", color="black")+
  theme_void()

plot32=ggplot(counttab, aes(x=Category, y=Freq, fill=Category)) + 
  geom_bar(stat = "identity", color="black")+
  theme_void()+ 
  geom_text(aes(label = Freq),vjust=-1)

library(gridExtra)
grid.arrange(plot31, plot32, ncol=2)
Barplot for the crime rate categories.

Figure 66: Barplot for the crime rate categories.

Visualzing a \(J \times I\) frequency table

The panel below presents the distribution of the crime rate across the categories of the factor chas (the distance to the Charles river).

counttab1=as.data.frame(table(Boston3$crim_cat,Boston3$chas))
colnames(counttab1)=c("Category","Chas1", "Freq")
counttab1
##   Category Chas1 Freq
## 1      Low     0  370
## 2   Medium     0   71
## 3     High     0   30
## 4      Low     1   30
## 5   Medium     1    5
## 6     High     1    0

As shown in Figure 67, we can visualize the distribution of the crime rate cross the chas levels using a two panels barplot. In Figure 57, the two panels were created using the function facet_wrap(~as.factor(Chas1)….).

plot31a=ggplot(counttab1, aes(x=Category, y=Freq, fill=Category)) + 
   geom_bar(stat = "identity", color="black")+
   geom_text(aes(label = Freq),vjust=-1)+
   facet_wrap(~as.factor(Chas1),labeller = as_labeller(c("1" = "Chas=1", "0" = "Chas=0")))+
   theme_void()
plot31a
Barplot for the crime rate categories scross the levels of the variable <tt>chas</tt>.

Figure 67: Barplot for the crime rate categories scross the levels of the variable chas.