Introduction:

For a new data analyst, the amount of information and different packages available in r can be a little daunting. Not knowing what to use to visualise some data distributions during early exploratory phases of data analysis or what package to use to plot some graphs can give some early headaches.

ggplot makes visualising data a little simpler while providing a set of built-in functions to present data distributions in many different ways.

In this vignette I explore using ggplot to get some visualisations of data distributions using histograms, density curves, facets and box plots. The main focus here is to showcase a range of different options available within the ggplot package to display your data as an aid to statistical analysis. My take on a mini guide on data distribution visualisations tools.

Goal:

The question we like to answer is “How can we use ggplot to visualise distribution of data”

Creating some sample data:

For this demonstration we are creating some random data consisting of a variable (as a factor of 2 values) and a rating variable to use for our plots

set.seed(95) #random seed number to be able to replicate

# creating a data frame with 2 variables, a variable made up of 2 values A and B, interchanged every 50 records and a rating value
data <- data.frame(type = factor(rep(c("A","B"), each = 50)), 
                   rating = c(rnorm(200),rnorm(200, mean=.6))) #rnorm = random number generator 

#View top rows
head(data)
##   type      rating
## 1    A -1.02912040
## 2    A -1.61552578
## 3    A -0.02787948
## 4    A -0.32112762
## 5    A  1.88037134
## 6    A  0.69680689
#install.packages("ggplot2")
library(ggplot2)

Types of distribution graphs:

1. Histogram

A histogram is a plot that uses bars to show the frequency distribution of a set of continuous data or an item of data in sucessive intervals

Generally a histogram can be used to count and display the distribution of a variable; however it may be misleading for displaying distributions due to their dependence on the number of classes chosen (Everitt & Hothorn 2010)

Using ggplot we can produce a basic histogram of variable “rating” for our sample data. Note we can use ‘binwidth’ to set how wide each bin is

ggplot(data, aes(x = rating)) +
  geom_histogram(binwidth = .5)

To display the same histogram as an outline we can use:

ggplot(data, aes(x = rating)) +
  geom_histogram(binwidth = .5, colour = "blue", fill = "white")

2. Density Curve

A density curve is a graph that shows probability. The area under the curve is equal to 100 percent of all probabilities. Usually represented as decimals, we can also say the area is equal to 1 (100% as a decimal of 1)

With ggplot, we can use geom_density to display the previous plot as a density curve plot

ggplot(data, aes(x = rating)) +
  geom_density()

3. Histogram with density curve overlay

We can combine both the histogram and density curve using ggplot

ggplot(data, aes(x = rating)) +
  geom_histogram(aes(y = ..density..), # the histogram will display "density" on its y-axis
                 binwidth = .5, colour = "blue", fill = "white") +
  geom_density(alpha = .2, fill="#FF6655") #overlay with a transparent (alpha value) density plot

Want to add the mean?:

Additional lines like “mean”" can be added to existing plots using geom_vline like in example below, re-using previous plot

ggplot(data, aes(x = rating)) +
  geom_histogram(aes(y = ..density..), # the histogram will display "density" on its y-axis
                 binwidth = .5, colour = "blue", fill = "white") +
  geom_density(alpha = .2, fill="#FF6655") +
  geom_vline(aes(xintercept = mean(rating, na.rm = T)),
             colour = "red", linetype ="longdash", size = .8)

Visualising multiple groups:

1. Overlapping Histograms:

In our sample data, We can use our factor variable “type” to produce histograms with overlays based on “type”. The sample code below produces such output

ggplot(data, aes(x = rating, fill = type)) +
  geom_histogram(binwidth = .5, alpha =.5, position = "identity")

2. Interleaved Histogram:

Interleaved histograms use a different way of displaying the histogram overlaps in regular alternating (per type) bars It is similar to previous plot but changing the way each bin is filled

ggplot(data, aes(x = rating, fill = type)) +
  geom_histogram(binwidth = .5, alpha =.5, position = "dodge")

3. Density plots

Similarly to the previous density plot, we can use our factor variable to overlay density based on “type” as in the sample code below:

ggplot(data, aes(x = rating, fill = type)) +
  geom_density(alpha = .3) #alpha used for filling the density

Adding means to individual “types”

We can (using the plyr package) overlay the means and other available summary values for each individual type. It is simple to use as per sample below:

#using plyr to produce means for each type
library(plyr)
means <- ddply(data, "type", summarise, rating.mean = mean(rating))
means
##   type rating.mean
## 1    A   0.2297555
## 2    B   0.3461643

using the produced means and the previoulsy used geom_vline option, we can overlay the means to our last plot:

ggplot(data, aes(x = rating, fill = type)) +
  geom_density(alpha = .3) + #alpha used for filling the density
  geom_vline(data = means, aes(xintercept = rating.mean, colour = type),
             linetype = "longdash", size=1)

4. Using Facets to split data distribution display:

Within ggplot2, the facet_grid() function allows for data to be split by one or more variables and allows to plot these subsets together. Data can be split on a horizontal or vertical direction

facet_grid() can also be used on other graphic displays like scatterplots

In the graph below, histograms have been separated and are displays per “type” value

ggplot(data, aes(x = rating)) +
  geom_histogram(binwidth = .5, colour = "blue", fill = "white") +
  facet_grid(type ~ .)

If you like to display in a horizontal direction just change the order on facet_grid()

ggplot(data, aes(x = rating)) +
  geom_histogram(binwidth = .5, colour = "blue", fill = "white") +
  facet_grid(. ~ type)

If you like to explore more about facet_grid(), visit: http://www.cookbook-r.com/Graphs/Facets_(ggplot2)

5. Box plots

A box plot is a type of graph used to visualise patterns of quantitative data. It splits the dataset into quartiles. The body of the box plot consists of a “box” which goes from first quartile (Q1) to the third quartile (Q3). Within the box, a line is drawn at Q2, the median of the data set.

ggplot(data, aes(x=type, y=rating, fill=type)) + #fill allows for colouring the box plots by type
  geom_boxplot() +
  guides(fill = FALSE) #this line removes boxplot legend (redundant to graph)

the axis of the box plot can be flipped by using coord_flip()

ggplot(data, aes(x=type, y=rating, fill=type)) + #fill allow for box plot colouring per type
  geom_boxplot() +
  guides(fill = FALSE) + #this line removes boxplot legend (redundant to graph
  coord_flip()

And you can further add the mean display into the box plot by using stat_summary

ggplot(data, aes(x=type, y=rating, fill=type)) + #fill allow for box plot colouring per type
  geom_boxplot() +
  guides(fill = FALSE) +
  stat_summary(fun.y = mean, geom = "point", shape = 5, size = 4)

Hope this reading helps a little during your data exploring journey.

References:

Boxplot: Definition n.d., viewed 29 March 2018, http://stattrek.com/statistics/dictionary.aspx?definition=boxplot.

Cookbook for R n.d., viewed 27 March 2018, http://www.cookbook-r.com/.

Everitt, B. & Hothorn, T. 2010, A handbook of statistical analyses using R, 2nd ed., CRC Press, Boca Raton.

Grolemund, G. & Wickham, H. n.d., R for Data Science, viewed 28 March 2018, http://r4ds.had.co.nz/.

Histogram Definition & Example | InvestingAnswers n.d., viewed 28 March 2018, http://www.investinganswers.com/financial-dictionary/investing/histogram-5986.

Histograms - Understanding the properties of histograms, what they show, and when and how to use them | Laerd Statistics n.d., viewed 28 March 2018, https://statistics.laerd.com/statistical-guides/understanding-histograms.php.

How to use R to display distributions of data and statistics n.d., viewed 29 March 2018, http://influentialpoints.com/Critiques/displaying_distributions_using_R.htm.

Stephanie n.d., ‘Density Curve Examples’, Statistics How To, viewed 28 March 2018, http://www.statisticshowto.com/density-curve-examples/.