The first, and without doubt, the most crucial step of any data analysis is the data visualization. As is said that a picture is worth a thousand words, it really is practically true. The graph could really reveal the shape of the data, decipher patterns, identify extreme values, missing values, relationship and much more.
Here in this tutorial we will discuss some of the basic graphs useful to display data. We will also understand which graph would be appropriate for a given type of data. These graphs will be made using R functions, and will understand the intricacies of the commands used to create them.
This tutorial will explain the commands used to create graphs with the ggplot2
package in R
.
ggplot2
is based on the grammar of graphics
, the idea that can be build every graph from the same few components: a data set, a set of geoms
- visual marks that represent data points, and coordinate system
.
To display data values, map variables in the data set to aesthetic prperties of the geom
like size
, color
and x
and y
locations.
ggplot()
command initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.
ggplot()
is typically used to construct a plot incrementally, using the +
operator to add layers to the existing ggplot object. This is advantageous in that the code is explicit about which layers are added and the order in which they are added. For complex graphics with multiple layers, initialization with ggplot is recommended.
There are three common ways to invoke ggplot
:
- ggplot(df, aes(x, y,
- ggplot(df)
- ggplot()
The first method is recommended if all layers use the same data and the same set of aesthetics
, although this method can also be used to add a layer
using data from another data frame
. See the first example below. The second method specifies the default data frame to use for the plot, but no aesthetics are defined up front. This is useful when one data frame is used predominantly as layers
are added, but the aesthetics may vary from one layer to another. The third method initializes a skeleton ggplot
object which is fleshed out as layers
are added. This method is useful when multiple data frames are used to produce different layers
, as is often the case in complex graphics.
df <- data.frame(gp = factor(rep(letters[1:3], each = 10)),
y = rnorm(30))
# Compute sample mean and standard deviation in each group
library(plyr)
ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))
# Declare the data frame and common aesthetics.
# The summary data frame ds is used to plot
# larger red points in a second geom_point() layer.
# If the data = argument is not specified, it uses the
# declared data frame from ggplot(); ditto for the aesthetics.
ggplot(df, aes(x = gp, y = y)) +
geom_point() +
geom_point(data = ds, aes(y = mean),
colour = 'red', size = 3)
# Same plot as above, declaring only the data frame in ggplot().
# Note how the x and y aesthetics must now be declared in
# each geom_point() layer.
ggplot(df) +
geom_point(aes(x = gp, y = y)) +
geom_point(data = ds, aes(x = gp, y = mean),
colour = 'red', size = 3)
# Set up a skeleton ggplot object and add layers:
ggplot() +
geom_point(data = df, aes(x = gp, y = y)) +
geom_point(data = ds, aes(x = gp, y = mean),
colour = 'red', size = 3) +
geom_errorbar(data = ds, aes(x = gp, y = mean,
ymin = mean - sd, ymax = mean + sd),
colour = 'red', width = 0.4)
This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.A data frame with 234 rows and 11 variables:
Let’s see the structure of the above dataset.
library(ggplot2)
str(mpg)
## 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
head(mpg)
## manufacturer model displ year cyl trans drv cty hwy fl class
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
A barplot
is created for a single variable when it is categorical/qualitative/discrete in nature. Let’s create the plot.
# The following command initiates the creation of the plot for fl variable of the # mpg dataset with default aesthetics.
b <- ggplot(mpg, aes(fl))
# Now that we have initiated the creation of the plot the command below makes the # barplot.
b + geom_bar()
Let us now create a barplot
when one variable is discrete and the other is continuous.
c <- ggplot(mpg, aes(class,hwy))
# The arguments of the `geom_bar()` command creates a ``barplot` where the height # of the bar represents values in the data while the default argument is
# `stat="bin".`
c + geom_bar(stat="identity")
We now play around with the color
aesthetics of the barplot
.
d <- b + geom_bar(aes(fill=fl))
d
# One do the same manually using the command `geom_fill_manual()`.
library(RColorBrewer)
d + scale_fill_brewer(palette = "Blues")
d + scale_fill_grey(start =0.2, end = 0.6, na.value = "red")
# The `start` and `end` arguments give the intensity of the grey shade and the
# missing values will be displayed in red color
Now we turn our attention to the coordinate systems
of the created plot. Observe the changes in the chart formations based on the sequence of the commands written below.
t <- b + geom_bar()
t + coord_cartesian(xlim = c(0,5))
t + coord_fixed(ratio = 1/5)
t + coord_flip()
t + coord_trans(ytrans = "sqrt")
The position of the geoms
in the chart can also be adjusted. Let us see how?
s <- ggplot(mpg, aes(fl, fill = drv))
s + geom_bar(position = "dodge") # Arrange elements side by side
s + geom_bar(position = "fill") # Stacks items on top of one another and normalizes height of the bars
s + geom_bar(position = "stack") # Just stacks the items on top of each other
The Themes
of the chart are nothing but how one can change the aesthetics of the chart area.
t + theme_bw() # White background with grid lines.
t + theme_grey() # This the default theme
t + theme_classic() # White background without gridlines
t + theme_minimal() # Minimal theme