Bar Plot

The first, and without doubt, the most crucial step of any data analysis is the data visualization. As is said that a picture is worth a thousand words, it really is practically true. The graph could really reveal the shape of the data, decipher patterns, identify extreme values, missing values, relationship and much more.

Here in this tutorial we will discuss some of the basic graphs useful to display data. We will also understand which graph would be appropriate for a given type of data. These graphs will be made using R functions, and will understand the intricacies of the commands used to create them.

This tutorial will explain the commands used to create graphs with the ggplot2 package in R.

ggplot2 is based on the grammar of graphics, the idea that can be build every graph from the same few components: a data set, a set of geoms- visual marks that represent data points, and coordinate system.

To display data values, map variables in the data set to aesthetic prperties of the geom like size, color and xand y locations.

ggplot() command initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.

ggplot() is typically used to construct a plot incrementally, using the + operator to add layers to the existing ggplot object. This is advantageous in that the code is explicit about which layers are added and the order in which they are added. For complex graphics with multiple layers, initialization with ggplot is recommended.

There are three common ways to invoke ggplot:
- ggplot(df, aes(x, y, ))
- ggplot(df)
- ggplot()

The first method is recommended if all layers use the same data and the same set of aesthetics, although this method can also be used to add a layer using data from another data frame. See the first example below. The second method specifies the default data frame to use for the plot, but no aesthetics are defined up front. This is useful when one data frame is used predominantly as layers are added, but the aesthetics may vary from one layer to another. The third method initializes a skeleton ggplot object which is fleshed out as layers are added. This method is useful when multiple data frames are used to produce different layers, as is often the case in complex graphics.

df <- data.frame(gp = factor(rep(letters[1:3], each = 10)),
                 y = rnorm(30))
# Compute sample mean and standard deviation in each group
library(plyr)
ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))

# Declare the data frame and common aesthetics.
# The summary data frame ds is used to plot
# larger red points in a second geom_point() layer.
# If the data = argument is not specified, it uses the
# declared data frame from ggplot(); ditto for the aesthetics.
ggplot(df, aes(x = gp, y = y)) +
   geom_point() +
   geom_point(data = ds, aes(y = mean),
              colour = 'red', size = 3)
# Same plot as above, declaring only the data frame in ggplot().
# Note how the x and y aesthetics must now be declared in
# each geom_point() layer.
ggplot(df) +
   geom_point(aes(x = gp, y = y)) +
   geom_point(data = ds, aes(x = gp, y = mean),
                 colour = 'red', size = 3)
# Set up a skeleton ggplot object and add layers:
ggplot() +
  geom_point(data = df, aes(x = gp, y = y)) +
  geom_point(data = ds, aes(x = gp, y = mean),
                        colour = 'red', size = 3) +
  geom_errorbar(data = ds, aes(x = gp, y = mean,
                    ymin = mean - sd, ymax = mean + sd),
                    colour = 'red', width = 0.4)

The Dataset - Fuel economy data from 1999 and 2008 for 38 popular models of car

This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.A data frame with 234 rows and 11 variables:

Details

manufacturer.
model.
displ. engine displacement, in litres
year.
cyl. number of cylinders
trans. type of transmission
drv. f = front-wheel drive, r = rear wheel drive, 4 = 4wd
cty. city miles per gallon
hwy. highway miles per gallon
fl.
class.

Let’s see the structure of the above dataset.

library(ggplot2)
str(mpg)

## 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

head(mpg)

##   manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact

A barplot is created for a single variable when it is categorical/qualitative/discrete in nature. Let’s create the plot.

# The following command initiates the creation of the plot for fl variable of the # mpg dataset with default aesthetics.
b <- ggplot(mpg, aes(fl))

# Now that we have initiated the creation of the plot the command below makes the # barplot.
b + geom_bar()

Let us now create a barplot when one variable is discrete and the other is continuous.

c <- ggplot(mpg, aes(class,hwy))

# The arguments of the `geom_bar()` command creates a ``barplot` where the height # of the bar represents values in the data while the default argument is 
# `stat="bin".`
c + geom_bar(stat="identity")

We now play around with the color aesthetics of the barplot.

d <- b + geom_bar(aes(fill=fl))
d

# One do the same manually using the command `geom_fill_manual()`.

library(RColorBrewer)
d + scale_fill_brewer(palette = "Blues")

d + scale_fill_grey(start =0.2, end = 0.6, na.value = "red")

# The `start` and `end` arguments give the intensity of the grey shade and the 
# missing values will be displayed in red color

Now we turn our attention to the coordinate systems of the created plot. Observe the changes in the chart formations based on the sequence of the commands written below.

t <- b + geom_bar()
t + coord_cartesian(xlim = c(0,5))

t + coord_fixed(ratio = 1/5)

t + coord_flip()

t + coord_trans(ytrans = "sqrt")

The position of the geoms in the chart can also be adjusted. Let us see how?

s <- ggplot(mpg, aes(fl, fill = drv))
s + geom_bar(position = "dodge") # Arrange elements side by side

s + geom_bar(position = "fill") # Stacks items on top of one another and normalizes height of the bars

s + geom_bar(position = "stack") # Just stacks the items on top of each other

The Themes of the chart are nothing but how one can change the aesthetics of the chart area.

t + theme_bw() # White background with grid lines.

t + theme_grey() # This the default theme

t + theme_classic() # White background without gridlines

t + theme_minimal() # Minimal theme

Basic Data Visualization using R

Moonis Shakeel, Ph.D.

5 September 2015

Bar Plot

The Dataset - Fuel economy data from 1999 and 2008 for 38 popular models of car

Details