Data Analysis and Visualization Using R: Lesson 2

David Robinson
1/29/14

Introduction to ggplot2

ggplot2 is a third party package that produces attractive visualizations of data easily and intuitively.

plot of chunk example

Installing ggplot2

ggplot2 is a third party package: code that doesn't come built in to R. You therefore have to install it. The easiest way is to run the line:

install.packages("ggplot2")

You can also go to the Tools->Install Packages… menu in RStudio.

Loading the ggplot2 library

Every time you reopen R, you need to load a library using library() before using it:

library(ggplot2)

Diamond data

data(diamonds)

Contains information on the weight, price, size and quality of ~54,000 diamonds.

head(diamonds, 5)

  carat     cut color clarity depth table price    x
1  0.23   Ideal     E     SI2  61.5    55   326 3.95
2  0.21 Premium     E     SI1  59.8    61   326 3.89
3  0.23    Good     E     VS1  56.9    65   327 4.05
4  0.29 Premium     I     VS2  62.4    58   334 4.20
5  0.31    Good     J     SI2  63.3    58   335 4.34
     y    z
1 3.98 2.43
2 3.84 2.31
3 4.07 2.31
4 4.23 2.63
5 4.35 2.75

Some columns of the diamond data

head(diamonds$cut)

[1] Ideal     Premium   Good      Premium   Good     
[6] Very Good
5 Levels: Fair < Good < Very Good < ... < Ideal

head(diamonds$color)

[1] E E E I J J
Levels: D < E < F < G < H < I < J

Aesthetics

An aesthetic is one attribute that we can perceive visually. For a scatter plot, some aesthetics are:

x
y
color
size
shape

ggplot call

To build a plot in ggplot2, we use four components:

ggplot(data,aesthetics) + geom_type of graph() + extra options

data: The data frame we're working from
aesthetics: which attributes (columns) of the data are represented by what visual qualities (x, y, color, size, shape…)
type of graph: geom_point, geom_histogram, geom_boxplot…
extra options: custom title or axis labels, background color, whether to make axes on log scale…

Basic scatter plot

ggplot(diamonds, aes(x=carat, y=price)) + geom_point()

plot of chunk diamonds_carat_price

Additional aesthetic: color

ggplot(diamonds, aes(x=carat, y=price, color=color)) + geom_point()

plot of chunk diamonds_withcol

Additional aesthetic: shape

ggplot(diamonds, aes(x=carat, y=price, color=color, shape=cut)) + geom_point()

plot of chunk diamonds_withcolshape

Additional aesthetic: size

ggplot(diamonds, aes(x=carat, y=price, color=color, shape=cut, size=depth)) + geom_point()

plot of chunk diamonds_withcolshapesize

Plotting a subset of the dataset

ggplot(diamonds[1:100, ], aes(x=carat, y=price, color=clarity, shape=cut)) + geom_point()

plot of chunk diamonds_subset

Pre-filtering the data frame based on one column

ggplot(diamonds[diamonds$carat < 2, ], aes(x=carat, y=price, color=clarity, shape=cut)) + geom_point()

plot of chunk diamonds_filtered

Plot multiple plots in one, divided into "facets"

ggplot(diamonds, aes(x=carat, y=price, color=color, shape=cut)) + geom_point() + facet_wrap(~ clarity)

plot of chunk diamonds_facets

Additional plot options

Other options get added on to your call to ggplot:

ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + ...

X- and Y- axis limits

ggplot(diamonds, aes(x=carat, y=price, color=clarity, shape=cut)) + geom_point() + xlim(0, 2) + ylim(0, 15000)

plot of chunk diamonds_axis_lim

X- or Y-axis on log scale

ggplot(diamonds, aes(x=carat, y=price, color=clarity, shape=cut)) + geom_point() + scale_y_log10(breaks=c(330, 1000, 3300, 10000))

plot of chunk diamonds_logy

Adding a smoothing curve

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + stat_smooth()

plot of chunk diamonds_smooth

Removing geom_point() removes points

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + stat_smooth()

plot of chunk diamonds_smooth_nopoints

Histograms

Scatter plots are just one kind of plot. Histograms show the density of a 1-dimensional variable.

Use geom_histogram() for histograms

ggplot(diamonds, aes(x=price)) + geom_histogram()

plot of chunk diamonds_histogram

Customize the number of bins as an argument to geom_histogram

ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=2000)

plot of chunk diamonds_histogram_binwidth

Divide histograms into facets in the same way

ggplot(diamonds, aes(x=price)) + geom_histogram() + facet_wrap(~ cut)

plot of chunk diamonds_histogram_facets

Create stacked histogram using fill aesthetic

ggplot(diamonds, aes(x=price, fill=cut)) + geom_histogram()

plot of chunk diamonds_histogram_fill

geom_freqpoly can also compare densities

ggplot(diamonds, aes(x=price, col=cut)) + geom_freqpoly()

plot of chunk diamonds_freqpoly

Boxplots and violin plots

Boxplots and violin plots can compare multiple 1-dimensional densities side by side.

geom_boxplot creates simple boxplots:

ggplot(diamonds, aes(x=cut, y=price)) + geom_boxplot()

plot of chunk diamonds_boxplot

geom_violin shows density in more detail

ggplot(diamonds, aes(x=cut, y=price)) + geom_violin()

plot of chunk diamonds_violin

ggplot shortcut

If you have just one vector and want to create a histogram, or have two vectors and want to make a scatterplot, it may not be worth it to construct a data frame to contain it (then geom_histogram() or geom_point()).

qplot for simple histograms

qplot can be used to make a histogram out of a single vector:

x = log(1:1000)
qplot(x)

plot of chunk qplot_hist

qplot for simple scatter plots

When given two arguments, qplot shows a scatter plot:

x = log(1:1000)
y = rnorm(1000)
qplot(x, y)

Input and output for `ggplot2`

These plots so far were easy because the data was given in the format of one observation per row- what we call “tall” format. But many datasets come in “wide” format, so it's not immediately clear how to ggplot them.

Wide data

data(WorldPhones)
WorldPhones

     N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951  45939  21574 2876   1815    1646     89      555
1956  60423  29990 4708   2568    2366   1411      733
1957  64721  32510 5230   2695    2526   1546      773
1958  68484  35218 6662   2845    2691   1663      836
1959  71799  37598 6856   3000    2868   1769      911
1960  76036  40341 8220   3145    3054   1905     1008
1961  79831  43173 9053   3338    3224   2005     1076

Getting data into long format

We use melt from the reshape2 package to restructure:

library(reshape2)
WorldPhones.m = melt(WorldPhones)
head(WorldPhones.m)

  Var1   Var2 value
1 1951 N.Amer 45939
2 1956 N.Amer 60423
3 1957 N.Amer 64721
4 1958 N.Amer 68484
5 1959 N.Amer 71799
6 1960 N.Amer 76036

Getting data into long format

Since names like Var1 and value aren't helpful, we rename them:

colnames(WorldPhones.m) = c("Year", "Region", "Phones")

Melted data is "long" and therefore appropriate for ggplot2:

ggplot(WorldPhones.m, aes(x=Year, y=Phones, color=Region)) + geom_line()

plot of chunk plot_world_phones

Saving figures

Note the use of print around the call to ggplot.

pdf("myplot.pdf")
print(ggplot(diamonds, aes(x=carat, y=price, color=color, shape=cut)) + geom_point() + facet_wrap(~ clarity))
dev.off()

We can use jpeg( or png( to create files of those formats as well.

Alternatively, you can go to “Export->Save Plot As Image” above the plot view in RStudio.

ggplot2 FAQ

What advantages does ggplot2 have over built-in R plotting?
- Built in R plotting is powerful, but some kinds of plots are much more difficult to construct. These include:
  - Legends are easier in ggplot2
  - Facetting your graph into multiple graphs requires dividing up your data and then performing a loop to make each plot separately
  - More complex plots, such as stacked histograms, are made much easier
  - Looks better

Functions

We've used a few built-in functions so far, such as sum and log. You can recognize a function by the use of parentheses:

cos(pi)

[1] -1

log(1.5)

[1] 0.4055

You can write your own functions as well.

Functions: "black box" with input and output

Functions are a black box with input- called arguments and output- called a return value.

myfunc = function(a, b) {
    return(a + b / 2)
}

In this case, a and b are the two arguments, and the value a + b / 2 is returned.

Calling a function

You call a function using the name of the function, parentheses, and the arguments separated by commas:

myfunc(3, 10)

[1] 8

Calling a function

You can save the return value of a function call to a new variable:

result = myfunc(6, 8)
result

[1] 10

In other words, the above line is equivalent to result = 6 + 8/2.

Functions in practice

Functions are meant to package a complicated task so that others can perform it easily and concisely. For example, it is easier to type

mean(x)

than

sum(x) / length(x)

every time.

Functions can be as long and complicated as you want: possibly taking up hundreds of lines and calling several other functions. What they have in common is taking a particular input, processing it, and then returning the new value.

Lists

Lists are a very flexible type of variable that can store multiple other variables within them. The difference between a list and a vector is that different values in a list can be different data types. A list can even contain other lists.

List example

Here we create a list with four elements of different types:

mylist = list(50, 1:10, "apple", c("hello", "world"))
mylist

[[1]]
[1] 50

[[2]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[3]]
[1] "apple"

[[4]]
[1] "hello" "world"

List example

You can find the length of a list with length, just like a vector:

length(mylist)

[1] 4

To retrieve one of the values in a list, use double square brackets, like so:

mylist[[2]]

 [1]  1  2  3  4  5  6  7  8  9 10

Named list

Lists can also be named:

nlist = list(a=1:10, b="hello", c=2)

It can still be accessed as nlist[[1]], but also using a name in double square brackets or after a $:

nlist[['b']]

[1] "hello"

nlist$c

[1] 2

Lists in practice

In practice, lists are often used to “package” together multiple variables- for example, the results of a comprehensive statistical analysis- so they can easily be moved around a program (for example, returned from a function)