David Robinson
1/29/14
ggplot2 is a third party package that produces attractive visualizations of data easily and intuitively.
ggplot2 is a third party package: code that doesn't come built in to R. You therefore have to install it. The easiest way is to run the line:
install.packages("ggplot2")
You can also go to the Tools->Install Packages… menu in RStudio.
Every time you reopen R, you need to load a library using library() before using it:
library(ggplot2)
data(diamonds)
Contains information on the weight, price, size and quality of ~54,000 diamonds.
head(diamonds, 5)
carat cut color clarity depth table price x
1 0.23 Ideal E SI2 61.5 55 326 3.95
2 0.21 Premium E SI1 59.8 61 326 3.89
3 0.23 Good E VS1 56.9 65 327 4.05
4 0.29 Premium I VS2 62.4 58 334 4.20
5 0.31 Good J SI2 63.3 58 335 4.34
y z
1 3.98 2.43
2 3.84 2.31
3 4.07 2.31
4 4.23 2.63
5 4.35 2.75
Some columns of the diamond data
head(diamonds$cut)
[1] Ideal Premium Good Premium Good
[6] Very Good
5 Levels: Fair < Good < Very Good < ... < Ideal
head(diamonds$color)
[1] E E E I J J
Levels: D < E < F < G < H < I < J
An aesthetic is one attribute that we can perceive visually. For a scatter plot, some aesthetics are:
To build a plot in ggplot2, we use four components:
ggplot(data,aesthetics) + geom_type of graph() + extra options
geom_point, geom_histogram, geom_boxplot…ggplot(diamonds, aes(x=carat, y=price)) + geom_point()
ggplot(diamonds, aes(x=carat, y=price, color=color)) + geom_point()
ggplot(diamonds, aes(x=carat, y=price, color=color, shape=cut)) + geom_point()
ggplot(diamonds, aes(x=carat, y=price, color=color, shape=cut, size=depth)) + geom_point()
ggplot(diamonds[1:100, ], aes(x=carat, y=price, color=clarity, shape=cut)) + geom_point()
ggplot(diamonds[diamonds$carat < 2, ], aes(x=carat, y=price, color=clarity, shape=cut)) + geom_point()
ggplot(diamonds, aes(x=carat, y=price, color=color, shape=cut)) + geom_point() + facet_wrap(~ clarity)
Other options get added on to your call to ggplot:
ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + ...
ggplot(diamonds, aes(x=carat, y=price, color=clarity, shape=cut)) + geom_point() + xlim(0, 2) + ylim(0, 15000)
ggplot(diamonds, aes(x=carat, y=price, color=clarity, shape=cut)) + geom_point() + scale_y_log10(breaks=c(330, 1000, 3300, 10000))
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + stat_smooth()
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + stat_smooth()
Scatter plots are just one kind of plot. Histograms show the density of a 1-dimensional variable.
ggplot(diamonds, aes(x=price)) + geom_histogram()
ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=2000)
ggplot(diamonds, aes(x=price)) + geom_histogram() + facet_wrap(~ cut)
ggplot(diamonds, aes(x=price, fill=cut)) + geom_histogram()
ggplot(diamonds, aes(x=price, col=cut)) + geom_freqpoly()
Boxplots and violin plots can compare multiple 1-dimensional densities side by side.
ggplot(diamonds, aes(x=cut, y=price)) + geom_boxplot()
ggplot(diamonds, aes(x=cut, y=price)) + geom_violin()
If you have just one vector and want to create a histogram, or have two vectors and want to make a scatterplot, it may not be worth it to construct a data frame to contain it (then geom_histogram() or geom_point()).
qplot can be used to make a histogram out of a single vector:
x = log(1:1000)
qplot(x)
When given two arguments, qplot shows a scatter plot:
x = log(1:1000)
y = rnorm(1000)
qplot(x, y)
These plots so far were easy because the data was given in the format of one observation per row- what we call “tall” format. But many datasets come in “wide” format, so it's not immediately clear how to ggplot them.
data(WorldPhones)
WorldPhones
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951 45939 21574 2876 1815 1646 89 555
1956 60423 29990 4708 2568 2366 1411 733
1957 64721 32510 5230 2695 2526 1546 773
1958 68484 35218 6662 2845 2691 1663 836
1959 71799 37598 6856 3000 2868 1769 911
1960 76036 40341 8220 3145 3054 1905 1008
1961 79831 43173 9053 3338 3224 2005 1076
We use melt from the reshape2 package to restructure:
library(reshape2)
WorldPhones.m = melt(WorldPhones)
head(WorldPhones.m)
Var1 Var2 value
1 1951 N.Amer 45939
2 1956 N.Amer 60423
3 1957 N.Amer 64721
4 1958 N.Amer 68484
5 1959 N.Amer 71799
6 1960 N.Amer 76036
Since names like Var1 and value aren't helpful, we rename them:
colnames(WorldPhones.m) = c("Year", "Region", "Phones")
ggplot(WorldPhones.m, aes(x=Year, y=Phones, color=Region)) + geom_line()
Note the use of print around the call to ggplot.
pdf("myplot.pdf")
print(ggplot(diamonds, aes(x=carat, y=price, color=color, shape=cut)) + geom_point() + facet_wrap(~ clarity))
dev.off()
We can use jpeg( or png( to create files of those formats as well.
Alternatively, you can go to “Export->Save Plot As Image” above the plot view in RStudio.
What advantages does ggplot2 have over built-in R plotting?
We've used a few built-in functions so far, such as sum and log. You can recognize a function by the use of parentheses:
cos(pi)
[1] -1
log(1.5)
[1] 0.4055
You can write your own functions as well.
Functions are a black box with input- called arguments and output- called a return value.
myfunc = function(a, b) {
return(a + b / 2)
}
In this case, a and b are the two arguments, and the value a + b / 2 is returned.
You call a function using the name of the function, parentheses, and the arguments separated by commas:
myfunc(3, 10)
[1] 8
You can save the return value of a function call to a new variable:
result = myfunc(6, 8)
result
[1] 10
In other words, the above line is equivalent to result = 6 + 8/2.
Functions are meant to package a complicated task so that others can perform it easily and concisely. For example, it is easier to type
mean(x)
than
sum(x) / length(x)
every time.
Functions can be as long and complicated as you want: possibly taking up hundreds of lines and calling several other functions. What they have in common is taking a particular input, processing it, and then returning the new value.
Lists are a very flexible type of variable that can store multiple other variables within them. The difference between a list and a vector is that different values in a list can be different data types. A list can even contain other lists.
Here we create a list with four elements of different types:
mylist = list(50, 1:10, "apple", c("hello", "world"))
mylist
[[1]]
[1] 50
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10
[[3]]
[1] "apple"
[[4]]
[1] "hello" "world"
You can find the length of a list with length, just like a vector:
length(mylist)
[1] 4
To retrieve one of the values in a list, use double square brackets, like so:
mylist[[2]]
[1] 1 2 3 4 5 6 7 8 9 10
Lists can also be named:
nlist = list(a=1:10, b="hello", c=2)
It can still be accessed as nlist[[1]], but also using a name in double square brackets or after a $:
nlist[['b']]
[1] "hello"
nlist$c
[1] 2
In practice, lists are often used to “package” together multiple variables- for example, the results of a comprehensive statistical analysis- so they can easily be moved around a program (for example, returned from a function)