David Lee
November 16, 2016
Pretty graphics (especially compared to SAS)
Systematic framework for manipulating & visualizing data
Based on the “Grammar of Graphics” by Leland Wilkinson
Wilkinson describes graphics in “layers”
See this article by Hadley Wickham: A Layered Grammar of Graphics for further details
Let’s load ggplot2 and load the dataset diamonds, which comes with ggplot2.
library(ggplot2)
data(diamonds)
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
We’ll start with the most basic function, “ggplot”.
What is the first element we pick after choosing the dataset?
“Aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms.”
Fancy words describing “how we encode data onto visual elements”.
Let’s start with no aesthetic mapping.
g <- ggplot(diamonds)
summary(g)
## data: carat, cut, color, clarity, depth, table, price, x, y, z
## [53940x10]
## faceting: facet_null()
g
What are we looking at in this box?
Let’s add one variable at a time, starting with carat.
g <- ggplot(diamonds, aes(x=carat))
summary(g)
## data: carat, cut, color, clarity, depth, table, price, x, y, z
## [53940x10]
## mapping: x = carat
## faceting: facet_null()
g
Now we add price.
ggplot(diamonds, aes(x=carat,y=price))
summary(g)
## data: carat, cut, color, clarity, depth, table, price, x, y, z
## [53940x10]
## mapping: x = carat
## faceting: facet_null()
Next, we’ll add geoms. How do we add points to this graph?
g <- (ggplot(diamonds, aes(x=carat,y=price)) + geom_point() )
g
summary(g)
## data: carat, cut, color, clarity, depth, table, price, x, y, z
## [53940x10]
## mapping: x = carat, y = price
## faceting: facet_null()
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
Note, I literally added the geom of points. This is what we mean by adding layers.
How do we include another geom, a smoothed mean (geom_smooth)?
How do we include another aesthetic, coloring by the diamond’s cut?
ggplot(diamonds, aes(x=carat,y=price,color=cut)) +
geom_point() +
geom_smooth()
Since the aesthetic mappings were defined in the ggplot function itself, the points and smoothed means inherited those aes features.
How do we change this so that we only have 1 single smoothed mean?
ggplot(diamonds, aes(x=carat,y=price)) +
geom_point(aes(color=cut)) +
geom_smooth()
Next, let’s add labels and change color theme.
ggplot(diamonds, aes(x=carat,y=price)) +
geom_point(aes(color=cut)) +
geom_smooth() +
labs(title = "Scatterplot",
x = "Carat",
y = "Price") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())+
theme(axis.line.y = element_line(colour = "black", size=.5),
axis.line.x = element_line(colour = "black", size=.5))
Themes look like a pain to configure. How do we save theme settings?
Let’s talk about facets.
Sometimes I’m exploring and I just want to compare factors side-by-side in separate graphs. Instead of coding “cut” of the diamond to a color, I use facets.
ggplot(diamonds, aes(x=carat,y=price)) +
geom_point(size=1) +
geom_smooth(size=1) +
labs(title = "Scatterplot by Cut",
x = "Carat",
y = "Price") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())+
theme(axis.line.y = element_line(colour = "black", size=.5),
axis.line.x = element_line(colour = "black", size=.5)) +
facet_grid(cut ~ .)
Other times, I want to look at levels across 2 variables like cut and clarity.
ggplot(diamonds, aes(x=carat,y=price)) +
geom_point(size=.5) +
geom_smooth(size=1) +
labs(title = "Scatterplot by Cut",
x = "Carat",
y = "Price") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())+
theme(axis.line.y = element_line(colour = "black", size=.5),
axis.line.x = element_line(colour = "black", size=.5)) +
facet_grid(cut ~ clarity)
## Warning: Computation failed in `stat_smooth()`:
## x has insufficient unique values to support 10 knots: reduce k.
Stacked barplot (which I generally don’t recommend)
ggplot(diamonds, aes(clarity)) +
geom_bar(aes(fill=cut)) +
scale_fill_brewer() +
theme_minimal() +
theme(legend.position="top") +
labs(title = "Stacked barplot",
x = "Clarity",
y = "# of Diamonds")
Histogram of diamond prices
ggplot(diamonds) +
geom_histogram(aes(price),color="red",fill="white") +
scale_fill_brewer() +
theme_minimal() +
theme(legend.position="top") +
labs(title = "Histogram of diamond prices",
x = "Price",
y = "# of Diamonds") +
coord_flip() +
geom_vline(xintercept=3944,alpha=.8,linetype=2) +
annotate("text", x = 4411, y = 10115, label = "National avg cost of wedding stone (not fact)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Boxplot of Price per Carat by Color, from A. Meliji
ggplot(diamonds, aes(factor(color), (price/carat), fill=color)) +
geom_boxplot() +
ggtitle("Diamond Price per Carat according Color") +
xlab("Color") +
ylab("Diamond Price per Carat U$")