What is ggplot2?

Grammar

Other visualization options

Lattice and ggplot2 tend to compete. Lattice is strongly rooted in common approaches to visualization in statistics. ggplot2 attempts to use all those ideas “without the bad stuff” (Wickham is also a statistician).

ggviz and ggplot2 are both created by the Wickham. D3.js is a low level interactive visualization tool, and does much more than ggplot2. In practice a data scientist might use ggplot2 or ggviz to prototype a visualization, and give it to a professional front-end developer or UX expert to put into production using D3.js. Here is a great example.

A histogram with basic plot

hist(sestates$price)

Taylor the plot by passing arguments to the hist function.

hist(sestates$price, 
     breaks = 30,
     main = "Housing Prices", 
     xlab = "price", 
     col = "blue",
    )

Note that this will not play well with the “piping” used in dplyr, eg. the following won’t work.

sestates %>%
  filter(city == "SACRAMENTO") %>%
  select(price) %>%
  hist

A basic histogram with ggplot2

Start by using “aes” to create a “aesthetic mapping” – describe how variables in the data are mapped to visual properties.

# Map the x axis to price
p <- ggplot(sestates, aes(x = price)) 
p

Then add the histogram as a layer.

p + geom_histogram(bins = 30)

Unlike default plotting, where you add details with arguments, you build the plot incrementally by adding “layers”.

p + 
  geom_histogram(bins = 30, fill = "blue") + 
  ggtitle("Housing Prices")  # add a title

Comparing default plotting to ggplot: Overlaying two histograms

Looking at square feet, lets define a house of more than 1850 square feet as “big”.

sestates <- mutate(sestates, 
                size = ifelse(sq__ft >= 1850, "big", "small"),
                size = factor(size))

Using default plotting methods you create a first histogram and then “add” a second by setting the add argument to TRUE.

hist(sestates$price[sestates$size == "big"], 
     main = "Housing Prices", xlab = "Price",
     col=rgb(1,0,0,0.5), breaks = 15, ylim = c(0, 175))
hist(sestates$price[sestates$size == "small"],
      col=rgb(0,0,1,0.5), add = T,   breaks = 15)

In ggplot2 you don’t create two different plots and combine them. Rather, you map a variable to a dimension in the plot, in this case we map the size variable to color (“fill = size”). You then add a single histogram layer.

ggplot(sestates, aes(x = price, fill = size)) + # Map price and size
  geom_histogram(bins = 30, alpha = .5) + # Add histogram layer
  ggtitle("Housing Prices")  # Add a title layer

Building a scatter plot

Map price to X axis, square feet to Y axis, and type (Condo, Multi-Family, Residential) to color

p <- ggplot(sestates, aes(price, sq__ft, 
                       col = type, 
                       alpha = .2)) 

Plot points.

p <- p + geom_point()
p

Plot a smoothing function.

p <- p + geom_smooth()
p

Compare across the variable for number of bathrooms.

p <- p + facet_grid(. ~ baths)
p

Oops! That looks kind of ugly. ggplot2 does not substitute for good visualization practice. ggplot2 is a tool. Creating informative visualizations of data is a skill you must build.

If you know what you are doing ggplot2 can make sophisticated visualizations.

Data on crimes in the USA

## Source: local data frame [50 x 5]
## 
##          state Murder Assault UrbanPop  Rape
##         (fctr)  (dbl)   (int)    (int) (dbl)
## 1      alabama   13.2     236       58  21.2
## 2       alaska   10.0     263       48  44.5
## 3      arizona    8.1     294       80  31.0
## 4     arkansas    8.8     190       50  19.5
## 5   california    9.0     276       91  40.6
## 6     colorado    7.9     204       78  38.7
## 7  connecticut    3.3     110       77  11.1
## 8     delaware    5.9     238       72  15.8
## 9      florida   15.4     335       80  31.9
## 10     georgia   17.4     211       60  25.8
## ..         ...    ...     ...      ...   ...

Plot a map of crime with ggplot.

ggplot(crimes, aes(map_id = state)) +
  geom_map(aes(fill = Murder), map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  coord_map()

Adding a ‘facet’ layer to contrast different variables.

ggplot(crimesm, aes(map_id = state)) +
    geom_map(aes(fill = value), map = states_map) +
    expand_limits(x = states_map$long, y = states_map$lat) +
    facet_wrap( ~ variable)

Geographical maps tend to be overused. Of course there are more murders in New York than Montana – New York has more people. ggplot2 can produce a variety of alternative visualizations.

In a famous presentation made by economist Han’s Rosling, he showed that the x-y scatterplot, where population is mapped to size of the point on the plot, is a powerful yet simple way of conveying population-based information.

hans1

hans2

hans3

You can create this easily in ggplot2 (if you want the animation, you need ggviz.)

ggplot(crimes, aes(Murder, Assault, size=UrbanPop, label=state)) +
  geom_point(colour="red") +
  geom_text(size=3) + 
  xlab("Murders per 1,000 population") + 
  ylab("Assaults per 1,000")