Notes on RFDS 2

Load Packages

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

Facets

Bill Cleveland called these graphics trellises and implemented them in the original S language in the package Trellis.

Trellis was essentially re-created in R under the name lattice by Deepayan Sarkar. See http://lmdvr.r-forge.r-project.org/figures/figures.html. See figures 2.8 - 2.11 on Titanic survivors.

Trellis and Lattice are strongly focused on this type of graphic and have declined in popularity since the grammar of graphics aproach ggplot2.

Tufte refers to this type of graphic as ‘small multiples.’

facet_wrap

To produce separate graphs for each value of a single categorical variable use facet_wrap() as a layer. The syntax requires a single-variable formula in a formula (preceded by a ~).

Example

d = ggplot(data=mpg, aes(x=cty,y=hwy)) + geom_point()

d + facet_wrap(~class)

The layout can be controlled with the arguments ncol or nrow.

d + facet_wrap(~class,nrow=2) + ggtitle("nrow = 2")

d + facet_wrap(~class,ncol=2) + ggtitle("ncol=2")

Exercise

Create two graphics showing the relationship between displ and hwy broken down by the categorical variable drv in the dataframe mpg. In one version, use a single row arrangement. In the other, use a single column. Which do you prefer.

Answer

e = ggplot(data=mpg,aes(x=displ,y=hwy)) + geom_point()
e

e + facet_wrap(~drv,nrow=1) + ggtitle("nrow=1")

e + facet_wrap(~drv,ncol=1) + ggtitle("ncol=1")

facet_grid

To produce separate graphs for each combination of values of two categorical variables use facet_grid() as a layer. The syntax requires a two-variable formula (two variables separated by a ~). The first variable designates the rows and the second designates the columns.

Example

f = ggplot(mpg,aes(x=cty,y=hwy)) + geom_point()
f

f + facet_grid(drv~class) + ggtitle("drv in rows")

f + facet_grid(class~drv) + ggtitle("drv in columns")

Many Choices

Review the variables in in mpg.

glimpse(mpg)
## Observations: 234
## Variables: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "...
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 qua...
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0,...
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1...
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6...
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)...
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4",...
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 1...
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 2...
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
## $ class        <chr> "compact", "compact", "compact", "compact", "comp...

What are the quantitative variables of interest? * displ * cty * hwy

In displaying relationships among these using a scatterplot we can put two on the axes and map the third to either color or size. If we use a smoother alone, we can only use two variables.

What are the categorical variables of interest?

How can we extend the relationship among quantitative variables to include one or more categorical variables?

Map to:

Exercise

Using the mpg dataframe, try something we haven’t done yet.

g = ggplot(data=mpg,aes(x=displ,y=hwy,color=year)) + geom_point() 
g

g + facet_grid(drv~trans)

g + facet_wrap(~manufacturer)

End of hour 1

Bar Charts

The primary use of the bar chart is to display the counts of values of one or more cateogrical variables.

Example of just one variable.

ggplot(data = mpg,aes(x= class)) + geom_bar()

Two Categorical Variables

Map the second categorical to fill.

ggplot(data = mpg,aes(x= class,fill=drv)) + geom_bar()

Do this in the opposite order

ggplot(data = mpg,aes(fill= class,x=drv)) + geom_bar()

Position

The default value of the position parameter is “stack”. The alternatives are “dodge” and “fill.”

Dodge Example

ggplot(data = mpg,aes(x= class,fill=drv)) + geom_bar(position = "dodge")

This is know as a side-by-side bar chart.

Fill Example

ggplot(data = mpg,aes(x= class,fill=drv)) + geom_bar(position = 'fill')

All bars are of height 1.0 and the counts are lost. What we have displayed are proportions.

Consider facetting for the second variable.

ggplot(data = mpg,aes(x= class)) + 
  geom_bar() +
  facet_wrap(~drv)

Fix the labels problem with ncol = 1.

ggplot(data = mpg,aes(x= class)) + 
  geom_bar() +
  facet_wrap(~drv,ncol=1)

Or maybe coord_flip

ggplot(data = mpg,aes(x= class)) + 
  geom_bar() + coord_flip() +
  facet_wrap(~drv)

This looks dull, so add some color with fill.

ggplot(data = mpg,aes(x= class,fill=drv)) + 
  geom_bar() + coord_flip() +
  facet_wrap(~drv)

Recall that what we have been doing is visualizing a contingency table, which we can examine numerically with CrossTable from the gmodels package.

library(gmodels)
CrossTable(mpg$class,mpg$drv)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  234 
## 
##  
##              | mpg$drv 
##    mpg$class |         4 |         f |         r | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##      2seater |         0 |         0 |         5 |         5 | 
##              |     2.201 |     2.265 |    37.334 |           | 
##              |     0.000 |     0.000 |     1.000 |     0.021 | 
##              |     0.000 |     0.000 |     0.200 |           | 
##              |     0.000 |     0.000 |     0.021 |           | 
## -------------|-----------|-----------|-----------|-----------|
##      compact |        12 |        35 |         0 |        47 | 
##              |     3.649 |     8.828 |     5.021 |           | 
##              |     0.255 |     0.745 |     0.000 |     0.201 | 
##              |     0.117 |     0.330 |     0.000 |           | 
##              |     0.051 |     0.150 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##      midsize |         3 |        38 |         0 |        41 | 
##              |    12.546 |    20.321 |     4.380 |           | 
##              |     0.073 |     0.927 |     0.000 |     0.175 | 
##              |     0.029 |     0.358 |     0.000 |           | 
##              |     0.013 |     0.162 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##      minivan |         0 |        11 |         0 |        11 | 
##              |     4.842 |     7.266 |     1.175 |           | 
##              |     0.000 |     1.000 |     0.000 |     0.047 | 
##              |     0.000 |     0.104 |     0.000 |           | 
##              |     0.000 |     0.047 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##       pickup |        33 |         0 |         0 |        33 | 
##              |    23.497 |    14.949 |     3.526 |           | 
##              |     1.000 |     0.000 |     0.000 |     0.141 | 
##              |     0.320 |     0.000 |     0.000 |           | 
##              |     0.141 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##   subcompact |         4 |        22 |         9 |        35 | 
##              |     8.445 |     2.382 |     7.401 |           | 
##              |     0.114 |     0.629 |     0.257 |     0.150 | 
##              |     0.039 |     0.208 |     0.360 |           | 
##              |     0.017 |     0.094 |     0.038 |           | 
## -------------|-----------|-----------|-----------|-----------|
##          suv |        51 |         0 |        11 |        62 | 
##              |    20.598 |    28.085 |     2.891 |           | 
##              |     0.823 |     0.000 |     0.177 |     0.265 | 
##              |     0.495 |     0.000 |     0.440 |           | 
##              |     0.218 |     0.000 |     0.047 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |       103 |       106 |        25 |       234 | 
##              |     0.440 |     0.453 |     0.107 |           | 
## -------------|-----------|-----------|-----------|-----------|
## 
## 

Exercise

Create a visualization of the relationship between drv and cyl in the mpg dataframe.

Distribution of One Quantitative Variable

Three primary ways * Histogram * Boxplot * Density

Use the diamonds dataset and look at price.

Histogram

ggplot(data = diamonds,aes(x=price)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Try a different binwidth or two.

ggplot(data = diamonds,aes(x=price)) + geom_histogram(binwidth=1000)

ggplot(data = diamonds,aes(x=price)) + geom_histogram(binwidth=2000)

Try a logarithmic scale.

ggplot(data = diamonds,aes(x=price)) + 
  geom_histogram(binwidth=.1) +
  scale_x_log10()

Boxplot

Boxplot in ggplot has a pecularity in that it requires both and x and y aesthetic.

# This will fail.
# ggplot(data = diamonds,aes(y=price)) + geom_boxplot()

Fix this by using any constant for x.

ggplot(data = diamonds,aes(y=price,x='Whatever')) + geom_boxplot()

Use coord_flip() to make it horizontal and align visually with the other visualizations of one quantitative variable.

ggplot(data = diamonds,aes(y=price,x="Whatever")) + 
  geom_boxplot() +
  coord_flip()

Density

A density plot is essentially a smoothed histogram.

ggplot(data = diamonds,aes(x=price)) + geom_density()

Play with the adjust parameter.

This makes the plot more or less detailed.

ggplot(data = diamonds,aes(x=price)) + geom_density(adjust = 5)

ggplot(data = diamonds,aes(x=price)) + geom_density(adjust = 1/5)

Exercise

Create visualizations of the variable carat in the diamonds dataframe.

Answers

ggplot(diamonds,aes(x=carat)) + geom_histogram(binwidth=.1) + facet_wrap(~cut,ncol=1)

ggplot(diamonds,aes(x=carat,fill=cut)) + geom_density(adjust=2) + facet_wrap(~cut,ncol=1)

ggplot(diamonds,aes(y=carat,x=cut)) + geom_boxplot() + coord_flip()