Libraries

library(tidyverse)

## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.7
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

If we need to be explicit about where a function (or dataset) comes from, we’ll use the special form package::function() . For example, ggplot2::ggplot() tells you explicitly that we’re using the ggplot() function from the ggplot2 package.

The mpg data frame

Do cars with big engines use more fuel than cars with small engines? What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?

glimpse(mpg)

## Observations: 234
## Variables: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "...
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 qua...
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0,...
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1...
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6...
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)...
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4",...
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 1...
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 2...
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
## $ class        <chr> "compact", "compact", "compact", "compact", "comp...

summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

head(mpg)

## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi         a4      1.8  1999     4 auto~ f        18    29 p     comp~
## 2 audi         a4      1.8  1999     4 manu~ f        21    29 p     comp~
## 3 audi         a4      2    2008     4 manu~ f        20    31 p     comp~
## 4 audi         a4      2    2008     4 auto~ f        21    30 p     comp~
## 5 audi         a4      2.8  1999     6 auto~ f        16    26 p     comp~
## 6 audi         a4      2.8  1999     6 manu~ f        18    26 p     comp~

Creating a ggplot

The X-variable is called the explanatory or predictor variable, while the Y-variable is called the response variable or the dependent variable

#x-variable - displ (engine displacement, in litres)
#y-variable - hwy (highway miles per gallon)
#geom_point(mapping = NULL, data = NULL, stat = "identity", position = "identity", ..., na.rm = FALSE, show.legend = NA,inherit.aes = TRUE)
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

The plot shows a negative relationship between engine size and fuel efficiency.

cor(mpg$displ,mpg$hwy)

## [1] -0.76602

Exercises

Run ggplot(data = mpg). What do you see?

ggplot(data = mpg)

This code creates an empty plot. The ggplot() function creates the background of the plot, but since no layers were specified with geom function, nothing is drawn.

How many rows are in mpg? How many columns?

cat("\n","Number of Rows in the MPG Data Set","\n")

## 
##  Number of Rows in the MPG Data Set

nrow(mpg)

## [1] 234

cat("\n","Number of Columns in the MPG Data Set","\n")

## 
##  Number of Columns in the MPG Data Set

ncol(mpg)

## [1] 11

What does the drv variables describe? The drv variable is a categorical variable which categorizes cars into front-wheels, rear-wheels, or four-wheel drive, f=front wheel drive, r = rear wheel drive, 4 = 4wd

Make a scatterplot of hwy vs cyl

ggplot(data = mpg, aes(x = hwy, y = cyl)) +
  geom_point()

What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

ggplot(data = mpg, aes(x = class, y = drv)) +
  geom_point()

A scatter plot is not a useful display of these variables since both drv and class are categorical variables. Since categorical variables typically take a small number of values, there are a limited number of unique combinations of (x, y) values that can be displayed.

#count(x, ..., wt = NULL, sort = FALSE)
count(mpg,class)

## # A tibble: 7 x 2
##   class          n
##   <chr>      <int>
## 1 2seater        5
## 2 compact       47
## 3 midsize       41
## 4 minivan       11
## 5 pickup        33
## 6 subcompact    35
## 7 suv           62

count(mpg,drv)

## # A tibble: 3 x 2
##   drv       n
##   <chr> <int>
## 1 4       103
## 2 f       106
## 3 r        25

Aesthetic mappings

An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to the class variable to reveal the class of each car. ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.

#mapping color to the class variable
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

#mapping size to the class variable
ggplot(data = mpg, aes(x = displ, y = hwy, size = class)) +
  geom_point()

## Warning: Using size for a discrete variable is not advised.

#mapping alpha to the class variable
#alpha controls the transparency of the points
ggplot(data = mpg, aes(x = displ, y = hwy, alpha = class)) +
  geom_point()

## Warning: Using alpha for a discrete variable is not advised.

#mapping shape to the class variable
#suv class went unplotted b/c there is a maximum of 6 discrete values
ggplot(data = mpg, aes(x = displ, y = hwy, shape = class)) +
  geom_point()

## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.

## Warning: Removed 62 rows containing missing values (geom_point).

#to set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes outside of aes()
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "blue")

Exercises

What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

The argument colour = “blue” is included within the mapping argument, and as such, it is treated as an aesthetic, which is a mapping between a variable and a value. In the expression, color=“blue”, “blue” is interpreted as a categorical variable which only takes a single value “blue”. If this is confusing, consider how colour = 1:234 and colour = 1 are interpreted by aes().

Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg ? Categorical: manufacturer, model, trans, drv, fl, class

Continuous: displ, year, cyl, cty, hwy

Map a continuous variable to color , size , and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

ggplot(data = mpg, aes(x = displ, y = hwy, color = cty)) +
  geom_point()

Instead of using discrete colors, the continuous variable uses a scale that varies from a light to dark blue color.

ggplot(data = mpg, aes(x = displ, y = hwy, size = cty)) +
  geom_point()

When mapped to size, the sizes of the points vary continuously as a function of their size. When mapped to shape an eror populated “can not be mapped to shape”.

What happens if you map the same variable to multiple aesthetics?

ggplot(data = mpg, aes(x = displ, y = hwy, color = hwy)) +
  geom_point()

The code works and produces a plot, even if it is a bad one. Mapping a single variable to multiple aesthetics is redundant. Because it is redundant information, in most cases avoid mapping a single variable to multiple aesthetics.

What does the stroke aesthetic do? What shapes does it work with? (Hint: use ? geom_point)

Stroke changes the size of the border for shapes (21-25). These are filled shapes in which the color and size of the border can differ from that of the filled interior of the shape.

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point(shape = 21, color = "black", fill = "white",
             size = 5, stroke = 2)

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?

ggplot(data = mpg, aes(x = displ, y = hwy, color = displ < 5)) +
  geom_point()

Aesthetics can also be mapped to expressions like displ < 5. The ggplot() function behaves as if a temporary variable was added to the data with with values equal to the result of the expression. In this case, the result of displ < 5 is a logical variable which takes values of TRUE or FALSE.

Facets

Facets are subplots that each display one subset of the data. The variable that you pass to facet_wrap should be discrete.

#facet_wrap wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid() because most displays are roughly rectangular.
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~ class, nrow = 2)

To facet your plot on the combination of two variables, add facet_grid() to your plot call.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ cyl)

Exercises

What happens if you facet on a continuous variable?

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~ cty)

The continuous variable is converted to a categorical variable, and the plot contains a facet for each distinct value.

What do the empty cells in plot with facet_grid (drv ~ cyl) mean? How do they relate to this plot?

ggplot(data = mpg) +
  geom_point(mapping = aes(x = hwy, y = cty)) +
  facet_grid(drv ~ cyl)

The empty cells (facets) in this plot are combinations of drv and cyl that have no observations. These are the same locations in the scatter plot of drv and cyl that have no plots.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = drv, y = cyl))

What plots does the following code make? What does . do?

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

The symbol . ignores that dimension when faceting. For example, drv ~ . facet by values of drv on the y-axis.

While, . ~ cyl will facet by values of cyl on the x-axis.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

Take the first faceted plot in this section:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~class, nrow = 2)

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Advantages of encoding class with facets instead of color include the ability to encode more distinct categories. For me, it is difficult to distinguish between the colors of “midsize” and “minivan”.

Given human visual perception, the max number of colors to use when encoding unordered categorical (qualitative) data is nine, and in practice, often much less than that. Displaying observations from different categories on different scales makes it difficult to directly compare values of observations across categories. However, it can make it easier to compare the shape of the relationship between the x and y variables across categories.

Disadvantages of encoding the class variable with facets instead of the color aesthetic include the difficulty of comparing the values of observations between categories since the observations for each category are on different plots. Using the same x- and y-scales for all facets makes it easier to compare values of observations across categories, but it is still more difficult than if they had been displayed on the same plot. Since encoding class within color also places all points on the same plot, it visualizes the unconditional relationship between the x and y variables; with facets, the unconditional relationship is no longer visualized since the points are spread across multiple plots.

The benefits encoding a variable through facetting over color become more advantageous as either the number of points or the number of categories increase. In the former, as the number of points increases, there is likely to be more overlap.

It is difficult to handle overlapping points with color. Jittering will still work with color. But jittering will only work well if there are few points and the classes do not overlap much, otherwise, the colors of areas will no longer be distinct, and it will be hard to pick out the patterns of different categories visually. Transparency (alpha) does not work well with colors since the mixing of overlapping transparent colors will no longer represent the colors of the categories. Binning methods use already color to encode density, so color cannot be used to encode categories.

As noted before, as the number of categories increases, the difference between colors decreases, to the point that the color of categories will no longer be visually distinct.

Read ?facet_wrap . What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

The arguments nrow (ncol) determines the number of rows (columns) to use when laying out the facets. It is necessary since facet_wrap() only facets on one variable.

The nrow and ncol arguments are unnecessary for facet_grid() since the number of unique values of the variables specified in the function determines the number of rows and columns.

When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

There will be more space for columns if the plot is laid out horizontally (landscape).

Geometric Objects

A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses.

#scatterplot
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Here geom_smooth() separates the cars into three lines based on their drv value, which describes a car’s drivetrain.

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to drv.

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color = drv))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

To display multiple geoms in the same plot, add multiple geom function to ggplot()

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

You can avoid this type of repetition by passing a set of mappings to ggplot()

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

You can use the same idea to specify different data for each layer. Here, our smooth line displays just a subset of the mpg dataset, the subcompact cars. The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(color = class)) +
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Exercises

What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

line chart: geom_line()

box plot: geom_boxplot()

histogram: geom_hist()

area chart: geom_area()

Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

This code produces a scatter plot with displ on the x-axis, hwy on the y-axis, and the points colored by drv. There will be a smooth line, without standard errors, fit through each drv group.

#se display confidence interval around smooth? (TRUE by default, see level to control)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What does show.legend = FALSE do? What happens if you remove it? The theme option show.legend = FALSE hides the legend box.

#show.legend = FALSE
ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#show.legend = TRUE
#function is set to TRUE as a default
ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv), show.legend = TRUE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Why do you think I used it earlier in the chapter? In the chapter, the legend is suppressed because with three plots, adding a legend to only the last plot would make the sizes of plots different. Different sized plots would make it more difficult to see how arguments change the appearance of the plots. The purpose of those plots is to show the difference between no groups, using a group aesthetic, and using a color aesthetic, which creates implicit groups. In that example, the legend isn’t necessary since looking up the values associated with each color isn’t necessary to make that point.

What does the se argument to geom_smooth() do? It adds standard error bands to the lines.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

By default se = TRUE

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Will these two graphs look different? Why/why not?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() +
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

No. Because both geom_point() and geom_smooth() will use the same data and mappings. They will inherit those options from the ggplot() object, so the mappings don’t need to specified again.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Recreate the R code necessary to generate the following graphs (code in book)

ggplot(data = mpg, mapping=aes(x = displ, hwy)) +
  geom_point() +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, aes(x = displ, hwy)) +
  geom_smooth(mapping = aes(group = drv), se = FALSE) +
  geom_point()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, aes(x = displ, hwy, color = drv)) +
  geom_smooth(se = FALSE) +
  geom_point()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, aes(x = displ, hwy)) +
  geom_point(aes(color = drv)) +
  geom_smooth(aes(linetype = drv), se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Statistical Transformations

glimpse(diamonds)

## Observations: 53,940
## Variables: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...

summary(diamonds)

##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
##

head(diamonds)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

#cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

You can recreate the previous plot using stat_count() instead of geom_bar()

ggplot(data = diamonds) +
  stat_count(mapping = aes(x = cut))

demo <- tribble(
  ~cut, ~freq,
  "Fair", 1610,
  "Good", 4906,
  "Very Good", 12082,
  "Premium", 13791,
  "Ideal", 21551
)

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

You might want to display a bar chart of proportion, rather than count:

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing

ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut,y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

Exercises

What is the default geom associated with stat_summary() ? How could you rewrite the previous plot to use that geom function instead of the stat function?

The default geom for stat_summary() is geom_pointrange(). The default stat for geom_pointrange() is identity() but we can add the argument stat = “summary” to use stat_summary() instead of stat_identity().

ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary"
  )

## No summary function supplied, defaulting to `mean_se()

What does geom_col() do? How is it different to geom_bar() ?

The geom_col() function has a different default stat than geom_bar(). The default stat of geom_col() is stat_identity(), which leaves the data as is. The geom_col() function expects that the data contains x values and y values which represent the bar height.

The default stat of geom_bar() is stat_bin(). The geom_bar() function only expects an x variable. The stat, stat_bin(), preprocess input data by counting the number of observations for each value of x. The y aesthetic uses the values of these counts.

Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

see attached document

What variables does stat_smooth() compute? What parameters control its behaviour?

y: predicted value

ymin: lower value of the confidence interval

ymax: upper value of the confidence interval

se: standard error

The “Computed Variables” section of the stat_smooth() documentation contains these variables.

The parameters that control the behavior of stat_smooth() include

method: the method used to

formula: the formula are parameters such as method which determines which method is used to calculate the predictions and confidence interval, and some other arguments that are passed to that

na.rm:

In our proportion bar chart, we need to set group = 1 . Why? In other words what is the problem with these two graphs?

If group = 1 is not included, then all the bars in the plot will have the same height, a height of 1. The function geom_bar() assumes that the groups are equal to the x values since the stat computes the counts within the group.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop..))

The following code will produce the intended stacked bar charts:

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop.., group = color))

Position Adjustments

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, color = cut))

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut))

Note what happens if you map the fill aesthetic to another variable, like clarity : the bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity))

The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: “identity” , “dodge” or “fill”.

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
  geom_bar(alpha = 1/5, position = "identity")

ggplot(data = diamonds, mapping = aes(x = cut, color = clarity)) +
  geom_bar(fill = NA, position = "identity")

position = “fill” works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

position = “dodge” places overlapping objects directly beside one another. This makes it easier to compare individual values.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

Overplotting makes it hard to see where the mass of the data is. Values are rounded so the points appear on a grid and many points overlap each other. You can avoid this gridding by setting the position adjustment to “jitter”; position = “jitter” adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

Exercises

What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point()

There is overplotting because there are multiple observations for each combination of cty and hwy values.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point(position = "jitter")

What parameters to geom_jitter() control the amount of jittering?

width: controls the amount of vertical displacement

height: controls the amount of horizontal displacement

Jittering shows the locations where there are more observations.

Note that the height and width arguments are in the units of the data. Thus height = 1 (width = 1) corresponds to different relative amounts of jittering depending on the scale of the y (x) variable. The default values of height and width are defined to be 80% of the resolution() of the data, which is the smallest non-zero distance between adjacent values of a variable. When x and y are discrete variables, their resolutions are both equal to 1, and height = 0.4 and width = 0.4 since the jitter moves points in both positive and negative directions.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter(width = 20)

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter(height = 0)

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter(height = 15)

Compare and contrast geom_jitter() with geom_count(). The geom geom_jitter() adds random variation to the locations points of the graph. In other words, it “jitters” the locations of points slightly. This method reduces overplotting since two points with the same location are unlikely to have the same random variation.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter()

The geom geom_count() sizes the points relative to the number of observations. Combinations of (x, y) values with more observations will be larger than those with fewer observations.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_count()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
  geom_jitter()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
  geom_count()

As that example shows, unfortunately, there is no universal solution to overplotting. The costs and benefits of different approaches will depend on the structure of the data and the goal of the data scientist.

What’s the default position adjustment for geom_boxplot() ? Create a visualisation of the mpg dataset that demonstrates it. The default position for geom_boxplot() is “dodge2”, which is a shortcut for position_dodge2. This position adjustment does not change the vertical position of a geom but moves the geom horizontally to avoid overlapping other geoms. See the documentation for position_dodge2() for additional discussion on how it works.

ggplot(data = mpg, aes(x = drv, y = hwy, color = class)) +
  geom_boxplot()

If position_identity() is used to overlap boxplots

ggplot(data = mpg, aes(x = drv, y = hwy, color = class)) +
  geom_boxplot(position = "identity")

Coordinate Systems

coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot() +
  coord_flip()

coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2

coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

bar <- ggplot(data = diamonds) +
  geom_bar(
    mapping = aes(x = cut, fill = cut),
    show.legend = FALSE,
    width = 1
  ) +
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_flip()

bar + coord_polar()

Exercises

Turn a stacked bar chart into a pie chart using coord_polar(). A pie chart is a stacked bar chart with the addition of polar coordinates. Take this stacked bar chart with a single category.

ggplot(mpg, aes(x = factor(1), fill = drv)) +
  geom_bar()

Now add coord_polar(theta=“y”) to create pie chart.

ggplot(mpg, aes(x = factor(1), fill = drv)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y")

The argument theta = “y” maps y to the angle of each section. If coord_polar() is specified without theta = “y”, then the resulting plot is called a bulls-eye chart.

ggplot(mpg, aes(x = factor(1), fill = drv)) +
  geom_bar(width = 1) +
  coord_polar()

What does labs() do? Read the documentation. The labs function adds axis titles, plot titles, and a caption to the plot. The arguments to labs() are optional, so you can add as many or as few of these as are needed. The labs() function is not the only function that adds titles to plots. The xlab(), ylab(), and x- and y-scale functions can add axis titles. The ggtitle() function adds plot titles.

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot() +
  coord_flip() +
  labs(
    y = "Highway MPG",
    x = "Year",
    title = "Highway MPG by Car Class",
    subtitle = "1999-2008",
    caption = "Source: http://fueleconomy.gov"
  )

What’s the difference between coord_quickmap() and coord_map()? The coord_map() function uses map projections to project the three-dimensional Earth onto a two-dimensional plane. By default, coord_map() uses the Mercator projection. This projection is applied to all the geoms in the plot. The coord_quickmap() function uses an approximate but faster map projection. This approximation ignores the curvature of Earth and adjusts the map for the latitude/longitude ratio. The coord_quickmap() project is faster than coord_map() both because the projection is computationally easier, and unlike coord_map(), the coordinates of the individual geoms do not need to be transformed.

What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

The function coord_fixed() ensures that the line produced by geom_abline() is at a 45-degree angle. A 45-degree line makes it easy to compare the highway and city mileage to the case in which city and highway MPG were equal.

p <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() +
  geom_abline()
p + coord_fixed()

If we didn’t include geom_coord(), then the line would no longer have an angle of 45 degrees.

R for Data Science: Data Visualization

Brian Liles

February 6, 2019

Libraries

The mpg data frame

Creating a ggplot

Exercises

Aesthetic mappings

Exercises

Facets

Exercises

Geometric Objects

Exercises

Statistical Transformations

Exercises

Position Adjustments

Exercises

Coordinate Systems

Exercises