library(tidyverse)
## ─ Attaching packages ──────────────── tidyverse 1.2.1 ─
## ✔ ggplot2 3.1.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.1     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ─ Conflicts ───────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

creating a ggplot

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ,
                           y = hwy))

with ggplot2,you begin a plot with the function ggplot(). it creates a coordinate system that you can add layers to.

each geom function in ggplot2 takes a mapping argument.this defines how variables in your dataset are mapped to visual properties. the mapping argument is always paired with aes(),and the x and y arguments of aes() specify which variables to map to the x and y axes

aesthetic mappings

an aesthetic is a visual property of the objects in your plot. aesthetic include things like the size,the shape,or the color of your points

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ,
                           y = hwy,
                           color = class))

To map an aesthetic to a variable,associate the name of the aesthetic to the name of the variable inside aes()

once you map an aesthetic, ggplot2 will constructs a legend that explains the mapping between levels and values. For x and y aesthetics,ggplot2 does not creat a legend,but it creats an axis line with tick marks and a label,the axis line acts as a legend,it explains the mapping between location and values

Facets

particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data

To facet your plot by a single variable,use facet_wrap(). the variable that you pass to facet_wrap() should be discrete

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ,
                           y = hwy)) +
  facet_wrap(~class,nrow = 2)

To facet your plot on the combination of two variables, add facet_grid() to your plot call.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ,
                           y = hwy)) +
  facet_grid(drv ~ cyl)   

if you prefer to not facet in the rows or columns dimension, use a . instead of a variable name

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ,
                           y = hwy)) +
  facet_grid(. ~ cyl)

geometric objects

a geom is the geometric object that a plot uses to represent data, we can uses different geoms to plot the same data

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, 
                           y = hwy))

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ,
                           y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

every geom function in ggplot2 takes a mapping argument. however, not every aesthetic works with every geom. you can set the shape of a point, but you can’t set the “shape” of a line. on the other hand,you can set the linetype of a line

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ,
                            y = hwy,
                            linetype = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

many geoms, use a single geometric object to display multiple rows of data. for these geoms,you can set the group aesthetic to a categoricl variable to draw multiple objects. ggplot2 will draw separate object for each unique value of the grouping variable. it is convenient to rely on this feature because the group aesthetic by itself does not add a legnd or distinguishing features to the geoms

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ,
                            y = hwy,
                            group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, 
                  y = hwy, 
                  color = drv),
    show.legend = FALSE
  )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

To display multiple geoms in the same plot,add multiple geom function to ggplot()

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ,
                           y = hwy)) +
  geom_smooth(mapping = aes(x = displ,
                            y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

however,indroduces some duplication in our code.Imagine if you want to change the y-axis to display cty instead of hwy,you need to change the variable in two places,you might forget to update one,you can avoid this type of repetition by passing a set of mappings to ggplot(). ggplot will treat these mappings as global mappings that apply to each geom in the graph. In other words, this code will produce the same plot as the previous code

ggplot(data = mpg,
       mapping = aes(x = displ,
                     y = hwy)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

if you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. it will use these mappings to extend or overwrite the gplobal mappings for that layer only

ggplot(data = mpg,
       mapping = aes(x = displ,
                     y = hwy    )) + 
  geom_point(mapping = aes(color = class)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

you can use the same idea to specify different data for each layer. here, our smooth line displays just a subset of the mpg dataset

ggplot(data = mpg, 
       mapping = aes(x = displ, 
                     y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, 
                            class == "subcompact"), 
              se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

statistical transformations

bar charts seem simple,but they are interseting because they reveal something subtle about plots

the following chart displays the total number of diamonds in the diamonds dataset,grouped by cut

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

on the y-axis,it display count,but count is not a variable in diamonds,where does count come from?

many graphs,like scatterplots,plot the raw values of your dataset,other graphs,like bar charts,calculate new values to plot:

the algorithm used to calculate new values for a graph is called a stat,short for statistical transformation.the figure below describes how this process works with geom_bar()

geom_bar() default uses stat_count(),you can use geoms and stats interchangeably,you can recreate the previous plot using stat_count()

ggplot(data = diamonds) +
  stat_count(mapping = aes(x = cut))

every geom has a default stat,every stat has a default geom. this means that you can typically use geoms without worring about the underlying statistical transformation

you can use a stat explicitly

change the stat of geom_bar() from count (the default) to identity

demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)
ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, 
                         y = freq), 
           stat = "identity")

display a bar chart of proportion,rather than count

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut,
                         y = ..prop.. ,
                         group = 1,
                        ))

with the fill aesthetic,the heights of the bars need to be normalized

ggplot(data = diamonds) +
  geom_bar(aes(x = cut,
               y = ..count.. /sum(..count..),
               fill = color))

position adjustments

you can color a bar chart using either the color aesthetic, or, more usefully, fill

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut,
                         color = cut))

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut,
                         fill = cut))

if you map fill aesthetic to another variable,like clarity : the bars are automatically stacked.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut,
                         fill = clarity))

position = “fill” work like stacking, but makes each set of stacked bars the same height.this make it easier to compare proportions across groups

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut,
                         fill = clarity),
           position = "fill")

position = “dodge” place overlapping objects directly beside one another.this make it easier to compare individual values

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut,
                         fill = clarity),
           position = "dodge")

there’s one other type of adjustment that’s not useful for bar charts,but it can be very useful for scatterplots

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ,
                           y = hwy))

Did you notice that the plot displays only 126 points,even though there are 234 observations in the dataset ? the values hwy and displ are rounded so the points appear on a grid and many points overlap each other. this problem is known as overplotting

you can avoid this problem by setting the position adjustment to “jitter”, adds a small amount of random noise to each point

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ,
                           y = hwy),
             position = "jitter")

adding randomness seems like a strange way to improve your plot,but while it makes your graph less accurate at small scales,it makes your graph more revealing at large scales. because this is such a useful operation,ggplot2 comes with a shorhand for geom_point(position = “jitter”) : geom_jitter() .

compare and contrast geom_jitter() with geom_count()
geom_jitter() adds random variation to the locations points of the graph.this method reduces overploting since two points with the same location are unlikely to have the same random variation

ggplot(data = mpg,
       mapping = aes(x = cty,
                     y = hwy)) +
  geom_jitter()

the geom_count sizes the points relative to the number of observations.combinations of (x,y) values with more observations will be larger than those with fewer observations

ggplot(data = mpg,
       mapping = aes(x = cty,
                     y = hwy)) + 
  geom_count()

geom_boxplot() default position for geom_boxplot() is “dodge2”,which is shortcut for position_dodge2. this position adjustment does not change the vertical position of a geom but moves the geom horizontally to avoid overlapping other geoms

ggplot(data = mpg, 
       aes(x = drv, 
           y = hwy, 
           colour = class)) +
  geom_boxplot()

if position_identity() is used the boxplots overlap

ggplot(data = mpg,
       aes(x = drv,
           y = hwy,
           color = class)) +
  geom_boxplot(position = "identity")

coordinate systems

coord_filp() switches the x and y axes. this is useful, if you want horizontal boxplots. it’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis

ggplot(data = mpg, 
       mapping = aes(x = class, 
                     y = hwy)) + 
  geom_boxplot() +
  coord_flip()

coord_polar uses polar coordinates. polar coordinate reveal an interesting connection between a bar chart and a coxcomb chart

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_polar()

coord_fixed() ensures that the line produced by geom_abline() is at a 45-degree line

p <- ggplot(data = mpg, 
            mapping = aes(x = cty, 
                          y = hwy)) +
  geom_point() +
  geom_abline()

p + coord_fixed()