library(tidyverse)
## ─ Attaching packages ──────────────── tidyverse 1.2.1 ─
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ─ Conflicts ───────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy))
with ggplot2,you begin a plot with the function ggplot(). it creates a coordinate system that you can add layers to.
each geom function in ggplot2 takes a mapping argument.this defines how variables in your dataset are mapped to visual properties. the mapping argument is always paired with aes(),and the x and y arguments of aes() specify which variables to map to the x and y axes
an aesthetic is a visual property of the objects in your plot. aesthetic include things like the size,the shape,or the color of your points
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy,
color = class))
To map an aesthetic to a variable,associate the name of the aesthetic to the name of the variable inside aes()
once you map an aesthetic, ggplot2 will constructs a legend that explains the mapping between levels and values. For x and y aesthetics,ggplot2 does not creat a legend,but it creats an axis line with tick marks and a label,the axis line acts as a legend,it explains the mapping between location and values
particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data
To facet your plot by a single variable,use facet_wrap(). the variable that you pass to facet_wrap() should be discrete
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy)) +
facet_wrap(~class,nrow = 2)
To facet your plot on the combination of two variables, add facet_grid() to your plot call.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy)) +
facet_grid(drv ~ cyl)
if you prefer to not facet in the rows or columns dimension, use a . instead of a variable name
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy)) +
facet_grid(. ~ cyl)
a geom is the geometric object that a plot uses to represent data, we can uses different geoms to plot the same data
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ,
y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
every geom function in ggplot2 takes a mapping argument. however, not every aesthetic works with every geom. you can set the shape of a point, but you can’t set the “shape” of a line. on the other hand,you can set the linetype of a line
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ,
y = hwy,
linetype = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
many geoms, use a single geometric object to display multiple rows of data. for these geoms,you can set the group aesthetic to a categoricl variable to draw multiple objects. ggplot2 will draw separate object for each unique value of the grouping variable. it is convenient to rely on this feature because the group aesthetic by itself does not add a legnd or distinguishing features to the geoms
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ,
y = hwy,
group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ,
y = hwy,
color = drv),
show.legend = FALSE
)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
To display multiple geoms in the same plot,add multiple geom function to ggplot()
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy)) +
geom_smooth(mapping = aes(x = displ,
y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
however,indroduces some duplication in our code.Imagine if you want to change the y-axis to display cty instead of hwy,you need to change the variable in two places,you might forget to update one,you can avoid this type of repetition by passing a set of mappings to ggplot(). ggplot will treat these mappings as global mappings that apply to each geom in the graph. In other words, this code will produce the same plot as the previous code
ggplot(data = mpg,
mapping = aes(x = displ,
y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
if you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. it will use these mappings to extend or overwrite the gplobal mappings for that layer only
ggplot(data = mpg,
mapping = aes(x = displ,
y = hwy )) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
you can use the same idea to specify different data for each layer. here, our smooth line displays just a subset of the mpg dataset
ggplot(data = mpg,
mapping = aes(x = displ,
y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg,
class == "subcompact"),
se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
bar charts seem simple,but they are interseting because they reveal something subtle about plots
the following chart displays the total number of diamonds in the diamonds dataset,grouped by cut
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
on the y-axis,it display count,but count is not a variable in diamonds,where does count come from?
many graphs,like scatterplots,plot the raw values of your dataset,other graphs,like bar charts,calculate new values to plot:
the algorithm used to calculate new values for a graph is called a stat,short for statistical transformation.the figure below describes how this process works with geom_bar()
geom_bar() default uses stat_count(),you can use geoms and stats interchangeably,you can recreate the previous plot using stat_count()
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
every geom has a default stat,every stat has a default geom. this means that you can typically use geoms without worring about the underlying statistical transformation
you can use a stat explicitly
change the stat of geom_bar() from count (the default) to identity
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut,
y = freq),
stat = "identity")
display a bar chart of proportion,rather than count
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut,
y = ..prop.. ,
group = 1,
))
with the fill aesthetic,the heights of the bars need to be normalized
ggplot(data = diamonds) +
geom_bar(aes(x = cut,
y = ..count.. /sum(..count..),
fill = color))
you can color a bar chart using either the color aesthetic, or, more usefully, fill
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut,
color = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut,
fill = cut))
if you map fill aesthetic to another variable,like clarity : the bars are automatically stacked.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut,
fill = clarity))
position = “fill” work like stacking, but makes each set of stacked bars the same height.this make it easier to compare proportions across groups
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut,
fill = clarity),
position = "fill")
position = “dodge” place overlapping objects directly beside one another.this make it easier to compare individual values
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut,
fill = clarity),
position = "dodge")
there’s one other type of adjustment that’s not useful for bar charts,but it can be very useful for scatterplots
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy))
Did you notice that the plot displays only 126 points,even though there are 234 observations in the dataset ? the values hwy and displ are rounded so the points appear on a grid and many points overlap each other. this problem is known as overplotting
you can avoid this problem by setting the position adjustment to “jitter”, adds a small amount of random noise to each point
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy),
position = "jitter")
adding randomness seems like a strange way to improve your plot,but while it makes your graph less accurate at small scales,it makes your graph more revealing at large scales. because this is such a useful operation,ggplot2 comes with a shorhand for geom_point(position = “jitter”) : geom_jitter() .
compare and contrast geom_jitter() with geom_count()
geom_jitter() adds random variation to the locations points of the graph.this method reduces overploting since two points with the same location are unlikely to have the same random variation
ggplot(data = mpg,
mapping = aes(x = cty,
y = hwy)) +
geom_jitter()
the geom_count sizes the points relative to the number of observations.combinations of (x,y) values with more observations will be larger than those with fewer observations
ggplot(data = mpg,
mapping = aes(x = cty,
y = hwy)) +
geom_count()
geom_boxplot() default position for geom_boxplot() is “dodge2”,which is shortcut for position_dodge2. this position adjustment does not change the vertical position of a geom but moves the geom horizontally to avoid overlapping other geoms
ggplot(data = mpg,
aes(x = drv,
y = hwy,
colour = class)) +
geom_boxplot()
if position_identity() is used the boxplots overlap
ggplot(data = mpg,
aes(x = drv,
y = hwy,
color = class)) +
geom_boxplot(position = "identity")
coord_filp() switches the x and y axes. this is useful, if you want horizontal boxplots. it’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis
ggplot(data = mpg,
mapping = aes(x = class,
y = hwy)) +
geom_boxplot() +
coord_flip()
coord_polar uses polar coordinates. polar coordinate reveal an interesting connection between a bar chart and a coxcomb chart
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_polar()
coord_fixed() ensures that the line produced by geom_abline() is at a 45-degree line
p <- ggplot(data = mpg,
mapping = aes(x = cty,
y = hwy)) +
geom_point() +
geom_abline()
p + coord_fixed()