library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
ggplot(data = mpg). What do you see?ggplot(data = mpg)
Only a grid with no plots.
mpg? How many columns?dim(mpg)
## [1] 234 11
234 rows, 11 columns.
drv variable describe? Read the help for
?mpg to find out.?mpg
“drv- the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd”
hwy vs cyl.ggplot(data = mpg) + geom_point(aes(x = hwy, y = cyl))
class vs
drv? Why is the plot not useful?ggplot(mpg) + geom_point(aes(class,drv))
Class is a discrete variable, so a scatterplot doesn’t really make sense. A bar plot would be better.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
Color for the entire plot (not dependent on a variable) the color
argument must go outside the aes() function.
mpg are categorical? Which variables
are continuous? (Hint: type ?mpg to read the documentation
for the dataset). How can you see this information when you run
mpg?Manufacturer, model, trans, drv, fl and class are categorical. Displ,
year, cty, and hwy are continuous. You can look at which are
<int> and which are <char>
mpg
color, size,
and shape. How do these aesthetics behave differently for
categorical vs. continuous variables?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = year))
Categorical variables get specific color/size values, where with continuous variables a spectrum is assigned. Shapes can’t be put on a smooth spectrum, so you cannot assign a continuous variable to shape.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = trans, shape = trans))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 10. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 96 rows containing missing values (geom_point).
That variable is assigned a value for each of the aesthetics. For example, auto(l4) is green and square.
stroke aesthetic do? What shapes does it
work with? (Hint: use ?geom_point)Stroke changes the thickness of a border around the shape being graphed.
aes(colour = displ < 5)? Note,
you’ll also need to specify x and y.ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))
If you set it to a Boolean condition then it will create a version of the aesthetic for true and false.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ cty)
It makes a graph for every unqiue instance of the continous variable.
facet_grid(drv ~ cyl) mean? How do they relate to this
plot?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
Becuase both are discrete variables, all of the points for their
intersection fall on the same point. Each point on this graph represents
all of the points on the facet_grid(drv ~ cyl) plot.
.
do?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
It makes a facet grid with only one variable. The .
represents nothing. Its a facet wrap but instead of seperate plots, its
one plot with seperate sections.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
It is much easier to see the distributions of the individual classes, but harder to see the overall trend across all classes. With a larger dataset, the overall chart becomes increasingly crowded. Faceting can help separate the data into more readable plots.
?facet_wrap. What does nrow do? What
does ncol do? What other options control the layout of the
individual panels? Why doesn’t facet_grid() have
nrow and ncol arguments?nrow and ncol sets the number of rows and
columns the faceted plots with be displayed in.
facet_grid() doesn’t have these arguments because the rows
and columns are determined by the number of unqiue values of the
variables used to facet.
facet_grid() you should usually put the
variable with more unique levels in the columns. Why?This would make is easier to read and interpret because graphs and computer screens are wider on the x axis. So if we can squish each graph less.
Line: geom_line(), Boxplot: geom_boxplot(),
Histrogram: geom_histrogram(), Area:
geom_area().
Scatter plot of displacement against hwy mpg, dots are colored according to drive train. Overlayed is a smooth regression line of the data.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
I was wrong. Forgot that if you set a grouping variable it will apply that to additional geoms.
show.legend = FALSE do? What happens if you
remove it?show.legend = FALSE removes the legend on the side of
the graph that gives the label for aesthetics like color and
shape.
I would guess so that graphs would all be the same size when next to
each other.
se argument to geom_smooth()
do?Controls the semi-transparent standard error graph around the line.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
They will not because the data and mapping
carry over from ggplot() to the geoms.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(aes(group = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
geom_smooth(aes(linetype = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(displ, hwy)) +
geom_point(size = 5, color = 'white', alpha = 0.5) +
geom_point(aes(color = drv))
stat_summary()? How could you rewrite the previous plot to
use that geom function instead of the stat function?geom_pointrange().
ggplot(diamonds, aes(cut, depth)) +
geom_pointrange(
stat = "summary",
fun.min = min,
fun.max = max,
fun = median
)
geom_col() do? How is it different to
geom_bar()?geom_col() uses stat_identity() as default,
where geom_bar() uses stat_count(). This means
you need to provide a y variable for geom_col() but not
geom_bar().
| geom | stat |
|---|---|
geom_bar() |
stat_count() |
geom_bin2d() |
stat_bin_2d() |
geom_boxplot() |
stat_boxplot() |
geom_contour_filled() |
stat_contour_filled() |
geom_contour() |
stat_contour() |
geom_count() |
stat_sum() |
geom_density_2d() |
stat_density_2d() |
geom_density() |
stat_density() |
geom_dotplot() |
stat_bindot() |
geom_function() |
stat_function() |
geom_sf() |
stat_sf() |
geom_sf() |
stat_sf() |
geom_smooth() |
stat_smooth() |
geom_violin() |
stat_ydensity() |
geom_hex() |
stat_bin_hex() |
geom_qq_line() |
stat_qq_line() |
geom_qq() |
stat_qq() |
geom_quantile() |
stat_quantile() |
Their names are often indentical.
stat_smooth() compute? What
parameters control its behavior?It computes the predicted variable x, the lower
confidence interval xmin, the upper confidence interval
xmax, and the standard error se. It can be
controlled with mappin, position,
method, formula and other arguments
group = 1.
Why? In other words what is the problem with these two graphs?ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop)))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))
Without setting group = 1 (or = to anything) then it
goes by geom_bar()’s default of grouping by x. This is fine
normally, but if we apply a stat to our y variable, then we just end up
comparing the stat(x) to x. In this case the proportion of
Fair in Fair is 100%, which makes all the bars
the same height. group = 1 fixes this, making it so the
proportion(cut) is compared to all levels of cut
ggplot(data = diamonds) +
geom_bar(aes(x = cut, y = after_stat(prop), fill = color, group = 1))
But hey wheres the color gone. Lets try removing the grouping.
ggplot(data = diamonds) +
geom_bar(aes(x = cut, y = after_stat(prop), fill = color))
yea… thats not right
ggplot(data = diamonds) +
geom_bar(aes(x = cut, y = after_stat(count) / sum(after_stat(count)), fill = color))
This took a second to get but makes sense. When we set
y = after_stat(prop) we are computing the y for the entire
plot then grouping the color by x. We need to calculate the proportion
for each individual x.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
Many of the dots are overlapped, this makes the data seem less than it actually is. We can either adjust to the alpha to show overlap, or adjust the jitters to slightly move overlapped data so they are viable.
geom_jitter() control the amount of
jittering?width and height.
geom_jitter() with
geom_count().geom_count() shows overlapped data as a single larger
dot, where geom_jitter() makes a small spread of the
smaller dots.
geom_boxplot()? Create a visualization of the
mpg dataset that demonstrates it.By default for geom_boxplot(),
position = "dodge2"
ggplot(mpg, aes(hwy, class)) +
geom_boxplot(aes(fill = fl))
We can see the each graph per fl per class
is dodged from the others
coord_polar().ggplot(mpg) +
geom_bar(aes(x = manufacturer, fill = manufacturer)) +
coord_polar()
labs() do? Read the documentation.Allows you to set the labels for any and all text on a ggplot
coord_quickmap() and
coord_map()?coord_map() preserves straight lines where
coord_quickmap() does not. This makes
coord_quickmap() significantly faster.
coord_fixed() important? What
does geom_abline() do?ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
We can see the relationship is linear. coord_fixed() is
important because it sets the grid to a aspect ratio of 1, so the scale
of both axis are the same. This is important so that we can clearly see
the trend. geom_abline() adds a line seperate fron the
data. In this case because no arguments are passed it defaults to y =
x.