In this class, we will start to learn how to visualize data with the
ggplot2
package in R. Again, to activate all functions in
ggplot2
, we need load the package. Usually we simply load
tidyverse
which contains ggplot2
.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.4
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
As we see, ggplot2
is part of the tidyverse
package.
Categorical (or qualitative) variable: takes values that are not numerical (not numbers)
Numeric (or quantitative) variable: takes values that are numeric (numbers)
Discrete variable: A numeric variable whose possible values can be listed.
Continuous variable: A numeric variable who possible values are from interval of real numbers.
Why do we have this many plot types? One reason is that we need different plots to best illustrate the relationship between (usually one or two) variables of different types.
bar plots: (usually) for one categorical variable
histograms: for one numeric variable
box plots: for one continuous variable
Scatter plots: (usually) for two numeric variables
Multiple box plots: for one continuous variable and one categorical/discrete variable
Stacked bar plots: for two categorical variables.
Again, let’s use the fuel economy data mpg
as the first
data set to work on. To recall what we learned from last class, answer
the following questions using R.
Now let’s learn from creating scatter plots, which is one of the most
commonly used graphs in scientific research. Let’s plot the
cty
variable against the hwy
variable in the
mpg
data set, which is given below.
cty vs hwy
scatter plotggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy))
ggplot()
creates a coordinate system that you can add
layers to.
ggplot()
is the data set to use in
the graph.geom_point()
adds a scatter plot (which is
called a layer) to the current plot.ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy))
geom_point()
function
mapping
argument: defines how variables in your data
set are mapped to visual properties.
aes()
function which constructs
aesthetic mappings.x
and y
arguments of aes()
specify which variables to map to the x and y axes. You don’t have to
have quotation marks for variable names.data
argument,
in this case, mpg
.displ
vs hwy
from
the mpg
data set.The reason why we see fewer points than the amount of samples is
that, the values of displ
and hwy
are rounded
so some points overlap with each other. This problem is known as
overplotting.
For example, the fifth and the sixth sample share the same
displ
and hwy
values.
mpg[5:6, c('displ', 'hwy')] # This code shows the values from "hwy" and "displ" for the 5th and 6th sample
## # A tibble: 2 × 2
## displ hwy
## <dbl> <int>
## 1 2.8 26
## 2 2.8 26
We can add the option position = "jitter"
into the
geom_point
function to avoid overplotting problem. By doing
this, we add a small amount of random noise to each point, which spreads
the points out.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
Here the position argument controls position adjustments, which determines how to arrange geoms that would otherwise occupy the same space.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
Next, let’s learn how to create a bar plot for one variable with
ggplot2
. The code template is very similar to that for
scatter plots. But it must be for a categorical variable and we don’t
need the y
variable in the mapping
. Now let’s
plot the bar plot for the variable drv
in the
mpg
data set.
ggplot(data = mpg) +
geom_bar(mapping = aes(x = drv))
Here we use the function geom_bar
to create a bar
plot.
ggplot(data = mpg) +
geom_bar(mapping = aes(x = drv))
A bar plot summarizes the count (or frequency) of each category in the data set.
Create proper bar plots to answer the following questions:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
This template contains the most basic information needed to create a graph:
<DATA>
.<GEOM_FUNCTION>
.<MAPPINGS>
.We will expand this template to more complicated cases in future classes. More details can be found at
https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
We can use the fill
argument in the aes
function to make a bar plot colored.
ggplot(data = mpg) +
geom_bar(mapping = aes(x = drv, fill = drv))
Here we use the same value of fill
and x
argument, which means “filled by colors of different x values”. If we
use different values, it becomes a stacked bar
plot.
ggplot(data = mpg) +
geom_bar(mapping = aes(x = drv, fill = class))
A stacked bar plot is used to show the distribution among combination of two categorical variables by breaking down each bar into smaller colored bars. Observe the graph on the last page and answer the following questions.
Another way to show the distribution among combination of two categorical variables is using dodged bar plot.
ggplot(data = mpg) +
geom_bar(mapping = aes(x = drv, fill = class), position = "dodge")
Again, we use the position
argument in geom
function to adjust the position of bars.
Use dodged bar plots, to answer the following question:
x
in the aes
function
into y
and reproduce the plot. What did you see?Next, let’s learn how to create a box plot. A box plot summarizes key information about the center, spread and potential outliers of a numeric variable.
First, let’s review how to read a box plot.
We use the function geom_boxplot
to create a box
plot.
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = displ)) +
scale_y_discrete(breaks = NULL) # Remove the y scales
ggplot(data = mpg) +
geom_boxplot(mapping = aes(y = displ)) + # The plot can also be vertical
scale_x_discrete(breaks = NULL) # Remove the x scales
The looking of boxplots created by ggplot2
does not have
the whisker lines. There is a trick to add them onto our plot:
ggplot(data = mpg, mapping = aes(y = displ)) +
stat_boxplot(geom = "errorbar", width = 0.5) + # The "width" controls the line size
geom_boxplot() +
scale_x_discrete(breaks = NULL)
geom_boxplot()
from the
code above and see what it gets .geom_boxplot()
before the
stat_boxplot
line and see what it gets.More often, multiple boxplots are used to compare the effect of a
categorical variable on a numeric variable. It’s very easy to do this
with ggplot
. We simply use both x
and
y
arguments.
For example, we hope to study the effect of drive train types on fuel
economy measured by hwy
. We can create the plots with the
following code.
ggplot(data = mpg, mapping = aes(x = drv, y = hwy)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot()
Create a multiple boxplot for variables manufacturer
and
cty
, answer the following question:
How are these two plots similar?
Both plots use the same data set, but different visual objects (which we call geoms).
# left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) # point geom
# right
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy)) # smooth geom
The function geom_smooth
creates a smoothed
conditional means curve to fit the data. The shaded region
represents the 95% (can be adjusted) confidence interval.
In this plot, there is statistical modeling behind it. Therefore we must learn statistical methods to fully understand the details.
What if we hope to combine the two plots into one plot? It is very
simple to do it with ggplot2
- we just apply two
geom
functions which add two layers to the same plot.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
We can put mapping
into the ggplot
function
to avoid redundant codes
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() + geom_smooth()
For the following two questions, submit your plot and answer to Canvas (go to Discussions tab and reply to the post)
Use the built-in diamonds
data set in
ggplot2
, create a scatter plot and a smooth line plot (in
the same graph) for price
in y and carat
in x.
What conclusions can you draw from your figure?
(self-study) Do some self-study to see how the function
geom_count()
works. Create a plot with mpg
data set using geom_count()
In this class, we learned some basics of data visualization with
ggplot2
in R. You are required to
geom
functions.ggplot2