Topics covered in these notes:
Note: I do not give comprehensive definitions for many of the functions used in this notebook (or the ones to follow). If the use of a function seems unclear, I encourage you to use the
?functionality in R to learn more about what these functions are doing. Alternatively, refer to the posit Cloud Cheat Sheets, or use Google!
Basic Setup:
./.Our purposes for using R stem from the need to process data so we can do statistics. So, it makes sense to start our introduction to R with the data frame, the manner in which R stores data (tables).
Technically, every data table in R is stored as a collection of
vectors (columns), which is then given special
ease-of-use methods when stored as a data.frame. When the
data frame is then stored as/converted to a tibble,
data manipulation and management becomes cleaner, easier, and faster.
So, for the remainder of this course, we will ensure all data is stored
as a tibble (using as_tibble, if needed), and the terms
“tibble” and “data frame” will be used interchangeably.
Before we get started, we will import the libraries needed to run these notes.
library(tidyverse) # we will see more on this package in a bit
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
A convenient data set which we can continue to use throughout this
walk-through is the mpg data set.
(This should already have been made available with ggplot
when we ran the above code block.)
# load the data so we can view it in RStudio
data(mpg)
Once the data is loaded, find it in the Environment
panel. Click the arrow (on the left) to view the
structure of the data (a.k.a. str(mpg)),
and click the small data table icon (on the right) to
View a rendering of the table (a.k.a.
View(mpg), with a capital V).
head() and tail() are also helpful functions
for getting a glimpse of a data frame or tibble.
R is an “object oriented” language. This means that when we type a
string of characters without quotations around it (e.g.,
mean, or mpg), we are referencing an R object
stored in memory. An R object could be a function (like
mean), or a data frame (like mpg), or a
“variable” that we define. When I say variable in this context, I simply
mean an R object with a name. We use the <- symbol to
define R variables:
# code preceded by hashtag is a non-executable *comment*
# watch as you type new object names ... makes sure it's not taken
numbers <- c(1, 2, 3, 4)
To distinguish between this kind of variable and a column of data, I’ll use “object” or “R-variable” to denote the former, and simple “variable” to denote the latter.
Note: If you think you might have overwritten a built-in R-variable, use
rm(var)to revert.
When you view the structure of the data, you should see next to the name of the column some abbreviation (e.g., “chr” or “int”). These are data types, or the kind of value that R is storing for that column vector. There are seven basic data types in R: Double, Integer, Character, Logical, Date, Complex, Raw, and NULL.
# I use "_" to keep from overwriting existing R objects
double_ <- pi # <- assigns new values to objects
integer_ <- 3L # notice the L
character_ <- "pi"
logical_ <- TRUE
null_ <- NULL # this is essentially a non-existent value
date_ <- as.Date("03-14-1593",
"%d-%m-%Y")
# these will not be used in this course
complex_ <- 3i + 14 # notice the "i"
raw_ <- charToRaw("pi") # this stores raw bytes (we won't use it)
typeof(double_)
## [1] "double"
Importantly, a vector (singularly defined using
c(val1, val2, …)) can only contain data of one datatype (or
NULL values). Since any column of a data frame is a vector, this holds
true for columns of data as well.
Alternatively, a list (defined using
list(val1, val2, ..)) can contain multiple datatypes.
Technically, you can create list-columns,
but these (and lists in general) are mostly useful for more advanced
data cleaning and manipulation, and we won’t be using either of them in
this course.
If we want to be pedantic, the “vectors” and “lists” above are both a type of R vector. The former is non-recursive, and the latter is recursive. Since we are only using non-recursive vectors in this course, we will refer to them simply as “vectors”.
Instantiate a few different R-variables; one of each data type which will be used in this course. Try running different mathematical operations on each variable (or, combinations of variables).
# your code here ...
Supposing we would like to save this data frame (or one like it) as a
CSV file, we can do so with the tibble-optimized
write_delim, that is:
write_delim(mpg, "./mpg_data.csv", delim = ",")
Similarly, you can load CSV data with
mpg2 <- read_delim("./mpg_data.csv", delim = ",")
## Rows: 234 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): manufacturer, model, trans, drv, fl, class
## dbl (5): displ, year, cyl, cty, hwy
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
(“delim” is short for “delimiter”, the character defining columns in the file.)
Much of what we are expected to do as data practitioners is navigate
the chaos of our data. With such large amounts of data with unbounded
dimensions (columns), we need to summarize our data into numbers that we
can understand. We’ll use the term statistic to denote
a number that describes some data. A few helpful “summary statistics” we
use often are mean, median, mode,
max, min, or count. Another
helpful summary statistic is the quantile. E.g., the 1st
and 3rd quartiles (notice the “r”) mark partitions for 25% and
75% of the data. So, between the 1st and 3rd quartile, we have 50% of
the data. We can get a glimpse of our data and all of these summary
statistics using summary:
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
We will see later how to summarize character columns.
Another important statistic that is not included above is the
variance (or, the standard deviation). The variance (var)
is the average squared distance of each number from the mean,
and it represents a crucial measure of “statistical dispersion”. The
standard deviation (sd) is the square root of this.
Note: Standard Deviation is just one measure of dispersion. Quantiles (
quantile) can also provide a good look at how spread out the data is, and the Gini Coefficient (not discussed in this class) provides another.
Using the mpg data, calculate some of the summary
statistics (individually) for each of the numeric columns. Consider the
weighted.mean(). Is there a column or situation where this
might make sense? Investigate different quantiles of data (i.e., use the
quantile function with multiple quantiles partitions).
# your code here
The tidyverse is a curated collection of R packages, specifically selected with data science (and statistics) in mind. The packages are designed to work especially well with one another, and seamlessly with base R.
| Term | Definition |
|---|---|
| variable | a quantity, quality, or property that you can measure |
| value | the state of a variable when it was measured |
| observation | a set of measurements under the same conditions for a single entity |
Tabular data is a table of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own cell, each variable in its own column, and each observation in its own row. In this course, we aim to find data sets that are as tidy as possible.
Functions are R objects that take an input within parentheses, and execute some operation. The values within the parentheses for a function are called arguments. Typically, functions in the tidyverse have a data frame or column (vector) as their first argument, then subsequent arguments are either named in the function or named by the user.
# sorting (arranging) data based on the value in a column
arrange(mpg, displ)
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 honda civic 1.6 1999 4 manu… f 28 33 r subc…
## 2 honda civic 1.6 1999 4 auto… f 24 32 r subc…
## 3 honda civic 1.6 1999 4 manu… f 25 32 r subc…
## 4 honda civic 1.6 1999 4 manu… f 23 29 p subc…
## 5 honda civic 1.6 1999 4 auto… f 24 32 r subc…
## 6 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 7 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 honda civic 1.8 2008 4 manu… f 26 34 r subc…
## # ℹ 224 more rows
In the documentation for arrange (and many functions
like it), … denotes variable names. Notice that we call
displ, a column name in mpg, as an R object
itself. When we pass a data frame in a function, often the columns
are vectors themselves, so we can refer to them in that
way.
filter(mpg, model == "a4")
## # A tibble: 7 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
## 7 audi a4 3.1 2008 6 auto(av) f 18 27 p compa…
In the case of filter, we pass the data frame and a
logical vector which is determined by running a logical operator (e.g.,
== referencing a particular value “a4”) on a vector (e.g.,
the model column). In other words, arguments can also be
equations themselves.
# add a deviation column (scroll to the right)
mutate(mpg, cty_deviation = cty - mean(cty))
## # A tibble: 234 × 12
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
## # ℹ 1 more variable: cty_deviation <dbl>
Here, we are naming (creating) a new column
(cty_deviation) by defining a transformation with a
function. Deviation columns like this calculate how far a variable
(e.g., City Mileage) for each row deviates from the mean across all
rows.
# explicit function for the $ operation
pluck(mpg, "cty")
## [1] 18 21 20 21 16 18 18 18 16 20 19 15 17 17 15 15 17 16 14 11 14 13 12 16 15
## [26] 16 15 15 14 11 11 14 19 22 18 18 17 18 17 16 16 17 17 11 15 15 16 16 15 14
## [51] 13 14 14 14 9 11 11 13 13 9 13 11 13 11 12 9 13 13 12 9 11 11 13 11 11
## [76] 11 12 14 15 14 13 13 13 14 14 13 13 13 11 13 18 18 17 16 15 15 15 15 14 28
## [101] 24 25 23 24 26 25 24 21 18 18 21 21 18 18 19 19 19 20 20 17 16 17 17 15 15
## [126] 14 9 14 13 11 11 12 12 11 11 11 12 14 13 13 13 21 19 23 23 19 19 18 19 19
## [151] 14 15 14 12 18 16 17 18 16 18 18 20 19 20 18 21 19 19 19 20 20 19 20 15 16
## [176] 15 15 16 14 21 21 21 21 18 18 19 21 21 21 22 18 18 18 24 24 26 28 26 11 13
## [201] 15 16 17 15 15 15 16 21 19 21 22 17 33 21 19 22 21 21 21 16 17 35 29 21 19
## [226] 20 20 21 18 19 21 16 18 17
Lastly, sometimes a function needs the name of a particular
column (or value itself) in the data frame. In these cases, you’ll use a
string (rather than an object) in the argument. Above, we use the
function pluck to extract the values from a
column, to return a vector. This is different from select
which returns a column as a tibble. Think about
selecting one of the “special” tibble-vectors
vs. plucking out just the data values.
select(mpg, cty)
## # A tibble: 234 × 1
## cty
## <int>
## 1 18
## 2 21
## 3 20
## 4 21
## 5 16
## 6 18
## 7 18
## 8 18
## 9 16
## 10 20
## # ℹ 224 more rows
Note: in either case, you could also use an integer
k to reference the k-th column instead of
"cty".
Explore these functions. What is the highest or lowest city mileage? Highway mileage? What about this same question for one particular class of vehicle? Consider extracting just a single column from a subset of the data. Come up with questions of your own, and try to answer them using the above functions.
# your code here
Arguably, the most foundational element of tidyverse coding in R is
using the pipe operator. Since R version 4.2, the native pipe operator
is |> (a bar followed by the greater than symbol). This
is an improvement on the earlier pipe operator
(%>%), but as we will see below, there is one particular
use case in which the older version is actually preferred.
The idea is to start with a data frame, and sort of pipe your way
through a routine. You can read each line as “do this” and each
|> as “and then …”.
mpg |> # get data frame
filter(year >= 2000) |> # then, filter it by the year column
pluck("cty") |> # then, select the "cty" column
mean() # then, calculate its mean
## [1] 16.70085
The output of each line is passed on to the first argument of the following line (so, you only need to type any second argument and on). Recall, the first argument of almost all tidyverse functions is a data frame or vector!
Here, we are grouping the data by the class variable,
and for each category in that column, we are calculating (and naming)
two new aggregate columns for the cty and hwy
variables (i.e., their mean). The result is a new “grouped” data frame,
using the summarise function.
mpg |>
group_by(class) |>
summarise(mean_cty = mean(cty),
mean_hwy = mean(hwy))
## # A tibble: 7 × 3
## class mean_cty mean_hwy
## <chr> <dbl> <dbl>
## 1 2seater 15.4 24.8
## 2 compact 20.1 28.3
## 3 midsize 18.8 27.3
## 4 minivan 15.8 22.4
## 5 pickup 13 16.9
## 6 subcompact 20.4 28.1
## 7 suv 13.5 18.1
In general, if you are doing a grouping operation, use
group_by. Below, we have an example of another method,aggregate, but only for demonstration purposes.
Try a few various group-by operations on the mpg
dataset. Can you think of questions you might ask regarding the
available discrete variables? How can you summarize those discrete
variables using group_by-summarise. Try a few
different versions — to to ask a question which might require you to use
mutate.
# your code here
If the object being passed to the next pipe is meant for an argument
other than the first one, use the alternate pipe operator
%>% in combination with a period . to
signify which argument should use the incoming pipe object.
mpg %>% # -> v <- notice the "."
aggregate(cty ~ class, data = ., FUN = mean)
## class cty
## 1 2seater 15.40000
## 2 compact 20.12766
## 3 midsize 18.75610
## 4 minivan 15.81818
## 5 pickup 13.00000
## 6 subcompact 20.37143
## 7 suv 13.50000
The above routine creates a formula
using the tilde ~ operator. The idea is to define what lies
on the left side in terms of what lies on the right side. In
this case, the term is the mean function for each
class group. The aggregate function here is
mirroring the mean_cty portion of the (preferred)
group_by operation, above.
Note: The native pipe operator
|>also has placeholder capability (i.e., using_), but it is limited, so we avoid using it.
Sometimes, you might want to apply the same transformation across
multiple columns. We can use across
in a few different ways to do just that. It takes in names of columns
and (optionally) of functions, and returns the corresponding columns and
functions.
mpg |>
group_by(class) |>
summarise(across(where(is.numeric), mean))
## # A tibble: 7 × 6
## class displ year cyl cty hwy
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2seater 6.16 2004. 8 15.4 24.8
## 2 compact 2.33 2003. 4.60 20.1 28.3
## 3 midsize 2.92 2004. 5.32 18.8 27.3
## 4 minivan 3.39 2003. 5.82 15.8 22.4
## 5 pickup 4.42 2004. 7.03 13 16.9
## 6 subcompact 2.66 2003. 5.03 20.4 28.1
## 7 suv 4.46 2004. 6.97 13.5 18.1
Above, we use across using where, which
specifically selects columns based on some criteria (usually and
is._ function).
We use ggplot2 for plotting in R. Below is the basic code blocking for ggplot visualizations:
Data Bindings: This line of code doesn’t really plot anything. We should think of this line of code as the framework for the connection between data and visual elements. In other words, this creates the basis for the coordinate system.
Layers: R creates layers of visual elements
to the coordinate plane. After having determined the data bindings, we
map each datum to a visual element (e.g., a row of data to a filled in
circle). The mapping argument is always paired with the
aes (aesthetic) function to link column names or other
vectors with the same length as the data (e.g., logical outcomes).
An aesthetic is a property for a visual element
bound to a data point. This could be the location of the visual element
(e.g., its x-y coordinates), or it could be color,
size, shape, etc. Aesthetics are
mapped to visual elements.
Outside of the mapping argument, you can
globally define a single value to each visual element (e.g., using
color as its own argument, outside of
aes.)
The data argument in ggplot() binds the
same data set across the whole visualization (unless otherwise
specified). In the same way this argument could be moved down to
specific layers, we could also place mapping arguments into
this ggplot() to apply globally across the
visualization.
Formatting: This block contains any formatting we want to apply to the plot. For example, scale, axis labels, etc.
All of these together form the “grammar of graphics” (ggplot).
(Optional: For each of the plots below, create your own executable R cell, and try to make your own version of the same plot with different data.)
In a visualization, a solid bar does a great job at communicating
“amount” for categorical variables (e.g., sum). And, after
a quick glance at the structure of this data, a question that comes to
mind is how the manufacturers differ insofar as the amount of
the various classes they make changes from one to the next.
n_distinct(mpg$manufacturer)
## [1] 15
Before I plot anything, I’m seeing that there are actually more manufacturers than I should be putting on a plot (avoid plotting any more than 7 categorical values). Why don’t we choose the manufacturers with a decent variety of class (i.e., 3 or more).
manufacturers <- mpg |>
group_by(manufacturer) |>
summarize(num_classes = n_distinct(class)) |>
arrange(desc(num_classes)) |>
filter(num_classes >= 3) |> # verify the data before continuing
pluck("manufacturer")
manufacturers
## [1] "toyota" "chevrolet" "dodge" "ford" "nissan"
## [6] "subaru" "volkswagen"
p <- mpg |>
filter(manufacturer %in% manufacturers) |>
ggplot() + # recall, layers/formatting are added (+) to the plot
geom_bar(mapping = aes(x = manufacturer, fill = class)) +
theme_minimal() +
scale_fill_brewer(palette = 'Dark2')
p
Let’s walk through what’s happening in this code …
%in% operator returns a boolean (TRUE/FALSE) vector
based on whether a value in the first vector is in the second
vector.data = mpg..., since
it’s already instantiated in the pipe._bar, so each data point (count of data with
a manufacturer-class combination) is bound do a single bar.
Each bar’s aesthetics are mapped to an attribute of its data point.
geom_bar maps count of
data to the y-axiscolor would color that outline, and fill is
the color of the inside.theme_minimal() is a particular theme I
like when the grid lines matter.p, to refer to
later. You don’t need to do this every time, and you certainly don’t
need a different object for each plot, but it’s good to keep in
mind.Which manufacturers are we excluding? For the sake of
comprehensiveness, it would be a good idea to report on these as well.
We use the ! symbol to represent “not”.
mpg |>
filter(!(manufacturer %in% manufacturers)) |>
select(manufacturer) |>
distinct()
## # A tibble: 8 × 1
## manufacturer
## <chr>
## 1 audi
## 2 honda
## 3 hyundai
## 4 jeep
## 5 land rover
## 6 lincoln
## 7 mercury
## 8 pontiac
You might include this list in your report for this particular analysis.
Recall the summary from earlier gives us a very
oversimplified view of how our non-discrete data are distributed.
Histograms give us a full visualization of the distribution from minimum
to maximum.
(Again, we need a question.) Suppose we are particularly interested in the ratio between the typical car’s highway mileage to their city mileage.
mean_ratio <- mean(mpg$hwy / mpg$cty)
mpg |>
mutate(hwy_to_cty = hwy / cty) |>
ggplot() +
geom_histogram(mapping = aes(x = hwy_to_cty), color = 'white') +
geom_vline(xintercept = mean_ratio, color = 'orange') +
annotate("text", # the type of annotation
x = 1.425, y = 24.5, label = "Average", color = 'orange') +
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It looks like typically, a car’s highway mileage is about 40% more
than (1.4 times) its city mileage. Notice our use of the annotate
function.
We might also be interested in how this changes between automatic and
manual transmissions. Note, the grepl function returns
TRUE/FALSE values based on whether a string (e.g., “auto”) shows up in
the value(s) in a column.
mpg |>
mutate(is_automatic = grepl("auto", trans),
hwy_to_cty = hwy / cty) |>
ggplot() +
geom_density(mapping = aes(x = hwy_to_cty, color = is_automatic)) +
theme_classic() +
scale_color_brewer(palette = "Dark2")
We’ll dive deeper into density plots later in the course, but for now we can think of it as a “smoothed” histogram, helping to give us a sense for the relative (proportions, not counts) distribution of our data. In this plot, we can see that the highway-to-city ratio remains pretty consistent between automatic and manual transmissions.
Suppose we’re interested in the difference in city mileage between the different manufacturers. Box plots (or “box and whisker plots”) are a great way to visualize multiple distributions at once.
mpg |>
filter(manufacturer %in% manufacturers) |>
ggplot() +
geom_boxplot(mapping = aes(x = manufacturer, y = cty)) +
ggtitle("City Mileage by Manufacturer") +
theme_minimal()
For each manufacturer, we have a box with “whiskers” extending from above and below:
geom_violinis sometimes a prettier alternative to the box plot, but usually it tends to overcomplicate the visualization.
Now, we might be interested in the relationship between Highway fuel mileage and the displacement (cylinder volume, in liters) of a car’s engine. We’d expect that larger engines have lower fuel mileage.
mpg |>
ggplot() +
geom_point(mapping = aes(x = displ, y = hwy),
color = 'darkblue') +
labs(title = "Highway vs. Displacement",
x = "Cylinder Volume (L)", y = "Highway Mileage") +
theme_classic()
We may want to figure out what is going on with the points that stand out (six points on the far right, and two on the far left).
For this, we are going to import the library ggrepel.
library(ggrepel)
# add a new column for these abnormally high highway mileages
mpg <- mutate(mpg, high_hwy = (hwy > 40 & displ < 2) |
(hwy > 20 & displ > 5))
# start a new ggplot for multiple data set bindings
ggplot(mapping = aes(x = displ, y = hwy)) +
geom_point(data = filter(mpg, !high_hwy), color = 'darkblue') +
geom_point(data = filter(mpg, high_hwy), color = 'darkred') +
geom_text_repel(data = filter(mpg, high_hwy),
mapping = aes(label = model)) +
theme_classic() +
scale_color_brewer(palette = "Set1")
For large data sets, the box plot gives us an interesting view of
relationships between two continuous variables. That is, we can use the
cut_width cut
function to group
a continuous variable into sections, and plot a box plot for each
partition.
mpg |>
ggplot() +
geom_boxplot(mapping = aes(x = displ, y = hwy,
group = cut_width(displ, width = 0.5))) +
labs(title = "Mileage by Displacement") +
theme_minimal()
cut_widthhere converts our continuous variable into something discrete (i.e., partitions of width 0.5), andgroupplots the discrete groups over the continuous axis.
This technique is particularly useful for very large data sets where
scatterplots are harder to read. Also, notice our use of the ggplot labs
function for adding different labels to our plot.
Line plots are particularly (and in a way, specifically)
useful for situations where there is a meaningful “movement” along an
x-axis continuum. E.g., time, distance, or points along some process.
And, since our mpg data set only contains two years of
data, we’ll use the economics data set to illustrate this
example. (Take a look at the data with View.)
With line plots, I prefer a particular theme that comes with the
ggtheme package. We’ll need to import that.
library(ggthemes)
economics |>
ggplot() +
geom_line(mapping = aes(x = date, y = unemploy)) +
expand_limits(y = 0) + # sometimes you need to force zero lines
theme_hc() + # this theme is great for line plots
labs(title = "Unemployment in the US Over Time",
subtitle = "Collected on the first of every month",
x = "Year", y = "Unemployment (in thousands)") +
theme(plot.subtitle = element_text(colour = "darkgray"))
Let’s continue with this economics data, and let’s
investigate how the number of unemployed in the US changes with the
country’s median duration of unemployment. But, now let’s overlay a
smooth curve over its scatterplot.
economics |>
# reduce duplicate code by defining the mapping in `ggplot`
ggplot(mapping = aes(x = uempmed, y = unemploy)) +
geom_point(color = "gray") +
geom_smooth(se = FALSE) +
theme_classic() +
labs(title = "Unemployment as Duration Increases",
x = "Median Duration of Unemployment",
y = "Number of Unemployed (in thousands)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Curves like this give us a clean visual understanding for complex
relationships. Feel free to take a look at the ?geom_smooth
documentation to learn about what it’s doing, but please keep in mind
that the methods described will not be covered in this course.
Returning to the mpg data, suppose we are interested in
the relationship between highway mileage and the number of cylinders in
the vehicle, while also taking into consideration the year of the
vehicle. We can convert the integer cylinder and year variables to
factors using as_factor, and plot a special sort of
scatterplot which avoids over-plotting (i.e., overlapping visual
elements).
mpg |>
mutate(year = as_factor(year),
cyl = as_factor(cyl)) |>
ggplot() +
geom_jitter(mapping = aes(x = cyl, y = hwy, color = year),
width = 0.2, height = 0) +
labs(title = "Highway Mileage for Different Cylinder Counts",
x = "Cylinder Count",
y = "Highway Mileage") +
scale_color_brewer(palette = "Dark2") +
theme_minimal()
If you ever decide to use a jitter plot like this, any
jittering must be done along a non-continuous axis. In this
case, we jitter along the x-axis (width), so we set the
vertical jitter (height) to zero. Otherwise, the plot
becomes deceptive.
A quick note on factors:
A factor is a specially coded column in R to represent categories.
Here, we used the as_factor function to convert a column of
categorical values into an official “factor” so R will plot it
correctly. In some cases, you’ll have a column with numbers (factor
levels) indicating categories (factor labels), and we can convert such a
column using the factor function.
cyl_levels <- unique(mpg$cyl)
cyl_labels <- paste(cyl_levels, "cylinders")
mpg |>
mutate(cyl_factor = factor(cyl,
levels = cyl_levels,
labels = cyl_labels)) |>
select(c(cyl, cyl_factor)) |>
sample_n(5)
## # A tibble: 5 × 2
## cyl cyl_factor
## <int> <fct>
## 1 6 6 cylinders
## 2 6 6 cylinders
## 3 6 6 cylinders
## 4 8 8 cylinders
## 5 4 4 cylinders
Lastly, we might be interested in visualizing the same plot for multiple subsets of the data, or for multiple facets of a categorical dimension. Suppose we want to see how the relationship between City mileage and Displacement changes between the different classes of vehicle.
mpg |>
ggplot(mapping = aes(displ, cty)) +
# background
geom_point(data = mutate(mpg, class = NULL), colour = "grey85") +
# foreground
geom_point(color = "darkblue") +
facet_wrap(vars(class)) +
theme_classic() +
labs(title = "City Mileage vs. Displacement Over Class",
x = "Cylinder Displacement (in liters)",
y = "City Mileage")
To get an overlayed plot like this (adopted from a ggplot example) we can just remove the facet column from the “background” plot using
mutate, seen before.
To view this a bit better, you can toy with the number of columns
ncol, or click the little icon to view it in a separate
window.
Come up with 3-5 different questions about Miles Per Gallon which
might be addressed with the mpg data. For each question,
scope your data set, and build a plot using the |>
piping code framework.
# your code here
# your code here
pluck vs selectThroughout this notebook, you’ve seen the functions
pluck and select used almost interchangeably.
However, remember, there is an important difference:
pluck returns the vector form of a
column from a data frame (or tibble).
select returns the tibble form 1 or
more columns from data frame (or tibble).
Most functions in the tidyverse require a tibble as input (so
select should usually work), but some functions (built-in
or tidyverse) require a vector, and in these cases pluck is
what you want.