Data and R

Topics covered in these notes:

Note: I do not give comprehensive definitions for many of the functions used in this notebook (or the ones to follow). If the use of a function seems unclear, I encourage you to use the ? functionality in R to learn more about what these functions are doing. Alternatively, refer to the posit Cloud Cheat Sheets, or use Google!

Basic Setup:

  1. Set your working directory to a specific folder. Navigate to a preferred folder in the pane on the lower right, click the “More” button, and select “Set as Working Directory”. Now, any time we want to reference a file in this directory, we only need to use the shortcut ./.
  2. Type “Cmd + ,” or navigate to “Edit” -> “Settings” in RStudio’s main top menu. Select the “Code” tab, and check the box for “use native pipe operator”. Feel free to adjust any other themes and settings if you like.

Data Frames and Tibbles

Our purposes for using R stem from the need to process data so we can do statistics. So, it makes sense to start our introduction to R with the data frame, the manner in which R stores data (tables).

Technically, every data table in R is stored as a collection of vectors (columns), which is then given special ease-of-use methods when stored as a data.frame. When the data frame is then stored as/converted to a tibble, data manipulation and management becomes cleaner, easier, and faster. So, for the remainder of this course, we will ensure all data is stored as a tibble (using as_tibble, if needed), and the terms “tibble” and “data frame” will be used interchangeably.

Loading Data

Before we get started, we will import the libraries needed to run these notes.

library(tidyverse)  # we will see more on this package in a bit
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

A convenient data set which we can continue to use throughout this walk-through is the mpg data set. (This should already have been made available with ggplot when we ran the above code block.)

# load the data so we can view it in RStudio
data(mpg)

Once the data is loaded, find it in the Environment panel. Click the arrow (on the left) to view the structure of the data (a.k.a. str(mpg)), and click the small data table icon (on the right) to View a rendering of the table (a.k.a. View(mpg), with a capital V). head() and tail() are also helpful functions for getting a glimpse of a data frame or tibble.

R Objects

R is an “object oriented” language. This means that when we type a string of characters without quotations around it (e.g., mean, or mpg), we are referencing an R object stored in memory. An R object could be a function (like mean), or a data frame (like mpg), or a “variable” that we define. When I say variable in this context, I simply mean an R object with a name. We use the <- symbol to define R variables:

# code preceded by hashtag is a non-executable *comment*

# watch as you type new object names ... makes sure it's not taken
numbers <- c(1, 2, 3, 4)

To distinguish between this kind of variable and a column of data, I’ll use “object” or “R-variable” to denote the former, and simple “variable” to denote the latter.

Note: If you think you might have overwritten a built-in R-variable, use rm(var) to revert.

Data Types and Vectors

When you view the structure of the data, you should see next to the name of the column some abbreviation (e.g., “chr” or “int”). These are data types, or the kind of value that R is storing for that column vector. There are seven basic data types in R: Double, Integer, Character, Logical, Date, Complex, Raw, and NULL.

# I use "_" to keep from overwriting existing R objects
double_ <- pi                     # <- assigns new values to objects
integer_ <- 3L                    # notice the L
character_ <- "pi"
logical_ <- TRUE
null_ <- NULL                     # this is essentially a non-existent value
date_ <- as.Date("03-14-1593", 
                 "%d-%m-%Y")

# these will not be used in this course
complex_ <- 3i + 14               # notice the "i"
raw_ <- charToRaw("pi")           # this stores raw bytes (we won't use it)

typeof(double_)
## [1] "double"

Importantly, a vector (singularly defined using c(val1, val2, …)) can only contain data of one datatype (or NULL values). Since any column of a data frame is a vector, this holds true for columns of data as well.

Alternatively, a list (defined using list(val1, val2, ..)) can contain multiple datatypes. Technically, you can create list-columns, but these (and lists in general) are mostly useful for more advanced data cleaning and manipulation, and we won’t be using either of them in this course.

If we want to be pedantic, the “vectors” and “lists” above are both a type of R vector. The former is non-recursive, and the latter is recursive. Since we are only using non-recursive vectors in this course, we will refer to them simply as “vectors”.

EXERCISE

Instantiate a few different R-variables; one of each data type which will be used in this course. Try running different mathematical operations on each variable (or, combinations of variables).

# your code here ...

Working with CSV Files

Supposing we would like to save this data frame (or one like it) as a CSV file, we can do so with the tibble-optimized write_delim, that is:

write_delim(mpg, "./mpg_data.csv", delim = ",")

Similarly, you can load CSV data with

mpg2 <- read_delim("./mpg_data.csv", delim = ",")
## Rows: 234 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): manufacturer, model, trans, drv, fl, class
## dbl (5): displ, year, cyl, cty, hwy
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(“delim” is short for “delimiter”, the character defining columns in the file.)

Summarizing Data

Much of what we are expected to do as data practitioners is navigate the chaos of our data. With such large amounts of data with unbounded dimensions (columns), we need to summarize our data into numbers that we can understand. We’ll use the term statistic to denote a number that describes some data. A few helpful “summary statistics” we use often are mean, median, mode, max, min, or count. Another helpful summary statistic is the quantile. E.g., the 1st and 3rd quartiles (notice the “r”) mark partitions for 25% and 75% of the data. So, between the 1st and 3rd quartile, we have 50% of the data. We can get a glimpse of our data and all of these summary statistics using summary:

summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

We will see later how to summarize character columns.

Another important statistic that is not included above is the variance (or, the standard deviation). The variance (var) is the average squared distance of each number from the mean, and it represents a crucial measure of “statistical dispersion”. The standard deviation (sd) is the square root of this.

Note: Standard Deviation is just one measure of dispersion. Quantiles (quantile) can also provide a good look at how spread out the data is, and the Gini Coefficient (not discussed in this class) provides another.

EXERCISE

Using the mpg data, calculate some of the summary statistics (individually) for each of the numeric columns. Consider the weighted.mean(). Is there a column or situation where this might make sense? Investigate different quantiles of data (i.e., use the quantile function with multiple quantiles partitions).

# your code here

Working in the Tidyverse

The tidyverse is a curated collection of R packages, specifically selected with data science (and statistics) in mind. The packages are designed to work especially well with one another, and seamlessly with base R.

Vocabulary

Term Definition
variable a quantity, quality, or property that you can measure
value the state of a variable when it was measured
observation a set of measurements under the same conditions for a single entity

Tabular data is a table of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own cell, each variable in its own column, and each observation in its own row. In this course, we aim to find data sets that are as tidy as possible.

Useful Functions

Functions are R objects that take an input within parentheses, and execute some operation. The values within the parentheses for a function are called arguments. Typically, functions in the tidyverse have a data frame or column (vector) as their first argument, then subsequent arguments are either named in the function or named by the user.

Examples

# sorting (arranging) data based on the value in a column
arrange(mpg, displ)
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 honda        civic        1.6  1999     4 manu… f        28    33 r     subc…
##  2 honda        civic        1.6  1999     4 auto… f        24    32 r     subc…
##  3 honda        civic        1.6  1999     4 manu… f        25    32 r     subc…
##  4 honda        civic        1.6  1999     4 manu… f        23    29 p     subc…
##  5 honda        civic        1.6  1999     4 auto… f        24    32 r     subc…
##  6 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  7 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 honda        civic        1.8  2008     4 manu… f        26    34 r     subc…
## # ℹ 224 more rows

In the documentation for arrange (and many functions like it), denotes variable names. Notice that we call displ, a column name in mpg, as an R object itself. When we pass a data frame in a function, often the columns are vectors themselves, so we can refer to them in that way.

filter(mpg, model == "a4")
## # A tibble: 7 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
## 7 audi         a4      3.1  2008     6 auto(av)   f        18    27 p     compa…

In the case of filter, we pass the data frame and a logical vector which is determined by running a logical operator (e.g., == referencing a particular value “a4”) on a vector (e.g., the model column). In other words, arguments can also be equations themselves.

# add a deviation column (scroll to the right)
mutate(mpg, cty_deviation = cty - mean(cty))
## # A tibble: 234 × 12
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows
## # ℹ 1 more variable: cty_deviation <dbl>

Here, we are naming (creating) a new column (cty_deviation) by defining a transformation with a function. Deviation columns like this calculate how far a variable (e.g., City Mileage) for each row deviates from the mean across all rows.

# explicit function for the $ operation
pluck(mpg, "cty")
##   [1] 18 21 20 21 16 18 18 18 16 20 19 15 17 17 15 15 17 16 14 11 14 13 12 16 15
##  [26] 16 15 15 14 11 11 14 19 22 18 18 17 18 17 16 16 17 17 11 15 15 16 16 15 14
##  [51] 13 14 14 14  9 11 11 13 13  9 13 11 13 11 12  9 13 13 12  9 11 11 13 11 11
##  [76] 11 12 14 15 14 13 13 13 14 14 13 13 13 11 13 18 18 17 16 15 15 15 15 14 28
## [101] 24 25 23 24 26 25 24 21 18 18 21 21 18 18 19 19 19 20 20 17 16 17 17 15 15
## [126] 14  9 14 13 11 11 12 12 11 11 11 12 14 13 13 13 21 19 23 23 19 19 18 19 19
## [151] 14 15 14 12 18 16 17 18 16 18 18 20 19 20 18 21 19 19 19 20 20 19 20 15 16
## [176] 15 15 16 14 21 21 21 21 18 18 19 21 21 21 22 18 18 18 24 24 26 28 26 11 13
## [201] 15 16 17 15 15 15 16 21 19 21 22 17 33 21 19 22 21 21 21 16 17 35 29 21 19
## [226] 20 20 21 18 19 21 16 18 17

Lastly, sometimes a function needs the name of a particular column (or value itself) in the data frame. In these cases, you’ll use a string (rather than an object) in the argument. Above, we use the function pluck to extract the values from a column, to return a vector. This is different from select which returns a column as a tibble. Think about selecting one of the “special” tibble-vectors vs. plucking out just the data values.

select(mpg, cty)
## # A tibble: 234 × 1
##      cty
##    <int>
##  1    18
##  2    21
##  3    20
##  4    21
##  5    16
##  6    18
##  7    18
##  8    18
##  9    16
## 10    20
## # ℹ 224 more rows

Note: in either case, you could also use an integer k to reference the k-th column instead of "cty".

EXERCISE

Explore these functions. What is the highest or lowest city mileage? Highway mileage? What about this same question for one particular class of vehicle? Consider extracting just a single column from a subset of the data. Come up with questions of your own, and try to answer them using the above functions.

# your code here

Piping

Arguably, the most foundational element of tidyverse coding in R is using the pipe operator. Since R version 4.2, the native pipe operator is |> (a bar followed by the greater than symbol). This is an improvement on the earlier pipe operator (%>%), but as we will see below, there is one particular use case in which the older version is actually preferred.

The idea is to start with a data frame, and sort of pipe your way through a routine. You can read each line as “do this” and each |> as “and then …”.

mpg |>                     # get data frame
  filter(year >= 2000) |>  # then, filter it by the year column
  pluck("cty") |>          # then, select the "cty" column
  mean()                   # then, calculate its mean
## [1] 16.70085

The output of each line is passed on to the first argument of the following line (so, you only need to type any second argument and on). Recall, the first argument of almost all tidyverse functions is a data frame or vector!

Pipe Example: Grouping

Here, we are grouping the data by the class variable, and for each category in that column, we are calculating (and naming) two new aggregate columns for the cty and hwy variables (i.e., their mean). The result is a new “grouped” data frame, using the summarise function.

mpg |>
  group_by(class) |>
  summarise(mean_cty = mean(cty),
            mean_hwy = mean(hwy))
## # A tibble: 7 × 3
##   class      mean_cty mean_hwy
##   <chr>         <dbl>    <dbl>
## 1 2seater        15.4     24.8
## 2 compact        20.1     28.3
## 3 midsize        18.8     27.3
## 4 minivan        15.8     22.4
## 5 pickup         13       16.9
## 6 subcompact     20.4     28.1
## 7 suv            13.5     18.1

In general, if you are doing a grouping operation, use group_by. Below, we have an example of another method, aggregate, but only for demonstration purposes.

EXERCISE

Try a few various group-by operations on the mpg dataset. Can you think of questions you might ask regarding the available discrete variables? How can you summarize those discrete variables using group_by-summarise. Try a few different versions — to to ask a question which might require you to use mutate.

# your code here

Pipe Example: Placeholders

If the object being passed to the next pipe is meant for an argument other than the first one, use the alternate pipe operator %>% in combination with a period . to signify which argument should use the incoming pipe object.

mpg %>%  #                  ->  v  <- notice the "."
  aggregate(cty ~ class, data = ., FUN = mean)
##        class      cty
## 1    2seater 15.40000
## 2    compact 20.12766
## 3    midsize 18.75610
## 4    minivan 15.81818
## 5     pickup 13.00000
## 6 subcompact 20.37143
## 7        suv 13.50000

The above routine creates a formula using the tilde ~ operator. The idea is to define what lies on the left side in terms of what lies on the right side. In this case, the term is the mean function for each class group. The aggregate function here is mirroring the mean_cty portion of the (preferred) group_by operation, above.

Note: The native pipe operator |> also has placeholder capability (i.e., using _), but it is limited, so we avoid using it.

Pipe Example: Across

Sometimes, you might want to apply the same transformation across multiple columns. We can use across in a few different ways to do just that. It takes in names of columns and (optionally) of functions, and returns the corresponding columns and functions.

mpg |>
  group_by(class) |>
  summarise(across(where(is.numeric), mean))
## # A tibble: 7 × 6
##   class      displ  year   cyl   cty   hwy
##   <chr>      <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2seater     6.16 2004.  8     15.4  24.8
## 2 compact     2.33 2003.  4.60  20.1  28.3
## 3 midsize     2.92 2004.  5.32  18.8  27.3
## 4 minivan     3.39 2003.  5.82  15.8  22.4
## 5 pickup      4.42 2004.  7.03  13    16.9
## 6 subcompact  2.66 2003.  5.03  20.4  28.1
## 7 suv         4.46 2004.  6.97  13.5  18.1

Above, we use across using where, which specifically selects columns based on some criteria (usually and is._ function).

Basic Plotting

We use ggplot2 for plotting in R. Below is the basic code blocking for ggplot visualizations:

Data Bindings: This line of code doesn’t really plot anything. We should think of this line of code as the framework for the connection between data and visual elements. In other words, this creates the basis for the coordinate system.

Layers: R creates layers of visual elements to the coordinate plane. After having determined the data bindings, we map each datum to a visual element (e.g., a row of data to a filled in circle). The mapping argument is always paired with the aes (aesthetic) function to link column names or other vectors with the same length as the data (e.g., logical outcomes).

  • An aesthetic is a property for a visual element bound to a data point. This could be the location of the visual element (e.g., its x-y coordinates), or it could be color, size, shape, etc. Aesthetics are mapped to visual elements.

  • Outside of the mapping argument, you can globally define a single value to each visual element (e.g., using color as its own argument, outside of aes.)

  • The data argument in ggplot() binds the same data set across the whole visualization (unless otherwise specified). In the same way this argument could be moved down to specific layers, we could also place mapping arguments into this ggplot() to apply globally across the visualization.

Formatting: This block contains any formatting we want to apply to the plot. For example, scale, axis labels, etc.

All of these together form the “grammar of graphics” (ggplot).

Asking Questions

  1. Start with a question. Before any sort of analysis, we need to start with a well-formed question.
  2. Then, we need to scope our data to specifically capture the necessary information to answer that question.
  3. Finally, we try many different visualizations, making sure to pick the one that seems to present the data most efficiently, without distorting any meaning.

(Optional: For each of the plots below, create your own executable R cell, and try to make your own version of the same plot with different data.)

Bar Plots

In a visualization, a solid bar does a great job at communicating “amount” for categorical variables (e.g., sum). And, after a quick glance at the structure of this data, a question that comes to mind is how the manufacturers differ insofar as the amount of the various classes they make changes from one to the next.

n_distinct(mpg$manufacturer)
## [1] 15

Before I plot anything, I’m seeing that there are actually more manufacturers than I should be putting on a plot (avoid plotting any more than 7 categorical values). Why don’t we choose the manufacturers with a decent variety of class (i.e., 3 or more).

manufacturers <- mpg |>
  group_by(manufacturer) |>
  summarize(num_classes = n_distinct(class)) |>
  arrange(desc(num_classes)) |>
  filter(num_classes >= 3) |>  # verify the data before continuing
  pluck("manufacturer")

manufacturers
## [1] "toyota"     "chevrolet"  "dodge"      "ford"       "nissan"    
## [6] "subaru"     "volkswagen"
p <- mpg |>
  filter(manufacturer %in% manufacturers) |>  
  ggplot() +  # recall, layers/formatting are added (+) to the plot
  geom_bar(mapping = aes(x = manufacturer, fill = class)) +
  theme_minimal() +
  scale_fill_brewer(palette = 'Dark2')

p

Let’s walk through what’s happening in this code …

  1. The %in% operator returns a boolean (TRUE/FALSE) vector based on whether a value in the first vector is in the second vector.
  2. Remember that the first argument of these functions is typically the dataset, so we don’t need to define data = mpg..., since it’s already instantiated in the pipe.
  3. A geom is a “geometric object” in ggplot, and we should think about it as a single visual element bound to each data point. In this case, the geom is a _bar, so each data point (count of data with a manufacturer-class combination) is bound do a single bar. Each bar’s aesthetics are mapped to an attribute of its data point.
    • By default, the geom_bar maps count of data to the y-axis
    • Each bar is actually an outline for an empty rectangle. color would color that outline, and fill is the color of the inside.
  4. I believe formatting each plot is a good habit to get into, even minimally. In this case, theme_minimal() is a particular theme I like when the grid lines matter.
  5. There are scales for color, date, axes, etc. In this case, we’re using a particularly eye-friendly discrete scale that I prefer for situations like this.
  6. Lastly, we store this plot as an object, p, to refer to later. You don’t need to do this every time, and you certainly don’t need a different object for each plot, but it’s good to keep in mind.

Which manufacturers are we excluding? For the sake of comprehensiveness, it would be a good idea to report on these as well. We use the ! symbol to represent “not”.

mpg |>
  filter(!(manufacturer %in% manufacturers)) |>
  select(manufacturer) |>
  distinct()
## # A tibble: 8 × 1
##   manufacturer
##   <chr>       
## 1 audi        
## 2 honda       
## 3 hyundai     
## 4 jeep        
## 5 land rover  
## 6 lincoln     
## 7 mercury     
## 8 pontiac

You might include this list in your report for this particular analysis.

Histograms

Recall the summary from earlier gives us a very oversimplified view of how our non-discrete data are distributed. Histograms give us a full visualization of the distribution from minimum to maximum.

(Again, we need a question.) Suppose we are particularly interested in the ratio between the typical car’s highway mileage to their city mileage.

mean_ratio <- mean(mpg$hwy / mpg$cty)

mpg |>
  mutate(hwy_to_cty = hwy / cty) |>
  ggplot() +
  geom_histogram(mapping = aes(x = hwy_to_cty), color = 'white') +
  geom_vline(xintercept = mean_ratio, color = 'orange') +
  annotate("text",  # the type of annotation
           x = 1.425, y = 24.5, label = "Average", color = 'orange') +
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It looks like typically, a car’s highway mileage is about 40% more than (1.4 times) its city mileage. Notice our use of the annotate function.

We might also be interested in how this changes between automatic and manual transmissions. Note, the grepl function returns TRUE/FALSE values based on whether a string (e.g., “auto”) shows up in the value(s) in a column.

mpg |>
  mutate(is_automatic = grepl("auto", trans),
         hwy_to_cty = hwy / cty) |>
  ggplot() +
  geom_density(mapping = aes(x = hwy_to_cty, color = is_automatic)) +
  theme_classic() +
  scale_color_brewer(palette = "Dark2")

We’ll dive deeper into density plots later in the course, but for now we can think of it as a “smoothed” histogram, helping to give us a sense for the relative (proportions, not counts) distribution of our data. In this plot, we can see that the highway-to-city ratio remains pretty consistent between automatic and manual transmissions.

Box Plots

Suppose we’re interested in the difference in city mileage between the different manufacturers. Box plots (or “box and whisker plots”) are a great way to visualize multiple distributions at once.

mpg |>
  filter(manufacturer %in% manufacturers) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = manufacturer, y = cty)) +
  ggtitle("City Mileage by Manufacturer") +
  theme_minimal()

For each manufacturer, we have a box with “whiskers” extending from above and below:

  • The bottom of the box represents the 25th percentile, and the top represents the 75th percentile. So, the 50% of the data lies “within” the box. This is the Interquartile Range, or IQR.
  • The whiskers extend from the box to the maximum or minimum value, but no further than \(1.5 \times \text{IQR}\) from the top or bottom of the box, respectively.
  • Any points past the whiskers are typically defined as outliers.
  • The horizontal line within the box represents the median value for the data. The lower half of the data lies below this value, and upper 50% above it.

geom_violin is sometimes a prettier alternative to the box plot, but usually it tends to overcomplicate the visualization.

Scatterplots

Now, we might be interested in the relationship between Highway fuel mileage and the displacement (cylinder volume, in liters) of a car’s engine. We’d expect that larger engines have lower fuel mileage.

mpg |>
  ggplot() +
  geom_point(mapping = aes(x = displ, y = hwy),
             color = 'darkblue') +
  labs(title = "Highway vs. Displacement",
       x = "Cylinder Volume (L)", y = "Highway Mileage") +
  theme_classic()

We may want to figure out what is going on with the points that stand out (six points on the far right, and two on the far left).

For this, we are going to import the library ggrepel.

library(ggrepel)
# add a new column for these abnormally high highway mileages
mpg <- mutate(mpg, high_hwy = (hwy > 40 & displ < 2) |
                              (hwy > 20 & displ > 5))

# start a new ggplot for multiple data set bindings
ggplot(mapping = aes(x = displ, y = hwy)) +
  geom_point(data = filter(mpg, !high_hwy), color = 'darkblue') +
  geom_point(data = filter(mpg, high_hwy), color = 'darkred') +
  geom_text_repel(data = filter(mpg, high_hwy),
                   mapping = aes(label = model)) +
  theme_classic() +
  scale_color_brewer(palette = "Set1")

For large data sets, the box plot gives us an interesting view of relationships between two continuous variables. That is, we can use the cut_width cut function to group a continuous variable into sections, and plot a box plot for each partition.

mpg |>
  ggplot() +
  geom_boxplot(mapping = aes(x = displ, y = hwy, 
                             group = cut_width(displ, width = 0.5))) +
  labs(title = "Mileage by Displacement") +
  theme_minimal()

cut_width here converts our continuous variable into something discrete (i.e., partitions of width 0.5), and group plots the discrete groups over the continuous axis.

This technique is particularly useful for very large data sets where scatterplots are harder to read. Also, notice our use of the ggplot labs function for adding different labels to our plot.

Line Plots

Line plots are particularly (and in a way, specifically) useful for situations where there is a meaningful “movement” along an x-axis continuum. E.g., time, distance, or points along some process. And, since our mpg data set only contains two years of data, we’ll use the economics data set to illustrate this example. (Take a look at the data with View.)

With line plots, I prefer a particular theme that comes with the ggtheme package. We’ll need to import that.

library(ggthemes)
economics |>
  ggplot() +
  geom_line(mapping = aes(x = date, y = unemploy)) +
  expand_limits(y = 0) +  # sometimes you need to force zero lines 
  theme_hc() +  # this theme is great for line plots 
  labs(title = "Unemployment in the US Over Time",
       subtitle = "Collected on the first of every month",
       x = "Year", y = "Unemployment (in thousands)") +
  theme(plot.subtitle = element_text(colour = "darkgray"))

Smooth Curve Fits

Let’s continue with this economics data, and let’s investigate how the number of unemployed in the US changes with the country’s median duration of unemployment. But, now let’s overlay a smooth curve over its scatterplot.

economics |>
  # reduce duplicate code by defining the mapping in `ggplot`
  ggplot(mapping = aes(x = uempmed, y = unemploy)) +
  geom_point(color = "gray") +
  geom_smooth(se = FALSE) +
  theme_classic() +
  labs(title = "Unemployment as Duration Increases",
       x = "Median Duration of Unemployment",
       y = "Number of Unemployed (in thousands)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Curves like this give us a clean visual understanding for complex relationships. Feel free to take a look at the ?geom_smooth documentation to learn about what it’s doing, but please keep in mind that the methods described will not be covered in this course.

Jittering

Returning to the mpg data, suppose we are interested in the relationship between highway mileage and the number of cylinders in the vehicle, while also taking into consideration the year of the vehicle. We can convert the integer cylinder and year variables to factors using as_factor, and plot a special sort of scatterplot which avoids over-plotting (i.e., overlapping visual elements).

mpg |>
  mutate(year = as_factor(year),
         cyl = as_factor(cyl)) |>
  ggplot() +
  geom_jitter(mapping = aes(x = cyl, y = hwy, color = year),
              width = 0.2, height = 0) +
  labs(title = "Highway Mileage for Different Cylinder Counts",
       x = "Cylinder Count",
       y = "Highway Mileage") +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

If you ever decide to use a jitter plot like this, any jittering must be done along a non-continuous axis. In this case, we jitter along the x-axis (width), so we set the vertical jitter (height) to zero. Otherwise, the plot becomes deceptive.

A quick note on factors:

A factor is a specially coded column in R to represent categories. Here, we used the as_factor function to convert a column of categorical values into an official “factor” so R will plot it correctly. In some cases, you’ll have a column with numbers (factor levels) indicating categories (factor labels), and we can convert such a column using the factor function.

cyl_levels <- unique(mpg$cyl)
cyl_labels <- paste(cyl_levels, "cylinders")

mpg |>
  mutate(cyl_factor = factor(cyl,
                             levels = cyl_levels,
                             labels = cyl_labels)) |>
  select(c(cyl, cyl_factor)) |>
  sample_n(5)
## # A tibble: 5 × 2
##     cyl cyl_factor 
##   <int> <fct>      
## 1     6 6 cylinders
## 2     6 6 cylinders
## 3     6 6 cylinders
## 4     8 8 cylinders
## 5     4 4 cylinders

Faceting

Lastly, we might be interested in visualizing the same plot for multiple subsets of the data, or for multiple facets of a categorical dimension. Suppose we want to see how the relationship between City mileage and Displacement changes between the different classes of vehicle.

mpg |>
  ggplot(mapping = aes(displ, cty)) +
  # background
  geom_point(data = mutate(mpg, class = NULL), colour = "grey85") +
  # foreground
  geom_point(color = "darkblue") +
  facet_wrap(vars(class)) +
  theme_classic() +
  labs(title = "City Mileage vs. Displacement Over Class",
       x = "Cylinder Displacement (in liters)",
       y = "City Mileage")

To get an overlayed plot like this (adopted from a ggplot example) we can just remove the facet column from the “background” plot using mutate, seen before.

To view this a bit better, you can toy with the number of columns ncol, or click the little icon to view it in a separate window.

EXERCISE

Come up with 3-5 different questions about Miles Per Gallon which might be addressed with the mpg data. For each question, scope your data set, and build a plot using the |> piping code framework.

# your code here
# your code here

pluck vs select

Throughout this notebook, you’ve seen the functions pluck and select used almost interchangeably. However, remember, there is an important difference:

  • pluck returns the vector form of a column from a data frame (or tibble).
    • Think about it as “plucking” the actual data (i.e., the vector) from the column itself.
  • select returns the tibble form 1 or more columns from data frame (or tibble).
    • This can “select” multiple columns from a tibble as a tibble.

Most functions in the tidyverse require a tibble as input (so select should usually work), but some functions (built-in or tidyverse) require a vector, and in these cases pluck is what you want.