Introduction

Tidyverse

The Tidyverse is a collection of R packages that are designed to work well together. Their fundamental assumption is that data is represented in a so-called “tidy” format. This basically means that it can be represented by a table in which a row corresponds to an observation and a column corresponds to a variable (more details are available in the “Tidying data” chapter in this workshop and this article).

Here’s what a typical data science workflow looks like:

Data science workflow (adapted from here).

The Tidyverse offers packages that cover all aspects of this workflow. In this course, we will focus on data wrangling (the green box). In addition, we will also discuss importing and visualizing data. Modeling data and communicating results are outside the scope of this workshop.

Prerequisites

To get started with the Tidyverse, we need to install both R and RStudio. Although RStudio is technically not required, I strongly recommend it because it makes working with R much more pleasant.

Once you have these programs on your computer, start RStudio and install the tidyverse package, which is basically a meta-package consisting of the most important packages from the Tidyverse. We will also need the nycflights13 and palmerpenguins packages, because they provide nice datasets that we will explore.

We will now go through some basic R commands and workflows. Ideally, you should already be familiar with most of these topics. If not, this hopefully serves as a quick tour and should get you up to speed. Note that we show R code in gray boxes and output/results in white boxes throughout the document, for example:

mean(c(1, 2, 3))

[1] 2

You can copy code from a gray box and paste it into the R console (try it out with the previous example).

This workshop is based on selected chapters from the book “R for Data Science” by Hadley Wickham and Garrett Grolemund. An online version is freely available at https://r4ds.had.co.nz/.

R for Data Science.

R basics

Packages

A package contains additional functions and/or datasets that extend the capabilities of R. We install a package with the install.packages() function. In this workshop, we are going to use the tidyverse, nycflights13, and palmerpenguins packages. To install them, run the following commands in the R console (note that package names must be surrounded by single or double quotes):

install.packages("tidyverse")
install.packages("nycflights13")
install.packages("palmerpenguins")

Alternatively, you can use the “Packages” pane in RStudio (bottom right pane in the default layout) to install/update/uninstall R packages.

Once installed, we need to activate a package with the library function in each R session. If we don’t activate a package, we do not have (direct) access to the objects it contains. Here’s how we activate the packages that we’ve just installed:

library(tidyverse)
library(nycflights13)
library(palmerpenguins)

Note that we can also use functions contained in a package without activating by prepending the function name with the package name and two colons. For example, nycflights13::flights would access the flights data frame without having to activate the package first. In contrast, library(nycflights13) enables us to access this data frame with flights directly (no nycflights13:: prefix necessary).

Working directory

When R runs commands, it performs all computations in the so-called working directory. R expects all data files that you want to import in this directory (if not otherwise specified). The working directory can be any directory on your computer, and there are several options to change it.

The function getwd() returns the current working directory. The subtitle of the “Console” pane in RStudio (bottom left in the default layout) also shows the working directory.

The function setwd() sets the working directory to the folder passed as an argument, for example setwd("C:/Users/myuser/R") or equivalently setwd("~/R") (the tilde symbol is short for the current user’s home directory). Note that directories need to be separated with a forward slash / even on Windows (which normally uses a backslash \). Alternative methods to change the working directory with RStudio include the “Session – Set Working Directory” menu and the “More” icon in the “Files” pane (bottom right). Also, if you double-click an R script in Windows Explorer or macOS Finder, RStudio will open it and automatically set the working directory to the corresponding file location.

Never ever set the working directory in a script (see below). Instead, always refer to files with relative paths relative to the script. This guarantees portability of the analysis, because it is very unlikely that another person has the exact same directory that you are trying to set.

R code

Typically, we enter R commands in the console. A prompt symbol > indicates that R is ready for our input (note that we do not show the prompt symbol in the gray code boxes). Once we hit the ⏎ (enter or return) key, R will immediately evaluate what we just typed and print the result. This workflow is called REPL (read-eval-print loop), and it is a convenient way to interactively work with R and try out new things. Here’s an example of some commands with their outputs (try typing these commands into your console):

1 + 9  # this is a comment

[1] 10

x = 1:10  # the value of x is not printed in an assignment
sum(x)  # function call

[1] 55

Notice that when creating a new object with the assignment operator = (or equivalently <-), R does not automatically print its value. However, it is often useful to assign a new object and then immediately inspect its value. We could do this with two lines of code:

x = 1:10
x

 [1]  1  2  3  4  5  6  7  8  9 10

A more convenient way is to enclose the assignment in parentheses, which will both create the object and print its value:

(x = 1:10)

 [1]  1  2  3  4  5  6  7  8  9 10

Whenever R prints a vector, it automatically prepends the index of the first element to each line in square brackets. We only saw [1] in the previous outputs, because the values fitted into one line. If you print a longer vector, each line will show the index of its first element:

(y = 1:100)

  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26
 [27]  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52
 [53]  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78
 [79]  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

The console is nice for trying out commands, fixing errors, and playing around with code. However, if we want to store a sequence of R commands for later use, we can put them into so-called R scripts. An R script is a plain-text file (ending in .R) containing R commands, usually one command per line. RStudio includes an editor (top left pane in the default layout), which can be used to write, edit, and run (parts of) a script.

Importantly, a data analysis project stored as an R script can be run over and over again. This means if another person wants to reproduce your analysis, all you need to do is share your script and data files. The person then runs your entire script, for example by clicking on the “Source” icon, which fully reproduces the entire analysis and all results.

RStudio keyboard shortcuts

RStudio includes many useful keyboard shortcuts. It really pays off to remember some of them, because your workflow will become faster and more efficient. An overview of all keyboard shortcuts is available in the “Help – Keyboard Shortcuts Help” menu item.

Here are four important shortcuts that I think everyone should use:

The ↑ and ↓ arrow keys access the command history in the console. You can also edit any command before running it again.
If you are searching for a previously entered command starting with specific characters, enter the characters in the console and press ⌘+↑ on macOS or Ctrl+↑ on Windows and Linux.
Hitting ⌘+⏎ (macOS) or Ctrl+⏎ (Windows and Linux) in the editor runs the command under the cursor in the console.
The shortcut for the pipe operator |> (more on that later) is ⌘+⇧+m (macOS) or Ctrl+⇧+m (Windows and Linux).

Vectors

The most basic (atomic) data type in R is a vector. A vector is a collection of objects which all have the same type. Even a scalar number like 1 is really a vector in R. The c() function creates vectors consisting of one or more elements (c is short for “concatenate”). The length() function returns the number of elements in a vector.

Important data types include numeric vectors, character vectors, logical vectors, factors, and datetime vectors. We can use the class() function to determine the type of a given object. Here are some examples:

x = 1
class(x)

[1] "numeric"

length(x)

[1] 1

y = c(4, 5.6, -7)
class(y)

[1] "numeric"

length(y)

[1] 3

c("Hello", "world!")  # character

[1] "Hello"  "world!"

c(TRUE, FALSE)  # logical

[1]  TRUE FALSE

y > 4  # a comparison evaluates to a logical vector

[1] FALSE  TRUE FALSE

factor(c("A", "A", "B", "A", "C", "C", "A", "B"))  # factor with three levels

[1] A A B A C C A B
Levels: A B C

as.Date(c("17.3.2020", "22.5.2020", "3.3.2021"), format="%d.%m.%Y")  # datetime

[1] "2020-03-17" "2020-05-22" "2021-03-03"

We can access individual elements of a vector using square brackets containing the indexes of all elements we would like to access:

x = 1:10
x[5]  # fifth element

[1] 5

x[c(7, 1, 4)]  # elements with index 7, 1, and 4

[1] 7 1 4

x[x >= 5]  # all elements >= 5

[1]  5  6  7  8  9 10

Data frames

A data frame is a list of vectors (with identical lengths). In other words, it represents a table consisting of rows and columns for storing rectangular data.

(df = data.frame(x=1:5, y=c(6, -9.5, 166, 8.8, 0.112), z=c("A", "X", "X", "B", "A")))

  x       y z
1 1   6.000 A
2 2  -9.500 X
3 3 166.000 X
4 4   8.800 B
5 5   0.112 A

The Tidyverse package tibble provides an improved data frame type called tibble. A tibble is a drop-in replacement for a data frame, so we can use tibbles (almost) everywhere data frames are expected.

tibble::tibble(x=1:5, y=c(6, -9.5, 166, 8.8, 0.112), z=c("A", "X", "X", "B", "A"))

# A tibble: 5 × 3
      x       y z    
  <int>   <dbl> <chr>
1     1   6     A    
2     2  -9.5   X    
3     3 166     X    
4     4   8.8   B    
5     5   0.112 A

Note how data frames and tibbles print differently in the previous examples. Tibbles are better readable and include their dimension (# A tibble: 5 x 3) as well as column data types (<int>, <dbl>, and <chr>, which is short for integer, double, and character). The str() function shows a convenient summary of the structure of a given data frame, which also contains the column data types:

str(df)

'data.frame':   5 obs. of  3 variables:
 $ x: int  1 2 3 4 5
 $ y: num  6 -9.5 166 8.8 0.112
 $ z: chr  "A" "X" "X" "B" ...

RStudio offers a nice integrated data frame viewer: the View() function visualizes any data frame or tibble in a spreadsheet-like table. For example, the previously created data frame df can be viewed with View(df).

Functions

A function performs some pre-defined computations on (optional) input arguments and (optionally) returns a result. We routinely call functions that have been defined elsewhere, for example the c(), class(), and length() functions. A pair of parentheses () after a function name indicates that we are calling that function. We can also define our own functions, but this is outside the scope of this workshop.

Here are some examples for function calls:

c(1, 2, 3)  # 3 arguments

[1] 1 2 3

class("A")  # 1 argument

[1] "character"

length(c(4, 5, 6))  # two (nested) function calls

[1] 3

The last example shows two nested function calls. First, we call the c() function with three arguments, which we use as an argument in the length() function call. R tries to reduce all expressions to a value, so it works its way from the innermost layer to the outermost one. Therefore, a nested function call is really two function calls in the following order:

(tmp = c(4, 5, 6))

[1] 4 5 6

length(tmp)

[1] 3

Missing values

R represents missing values as NA (“not available”). Missing values are contagious, which means that calculations involving missing values will result in NA. This makes sense if you think of a missing value as “unknown” (we don’t know what the value is). Here are some examples:

NA + 1

[1] NA

NA > 0

[1] NA

1 == NA

[1] NA

NA / 2

[1] NA

Even comparing NA with NA is again NA:

NA == NA

[1] NA

Let’s compute the mean of some numbers involving one missing value:

(x = c(25, 50, NA, 100))

[1]  25  50  NA 100

mean(x)

[1] NA

The mean is also unknown, because we cannot compute it because of the missing value. However, almost all aggregation functions support the na.rm argument, which by default (FALSE) does not remove missing values. If we set na.rm=TRUE, all missing values are removed before the actual value is computed:

mean(x, na.rm=TRUE)

[1] 58.33333

To find out which values in a vector are missing, we can use the is.na() function:

is.na(x)

[1] FALSE FALSE  TRUE FALSE

Help

You can view the documentation for any object by prepending a ? to the object name. For example, ?length shows the help page for the length function. You can also press F1 to display help for the object at the current cursor location.

Exercices

Install the tidyverse, nycflights13, and palmerpenguins packages. After that, check if you have the packages readr, dplyr, ggplot2, and tidyr installed.
What is your current working directory? Create a new directory called tidyverse-workshop in your home directory (use Windows Explorer, macOS Finder or the “Files” pane in RStudio to navigate and create the directory). Then set the working directory to this folder. Check again if the current working directory now points to the correct location.
Compute the areas of circles with radii 5, 7, 19, and \(\pi^{-\frac{1}{2}}\). Put all radii into a vector r and compute the corresponding areas with one command!
Create a vector with 100 random numbers drawn from a standard normal distribution using the function rnorm(). Then extract all positive numbers from this random vector – how many elements are positive?
How many rows and columns does the built-in mtcars data frame have? What are the column data types? What does the drat column represent?