The Tidyverse is a collection of R packages that are designed to work well together. Their fundamental assumption is that data is represented in a so-called “tidy” format. This basically means that it can be represented by a table in which a row corresponds to an observation and a column corresponds to a variable (more details are available in the “Tidying data” chapter in this workshop and this article).
Here’s what a typical data science workflow looks like:
Data science workflow (adapted from here).
The Tidyverse offers packages that cover all aspects of this workflow. In this course, we will focus on data wrangling (the green box). In addition, we will also discuss importing and visualizing data. Modeling data and communicating results are outside the scope of this workshop.
To get started with the Tidyverse, we need to install both R and RStudio. Although RStudio is technically not required, I strongly recommend it because it makes working with R much more pleasant.
Once you have these programs on your computer, start RStudio and
install the tidyverse package, which is basically a
meta-package consisting of the most important packages from the
Tidyverse. We will also need the nycflights13 and
palmerpenguins packages, because they provide nice datasets
that we will explore.
We will now go through some basic R commands and workflows. Ideally, you should already be familiar with most of these topics. If not, this hopefully serves as a quick tour and should get you up to speed. Note that we show R code in gray boxes and output/results in white boxes throughout the document, for example:
mean(c(1, 2, 3))
[1] 2
You can copy code from a gray box and paste it into the R console (try it out with the previous example).
This workshop is based on selected chapters from the book “R for Data Science” by Hadley Wickham and Garrett Grolemund. An online version is freely available at https://r4ds.had.co.nz/.
R for Data Science.
A package contains additional functions and/or datasets that extend
the capabilities of R. We install a package with the
install.packages() function. In this workshop, we are going
to use the tidyverse, nycflights13, and
palmerpenguins packages. To install them, run the following
commands in the R console (note that package names must be surrounded by
single or double quotes):
install.packages("tidyverse")
install.packages("nycflights13")
install.packages("palmerpenguins")
Alternatively, you can use the “Packages” pane in RStudio (bottom right pane in the default layout) to install/update/uninstall R packages.
Once installed, we need to activate a package with the
library function in each R session. If we don’t activate a
package, we do not have (direct) access to the objects it contains.
Here’s how we activate the packages that we’ve just installed:
library(tidyverse)
library(nycflights13)
library(palmerpenguins)
Note that we can also use functions contained in a package without
activating by prepending the function name with the package name and two
colons. For example, nycflights13::flights would access the
flights data frame without having to activate the package
first. In contrast, library(nycflights13) enables us to
access this data frame with flights directly (no
nycflights13:: prefix necessary).
When R runs commands, it performs all computations in the so-called working directory. R expects all data files that you want to import in this directory (if not otherwise specified). The working directory can be any directory on your computer, and there are several options to change it.
The function getwd() returns the current working
directory. The subtitle of the “Console” pane in RStudio (bottom left in
the default layout) also shows the working directory.
The function setwd() sets the working directory to the
folder passed as an argument, for example
setwd("C:/Users/myuser/R") or equivalently
setwd("~/R") (the tilde symbol is short for the current
user’s home directory). Note that directories need to be separated with
a forward slash / even on Windows (which normally uses a
backslash \). Alternative methods to change the working
directory with RStudio include the “Session – Set Working Directory”
menu and the “More” icon in the “Files” pane (bottom right). Also, if
you double-click an R script in Windows Explorer or macOS Finder,
RStudio will open it and automatically set the working directory to the
corresponding file location.
Never ever set the working directory in a script (see below). Instead, always refer to files with relative paths relative to the script. This guarantees portability of the analysis, because it is very unlikely that another person has the exact same directory that you are trying to set.
Typically, we enter R commands in the console. A prompt symbol
> indicates that R is ready for our input (note that we
do not show the prompt symbol in the gray code boxes). Once we hit the
⏎ (enter or return) key, R will immediately evaluate what we
just typed and print the result. This workflow is called REPL
(read-eval-print loop), and it is a convenient way to interactively work
with R and try out new things. Here’s an example of some commands with
their outputs (try typing these commands into your console):
1 + 9 # this is a comment
[1] 10
x = 1:10 # the value of x is not printed in an assignment
sum(x) # function call
[1] 55
Notice that when creating a new object with the assignment operator
= (or equivalently <-), R does not
automatically print its value. However, it is often useful to assign a
new object and then immediately inspect its value. We could do this with
two lines of code:
x = 1:10
x
[1] 1 2 3 4 5 6 7 8 9 10
A more convenient way is to enclose the assignment in parentheses, which will both create the object and print its value:
(x = 1:10)
[1] 1 2 3 4 5 6 7 8 9 10
Whenever R prints a vector, it automatically prepends the index of
the first element to each line in square brackets. We only saw
[1] in the previous outputs, because the values fitted into
one line. If you print a longer vector, each line will show the index of
its first element:
(y = 1:100)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[27] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
[53] 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
[79] 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
The console is nice for trying out commands, fixing errors, and
playing around with code. However, if we want to store a sequence of R
commands for later use, we can put them into so-called R scripts. An R
script is a plain-text file (ending in .R) containing R
commands, usually one command per line. RStudio includes an editor (top
left pane in the default layout), which can be used to write, edit, and
run (parts of) a script.
Importantly, a data analysis project stored as an R script can be run over and over again. This means if another person wants to reproduce your analysis, all you need to do is share your script and data files. The person then runs your entire script, for example by clicking on the “Source” icon, which fully reproduces the entire analysis and all results.
RStudio includes many useful keyboard shortcuts. It really pays off to remember some of them, because your workflow will become faster and more efficient. An overview of all keyboard shortcuts is available in the “Help – Keyboard Shortcuts Help” menu item.
Here are four important shortcuts that I think everyone should use:
|> (more on that
later) is ⌘+⇧+m (macOS) or
Ctrl+⇧+m (Windows and Linux).The most basic (atomic) data type in R is a vector. A vector is a
collection of objects which all have the same type. Even a scalar number
like 1 is really a vector in R. The c()
function creates vectors consisting of one or more elements
(c is short for “concatenate”). The length()
function returns the number of elements in a vector.
Important data types include numeric vectors, character vectors,
logical vectors, factors, and datetime vectors. We can use the
class() function to determine the type of a given object.
Here are some examples:
x = 1
class(x)
[1] "numeric"
length(x)
[1] 1
y = c(4, 5.6, -7)
class(y)
[1] "numeric"
length(y)
[1] 3
c("Hello", "world!") # character
[1] "Hello" "world!"
c(TRUE, FALSE) # logical
[1] TRUE FALSE
y > 4 # a comparison evaluates to a logical vector
[1] FALSE TRUE FALSE
factor(c("A", "A", "B", "A", "C", "C", "A", "B")) # factor with three levels
[1] A A B A C C A B
Levels: A B C
as.Date(c("17.3.2020", "22.5.2020", "3.3.2021"), format="%d.%m.%Y") # datetime
[1] "2020-03-17" "2020-05-22" "2021-03-03"
We can access individual elements of a vector using square brackets containing the indexes of all elements we would like to access:
x = 1:10
x[5] # fifth element
[1] 5
x[c(7, 1, 4)] # elements with index 7, 1, and 4
[1] 7 1 4
x[x >= 5] # all elements >= 5
[1] 5 6 7 8 9 10
A data frame is a list of vectors (with identical lengths). In other words, it represents a table consisting of rows and columns for storing rectangular data.
(df = data.frame(x=1:5, y=c(6, -9.5, 166, 8.8, 0.112), z=c("A", "X", "X", "B", "A")))
x y z
1 1 6.000 A
2 2 -9.500 X
3 3 166.000 X
4 4 8.800 B
5 5 0.112 A
The Tidyverse package tibble provides an improved data
frame type called tibble. A tibble is a drop-in replacement for a data
frame, so we can use tibbles (almost) everywhere data frames are
expected.
tibble::tibble(x=1:5, y=c(6, -9.5, 166, 8.8, 0.112), z=c("A", "X", "X", "B", "A"))
# A tibble: 5 × 3
x y z
<int> <dbl> <chr>
1 1 6 A
2 2 -9.5 X
3 3 166 X
4 4 8.8 B
5 5 0.112 A
Note how data frames and tibbles print differently in the previous
examples. Tibbles are better readable and include their dimension
(# A tibble: 5 x 3) as well as column data types
(<int>, <dbl>, and
<chr>, which is short for integer, double, and
character). The str() function shows a convenient summary
of the structure of a given data frame, which also contains the column
data types:
str(df)
'data.frame': 5 obs. of 3 variables:
$ x: int 1 2 3 4 5
$ y: num 6 -9.5 166 8.8 0.112
$ z: chr "A" "X" "X" "B" ...
RStudio offers a nice integrated data frame viewer: the
View() function visualizes any data frame or tibble in a
spreadsheet-like table. For example, the previously created data frame
df can be viewed with View(df).
A function performs some pre-defined computations on (optional) input
arguments and (optionally) returns a result. We routinely call functions
that have been defined elsewhere, for example the c(),
class(), and length() functions. A pair of
parentheses () after a function name indicates that we are
calling that function. We can also define our own functions, but this is
outside the scope of this workshop.
Here are some examples for function calls:
c(1, 2, 3) # 3 arguments
[1] 1 2 3
class("A") # 1 argument
[1] "character"
length(c(4, 5, 6)) # two (nested) function calls
[1] 3
The last example shows two nested function calls. First, we call the
c() function with three arguments, which we use as an
argument in the length() function call. R tries to reduce
all expressions to a value, so it works its way from the innermost layer
to the outermost one. Therefore, a nested function call is really two
function calls in the following order:
(tmp = c(4, 5, 6))
[1] 4 5 6
length(tmp)
[1] 3
R represents missing values as NA (“not available”).
Missing values are contagious, which means that calculations involving
missing values will result in NA. This makes sense if you
think of a missing value as “unknown” (we don’t know what the value is).
Here are some examples:
NA + 1
[1] NA
NA > 0
[1] NA
1 == NA
[1] NA
NA / 2
[1] NA
Even comparing NA with NA is again
NA:
NA == NA
[1] NA
Let’s compute the mean of some numbers involving one missing value:
(x = c(25, 50, NA, 100))
[1] 25 50 NA 100
mean(x)
[1] NA
The mean is also unknown, because we cannot compute it because of the
missing value. However, almost all aggregation functions support the
na.rm argument, which by default (FALSE) does
not remove missing values. If we set na.rm=TRUE,
all missing values are removed before the actual value is computed:
mean(x, na.rm=TRUE)
[1] 58.33333
To find out which values in a vector are missing, we can use the
is.na() function:
is.na(x)
[1] FALSE FALSE TRUE FALSE
You can view the documentation for any object by prepending a
? to the object name. For example, ?length
shows the help page for the length function. You can also
press F1 to display help for the object at the current cursor
location.
tidyverse, nycflights13, and
palmerpenguins packages. After that, check if you have the
packages readr, dplyr, ggplot2,
and tidyr installed.tidyverse-workshop in your home directory (use
Windows Explorer, macOS Finder or the “Files” pane in RStudio to
navigate and create the directory). Then set the working directory to
this folder. Check again if the current working directory now points to
the correct location.r and compute the corresponding areas with one
command!rnorm(). Then extract all
positive numbers from this random vector – how many elements are
positive?mtcars data
frame have? What are the column data types? What does the
drat column represent? ©
2022, Clemens Brunner