\[\\[0.25in]\]
Tidyverse is a large family of packages used to enhance R’s capacity for data science.
This is the main website: https://www.tidyverse.org/
\[\\[0.1in]\]
This workthru focuses on the two packages used in data wrangling,
tidyr and dplyr (pronounced ‘d plier’ like the
hardware tool), and one for importing/exporting your data
(readr). Note that plyr is the full and more
expanded (and slower) version of dplyr.
The other tidyverse packages that not included in this demo are:
ggplot (for data visualization), Tibble (small
package for working better with data frames), forcats
(helps with factors and categorical data), stringr (for
working with strings), purrr (functional programming,
e.g. functions and vectors), and some other smaller ones designed for
more specific uses.
\[\\[0.1in]\]
There are many cheatsheets, I recommend these Posit ones: https://rstudio.github.io/cheatsheets/
Or this great website: https://r4ds.had.co.nz/introduction.html
\[\\[0.25in]\]
You can install the whole tidyverse:
install.packages("tidyverse")
Or alternatively, install just the ones you want:
install.packages("tidyr")
install.packages("dplyr")
install.packages("readr")
And then library them:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.4.4 ✔ purrr 1.0.2
## ✔ tibble 3.2.1 ✔ dplyr 1.1.3
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 1.4.0 ✔ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# OR #
library(tidyr)
library(dplyr)
library(readr)
\[\\[0.25in]\]
readr):You can import many file types using tidyr, including: tab delimited
(read_tsv), comma delimited (read_csv),
semicolon delimited (read_csv2). Use
read_delim for any delimiter - you can either tell it which
one to use or it’ll guess.
\[\\[0.05in]\]
(these will give errors if you run them, they’re only examples):
read_csv("MyData.csv")read_csv("MyData.csv", col_names = FALSE)read_csv("MyData.csv", col_names = c("A", "B", "C"))\[\\[0.1in]\]
Importing can get more complicated if you have the need, for example you can use general expressions while importing too:
Import <- read.csv("MyData.csv", header = F, col.names = paste("V",1:50))
\[\\[0.05in]\]
Here’s a fun trick to not hard-code your file or file path by opening your native file directory:
MyChoice <- read.csv(file.choose())
\[\\[0.05in]\]
One way to read many files at once:
filepath <- "/FakePath/Samples/"
file_list <- dir("/FakePath/Samples/", pattern = "*.csv")
all_csv <- lapply(paste0(filepath, file_list), read.csv, header = FALSE)
\[\\[0.1in]\]
Likewise, to save your data you can use: write_csv,
write_csv2, write_tsv, and
write_delim.
For example,
write.csv(x = output_dataframe, file = "Output.csv", row.names = FALSE)
Where x is the R object you’re saving, file
is the name of your exported file (note you can include a file path if
you want it in a different directory), and then there are other
arguments you can include.
\[\\[0.25in]\]
Why is it called ‘tidy’? Tidyverse packages make a few assumptions about your data, namely that each variable is it’s own column and each sample is it’s own row. If you have your data set up like this, then the relationships will always be preserved no matter what transformations happen. This may seem like basic data hygiene, but it’s worth keeping in mind when working with large an/or unusual data.
\[\\[0.1in]\]
The tidyverse has some unique syntax and operations, although there are really only two unique things you need to know beyond just how the function exppects things defined.
When working with tidyverse you’ll often need to use
%>%, which is called a pipe. This character is used
during a sequence of actions. It’s best not to involve more than one
object, and if there are intermediate objects they should be named
individually. The pipe should always have a space in front of it, and
most folks prefer to have white space afterwards (e.g. starting a new
line) if it’s part of a large code block.
\[\\[0.1in]\]
We’ll go over the different functions (like separate
here) later on, but here is an example of using a pipe.
NewDF <- OldDF %>% separate(col = MyColumn, into = c('A', 'B'), sep= "_")
The line above takes a dataframe, OldDF, and splits the
column MyColumn by an underscore into two new columns
(A and B) - and saves it as a new dataframe
(NewDF).
\[\\[0.1in]\]
You can use the %>% or including the data in the
function can be equivalent, these next two lines produce the exact same
result (we’ll cover the weird dplyr:: next and the
count later). It’s best you use the pipe for longer
scripts.
MyDataset %>% dplyr::count(variable)
dplyr::count(MyDataset, variable)
\[\\[0.1in]\]
The tidyverse is large, and there are sometimes overlapping function
names both within it and among other packages. When this happens, you’ll
need to specify which package contains the function you’re trying to
call. To do this you write package::functionname.
Here is an example where some unknown package and dplyr
both have a function called group_by, but I specifically
want dplyr’s version:
OldDF %>% dplyr::group_by(MyVariable) %>% summarise
\[\\[0.1in]\]
(ignore this for the most part)
names1 <- c("A_1", "B_1", "C_1", "D_1", "E_1")
names2 <- c("W_2", "X_2", "Y_2", "Z_2", "ZZ_2")
numbers1 <- c(1, 2, 3, 4, 5)
numbers2 <- c(1, 2, 3, 4, 6)
simple1 <- t(rbind(numbers1, names1))
simple1 <- as.data.frame(simple1)
simple1$numbers1 <- as.numeric(simple1$numbers1)
simple2 <- t(rbind(numbers2, names2))
simple2 <- as.data.frame(simple2)
simple2$numbers2 <- as.numeric(simple2$numbers2)
colnames(simple1) <- c("numbers", "letters")
colnames(simple2) <- c("numbers", "letters")
\[\\[0.5in]\]
tidyr and dplyr):There are quite a few ways to view your data - especially if you’re using RSTudio. Below are a few lines you can run to take a peak:
#Base R:
str(simple1) # Look at the structure
## 'data.frame': 5 obs. of 2 variables:
## $ numbers: num 1 2 3 4 5
## $ letters: chr "A_1" "B_1" "C_1" "D_1" ...
# Using tidyverse:
dplyr::glimpse(simple1)
## Rows: 5
## Columns: 2
## $ numbers <dbl> 1, 2, 3, 4, 5
## $ letters <chr> "A_1", "B_1", "C_1", "D_1", "E_1"
\[\\[0.1in]\]
Combining data sets is an important but tricky step in any workflow. Tidyverse is great for it - I often defer to the Posit cheatsheet which has it all color-coded to make sure I choose the right one.
For this exercise, I’ve made two simple data sets with just a number column and a letter column.
head(simple1)
head(simple2)
\[\\[0.1in]\]
Let’s treat the numbers as the same sample across the two sheets and the letters as different variables, and explore how we can combine them.
Note that the by function was the variable they were
indexed and combined by - if I had put letters instead of
numbers nothing would have combined. The .x
and .y are to keep track which dataset they came from
automatically - if the columns had unique names which would not have
happened.
dplyr::left_join(simple1, simple2, by = "numbers") # Matches 2 to 1
dplyr::right_join(simple1, simple2, by = "numbers") # Matches 1 to 2
dplyr::inner_join(simple1, simple2, by = "numbers") # Only keeps rows that are shared
dplyr::full_join(simple1, simple2, by = "numbers") # Keeps all data
\[\\[0.1in]\]
Let’s treat each one as different parts of the same set, and go over different ways to combine them:
dplyr::intersect(simple1, simple2) # rows in both 1 and 2
dplyr::setdiff(simple1, simple2) # rows in 1 but not 2
dplyr::union(simple1, simple2) # all rows
\[\\[0.1in]\] And here is how to bind both sets together, regardless if they match or not:
dplyr::bind_rows(simple1, simple2) # appends 2 to 1 at the bottom
dplyr::bind_cols(simple1, simple2) # appends 2 to 1 but as new columns based on position
## New names:
## • `numbers` -> `numbers...1`
## • `letters` -> `letters...2`
## • `numbers` -> `numbers...3`
## • `letters` -> `letters...4`
\[\\[0.1in]\]
Let’s use the full_join version for the rest of the walkthru.
PlayData <- dplyr::full_join(simple1, simple2, by = "numbers")
\[\\[0.1in]\]
Now that we have our data, we should clean it up a bit. Here we can remove NAs/0s, reorder the rows, and tranform how our data is shown in the sheet.
For many reasons you may want to remove NAs or 0s from your data:
PlayData %>% drop_na() # Drop all rows with NA values
PlayData %>% drop_na(letters.y) # Specify which columns
# There isn't a specific function to drop zeros, but you can get creative:
PlayData %>% select(where(~(is.numeric(.) && all(. > 0)) || !is.numeric(.)))
The drop zero line is a little odd, but basically it’s saying in
PlayData, select only the rows that are above zero if they
are numeric and delete everything else. Let’s review the syntax: -
select: filtering function - where: the
argument - is.numeric: ignores characters/factors/etc -
. : the entire object - all: all of the
values
\[\\[0.1in]\]
You may want to rearrange your datasheets (for example if you’re
adding a new index and would like it to sort alphabetically) - for that
you’d use arrange with or without desc
(descending). This can be done with multiple levels, too.
dplyr::arrange(PlayData, letters.x) # Low to high
dplyr::arrange(PlayData, desc(letters.x)) # High to low
dplyr::arrange(PlayData, -numbers) # For high to low with numbers you can just add -
dplyr::arrange(PlayData, numbers, letters.x) # By the number and letters
\[\\[0.1in]\]
Adding columns or splitting them apart can be really helpful - these
can be done with unite and seperate. Many
times I’ll have a column that is a long string of words seperated by
either a space or underscore - you can leverage seperate to
create different columns instead of manually redoing your
spreadsheet.
tidyr::unite(PlayData, New, letters.x, letters.y, sep = "|")
tidyr::unite(PlayData, New, letters.x, letters.y, sep = "|", na.rm = TRUE) # Ignoring NA
separate(PlayData, col = letters.x, into = c('Let', 'Num'), sep = '_')
\[\\[0.1in]\]
The functions gather and spread transform
your data from shortform to longform and vice versa (if that doesn’t
make sense yet hopefully the examples will help). An alternative to
using tidyr for this is the reshape2 package’s
melt and cast functions.
For the sake of the next two, I’ll make new play data for each. Run but ignore this next code chunk:
GatherDF <- data.frame(matrix(NA, nrow = 3, ncol = 4))
colnames(GatherDF) <- c("ID", "A", "B", "C")
GatherDF$ID <- 1:nrow(GatherDF)
GatherDF$A <- round(rnorm(n = 3, mean = 20, sd = 12))
GatherDF$B <- round(rnorm(n = 3, mean = 10, sd = 12))
GatherDF$C <- round(rnorm(n = 3, mean = 15, sd = 12))
SpreadDF <- data.frame(matrix(NA, nrow = 9, ncol = 3))
colnames(SpreadDF) <- c("ID", "Types", "Data")
SpreadDF$ID <- c(1, 2, 3, 1, 2, 3, 1, 2, 3)
SpreadDF$Types <- c("Type1", "Type1", "Type1", "Type2", "Type2", "Type2", "Type3", "Type3", "Type3")
SpreadDF$Data <- round(rnorm(n = 9, mean = 100, sd = 5))
Let’s say we have a dataframe that has columns we would like to
transform into rows - this is especially handy when working with data
visualization, so each variable type can be accessed in one column (like
in ggplot). The syntax here has the dataset in the first
position (GatherDF), then the names of the two columns
you’re making (Letters and Data), and finally
the 2:4 specifies that I want to transform my data by the
columns 2-4 and leave my first ID column alone.
head(GatherDF)
tidyr::gather(GatherDF, "Letters", "Data", 2:4)
\[\\[0.1in]\]
Now let’s look at a reverse case where you have categorical data as
different rows and would like to spread it into different columns.
Again, the syntax follows that the first place shows the dataframe I’m
working with, but then the next two words (Type and
Data) show which columns I’m spreading.
head(SpreadDF)
tidyr::spread(SpreadDF, "Types", "Data", 2:4)
## Warning in if (!is.na(fill)) {: the condition has length > 1 and only the first
## element will be used
\[\\[0.5in]\]
For these next two sections (subsetting and analysis) we’ll use the
mpg dataset - this is automatically included in base R to
help standardize things (like if you wanted to ask someone how to do
something without giving them your data). It is the EPA reported fuel
economy data from 1999-2008 for several models of cars. Let’s take a
look:
dplyr::glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
\[\\[0.1in]\]
Subsetting rows is useful for many reasons, such as keeping analyses separate from each other.
dplyr::filter(mpg, year > 2005) # Filtering rows older than 2005
dplyr::distinct(mpg, manufacturer) # Unique manufacturer values
dplyr::sample_n(mpg, 5, replace = FALSE) # Random 5 rows
dplyr::slice(mpg, 50:55) # Specific rows
dplyr::top_n(mpg, 5, cty) # Order by a variable and pull the top n rows
\[\\[0.1in]\]
Grouping data is helpful when doing an analysis - such as calculating the mean of something per value. There isn’t much to show as a stand alone operation, but we’ll use it in the next section. Ungrouping your data is used to bring the groups back out to the full set.
dplyr::group_by(mpg, manufacturer)
dplyr::ungroup(mpg)
\[\\[0.5in]\]
dplyr offers some great tools to do relatively simple,
but specific, calculations.
Count counts the unique values of a given value.
dplyr::count(mpg, manufacturer)
\[\\[0.1in]\]
Summarize is a neat way to do calculations on your
dataset. In this example, I’m asking the mean, standard
deviation (sd), and minimum (min) city mpg per
car manufacturer.
mpg %>% dplyr::group_by(manufacturer) %>% summarise(mean = mean(cty), sd = sd(cty), min = min(cty))
\[\\[0.1in]\]
The function mutate is used to make a calculation or
transformation and save it as a column. Here I capitalized the
manufacturer and model values.
dplyr::mutate(mpg, manufacturer = str_to_title(manufacturer), model = str_to_title(model))
Where mutate just adds a new column or overwrites one,
transmute removes all of the originals unless specified.
The example here calculates the combined fuel efficiency, note that all
but the model and new CombinedFuelEfficiency
column were dropped.
dplyr::transmute(mpg, model, CombinedFuelEfficiency = 0.55*cty + 0.45*hwy)
\[\\[0.5in]\]
Here’s a simple question that could be difficult to code: what is the mean fuel efficiency for each transmission type, and which model has the highest city mpg?
Using the tidyverse we can do this in just 5 lines! We’ll first use
the separate function to break out the transmission data by
the ( into two columns (e.g. from auto(l5) to
auto and 15)), then using
group_by to parse the dataframe into each
manufacturer, summarise the mean
for city and highway mpg, then report the statistics back out with
ungroup.
mpg %>%
tidyr::separate(trans, into = c('Transm', 'SubType'), sep = '\\(') %>% #the \\ is used for symbols
group_by(manufacturer, Transm) %>%
summarise(avg_city = mean(cty), avg_hwy = mean(hwy)) %>%
dplyr::arrange(avg_city) %>%
ungroup()
## `summarise()` has grouped output by 'manufacturer'. You can override using the
## `.groups` argument.
\[\\[0.1in]\]