\[\\[0.25in]\]

Tidyverse is a large family of packages used to enhance R’s capacity for data science.

This is the main website: https://www.tidyverse.org/

\[\\[0.1in]\]

This workthru focuses on the two packages used in data wrangling, tidyr and dplyr (pronounced ‘d plier’ like the hardware tool), and one for importing/exporting your data (readr). Note that plyr is the full and more expanded (and slower) version of dplyr.

The other tidyverse packages that not included in this demo are: ggplot (for data visualization), Tibble (small package for working better with data frames), forcats (helps with factors and categorical data), stringr (for working with strings), purrr (functional programming, e.g. functions and vectors), and some other smaller ones designed for more specific uses.

\[\\[0.1in]\]

There are many cheatsheets, I recommend these Posit ones: https://rstudio.github.io/cheatsheets/

Or this great website: https://r4ds.had.co.nz/introduction.html

\[\\[0.25in]\]

Package Installation:

You can install the whole tidyverse:

install.packages("tidyverse")

Or alternatively, install just the ones you want:

install.packages("tidyr")

install.packages("dplyr")

install.packages("readr")

And then library them:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.4.4     ✔ purrr   1.0.2
## ✔ tibble  3.2.1     ✔ dplyr   1.1.3
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   1.4.0     ✔ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# OR #
library(tidyr)
library(dplyr)
library(readr)

\[\\[0.25in]\]

Data Import and Export (readr):

You can import many file types using tidyr, including: tab delimited (read_tsv), comma delimited (read_csv), semicolon delimited (read_csv2). Use read_delim for any delimiter - you can either tell it which one to use or it’ll guess.

\[\\[0.05in]\]

(these will give errors if you run them, they’re only examples):

  • The simplest is just read_csv("MyData.csv")
  • To ignore the header: read_csv("MyData.csv", col_names = FALSE)
  • Renaming the columns: read_csv("MyData.csv", col_names = c("A", "B", "C"))
  • Skipping specific rows: `read_csv(“MyData.csv”, skip = c(2, 10))``

\[\\[0.1in]\]

Importing can get more complicated if you have the need, for example you can use general expressions while importing too:

Import <- read.csv("MyData.csv", header = F, col.names = paste("V",1:50))

\[\\[0.05in]\]

Here’s a fun trick to not hard-code your file or file path by opening your native file directory:

MyChoice <- read.csv(file.choose())

\[\\[0.05in]\]

One way to read many files at once:

filepath <- "/FakePath/Samples/"

file_list <- dir("/FakePath/Samples/", pattern = "*.csv")

all_csv <- lapply(paste0(filepath, file_list), read.csv, header = FALSE)

\[\\[0.1in]\]

Likewise, to save your data you can use: write_csv, write_csv2, write_tsv, and write_delim.

For example, write.csv(x = output_dataframe, file = "Output.csv", row.names = FALSE)

Where x is the R object you’re saving, file is the name of your exported file (note you can include a file path if you want it in a different directory), and then there are other arguments you can include.

\[\\[0.25in]\]

Tidyverse basics:

Why is it called ‘tidy’? Tidyverse packages make a few assumptions about your data, namely that each variable is it’s own column and each sample is it’s own row. If you have your data set up like this, then the relationships will always be preserved no matter what transformations happen. This may seem like basic data hygiene, but it’s worth keeping in mind when working with large an/or unusual data.

\[\\[0.1in]\]

The tidyverse has some unique syntax and operations, although there are really only two unique things you need to know beyond just how the function exppects things defined.

When working with tidyverse you’ll often need to use %>%, which is called a pipe. This character is used during a sequence of actions. It’s best not to involve more than one object, and if there are intermediate objects they should be named individually. The pipe should always have a space in front of it, and most folks prefer to have white space afterwards (e.g. starting a new line) if it’s part of a large code block.

\[\\[0.1in]\]

We’ll go over the different functions (like separate here) later on, but here is an example of using a pipe.

NewDF <- OldDF %>% separate(col = MyColumn, into = c('A', 'B'), sep= "_")

The line above takes a dataframe, OldDF, and splits the column MyColumn by an underscore into two new columns (A and B) - and saves it as a new dataframe (NewDF).

\[\\[0.1in]\]

You can use the %>% or including the data in the function can be equivalent, these next two lines produce the exact same result (we’ll cover the weird dplyr:: next and the count later). It’s best you use the pipe for longer scripts.

MyDataset %>% dplyr::count(variable)

dplyr::count(MyDataset, variable)

\[\\[0.1in]\]

The tidyverse is large, and there are sometimes overlapping function names both within it and among other packages. When this happens, you’ll need to specify which package contains the function you’re trying to call. To do this you write package::functionname.

Here is an example where some unknown package and dplyr both have a function called group_by, but I specifically want dplyr’s version:

OldDF %>% dplyr::group_by(MyVariable) %>% summarise

\[\\[0.1in]\]

Making some play data:

(ignore this for the most part)

names1 <- c("A_1", "B_1", "C_1", "D_1", "E_1") 
names2 <- c("W_2", "X_2", "Y_2", "Z_2", "ZZ_2")
numbers1 <- c(1, 2, 3, 4, 5)
numbers2 <- c(1, 2, 3, 4, 6)

simple1 <- t(rbind(numbers1, names1))
simple1 <- as.data.frame(simple1)
simple1$numbers1 <- as.numeric(simple1$numbers1)

simple2 <- t(rbind(numbers2, names2))
simple2 <- as.data.frame(simple2)
simple2$numbers2 <- as.numeric(simple2$numbers2)

colnames(simple1) <- c("numbers", "letters")
colnames(simple2) <- c("numbers", "letters")

\[\\[0.5in]\]

Data Wrangling (tidyr and dplyr):

Viewing your data:

There are quite a few ways to view your data - especially if you’re using RSTudio. Below are a few lines you can run to take a peak:

#Base R:
str(simple1) # Look at the structure
## 'data.frame':    5 obs. of  2 variables:
##  $ numbers: num  1 2 3 4 5
##  $ letters: chr  "A_1" "B_1" "C_1" "D_1" ...
# Using tidyverse:
dplyr::glimpse(simple1)
## Rows: 5
## Columns: 2
## $ numbers <dbl> 1, 2, 3, 4, 5
## $ letters <chr> "A_1", "B_1", "C_1", "D_1", "E_1"

\[\\[0.1in]\]

Combining datasets:

Combining data sets is an important but tricky step in any workflow. Tidyverse is great for it - I often defer to the Posit cheatsheet which has it all color-coded to make sure I choose the right one.

For this exercise, I’ve made two simple data sets with just a number column and a letter column.

head(simple1)
head(simple2)

\[\\[0.1in]\]

Let’s treat the numbers as the same sample across the two sheets and the letters as different variables, and explore how we can combine them.

Note that the by function was the variable they were indexed and combined by - if I had put letters instead of numbers nothing would have combined. The .x and .y are to keep track which dataset they came from automatically - if the columns had unique names which would not have happened.

dplyr::left_join(simple1, simple2, by = "numbers") # Matches 2 to 1
dplyr::right_join(simple1, simple2, by = "numbers") # Matches 1 to 2
dplyr::inner_join(simple1, simple2, by = "numbers") # Only keeps rows that are shared
dplyr::full_join(simple1, simple2, by = "numbers") # Keeps all data

\[\\[0.1in]\]

Let’s treat each one as different parts of the same set, and go over different ways to combine them:

dplyr::intersect(simple1, simple2) # rows in both 1 and 2
dplyr::setdiff(simple1, simple2) # rows in 1 but not 2
dplyr::union(simple1, simple2) # all rows

\[\\[0.1in]\] And here is how to bind both sets together, regardless if they match or not:

dplyr::bind_rows(simple1, simple2) # appends 2 to 1 at the bottom
dplyr::bind_cols(simple1, simple2) # appends 2 to 1 but as new columns based on position
## New names:
## • `numbers` -> `numbers...1`
## • `letters` -> `letters...2`
## • `numbers` -> `numbers...3`
## • `letters` -> `letters...4`

\[\\[0.1in]\]

Let’s use the full_join version for the rest of the walkthru.

PlayData <- dplyr::full_join(simple1, simple2, by = "numbers")

\[\\[0.1in]\]

Manipulating datasets:

Now that we have our data, we should clean it up a bit. Here we can remove NAs/0s, reorder the rows, and tranform how our data is shown in the sheet.

For many reasons you may want to remove NAs or 0s from your data:

PlayData %>% drop_na() # Drop all rows with NA values
PlayData %>% drop_na(letters.y) # Specify which columns
# There isn't a specific function to drop zeros, but you can get creative:
PlayData %>% select(where(~(is.numeric(.) && all(. > 0)) || !is.numeric(.)))

The drop zero line is a little odd, but basically it’s saying in PlayData, select only the rows that are above zero if they are numeric and delete everything else. Let’s review the syntax: - select: filtering function - where: the argument - is.numeric: ignores characters/factors/etc - . : the entire object - all: all of the values

\[\\[0.1in]\]

You may want to rearrange your datasheets (for example if you’re adding a new index and would like it to sort alphabetically) - for that you’d use arrange with or without desc (descending). This can be done with multiple levels, too.

dplyr::arrange(PlayData, letters.x) # Low to high
dplyr::arrange(PlayData, desc(letters.x)) # High to low
dplyr::arrange(PlayData, -numbers) # For high to low with numbers you can just add -
dplyr::arrange(PlayData, numbers, letters.x) # By the number and letters

\[\\[0.1in]\]

Adding columns or splitting them apart can be really helpful - these can be done with unite and seperate. Many times I’ll have a column that is a long string of words seperated by either a space or underscore - you can leverage seperate to create different columns instead of manually redoing your spreadsheet.

tidyr::unite(PlayData, New, letters.x, letters.y, sep = "|")
tidyr::unite(PlayData, New, letters.x, letters.y, sep = "|", na.rm = TRUE) # Ignoring NA
separate(PlayData, col = letters.x, into = c('Let', 'Num'), sep = '_')

\[\\[0.1in]\]

The functions gather and spread transform your data from shortform to longform and vice versa (if that doesn’t make sense yet hopefully the examples will help). An alternative to using tidyr for this is the reshape2 package’s melt and cast functions.

For the sake of the next two, I’ll make new play data for each. Run but ignore this next code chunk:

GatherDF <- data.frame(matrix(NA, nrow = 3, ncol = 4))
colnames(GatherDF) <- c("ID", "A", "B", "C")
GatherDF$ID <- 1:nrow(GatherDF)
GatherDF$A <- round(rnorm(n = 3, mean = 20, sd = 12))
GatherDF$B <- round(rnorm(n = 3, mean = 10, sd = 12))
GatherDF$C <- round(rnorm(n = 3, mean = 15, sd = 12))


SpreadDF <- data.frame(matrix(NA, nrow = 9, ncol = 3))
colnames(SpreadDF) <- c("ID", "Types", "Data")
SpreadDF$ID <- c(1, 2, 3, 1, 2, 3, 1, 2, 3)
SpreadDF$Types <- c("Type1", "Type1", "Type1", "Type2", "Type2", "Type2", "Type3", "Type3", "Type3")
SpreadDF$Data <- round(rnorm(n = 9, mean = 100, sd = 5))

Let’s say we have a dataframe that has columns we would like to transform into rows - this is especially handy when working with data visualization, so each variable type can be accessed in one column (like in ggplot). The syntax here has the dataset in the first position (GatherDF), then the names of the two columns you’re making (Letters and Data), and finally the 2:4 specifies that I want to transform my data by the columns 2-4 and leave my first ID column alone.

head(GatherDF)
tidyr::gather(GatherDF, "Letters", "Data", 2:4)

\[\\[0.1in]\]

Now let’s look at a reverse case where you have categorical data as different rows and would like to spread it into different columns. Again, the syntax follows that the first place shows the dataframe I’m working with, but then the next two words (Type and Data) show which columns I’m spreading.

head(SpreadDF)
tidyr::spread(SpreadDF, "Types", "Data", 2:4)
## Warning in if (!is.na(fill)) {: the condition has length > 1 and only the first
## element will be used

\[\\[0.5in]\]

Subsetting datasets:

For these next two sections (subsetting and analysis) we’ll use the mpg dataset - this is automatically included in base R to help standardize things (like if you wanted to ask someone how to do something without giving them your data). It is the EPA reported fuel economy data from 1999-2008 for several models of cars. Let’s take a look:

dplyr::glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

\[\\[0.1in]\]

Subsetting rows is useful for many reasons, such as keeping analyses separate from each other.

dplyr::filter(mpg, year > 2005) # Filtering rows older than 2005
dplyr::distinct(mpg, manufacturer) # Unique manufacturer values
dplyr::sample_n(mpg, 5, replace = FALSE) # Random 5 rows
dplyr::slice(mpg, 50:55) # Specific rows
dplyr::top_n(mpg, 5, cty) # Order by a variable and pull the top n rows

\[\\[0.1in]\]

Grouping data is helpful when doing an analysis - such as calculating the mean of something per value. There isn’t much to show as a stand alone operation, but we’ll use it in the next section. Ungrouping your data is used to bring the groups back out to the full set.

dplyr::group_by(mpg, manufacturer) 
dplyr::ungroup(mpg) 

\[\\[0.5in]\]

Data analysis:

dplyr offers some great tools to do relatively simple, but specific, calculations.

Count counts the unique values of a given value.

dplyr::count(mpg, manufacturer) 

\[\\[0.1in]\]

Summarize is a neat way to do calculations on your dataset. In this example, I’m asking the mean, standard deviation (sd), and minimum (min) city mpg per car manufacturer.

mpg %>% dplyr::group_by(manufacturer) %>% summarise(mean = mean(cty), sd = sd(cty), min = min(cty))

\[\\[0.1in]\]

The function mutate is used to make a calculation or transformation and save it as a column. Here I capitalized the manufacturer and model values.

dplyr::mutate(mpg, manufacturer = str_to_title(manufacturer), model = str_to_title(model))

Where mutate just adds a new column or overwrites one, transmute removes all of the originals unless specified. The example here calculates the combined fuel efficiency, note that all but the model and new CombinedFuelEfficiency column were dropped.

dplyr::transmute(mpg, model, CombinedFuelEfficiency = 0.55*cty + 0.45*hwy)

\[\\[0.5in]\]

Tying it all together:

Here’s a simple question that could be difficult to code: what is the mean fuel efficiency for each transmission type, and which model has the highest city mpg?

Using the tidyverse we can do this in just 5 lines! We’ll first use the separate function to break out the transmission data by the ( into two columns (e.g. from auto(l5) to auto and 15)), then using group_by to parse the dataframe into each manufacturer, summarise the mean for city and highway mpg, then report the statistics back out with ungroup.

mpg %>% 
  tidyr::separate(trans, into = c('Transm', 'SubType'), sep = '\\(') %>%  #the \\ is used for symbols
  group_by(manufacturer, Transm) %>% 
  summarise(avg_city = mean(cty), avg_hwy = mean(hwy)) %>% 
  dplyr::arrange(avg_city) %>% 
  ungroup()
## `summarise()` has grouped output by 'manufacturer'. You can override using the
## `.groups` argument.

\[\\[0.1in]\]