Data wrangling with tidyverse

.title[
# Data wrangling with <code>tidyverse</code>
]
.subtitle[
## Hannah Owens, adapted from Maria Novosolov
]
.date[
### 07-03-2024
]

---

# Tidyverse is a collection of packages
.center[
<img src="img/tidyverse_core.png" width=50%>
]

---

# The advantages

- tibble/data.frame in, tibble out

- neat code

]
---

# Tidy data

|Film                       |Gender |Race   | Words|
|:--------------------------|:------|:------|-----:|
|The Fellowship Of The Ring |Female |Elf    |  1229|
|The Fellowship Of The Ring |Male   |Elf    |   971|
|The Fellowship Of The Ring |Female |Hobbit |    14|
|The Fellowship Of The Ring |Male   |Hobbit |  3644|
|The Fellowship Of The Ring |Female |Man    |     0|
|The Fellowship Of The Ring |Male   |Man    |  1995|
|The Two Towers             |Female |Elf    |   331|
|The Two Towers             |Male   |Elf    |   513|
|The Two Towers             |Female |Hobbit |     0|
|The Two Towers             |Male   |Hobbit |  2463|

---

# Does your code resemble this?

```r
starwars_human_subset <- subset(starwars,species == "Human")
starwars_human_subset$bmi <- starwars_human_subset$mass / 
  (0.01 * starwars_human_subset$height)^2
fattest_human_from_each_planet <- aggregate(bmi ~ homeworld,data = 
      starwars_human_subset, FUN = "max")
fattest_human_from_each_planet <- merge(
  x=fattest_human_from_each_planet,
  y=starwars_human_subset,by = c("homeworld","bmi"))
fattest_human_from_each_planet <- fattest_human_from_each_planet [,1:5]
```

![](https://jamesskemp.github.io/gits-matrix/images/green6.jpg)

---
# Code should be pleasant to read

![](https://media.giphy.com/media/OWyYSmZT43pxm/giphy.gif)

---
# Tibbles

- **They do less** (i.e. don't change variable names or types, don't do partial matching)

- **They complain more** (e.g. when a variable does not exist).

- Force you to confront problems earlier, typically leading to cleaner, more expressive code.
]

---
# `data.frame`

```r
iris
```

```
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica
```

---
# Tibbles print nicely!

```r
library(tidyverse)
as_tibble(iris)
```

```
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows
```

---
class: exercise, middle

# Let's practice!

1. Load tidyverse with `library(tidyverse)`

2. Run the function `as_tibble()` on your squid data

---

# Pipe ("then")

```r
do_another_thing(do_something(data))

# versus

data %>% 
    do_something() %>% 
    do_another_thing() 
```
]

---

# `readr` package

---

# read_xxx function

* Neater import than `read.table` and `read.csv`

* Does data check and prints a report of the data imported

* Character columns are not converted to factors

* Most useful are `read_csv`, `read_table`, and `read_delim`

* Compatible with pipe workflow

---
# Example

```r
library(readr)
mydata<- read_csv("data/Data_Squid.csv")
```

```
## Rows: 2644 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (5): Sample, Year, Month, Location, GSI
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

---

```r
mydata
```

```
## # A tibble: 2,644 × 6
##    Sample  Year Month Location Sex      GSI
##     <dbl> <dbl> <dbl>    <dbl> <chr>  <dbl>
##  1      1     1     1        1 Female 10.4 
##  2      2     1     1        3 Female  9.83
##  3      3     1     1        1 Female  9.74
##  4      4     1     1        1 Female  9.31
##  5      5     1     1        1 Female  8.99
##  6      6     1     1        1 Female  8.77
##  7      7     1     1        1 Female  8.26
##  8      8     1     1        3 Female  7.40
##  9      9     1     1        3 Female  7.22
## 10     10     1     2        1 Female  6.84
## # ℹ 2,634 more rows
```

---
class: center, middle

# `janitor` package

<img src="img/janitor.png" width=40%>
---
# `clean_names()` function

* cleans the column names to something more computer friendly

* For example, brings all the column names to lowercase and adds underscores between words
---

# Regular column names

```r
mydata
```

---
# Clean column names

```r
mydata %>% 
* janitor::clean_names()
```

```
## # A tibble: 2,644 × 6
##    sample  year month location sex      gsi
##     <dbl> <dbl> <dbl>    <dbl> <chr>  <dbl>
##  1      1     1     1        1 Female 10.4 
##  2      2     1     1        3 Female  9.83
##  3      3     1     1        1 Female  9.74
##  4      4     1     1        1 Female  9.31
##  5      5     1     1        1 Female  8.99
##  6      6     1     1        1 Female  8.77
##  7      7     1     1        1 Female  8.26
##  8      8     1     1        3 Female  7.40
##  9      9     1     1        3 Female  7.22
## 10     10     1     2        1 Female  6.84
## # ℹ 2,634 more rows
```
---

# Can also be:

```r
mydata<- read_csv("data/Data_Squid.csv") %>% 
  janitor::clean_names()
mydata
```

---
class: exercise, middle

# Lets practice!

## Load the data and change all the column names to caps lock

**Hint:** check out the help in `clean_names()`

---

# `select()`

![](img/xls-select.PNG)

---
# Select "sample", "sex", and "gsi" columns only

```r
mydata %>% 
*   select(sample, sex, gsi)
```

```
## # A tibble: 2,644 × 3
##    sample sex      gsi
##     <dbl> <chr>  <dbl>
##  1      1 Female 10.4 
##  2      2 Female  9.83
##  3      3 Female  9.74
##  4      4 Female  9.31
##  5      5 Female  8.99
##  6      6 Female  8.77
##  7      7 Female  8.26
##  8      8 Female  7.40
##  9      9 Female  7.22
## 10     10 Female  6.84
## # ℹ 2,634 more rows
```

---

# Let's practice!

## Select "year", "sex", "location", and "gsi".
---

# `mutate()`
.center[
<img src="img/dplyr_mutate.png" width=60%>
]
---
![](img/xls-mutate.PNG)
---
# Add a gsi_log column

```r
mydata %>% 
    select(sample, sex, gsi) %>% 
*   mutate(gsi_log = log10(gsi))
```

```
## # A tibble: 2,644 × 4
##    sample sex      gsi gsi_log
##     <dbl> <chr>  <dbl>   <dbl>
##  1      1 Female 10.4    1.02 
##  2      2 Female  9.83   0.993
##  3      3 Female  9.74   0.988
##  4      4 Female  9.31   0.969
##  5      5 Female  8.99   0.954
##  6      6 Female  8.77   0.943
##  7      7 Female  8.26   0.917
##  8      8 Female  7.40   0.869
##  9      9 Female  7.22   0.858
## 10     10 Female  6.84   0.835
## # ℹ 2,634 more rows
```

---

# Let's practice!

## Add a new column with gsi multiplied by 10

---

# `filter()`
Works similar to `subset()`

---

# Filter the data to have only males

```r
mydata %>% 
    select(sample, sex, gsi) %>% 
    mutate(gsi_log = log10(gsi)) %>% 
*   filter(sex == "Male")
```

```
## # A tibble: 1,402 × 4
##    sample sex     gsi gsi_log
##     <dbl> <chr> <dbl>   <dbl>
##  1     24 Male   5.30   0.724
##  2     48 Male   4.30   0.633
##  3     58 Male   3.50   0.544
##  4     60 Male   3.25   0.512
##  5     61 Male   3.23   0.509
##  6     62 Male   3.23   0.509
##  7     63 Male   3.18   0.503
##  8     65 Male   2.97   0.473
##  9     66 Male   2.95   0.470
## 10     67 Male   2.94   0.468
## # ℹ 1,392 more rows
```

---
# Filter the data to have only males with gonads larger than 4.

```r
mydata %>% 
    select(sample, sex, gsi) %>% 
    mutate(gsi_log = log10(gsi)) %>% 
*   filter(sex == "Male",gsi > 4)
```

```
## # A tibble: 7 × 4
##   sample sex     gsi gsi_log
##    <dbl> <chr> <dbl>   <dbl>
## 1     24 Male   5.30   0.724
## 2     48 Male   4.30   0.633
## 3    763 Male   4.21   0.625
## 4    765 Male   4.19   0.622
## 5   1671 Male   4.55   0.658
## 6   1676 Male   4.33   0.636
## 7   1679 Male   4.01   0.603
```

---

# Let's practice!

1. Filter the data to have only males from year 1
2. Talk to your neighbor: do you have the same number of rows?

---

# `arrange()`

![](img/xls-arrange.PNG)
---

# Sort the data based on the gsi_log
- `desc()` puts things in descending order

```r
mydata %>% 
    select(sample, sex, gsi) %>% 
    mutate(gsi_log = log10(gsi)) %>% 
*   arrange(desc(gsi_log))
```

```
## # A tibble: 2,644 × 4
##    sample sex      gsi gsi_log
##     <dbl> <chr>  <dbl>   <dbl>
##  1    546 Female  14.6    1.16
##  2    547 Female  13.3    1.12
##  3   1520 Female  11.9    1.07
##  4    548 Female  11.2    1.05
##  5    549 Female  11.2    1.05
##  6   1521 Female  10.8    1.03
##  7    550 Female  10.7    1.03
##  8    551 Female  10.7    1.03
##  9    552 Female  10.6    1.03
## 10   2284 Female  10.6    1.03
## # ℹ 2,634 more rows
```

---

# Let's practice!

## Arrange the data based on year.

---
# `group_by(), summarize()`

![](img/xls-summary.PNG)

---

## Create a summary with the avarage GSI for each combination of year and location, and sort it by the avarage GSI.

```r
mydata %>% 
  select(location, year, sex, gsi) %>% 
* group_by(location,year) %>%
* summarise(avg_gsi = mean(gsi,na.rm = T)) %>%
  arrange(desc(avg_gsi)) %>% 
* ungroup()
```

```
## `summarise()` has grouped output by 'location'. You can override using the
## `.groups` argument.
```

```
## # A tibble: 11 × 3
##    location  year avg_gsi
##       <dbl> <dbl>   <dbl>
##  1        3     3   3.78 
##  2        3     4   3.65 
##  3        4     2   3.06 
##  4        1     2   2.84 
##  5        1     4   2.60 
##  6        3     2   2.38 
##  7        1     3   2.24 
##  8        1     1   1.46 
##  9        3     1   1.11 
## 10        2     2   0.491
## 11        2     3   0.209
```

---
.center[
### **Remember:** When using `group_by()`, always add `ungroup()` at the end to convert the data to a standard tibble
<img src="img/group_by_ungroup.png" width=80%>
]
---
class: exercise, middle

# Let's practice!

Work with your neighbor.

1. What is the number of samples in each year?
2. How many females and males there are in each location?
3. What is the maximum GSI for males and females in each month?

---
class: center, middle

# 5 Minute Break

---
# Rename columns with `rename()`

```r
mydata %>% 
  select(location, year, sex, gsi) %>% 
*   rename(gonad_size = gsi)
```

```
## # A tibble: 2,644 × 4
##    location  year sex    gonad_size
##       <dbl> <dbl> <chr>       <dbl>
##  1        1     1 Female      10.4 
##  2        3     1 Female       9.83
##  3        1     1 Female       9.74
##  4        1     1 Female       9.31
##  5        1     1 Female       8.99
##  6        1     1 Female       8.77
##  7        1     1 Female       8.26
##  8        3     1 Female       7.40
##  9        3     1 Female       7.22
## 10        1     1 Female       6.84
## # ℹ 2,634 more rows
```
---

# `rename_all()`
* Works similar to `janitor::clean_names()`

Change all the column names to upper case

```r
mydata %>% 
  select(location, year, sex, gsi) %>% 
*   rename_all(toupper)
```

```
## # A tibble: 2,644 × 4
##    LOCATION  YEAR SEX      GSI
##       <dbl> <dbl> <chr>  <dbl>
##  1        1     1 Female 10.4 
##  2        3     1 Female  9.83
##  3        1     1 Female  9.74
##  4        1     1 Female  9.31
##  5        1     1 Female  8.99
##  6        1     1 Female  8.77
##  7        1     1 Female  8.26
##  8        3     1 Female  7.40
##  9        3     1 Female  7.22
## 10        1     1 Female  6.84
## # ℹ 2,634 more rows
```

---
class: inverse, center, middle

# Some more useful functions in the tidyverse family

---
class: center, middle
<img src="img/relocate.jpg" width=70%>

---
class: center, middle
<img src="img/accross.jpg" width=70%>

.footnote[
* Syntax has changed slightly: across(<font color="red">when(</font>is.numeric<font color="red">)</font>, .f)
]

---
class: center, middle
<img src="img/case_when.jpg" width=70%>

---
class: exercise, middle

# Let's practice!

Work with your neighbor.

Using across() and case_when():

Group the squid data by location, calculate

the maximum of each numeric column except sample,

and add a new column that makes a note of whether

at a given location, sampling was done all year,

or if it stopped in the spring.

---
class: exercise

# If you want more practice

Open `swirl`

### For practicing manipulating data with `tidyverse`:

Download the course "Getting and Cleaning Data"
`swirl::install_course("Getting and Cleaning Data")`

Work on sections 1-3

### If you want a challenge, try `purrr`:

Download the course "Advanced R Programming"
`swirl::install_course("Advanced R Programming")`

Work on sections 2 and 3

---

# Congratulations!

## You now know the basics of `tidyverse`