R workshop

class: center, middle, inverse, title-slide

.title[
# R workshop
]
.subtitle[
## Data manupulation & Rmarkdown
]
.author[
### Bolin Wu
]
.institute[
### NEAR, Aging Research Center
]
.date[
### 2023/01/26 (updated: 2023-04-14)
]

---

## About me

.pull-left[
![](https://ki-su-arc.se/wp-content/uploads/2021/10/Bolin-Wu_foto-Maria-Yohuang-683x1024.jpg)
]

.footnote[Photo: Maria Yohuang]

.pull-right[
- Statistician at [NEAR](https://www.near-aging.se/), KI.
- MSc in Statistics, Uppsala University.
- Interests: 
  - R package dev
  - Statistics
  - Data manipulation
  - guitar
]

???

At my daily work, I need to deal with data of many cohorts and follow ups.

---
## `workshopr` package

### Install

```r
install.packages("remotes")
remotes::install_github("Bolin-Wu/workshopr",
  subdir = "rpackage",
  force = TRUE
)
```

### Load

```r
library(workshopr)
library(tidyverse)
library(here)
```

???

One can download the slide from my github page.

---
class: center, middle

# Data manupulation

### tidyverse, assign

---
## Introduction

This session is to share useful data manipulation skills at daily data harmonization work. My main goal is to follow the "don't repeat yourself" (DRY) principle.

It can make our code more readable and reduce our chance of making mistakes.

Suppose you want to change column class.

```r
df$cohort1_edu = as.factor(df$cohort1_edu )
df$cohort2_edu  = as.factor(df$cohort2_edu) 
# and so on...
```

???

For example, change a variable name in the code, but forget to change other places

---

### Get the code

*Note: Data frames may seem to be unfit in the PDF. I do not user any html widges since the code will be pulled out for tutorial purpose. Attendants can run the code on their on machine instead to get better view.*

```r
workshopr::get_code_2023(session = "tidyverse")
```

---

## Content

The content is selected based on data manipulation in real work scenario.

I hope by the end of the workshop, you will have them in your toolbox:

- %>% syntax
- **join** data frames (*join function*)
- **transform** data shape (*pivot_longer*)
- **select** variables based on name pattern (*select*)
- **extract** the label from DTA and SPSS in R (*filter & sjlabelled*)
- **check** missing values (*summarise & across*)
- **mutate data** based on column types (*mutate & across*)
- **bin** variables by percentiles (*case_when*)
- **assign** function

---

background-image: url(https://1.bp.blogspot.com/-UHpM76ze7Dk/YFVs4iGMWuI/AAAAAAAAA0s/3EpcYcl-PLce7rcdcD4YOUXntpdUbc3aACLcBGAsYHQ/s16000/7b5f79a7e88560bb863c6bc8bcfb7146.jpg)

.footnote[picture credit: https://www.redarmy.in/]

---

# Tidyverse

Many times we just `library(tidyverse)`. Actually [Tidyverse](https://www.tidyverse.org/packages/) is a huge umbrella consists of several powerful visualization and data manipulation package. For example:

- [magrittr](https://magrittr.tidyverse.org/): pipeline operator `%>%`.
- [ggplot2](https://ggplot2.tidyverse.org/): `ggplot()`.
- [dplyr](https://dplyr.tidyverse.org/): `select(), filter(), mutate()`.
- [stringr](https://stringr.tidyverse.org/): `str_detect(), str_subset()`.

### Notes

- Advantage: All at once.
- Disadvantage: 
  - Slower to load the whole package. 
  - Potential conflicts of function names with other packages. Example [here](https://tidyverse.tidyverse.org/reference/tidyverse_conflicts.html).

???

Just heads up, in case in the future you find one function does not work as it expected. Then you know it could because of the conflicts of tidyverse.

---

background-image: url(https://media.licdn.com/dms/image/C4E22AQHzxOsNkxPj-g/feedshare-shrink_800/0/1667950242189?e=1684368000&v=beta&t=MCGoT60B0dPjoZeWWdlGiARrnEbg5J_obake7sT5cT8)
background-size: 80% auto
background-position: center
class: center, bottom, inverse

.footnote[From Matt Dancho's Linkedin]

---

background-image: url(https://upload.wikimedia.org/wikipedia/commons/f/fa/Hadley-wickham2016-02-04.jpg)
background-size: 40% auto
background-position: center
class: center, bottom, inverse

.footnote[picture credit: [Wikipedia of Hadley Wickham](https://www.redarmy.in/)]

---

## Pipeline operator `%>%`
Beautiful syntax with pipeline, just like playing LEGO.

![](pic/legos.jpg)

---

### Example

```r
fake_data %>%
  select(Lopnr, Date_wave1)
```

```
# A tibble: 3,000 × 2
   Lopnr Date_wave1
   <dbl> <date>    
 1     1 2001-06-11
 2     2 2002-04-08
 3     3 2001-09-12
 4     4 2001-06-13
 5     5 2001-02-12
 6     6 2001-07-05
 7     7 2001-12-28
 8     8 2002-09-17
 9     9 2001-09-14
10    10 2001-09-21
# … with 2,990 more rows
```

---

- In addition, filter on the date column

```r
fake_data %>%
  select(Lopnr, Date_wave1) %>%
  filter(Date_wave1 > "2002-01-01") %>%
  slice(1:5)
```

```
# A tibble: 5 × 2
  Lopnr Date_wave1
  <dbl> <date>    
1     2 2002-04-08
2     8 2002-09-17
3    13 2002-10-13
4    15 2002-04-28
5    19 2002-12-25
```

- You can further stacking group, filter, mutate, etc., with the pipeline operator.

???
This syntax makes data manupulation convenient and easy to read.

---
## Join data frames

* With `.*_join()` function: There are 4 common types of joins.

![](pic/join_tabel.png)

---

- Take `left_join()` as an example:

```r
# From  `dplyr` documentation:
df1 <- tibble(x = 1:3)
df2 <- tibble(
  x = c(1, 1, 2),
  y = c("first", "second", "third")
)
df1 %>% left_join(df2)
```

```
# A tibble: 4 × 2
      x y     
  <dbl> <chr> 
1     1 first 
2     1 second
3     2 third 
4     3 <NA>  
```

*By the way, do you know why `tibble()` is better than `data.frame()`?*

???
Please note, in this example, df1 is on the left side.

---
## Transform data shape
Transform data shape is frequently used to clean data. However, for many people, including me, it sounds troublesome. In R, its relevant functions are evolving overtime as well.

In the beginning (2019), I used [spread()](https://tidyr.tidyverse.org/reference/spread.html) and [gather()](https://tidyr.tidyverse.org/reference/gather.html).  Every time I use `spread()` and `gather()`, it takes me a while to figure out how to fill in 'key' and 'value'. But as you can see from their documentation, their 'lifecycle' is 'superseded'.

---
## Transform data shape

Now I only use `pivot_longer()` and `pivot_wider()` for transforming data. You can find their comprehensive documentation [here](https://tidyr.tidyverse.org/articles/pivot.html).

They come with better documentation, more powerful application, and better integration with tidyverse syntax.

---

### Example

Let's assume we received a wide format data:

```r
head(fake_data, n = 5)
```

```
# A tibble: 5 × 28
  Lopnr Date_wave1 age_wave1 mmse_w…¹ demen…² Date_wave2 age_w…³ mmse_…⁴ demen…⁵
  <dbl> <date>         <dbl>    <dbl>   <dbl> <date>       <dbl>   <dbl>   <dbl>
1     1 2001-06-11      77.6       29       0 2003-06-03    79.5      29       0
2     2 2002-04-08      72.6       18       0 2004-03-07    74.5      15       0
3     3 2001-09-12      79.9       22       0 NA            NA        NA      NA
4     4 2001-06-13      79.9       26       0 2003-05-12    81.8      24       0
5     5 2001-02-12      82.6       NA       0 2003-04-20    84.8      22       0
# … with 19 more variables: Date_wave3 <date>, age_wave3 <dbl>,
#   mmse_wave3 <dbl>, dementia_wave3 <dbl>, Date_wave4 <date>, age_wave4 <dbl>,
#   mmse_wave4 <dbl>, dementia_wave4 <dbl>, Date_wave5 <date>, age_wave5 <dbl>,
#   mmse_wave5 <dbl>, dementia_wave5 <dbl>, Date_wave6 <date>, age_wave6 <dbl>,
#   mmse_wave6 <dbl>, dementia_wave6 <dbl>, age_base <dbl>, sex <dbl+lbl>,
#   education <dbl+lbl>, and abbreviated variable names ¹mmse_wave1,
#   ²dementia_wave1, ³age_wave2, ⁴mmse_wave2, ⁵dementia_wave2
```

???
Please run the code in your R studio to get a better view.
---

### Example

The column names are:

```r
sort(colnames(fake_data))
```

```
 [1] "age_base"       "age_wave1"      "age_wave2"      "age_wave3"     
 [5] "age_wave4"      "age_wave5"      "age_wave6"      "Date_wave1"    
 [9] "Date_wave2"     "Date_wave3"     "Date_wave4"     "Date_wave5"    
[13] "Date_wave6"     "dementia_wave1" "dementia_wave2" "dementia_wave3"
[17] "dementia_wave4" "dementia_wave5" "dementia_wave6" "education"     
[21] "Lopnr"          "mmse_wave1"     "mmse_wave2"     "mmse_wave3"    
[25] "mmse_wave4"     "mmse_wave5"     "mmse_wave6"     "sex"           
```

Now, assume for some reason, e.g. merge it with other data set, we want to transform it in a long format.

There are 3 variables with prefix should be formatted: 'age', 'Date' and 'dementia'. For beginners, I would recommend to start small.

---
### Example
- Select the interested columns

```r
fake_data %>%
  select(contains("Date")) %>%
  slice(1:5)
```

```
# A tibble: 5 × 6
  Date_wave1 Date_wave2 Date_wave3 Date_wave4 Date_wave5 Date_wave6
  <date>     <date>     <date>     <date>     <date>     <date>    
1 2001-06-11 2003-06-03 2005-05-27 2007-04-22 2009-04-10 2011-04-01
2 2002-04-08 2004-03-07 2006-02-08 2008-02-19 2010-03-26 2012-04-23
3 2001-09-12 NA         NA         NA         NA         NA        
4 2001-06-13 2003-05-12 2005-04-20 2007-03-28 2009-04-10 2011-04-27
5 2001-02-12 2003-04-20 2005-04-06 2007-03-14 2009-03-17 2011-02-22
```

---
### Example

Read documentation, try to fill in the arguments.

```r
?tidyr::pivot_longer()
```

- The 3 basic arguments are:
  * `cols()`: tells R what variables to pivot.
  * `names_to()`:  a new name for columns in `cols()`. 
  * `values_to()`: a new name for **values** under the columns in `cols()`.

---

### Example
Let's give a first try:

```r
fake_data %>%
  select(Lopnr, contains("Date")) %>%
  pivot_longer(
    cols = contains("Date"),
    names_to = "wave", values_to = "date",
    names_prefix = "Date"
  )
```

```
# A tibble: 18,000 × 3
   Lopnr wave   date      
   <dbl> <chr>  <date>    
 1     1 _wave1 2001-06-11
 2     1 _wave2 2003-06-03
 3     1 _wave3 2005-05-27
 4     1 _wave4 2007-04-22
 5     1 _wave5 2009-04-10
 6     1 _wave6 2011-04-01
 7     2 _wave1 2002-04-08
 8     2 _wave2 2004-03-07
 9     2 _wave3 2006-02-08
10     2 _wave4 2008-02-19
# … with 17,990 more rows
```

---

### Example

The result above looks good, but 'wave' looks a bit strange. I will leave the task to audience to fix this column.

--
* Do the same with 'dementia' columns

```r
fake_data %>%
  select(Lopnr, contains("dementia")) %>%
  pivot_longer(
    cols = contains("dementia"),
    names_to = "wave", names_prefix = "dementia",
    values_to = "dementia"
  )
```

```
# A tibble: 18,000 × 3
   Lopnr wave   dementia
   <dbl> <chr>     <dbl>
 1     1 _wave1        0
 2     1 _wave2        0
 3     1 _wave3        0
 4     1 _wave4        0
 5     1 _wave5        0
 6     1 _wave6        0
 7     2 _wave1        0
 8     2 _wave2        0
 9     2 _wave3        0
10     2 _wave4        0
# … with 17,990 more rows
```

???
Pay attention to what arguments are changed, what are not changed.

---

### Data transform exercise (5 - 10 min)

- Read documentation. Change the arguments in the `pivot_longer()` function to get proper wave column.

*Example result (first 5 rows):*

```
# A tibble: 5 × 3
  Lopnr wave  date      
  <dbl> <fct> <date>    
1     1 1     2001-06-11
2     1 2     2003-06-03
3     1 3     2005-05-27
4     1 4     2007-04-22
5     1 5     2009-04-10
```

- Merge the two long pivot data sets together.

*Consider: is it enough to only join on one column? Why or why not?*

---

## Select variables by name pattern

Some times you get data with specific variables. E.g.

- Each wave has its own specific prefix.
- Questionnaire/in person test has its own prefix.
- ...

In this case, the `select()` function can help you. Some useful **selection helpers** are

- `starts_with()`
- `contains()`
- `matches()`
- ...

More details please see documentation:

```r
?select()
```

???
Explain them. 90% of the time I just use these four.

---

### starts_with

```r
fake_data %>%
  select(starts_with("Date"))
```

### contains

```r
fake_data %>%
  select(contains("Date") & contains("1"))
```

- If we want to select variables containing 'Date' or 'age', how to do it? Let's try it.

---

### match

- `match()` is more advanced since it uses [regular expression (regex)](https://en.wikipedia.org/wiki/Regular_expression).

- There are many regex cheatsheets online. For exmaple [this](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf).

???

Please note, the regex in r could be slightly different from ones in other programming languages, e.g. Python

Regex is a powerful tool to help with finding the patterns in a string.

- One useful website to test regex is [here](https://regex101.com).

```r
fake_data %>%
  select(matches("Date_wave\\d"))
```

---> Live demo

???

With match basically you can filter any pattern of variable names you want.

I can show to test it in the online test website.

---

## Get labels from datasets

- When we get SPSS or STATA data sets, usually they come with labels. E.g. in SPSS, one can check them in the "Variable View" tab.

- In R, one can use `view()` function. In the example data, we have:

```r
view(fake_data)
```

![](pic/df_label.png)

- A natural question to ask: how to extract the labels in R?

???
This can be useful when you got hundreds of variables in a file.

There's one project has 200+ variables, I need to select the variables related to smoking.

---

If you run `str()` function, the labels are in the "label" attribute. There are multiple ways to extract the labels.

```r
str(fake_data)
```

The one I use is the [sjlabelled](https://strengejacke.github.io/sjlabelled/articles/labelleddata.html) R package.

```r
label_char <- sjlabelled::get_label(fake_data)
label_char
```

```
                        Lopnr                    Date_wave1 
                         "ID"  "date examination 2011-2012" 
                    age_wave1                    mmse_wave1 
     "age at visit 2011-2012"     "mmse at visit 2011-2012" 
               dementia_wave1                    Date_wave2 
"dementia at visit 2011-2012"  "date examination 2011-2012" 
                    age_wave2                    mmse_wave2 
     "age at visit 2011-2012"     "mmse at visit 2011-2012" 
               dementia_wave2                    Date_wave3 
"dementia at visit 2011-2012"  "date examination 2011-2012" 
                    age_wave3                    mmse_wave3 
     "age at visit 2011-2012"     "mmse at visit 2011-2012" 
               dementia_wave3                    Date_wave4 
"dementia at visit 2011-2012"  "date examination 2011-2012" 
                    age_wave4                    mmse_wave4 
     "age at visit 2011-2012"     "mmse at visit 2011-2012" 
               dementia_wave4                    Date_wave5 
"dementia at visit 2011-2012"  "date examination 2011-2012" 
                    age_wave5                    mmse_wave5 
     "age at visit 2011-2012"     "mmse at visit 2011-2012" 
               dementia_wave5                    Date_wave6 
"dementia at visit 2011-2012"  "date examination 2011-2012" 
                    age_wave6                    mmse_wave6 
     "age at visit 2011-2012"     "mmse at visit 2011-2012" 
               dementia_wave6                      age_base 
"dementia at visit 2011-2012"                            "" 
                          sex                     education 
                     "gender"                   "education" 
```

|               |x                           |
|:--------------|:---------------------------|
|Lopnr          |ID                          |
|Date_wave1     |date examination 2011-2012  |
|age_wave1      |age at visit 2011-2012      |
|mmse_wave1     |mmse at visit 2011-2012     |
|dementia_wave1 |dementia at visit 2011-2012 |
|Date_wave2     |date examination 2011-2012  |
|age_wave2      |age at visit 2011-2012      |
|mmse_wave2     |mmse at visit 2011-2012     |
|dementia_wave2 |dementia at visit 2011-2012 |
|Date_wave3     |date examination 2011-2012  |
|age_wave3      |age at visit 2011-2012      |
|mmse_wave3     |mmse at visit 2011-2012     |
|dementia_wave3 |dementia at visit 2011-2012 |
|Date_wave4     |date examination 2011-2012  |
|age_wave4      |age at visit 2011-2012      |
|mmse_wave4     |mmse at visit 2011-2012     |
|dementia_wave4 |dementia at visit 2011-2012 |
|Date_wave5     |date examination 2011-2012  |
|age_wave5      |age at visit 2011-2012      |
|mmse_wave5     |mmse at visit 2011-2012     |
|dementia_wave5 |dementia at visit 2011-2012 |
|Date_wave6     |date examination 2011-2012  |
|age_wave6      |age at visit 2011-2012      |
|mmse_wave6     |mmse at visit 2011-2012     |
|dementia_wave6 |dementia at visit 2011-2012 |
|age_base       |                            |
|sex            |gender                      |
|education      |education                   |

---

The result looks terrible, so we have to fix it by transforming it to be tibble:

```r
label_df <- tibble::rownames_to_column(
  as.data.frame(label_char),
  "variable"
)
label_df <- tibble::as_tibble(label_df)
label_df
```

```
# A tibble: 28 × 2
   variable       label_char                 
   <chr>          <chr>                      
 1 Lopnr          ID                         
 2 Date_wave1     date examination 2011-2012 
 3 age_wave1      age at visit 2011-2012     
 4 mmse_wave1     mmse at visit 2011-2012    
 5 dementia_wave1 dementia at visit 2011-2012
 6 Date_wave2     date examination 2011-2012 
 7 age_wave2      age at visit 2011-2012     
 8 mmse_wave2     mmse at visit 2011-2012    
 9 dementia_wave2 dementia at visit 2011-2012
10 Date_wave3     date examination 2011-2012 
# … with 18 more rows
```

---

The result looks much better! Now we can do lots of things with pipeline.

- filter the label contains 'dementia'

```r
label_df %>%
  filter(grepl("dementia", label_char))
```

```
# A tibble: 6 × 2
  variable       label_char                 
  <chr>          <chr>                      
1 dementia_wave1 dementia at visit 2011-2012
2 dementia_wave2 dementia at visit 2011-2012
3 dementia_wave3 dementia at visit 2011-2012
4 dementia_wave4 dementia at visit 2011-2012
5 dementia_wave5 dementia at visit 2011-2012
6 dementia_wave6 dementia at visit 2011-2012
```

---

- filter the variable contains 'wave'

```r
label_df %>%
  filter(grepl("wave", variable))
```

```
# A tibble: 24 × 2
   variable       label_char                 
   <chr>          <chr>                      
 1 Date_wave1     date examination 2011-2012 
 2 age_wave1      age at visit 2011-2012     
 3 mmse_wave1     mmse at visit 2011-2012    
 4 dementia_wave1 dementia at visit 2011-2012
 5 Date_wave2     date examination 2011-2012 
 6 age_wave2      age at visit 2011-2012     
 7 mmse_wave2     mmse at visit 2011-2012    
 8 dementia_wave2 dementia at visit 2011-2012
 9 Date_wave3     date examination 2011-2012 
10 age_wave3      age at visit 2011-2012     
# … with 14 more rows
```

---

## Missing values

### Count the NA

- If count the NA of one column. It could something like:

```r
sum(is.na(fake_data$Date_wave1))
```

```
[1] 0
```
--

- What if multiple columns?

---

### Count NA in multiple columns

```r
fake_data %>%
  summarise(across(
    where(lubridate::is.Date),
    ~ sum(is.na(.))
  ))
```

```
# A tibble: 1 × 6
  Date_wave1 Date_wave2 Date_wave3 Date_wave4 Date_wave5 Date_wave6
       <int>      <int>      <int>      <int>      <int>      <int>
1          0        207        489        748        983       1172
```

- You may wonder: what is this `~ sum(is.na(.))`?

- It is a purrr-style [lambda function](https://en.wikipedia.org/wiki/Anonymous_function).

- You may also wonderL what is `across`?

- It tells `summarise` to do what operations, on which columns.

???
The wikipedia explanation is not straight-forward. My personal comprehension is that lambda function is any function you defined by yourself, which usually can not be found as a standard function like mean or sd.

- One can also use `colSums(is.na(fake_data))`.

???
Explain why I prefer the pipeline version

---

## Missing value exercise (5 - 10 min)

- Read the documentation of `across()` function.
- Count the NA of all numeric/int columns.
- Count the NA of columns contains certain strings, e.g. 'edu'

*Example result*

```
# A tibble: 1 × 22
  Lopnr age_wa…¹ mmse_…² demen…³ age_w…⁴ mmse_…⁵ demen…⁶ age_w…⁷ mmse_…⁸ demen…⁹
  <int>    <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>
1     0        0     135       0     207     323     207     489     605     489
# … with 12 more variables: age_wave4 <int>, mmse_wave4 <int>,
#   dementia_wave4 <int>, age_wave5 <int>, mmse_wave5 <int>,
#   dementia_wave5 <int>, age_wave6 <int>, mmse_wave6 <int>,
#   dementia_wave6 <int>, age_base <int>, sex <int>, education <int>, and
#   abbreviated variable names ¹age_wave1, ²mmse_wave1, ³dementia_wave1,
#   ⁴age_wave2, ⁵mmse_wave2, ⁶dementia_wave2, ⁷age_wave3, ⁸mmse_wave3,
#   ⁹dementia_wave3
```

```
# A tibble: 1 × 1
  education
      <int>
1         0
```

---

## Mutation based on column type

- Now you should have some understanding of `where()` function.

- When you apply it with `mutate()` function, you can do many manipulations. E.g. round the digits of numeric columns.

```r
fake_data %>%
  mutate(across(
    where(is.numeric),
    ~ round(., digits = 2)
  ))
```

```
# A tibble: 3,000 × 28
   Lopnr Date_wave1 age_wave1 mmse_…¹ demen…² Date_wave2 age_w…³ mmse_…⁴ demen…⁵
   <dbl> <date>         <dbl>   <dbl>   <dbl> <date>       <dbl>   <dbl>   <dbl>
 1     1 2001-06-11      77.6      29       0 2003-06-03    79.5      29       0
 2     2 2002-04-08      72.6      18       0 2004-03-07    74.5      15       0
 3     3 2001-09-12      80.0      22       0 NA            NA        NA      NA
 4     4 2001-06-13      79.9      26       0 2003-05-12    81.8      24       0
 5     5 2001-02-12      82.6      NA       0 2003-04-20    84.8      22       0
 6     6 2001-07-05      79.3      25       0 2003-07-20    81.4      24       0
 7     7 2001-12-28      86.4      NA       0 2003-12-11    88.4      NA       1
 8     8 2002-09-17      77.1      26       0 2004-09-01    79.0      25       0
 9     9 2001-09-14      79.8      30       0 2003-08-30    81.8      30       0
10    10 2001-09-21      69.0      29       0 2003-09-27    71.0      29       0
# … with 2,990 more rows, 19 more variables: Date_wave3 <date>,
#   age_wave3 <dbl>, mmse_wave3 <dbl>, dementia_wave3 <dbl>, Date_wave4 <date>,
#   age_wave4 <dbl>, mmse_wave4 <dbl>, dementia_wave4 <dbl>, Date_wave5 <date>,
#   age_wave5 <dbl>, mmse_wave5 <dbl>, dementia_wave5 <dbl>, Date_wave6 <date>,
#   age_wave6 <dbl>, mmse_wave6 <dbl>, dementia_wave6 <dbl>, age_base <dbl>,
#   sex <dbl>, education <dbl>, and abbreviated variable names ¹mmse_wave1,
#   ²dementia_wave1, ³age_wave2, ⁴mmse_wave2, ⁵dementia_wave2
```

???
This I often use for data visualization.

---

## Bin variables

- `case_when()` function can do the work for us easily, it accommodates well with pipeline syntax.

- For example, if we want to bin the education variable.

```r
fake_data %>%
  transmute(mmse_wave1,
    mmse_wave1_bin = case_when(
      between(mmse_wave1, 0, 9) ~ "Severe dementia",
      between(mmse_wave1, 10, 18) ~ "Moderate dementia",
      between(mmse_wave1, 19, 23) ~ "Mild dementia",
      mmse_wave1 >= 24 ~ "Severe dementia",
      TRUE ~ NA_character_
    )
  )
```

```
# A tibble: 3,000 × 2
   mmse_wave1 mmse_wave1_bin   
        <dbl> <chr>            
 1         29 Severe dementia  
 2         18 Moderate dementia
 3         22 Mild dementia    
 4         26 Severe dementia  
 5         NA <NA>             
 6         25 Severe dementia  
 7         NA <NA>             
 8         26 Severe dementia  
 9         30 Severe dementia  
10         29 Severe dementia  
# … with 2,990 more rows
```

---

### Assign function

I would like to spend some time introduce `assign()` function in R.

- Even though it is not part of `tidyverse` family. But it is really useful! For example, when you bulk import databases into R environment; or you want to bulk change the imported data frame names.

- A simple example (run in your R studio):

```r
df_names <- c()
set.seed(2023)
for (i in 1:4) {
  df_names[i] <- paste0("edu_", i)
  assign(df_names[i], sample.int(3, 10, replace = T))
  # print the result
  cat(
    "The df name is: ", df_names[i], "\n",
    "its value is: ", get(df_names[i]), "\n"
  )
}
```

---

### Messy dataframe name

- Imagine in your environment you have these dataframe. 
  - Chances are that they are named this way when you get the data files from someone. They have spaces, "-", horrible!

```r
objects_name <- c(
  "Cohort1_Baseline_BMI", "Cohort1_FU1_BMI", "Cohort1_FU2_BMI",
  "Cohort1FU3_Cohort2FU2", "Cohort2_Baseline_BMI", "envir_name",
  "Gender data request_20230111", "Gender data request_20230116",
  "index", "NEAR_BMI-mortality", "SNAC-K C1 B-F6 cohort file",
  "SNAC-K C1_B", "SNAC-K C1_F1", "SNAC-K C1_F2", "SNAC-K C1_F3",
  "SNAC-K C1_F4", "SNAC-K C1_F5"
)

for (i in 1:length(objects_name)) {
  # assign some values to these objects
  assign(objects_name[i], sample(100, 10))
}
```

---

- Look at your 'Environment' panel. Or `ls()` can retrieve all object names in your R environment. Do you feel familiar?

---

### Let's fix it with `assign()` function (one possible way).

```r
clean_names <- gsub(" |-", "_", objects_name)

for (i in 1:length(clean_names)) {
  assign(clean_names[i], get(objects_name[i]))
}

# just show the first few
clean_names[1:5]
```

```
[1] "Cohort1_Baseline_BMI"  "Cohort1_FU1_BMI"       "Cohort1_FU2_BMI"      
[4] "Cohort1FU3_Cohort2FU2" "Cohort2_Baseline_BMI" 
```

- Check your 'Environment' panel again.
--

- The `gsub()` part can be customized with regex syntax. Basically you can fix any kind of messy dataframe names.

???

Another application of assign is to use it in ggplot.

---
class: center, middle

# R markdown

## Basics and daily work uses

---

## Introduction

- A complete Rmarkdown documentation could be found [here](https://rmarkdown.rstudio.com/lesson-2.html).
--

- The key is to understand these three pillows:

**Rmarkdown = markdown + yaml heading + code chunk options.**

---

## Markdown

* Markdownis a lightweight language for creating formatted text using a plain-text editor.<sup>*</sup>
--

* Advantage: 
  * You do not need to worry about the format. Just focus on writing itself. Could be superfast.
  * Little chance of making the format consistent across the whole documentation.
--

* Disadvantage:
  * A steep learning curve in the beginning.
--
  * However, this disadvantage greatly remedied by many user-friendly editor, e.g. overleaf

Let me give a live example via an [online markdown editor](https://stackedit.io/app#)

.footnote[[*] https://en.wikipedia.org/wiki/Markdown]

???
This slide is written with markdown syntax.

Before I had to a thing called LaTex editor, compile & debug all the time.

With overleaf, life becomes much easier. I wrote my master thesis with it.

No need to worry about the font size, color, logo, etc. Just write!

---

## code chunk options

* Code chunk options in RMD can help you control the execution of your code in the file. 
--

* Some useful ones are as follows:<sup>*</sup>

- `include` = FALSE prevents code and results from appearing in the finished file. R Markdown still runs the code in the chunk, and the results can be used by other chunks.
- `echo` = FALSE prevents code, but not the results from appearing in the finished file. This is a useful way to embed figures.
- `message` = FALSE prevents messages that are generated by code from appearing in the finished file.
- `warning` = FALSE prevents warnings that are generated by code from appearing in the finished.

.footnote[[*] https://rmarkdown.rstudio.com/lesson-3.html]

---

background-image: url(https://d33wubrfki0l68.cloudfront.net/22b5294b277483dd102640b43b5e70d013f677de/08292/resources/rstudioconf-2018/create-and-maintain-websites-with-r-markdown-and-blogdown/thumbnail_hubae0b01a3d61d8344b113ddc4adfe671_69897_1000x0_resize_q75_box.jpg)

.footnote[picture credit: https://www.redarmy.in/]

---

<div class="figure" style="text-align: center">
<img src="pic/local_set.png" alt="local option" width="569" />
<p class="caption">local option</p>
</div>

<div class="figure" style="text-align: center">
<img src="pic/global_set.png" alt="global option" width="1023" />
<p class="caption">global option</p>
</div>

---

### RMD chunk option exercises (1-3 minutes)

* Print the column names of `fake_data`, but do not show the code itself.

* Print the code above, but do not show the result.

* Neither print the code nor the result, but preserve the evaluation of execution for further analysis.

---

## YAML

```yaml
---
title: "A Cool Presentation"
output: html_document
---
```

```yaml
---
title: "A Cool Presentation"
output:
  word_document:
    toc: yes
    toc_depth: 2
---
```

* The YAML heading basically defines the type of output file, layout, etc.

---

### Live demo in R studio

* Get the templates of html or word output:

```r
workshopr::get_rmd_2023(
  name = "pretty_template",
  output_file = "word"
)
workshopr::get_rmd_2023(
  name = "pretty_template",
  output_file = "html"
)
```

* Let's take a look together!

---

# Final exercise

Please check this link [here](https://github.com/Bolin-Wu/workshopr/blob/main/material/2023_intermediate/tidyverse_RMD_session/slides/rmd/final_exercise.pdf).

---

## Wrap up

* Many times, avoiding repetitive coding can help you, also others to review your code.
--

* In the beginning, the Rmarkdown has steep learning curve, but once you are familiar with it, your life will be easier.
--

* This whole slide is written in Rmarkdown. Source code is  [here](https://github.com/Bolin-Wu/workshopr/blob/main/material/2023_intermediate/tidyverse_RMD_session/slides/index.Rmd). The R example code is pulled from Rmarkdown, I do not need to copy and paste.
--

- The materials are collected in Github repo: https://github.com/Bolin-Wu/workshopr

- Hope today's content could be an enlightment to you. If you have any question, please contact me: bolin.wu@ki.se;