R beyond - Automatization

You might know r from data analysis. But, r can do much more than that for you.

Adrian Stanciu

Faculty of Humanities, Education and Social Sciences (FHSE), University of Luxembourg

Previously

A brief reminder.

The universe

r, RStudio, GitHub are the basic tools for an integrated workflow.
Preferably, we work with r objects.
Preferably, we save code in r script files.
clone, push, pull, and commit the basic git verbs.
Functions have the structure {function name}(). But, writing custom functions need this structure function(){}
Packages simplify the workflow but may not be always maintained.

Work routine template

Forth-and-back code writing in the console
Save working code in an r script file
Create Rproject to ensure path dependencies
Create and/ or clone online repository on GitHub
git commit and git push the r script file on GitHub online repository
git pull commited updates from online repository

Automatization

Towards an integrated and reproducible workflow.

Good for

Repetitive tasks can be reduced through the use of code and a designated work environment.

Among the domains where this is an asset are:

Research: Data analysis and results interpretation, writing manuscripts, and adhering to open science.
Applied sector: Writing of repetitive reports.
Education: Transparent homework.

Elements and structure

Rmarkdown is an enhanced document type that supports plain text, code and formatting, and from which output file formats can be rendered: PDF, DOCX, HTML are the most common formats.

.Rmd editable markdown document in R
knitr engine (jupyter engine for python)
md simplified markdown document in a markup language
pandoc document converter

Elements and structure

What we need to focus on is the .Rmd (or the new generation .qmd addressed in the next sessions).

flowchart LR
  A(Plain text) --> Z{.Rmd}
  B(Code) --> Z{.Rmd}
  C(Images) --> Z{.Rmd}
  D(External source file) --> Z{.Rmd}
  E(Formatting source file) --> Z{.Rmd}
  F(...) --> Z{.Rmd}

Elements that an .Rmd file can integrate

Elements and structure

When knitting to an HTML output format we don’t need anything else to install.

But, when knitting to a PDF output format we’d need a latex distribution!

latex distribution needed

Install the tinytex package now.

# to install tinytex distribution 
install.packages('tinytex')
tinytex::install_tinytex()
# to uninstall TinyTeX, run tinytex::uninstall_tinytex()

Live/ enhanced documents

Documents that are coded to retrieve data and/ or information from external source material (e.g., datasets or meta-data such as from Excel sheets).

This is the building block for creating all sorts of automatized reports.

Path dependencies

Working with .Rprojtakes care of that.

Otherwise, paths to external source files need to be called adequately.

Path dependencies inside projects

To benefit from established path dependencies, work with subfolders inside your project repository.

When calling from inside a subfolder make sure to call the subfolder first followed by the file itself.

SUBFOLDER/FILE.FORMAT TYPE

When calling a file from the root repository folder (where the .Rproj extension is stored) simply call the file itself.

The set up

Let us set it all up.

.Rmd inside an .Rproj

Defines the parameters of the entire rmarkdown document.

pandoc document convertor uses these parameters.

For example, the output file format, title, author, or date of the document version.

See at lines 1 – 5.

Simple text. No editing, no hyperlinks, no enhanced fields.

Special formating is possible. See next slide for a brief guide.

See at lines 13 – 15.

Enhanced field where code is integrated. Varying programming languages can be integrated, r, python, SQL, Julia and so on.

Useful when the goal is, for example, integration of output figures and tables.

See at lines 17 – 19.

Enhanced field where code is integrated seamlessly with plain text.

Keep inline code simple. Use code chunks to prepare output before integrating code with plain text.

`r code here`

See next slides.

Rmarkdown basics

Text formatting
Links
Headings
Lists
Math

*abc* italics abc

**abc** bold abc

`backstick`

code backstick

H~2~O subscript H₂O

R^2^ superscript R²

[https://www.r-project.org/](https://www.r-project.org/)

https://www.r-project.org/

![Image from local repository](img/logo.jpg)

![Image from the Internet](https://www.r-project.org/Rlogo.png)

# First level.

## Second level.

### Third level.

Watch the indentation!

Unordered

* Item 1
* Item 2
    + Item 2a
    + Item 2b

Item 1
Item 2
- Item 2a
- Item 2b

Ordered

1. Item 1
2. Item 2
3. Item 3
    + Item 3a
    + Item 3b

Item 1
Item 2
Item 3
- Item 3a
- Item 3b

Surround by $ for inline mathematics and $$ for displayed equations. Without empty space!

$x = y$ becomes $x = y$

$\left(\int_{a}^{b} f(x) \; dx\right)$ becomes $\left(\int_{a}^{b} f(x) \; dx\right)$

Include Greek letters too.

$\alpha A$ becomes $\alpha A$

Equations

$$\sum_{n=1}^{10} n^2$$ becomes \[\sum_{n=1}^{10} n^2\]

Some yaml parameters

Copy-paste from Stanciu et al. (2024)

title: "Can human values explain one’s interest in cryptocurrencies? An explorative study in Germany"
author: 
  - name: "Adrian Stanciu"
    affiliation_number: 1
  - name: "Melanie Partsch"
    affiliation_number: 1
  - name: "Clemens Lechner"
    affiliation_number: 1
affiliations:
  - "GESIS-Leibniz Institute for the Social Sciences, Mannheim"
  - "University of Bremen, Bremen"
shorttitle: "Values and cryptocurrencies"
authors_note: "For correspondence contact Dr. Adrian Stanciu, Data and Research on Society, GESIS-Leibniz Institute for the Social Sciences, PO Box 12215, 68072 Mannheim, Germany. Email: adrian.stanciu[at]gesis.org"
abstract: "Write abstract here"
keywords: "Values, Cryptocurrencies, Germany"
date: "`r format(Sys.time(), '%d. %B, %Y')`"
doctype: doc
header-includes:
  - \usepackage{subfig}
output: 
  bookdown::pdf_document2:
    toc: False
    number_sections: False
    template: "style/template.tex"
csl: style/apa.csl
bibliography: reference.bib

Some r code chunk attributes

echo=TRUE # whether the code is displayed in the output file

eval=TRUE # whether the code is ran and the outcome generated

include=TRUE # whether the code and its outcome is included in the output document

# sets attributes for entire document
knitr::opts_chunk$set(echo = TRUE,eval=FALSE,warning = FALSE,message = FALSE)

Code chunk attributes

It may be simpler to set code chunk attributes for the entire document at the beginning of the document.

In this first code chunk also install all the relevant packages.

Familiarize yourself

Inspect the newly created .Rmd document and identify the discussed elements.

Play with the attributes and/ or use online search engines to identify new attributes of code chunks and yaml parameters.

Packages

We install using pacman the packages tidyverse, readxl (for reading Excel sheets), haven (for reading SPSS files), sjlabelled (for dealing with labelled dataframes), kable and kableExtra (for creating tables).

install.packages("pacman")
pacman::p_load(tidyverse,readxl,haven,sjlabelled,kable, kableExtra)

Illustrative example

From here on, we build automaized reports, websites and books, and shiny apps using real data.

Data

We use the subsample data from Stanciu et al. (2017) and the movies.xlsx metadata.

Download
Import
Inspect sample.sav
Inspect movies.xlsx

Download from the R beyond data analysis book.

# create an object dataframe example `dfex` and assign to it the .sav file `sample.sav` that was introduced previously
dfex<-haven::read_sav("data/sample.sav")

# create an object movies metadata `dfmv` and assign to it the .xlsx file `movies.xlsx`
# note the different paths to these files
# note that we specify which sheet to read too; here only sheet 1 is imported
dfmv<-readxl::read_excel("mat/movies.xlsx",1)

# check if the source material was imported successfully 
# by observing the first lines in the tables
head(dfex)

# A tibble: 6 × 9
    ppn gen          age res       res_other men_warm men_comp wom_warm wom_comp
  <dbl> <dbl+lbl>  <dbl> <dbl+lbl> <chr>     <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
1   459 1 [Female]    24  5 [Iasi] -99        3 [Und…  4 [Agr… 3 [Unde… 4 [Agre…
2   592 2 [Male]      21  5 [Iasi] -99        3 [Und…  4 [Agr… 3 [Unde… 3 [Unde…
3   634 2 [Male]      21 NA        petrosani  4 [Agr…  5 [Str… 4 [Agre… 4 [Agre…
4   369 1 [Female]    30  8 [Gala… -99       NA       NA       4 [Agre… 4 [Agre…
5   121 1 [Female]    21  4 [Timi… -99        4 [Agr…  3 [Und… 3 [Unde… 4 [Agre…
6   127 1 [Female]    20  4 [Timi… -99        4 [Agr…  4 [Agr… 4 [Agre… 2 [Disa…

head(dfmv)

# A tibble: 4 × 6
  Movie                       Actor                 Like  Why     Grade Wikilink
  <chr>                       <chr>                 <chr> <chr>   <dbl> <chr>   
1 John Wick                   Keanu Reeves          Yes   Fight …    10 https:/…
2 Call me by your name        Timothee Chalamet     Yes   Beauti…    10 https:/…
3 Terminator                  Arnold Schwarzenegger Yes   Arnold      9 https:/…
4 4 months 3 weeks and 2 days <NA>                  Yes   Portra…     8 https:/…

Plain vs. enhanced text

The .rmd is already a step forward toward automatization in that it retrieves external source material.

Not too helpful because the display of those contents are static, or as plain information.

Inline coding can integrate enhanced text with plain text through the knit engine.

Power for repetitive reports or quick inspection of data collection progress.

Plain vs. enhanced text

What to look for:

what is repetitive
what can be integrated from external source material
what vector contains the desired information (character strings and numeric vectors behave differently)

Plain vs. enhanced text

Enhancing plain text
What’s happening
Output enhanced text

This is an example of how automatization can be implemented in the work flow. 
My list of movies include `r nrow(dfmv)` entries. 
The title of those movies are `r dfmv$Movie`. 
Is there a movie that I actually dont like on that list, well, the answer is that I dislike exactly 
`r dfmv %>% filter(Like %in% c("No","no","NO")) %>% nrow()`
movies on that list.

nrow(dfmv) # My list of movies includes...entries

[1] 4

dfmv$Movie # Title of those movies are...

[1] "John Wick"                   "Call me by your name"       
[3] "Terminator"                  "4 months 3 weeks and 2 days"

dfmv %>% filter(Like %in% c("No","no","NO")) %>% nrow() # ...I dislike exactly...

[1] 0

This is an example of how automatization can be implemented in the work flow. My list of movies include 4 entries. The title of those movies are John Wick, Call me by your name, Terminator, 4 months 3 weeks and 2 days. Is there a movie that I actually don’t like on that list, well, the answer is that I dislike exactly 0 movies on that list.

DIY – Enhanced text

Modify the movies.xlsx or create your own metadata (.xlsx sheet) and then write an enhanced text in Rmarkdown.

Import .xlsx

Remember to import the .xlsx file using readxl::read_excel().

Watch out for the right path dependency.

Automated graphs and tables

Tables and graphs can be automatically updated with new data.

Graphs

n = 15
n = 60
n = 100

dfex_n15<-haven::read_sav("data/tmpdf1.sav") %>% 
  sjlabelled::remove_all_labels() %>% 
  mutate(gen=factor(gen),
         res=factor(res))

ggplot(dfex_n15, aes(x=gen, y=wom_warm)) + 
  labs(x="Gender",
       y="Stereotype of warmth") +
  geom_boxplot() + 
  theme_light()

dfex_n60<-haven::read_sav("data/tmpdf2.sav") %>% 
  sjlabelled::remove_all_labels() %>% 
  mutate(gen=factor(gen),
         res=factor(res))

ggplot(dfex_n60, aes(x=gen, y=wom_warm)) + 
  labs(x="Gender",
       y="Stereotype of warmth") +
  geom_boxplot() + 
  theme_light()

dfex<-haven::read_sav("data/sample.sav") %>% 
  sjlabelled::remove_all_labels() %>% 
  mutate(gen=factor(gen),
         res=factor(res))

ggplot(dfex, aes(x=gen, y=wom_warm)) + 
  labs(x="Gender",
       y="Stereotype of warmth") +
  geom_boxplot() + 
  theme_light()

Tables

Knit .xlsx sheet directly
Knit custom table

dfmv %>% knitr::kable(caption="Simple table using knitr::kable()",format = "pipe")

Simple table using knitr::kable()
Movie	Actor	Like	Why	Grade	Wikilink
John Wick	Keanu Reeves	Yes	Fight scenes	10	https://en.wikipedia.org/wiki/John_Wick_(film)
Call me by your name	Timothee Chalamet	Yes	Beautiful love story	10	https://en.wikipedia.org/wiki/Call_Me_by_Your_Name_(film)
Terminator	Arnold Schwarzenegger	Yes	Arnold	9	https://en.wikipedia.org/wiki/The_Terminator
4 months 3 weeks and 2 days	NA	Yes	Portrayal of life in communist Romania	8	https://en.wikipedia.org/wiki/4_Months%2C_3_Weeks_and_2_Days

# does some data manipulation to retrieve the required information
tmptbl<-dfmv %>% 
  filter(Actor %in% c("Keanu Reeves", "Alec Baldwin"))

# creates an empty table holder that is our summary table that we'd
# want to include in the final output document
extbl<-tibble(
  
  like=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Like,
  name=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Actor,
  movie=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Movie,
  wiki=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Wikilink
  
)
  
extbl %>% knitr::kable(caption="Movies graded 8 or more from liked and least like actors", format="pipe")

Movies graded 8 or more from liked and least like actors
like	name	movie	wiki
Yes	Keanu Reeves	John Wick	https://en.wikipedia.org/wiki/John_Wick_(film)

DIY – Edit the `.xlsx` sheet

Open Microsoft Excel movies.xlsx and add one or more movies by actor Alec Baldwin while pretending you dislike the actor.

Or, you modify the table code and replace the two actors with actors you dislike and like and update the Excel sheet accordingly making sure you maintain the sheet structure.

Re-knit the tables.

Knit with parameters

Knitting with parameters simplifies even more the work routine.

It uses a friendly user interface, the shiny interface.

Parameters

Defined the parameters in the yaml head.

Paramters

Characteristics of the document that are repetitive both throughout the document and along the iteration of various versions of the document.

Progress illustrative example

Name of actors in the Excel sheet movies.xlsx.

Which of the stereotype evaluation from subsample Stanciu et al. (2017) we’d want to use for graph creation.

Also, which dataset we use.

Set up – yaml header

title: "example"
output: html_document
date: "2025-03-31"
params:
  actor:
    label: "Actor"
    value: "Keanu Reeves"
    input: select
    choices: ["Keanu Reeves", "Alec Baldwin","Arnold Schwarzenegger"]
    multiple: yes
  stereotype:
    label: "Stereotype evaluation"
    value: wom_warm
    input: select
    choices: [wom_warm,wom_comp,men_warm,men_comp]
    multiple: no
  sampledf:
    label: "Dataset version"
    value: sample.sav
    input: select
    choices: [sample.sav,tmpdf1.sav,tmpdf2.sav]
    multiple: no

Using parameters in code

We can either use directly or assign to an object.

actor<-params$actor
st<-params$stereotype

Calling parameters

Remember to always call parameters as such: params${label defined parameter}

Using parameters in code

# 1 - imports dataset into object tempdf
tempdf<-haven::read_sav("data/tmpdf1.sav") %>% 
  sjlabelled::remove_all_labels() %>% 
  pivot_longer(contains("warm") | contains("comp")) %>% 
  filter(name %in% st)

# 2 - applies the ggplot to the dataset
ggplot(tempdf, aes(x=factor(gen), y=value)) + 
  labs(title=paste0("Evaluation based on ",st), 
       x="Gender",
       y="Stereotype") +
  geom_boxplot() + 
  theme_light()

Using parameters in code

# does some data manipulation to retrieve the required information
tmptbl<-dfmv %>% 
  filter(Actor %in% actor)

# creates an empty table holder that is our summary table that we'd
# want to include in the final output document
extbl<-tibble(
  
  like=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Like,
  name=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Actor,
  movie=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Movie,
  wiki=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Wikilink
  
)

Where is the difference

Without parameter

tmptbl<-dfmv %>% 
  filter(Actor %in% c("Keanu Reeves", "Alec Baldwin"))

Parameter defined

tmptbl<-dfmv %>% 
  filter(Actor %in% actor)

Knit with parameters

Even more…

One other way to work with parameterized reports is to code the document such that it creates tables (or anything else for that matter) using a specific dataset.

sampledf<-paste0("data/",params$sampledf) # assigns parameter
abc<-haven::read_sav(sampledf) # uses parameter in code

abc %>% 
  sjlabelled::remove_all_labels() %>% 
  pivot_longer(contains("warm") | contains("comp")) %>% 
  group_by(name) %>% 
  summarise(mean=mean(value, na.rm = TRUE), # we use missing remove TRUE (na.rm=TRUE) to make sure r gives an output
            sd=sd(value, na.rm = TRUE),
            min=min(value, na.rm = TRUE),
            max=max(value, na.rm = TRUE))

DIY – parameterized reports

Download from the R beyond data analysis book the Examples .rmd and think of new parameters to add to the document.

https://adrian-stanciu.quarto.pub/r-beyond-data-analysis/

Reference list

Stanciu, A., Cohrs, C. J., Hanke, K., & Gavreliuc, A. (2017). Within-culture variation in the content of stereotypes: Application and development of the stereotype content model in an eastern european culture. The Journal of Social Psychology, 157(5), 611–628. https://doi.org/10.1080/00224545.2016.1262812

Stanciu, A., Partsch, M. V., & Lechner, C. M. (2024). Basic human values and the adoption of cryptocurrency. Frontiers in Psychology, 15. https://doi.org/10.3389/fpsyg.2024.1395674

R beyond - Automatization

Previously

The universe

Work routine template

Automatization

Good for

Elements and structure

Elements and structure

Elements and structure

Live/ enhanced documents

Path dependencies

The set up

.Rmd inside an .Rproj

.Rmd elements

Rmarkdown basics

Some yaml parameters

Some r code chunk attributes

Familiarize yourself

Packages

Illustrative example

Data

Plain vs. enhanced text

Plain vs. enhanced text

Plain vs. enhanced text

DIY – Enhanced text

Automated graphs and tables

Graphs

Tables

DIY – Edit the .xlsx sheet

Knit with parameters

Parameters

Progress illustrative example

Set up – yaml header

Using parameters in code

Using parameters in code

Using parameters in code

Where is the difference

Knit with parameters

Even more…

DIY – parameterized reports

Reference list

DIY – Edit the `.xlsx` sheet