flowchart LR
A(Plain text) --> Z{.Rmd}
B(Code) --> Z{.Rmd}
C(Images) --> Z{.Rmd}
D(External source file) --> Z{.Rmd}
E(Formatting source file) --> Z{.Rmd}
F(...) --> Z{.Rmd}
Automatization
Faculty of Humanities, Education and Social Sciences (FHSE), University of Luxembourg
A brief reminder.
r, RStudio, GitHub are the basic tools for an integrated workflow.
Preferably, we work with r objects.
Preferably, we save code in r script files.
clone, push, pull, and commit the basic git verbs.
Functions have the structure {function name}(). But, writing custom functions need this structure function(){}
Packages simplify the workflow but may not be always maintained.
Forth-and-back code writing in the console
Save working code in an r script file
Create Rproject to ensure path dependencies
Create and/ or clone online repository on GitHub
git commit and git push the r script file on GitHub online repository
git pull commited updates from online repository
Towards an integrated and reproducible workflow.
Repetitive tasks can be reduced through the use of code and a designated work environment.
Among the domains where this is an asset are:
Rmarkdown is an enhanced document type that supports plain text, code and formatting, and from which output file formats can be rendered: PDF, DOCX, HTML are the most common formats.
.Rmd editable markdown document in R
knitr engine (jupyter engine for python)
md simplified markdown document in a markup language
pandoc document converter
What we need to focus on is the .Rmd (or the new generation .qmd addressed in the next sessions).
flowchart LR
A(Plain text) --> Z{.Rmd}
B(Code) --> Z{.Rmd}
C(Images) --> Z{.Rmd}
D(External source file) --> Z{.Rmd}
E(Formatting source file) --> Z{.Rmd}
F(...) --> Z{.Rmd}
When knitting to an HTML output format we don’t need anything else to install.
But, when knitting to a PDF output format we’d need a latex distribution!
In this seminar we knit to an HTML and PDF output format.
Documents that are coded to retrieve data and/ or information from external source material (e.g., datasets or meta-data such as from Excel sheets).
This is the building block for creating all sorts of automatized reports.
Working with .Rprojtakes care of that.
Otherwise, paths to external source files need to be called adequately.
Path dependencies inside projects
To benefit from established path dependencies, work with subfolders inside your project repository.
When calling from inside a subfolder make sure to call the subfolder first followed by the file itself.
When calling a file from the root repository folder (where the .Rproj extension is stored) simply call the file itself.
Let us set it all up.
Defines the parameters of the entire rmarkdown document.
pandoc document convertor uses these parameters.
For example, the output file format, title, author, or date of the document version.
See at lines 1 – 5.
Simple text. No editing, no hyperlinks, no enhanced fields.
Special formating is possible. See next slide for a brief guide.
See at lines 13 – 15.
Enhanced field where code is integrated. Varying programming languages can be integrated, r, python, SQL, Julia and so on.
Useful when the goal is, for example, integration of output figures and tables.
See at lines 17 – 19.
Enhanced field where code is integrated seamlessly with plain text.
Keep inline code simple. Use code chunks to prepare output before integrating code with plain text.
See this guide https://rmarkdown.rstudio.com/authoring_basics.html
*abc* italics abc
**abc** bold abc
code backstick
H~2~O subscript H2O
R^2^ superscript R2
[https://www.r-project.org/](https://www.r-project.org/)


# First level.
## Second level.
### Third level.
Watch the indentation!
See this guide https://rpruim.github.io/s341/S19/from-class/MathinRmd.html
$ for inline mathematics and $$ for displayed equations. Without empty space!$x = y$ becomes \(x = y\)
$\left(\int_{a}^{b} f(x) \; dx\right)$ becomes \(\left(\int_{a}^{b} f(x) \; dx\right)\)
$\alpha A$ becomes \(\alpha A\)
$$\sum_{n=1}^{10} n^2$$ becomes \[\sum_{n=1}^{10} n^2\]
Copy-paste from Stanciu et al. (2024)
title: "Can human values explain one’s interest in cryptocurrencies? An explorative study in Germany"
author:
- name: "Adrian Stanciu"
affiliation_number: 1
- name: "Melanie Partsch"
affiliation_number: 1
- name: "Clemens Lechner"
affiliation_number: 1
affiliations:
- "GESIS-Leibniz Institute for the Social Sciences, Mannheim"
- "University of Bremen, Bremen"
shorttitle: "Values and cryptocurrencies"
authors_note: "For correspondence contact Dr. Adrian Stanciu, Data and Research on Society, GESIS-Leibniz Institute for the Social Sciences, PO Box 12215, 68072 Mannheim, Germany. Email: adrian.stanciu[at]gesis.org"
abstract: "Write abstract here"
keywords: "Values, Cryptocurrencies, Germany"
date: "`r format(Sys.time(), '%d. %B, %Y')`"
doctype: doc
header-includes:
- \usepackage{subfig}
output:
bookdown::pdf_document2:
toc: False
number_sections: False
template: "style/template.tex"
csl: style/apa.csl
bibliography: reference.bibCode chunk attributes
It may be simpler to set code chunk attributes for the entire document at the beginning of the document.
In this first code chunk also install all the relevant packages.
Inspect the newly created .Rmd document and identify the discussed elements.
Play with the attributes and/ or use online search engines to identify new attributes of code chunks and yaml parameters.
We install using pacman the packages tidyverse, readxl (for reading Excel sheets), haven (for reading SPSS files), sjlabelled (for dealing with labelled dataframes), kable and kableExtra (for creating tables).
From here on, we build automatized reports, websites and books, and shiny apps using real data.
We use the subsample data from Stanciu et al. (2017) and the movies.xlsx metadata.
Download from the R beyond data analysis book. https://adrian-stanciu.quarto.pub/r-beyond-data-analysis/
Once the data is downloaded, make sure it is stored in the project folder. Then import into the r environment.
# create an object dataframe example `dfex` and assign to it the .sav file `sample.sav` that was introduced previously
dfex<-haven::read_sav("data/sample.sav")
# create an object movies metadata `dfmv` and assign to it the .xlsx file `movies.xlsx`
# note the different paths to these files
# note that we specify which sheet to read too; here only sheet 1 is imported
dfmv<-readxl::read_excel("mat/movies.xlsx",1)Once available for use in the r environment, we can perform actions on the data.
# check if the source material was imported successfully
# by observing the first lines in the tables
head(dfex)# A tibble: 6 × 9
ppn gen age res res_other men_warm men_comp wom_warm wom_comp
<dbl> <dbl+lbl> <dbl> <dbl+lbl> <chr> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
1 459 1 [Female] 24 5 [Iasi] -99 3 [Und… 4 [Agr… 3 [Unde… 4 [Agre…
2 592 2 [Male] 21 5 [Iasi] -99 3 [Und… 4 [Agr… 3 [Unde… 3 [Unde…
3 634 2 [Male] 21 NA petrosani 4 [Agr… 5 [Str… 4 [Agre… 4 [Agre…
4 369 1 [Female] 30 8 [Gala… -99 NA NA 4 [Agre… 4 [Agre…
5 121 1 [Female] 21 4 [Timi… -99 4 [Agr… 3 [Und… 3 [Unde… 4 [Agre…
6 127 1 [Female] 20 4 [Timi… -99 4 [Agr… 4 [Agr… 4 [Agre… 2 [Disa…
# A tibble: 4 × 6
Movie Actor Like Why Grade Wikilink
<chr> <chr> <chr> <chr> <dbl> <chr>
1 John Wick Keanu Reeves Yes Fight … 10 https:/…
2 Call me by your name Timothee Chalamet Yes Beauti… 10 https:/…
3 Terminator Arnold Schwarzenegger Yes Arnold 9 https:/…
4 4 months 3 weeks and 2 days <NA> Yes Portra… 8 https:/…
The .rmd is already a step forward toward automatization in that it retrieves external source material.
Not too helpful because the display of those contents are static, or as plain information.
Inline coding can integrate enhanced text with plain text through the knit engine.
Power for repetitive reports or quick inspection of data collection progress.
What to look for:
what is repetitive
what can be integrated from external source material
what vector contains the desired information (character strings and numeric vectors behave differently)
This is an example of how automatization can be implemented in the work flow.
My list of movies include `r nrow(dfmv)` entries.
The title of those movies are `r dfmv$Movie`.
Is there a movie that I actually dont like on that list, well, the answer is that I dislike exactly
`r dfmv %>% filter(Like %in% c("No","no","NO")) %>% nrow()`
movies on that list.[1] "John Wick" "Call me by your name"
[3] "Terminator" "4 months 3 weeks and 2 days"
This is an example of how automatization can be implemented in the work flow. My list of movies include 4 entries. The title of those movies are John Wick, Call me by your name, Terminator, 4 months 3 weeks and 2 days. Is there a movie that I actually don’t like on that list, well, the answer is that I dislike exactly 0 movies on that list.
Modify the movies.xlsx or create your own metadata (.xlsx sheet) and then write an enhanced text in Rmarkdown.
Import .xlsx
Remember to import the .xlsx file using readxl::read_excel().
Watch out for the right path dependency.
Tables and graphs can be automatically updated with new data.
A series of three graphs follows.
This series reproduces a scenario whereby a dataset is progressively updated during fieldwork.
Each week there are new observations collected, and for each week we’d need to prepare a field report.
Knit .xlsx sheet directly. No modifications made to the original Excel sheet.
| Movie | Actor | Like | Why | Grade | Wikilink |
|---|---|---|---|---|---|
| John Wick | Keanu Reeves | Yes | Fight scenes | 10 | https://en.wikipedia.org/wiki/John_Wick_(film) |
| Call me by your name | Timothee Chalamet | Yes | Beautiful love story | 10 | https://en.wikipedia.org/wiki/Call_Me_by_Your_Name_(film) |
| Terminator | Arnold Schwarzenegger | Yes | Arnold | 9 | https://en.wikipedia.org/wiki/The_Terminator |
| 4 months 3 weeks and 2 days | NA | Yes | Portrayal of life in communist Romania | 8 | https://en.wikipedia.org/wiki/4_Months%2C_3_Weeks_and_2_Days |
Adjustments and modifications made before the final table is reported.
# does some data manipulation to retrieve the required information
tmptbl<-dfmv %>%
filter(Actor %in% c("Keanu Reeves", "Alec Baldwin"))
# creates an empty table holder that is our summary table that we'd
# want to include in the final output document
extbl<-tibble(
like=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Like,
name=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Actor,
movie=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Movie,
wiki=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Wikilink
)
extbl %>% knitr::kable(caption="Movies graded 8 or more from liked and least like actors", format="pipe")| like | name | movie | wiki |
|---|---|---|---|
| Yes | Keanu Reeves | John Wick | https://en.wikipedia.org/wiki/John_Wick_(film) |
.xlsx sheetOpen Microsoft Excel movies.xlsx and add one or more movies by actor Alec Baldwin while pretending you dislike the actor.
Or, you modify the table code and replace the two actors with actors you dislike and like and update the Excel sheet accordingly making sure you maintain the sheet structure.
Re-knit the tables.
Knitting with parameters simplifies even more the work routine.
It uses a friendly user interface, the shiny interface.
Defined the parameters in the yaml head.
Paramters
Characteristics of the document that are repetitive both throughout the document and along the iteration of various versions of the document.
Name of actors in the Excel sheet movies.xlsx.
Which of the stereotype evaluation from subsample Stanciu et al. (2017) we’d want to use for graph creation.
Also, which dataset we use.
title: "example"
output: html_document
date: "2025-03-31"
params:
actor:
label: "Actor"
value: "Keanu Reeves"
input: select
choices: ["Keanu Reeves", "Alec Baldwin","Arnold Schwarzenegger"]
multiple: yes
stereotype:
label: "Stereotype evaluation"
value: wom_warm
input: select
choices: [wom_warm,wom_comp,men_warm,men_comp]
multiple: no
sampledf:
label: "Dataset version"
value: sample.sav
input: select
choices: [sample.sav,tmpdf1.sav,tmpdf2.sav]
multiple: noWe can either use directly or assign to an object.
Calling parameters
Remember to always call parameters as such: params${label defined parameter}
# 1 - imports dataset into object tempdf
tempdf<-haven::read_sav("data/tmpdf1.sav") %>%
sjlabelled::remove_all_labels() %>%
pivot_longer(contains("warm") | contains("comp")) %>%
filter(name %in% st)
# 2 - applies the ggplot to the dataset
ggplot(tempdf, aes(x=factor(gen), y=value)) +
labs(title=paste0("Evaluation based on ",st),
x="Gender",
y="Stereotype") +
geom_boxplot() +
theme_light()# does some data manipulation to retrieve the required information
tmptbl<-dfmv %>%
filter(Actor %in% actor)
# creates an empty table holder that is our summary table that we'd
# want to include in the final output document
extbl<-tibble(
like=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Like,
name=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Actor,
movie=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Movie,
wiki=tmptbl[ tmptbl$Grade >= 8 & tmptbl$Like %in% c("Yes","No"), ]$Wikilink
)One other way to work with parameterized reports is to code the document such that it creates tables (or anything else for that matter) using a specific dataset.
abc %>%
sjlabelled::remove_all_labels() %>%
pivot_longer(contains("warm") | contains("comp")) %>%
group_by(name) %>%
summarise(mean=mean(value, na.rm = TRUE), # we use missing remove TRUE (na.rm=TRUE) to make sure r gives an output
sd=sd(value, na.rm = TRUE),
min=min(value, na.rm = TRUE),
max=max(value, na.rm = TRUE))Download from the R beyond data analysis book the Examples .rmd and think of new parameters to add to the document.
Rmarkdown file to output file formats like PDF
Create project
Create an empty .Rmd
Figure 3: Elements of an .Rmd document
Elements of an .Rmd document
Elements of an .Rmd document
Elements of an .Rmd document
Elements of an .Rmd document
Image from local repository
Image from the Internet