This article demonstrates how RStudio, Zotero and related tools can produce scholarly citations in a data science workflow. Data scientists typically code and write text in software tools like RStudio or Jupyter where citations, references and bibliographies are distant afterthoughts. Word processors like Microsoft Word or Google Docs support scholarly citations and references but are not typically used by data scientist. This was written in Rmarkdown showing R code with APA 7th edition citation style. This article begins by describing the technology components used to produce this document. Then I outline the workflow and then demonstrate with a mock literature review using actual works.
For paper writing, my OS environment is a Mac-Mini running macOS Big Sur 11.5. These tools should be widely available for Windows. I use the following tools:
The technology tools listed above are free to download. The stated versions or their successors should work. The rest of this document assumes those tools are installed and working.
Let’s assume you have correctly installed the above tools, and you have a hypothetical research project. I’ll assume you set up a working directory WORKDIR
where your Rmarkdown files and .bib
files will be saved.
You will need to know the citation style that you plan to use in your project. I am using APA 7th edition citation style for this document. Each citation style is encoded in an XML language called CSL. All commonly used citation styles are available in Zotero at the following repository. Search for your style in the webpage https://www.zotero.org/styles shown below and download the matching csl file to your local WORKDIR
. In my case, the APA 7th edition is stored in a file named apa.csl
.
CSL Repository
Find the works that you wish to cite in your project and store their metadata in Zotero. Each type of citable item has an item type
in Zotero. This sample document uses Book
, Journal Article
, Conference Paper
, Report
, Journal Article
, Encyclopedia Article
, Newspaper Article
, Web Page
. There are many other types.
How do you store the information into Zotero? The three most common ways that I’ve tried are:
Manually typing the details in Zotero desktop for each work. This is the most laborious. It is flexible but can be error prone.
Use Zotero Connector to scraping a webpage for bibliographic metadata. This is powerful and surprisingly fast. It can produce items with errors.
Import an external BibTex format file.
You may need to add more works to the Zotero collection after setting up your project. That’s fine. You’ll simply repeat the next step.
Zotero Main Screen with metadata
Now you need to export the collected items of research interest in a Bibtex format file and place it in WORKDIR
. For example, I exported RStudioZotero.bib
so that it can be used when writing my document.
In theory, this step may need to be repeated if the Zotero collection is modified. For example, if amending metadata or adding extra works discovered during the research process.
The next essential step is setting up the preamble (i.e. the YAML) section of your RMarkdown document. I find the following three entries useful. Note that the RMarkdown file should be placed in WORKDIR
for the path references to work.
bibliography: RStudioZotero.bib
link-citations: true
csl: apa.csl
You will also need to set up a place where References or a Bibliography will be generated. The following Rmarkdown chunk goes at the end of your Rmd document to tell RStudio to generate the list.
::: {#refs}
:::
The next and fun step is to cite your works. The visual mode in RStudio makes this easy and you should enable it. Allaire (2020) tells you how to enable visual mode. As you type the @
in the text window, a dialog box pops up on the fly. It shows the available works to cite from your .bib
file. In this case, the works listed in RStudioZotero.bib
are listed by Citation Key. Tab or select the work you want to cite. Here is a screenshot of that effect. Note that the cursor is at the upper right corner of the pop-up window.
Popup as you type a citation
RStudio officially announced the citation features and visual editing capabilities in a blog in September 2020 (Allaire, 2020). But Calster & Vanderhaeghe (2021) has provided useful advanced examples of citations and YAML document level arguments.
When you render your document, RStudio and its underlying packages build the References, parses the citation markdown and replaces it with the correct labels. This process works for both report and presentation slide output format. Dunnington (2020) has additional helpful information on the workflow.
It’s helpful to see a mock literature review to test out the most useful features and examples of citations. This section cites actual papers, books and online resources. I’ve grouped them by type of work but tried to illustrate the common ways in which citations arise.
Time series analysis is useful in data science and ARIMA models are one important class of time series models. “A distinctive feature of the data which suggests the appropriateness of an ARIMA model is the slowly decaying positive sample autocorrelation function…” (Brockwell & Davis, 1996, p. p179).
If you are visiting New York City, a travel guide can provide informative tips. Fodor’s guidebook shows the city in its pre-Covid19 glory (Fodor’s, 2004). Did you know that a Diego Rivera mural with communist themes depicting Joseph Stalin was removed from the GE Building in Rockefeller Center (Fodor’s, 2004, p. p128)?
Here are some peer-reviewed journal articles and conference proceedings. While their citations look identical to those of less formal publications, their metadata entry in Zotero differs.
Crime prediction in urban settings is one area where machine learning methods have been fruitfully applied. Mohler, Carter, et al. (2018) suggests a modulated Hawkes process to simulate social harm at dynamically changing hotspots. On the other hand, Richardson et al. (2019) argues predictive policing leads to socially unfair outcomes because the machine learning models are fed biased or inaccurate data.
Consequently, attempts to rectify the biased outcomes have been made. One quantitative approach to counteract unfairness is using a penalty function to balance accuracy and fairness (Mohler, Rajeev, et al., 2018). The jury is still out on the success of these types of measures.
Two related papers by the same author (Mohler, 2014a) and (Mohler, 2014b) demonstrate what happens when the same author, year combination appears multiple times. RStudio and the underlying R packages automatically converts the citations into author-year with a
, b
, c
suffixes.
Various studies in other fields also use statistical methods (Achor et al., 2009; Mohler, Rajeev, et al., 2018; Mohler, Carter, et al., 2018). This illustrates the inclusion of multiple works in one citation.
Working papers are not peer-reviewed publications but may be widely available work with important content. Panhans & Singleton (2016) discusses econometrics identification and the attempts to use quasi-experimental methods to address such statistical challenges. It quotes an interesting anecdote by Leamer. We illustrate the citation of a block quote below.
The applied econometrician is like a farmer who notices that the yield is somewhat higher under trees where birds roost, and he uses this as evidence that bird droppings increase yields. However, when he presents this finding at the annual meeting of the American Ecological Association, another farmer in the audience objects that he used the same data but came up with the conclusion that moderate amounts of shade increase yields. A bright chap in the back of the room then observes that these two hypotheses are indistinguishable given the available data. He mentions the phrase “identification problem,” which, though no one knows quite what he means, is said with such authority that it is totally convincing. (Leamer, 1983, p. p31)
Wikipedia articles are no problem to cite. Zotero Connector does a good job of scraping the required metadata. Kernel density estimation is one technique used for crime mapping (Mohler, 2014b) but if you want to learn the basics of KDE, “Kernel Density Estimation” (2021) may be a good start.
An interesting online article in The New York Times Magazine reported on the role of machine learning in deciding if coffee is healthy for us (Tingley, 2021).
Meanwhile Bergman & Fassihi (2021) describes how a nuclear scientist was assassinated by secret service agents using an AI-assisted sniper machine gun. This illustrates citing a newspaper article.
I wrote this article to learn the RStudio-Zotero workflow for a CUNY MS Data Science Capstone project but the same ideas are applicable elsewhere. The references and a sample appendix follow below. Happy scholarly writing.
Appendices can go after the References. By default, references are generated at the end of the document. But this can be controlled by including the aforementioned code snippet placed in the references section.
Here I demonstrate that R
code be executed in the same document where citations are used.
library(dplyr)
library(ggplot2)
rdata = data.frame( index = 1:100, val = rnorm(100))
rdata %>% ggplot() + geom_line(aes(x=index,y=val), col="red") +
labs(title = "Random Noise or Real Data?")