This article demonstrates how RStudio, Zotero and related tools can produce scholarly citations in a data science workflow. Data scientists typically code and write text in software tools like RStudio or Jupyter where citations, references and bibliographies are distant afterthoughts. Word processors like Microsoft Word or Google Docs support scholarly citations and references but are not typically used by data scientist. This was written in Rmarkdown showing R code with APA 7th edition citation style. This article begins by describing the technology components used to produce this document. Then I outline the workflow and then demonstrate with a mock literature review using actual works.

Technology

List of Software

For paper writing, my OS environment is a Mac-Mini running macOS Big Sur 11.5. These tools should be widely available for Windows. I use the following tools:

R version 4.0.4 (Lost Library Book) runs the packages and data science algorithms.
RStudio 1.4.1106 Desktop for Mac is the IDE to author documents and render output (pdf, html).
Zotero Desktop for Mac 5.0.96.3 stores your references and notes.
Zotero Connector Extension for Firefox (5.0.91) links a web browser to Zotero desktop to gather metadata.
Better Bibtex plugin for Zotero (5.4.29) exports bibliography to RStudio and manages citation keys.

The technology tools listed above are free to download. The stated versions or their successors should work. The rest of this document assumes those tools are installed and working.

Workflow

Let’s assume you have correctly installed the above tools, and you have a hypothetical research project. I’ll assume you set up a working directory WORKDIR where your Rmarkdown files and .bib files will be saved.

Set Up Your Citation Style

You will need to know the citation style that you plan to use in your project. I am using APA 7th edition citation style for this document. Each citation style is encoded in an XML language called CSL. All commonly used citation styles are available in Zotero at the following repository. Search for your style in the webpage https://www.zotero.org/styles shown below and download the matching csl file to your local WORKDIR. In my case, the APA 7th edition is stored in a file named apa.csl.

CSL Repository

Gather Your Research

Find the works that you wish to cite in your project and store their metadata in Zotero. Each type of citable item has an item type in Zotero. This sample document uses Book, Journal Article, Conference Paper, Report, Journal Article, Encyclopedia Article, Newspaper Article, Web Page. There are many other types.

How do you store the information into Zotero? The three most common ways that I’ve tried are:

Manually typing the details in Zotero desktop for each work. This is the most laborious. It is flexible but can be error prone.
Use Zotero Connector to scraping a webpage for bibliographic metadata. This is powerful and surprisingly fast. It can produce items with errors.
Import an external BibTex format file.

You may need to add more works to the Zotero collection after setting up your project. That’s fine. You’ll simply repeat the next step.

Zotero Main Screen with metadata

Export a Bib File

Now you need to export the collected items of research interest in a Bibtex format file and place it in WORKDIR. For example, I exported RStudioZotero.bib so that it can be used when writing my document.

In theory, this step may need to be repeated if the Zotero collection is modified. For example, if amending metadata or adding extra works discovered during the research process.

Set Up Your Document

The next essential step is setting up the preamble (i.e. the YAML) section of your RMarkdown document. I find the following three entries useful. Note that the RMarkdown file should be placed in WORKDIR for the path references to work.

bibliography: RStudioZotero.bib
link-citations: true
csl: apa.csl

You will also need to set up a place where References or a Bibliography will be generated. The following Rmarkdown chunk goes at the end of your Rmd document to tell RStudio to generate the list.

::: {#refs}
:::

Cite Your Works

The next and fun step is to cite your works. The visual mode in RStudio makes this easy and you should enable it. Allaire (2020) tells you how to enable visual mode. As you type the @ in the text window, a dialog box pops up on the fly. It shows the available works to cite from your .bib file. In this case, the works listed in RStudioZotero.bib are listed by Citation Key. Tab or select the work you want to cite. Here is a screenshot of that effect. Note that the cursor is at the upper right corner of the pop-up window.

Popup as you type a citation

RStudio officially announced the citation features and visual editing capabilities in a blog in September 2020 (Allaire, 2020). But Calster & Vanderhaeghe (2021) has provided useful advanced examples of citations and YAML document level arguments.

Knit Your Document

When you render your document, RStudio and its underlying packages build the References, parses the citation markdown and replaces it with the correct labels. This process works for both report and presentation slide output format. Dunnington (2020) has additional helpful information on the workflow.

Literature Review

It’s helpful to see a mock literature review to test out the most useful features and examples of citations. This section cites actual papers, books and online resources. I’ve grouped them by type of work but tried to illustrate the common ways in which citations arise.

Books

Time series analysis is useful in data science and ARIMA models are one important class of time series models. “A distinctive feature of the data which suggests the appropriateness of an ARIMA model is the slowly decaying positive sample autocorrelation function…” (Brockwell & Davis, 1996, p. p179).

If you are visiting New York City, a travel guide can provide informative tips. Fodor’s guidebook shows the city in its pre-Covid19 glory (Fodor’s, 2004). Did you know that a Diego Rivera mural with communist themes depicting Joseph Stalin was removed from the GE Building in Rockefeller Center (Fodor’s, 2004, p. p128)?

Journal Articles

Here are some peer-reviewed journal articles and conference proceedings. While their citations look identical to those of less formal publications, their metadata entry in Zotero differs.

Crime prediction in urban settings is one area where machine learning methods have been fruitfully applied. Mohler, Carter, et al. (2018) suggests a modulated Hawkes process to simulate social harm at dynamically changing hotspots. On the other hand, Richardson et al. (2019) argues predictive policing leads to socially unfair outcomes because the machine learning models are fed biased or inaccurate data.

Consequently, attempts to rectify the biased outcomes have been made. One quantitative approach to counteract unfairness is using a penalty function to balance accuracy and fairness (Mohler, Rajeev, et al., 2018). The jury is still out on the success of these types of measures.

Two related papers by the same author (Mohler, 2014a) and (Mohler, 2014b) demonstrate what happens when the same author, year combination appears multiple times. RStudio and the underlying R packages automatically converts the citations into author-year with a, b, c suffixes.

Various studies in other fields also use statistical methods (Achor et al., 2009; Mohler, Rajeev, et al., 2018; Mohler, Carter, et al., 2018). This illustrates the inclusion of multiple works in one citation.

Working Papers

Working papers are not peer-reviewed publications but may be widely available work with important content. Panhans & Singleton (2016) discusses econometrics identification and the attempts to use quasi-experimental methods to address such statistical challenges. It quotes an interesting anecdote by Leamer. We illustrate the citation of a block quote below.

The applied econometrician is like a farmer who notices that the yield is somewhat higher under trees where birds roost, and he uses this as evidence that bird droppings increase yields. However, when he presents this finding at the annual meeting of the American Ecological Association, another farmer in the audience objects that he used the same data but came up with the conclusion that moderate amounts of shade increase yields. A bright chap in the back of the room then observes that these two hypotheses are indistinguishable given the available data. He mentions the phrase “identification problem,” which, though no one knows quite what he means, is said with such authority that it is totally convincing. (Leamer, 1983, p. p31)

Webpages and Encyclopedia Articles

Wikipedia articles are no problem to cite. Zotero Connector does a good job of scraping the required metadata. Kernel density estimation is one technique used for crime mapping (Mohler, 2014b) but if you want to learn the basics of KDE, “Kernel Density Estimation” (2021) may be a good start.

An interesting online article in The New York Times Magazine reported on the role of machine learning in deciding if coffee is healthy for us (Tingley, 2021).

Meanwhile Bergman & Fassihi (2021) describes how a nuclear scientist was assassinated by secret service agents using an AI-assisted sniper machine gun. This illustrates citing a newspaper article.

Conclusion

I wrote this article to learn the RStudio-Zotero workflow for a CUNY MS Data Science Capstone project but the same ideas are applicable elsewhere. The references and a sample appendix follow below. Happy scholarly writing.

References

Achor, E. E., Imoko, B. I., & Uloko, E. S. (2009). Effect of ethnomathematics teaching approach on senior secondary students’ achievement and retention in locus. Educational Research and Review, 4(8), 385–390.

Allaire, J. J. (2020). RStudio 1.4 Preview: Citations. https://blog.rstudio.com/2020/11/09/rstudio-1-4-preview-citations/.

Bergman, R., & Fassihi, F. (2021). The Scientist and the A.I.-Assisted, Remote-Control Killing Machine. The New York Times.

Brockwell, P. J., & Davis, R. A. (1996). Introduction to Time Series and Forecasting. Spring-Verlag.

Calster, H. V., & Vanderhaeghe, F. (2021). Citations in R Markdown Hans Van Calster. In INBO Tutorials. https://inbo.github.io/tutorials/tutorials/r_citations_markdown/.

Dunnington, D. (2020). Getting started with Zotero, Better BibTeX, and RMarkdown. In Fish & Whistle. https://fishandwhistle.net/post/2020/getting-started-zotero-better-bibtex-rmarkdown/.

Fodor’s. (2004). Fodor’s See It New York City, 1st Edition (First). Fodor’s Travel Publications.

Kernel density estimation. (2021). Wikipedia.

Leamer, E. E. (1983). Let’s Take the Con Out of Econometrics. The American Economic Review, 73(1), 31–43.

Mohler, G. (2014a). Learning convolution filters for inverse covariance estimation of neural network connectivity. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27). Curran Associates, Inc.

Mohler, G. (2014b). Marked point process hotspot maps for homicide and gun crime prediction in Chicago. International Journal of Forecasting, 30(3), 491–497. https://doi.org/10.1016/j.ijforecast.2014.01.004

Mohler, G., Carter, J., & Raje, R. (2018). Improving social harm indices with a modulated Hawkes process. International Journal of Forecasting, 34(3), 431–439. https://doi.org/10.1016/j.ijforecast.2018.01.006

Mohler, G., Rajeev, R., Carter, J., Valasik, M., & Brantingham, J. (2018). A Penalized Likelihood Method for Balancing Accuracy and Fairness in Predictive Policing. 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2454–2459.

Panhans, M. T., & Singleton, J. D. (2016). The Empirical Economist’s Toolkit: From Models to Methods (Working {{Paper}} No. 2015-03). Center for the History of Political Economy (CHOPE).

Richardson, R., Schultz, J., & Crawford, K. (2019). Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, And Justice. New York University Law Review Online, 94(May 2019), 192–233.

Tingley, K. (2021). Is Coffee Good for Us? Maybe Machine Learning Can Help Figure It Out. The New York Times Magazine, 16.

Appendix A

Appendices can go after the References. By default, references are generated at the end of the document. But this can be controlled by including the aforementioned code snippet placed in the references section.

Appendix B

Here I demonstrate that R code be executed in the same document where citations are used.

library(dplyr)
library(ggplot2)

rdata = data.frame( index = 1:100, val = rnorm(100))

rdata %>% ggplot() + geom_line(aes(x=index,y=val), col="red") +
  labs(title = "Random Noise or Real Data?")

Citations and References for Data Scientists in RStudio and Zotero

Alexander Ng

28 September 2021