About

Project Overview

This dashboard is the final project for Spring 2023’s ANTH630: Statistics for Anthropology at the University of Maryland taught by Madeline Brown. The project was completed by Britney Bibeault and Ia Bull, students of the PhD in Information Studies program. Quantitative approaches are fairly common in translation studies and lexology as they are in literary studies, though they both often have their niches and blind-spots (Guo 2022). We hope to broadly contribute to the growing trend of interdisciplinary studies while showing how useful archival data can be when reused.

Data Source

This dashboard highlights the literary translation dataset collected and written about by Erlin et al. (2022) in “The TRANSCOMP Dataset of Literary Translations from 120 Languages and a Parallel Collection of English-Language Originals,” set of over 10,000 literary works that were translated into English in the early 1900s and English-origin works of the same time-period. This data was pulled from a larger dataset (hereafter referred to as NovelTM) of nearly 200,000 literary works located in the HathiTrust Digital Library published since the eighteenth century (Underwood: 2020). We should address the limitations in this source of data. The HathiTrust digital library is not a comprehensive representation of past literature. It is primarily composed of books from academic libraries in the United States, resulting in a bias towards famous authors and weaker coverage of popular culture and juvenile fiction, particularly in non-Anglo-American contexts. Furthermore, the collection is incomplete, containing only slightly more than half of the 19th century fiction titles mentioned in Publishers Weekly and less than a quarter of 20th century titles (Underwood: 200, 4-5).

The dataset we are using can be found on the Journal of Open Humanities Data [add link here].

Research Questions

By applying statistics to this dataset, we hope to answer (1) What are the general trends in this dataset in relation to what languages were translated? (2) Is there a correlation between time and languages in terms of when different languages were translated?

Expected Impacts

One of the main impacts we expect of this project is to bring more awareness to the reuse of open data, especially of humanities data, to encourage other researchers to submit their data to open access resources and to do subsequent analyses on data so more questions can be answered without needing to collect new data. this research project is also shedding a spot-light on previously unexplored areas of research identified by Erlin et al. to encourage others to engage this data critically and more thoroughly. They identified two areas of research that could benefit from access to their data, which include investigating the structural asymmetries in the global flow of translations and linking these asymmetries to differences in the linguistic, stylistical, or thematic features of translations. However, the current project cannot engage directly with these specific areas due to the scope and limitations of the research. Nonetheless, the project’s findings can still contribute to the broader field of translation studies and provide valuable insights into the patterns and characteristics of translated texts in the global context. By building on the work of Erlin et al. and other scholars in the field, this research project can help expand our understanding of how translations reflect and shape cultural biases and power dynamics in the “world republic of letters” (2022), and in what ways the written record overshadows other forms of influence (Donald 2023).

Research Question 1

Top Translations Including English


To begin our analysis, we wanted to see the overall counts of translations in relation to each other. English was the final language for a vast majority of the texts, as expected from the dataset being pulled from a set of literature with a focus on translations into English.

Top Translations with the ability to Exclude English


After creating the first visual, we were interested in gaining better insight into the non-English translations. To accomplish this, we removed English from the final_lang column and recreated the top 50 languages translated

Research Question 2

Translations by Year Plot


As this visualization shows, translations of works before the mid-20th century remained low (between 0 and 5 translations). In the 1950s, a significant jump in translations occurred with a steady rise in translations until the late-20th century when there was a small decrease in translations before rising once more before the beginning of the 21st century.

Time by Language


We have two interactive line charts where each language is represented by a separate line. You can hover over the lines to see the count of translations for each year and compare the trends between different languages. The first line chart showcases the data of translations (y-axis) over time (x-axis) for different languages within the dataset (lines). You can hover over the lines to see the count of translations for each year and compare the trends between different languages. This visualization is useful for comparing both translations and between languages over the time period, allowing researchers to gain a better understanding of translation trends over time for an array of languages. If a dataset of modern translations is available, it could be compared quickly to this chart to see if the trends have changed. Additionally, if historical context is provided in conjunction with this data, the chart would be useful for showcasing trends in relation to historical events.

Translations Printed Over Time


Translations Over Time


As the scatterplot with regression line shows, the average number of translations printed into English over time steadily rises throughout the period of data collection with peaks in the mid-1960s (440 translations in 1968) and the 1990s (502 translations in 1992) and low points in the mid-1950s and 1980s (255 translations in 1981). The regression line indicates there is a general correlation between number of translations and year which could be explained by higher connectivity between different language groups as a result of globalization and the eventual creation and common use of the internet. As with the other visualizations answering Research Question 2, this plot should be analyzed with historical context in mind to better understand the trends shown.

Overview and Future Work

Overview

The dataset used in this dashboard is the compiled metadata derived from the “TRANSCOMP Dataset of Literary Translations from 120 Languages and a Parallel Collection of English-Language Originals” by Erlin et al. (2022). The dataset consists of over 10,000 literary works translated into English in 1900s, along with English-origin works from the same time period. The data was extracted from the larger NovelTM dataset, which contains nearly 200,000 literary works published since the eighteenth century and is located in the HathiTrust Digital Library (Underwood et al. 2020).

It is important to note the limitations of the data source. The HathiTrust digital library primarily includes books from academic libraries in the United States, resulting in a bias towards well-known authors and a weaker representation of popular culture and juvenile fiction, especially outside Anglo-American contexts. Additionally, the collection is incomplete, covering only a fraction of the titles mentioned in Publishers Weekly for the 19th and 20th centuries.

Through the analyses we ran on this data, we found the following findings interesting and warranting future research; 1) texts published in different languages experienced different amounts of attention by English translators. The historical context around which those spikes or drops occur could allow for more insight regarding the trends in translations. 2) There are a number of different opportunities to evaluate the state of literary works over time through the TransComp dataset collected by Underwood et al. (2020) and developed by Erlin et al. (2022). One visualization that we never engaged in was visualizing the global presence of the languages being translated into English. One way to do this is with a heatmap imposed on a global Choropleth in leaflet, in combination with or instead using R packages Lingtypology (Moroz 2017) or glottospace (Norder et al.).

Research Question 1 focuses on the top translations, including English, in the dataset. A bar chart displays the counts of translations for each language, with English having the highest count due to the dataset’s focus on translations into English. Research Question 2 further explores translations by year and language. A line chart shows the number of translations over time, allowing for comparisons between languages. The dashboard also presents a plot of translations over time, showing fluctuations in the number of translations into English.

Overall, this dashboard provides an initial exploration of the TRANSCOMP dataset and offers avenues for further investigation into translation trends and their historical context. This work highlights the need for examining historical contexts to gain deeper insights into translation trends and emphasizes the potential for comparing this dataset with modern translation data.

Publication

It is our plan to write a paper on the process of creating this project to submit to the Journal of Open Humanities Data in the coming year to share how we reused and applied statistics methods to humanities data, in hopes of encouraging more creative uses of existing datasets.

References

Cascio, M. Ariel, Eunlye Lee, Nicole Vaudrin, and Darcy A. Freedman. “A Team-Based Approach to Open Coding: Considerations for Creating Intercoder Consensus.” Field Methods 31, no. 2 (May 1, 2019): 116–30. https://doi.org/10.1177/1525822X19838237.

Creswell, John W., and J. David Creswell. Research Design: Qualitative, Quantitative, and Mixed Methods Approaches. SAGE Publications, 2017.

Dell’Oro, Francesca. “From Static to Interactive Maps: Drawing Diachronic Maps of (Latin) Modality with Pygmalion” 8, no. 0 (January 12, 2022): 2. https://doi.org/10.5334/johd.58.

Donald, Rhonda Lucas. “Language Diversity Index,” January 31, 2023. https://education.nationalgeographic.org/resource/language-diversity-index-map.

Erlin, Matt, Andrew Piper, Douglas Knox, and Stephen Knox. “The TRANSCOMP Dataset of Literary Translations from 120 Languages and a Parallel Collection of English-Language Originals” 8, no. 0 (December 26, 2022): 29. https://doi.org/10.5334/johd.94.

Guo, Jia. “Deep Learning Approach to Text Analysis for Human Emotion Detection from Big Data.” Journal of Intelligent Systems 31, no. 1 (January 1, 2022): 113–26. https://doi.org/10.1515/jisys-2022-0001.

Hiles, David. “Transparency.” In The SAGE Encyclopedia of Qualitative Research Methods, by Lisa Given. 2455 Teller Road, Thousand Oaks California 91320 United States: SAGE Publications, Inc., 2008. https://doi.org/10.4135/9781412963909.n467.

Moroz, G., Forlinguistics, Koncha, K., Ustera, Sanya, Ooms, J., Ram, K., & Timelyportfolio. (2023). ropensci/lingtypology: V 1.1.12 (v1.1.12). Zenodo. https://doi.org/10.5281/ZENODO.815028 Norder, S. (2023). glottospace: Language Mapping and Geospatial Analysis of Linguistic and Cultural Data [R]. https://github.com/SietzeN/glottospace/blob/69282cdcc6b72ff895f92074b5d419edd0060b33/README.md (Original work published 2021)

Ronen, Shahar, Bruno Gonçalves, Kevin Z. Hu, Alessandro Vespignani, Steven Pinker, and César A. Hidalgo. “Links That Speak: The Global Language Network and Its Association with Global Fame.” Proceedings of the National Academy of Sciences 111, no. 52 (December 30, 2014): E5616–22. https://doi.org/10.1073/pnas.1410931111.

Trisovic, Ana, Matthew K. Lau, Thomas Pasquier, and Mercè Crosas. “A Large-Scale Study on Research Code Quality and Execution.” Scientific Data 9, no. 1 (February 21, 2022): 60. https://doi.org/10.1038/s41597-022-01143-6.

Underwood, Ted, Patrick Kimutis, and Jessica Witte. “NovelTM Datasets for English-Language Fiction, 1700-2009.” Journal of Cultural Analytics 5, no. 2 (May 28, 2020). https://doi.org/10.22148/001c.13147.