04th October 2020
Plenary talk @ 46th Conference of the Japan Association for English Corpus Studies (JAECS)

Today’s talk: An Outline

  • Introduction
  • The many faces of NLP
  • Personal experiences with the different faces
  • What NLP research can learn from the other faces
  • Concluding Thoughts
    • NLP and Corpus Linguistics
    • NLP education

Note: Images without attribution are taken from our book (http://www.practicalnlp.ai/).

These slides can be accessed at: rpubs.com/vbsowmya/jaecs2020talk.

What is NLP?


(source)

NLP is all about understanding and modeling human language using computational methods.

Where is NLP useful in day to day life?

  • general purpose applications: search, email, voice based assistants on phones etc.

  • domain specific applications in e-commerce, legal, finance, health care etc.

  • educational technology: language teaching, learning, assessment tools

  • language revitalization software

  • disaster management tools

Title: What are the many faces of NLP?

Three broad groups:

  • NLPers: NLP researchers in academia and industry
  • Other researchers who use NLP methods in their research
  • Industry professionals developing NLP based applications



  • What do NLPers and others do?

  • What are the differences between their practices?

  • What can NLP learn from the many faces?

  • Does Corpus Linguistics figure in this setup at all?

The rest of this talk is a personal take on these questions.


Introduction

The many faces of NLP

Personal experiences with the different faces

What NLP research can learn from the other faces

Concluding Thoughts

  • NLP and Corpus Linguistics

  • NLP education

NLP research - an overview

A snapshot of NLP research topics

Trends in NLP Research

Contemporary NLP Research - 1

Contemporary NLP Research - 2

There is also a lot of work on a range of other topics from human language comprehension to mental health:

Finding syntax in human encephalography with beam search

Probabilistic Typology: Deep Generative Models of Vowel Inventories

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Feuding Families and Former Friends; Unsupervised Learning for Dynamic Fictional Relationships

Depression and Self-Harm Risk Assessment in Online Forums

NLP Research in 2020: Introspection

NLP Research: Language Learning/Teaching

Two examples:

  • FeedBook - an interactive workbook for English foreign language teaching

  • Research Writing Tutor - a tool that offers feedback on academic writing

  • Focused events within NLPer community for educational applications, e.g., BEA workshop series …..

NLP Research: Summary

  • generally heavy on algorithms/methods

  • relatively less focus on corpus creation, evaluation beyond standard tests, and rule engineering (an opportunity for corpus linguists?)

  • recent interest in bias in models, ethics, interpretability etc

  • lot of introspection in 2020

NLP in Industry - an overview

What does NLP in Industry look like?

  • language learning/teaching/assessment software, which are familiar to you.

  • large r&d teams building NLP focused products, for their own use as well as for third parties

  • software teams where NLP contributes to existing product functionalities

  • speech to text/text to speech software, transcription tools etc.

[What’s in it for me?: Ideally, any job that involves working with huge amounts of corpora needs folks with expertise in building and analyzing them.]

NLP in Industry - Language learning/teaching

Tools for specific classroom/assessment scenarios:

General purpose tools such as Grammarly, Duolingo etc.

NLP in Industry - Specific Examples

  • Glemser uses the technology from arria.com to generate clinical trial reports automatically.
  • Lawdroid makes chatbots for law firms to perform various functions (e.g., paralegal bot, reception bot etc)
  • Bloomberg uses sentiment analysis on news articles about companies to support stock market decisions.

NLP in Industry - Summary

  • We saw a few use cases so far. NLP is useful in many other industry scenarios too.

  • Companies that build software involving local, non-English NLP are also growing in many countries.

  • There are also companies that primarily do annotation for NLP and other Machine Learning projects. (e.g., Appen Ltd)

  • To conclude,

    • industry NLP involves a wide range of applications,
    • requires people from diverse backgrounds such as linguists, software developers, product managers etc.

NLP in other disciplines: An overview

Where is NLP used in other disciplines?

  • NLP is used as a method to answer research questions in many disciplines.
  • NLP sometimes plays a major role in discipline specific challenges, going beyond being just a research method.
  • In Google Scholar, I saw mentions of NLP methods in journals as diverse as Asian studies & History to Clinical Oncology.
  • I will show a sample of work taken from a few disciplines that may interest you.
  • [ Again, what’s in it for you?: Wherever there is a role for NLP, I believe there is a role for a corpus linguist too!]

NLP in Applied Linguistics Research

Chukharev-Hudilainen, E., & Saricaoglu, A. (2016). Causal discourse analyzer: Improving automated feedback on academic ESL writing. Computer Assisted Language Learning, 29(3), 494-516.

  • used Stanford CoreNLP software + linguistic rule engineering to identify cause and effect discourse in non-native writing.
  • Causal markers were first identified by a manual, functional linguistic analysis of a corpus, and were then used to develop the above rules.
  • evaluated in terms of precision and recall, on manually annotated essays by 17 students.

NLP in Language Acquisition Research

Chen, X., Alexopoulou, T., & Tsimpli, I. (2020). Automatic extraction of subordinate clauses and its application in second language acquisition research. Behavior Research Methods, 1-15.

  • Built a tool to extract subordinate clauses using Stanford dependency parser followed by several hand crafted rules.
  • Validated the tool through an evaluation with annotated test set and manual inspection.
  • Used this tool to analyze a large-scale learner corpus and investigate the effects of first language (L1) on the acquisition of subordination in second language (L2) English.

NLP in Corpus Linguistics

Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28-54.

  • Proposed an approach to control for annotation bias in learner language parse annotations.
  • Evaluated multiple NLP parsers on learner English.
  • Identified and quantified the influence of learner writing errors on parser’s efficiency.

Few more examples:

Medical Informatics: Chen, L., Gu, Y., Ji, X., Sun, Z., Li, H., Gao, Y., & Huang, Y. (2020). Extracting medications and associated adverse drug events using a natural language processing system combining knowledge base and deep learning. Journal of the American Medical Informatics Association : JAMIA, 27(1), 56–64.

Plant Science: Braun, I. R., & Lawrence-Dill, C. J. (2019). Automated methods enable direct computation on phenotypic descriptions for novel candidate gene prediction. Frontiers in Plant Science, 10, 1629.

Civil Engineering: Le, T., & David Jeong, H. (2017). NLP-based approach to semantic classification of heterogeneous transportation asset data terminology. Journal of Computing in Civil Engineering, 31(6), 04017057.

Economics: Hansen, S., McMahon, M., & Prat, A. (2018). Transparency and deliberation within the FOMC: a computational linguistics approach. The Quarterly Journal of Economics, 133(2), 801-870.

Political Science: Benoit, K., Munger, K., & Spirling, A. (2019). Measuring and explaining political sophistication through textual complexity. American Journal of Political Science, 63(2), 491-508.

Urban planning: Plunz, R. A., Zhou, Y., Vintimilla, M. I. C., Mckeown, K., Yu, T., Uguccioni, L., & Sutto, M. P. (2019). Twitter sentiment in New York City parks as measure of well-being. Landscape and urban planning, 189, 235-246.

Cultural Heritage: Machidon, O. M., Tavčar, A., Gams, M., & Duguleană, M. (2020). CulturalERICA: A conversational agent improving the exploration of European cultural heritage. Journal of Cultural Heritage, 41, 152-165.

NLP in other disciplines: Summary

  • Clearly, there are many more. I just sampled a few examples, from even fewer disciplines!

  • Existing NLP tools + rules is a commonly used approach in some disciplines.

  • Doing user studies and using a small set of manually annotated documents for validation of approaches is also a common method.

  • In some fields (e.g., medical informatics), we also see state of the art deep learning and NLP.

How are the faces of NLP different from each other?

  • NLPers focus on developing new methods, using standard corpora/evaluation procedures, and comparing against SOTA.

  • Industry professionals focus on end users, end to end system development and maintenance.

“If you think Machine Learning will give you a 100% boost, then a heuristic will get you a 50% of the way there”- Martin Zinkevich, Google

  • Other discipline researchers are concerned how to use NLP methods to address their own research questions.


Introduction

The many faces of NLP

Personal experiences with the different faces

What NLP research can learn from the other faces

Concluding Thoughts

  • NLP and Corpus Linguistics

  • NLP education

My watershed moments as an NLPer -1

  • We typically take corpora as a given, gold standard, although a lot of them are compiled from the web, without clear information on how they are created.
  • In 2018/19, we did a study where participants read texts annotated with reading levels assigned by authors of those texts (Vajjala & Lucic, 2019).
  • It turns out, these annotations did not correspond with readers’ comprehension (as per our definition, of course!).
  • Considering that such corpora are regularly used for building NLP models in the past, it made me question our practices with corpora collection/annotation/validation.

My watershed moments as an NLPer -2

  • Like many others (e.g., Cockburn et.al., 2020), I thought sharing code, data, and other details are enough to make experiments replicable and reproducible.
  • All four (1, 2, 3, 4) could reproduce several results, but also ran into many different issues.
  • Challenges with repeatability, replicability, and reproducibility were all extensively discussed in these reports.
  • This made me question NLP’s definition of a good model.

Personal Experiences: Industry R&D

  • Available NLP tools are brittle with new texts.
  • Issues such as data format (e.g., PDF, scanned files etc..) are non-trivial problems in an application scenario.
  • Deployed models or data are not static.
  • Privacy concerns exist while using customer texts for training NLP systems.
  • Evaluation is done using extrinsic measures, live data, and manual inspection, apart from standard test sets.

….

Inter-disciplinary Experiences - 1

Berendes, K., Vajjala, S., Meurers, D., Bryant, D., Wagner, W., Chinkina, M., & Trautwein, U. (2018). Reading demands in secondary school: Does the linguistic complexity of textbooks increase with grade level and the academic orientation of the school track? Journal of Educational Psychology, 110(4), 518–543.

Like typical NLP research, we built multiple machine learning models and evaluated them with train/test split. However, following the methods of ed.psych, we also:

  • extensively analyzed what features show significant differences across different groups
  • built multilevel models with a limited set of theoretically grounded variables
  • discussed practical implications to textbook content development in relative detail compared to NLPish papers.

Inter-disciplinary Experiences - 2

Vajjala, S., Meurers, D., Eitel, A., & Scheiter, K. (2016). Towards grounding computational linguistic approaches to readability: Modeling reader-text interaction for easy and difficult texts. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC) (pp. 38-48).

  • We searched for cognitive correlates of text readability by conducting a eye-tracking study and using mixed effects models.
  • We took texts used in NLP for modeling text readability, which are annotated by teachers, and sought to understand what it means in terms of the reading and comprehension processes of the readers.
  • The methods used in this project are very unrelated to NLP methods, although the goal is to improve NLP approaches to readability assessment.


Introduction

The many faces of NLP

Personal experiences with the different faces

What NLP research can learn from the other faces

Concluding Thoughts

  • NLP and Corpus Linguistics

  • NLP education

First things first: All is not bad!

Some cool things about NLPers:

  • Almost every publication is open access.
  • Many people share their code and/or corpora publicly (which makes criticisms and scrutinies possible!)
  • In my experience, NLPers criticize themselves and introspect more frequently compared to others.
  • We are generally more open to new methods, new application areas, working with other disciplines etc. than other faces.

What can NLPers learn from Other Disciplines?

  • What are some good practices of building and annotating corpora? (e.g., from Corpus Linguistics)

  • How can we validate our features/models beyond train/test data? (e.g., user studies, statistical techniques such as correlation etc.)

  • How can we perform a more detailed analysis of the model predictions? (e.g., mixed methods research)

  • How can we leverage theoretical insights from other disciplines? (e.g., for Computational Social Science research in NLP)

What can NLPers learn from Industry professionals?

  • Focusing on optimal solutions, not necessarily on complex ones with a 0.01% improvement over simpler ones.

  • Stress on testing the code and evaluating beyond standard test data.

  • Reusable code and understandable documentation

(Some of these observations hold good for other disciplines too, including corpus linguists.)


Introduction

The many faces of NLP

Personal experiences with the different faces

What NLP research can learn from the other faces

Concluding Thoughts

  • NLP and Corpus Linguistics

  • NLP education

NLP and Corpus Linguistics (CL)

What is Corpus Linguistics (to me)?

  • Systematic approaches to:

    - corpus construction/compilation (if needed: annotation?)     
    
    - analyzing the linguistic characteristics of corpora  
    
    - understanding different language varieties   
    
    - Applying such analyses for pedagogy, literary studies, translation etc. 

Corpus Linguistics:
“Opportunities in the new decade”

IJCL Editorial, 2020. Issue: 25 (1)

  • “Corpus linguistics has the potential to provide methods and approaches for applied humanities at scale.”

  • “Corpus linguistics offers vast opportunities to better understand how disciplines communicate and to consider how cross-disciplinary discourse might work.”

  • “We need to broaden our view beyond more easily retrievable data sets, such as newspaper articles or canonical literary texts, in favour of an inclusive and diverse approach to data”

Opportunities/Challenges in NLP

(across all faces)

  • Approaches that work on various varieties of language

  • Ways of evaluating the NLP systems for linguistic coverage

  • Interpretable computational models

  • Approaches that can be ported easily to new languages

  • Ways to manage non-static, constantly updating corpora/datasets

  • Stuff that works on the device without sending potentially private texts to some cloud location

……

What can NLP learn from CL?

  • corpora construction with methodical selection and annotation of texts
  • exploratory analysis of the linguistic characteristics of the corpus, to inform the computational models
  • using linguistic analyses to understand the coverage/limitations of computational models

What can CL learn from NLP?

  • Expand scope by supplementing existing corpus analysis methods with NLP methods, which will broaden the possible language analyses.
  • “Scale” to other languages beyond the dominant ones with multilingual NLP software.
  • improved (and open) access to code/data/literature? [No one can criticize/scrutinize/improve stuff they can’t access!]

Working together: new directions

  • best practices for issues related to corpora: store and share, manage constantly updating corpora, ownership of user produced texts, etc.
  • how to address the issues of ethics, bias, fairness in language based corpora and systems including those used for language teaching/assessment/learning.
  • develop methods to probe the computational models for the coverage of language variety.

How to work together?

“for a true interdisciplinary collaboration, both sides need to understand each other’s specialized terminology and together develop the definition of success for the project. We ourselves must be willing to acquire at least apprentice-level expertise in the domain at hand to develop the data and knowledge discovery process necessary for achieving success.” - Rudin & Wagstaff, 2014

NLP Education

  • Many different groups of people are interested in learning and applying NLP methods for their work now.
  • Yet, textbooks are typically written with NLPers (and engineers/programmers) in mind.
  • There is a need for books/courses that cater to the different faces of NLP - NLPers, industry professionals, and other researchers.
  • These is also a need to incorporate more non-model focused aspects into a regular NLP course.

Thank you

contact: sowmya.vajjala @ nrc-cnrc.gc.ca

to cite: Vajjala, S. (2020). NLP Beyond NLPers: the many faces of NLP in academia and real-world [Plenary Talk]. 46th Conference of the Japan Association for English Corpus Studies (JAECS), Virtual Event, Japan. https://rpubs.com/vbsowmya/jaecs2020talk

References

  1. “Why computing belongs within the social sciences” (Connolly, 2020)
  2. “Threats of a replication crisis in empirical computer science” (Cockburn et.al., 2020)
  3. “Data-Centricity: A Challenge and Opportunity for Computing Education” (Krishnamurthi & Fisler, 2020)
  4. “Machine Learning Production Pipeline: Project Flow and Landscape” (Huyen, 2020)
  5. “Some Advice for Psychologists Who Want to Work With Computer Scientists on Big Data.” König, Cornelius J., et al., 2020
  6. “Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)”. Pineau et.al., 2019
  7. “Data statements for natural language processing: Toward mitigating system bias and enabling better science.” Bender & Friedman, 2018
  8. “On the Challenges of Translating NLP Research into Commercial Products” (Dahlmeier, 2017)
  9. “Building better open-source tools to support fairness in automated scoring” (Madnani et.al., 2017)
  10. “Why Big Data Industrial Systems Need Rules and What We Can Do About It” (Suganthan et.al., 2015)
  11. “Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!” (Chiticariu et.al., 2013)