NLP Beyond NLPers the many faces of NLP in academia and real-world

04th October 2020
Plenary talk @ 46th Conference of the Japan Association for English Corpus Studies (JAECS)

Today’s talk: An Outline

Introduction
The many faces of NLP
Personal experiences with the different faces
What NLP research can learn from the other faces
Concluding Thoughts
- NLP and Corpus Linguistics
- NLP education

Note: Images without attribution are taken from our book (http://www.practicalnlp.ai/).

These slides can be accessed at: rpubs.com/vbsowmya/jaecs2020talk.

What is NLP?

(source)

NLP is all about understanding and modeling human language using computational methods.

Where is NLP useful in day to day life?

general purpose applications: search, email, voice based assistants on phones etc.
domain specific applications in e-commerce, legal, finance, health care etc.
educational technology: language teaching, learning, assessment tools
language revitalization software
disaster management tools

…

Title: What are the many faces of NLP?

Three broad groups:

NLPers: NLP researchers in academia and industry
Other researchers who use NLP methods in their research
Industry professionals developing NLP based applications

What do NLPers and others do?
What are the differences between their practices?
What can NLP learn from the many faces?
Does Corpus Linguistics figure in this setup at all?

The rest of this talk is a personal take on these questions.

Introduction

The many faces of NLP

Personal experiences with the different faces

What NLP research can learn from the other faces

Concluding Thoughts

NLP and Corpus Linguistics
NLP education

NLP research - an overview

A snapshot of NLP research topics

(Source)

Trends in NLP Research

How can one quickly showcase contemporary NLP research to others?

Some paper titles from best paper awards over the past 5 years may give a picture.

source: https://aclweb.org/aclwiki/Best_paper_awards
(everything is open access!)

Contemporary NLP Research - 1

A lot of research focuses on what can perhaps be called core NLP tasks and applications:

“Bridging the Gap between Training and Inference for Neural Machine Translation”

“Improving Evaluation of Machine Translation Quality Estimation”

“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”

“Linguistically-Informed Self-Attention for Semantic Role Labeling”

“Beyond Accuracy: Behavioral Testing of NLP Models with CheckList” (2020)

Contemporary NLP Research - 2

There is also a lot of work on a range of other topics from human language comprehension to mental health:

“Finding syntax in human encephalography with beam search”

“Probabilistic Typology: Deep Generative Models of Vowel Inventories”

“Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints”

“Feuding Families and Former Friends; Unsupervised Learning for Dynamic Fictional Relationships”

“Depression and Self-Harm Risk Assessment in Online Forums”

NLP Research in 2020: Introspection

Some papers from ACL 2020 Theme: “Taking Stock of Where We’ve Been and Where We’re Going”

“The State and Fate of Linguistic Diversity and Inclusion in the NLP World”

“How Can We Accelerate Progress Towards Human-like Linguistic Generalization?”
“A Call for More Rigor in Unsupervised Cross-lingual Learning”

“Automated Evaluation of Writing – 50 Years and Counting”

“Speech Translation and the End-to-End Promise: Taking Stock of Where We Are”

NLP Research: Language Learning/Teaching

Two examples:

FeedBook - an interactive workbook for English foreign language teaching
Research Writing Tutor - a tool that offers feedback on academic writing
Focused events within NLPer community for educational applications, e.g., BEA workshop series …..

NLP Research: Summary

generally heavy on algorithms/methods
relatively less focus on corpus creation, evaluation beyond standard tests, and rule engineering (an opportunity for corpus linguists?)
recent interest in bias in models, ethics, interpretability etc
lot of introspection in 2020

NLP in Industry - an overview

What does NLP in Industry look like?

language learning/teaching/assessment software, which are familiar to you.
large r&d teams building NLP focused products, for their own use as well as for third parties
software teams where NLP contributes to existing product functionalities
speech to text/text to speech software, transcription tools etc.

[What’s in it for me?: Ideally, any job that involves working with huge amounts of corpora needs folks with expertise in building and analyzing them.]

NLP in Industry - Language learning/teaching

Tools for specific classroom/assessment scenarios:

Criterion writing support tool
SpeechRater spoken response assessment tool
Linguatorium tools for second language vocabulary and pronunciation

General purpose tools such as Grammarly, Duolingo etc.

NLP in Industry - Specific Examples

Glemser uses the technology from arria.com to generate clinical trial reports automatically.

Lawdroid makes chatbots for law firms to perform various functions (e.g., paralegal bot, reception bot etc)

Bloomberg uses sentiment analysis on news articles about companies to support stock market decisions.

Pharma company Pfizer uses IBM Watson for cancer treament drug discovery

NLP in Industry - Summary

We saw a few use cases so far. NLP is useful in many other industry scenarios too.
Companies that build software involving local, non-English NLP are also growing in many countries.
There are also companies that primarily do annotation for NLP and other Machine Learning projects. (e.g., Appen Ltd)
To conclude,
- industry NLP involves a wide range of applications,
- requires people from diverse backgrounds such as linguists, software developers, product managers etc.

NLP in other disciplines: An overview

Where is NLP used in other disciplines?

NLP is used as a method to answer research questions in many disciplines.

NLP sometimes plays a major role in discipline specific challenges, going beyond being just a research method.

In Google Scholar, I saw mentions of NLP methods in journals as diverse as Asian studies & History to Clinical Oncology.

I will show a sample of work taken from a few disciplines that may interest you.

[ Again, what’s in it for you?: Wherever there is a role for NLP, I believe there is a role for a corpus linguist too!]

NLP in Applied Linguistics Research

Chukharev-Hudilainen, E., & Saricaoglu, A. (2016). Causal discourse analyzer: Improving automated feedback on academic ESL writing. Computer Assisted Language Learning, 29(3), 494-516.

used Stanford CoreNLP software + linguistic rule engineering to identify cause and effect discourse in non-native writing.

Causal markers were first identified by a manual, functional linguistic analysis of a corpus, and were then used to develop the above rules.

evaluated in terms of precision and recall, on manually annotated essays by 17 students.

NLP in Language Acquisition Research

Chen, X., Alexopoulou, T., & Tsimpli, I. (2020). Automatic extraction of subordinate clauses and its application in second language acquisition research. Behavior Research Methods, 1-15.

Built a tool to extract subordinate clauses using Stanford dependency parser followed by several hand crafted rules.

Validated the tool through an evaluation with annotated test set and manual inspection.

Used this tool to analyze a large-scale learner corpus and investigate the effects of first language (L1) on the acquisition of subordination in second language (L2) English.

NLP in Corpus Linguistics

Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28-54.

Proposed an approach to control for annotation bias in learner language parse annotations.

Evaluated multiple NLP parsers on learner English.

Identified and quantified the influence of learner writing errors on parser’s efficiency.

Few more examples:

Medical Informatics: Chen, L., Gu, Y., Ji, X., Sun, Z., Li, H., Gao, Y., & Huang, Y. (2020). Extracting medications and associated adverse drug events using a natural language processing system combining knowledge base and deep learning. Journal of the American Medical Informatics Association : JAMIA, 27(1), 56–64.

Plant Science: Braun, I. R., & Lawrence-Dill, C. J. (2019). Automated methods enable direct computation on phenotypic descriptions for novel candidate gene prediction. Frontiers in Plant Science, 10, 1629.

Civil Engineering: Le, T., & David Jeong, H. (2017). NLP-based approach to semantic classification of heterogeneous transportation asset data terminology. Journal of Computing in Civil Engineering, 31(6), 04017057.

Economics: Hansen, S., McMahon, M., & Prat, A. (2018). Transparency and deliberation within the FOMC: a computational linguistics approach. The Quarterly Journal of Economics, 133(2), 801-870.

Political Science: Benoit, K., Munger, K., & Spirling, A. (2019). Measuring and explaining political sophistication through textual complexity. American Journal of Political Science, 63(2), 491-508.

Urban planning: Plunz, R. A., Zhou, Y., Vintimilla, M. I. C., Mckeown, K., Yu, T., Uguccioni, L., & Sutto, M. P. (2019). Twitter sentiment in New York City parks as measure of well-being. Landscape and urban planning, 189, 235-246.

Cultural Heritage: Machidon, O. M., Tavčar, A., Gams, M., & Duguleană, M. (2020). CulturalERICA: A conversational agent improving the exploration of European cultural heritage. Journal of Cultural Heritage, 41, 152-165.

NLP in other disciplines: Summary

Clearly, there are many more. I just sampled a few examples, from even fewer disciplines!
Existing NLP tools + rules is a commonly used approach in some disciplines.
Doing user studies and using a small set of manually annotated documents for validation of approaches is also a common method.
In some fields (e.g., medical informatics), we also see state of the art deep learning and NLP.

How are the faces of NLP different from each other?

NLPers focus on developing new methods, using standard corpora/evaluation procedures, and comparing against SOTA.
Industry professionals focus on end users, end to end system development and maintenance.

“If you think Machine Learning will give you a 100% boost, then a heuristic will get you a 50% of the way there”- Martin Zinkevich, Google

Other discipline researchers are concerned how to use NLP methods to address their own research questions.

Introduction

The many faces of NLP

Personal experiences with the different faces

What NLP research can learn from the other faces

Concluding Thoughts

NLP and Corpus Linguistics
NLP education

My watershed moments as an NLPer -1

We typically take corpora as a given, gold standard, although a lot of them are compiled from the web, without clear information on how they are created.

In 2018/19, we did a study where participants read texts annotated with reading levels assigned by authors of those texts (Vajjala & Lucic, 2019).

It turns out, these annotations did not correspond with readers’ comprehension (as per our definition, of course!).

Considering that such corpora are regularly used for building NLP models in the past, it made me question our practices with corpora collection/annotation/validation.

My watershed moments as an NLPer -2

Like many others (e.g., Cockburn et.al., 2020), I thought sharing code, data, and other details are enough to make experiments replicable and reproducible.

One of the papers I co-authored (Vajjala & Rama, 2018) was reproduced by four teams in REPROLANG challenge, 2020.

All four (1, 2, 3, 4) could reproduce several results, but also ran into many different issues.

Challenges with repeatability, replicability, and reproducibility were all extensively discussed in these reports.

This made me question NLP’s definition of a good model.

Personal Experiences: Industry R&D

Available NLP tools are brittle with new texts.

Issues such as data format (e.g., PDF, scanned files etc..) are non-trivial problems in an application scenario.

Deployed models or data are not static.

Privacy concerns exist while using customer texts for training NLP systems.

Evaluation is done using extrinsic measures, live data, and manual inspection, apart from standard test sets.

….

Inter-disciplinary Experiences - 1

Berendes, K., Vajjala, S., Meurers, D., Bryant, D., Wagner, W., Chinkina, M., & Trautwein, U. (2018). Reading demands in secondary school: Does the linguistic complexity of textbooks increase with grade level and the academic orientation of the school track? Journal of Educational Psychology, 110(4), 518–543.

Like typical NLP research, we built multiple machine learning models and evaluated them with train/test split. However, following the methods of ed.psych, we also:

extensively analyzed what features show significant differences across different groups
built multilevel models with a limited set of theoretically grounded variables
discussed practical implications to textbook content development in relative detail compared to NLPish papers.

Inter-disciplinary Experiences - 2

Vajjala, S., Meurers, D., Eitel, A., & Scheiter, K. (2016). Towards grounding computational linguistic approaches to readability: Modeling reader-text interaction for easy and difficult texts. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC) (pp. 38-48).

We searched for cognitive correlates of text readability by conducting a eye-tracking study and using mixed effects models.

We took texts used in NLP for modeling text readability, which are annotated by teachers, and sought to understand what it means in terms of the reading and comprehension processes of the readers.

The methods used in this project are very unrelated to NLP methods, although the goal is to improve NLP approaches to readability assessment.

Introduction

The many faces of NLP

Personal experiences with the different faces

What NLP research can learn from the other faces

Concluding Thoughts

NLP and Corpus Linguistics
NLP education

First things first: All is not bad!

Some cool things about NLPers:

Almost every publication is open access.

Many people share their code and/or corpora publicly (which makes criticisms and scrutinies possible!)

In my experience, NLPers criticize themselves and introspect more frequently compared to others.

We are generally more open to new methods, new application areas, working with other disciplines etc. than other faces.

What can NLPers learn from Other Disciplines?

What are some good practices of building and annotating corpora? (e.g., from Corpus Linguistics)
How can we validate our features/models beyond train/test data? (e.g., user studies, statistical techniques such as correlation etc.)
How can we perform a more detailed analysis of the model predictions? (e.g., mixed methods research)
How can we leverage theoretical insights from other disciplines? (e.g., for Computational Social Science research in NLP)

What can NLPers learn from Industry professionals?

Focusing on optimal solutions, not necessarily on complex ones with a 0.01% improvement over simpler ones.
Stress on testing the code and evaluating beyond standard test data.
Reusable code and understandable documentation

(Some of these observations hold good for other disciplines too, including corpus linguists.)

Introduction

The many faces of NLP

Personal experiences with the different faces

What NLP research can learn from the other faces

Concluding Thoughts

NLP and Corpus Linguistics
NLP education

NLP and Corpus Linguistics (CL)

What is Corpus Linguistics (to me)?

Systematic approaches to:

- corpus construction/compilation (if needed: annotation?)     

- analyzing the linguistic characteristics of corpora  

- understanding different language varieties   

- Applying such analyses for pedagogy, literary studies, translation etc.

Corpus Linguistics:
“Opportunities in the new decade”

IJCL Editorial, 2020. Issue: 25 (1)

“Corpus linguistics has the potential to provide methods and approaches for applied humanities at scale.”
“Corpus linguistics offers vast opportunities to better understand how disciplines communicate and to consider how cross-disciplinary discourse might work.”
“We need to broaden our view beyond more easily retrievable data sets, such as newspaper articles or canonical literary texts, in favour of an inclusive and diverse approach to data”

Opportunities/Challenges in NLP

(across all faces)

Approaches that work on various varieties of language
Ways of evaluating the NLP systems for linguistic coverage
Interpretable computational models
Approaches that can be ported easily to new languages
Ways to manage non-static, constantly updating corpora/datasets
Stuff that works on the device without sending potentially private texts to some cloud location

……

What can NLP learn from CL?

corpora construction with methodical selection and annotation of texts

exploratory analysis of the linguistic characteristics of the corpus, to inform the computational models

using linguistic analyses to understand the coverage/limitations of computational models

What can CL learn from NLP?

Expand scope by supplementing existing corpus analysis methods with NLP methods, which will broaden the possible language analyses.

“Scale” to other languages beyond the dominant ones with multilingual NLP software.

improved (and open) access to code/data/literature? [No one can criticize/scrutinize/improve stuff they can’t access!]

Working together: new directions

best practices for issues related to corpora: store and share, manage constantly updating corpora, ownership of user produced texts, etc.

how to address the issues of ethics, bias, fairness in language based corpora and systems including those used for language teaching/assessment/learning.

develop methods to probe the computational models for the coverage of language variety.

create challenge sets for evaluating NLP systems and developing new evaluation methods (e.g., Sampson, 2000, IJCL 5 (1), Murakami et.al., 2018, IJCL 23 (1) etc.)

How to work together?

“for a true interdisciplinary collaboration, both sides need to understand each other’s specialized terminology and together develop the definition of success for the project. We ourselves must be willing to acquire at least apprentice-level expertise in the domain at hand to develop the data and knowledge discovery process necessary for achieving success.” - Rudin & Wagstaff, 2014

NLP Education

Many different groups of people are interested in learning and applying NLP methods for their work now.

Yet, textbooks are typically written with NLPers (and engineers/programmers) in mind.

There is a need for books/courses that cater to the different faces of NLP - NLPers, industry professionals, and other researchers.

These is also a need to incorporate more non-model focused aspects into a regular NLP course.

Thank you

contact: sowmya.vajjala @ nrc-cnrc.gc.ca

to cite: Vajjala, S. (2020). NLP Beyond NLPers: the many faces of NLP in academia and real-world [Plenary Talk]. 46th Conference of the Japan Association for English Corpus Studies (JAECS), Virtual Event, Japan. https://rpubs.com/vbsowmya/jaecs2020talk

References

“Why computing belongs within the social sciences” (Connolly, 2020)
“Threats of a replication crisis in empirical computer science” (Cockburn et.al., 2020)
“Data-Centricity: A Challenge and Opportunity for Computing Education” (Krishnamurthi & Fisler, 2020)
“Machine Learning Production Pipeline: Project Flow and Landscape” (Huyen, 2020)
“Some Advice for Psychologists Who Want to Work With Computer Scientists on Big Data.” König, Cornelius J., et al., 2020
“Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)”. Pineau et.al., 2019
“Data statements for natural language processing: Toward mitigating system bias and enabling better science.” Bender & Friedman, 2018
“On the Challenges of Translating NLP Research into Commercial Products” (Dahlmeier, 2017)
“Building better open-source tools to support fairness in automated scoring” (Madnani et.al., 2017)
“Why Big Data Industrial Systems Need Rules and What We Can Do About It” (Suganthan et.al., 2015)
“Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!” (Chiticariu et.al., 2013)

Today’s talk: An Outline

What is NLP?

Where is NLP useful in day to day life?

Title: What are the many faces of NLP?

NLP research - an overview

A snapshot of NLP research topics

Trends in NLP Research

Contemporary NLP Research - 1

Contemporary NLP Research - 2

NLP Research in 2020: Introspection

NLP Research: Language Learning/Teaching

NLP Research: Summary

NLP in Industry - an overview

What does NLP in Industry look like?

NLP in Industry - Language learning/teaching

NLP in Industry - Specific Examples

NLP in Industry - Summary

NLP in other disciplines: An overview

Where is NLP used in other disciplines?

NLP in Applied Linguistics Research

NLP in Language Acquisition Research

NLP in Corpus Linguistics

Few more examples:

NLP in other disciplines: Summary

How are the faces of NLP different from each other?

My watershed moments as an NLPer -1

My watershed moments as an NLPer -2

Personal Experiences: Industry R&D

Inter-disciplinary Experiences - 1

Inter-disciplinary Experiences - 2

First things first: All is not bad!

What can NLPers learn from Other Disciplines?

What can NLPers learn from Industry professionals?

NLP and Corpus Linguistics (CL)

What is Corpus Linguistics (to me)?

Corpus Linguistics: “Opportunities in the new decade”

Opportunities/Challenges in NLP

What can NLP learn from CL?

What can CL learn from NLP?

Working together: new directions

How to work together?

NLP Education

Thank you

References

Corpus Linguistics:
“Opportunities in the new decade”