November 16, 2017

Text-as-Data

Text-as-Data

  • How to Quantify Text & High Dimensionality
  • Bag of Words Model
  • Text-as-Data model

tidytext

  • Introduction to tidytext
  • Dictionary-based Sentiment Analysis
  • Word Counts, TF-IDF

Text Analysis Methods

Language Technology

Why is analyzing text hard?

Question: How to quantify Text?

The Problem of High Dimensionality

"A sample of 30-word Twitter messages that use only the 1,000 most common words in the English language, for example, has roughly as many dimensions as there are atoms in the universe."

Gentzkow, Kelly and Taddy (2017)

Bag of Words: Simplest Approach

Credit: Chris Manning

  • Count the number of words in each document.
  • Simplicity trade-off for correctness; ignores word order.
  • Good at classification; poor at semantic meaning.

Document Term Matrix

Text as Data Paradigm

Four Principles of Text as Data Methods

  1. All quantitative models of language are wrong – but some are useful.

  2. Quantitative methods for text amplify human abilities, not replace them.

  3. There is no globally best method for text analysis.

  4. Validate, validate, validate.

Grimmer and Stewart, 2013

Text as Data Methods

Cost/Benefits of Methods

Quinn et al., 2010