This activity will enable you to effectively work with and analyze text in your data science projects. Text data can capture incredibly expressive human sentiments. In data science, we typically analyze collections of text to answer questions.

In this activity, you will learn:

How to acquire text from files and APIs.
How to analyze individual textual documents.
How to compare and explore collections of textual documents.

Suggested reading: Text Mining with R: Tidy Text Mining

Text Acquisition

To perform text analysis, you need text! There are many ways to get it:

String Literals

us_dec_sentence <- 'We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.'

# Show the number of characters in the sentence.
nchar(us_dec_sentence)

## [1] 209

# Show the sentence itself.
us_dec_sentence

## [1] "We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness."

Reading `.txt` Files

Very often we can find examples of text files online. A good way to find a text file on a topic is by Googling with filetype .txt. .

library(readr)
us_dec <- read_file('https://ia800305.us.archive.org/29/items/unitedstatesdecl00001gut/when12.txt')
nchar(us_dec)

## [1] 24863

strtrim(us_dec, 200)

## [1] "The Project Gutenberg EBook of The Declaration of Independence\r\n\r\nCopyright laws are changing all over the world. Be sure to check the\r\ncopyright laws for your country before downloading or redistributing\r\nth"

Via Web APIs

Finally, there are a variety of online Web APIs that expose text for particular topics. This function gets the Wikipedia text for a specific article.

GetArticleText <- function(langCode, titles) {
  # Given a langCode ("en", "de", etc.) and a vector of article titles
  # Returns a data frame with the text of the specified articles in
  # the specified language
  texts <- sapply(titles, function(t) {
    resp <- GET(
      paste("https://", langCode, ".wikipedia.org/w/api.php", sep=''), 
      query = list(
        action  = "query", 
        prop = "extracts",
        format  = "json",
        explaintext = "",
        titles  = t
      )
    );
    
    js <- content(resp, "parsed");
    return (js$query$pages[[1]]$extract)
  })
  
  return (data.frame(title=titles, text=texts, stringsAsFactors=FALSE, row.names=NULL));
}

Exercise: Use the GetArticleText function to create a data frame containing texts for a group of related articles. Shilad will use ‘Macalester College’, ‘Carleton College’, and ‘University of Minnesota’ but you are welcome to use anything. Hint: use c to create a vector of the article titles you would like to analyze.

# Get the text for 
# https://en.wikipedia.org/wiki/Macalester_College,
# https://en.wikipedia.org/wiki/Carleton_College, and 
# https://en.wikipedia.org/wiki/University_of_Minnesota in English ("en").
# We could also get the text for the Spanish article ("es"), or German article ("de")

school_wiki_titles = c('Macalester College', 'Carleton College', 'University of Minnesota')
school_df <- GetArticleText('en', school_wiki_titles)

title	text
Macalester College	Macalester College () is a private liberal arts college in Saint Paul, Minnesota. Founded in 1874, Macalester is exclusively an undergraduate four-year institution and enrolled 2,174 students in the f…
Carleton College	Carleton College ( KARL-tin) is a private liberal arts college in Northfield, Minnesota. Founded in 1866, it had 2,105 undergraduate students and 269 faculty members in fall 2016. The 200-acre main ca…
University of Minnesota	The University of Minnesota, formally the University of Minnesota, Twin Cities, (UMN, the U of M, or Minnesota) is a public land-grant research university in the Twin Cities of Minneapolis and Saint P…

We’ll analyze these documents further below.

Analyzing Single Documents

The simpliest starting place to analyze text is by analyzing a single document. The document could be a single Tweet, article, book, or anything else you’d like to analyze.

Exercise Create a data frame with just the text for Macalester

mac_df <- GetArticleText("en", c("Macalester College"));

title	text
Macalester College	Macalester College () is a private liberal arts college in Saint Paul, Minnesota. Founded in 1874, Macalester is exclusively an undergraduate four-year institution and enrolled 2,174 students in the f…

We need to convert the text to a “tidy” representation. You may remember that a tidy representation must have a separate column for every variable, a separate row for every observation, and a single value in each cell. Why isn’t mac_df tidy?

Convert to tidy representation:

tidy_mac <- mac_df %>%
  unnest_tokens(word, text)

head(tidy_mac)

##                title       word
## 1 Macalester College macalester
## 2 Macalester College    college
## 3 Macalester College         is
## 4 Macalester College          a
## 5 Macalester College    private
## 6 Macalester College    liberal

nrow(tidy_mac)

## [1] 3611

We can also find the most frequently used words by using dplyr’s count function, which creates a frequency table for (in our case) words:

# Create and display frequency count table
all_mac_counts <- tidy_mac %>%
  count(word, sort = TRUE) 
all_mac_counts %>% head(5)

##         word   n
## 1        the 203
## 2        and 136
## 3         of 109
## 4         in  98
## 5 macalester  82

Stop Words

Notice that the most common words like “the” and “and” are not informative at all! In text analysis we typically manage this by removing very common words called stopwords.

# Load stop words dataset and display it
data(stop_words)
head(stop_words)

## # A tibble: 6 × 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART

dim(stop_words)

## [1] 1149    2

# Create and display frequency count table after removing stop words from the dataset
mac_counts <- tidy_mac %>%
  anti_join(stop_words) %>%
  count(word, sort=TRUE)

## Joining, by = "word"

head(mac_counts)

##         word  n
## 1 macalester 82
## 2   students 33
## 3    college 28
## 4     campus 24
## 5    student 21
## 6  minnesota 19

Word Clouds

One great visualization for analyzing single documents is the word cloud. It’s a little painful to fine tune, but very useful.

Exercise: Use the word cloud package to create a word cloud for Macalester.

library(wordcloud)

# Show a word cloud with some customized options

wordcloud(mac_counts$word,             # column of words
          mac_counts$n,                # column of frequencies
          scale=c(5,0.2),                 # range of font sizes of words
          min.freq = 2,                   # minimum word frequency to show
          max.words=200,                  # show the 200 most frequent words
          random.order=FALSE,             # position the most popular words first
          colors=brewer.pal(8, "Dark2"))  # Color palette

Comparing the text in two (or more) documents.

We often wish to know what makes a particular document unique within a collection. What are the most interesting words for a specific tweet, chapter, or article? Let’s work towards this. We are going to start by looking at some candidates for interesting words.

Let’s now create a TidyText data frame with the three Wikipedia documents we collected above via the API. Remember that the TidyText data frame has one row for each word.

title	text
Macalester College	Macalester College () is a private liberal arts college in Saint Paul, Minnesota. Founded in 1874, Macalester is exclusively an undergraduate four-year institution and enrolled 2,174 students in the f…
Carleton College	Carleton College ( KARL-tin) is a private liberal arts college in Northfield, Minnesota. Founded in 1866, it had 2,105 undergraduate students and 269 faculty members in fall 2016. The 200-acre main ca…
University of Minnesota	The University of Minnesota, formally the University of Minnesota, Twin Cities, (UMN, the U of M, or Minnesota) is a public land-grant research university in the Twin Cities of Minneapolis and Saint P…

# Generate counts
school_counts <-
  school_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(title, word, sort=TRUE)

## Joining, by = "word"

interesting_words <- c(
  "liberal",
  "education",
  "research",
  "teaching",
  "lgbtq",
  "football"
);

school_counts %>% 
  filter(word %in% interesting_words) %>%
  spread(title, n);

##        word Carleton College Macalester College University of Minnesota
## 1 education                5                  8                      10
## 2  football                2                  5                       9
## 3     lgbtq               NA                  4                      NA
## 4   liberal               13                  9                       1
## 5  research                3                  4                      19
## 6  teaching                4                  1                       2

Discuss: Which words are most interesting for each school? Why? Can you think about mathematical formulas that emulate this intuition?

Natural Language Processing (NLP) experts have coalesced on one popular measure to identify popular words. TidyText implements it for us. Let’s see it in action:

with_tf_idf <-
  school_counts %>%
  bind_tf_idf(word, title, n) %>%
  arrange(desc(tf_idf))

with_tf_idf %>% filter(title=='Macalester College') %>% head(10)

##                 title         word  n          tf       idf      tf_idf
## 1  Macalester College   macalester 82 0.037272727 1.0986123 0.040948276
## 2  Macalester College          mac 13 0.005909091 1.0986123 0.006491800
## 3  Macalester College   engagement  7 0.003181818 1.0986123 0.003495585
## 4  Macalester College        civic  6 0.002727273 1.0986123 0.002996215
## 5  Macalester College       fossil  6 0.002727273 1.0986123 0.002996215
## 6  Macalester College        scots  6 0.002727273 1.0986123 0.002996215
## 7  Macalester College        house 16 0.007272727 0.4054651 0.002948837
## 8  Macalester College   commitment  5 0.002272727 1.0986123 0.002496846
## 9  Macalester College     friendly  5 0.002272727 1.0986123 0.002496846
## 10 Macalester College macalester's  5 0.002272727 1.0986123 0.002496846

TfIdf includes two components:

Term Frequency

Term frequency (TF) captures the popularity of a word in a document. It is measured as the frequency of a particular word in a particular document. It’s typically normalized by document length. So in the table above, macalester accounts for 3.7% of the words in the Maclaester College Wikipedia article. Thus the TF for macalester in that article is 0.037.

Inverse Document Frequency

Inverse Document Frequency (IDF) captures the uniqueness of a word across all the documents in a corpus. It is typically calculated as 1/log(1 + number of documents with word). IDF has low values for “overly-general” words to penalize them, and higher values for unique words.

Exercise: Choose a collection of 10 Wikipedia articles in some category of interest to you. Calculate the tf-idf scores for all words in each document, and print out the highest scoring tf_idf words for each document.

Projects in Data Science: Intro to Text Analysis