This activity will enable you to effectively work with and analyze text in your data science projects. Text data can capture incredibly expressive human sentiments. In data science, we typically analyze collections of text to answer questions.
In this activity, you will learn:
Suggested reading: Text Mining with R: Tidy Text Mining
To perform text analysis, you need text! There are many ways to get it:
us_dec_sentence <- 'We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.'
# Show the number of characters in the sentence.
nchar(us_dec_sentence)
## [1] 209
# Show the sentence itself.
us_dec_sentence
## [1] "We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness."
.txt FilesVery often we can find examples of text files online. A good way to
find a text file on a topic is by Googling with filetype
.txt. .
library(readr)
us_dec <- read_file('https://ia800305.us.archive.org/29/items/unitedstatesdecl00001gut/when12.txt')
nchar(us_dec)
## [1] 24863
strtrim(us_dec, 200)
## [1] "The Project Gutenberg EBook of The Declaration of Independence\r\n\r\nCopyright laws are changing all over the world. Be sure to check the\r\ncopyright laws for your country before downloading or redistributing\r\nth"
Finally, there are a variety of online Web APIs that expose text for particular topics. This function gets the Wikipedia text for a specific article.
GetArticleText <- function(langCode, titles) {
# Given a langCode ("en", "de", etc.) and a vector of article titles
# Returns a data frame with the text of the specified articles in
# the specified language
texts <- sapply(titles, function(t) {
resp <- GET(
paste("https://", langCode, ".wikipedia.org/w/api.php", sep=''),
query = list(
action = "query",
prop = "extracts",
format = "json",
explaintext = "",
titles = t
)
);
js <- content(resp, "parsed");
return (js$query$pages[[1]]$extract)
})
return (data.frame(title=titles, text=texts, stringsAsFactors=FALSE, row.names=NULL));
}
Exercise: Use the GetArticleText function to create
a data frame containing texts for a group of related articles. Shilad
will use ‘Macalester College’, ‘Carleton College’, and ‘University of
Minnesota’ but you are welcome to use anything. Hint: use c
to create a vector of the article titles you would like to analyze.
# Get the text for
# https://en.wikipedia.org/wiki/Macalester_College,
# https://en.wikipedia.org/wiki/Carleton_College, and
# https://en.wikipedia.org/wiki/University_of_Minnesota in English ("en").
# We could also get the text for the Spanish article ("es"), or German article ("de")
school_wiki_titles = c('Macalester College', 'Carleton College', 'University of Minnesota')
school_df <- GetArticleText('en', school_wiki_titles)
| title | text |
|---|---|
| Macalester College | Macalester College () is a private liberal arts college in Saint Paul, Minnesota. Founded in 1874, Macalester is exclusively an undergraduate four-year institution and enrolled 2,174 students in the f… |
| Carleton College | Carleton College ( KARL-tin) is a private liberal arts college in Northfield, Minnesota. Founded in 1866, it had 2,105 undergraduate students and 269 faculty members in fall 2016. The 200-acre main ca… |
| University of Minnesota | The University of Minnesota, formally the University of Minnesota, Twin Cities, (UMN, the U of M, or Minnesota) is a public land-grant research university in the Twin Cities of Minneapolis and Saint P… |
We’ll analyze these documents further below.
The simpliest starting place to analyze text is by analyzing a single document. The document could be a single Tweet, article, book, or anything else you’d like to analyze.
Exercise Create a data frame with just the text for Macalester
mac_df <- GetArticleText("en", c("Macalester College"));
| title | text |
|---|---|
| Macalester College | Macalester College () is a private liberal arts college in Saint Paul, Minnesota. Founded in 1874, Macalester is exclusively an undergraduate four-year institution and enrolled 2,174 students in the f… |
We need to convert the text to a “tidy” representation. You may remember that a tidy representation must have a separate column for every variable, a separate row for every observation, and a single value in each cell. Why isn’t mac_df tidy?
Convert to tidy representation:
tidy_mac <- mac_df %>%
unnest_tokens(word, text)
head(tidy_mac)
## title word
## 1 Macalester College macalester
## 2 Macalester College college
## 3 Macalester College is
## 4 Macalester College a
## 5 Macalester College private
## 6 Macalester College liberal
nrow(tidy_mac)
## [1] 3611
We can also find the most frequently used words by using dplyr’s
count function, which creates a frequency table for (in our
case) words:
# Create and display frequency count table
all_mac_counts <- tidy_mac %>%
count(word, sort = TRUE)
all_mac_counts %>% head(5)
## word n
## 1 the 203
## 2 and 136
## 3 of 109
## 4 in 98
## 5 macalester 82
Notice that the most common words like “the” and “and” are not informative at all! In text analysis we typically manage this by removing very common words called stopwords.
# Load stop words dataset and display it
data(stop_words)
head(stop_words)
## # A tibble: 6 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
dim(stop_words)
## [1] 1149 2
# Create and display frequency count table after removing stop words from the dataset
mac_counts <- tidy_mac %>%
anti_join(stop_words) %>%
count(word, sort=TRUE)
## Joining, by = "word"
head(mac_counts)
## word n
## 1 macalester 82
## 2 students 33
## 3 college 28
## 4 campus 24
## 5 student 21
## 6 minnesota 19
One great visualization for analyzing single documents is the word cloud. It’s a little painful to fine tune, but very useful.
Exercise: Use the word cloud package to create a word cloud for Macalester.
library(wordcloud)
# Show a word cloud with some customized options
wordcloud(mac_counts$word, # column of words
mac_counts$n, # column of frequencies
scale=c(5,0.2), # range of font sizes of words
min.freq = 2, # minimum word frequency to show
max.words=200, # show the 200 most frequent words
random.order=FALSE, # position the most popular words first
colors=brewer.pal(8, "Dark2")) # Color palette
We often wish to know what makes a particular document unique within a collection. What are the most interesting words for a specific tweet, chapter, or article? Let’s work towards this. We are going to start by looking at some candidates for interesting words.
Let’s now create a TidyText data frame with the three Wikipedia documents we collected above via the API. Remember that the TidyText data frame has one row for each word.
| title | text |
|---|---|
| Macalester College | Macalester College () is a private liberal arts college in Saint Paul, Minnesota. Founded in 1874, Macalester is exclusively an undergraduate four-year institution and enrolled 2,174 students in the f… |
| Carleton College | Carleton College ( KARL-tin) is a private liberal arts college in Northfield, Minnesota. Founded in 1866, it had 2,105 undergraduate students and 269 faculty members in fall 2016. The 200-acre main ca… |
| University of Minnesota | The University of Minnesota, formally the University of Minnesota, Twin Cities, (UMN, the U of M, or Minnesota) is a public land-grant research university in the Twin Cities of Minneapolis and Saint P… |
# Generate counts
school_counts <-
school_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(title, word, sort=TRUE)
## Joining, by = "word"
interesting_words <- c(
"liberal",
"education",
"research",
"teaching",
"lgbtq",
"football"
);
school_counts %>%
filter(word %in% interesting_words) %>%
spread(title, n);
## word Carleton College Macalester College University of Minnesota
## 1 education 5 8 10
## 2 football 2 5 9
## 3 lgbtq NA 4 NA
## 4 liberal 13 9 1
## 5 research 3 4 19
## 6 teaching 4 1 2
Discuss: Which words are most interesting for each school? Why? Can you think about mathematical formulas that emulate this intuition?
Natural Language Processing (NLP) experts have coalesced on one popular measure to identify popular words. TidyText implements it for us. Let’s see it in action:
with_tf_idf <-
school_counts %>%
bind_tf_idf(word, title, n) %>%
arrange(desc(tf_idf))
with_tf_idf %>% filter(title=='Macalester College') %>% head(10)
## title word n tf idf tf_idf
## 1 Macalester College macalester 82 0.037272727 1.0986123 0.040948276
## 2 Macalester College mac 13 0.005909091 1.0986123 0.006491800
## 3 Macalester College engagement 7 0.003181818 1.0986123 0.003495585
## 4 Macalester College civic 6 0.002727273 1.0986123 0.002996215
## 5 Macalester College fossil 6 0.002727273 1.0986123 0.002996215
## 6 Macalester College scots 6 0.002727273 1.0986123 0.002996215
## 7 Macalester College house 16 0.007272727 0.4054651 0.002948837
## 8 Macalester College commitment 5 0.002272727 1.0986123 0.002496846
## 9 Macalester College friendly 5 0.002272727 1.0986123 0.002496846
## 10 Macalester College macalester's 5 0.002272727 1.0986123 0.002496846
TfIdf includes two components:
Term frequency (TF) captures the popularity of a word in a document. It is measured as the frequency of a particular word in a particular document. It’s typically normalized by document length. So in the table above, macalester accounts for 3.7% of the words in the Maclaester College Wikipedia article. Thus the TF for macalester in that article is 0.037.
Inverse Document Frequency (IDF) captures the uniqueness of a word across all the documents in a corpus. It is typically calculated as 1/log(1 + number of documents with word). IDF has low values for “overly-general” words to penalize them, and higher values for unique words.
Exercise: Choose a collection of 10 Wikipedia articles in some category of interest to you. Calculate the tf-idf scores for all words in each document, and print out the highest scoring tf_idf words for each document.