Ch. 1 - Jumping into text mining with bag of words

What is text mining?


Understanding text mining

  • Text mining is an algorithm that takes unstructured text and organizes it.
  • [*] Text mining is the process of distilling actionable insights from text.
  • Text mining is an evaluation metric used in data science for assessing machine learning algorithms on text.

Quick taste of text mining

# Print new_text to the console
## [1] "DataCamp is the first online learning platform that focuses on building the best learning experience specifically for Data Science. We have offices in New York, London, and Belgium, and to date, we trained over 3.8 million (aspiring) data scientists in over 150 countries. These data science enthusiasts completed more than 185 million exercises. You can take free beginner courses, or subscribe for $29/month to get access to all premium courses."
# Find the 10 most frequent terms: term_count
term_count <- freq_terms(new_text, 10)

# Plot term_count

Getting started


Load some text

# Import text data from CSV, no factors
tweets <- read.csv(coffee_data_file, stringsAsFactors = FALSE)

# View the structure of tweets
## 'data.frame':    1000 obs. of  15 variables:
##  $ num         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ text        : chr  "@ayyytylerb that is so true drink lots of coffee" "RT @bryzy_brib: Senior March tmw morning at 7:25 A.M. in the SENIOR lot. Get up early, make yo coffee/breakfast"| __truncated__ "If you believe in #gunsense tomorrow would be a very good day to have your coffee any place BUT @Starbucks Guns"| __truncated__ "My cute coffee mug." ...
##  $ favorited   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ replyToSN   : chr  "ayyytylerb" NA NA NA ...
##  $ created     : chr  "8/9/2013 2:43" "8/9/2013 2:43" "8/9/2013 2:43" "8/9/2013 2:43" ...
##  $ truncated   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ replyToSID  : num  3.66e+17 NA NA NA NA ...
##  $ id          : num  3.66e+17 3.66e+17 3.66e+17 3.66e+17 3.66e+17 ...
##  $ replyToUID  : int  1637123977 NA NA NA NA NA NA 1316942208 NA NA ...
##  $ statusSource: chr  "<a href=\"\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"\" rel=\"nofollow\">Twitter for iPhone</a>" "web" "<a href=\"\" rel=\"nofollow\">Twitter for Android</a>" ...
##  $ screenName  : chr  "thejennagibson" "carolynicosia" "janeCkay" "AlexandriaOOTD" ...
##  $ retweetCount: int  0 1 0 0 2 0 0 0 1 2 ...
##  $ retweeted   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ longitude   : logi  NA NA NA NA NA NA ...
##  $ latitude    : logi  NA NA NA NA NA NA ...
# Isolate text from tweets
coffee_tweets <- tweets$text

Make the vector a VCorpus object (1)

# Make a vector source from coffee_tweets
coffee_source <- VectorSource(coffee_tweets)

Make the vector a VCorpus object (2)

# Make a volatile corpus from coffee_corpus
coffee_corpus <- VCorpus(coffee_source)

# Print out coffee_corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1000
# Print the 15th tweet in coffee_corpus
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 111
# Print the contents of the 15th tweet in coffee_corpus
## $content
## [1] "@HeatherWhaley I was about 2 joke it takes 2 hands to hold hot coffee...then I read headline! #Don'tDrinkNShoot"
# Now use content to review the plain text of the 10th tweet
## [1] "RT @Dorkv76: I can't care before coffee."

Make a VCorpus from a data frame

# Create a DataframeSource from the example text
df_source <- DataframeSource(example_text)

# Convert df_source to a volatile corpus
df_corpus <- VCorpus(df_source)

# Examine df_corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 2
## Content:  documents: 3
# Examine df_corpus metadata
##    author       date
## 1 Author1 1514953399
## 2 Author2 1514866998
## 3 Author3 1514780598
# Compare the number of documents in the vector source
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
# Compare metadata in the vector corpus
## data frame with 0 columns and 3 rows

Cleaning and preprocessing text


Common cleaning functions from tm

# Create the object: text
text <- "<b>She</b> woke up at       6 A.M. It\'s so early!  She was only 10% awake and began drinking coffee in front of her computer."

# Make lowercase
## [1] "<b>she</b> woke up at       6 a.m. it's so early!  she was only 10% awake and began drinking coffee in front of her computer."
# Remove punctuation
## [1] "bSheb woke up at       6 AM Its so early  She was only 10 awake and began drinking coffee in front of her computer"
# Remove numbers
## [1] "<b>She</b> woke up at        A.M. It's so early!  She was only % awake and began drinking coffee in front of her computer."
# Remove whitespace
## [1] "<b>She</b> woke up at 6 A.M. It's so early! She was only 10% awake and began drinking coffee in front of her computer."

Cleaning with qdap

# Remove text within brackets
## [1] "She woke up at 6 A.M. It's so early! She was only 10% awake and began drinking coffee in front of her computer."
# Replace numbers with words
## [1] "<b>She</b> woke up at six A.M. It's so early! She was only ten% awake and began drinking coffee in front of her computer."
# Replace abbreviations
## [1] "<b>She</b> woke up at 6 AM It's so early! She was only 10% awake and began drinking coffee in front of her computer."
# Replace contractions
## [1] "<b>She</b> woke up at 6 A.M. it is so early! She was only 10% awake and began drinking coffee in front of her computer."
# Replace symbols with words
## [1] "<b>She</b> woke up at 6 A.M. It's so early! She was only 10 percent awake and began drinking coffee in front of her computer."

All about stop words

# List standard English stop words
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"
# Print text without standard stop words
removeWords(text, stopwords("en"))
## [1] "<b>She</b> woke         6 A.M. It's  early!  She   10% awake  began drinking coffee  front   computer."
# Add "coffee" and "bean" to the list: new_stops
new_stops <- c("coffee", "bean", stopwords("en"))

# Remove stop words from text
removeWords(text, new_stops)
## [1] "<b>She</b> woke         6 A.M. It's  early!  She   10% awake  began drinking   front   computer."

Intro to word stemming and stem completion

# Create complicate
complicate <- c("complicated", "complication", "complicatedly")

# Perform word stemming: stem_doc
stem_doc <- stemDocument(complicate)

# Create the completion dictionary: comp_dict
comp_dict <- "complicate"

# Perform stem completion: complete_text 
complete_text <- stemCompletion(stem_doc, comp_dict)

# Print complete_text
##      complic      complic      complic 
## "complicate" "complicate" "complicate"

Word stemming and stem completion on a sentence

# Remove punctuation: rm_punc
rm_punc <- removePunctuation(text_data)

# Create character vector: n_char_vec
n_char_vec <- unlist(strsplit(rm_punc, split = " "))

# Perform word stemming: stem_doc
stem_doc <- stemDocument(n_char_vec)

# Print stem_doc
##  [1] "In"      "a"       "complic" "hast"    "Tom"     "rush"    "to"     
##  [8] "fix"     "a"       "new"     "complic" "too"     "complic"
# Re-complete stemmed document: complete_doc
complete_doc <- stemCompletion(stem_doc, comp_dict)

# Print complete_doc
##           In            a      complic         hast          Tom         rush 
##         "In"          "a" "complicate"      "haste"        "Tom"       "rush" 
##           to          fix            a          new      complic          too 
##         "to"        "fix"          "a"        "new" "complicate"        "too" 
##      complic 
## "complicate"

Apply preprocessing steps to a corpus

# Alter the function code to match the instructions
clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeWords, words = c(stopwords("en"), "coffee", "mug"))
  corpus <- tm_map(corpus, stripWhitespace)

# Apply your customized function to the tweet_corp: clean_corp
clean_corp <- clean_corpus(tweet_corp)

# Print out a cleaned up tweet
## [1] "also dogs arent smart enough dip donut eat part thats dipped ladyandthetramp"
# Print out the same tweet in the original form
## [1] "Also, dogs aren't smart enough to dip the donut in the coffee and then eat the part that's been dipped. #ladyandthetramp"



Understanding TDM and DTM

When should you use the term-document matrix instead of the document-term matrix?

  • When you have big data.
  • When you want the documents as rows and words as columns.
  • [*] When you want the words as rows and documents as columns.
  • When you need to store it on disk.

Make a document-term matrix

# Create the document-term matrix from the corpus
coffee_dtm <- DocumentTermMatrix(clean_corp)

# Print out coffee_dtm data
## <<DocumentTermMatrix (documents: 1000, terms: 3075)>>
## Non-/sparse entries: 7384/3067616
## Sparsity           : 100%
## Maximal term length: 27
## Weighting          : term frequency (tf)
# Convert coffee_dtm to a matrix
coffee_m <- as.matrix(coffee_dtm)

# Print the dimensions of coffee_m
## [1] 1000 3075
# Review a portion of the matrix to get some Starbucks
coffee_m[25:35, c("star", "starbucks")]
##     Terms
## Docs star starbucks
##   25    0         0
##   26    0         1
##   27    0         1
##   28    0         0
##   29    0         0
##   30    0         0
##   31    0         0
##   32    0         0
##   33    0         0
##   34    0         1
##   35    0         0

Make a term-document matrix

# Create a term-document matrix from the corpus
coffee_tdm <- TermDocumentMatrix(clean_corp)

# Print coffee_tdm data
## <<TermDocumentMatrix (terms: 3075, documents: 1000)>>
## Non-/sparse entries: 7384/3067616
## Sparsity           : 100%
## Maximal term length: 27
## Weighting          : term frequency (tf)
# Convert coffee_tdm to a matrix
coffee_m <- as.matrix(coffee_tdm)

# Print the dimensions of the matrix
## [1] 3075 1000
# Review a portion of the matrix
coffee_m[c("star", "starbucks"), 25:35]
##            Docs
## Terms       25 26 27 28 29 30 31 32 33 34 35
##   star       0  0  0  0  0  0  0  0  0  0  0
##   starbucks  0  1  1  0  0  0  0  0  0  1  0

Ch. 2 - Word clouds and more interesting visuals

Common text mining visuals


Test your understanding of text mining

What is the best business reason to create a text mining visual like a word cloud?

  • [*] Word clouds help decision-makers come to quick conclusions.
  • Visuals can be manipulated so you can lead your audience.
  • Visuals are pretty and people like colorful things.
  • Millions of words can be put into a word cloud, so it’s faster.

Frequent terms with tm

# Convert coffee_tdm to a matrix
coffee_m <- as.matrix(coffee_tdm)

# Calculate the row sums of coffee_m
term_frequency <- rowSums(coffee_m)

# Sort term_frequency in decreasing order
term_frequency <- sort(term_frequency, decreasing = TRUE)

# View the top 10 most common words
##     like      cup     shop     just      get  morning     want drinking 
##      111      103       69       66       62       57       49       47 
##      can    looks 
##       45       45
# Plot a barchart of the 10 most common words
barplot(term_frequency[1:10], col = "tan", las = 2)

Frequent terms with qdap

# Create frequency
frequency <- freq_terms(
  top = 10, 
  at.least = 3, 
  stopwords = "Top200Words"

# Make a frequency bar chart

# Create frequency
frequency <- freq_terms(
  top = 10, 
  at.least = 3, 
  stopwords = stopwords("english")

# Make a frequency bar chart

Intro to word clouds


A simple word cloud

# Load wordcloud package

# Print the first 10 entries in term_frequency
##     like      cup     shop     just      get  morning     want drinking 
##      111      103       69       66       62       57       49       47 
##      can    looks 
##       45       45
# Vector of terms
terms_vec <- names(term_frequency)

# Create a wordcloud for the values in word_freqs
wordcloud(terms_vec, term_frequency, 
          max.words = 50, colors = "red")

Stop words and word clouds

# Review a "cleaned" tweet
## [1] "I brought some Marvin Gaye and Chardonnay."
# Add to stopwords
stops <- c(stopwords(kind = 'en'), 'chardonnay')

# Review last 6 stopwords 
## [1] "same"       "so"         "than"       "too"        "very"      
## [6] "chardonnay"
# Apply to a corpus
cleaned_chardonnay_corp <- tm_map(chardonnay_corp, removeWords, stops)

# Review a "cleaned" tweet again
## [1] "I brought  Marvin Gaye  Chardonnay."

Plot the better word cloud

# Sort the chardonnay_words in descending order
sorted_chardonnay_words <- sort(chardonnay_words, decreasing = TRUE)

# Print the 6 most frequent chardonnay terms
## marvin   gaye   just   like bottle    lol 
##    104     76     75     55     47     43
# Get a terms vector
terms_vec <- names(chardonnay_words)

# Create a wordcloud for the values in word_freqs
wordcloud(terms_vec, chardonnay_words, 
          max.words = 50, colors = "red")

Improve word cloud colors

# Print the list of colors
# Print the wordcloud with the specified colors
wordcloud(chardonnay_freqs$term, chardonnay_freqs$num, 
          max.words = 100, 
          colors = c("grey80", "darkgoldenrod1", "tomato"))

Use prebuilt color palettes

# Select 5 colors 
color_pal <- cividis(n = 5)

# Examine the palette output
## [1] "#00204DFF" "#414D6BFF" "#7C7B78FF" "#BCAF6FFF" "#FFEA46FF"
# Create a wordcloud with the selected palette
wordcloud(chardonnay_freqs$term, chardonnay_freqs$num, 
          max.words = 100, colors = color_pal)

Other word clouds and word networks


About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a CliftonStrengths “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

