Chapter 1. Intro to Data Text Mining

Estimating that about 70% of business information is unstructed and in the form of text data.
Text mining techniques allows business analyst(BA) and data analyst(DA) to look at closely actionable insights from the text data.

(1) Text Mining Workflow (Ref. DataCamp Lecture)

Step 1. Problem Definition & Specific goals
Step 2. Identify text to be collected
Step 3. Text Organization
Step 4. Feature Extraction
Step 5. Analysis
Step 6. Reach an Insight, Recommendation on or output

Chapter 2. R Packages

library(qdap) # install.packages("qdap")

## Loading required package: qdapDictionaries

## Loading required package: qdapRegex

## Loading required package: qdapTools

## Loading required package: RColorBrewer

## 
## Attaching package: 'qdap'

## The following object is masked from 'package:base':
## 
##     Filter

library(tm) # install.packages("tm")

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:qdap':
## 
##     ngrams

## 
## Attaching package: 'tm'

## The following objects are masked from 'package:qdap':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix

library(wordcloud) # install.packages("wordcloud")
library(xtable) # install.packages("xtable")

freq_terms() from package “qdap”
cleaning and preprocessing from package “tm”

Chapter 3. Quick Sample of Text Mining

Text data comes from Zuckerberg’s comment on the decision to end DACA (Sep 5th, 2017). (Ref. https://www.facebook.com/zuck)

Zuckerberg_comment <- "This is a sad day for our country. The decision to end DACA is not just wrong. It is particularly cruel to offer young people the American Dream, encourage them to come out of the shadows and trust our government, and then punish them for it. The young people covered by DACA are our friends and neighbors. They contribute to our communities and to the economy. I've gotten to know some Dreamers over the past few years, and I've always been impressed by their strength and sense of purpose. They don't deserve to live in fear. DACA protects 800,000 Dreamers -- young people brought to this country by their parents. Six months from today, new DACA recipients will start to lose their ability to work legally and will risk immediate deportation every day. It's time for Congress to act to pass the bipartisan Dream Act or another legislative solution that gives Dreamers a pathway to citizenship. For years, leaders from both parties have been talking about protecting Dreamers. Now it's time to back those words up with action. Show us that you can lead. No bill is perfect, but inaction now is unacceptable. Our team at FWD.us has been working alongside Dreamers in this fight, and we'll be doing even more in the weeks ahead to make sure Dreamers have the protections they deserve. If you live in the US, call your members of Congress and tell them to do the right thing. We have always been a nation of immigrants, and immigrants have always made our nation stronger. You can learn more and get connected at Dreamers.FWD.us."

# Find the most 10 most frequent terms: term_count
word_count <- freq_terms(Zuckerberg_comment, 10)

# plot the word_count
plot(word_count)

Chapter 4. Problem of Text Mining and Solution

In most cases, preposisions and be-verbs are highly counted than other groups. It might be worthwhile for linguistic research not for business area.
A sentence consists of two groups : noun phrase and verb phrase.
In case, researchers want to know what and how many certain verbs and nouns (not be-verb) are used.
To figure it out, researchers must clean sentence to fit their target research with “tm” package
Let’s see how tm functions work

(1) Preprocessing

Let us see the bunch of processes for preprocessing with several steps.
Will explain the steps code by code.

# Step 1. Make a vector source: word_count
comment_source <- VectorSource(Zuckerberg_comment)

# Step 2. Make a volatile corpus: coffee_corpus
comment_corpus <- VCorpus(comment_source)

# Step 3. Apply various preprocessing functions
# Alter the function code to match the instructions
clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeWords, 
                   c(stopwords("en"))) # Change words
  return(corpus)
}

clean_comment <- clean_corpus(comment_corpus)

# Step 4. Create TermDocumentMatrix
comment_tdm <- TermDocumentMatrix(clean_comment)

# Step 5. Create Matrix frame
comment_m <- as.matrix(comment_tdm)

# Step 6. RowSum
comment_words <- rowSums(comment_m)

# Step 7. Sorting
comment_words <- sort(comment_words, decreasing = TRUE)

# Step 8. data_frame for frequency
comment_freqs <- data.frame(term = names(comment_words), 
                            num = comment_words)

head(comment_freqs, 10)

##              term num
## dreamers dreamers   6
## daca         daca   4
## always     always   3
## people     people   3
## young       young   3
## act           act   2
## can           can   2
## congress congress   2
## country   country   2
## day           day   2

Step 1. VectorSource

VectorSource, which only accepts (character) vectors, most other implemented sources can take connections as input (a character string is interpreted as file path). (Ref. https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf)

Step 2. VCorpus vs PCorpus

PCorpus is the permanent corpus. VCorpus is the volatile corpus. The difference between them has something with the way to save the collection of documents on computer. In the perspective of memory efficient, VCorpus is better than PCorpus. The VCorpus object is a nested list or list of lists.

Step 3. clean_corpus Function Explanation.

This function is made up five elements with tm_map which applies cleaning function to a corpus. You might use all the functions seperately inside tm_map , but to make it easu and reusable, it Via Mapping, these functions to an entire corpus makes scaling the cleaning steps very easy.

removePunctuation(): Remove all punctuation marks
stripWhitespace(): Remove excess whitespace
removeNumbers(): Remove numbers
content_transformer(tolower): Make all characters lowercase
removeWords & stopwords: Remove “I”, “she’ll”, “the”, etc.

Step 4. TermDocumentMatrix

The TermDocumentMatrix is often the matrix used for language analysis. This is common approach to text mining. This is because you likely have more terms than authors or documents and life is generally easier when you have more rows than columns. But, this is not final step before analyzing.

Step 5. as.matrix()

To start analyzing text information, the easy way is to change the matrix to as.matrix form as simple as can be.
The other steps are optional depedning upon how you want to use the data or to analyze the data.

(2) Visualization

There are many ways to display data. These are simple examples but want to more focus on two ways: (1) wordcloud and (2) freq_terms().

# Method 1. barplot
barplot(comment_words[1:10], col = "tan", las = 2)

This grpah shows top 10 most used by Mark Zuckerberg on the issue, the Decision to end DACA program.
Zuckerberg used outstandingly the word ‘dreamers’ 6 times at this momment.

# Method 2. wordcloud
purple_green <- brewer.pal(10, "PiYG")
purple_green <- purple_green[-(3:7)]

wordcloud(words = comment_freqs$term, 
          freq = comment_freqs$num, 
          max.words = 100,
          min.freq = 1,
          colors = purple_green, random.order = F)

This pic is based on the wordcloud package with colors.

# Method 3. Using freq_terms()
# Quick way. using freq_terms()
frequency <- freq_terms(Zuckerberg_comment, top = 10, at.least = 2, stopwords = tm::stopwords("english"))
plot(frequency)

This one is used with freq_terms package. This method is also good to use but only useful to use quick view of dataset. This graph should not be used as final visualzation, in my opinion though.

Chapter 5. With Bill Gates, On the same issue.

Let’s see how two tech giants talk differently about the decision to end DACA. Interesting!!.

(1) Bill Gates

Comments, “I’m very disappointed with today’s decision to end DACA. Hundreds of thousands of young people who have been educated in the United States and have played by the rules their whole lives will be forced to live under the threat that they will be separated from their families, friends, and communities. Melinda and I have been incredibly impressed by the Dreamers we have come across in our work with Microsoft, the foundation, and other programs we have supported over the years. They have been raised as Americans and have taken that responsibility seriously. Dreamers represent the best instincts of this country and the tradition that the great experiment of the United States is made better by people from other places coming here to dedicate their talents and commitment to continuing to move our country forward. I hope that Congress will quickly pass a permanent fix to allow these young people to stay in the country without the destructive fear of deportation.”

(2) Mark Zukerberg

Comments, “This is a sad day for our country. The decision to end DACA is not just wrong. It is particularly cruel to offer young people the American Dream, encourage them to come out of the shadows and trust our government, and then punish them for it. The young people covered by DACA are our friends and neighbors. They contribute to our communities and to the economy. I’ve gotten to know some Dreamers over the past few years, and I’ve always been impressed by their strength and sense of purpose. They don’t deserve to live in fear. DACA protects 800,000 Dreamers – young people brought to this country by their parents. Six months from today, new DACA recipients will start to lose their ability to work legally and will risk immediate deportation every day. It’s time for Congress to act to pass the bipartisan Dream Act or another legislative solution that gives Dreamers a pathway to citizenship. For years, leaders from both parties have been talking about protecting Dreamers. Now it’s time to back those words up with action. Show us that you can lead. No bill is perfect, but inaction now is unacceptable. Our team at FWD.us has been working alongside Dreamers in this fight, and we’ll be doing even more in the weeks ahead to make sure Dreamers have the protections they deserve. If you live in the US, call your members of Congress and tell them to do the right thing. We have always been a nation of immigrants, and immigrants have always made our nation stronger. You can learn more and get connected at Dreamers.FWD.us.”

Project 1. Creating a basic World Cloud in R, On Decision to end DACA comments by Two Tech Giants

Evan Jung

2017년 9월 6일