Representing and Text Mining

In this presentation I am focusing on one particular sort of data, which is text data. In principle text is just another form of data. I am taking I Have A Dream Speech by Martin Luther King Jr. in a text format to use in this presentation. These texts are unstructured data.

Libraries

I have used several packages for this presentation that are specifically there for natural language processing, tidying data and text mining. Using tidy tools such as tidytext and dplyr make text mining process much easier.

library(readr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)
library(extrafont)
## Registering fonts with R
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)

Read in data from a text file

My first step was to find I Have a Dream speech by Marting Luther King Jr. online and save it as a text file. Main purpose of this presentation is to use text data for analysis. After that this text will go over tidying process to make it clean to do our text mining process.

# Read in data
MLK_Speech <- read_lines("I_have_a_dream_MLK.txt")

# Splits chunk of text into lines
MLK_text <- data.frame(line = 1:length(MLK_Speech), text = MLK_Speech, stringsAsFactors = FALSE)

# filter blank lines (even rows)
MLK_text <- MLK_text %>% filter(line %% 2 == 1) %>%
  mutate(line = 1:nrow(.)) #renumber lines

# Make single word vector
MLK_text <- MLK_text %>%
  unnest_tokens(word, text)

# Now remove all the stop words using "anti_join"
tidy_MLK <- MLK_text %>%
  anti_join(stop_words)
## Joining, by = "word"
# count the frequency of each word
tidy_MLK %>%
  count(word, sort = TRUE)
## # A tibble: 214 x 2
##    word         n
##    <chr>    <int>
##  1 freedom     13
##  2 ring        12
##  3 dream       11
##  4 day          9
##  5 negro        8
##  6 free         5
##  7 white        5
##  8 faith        4
##  9 hundred      4
## 10 mountain     4
## # ... with 204 more rows

Bar Graph to show highest frequency words

Once we are done with all the text analysis we can use this data to visualize most frequent words in I Have a Dream speech. Here we are using a bar graph to show our results. We can clearly see that words freedom, ring and dream at the top of the chart as his most frequently used words in the speech.

# count frequency of word used in the speech by most to least words that used more than twice.
tidy_MLK %>%
  count(word, sort = TRUE) %>%
  filter(n >2) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word,n))+
  geom_bar(stat = "identity", fill = "steelblue")+
  xlab(NULL)+
  coord_flip()+ labs(x="Words used more than 2 times", y= "word frequency", title = "I have a Dream Speech, MLK Jr. (1963)")

Wordcloud with higest frequency words

Another way to show our bag of words.

tidy_MLK <- tidy_MLK %>%
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n))
wordcloud(words = tidy_MLK$word, freq = tidy_MLK$n, min.freq = 2, max.words = 200, random.order = FALSE, rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))

Further Analysis

You can have a look at the frequent terms in the MLK_speech matrix as follow. We want to find words that occur at least three times :

library(NLP)
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(tm)

MLK_Speech_corpus = Corpus(VectorSource(MLK_Speech))

tdm = TermDocumentMatrix(MLK_Speech_corpus, 
                         control = list(removePunctuation = TRUE, 
                                      stopwords =  TRUE, 
                                      removeNumbers = TRUE, tolower = TRUE))

findFreqTerms(tdm, lowfreq = 3)
##  [1] "freedom"     "join"        "nation"      "today"       "will"       
##  [6] "american"    "hope"        "injustice"   "negro"       "years"      
## [11] "free"        "hundred"     "later"       "one"         "still"      
## [16] "brotherhood" "children"    "justice"     "now"         "ring"       
## [21] "rise"        "time"        "must"        "new"         "white"      
## [26] "dream"       "day"         "men"         "able"        "together"   
## [31] "state"       "little"      "black"       "every"       "mountain"   
## [36] "shall"       "faith"       "let"         "last"

Also we can analyze the association between frequent terms using findAssocs() function. Below R code identifies the words that associated with word freedom in I Have a Dream speech.

findAssocs(tdm, terms = "freedom", corlimit = 0.3)
## $freedom
##          let         ring  alleghenies      friends    hampshire 
##         0.92         0.87         0.66         0.66         0.66 
##     hilltops       mighty    mountains pennsylvania   prodigious 
##         0.66         0.66         0.66         0.66         0.66 
##          say         york          new  mississippi     molehill 
##         0.66         0.66         0.57         0.43         0.42 
## mountainside        every 
##         0.42         0.31

Next step is to do a sentiment analysis on the speech using get_sentiments() that comes with tidytext package. This helps us to get specific sentiment lexicons in a tidy format, with one row per word. For this example I am using bing general-purpose lexicon.

This will identify word as positive or negative sentiments and sort them according to their frequency.

library(tidyr)
bing <- get_sentiments("bing")
MLK_Wordcount <- MLK_text%>%
  inner_join(bing)%>%
  count(word, sentiment, sort = TRUE)
## Joining, by = "word"
MLK_Wordcount
## # A tibble: 53 x 3
##    word      sentiment     n
##    <chr>     <chr>     <int>
##  1 freedom   positive     13
##  2 free      positive      5
##  3 faith     positive      4
##  4 injustice negative      3
##  5 destiny   positive      2
##  6 great     positive      2
##  7 mighty    positive      2
##  8 slaves    negative      2
##  9 struggle  negative      2
## 10 beautiful positive      1
## # ... with 43 more rows

Now it is easy to graph the most common positive and negative words in this speech. One advantage of having a data frame with both sentiment and word is that we can analyze word count that contribute to each sentiment.

MLK_Wordcount %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  theme_minimal()+
  labs(y = "Contribution to sentiment")

Conclusion

This visualization shows us that MLK Jr. used more positive words than negative words in his speech. May be this is one of the reasons why this speech is more memorable and impacted largely on civil right movement in 1960s. His positive word usage resulted the “I Have a Dream” speech to be so powerful, even after more than 50 years.