Untitled

Problem statement

The Objective of this task is know what are the most frequent words that Trump used in his speeches. This is the code associated with project_1 presentation for DATA_607 MSDS CUNY University. The full presentation is here

Loading Texts

Start by saving your text files in a folder titled: “texts” This will be the “corpus” (body) of texts you are mining.

Note: The texts used in this example are a few of Donald Trump’s speeches that were copied and pasted into a text document.

text_name <- file.path("~", "Desktop", "texts")   
text_name

## [1] "~/Desktop/texts"

dir(text_name)

##  [1] "Trump Black History Month Speech.txt"         
##  [2] "Trump CIA Speech.txt"                         
##  [3] "Trump Congressional Address.txt"              
##  [4] "Trump CPAC Speech.txt"                        
##  [5] "Trump Florida Rally 2-18-17.txt"              
##  [6] "Trump Immigration Speech 8-31-16.txt"         
##  [7] "Trump Inauguration Speech.txt"                
##  [8] "Trump National Prayer Breakfast.txt"          
##  [9] "Trump Nomination Speech.txt"                  
## [10] "Trump Police Chiefs Speech.txt"               
## [11] "Trump Response to Healthcare Bill Failure.txt"

Load the R package for text mining and then load your texts into R.

VCorpus in tm refers to “Volatile” corpus which means that the corpus is stored in memory and would be destroyed when the R object containing it is destroyed.

Contrast this with PCorpus or Permanent Corpus which are stored outside the memory say in a db.

In order to create a VCorpus using tm, we need to pass a “Source” object as a paramter to the VCorpus method. You can find the sources available using this method - getSources()

I refered to this post on stackoverflow

library(tm)

## Loading required package: NLP

docs <- VCorpus(DirSource(text_name))   
summary(docs)

##                                               Length Class            
## Trump Black History Month Speech.txt          2      PlainTextDocument
## Trump CIA Speech.txt                          2      PlainTextDocument
## Trump Congressional Address.txt               2      PlainTextDocument
## Trump CPAC Speech.txt                         2      PlainTextDocument
## Trump Florida Rally 2-18-17.txt               2      PlainTextDocument
## Trump Immigration Speech 8-31-16.txt          2      PlainTextDocument
## Trump Inauguration Speech.txt                 2      PlainTextDocument
## Trump National Prayer Breakfast.txt           2      PlainTextDocument
## Trump Nomination Speech.txt                   2      PlainTextDocument
## Trump Police Chiefs Speech.txt                2      PlainTextDocument
## Trump Response to Healthcare Bill Failure.txt 2      PlainTextDocument
##                                               Mode
## Trump Black History Month Speech.txt          list
## Trump CIA Speech.txt                          list
## Trump Congressional Address.txt               list
## Trump CPAC Speech.txt                         list
## Trump Florida Rally 2-18-17.txt               list
## Trump Immigration Speech 8-31-16.txt          list
## Trump Inauguration Speech.txt                 list
## Trump National Prayer Breakfast.txt           list
## Trump Nomination Speech.txt                   list
## Trump Police Chiefs Speech.txt                list
## Trump Response to Healthcare Bill Failure.txt list

inspect(docs[1])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 3974

writeLines(as.character(docs[1]))

## list(list(content = c("Well, the election, it came out really well. Next time we’ll triple the number or quadruple it. We want to get it over 51, right? At least 51.", "", "Well this is Black History Month, so this is our little breakfast, our little get-together. Hi Lynn, how are you? Just a few notes. During this month, we honor the tremendous history of African-Americans throughout our country. Throughout the world, if you really think about it, right? And their story is one of unimaginable sacrifice, hard work, and faith in America. I’ve gotten a real glimpse—during the campaign, I’d go around with Ben to a lot of different places I wasn’t so familiar with. They’re incredible people. And I want to thank Ben Carson, who’s gonna be heading up HUD. That’s a big job. That’s a job that’s not only housing, but it’s mind and spirit. Right, Ben? And you understand, nobody’s gonna be better than Ben.", 
## "", "Last month, we celebrated the life of Reverend Martin Luther King, Jr., whose incredible example is unique in American history. You read all about Dr. Martin Luther King a week ago when somebody said I took the statue out of my office. It turned out that that was fake news. Fake news. The statue is cherished, it’s one of the favorite things in the—and we have some good ones. We have Lincoln, and we have Jefferson, and we have Dr. Martin Luther King. But they said the statue, the bust of Martin Luther King, was taken out of the office. And it was never even touched. So I think it was a disgrace, but that’s the way the press is. Very unfortunate.", 
## "", "I am very proud now that we have a museum on the National Mall where people can learn about Reverend King, so many other things. Frederick Douglass is an example of somebody who’s done an amazing job and is being recognized more and more, I noticed. Harriet Tubman, Rosa Parks, and millions more black Americans who made America what it is today. Big impact.", "", "I’m proud to honor this heritage and will be honoring it more and more. The folks at the table in almost all cases have been great friends and supporters. Darrell—I met Darrell when he was defending me on television. And the people that were on the other side of the argument didn’t have a chance, right? And Paris has done an amazing job in a very hostile CNN community. He’s all by himself. You’ll have seven people, and Paris. And I’ll take Paris over the seven. But I don’t watch CNN, so I don’t get to see you as much as I used to. I don’t like watching fake news. But Fox has treated me very nice. Wherever Fox is, thank you.", 
## "", "We’re gonna need better schools and we need them soon. We need more jobs, we need better wages, a lot better wages. We’re gonna work very hard on the inner city. Ben is gonna be doing that, big league. That’s one of the big things that you’re gonna be looking at. We need safer communities and we’re going to do that with law enforcement. We’re gonna make it safe. We’re gonna make it much better than it is right now. Right now it’s terrible, and I saw you talking about it the other night, Paris, on something else that was really—you did a fantastic job the other night on a very unrelated show.", 
## "", "I’m ready to do my part, and I will say this: We’re gonna work together. This is a great group, this is a group that’s been so special to me. You really helped me a lot. If you remember I wasn’t going to do well with the African-American community, and after they heard me speaking and talking about the inner city and lots of other things, we ended up getting—and I won’t go into details—but we ended up getting substantially more than other candidates who had run in the past years. And now we’re gonna take that to new levels. I want to thank my television star over here—Omarosa’s actually a very nice person, nobody knows that. I don’t want to destroy her reputation but she’s a very good person, and she’s been helpful right from the beginning of the campaign, and I appreciate it. I really do. Very special.", 
## "", "So I want to thank everybody for being here."), meta = list(author = character(0), datetimestamp = list(sec = 28.9621739387512, min = 20, hour = 15, mday = 24, mon = 8, year = 119, wday = 2, yday = 266, isdst = 0), description = character(0), heading = character(0), id = "Trump Black History Month Speech.txt", language = "en", origin = character(0))))
## list()
## list()

Removing the punctuation

I used tm_map to remove all the puncituations from all the documents in the corpus.

docs <- tm_map(docs,removePunctuation)   
# writeLines(as.character(docs[1])) # Check to see if it worked.
    # The 'writeLines()' function is commented out to save space.
docs

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 11

also used regex to remove ASCII characters

for (j in seq(docs)) {
    docs[[j]] <- gsub("/", " ", docs[[j]])
    docs[[j]] <- gsub("@", " ", docs[[j]])
    docs[[j]] <- gsub("\\|", " ", docs[[j]])
    docs[[j]] <- gsub("\u2028", " ", docs[[j]])  # This is an ascii character that did not translate, so it had to be removed.
}
#writeLines(as.character(docs[1])) # You can check a document (in this case
# the first) to see if it worked.

Removing numbers

docs <- tm_map(docs, removeNumbers)   
#writeLines(as.character(docs[1])) # Check to see if it worked.

Converting to lowercase

As before, we want a word to appear exactly the same every time it appears. We therefore change everything to lowercase.

docs <- tm_map(docs, tolower)   
docs <- tm_map(docs, PlainTextDocument)
DocsCopy <- docs
# writeLines(as.character(docs[1])) # Check to see if it worked.

Removing “stopwords”

(common words) that usually have no analytic value. In every text, there are a lot of common, and uninteresting words (a, and, also, the, etc.). Such words are frequent by their nature, and will confound your analysis if they remain in the text.

# For a list of the stopwords, see:   
# length(stopwords("english"))   
# stopwords("english")   
docs <- tm_map(docs, removeWords, stopwords("english"))   
docs <- tm_map(docs, PlainTextDocument)
# writeLines(as.character(docs[1])) # Check to see if it worked.

docs <- tm_map(docs, removeWords, c("syllogism", "tautology"))   
# Just remove the words "syllogism" and "tautology". 
# These words don't actually exist in these texts. But this is how you would remove them if they had.

Combining words that should stay together

If you wish to preserve a concept is only apparent as a collection of two or more words, then you can combine them or reduce them to a meaningful acronym before you begin the analysis. Here, I am using examples that are particular to qualitative data analysis.

for (j in seq(docs))
{
  docs[[j]] <- gsub("fake news", "fake_news", docs[[j]])
  docs[[j]] <- gsub("inner city", "inner-city", docs[[j]])
  docs[[j]] <- gsub("politically correct", "politically_correct", docs[[j]])
}
docs <- tm_map(docs, PlainTextDocument)

Removing common word endings (e.g., “ing”, “es”, “s”)

This is referred to as “stemming” documents. We stem the documents so that a word will be recognizable to the computer, despite whether or not it may have a variety of possible endings in the original text.

Note: The “stem completion” function is currently problemmatic, and stemmed words are often annoying to read. For now, I have this section commented out. But you are welcome to try these functions (by removing the hashmark from the beginning of the line) if they interest you. Just don’t expect them to operate smoothly.

This procedure has been a little hanky in the recent past, so I change the name of the data object when I do this to keep from overwriting what I have done to this point.

docs_st <- tm_map(docs, stemDocument)   
docs_st <- tm_map(docs_st, PlainTextDocument)
writeLines(as.character(docs_st[1])) # Check to see if it worked.

## list(list(content = c("well elect came realli well next time ’ll tripl number quadrupl want get right least", "", "well black histori month littl breakfast littl gettogeth hi lynn just note month honor tremend histori africanamerican throughout countri throughout world realli think right stori one unimagin sacrific hard work faith america ’ve gotten real glimpse— campaign ’d go around ben lot differ place wasn’t familiar ’re incred peopl want thank ben carson ’s gonna head hud ’s big job ’s job ’s hous ’s mind spirit right ben understand nobody’ gonna better ben", 
## "", "last month celebr life reverend martin luther king jr whose incred exampl uniqu american histori read dr martin luther king week ago somebodi said took statu offic turn fake_new fake_new statu cherish ’s one favorit thing — good one lincoln jefferson dr martin luther king said statu bust martin luther king taken offic never even touch think disgrac ’s way press unfortun", "", "proud now museum nation mall peopl can learn reverend king mani thing frederick douglass exampl somebodi ’s done amaz job recogn notic harriet tubman rosa park million black american made america today big impact", 
## "", "’m proud honor heritag will honor folk tabl almost case great friend support darrell— met darrel defend televis peopl side argument didn’t chanc right pari done amaz job hostil cnn communiti ’s ’ll seven peopl pari ’ll take pari seven don’t watch cnn don’t get see much use don’t like watch fake_new fox treat nice wherev fox thank", "", "’re gonna need better school need soon need job need better wage lot better wage ’re gonna work hard inner-c ben gonna big leagu ’s one big thing ’re gonna look need safer communiti ’re go law enforc ’re gonna make safe ’re gonna make much better right now right now ’s terribl saw talk night pari someth els really— fantast job night unrel show", 
## "", "’m readi part will say ’re gonna work togeth great group group ’s special realli help lot rememb wasn’t go well africanamerican communiti heard speak talk inner-c lot thing end getting— won’t go details— end get substanti candid run past year now ’re gonna take new level want thank televis star —omarosa’ actual nice person nobodi know don’t want destroy reput ’s good person ’s help right begin campaign appreci realli special", "", "want thank everybodi"), meta = list(
##     author = character(0), datetimestamp = list(sec = 29.7225689888, min = 20, hour = 15, mday = 24, mon = 8, year = 119, wday = 2, yday = 266, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()

# docs <- docs_st

docs <- tm_map(docs, stripWhitespace)
# writeLines(as.character(docs[1])) # Check to see if it worked.

docs <- tm_map(docs, PlainTextDocument)

Stage the Data

To proceed, create a document term matrix. This is what you will be using from this point on.

dtm <- DocumentTermMatrix(docs)   
dtm

## <<DocumentTermMatrix (documents: 11, terms: 3659)>>
## Non-/sparse entries: 8364/31885
## Sparsity           : 79%
## Maximal term length: 19
## Weighting          : term frequency (tf)

You’ll also need a transpose of this matrix. Create it using:

tdm <- TermDocumentMatrix(docs)   
tdm

## <<TermDocumentMatrix (terms: 3659, documents: 11)>>
## Non-/sparse entries: 8364/31885
## Sparsity           : 79%
## Maximal term length: 19
## Weighting          : term frequency (tf)

Explore the data

Organize terms by their frequency:

freq <- colSums(as.matrix(dtm))   
length(freq)

## [1] 3659

ord <- order(freq)

m <- as.matrix(dtm)   
dim(m)

## [1]   11 3659

Write to a csv file

write.csv(m, file="DocumentTermMatrix.csv")

The ‘removeSparseTerms()’ function will remove the infrequently used words, leaving only the most well-used words in the corpus.

#  Start by removing sparse terms:   
dtms <- removeSparseTerms(dtm, 0.2) # This makes a matrix that is 20% empty space, maximum.   
dtms

## <<DocumentTermMatrix (documents: 11, terms: 87)>>
## Non-/sparse entries: 848/109
## Sparsity           : 11%
## Maximal term length: 11
## Weighting          : term frequency (tf)

Word Frequency

There are a lot of terms, so for now, just check out some of the most and least frequently occurring words.

freq <- colSums(as.matrix(dtm))

Check out the frequency of frequencies. The ‘colSums()’ function generates a table reporting how often each word frequency occurs. Using the ’head()" function, below, we can see the distribution of the least-frequently used words.

head(table(freq), 20) # The ", 20" indicates that we only want the first 20 frequencies. Feel free to change that number.

## freq
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1638  626  326  212  123  103   81   60   59   45   38   33   29   25   26 
##   16   17   18   19   20 
##   16    9   13   16   11

The resulting output is two rows of numbers. The top number is the frequency with which words appear and the bottom number reflects how many words appear that frequently. Here, considering only the 20 lowest word frequencies, we can see that 1602 terms appear only once. There are also a lot of others that appear very infrequently.

For a look at the most frequently used terms, we can use the ‘tail()’ function.

tail(table(freq), 20) # The ", 20" indicates that we only want the last 20 frequencies.  Feel free to change that number, as needed.

## freq
##  77  79  83  88  89 100 101 102 105 107 111 122 127 139 140 163 174 265 
##   1   2   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 278 428 
##   1   1

Considering only the 50 greatest frequencies, we can see that there is a huge disparity in how frequently some terms appear.

For a less, fine-grained look at term freqency we can view a table of the terms we selected when we removed sparse terms, above.

freq <- colSums(as.matrix(dtms))   
freq

##        also      always     america    american     another        back 
##          54          24         122         107          22          75 
##         bad     believe         big        came         can        care 
##          35          60          45          20         100          37 
##        come     country         day   different        done enforcement 
##          55         174          36          16          24          43 
##        even        ever       every         get     getting        give 
##          55          42          49          79          23          25 
##       going        good       great       group      happen         job 
##         265          58         163          20          36          38 
##        just        know        last         law         let        life 
##          88         127          44          59          40          27 
##        like      little        long        look         lot        love 
##          79          24          36          52          44          45 
##        made        many        much        must      nation        need 
##          32         101          68          53          48          32 
##       never         new         now      office         one      people 
##          83          69         111          24         139         278 
##   president         put      really    remember       right        safe 
##          44          35          57          27         102          35 
##        said         say         see        seen   something     special 
##          83          66          48          34          25          26 
##      states        take        tell       thank      things       think 
##          65          64          50         105          40          77 
##        time       today    together     totally       truly  understand 
##          76          33          34          18          16          29 
##      united        want         way        well        will        work 
##          64         140          71          51         428          64 
##       world        year       years 
##          56          47          54

freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)   
head(freq, 14)

##     will   people    going  country    great     want      one     know 
##      428      278      265      174      163      140      139      127 
##  america      now american    thank    right     many 
##      122      111      107      105      102      101

An alternate view of term frequency: This will identify all terms that appear frequently (in this case, 50 or more times).

findFreqTerms(dtm, lowfreq=50)

##  [1] "’re"         "also"        "america"     "american"    "back"       
##  [6] "believe"     "can"         "come"        "country"     "dont"       
## [11] "even"        "get"         "going"       "good"        "great"      
## [16] "immigration" "jobs"        "just"        "know"        "law"        
## [21] "like"        "look"        "make"        "many"        "much"       
## [26] "must"        "never"       "new"         "now"         "one"        
## [31] "people"      "really"      "right"       "said"        "say"        
## [36] "states"      "take"        "tell"        "thank"       "theyre"     
## [41] "think"       "time"        "united"      "want"        "way"        
## [46] "well"        "will"        "work"        "world"       "years"

View as a table:

wf <- data.frame(word=names(freq), freq=freq)   
head(wf)

Visualizing

Plot words that appear at least 50 times.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

p <- ggplot(subset(wf, freq>50), aes(x = reorder(word, -freq), y = freq)) +
          geom_bar(stat = "identity", fill = "#FF6666") + 
          theme(axis.text.x=element_text(angle=45, hjust=1))
p

Relationships Between Terms

Term Correlations

If we have a term in mind that we have found to be particularly meaningful to our analysis, then we may find it helpful to identify the words that most highly correlate with that term.

If words always appear together, then correlation=1.0.

findAssocs(dtm, c("country" , "american"), corlimit=0.85) # specifying a correlation limit of 0.85

## $country
##   nothing    cities countries      jobs      come   biggest    donors 
##      0.95      0.94      0.94      0.92      0.91      0.90      0.90 
##    second     begin    border      plan    crimes     globe     meant 
##      0.90      0.88      0.88      0.88      0.87      0.87      0.87 
## thousands     means   workers      also   despite      take 
##      0.87      0.86      0.86      0.85      0.85      0.85 
## 
## $american
##  restore     task     fair   budget    cycle      new promises  dollars 
##     0.97     0.93     0.92     0.91     0.89     0.89     0.89     0.88 
##  finally millions national     tens  foreign   middle  justice  program 
##     0.88     0.88     0.88     0.88     0.87     0.87     0.86     0.86 
##    break  joining   united 
##     0.85     0.85     0.85

findAssocs(dtms, "think", corlimit=0.70) # specifying a correlation limit of 0.95

## $think
## really   well   care    lot happen    see   good 
##   0.89   0.88   0.86   0.76   0.72   0.72   0.71

Word Clouds!

Humans are generally strong at visual analytics. That is part of the reason that these have become so popular. What follows are a variety of alternatives for constructing word clouds with your text.

But first you will need to load the package that makes word clouds in R.

library(tm)
library(tmap)
library(wordcloud)

## Loading required package: RColorBrewer

set.seed(142) 

wordcloud(names(freq), freq, min.freq=25, scale = c(4, 0.2))

Plot the 100 most frequently used words.

set.seed(142)   
wordcloud(names(freq), freq, max.words=100, scale = c(4, 0.2))

Add some color and plot words occurring at least 20 times.

set.seed(142)   
wordcloud(names(freq), freq, min.freq=20, scale = c(4, 0.2), colors=brewer.pal(6, "Dark2"))

Plot the 100 most frequently occurring words.

set.seed(142)   
dark2 <- brewer.pal(6, "Dark2")   
wordcloud(names(freq), freq, max.words=100, scale = c(4, 0.2), colors=dark2)

Clustering by Term Similarity

To do this, we should always first remove a lot of the uninteresting or infrequent words.

dtmss <- removeSparseTerms(dtm, 0.15) # This makes a matrix that is only 15% empty space, maximum.   
dtmss

## <<DocumentTermMatrix (documents: 11, terms: 43)>>
## Non-/sparse entries: 452/21
## Sparsity           : 4%
## Maximal term length: 9
## Weighting          : term frequency (tf)

Hierarchal Clustering

First calculate distance between words & then cluster them according to similarity.

library(cluster)   
d <- dist(t(dtmss), method="euclidian")   
fit <- hclust(d=d, method="complete")   # for a different look try substituting: method="ward.D"
fit

## 
## Call:
## hclust(d = d, method = "complete")
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 43

plot(fit, hang=-1)

plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=6)   # "k=" defines the number of clusters you are using   
rect.hclust(fit, k=6, border="red") # draw dendogram with red borders around the 6 clusters