The Objective of this task is know what are the most frequent words that Trump used in his speeches. This is the code associated with project_1 presentation for DATA_607 MSDS CUNY University. The full presentation is here
Start by saving your text files in a folder titled: “texts” This will be the “corpus” (body) of texts you are mining.
Note: The texts used in this example are a few of Donald Trump’s speeches that were copied and pasted into a text document.
## [1] "~/Desktop/texts"
## [1] "Trump Black History Month Speech.txt"
## [2] "Trump CIA Speech.txt"
## [3] "Trump Congressional Address.txt"
## [4] "Trump CPAC Speech.txt"
## [5] "Trump Florida Rally 2-18-17.txt"
## [6] "Trump Immigration Speech 8-31-16.txt"
## [7] "Trump Inauguration Speech.txt"
## [8] "Trump National Prayer Breakfast.txt"
## [9] "Trump Nomination Speech.txt"
## [10] "Trump Police Chiefs Speech.txt"
## [11] "Trump Response to Healthcare Bill Failure.txt"
Load the R package for text mining and then load your texts into R.
VCorpus in tm refers to “Volatile” corpus which means that the corpus is stored in memory and would be destroyed when the R object containing it is destroyed.
Contrast this with PCorpus or Permanent Corpus which are stored outside the memory say in a db.
In order to create a VCorpus using tm, we need to pass a “Source” object as a paramter to the VCorpus method. You can find the sources available using this method - getSources()
I refered to this post on stackoverflow
## Loading required package: NLP
## Length Class
## Trump Black History Month Speech.txt 2 PlainTextDocument
## Trump CIA Speech.txt 2 PlainTextDocument
## Trump Congressional Address.txt 2 PlainTextDocument
## Trump CPAC Speech.txt 2 PlainTextDocument
## Trump Florida Rally 2-18-17.txt 2 PlainTextDocument
## Trump Immigration Speech 8-31-16.txt 2 PlainTextDocument
## Trump Inauguration Speech.txt 2 PlainTextDocument
## Trump National Prayer Breakfast.txt 2 PlainTextDocument
## Trump Nomination Speech.txt 2 PlainTextDocument
## Trump Police Chiefs Speech.txt 2 PlainTextDocument
## Trump Response to Healthcare Bill Failure.txt 2 PlainTextDocument
## Mode
## Trump Black History Month Speech.txt list
## Trump CIA Speech.txt list
## Trump Congressional Address.txt list
## Trump CPAC Speech.txt list
## Trump Florida Rally 2-18-17.txt list
## Trump Immigration Speech 8-31-16.txt list
## Trump Inauguration Speech.txt list
## Trump National Prayer Breakfast.txt list
## Trump Nomination Speech.txt list
## Trump Police Chiefs Speech.txt list
## Trump Response to Healthcare Bill Failure.txt list
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 3974
## list(list(content = c("Well, the election, it came out really well. Next time we’ll triple the number or quadruple it. We want to get it over 51, right? At least 51.", "", "Well this is Black History Month, so this is our little breakfast, our little get-together. Hi Lynn, how are you? Just a few notes. During this month, we honor the tremendous history of African-Americans throughout our country. Throughout the world, if you really think about it, right? And their story is one of unimaginable sacrifice, hard work, and faith in America. I’ve gotten a real glimpse—during the campaign, I’d go around with Ben to a lot of different places I wasn’t so familiar with. They’re incredible people. And I want to thank Ben Carson, who’s gonna be heading up HUD. That’s a big job. That’s a job that’s not only housing, but it’s mind and spirit. Right, Ben? And you understand, nobody’s gonna be better than Ben.",
## "", "Last month, we celebrated the life of Reverend Martin Luther King, Jr., whose incredible example is unique in American history. You read all about Dr. Martin Luther King a week ago when somebody said I took the statue out of my office. It turned out that that was fake news. Fake news. The statue is cherished, it’s one of the favorite things in the—and we have some good ones. We have Lincoln, and we have Jefferson, and we have Dr. Martin Luther King. But they said the statue, the bust of Martin Luther King, was taken out of the office. And it was never even touched. So I think it was a disgrace, but that’s the way the press is. Very unfortunate.",
## "", "I am very proud now that we have a museum on the National Mall where people can learn about Reverend King, so many other things. Frederick Douglass is an example of somebody who’s done an amazing job and is being recognized more and more, I noticed. Harriet Tubman, Rosa Parks, and millions more black Americans who made America what it is today. Big impact.", "", "I’m proud to honor this heritage and will be honoring it more and more. The folks at the table in almost all cases have been great friends and supporters. Darrell—I met Darrell when he was defending me on television. And the people that were on the other side of the argument didn’t have a chance, right? And Paris has done an amazing job in a very hostile CNN community. He’s all by himself. You’ll have seven people, and Paris. And I’ll take Paris over the seven. But I don’t watch CNN, so I don’t get to see you as much as I used to. I don’t like watching fake news. But Fox has treated me very nice. Wherever Fox is, thank you.",
## "", "We’re gonna need better schools and we need them soon. We need more jobs, we need better wages, a lot better wages. We’re gonna work very hard on the inner city. Ben is gonna be doing that, big league. That’s one of the big things that you’re gonna be looking at. We need safer communities and we’re going to do that with law enforcement. We’re gonna make it safe. We’re gonna make it much better than it is right now. Right now it’s terrible, and I saw you talking about it the other night, Paris, on something else that was really—you did a fantastic job the other night on a very unrelated show.",
## "", "I’m ready to do my part, and I will say this: We’re gonna work together. This is a great group, this is a group that’s been so special to me. You really helped me a lot. If you remember I wasn’t going to do well with the African-American community, and after they heard me speaking and talking about the inner city and lots of other things, we ended up getting—and I won’t go into details—but we ended up getting substantially more than other candidates who had run in the past years. And now we’re gonna take that to new levels. I want to thank my television star over here—Omarosa’s actually a very nice person, nobody knows that. I don’t want to destroy her reputation but she’s a very good person, and she’s been helpful right from the beginning of the campaign, and I appreciate it. I really do. Very special.",
## "", "So I want to thank everybody for being here."), meta = list(author = character(0), datetimestamp = list(sec = 28.9621739387512, min = 20, hour = 15, mday = 24, mon = 8, year = 119, wday = 2, yday = 266, isdst = 0), description = character(0), heading = character(0), id = "Trump Black History Month Speech.txt", language = "en", origin = character(0))))
## list()
## list()
I used tm_map to remove all the puncituations from all the documents in the corpus.
docs <- tm_map(docs,removePunctuation)
# writeLines(as.character(docs[1])) # Check to see if it worked.
# The 'writeLines()' function is commented out to save space.
docs
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 11
also used regex to remove ASCII characters
for (j in seq(docs)) {
docs[[j]] <- gsub("/", " ", docs[[j]])
docs[[j]] <- gsub("@", " ", docs[[j]])
docs[[j]] <- gsub("\\|", " ", docs[[j]])
docs[[j]] <- gsub("\u2028", " ", docs[[j]]) # This is an ascii character that did not translate, so it had to be removed.
}
#writeLines(as.character(docs[1])) # You can check a document (in this case
# the first) to see if it worked.
As before, we want a word to appear exactly the same every time it appears. We therefore change everything to lowercase.
(common words) that usually have no analytic value. In every text, there are a lot of common, and uninteresting words (a, and, also, the, etc.). Such words are frequent by their nature, and will confound your analysis if they remain in the text.
If you wish to preserve a concept is only apparent as a collection of two or more words, then you can combine them or reduce them to a meaningful acronym before you begin the analysis. Here, I am using examples that are particular to qualitative data analysis.
This is referred to as “stemming” documents. We stem the documents so that a word will be recognizable to the computer, despite whether or not it may have a variety of possible endings in the original text.
Note: The “stem completion” function is currently problemmatic, and stemmed words are often annoying to read. For now, I have this section commented out. But you are welcome to try these functions (by removing the hashmark from the beginning of the line) if they interest you. Just don’t expect them to operate smoothly.
This procedure has been a little hanky in the recent past, so I change the name of the data object when I do this to keep from overwriting what I have done to this point.
docs_st <- tm_map(docs, stemDocument)
docs_st <- tm_map(docs_st, PlainTextDocument)
writeLines(as.character(docs_st[1])) # Check to see if it worked.
## list(list(content = c("well elect came realli well next time ’ll tripl number quadrupl want get right least", "", "well black histori month littl breakfast littl gettogeth hi lynn just note month honor tremend histori africanamerican throughout countri throughout world realli think right stori one unimagin sacrific hard work faith america ’ve gotten real glimpse— campaign ’d go around ben lot differ place wasn’t familiar ’re incred peopl want thank ben carson ’s gonna head hud ’s big job ’s job ’s hous ’s mind spirit right ben understand nobody’ gonna better ben",
## "", "last month celebr life reverend martin luther king jr whose incred exampl uniqu american histori read dr martin luther king week ago somebodi said took statu offic turn fake_new fake_new statu cherish ’s one favorit thing — good one lincoln jefferson dr martin luther king said statu bust martin luther king taken offic never even touch think disgrac ’s way press unfortun", "", "proud now museum nation mall peopl can learn reverend king mani thing frederick douglass exampl somebodi ’s done amaz job recogn notic harriet tubman rosa park million black american made america today big impact",
## "", "’m proud honor heritag will honor folk tabl almost case great friend support darrell— met darrel defend televis peopl side argument didn’t chanc right pari done amaz job hostil cnn communiti ’s ’ll seven peopl pari ’ll take pari seven don’t watch cnn don’t get see much use don’t like watch fake_new fox treat nice wherev fox thank", "", "’re gonna need better school need soon need job need better wage lot better wage ’re gonna work hard inner-c ben gonna big leagu ’s one big thing ’re gonna look need safer communiti ’re go law enforc ’re gonna make safe ’re gonna make much better right now right now ’s terribl saw talk night pari someth els really— fantast job night unrel show",
## "", "’m readi part will say ’re gonna work togeth great group group ’s special realli help lot rememb wasn’t go well africanamerican communiti heard speak talk inner-c lot thing end getting— won’t go details— end get substanti candid run past year now ’re gonna take new level want thank televis star —omarosa’ actual nice person nobodi know don’t want destroy reput ’s good person ’s help right begin campaign appreci realli special", "", "want thank everybodi"), meta = list(
## author = character(0), datetimestamp = list(sec = 29.7225689888, min = 20, hour = 15, mday = 24, mon = 8, year = 119, wday = 2, yday = 266, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()
To proceed, create a document term matrix. This is what you will be using from this point on.
## <<DocumentTermMatrix (documents: 11, terms: 3659)>>
## Non-/sparse entries: 8364/31885
## Sparsity : 79%
## Maximal term length: 19
## Weighting : term frequency (tf)
You’ll also need a transpose of this matrix. Create it using:
## <<TermDocumentMatrix (terms: 3659, documents: 11)>>
## Non-/sparse entries: 8364/31885
## Sparsity : 79%
## Maximal term length: 19
## Weighting : term frequency (tf)
Organize terms by their frequency:
## [1] 3659
## [1] 11 3659
The ‘removeSparseTerms()’ function will remove the infrequently used words, leaving only the most well-used words in the corpus.
# Start by removing sparse terms:
dtms <- removeSparseTerms(dtm, 0.2) # This makes a matrix that is 20% empty space, maximum.
dtms
## <<DocumentTermMatrix (documents: 11, terms: 87)>>
## Non-/sparse entries: 848/109
## Sparsity : 11%
## Maximal term length: 11
## Weighting : term frequency (tf)
There are a lot of terms, so for now, just check out some of the most and least frequently occurring words.
Check out the frequency of frequencies. The ‘colSums()’ function generates a table reporting how often each word frequency occurs. Using the ’head()" function, below, we can see the distribution of the least-frequently used words.
## freq
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1638 626 326 212 123 103 81 60 59 45 38 33 29 25 26
## 16 17 18 19 20
## 16 9 13 16 11
The resulting output is two rows of numbers. The top number is the frequency with which words appear and the bottom number reflects how many words appear that frequently. Here, considering only the 20 lowest word frequencies, we can see that 1602 terms appear only once. There are also a lot of others that appear very infrequently.
For a look at the most frequently used terms, we can use the ‘tail()’ function.
## freq
## 77 79 83 88 89 100 101 102 105 107 111 122 127 139 140 163 174 265
## 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 278 428
## 1 1
Considering only the 50 greatest frequencies, we can see that there is a huge disparity in how frequently some terms appear.
For a less, fine-grained look at term freqency we can view a table of the terms we selected when we removed sparse terms, above.
## also always america american another back
## 54 24 122 107 22 75
## bad believe big came can care
## 35 60 45 20 100 37
## come country day different done enforcement
## 55 174 36 16 24 43
## even ever every get getting give
## 55 42 49 79 23 25
## going good great group happen job
## 265 58 163 20 36 38
## just know last law let life
## 88 127 44 59 40 27
## like little long look lot love
## 79 24 36 52 44 45
## made many much must nation need
## 32 101 68 53 48 32
## never new now office one people
## 83 69 111 24 139 278
## president put really remember right safe
## 44 35 57 27 102 35
## said say see seen something special
## 83 66 48 34 25 26
## states take tell thank things think
## 65 64 50 105 40 77
## time today together totally truly understand
## 76 33 34 18 16 29
## united want way well will work
## 64 140 71 51 428 64
## world year years
## 56 47 54
## will people going country great want one know
## 428 278 265 174 163 140 139 127
## america now american thank right many
## 122 111 107 105 102 101
An alternate view of term frequency: This will identify all terms that appear frequently (in this case, 50 or more times).
## [1] "’re" "also" "america" "american" "back"
## [6] "believe" "can" "come" "country" "dont"
## [11] "even" "get" "going" "good" "great"
## [16] "immigration" "jobs" "just" "know" "law"
## [21] "like" "look" "make" "many" "much"
## [26] "must" "never" "new" "now" "one"
## [31] "people" "really" "right" "said" "say"
## [36] "states" "take" "tell" "thank" "theyre"
## [41] "think" "time" "united" "want" "way"
## [46] "well" "will" "work" "world" "years"
View as a table:
Plot words that appear at least 50 times.
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
p <- ggplot(subset(wf, freq>50), aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity", fill = "#FF6666") +
theme(axis.text.x=element_text(angle=45, hjust=1))
p
Term Correlations
If we have a term in mind that we have found to be particularly meaningful to our analysis, then we may find it helpful to identify the words that most highly correlate with that term.
If words always appear together, then correlation=1.0.
## $country
## nothing cities countries jobs come biggest donors
## 0.95 0.94 0.94 0.92 0.91 0.90 0.90
## second begin border plan crimes globe meant
## 0.90 0.88 0.88 0.88 0.87 0.87 0.87
## thousands means workers also despite take
## 0.87 0.86 0.86 0.85 0.85 0.85
##
## $american
## restore task fair budget cycle new promises dollars
## 0.97 0.93 0.92 0.91 0.89 0.89 0.89 0.88
## finally millions national tens foreign middle justice program
## 0.88 0.88 0.88 0.88 0.87 0.87 0.86 0.86
## break joining united
## 0.85 0.85 0.85
## $think
## really well care lot happen see good
## 0.89 0.88 0.86 0.76 0.72 0.72 0.71
Humans are generally strong at visual analytics. That is part of the reason that these have become so popular. What follows are a variety of alternatives for constructing word clouds with your text.
But first you will need to load the package that makes word clouds in R.
## Loading required package: RColorBrewer
Plot the 100 most frequently used words.
Add some color and plot words occurring at least 20 times.
set.seed(142)
wordcloud(names(freq), freq, min.freq=20, scale = c(4, 0.2), colors=brewer.pal(6, "Dark2"))
Plot the 100 most frequently occurring words.
set.seed(142)
dark2 <- brewer.pal(6, "Dark2")
wordcloud(names(freq), freq, max.words=100, scale = c(4, 0.2), colors=dark2)
To do this, we should always first remove a lot of the uninteresting or infrequent words.
dtmss <- removeSparseTerms(dtm, 0.15) # This makes a matrix that is only 15% empty space, maximum.
dtmss
## <<DocumentTermMatrix (documents: 11, terms: 43)>>
## Non-/sparse entries: 452/21
## Sparsity : 4%
## Maximal term length: 9
## Weighting : term frequency (tf)
First calculate distance between words & then cluster them according to similarity.
library(cluster)
d <- dist(t(dtmss), method="euclidian")
fit <- hclust(d=d, method="complete") # for a different look try substituting: method="ward.D"
fit
##
## Call:
## hclust(d = d, method = "complete")
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 43
plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=6) # "k=" defines the number of clusters you are using
rect.hclust(fit, k=6, border="red") # draw dendogram with red borders around the 6 clusters