“January 20th 2017 will be remerbered as the day the people became the rules of this nation again” said Donald Trump. It was an unforgetabble day for all american people.
In this regard, I got the idea of analyzing the Donald Trump inauguration speech and I could not find a better software than R to obtain good results. I resort to Data Mining to create a wordcloud which resume the main words said by the 45th President of the United States. Data Mining is one of the interesting techniques of exploring data.
#Loading packages
library(RColorBrewer) #color pallet
library(tm) #package text mining (tm)
library(wordcloud)
library(DT) First of all, I load data (speech) that was in a file text.
text<-readLines("C:/text.txt",encoding ='UTF-8')
text[1:10]## [1] "<U+FEFF>Chief Justice Roberts, President Carter, President Clinton, President Bush, fellow Americans and people of the world thank you."
## [2] ""
## [3] "We the citizens of America have now joined a great national effort to rebuild our county and restore its promise for all our people. "
## [4] ""
## [5] "Together we will determine the course of America for many, many years to come."
## [6] ""
## [7] "Together we will face challenges. We will confront hardships. But we will get the job done."
## [8] ""
## [9] "Every four years we gather on these steps to carry out the orderly and peaceful transfer of power."
## [10] ""
The next step is to clean data.
#Removing the stopwords
text<-removeWords(text,stopwords("en"))
#Removing the punctuations
text<-removePunctuation(text)
#Removing the empty spaces
text<-text[-which(text=="")]
#Making all text lowercase
for (i in 1:length(text)) text[i]<-tolower(text[i])
#Choosing the words to be removed
text<-removeWords(text,c("the","there","this","'ve","it's","their","and"))Corpus is a set of text vectors
doc<-Corpus(VectorSource(text))The term-documents matrix is a table containing the frequency of each word in the speech.
#Building a matrix of words
(tdm<-TermDocumentMatrix(doc))## <<TermDocumentMatrix (terms: 425, documents: 61)>>
## Non-/sparse entries: 609/25316
## Sparsity : 98%
## Maximal term length: 14
## Weighting : term frequency (tf)
dim(tdm)## [1] 425 61
We can conclude that there are 424 words and 61 paragraphs in the text.
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)I try to display every word included in the speech in an attractive way using the package DT. Therefore I got a table that shows each word and its frequency.
d <- data.frame(word = names(v),freq=v)
datatable(d,class='compact',options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
"}")
))Creating the wordcloud
wordcloud(words = d$word, freq = d$freq, min.freq = 1,random.order=FALSE,max.words=200,
rot.per=0.35,colors=brewer.pal(20, "Paired"))Note: Feel free to ask me about anything that seems not clear!