In this project of exploring R markdown and its beauty, I chose to practice text analaysis in the basic form of a wordcloud. Here, I have used the historical database of the United States Presidents’ State Of The Union Addresses (SOTU) all the way from George Washington to the current resident in the White House (as of now). The goal of this project is to get a handle on the dexterity of RMarkdown as well as learn some text analysis along the way.
The data for the speeches of the SOTU can be found at the GitHub repository of Peter Aldhous at this link. Rather than downloading the datafile, I have asked R to directly use the github repository for the data. The text has been pre-processed (a text world way of saying data cleaning). What you will see below are the Word Clouds of the most frequent words used by Presidents in their SOTU addresses. Moreover, I have also shown the Word Cloud for the last three Presidents: Pres. Bush, Pres. Obama, and the current resident.
The three different Word Clouds show how the most frequent words used by the Presidents are different. Bush focusing on American, Security, Iraq. Obama focusing on American, jobs and work while the President as of 2020 focusing more on “America and Americans and Taxes”.
sou<- read.csv("https://raw.githubusercontent.com/BuzzFeedNews/2018-01-trump-state-of-the-union/master/data/sou.csv")
sou<- sou[-c(233),]
str(sou) ## This produces the structure of our data
## 'data.frame': 232 obs. of 5 variables:
## $ link : Factor w/ 233 levels " because you can. He heard those words. He took out a picture of his wife and their four kids. Then, he went ho"| __truncated__,..: 52 53 54 55 56 57 58 59 60 61 ...
## $ president: Factor w/ 43 levels "","Abraham Lincoln",..: 15 15 15 15 15 15 15 15 25 25 ...
## $ message : Factor w/ 55 levels "","7th Annual Message",..: 25 36 51 30 23 43 41 21 25 36 ...
## $ date : Factor w/ 233 levels "","1790-01-08",..: 2 3 4 5 6 7 8 9 10 11 ...
## $ text : Factor w/ 233 levels "","Fellow-Citizens of the Senate and House of Representatives: \"In vain may we expect peace with the Indians on o"| __truncated__,..: 10 16 2 22 30 44 13 19 78 81 ...
We see that words like government, congress, people, united, states have been uttered frequently across Presidents.
set.seed(1234) # for reproducibility
wordcloud(words = df$word, freq = df$freq, min.freq = 10,
max.words=150, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"),
scale=c(3.5,0.25))
This is just another use of a package that demostrates a different type of Word CLoud in a pentagon shape (seems apropos of the scenario!)
wordcloud2(data=df, size = 0.5,minSize =10,
shape = 'pentagon')
unique(colnames(matrix)) ##President's in our data
## [1] "George Washington" "John Adams"
## [3] "Thomas Jefferson" "James Madison"
## [5] "James Monroe" "John Quincy Adams"
## [7] "Andrew Jackson" "Martin van Buren"
## [9] "John Tyler" "James K. Polk"
## [11] "Zachary Taylor" "Millard Fillmore"
## [13] "Franklin Pierce" "James Buchanan"
## [15] "Abraham Lincoln" "Andrew Johnson"
## [17] "Ulysses S. Grant" "Rutherford B. Hayes"
## [19] "Chester A. Arthur" "Grover Cleveland"
## [21] "Benjamin Harrison" "William McKinley"
## [23] "Theodore Roosevelt" "William Howard Taft"
## [25] "Woodrow Wilson" "Warren G. Harding"
## [27] "Calvin Coolidge" "Herbert Hoover"
## [29] "Franklin D. Roosevelt" "Harry S. Truman"
## [31] "Dwight D. Eisenhower" "John F. Kennedy"
## [33] "Lyndon B. Johnson" "Richard Nixon"
## [35] "Gerald R. Ford" "Jimmy Carter"
## [37] "Ronald Reagan" "George Bush"
## [39] "William J. Clinton" "George W. Bush"
## [41] "Barack Obama" "Donald J. Trump"
bush <- matrix[,215:222]
obama <- matrix[,223:230]
trump<- matrix[,231:232]
bushwords <- sort(rowSums(bush),decreasing=TRUE)
bushdf <- data.frame(word = names(bushwords),freq=bushwords)
obamawords <- sort(rowSums(obama),decreasing=TRUE)
obamadf <- data.frame(word = names(obamawords),freq=obamawords)
trumpwords <- sort(rowSums(trump),decreasing=TRUE)
trumpdf <- data.frame(word = names(trumpwords),freq=trumpwords)
set.seed(1234) # for reproducibility
wordcloud(words = bushdf$word, freq = bushdf$freq, min.freq = 30,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"),
scale=c(3.5,0.25))
set.seed(1234) # for reproducibility
wordcloud(words = obamadf$word, freq = obamadf$freq,
min.freq = 30,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"),
scale=c(2.5,0.25))
set.seed(1234) # for reproducibility
wordcloud(words = trumpdf$word, freq = trumpdf$freq,
min.freq = 10,
max.words=300, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"),
scale=c(2.5,0.25))
It would be interesting to create an animated Word Cloud that presents transitions of frequent words used in a gif form from George Washington onwards.
As I made this file, I had Tchaikovsky’s Nutcracker Pas De Deux playing in the background repeatedly. And I never imagined that a The Simpsons episode will make me fall in love with this Orchestraic magnum opus! Go check out this The Simpsons episode of what I am talking about available on Youtube .