Introduction

In this project of exploring R markdown and its beauty, I chose to practice text analaysis in the basic form of a wordcloud. Here, I have used the historical database of the United States Presidents’ State Of The Union Addresses (SOTU) all the way from George Washington to the current resident in the White House (as of now). The goal of this project is to get a handle on the dexterity of RMarkdown as well as learn some text analysis along the way.

Setup

The data for the speeches of the SOTU can be found at the GitHub repository of Peter Aldhous at this link. Rather than downloading the datafile, I have asked R to directly use the github repository for the data. The text has been pre-processed (a text world way of saying data cleaning). What you will see below are the Word Clouds of the most frequent words used by Presidents in their SOTU addresses. Moreover, I have also shown the Word Cloud for the last three Presidents: Pres. Bush, Pres. Obama, and the current resident.

Findings

The three different Word Clouds show how the most frequent words used by the Presidents are different. Bush focusing on American, Security, Iraq. Obama focusing on American, jobs and work while the President as of 2020 focusing more on “America and Americans and Taxes”.

  • Have Fun reading forward!

Historical To Present State of The Union Address

Loading Data

  • Here we are loading the dataset that is publicly available on the GitHub website.
sou<- read.csv("https://raw.githubusercontent.com/BuzzFeedNews/2018-01-trump-state-of-the-union/master/data/sou.csv")
sou<- sou[-c(233),]

Use of in-Line R code to demonstrate its use

  • The count of the number of SOTU addresses from George Washington onwards are : 232
  • The number of columns or variables are : 5
  • The names of the variables/columns are : link, president, message, date, text

Structure/Summary Stats of Our SOTU Text Document

  • rather than presenting the summary stats of the data, I am presenting the structure of the data. This is a concise form of a summary stats.
  • Since I have a text data, presenting a summary stats is infeasible as the output is too large.
str(sou)  ## This produces the structure of our data
## 'data.frame':    232 obs. of  5 variables:
##  $ link     : Factor w/ 233 levels " because you can. He heard those words. He took out a picture of his wife and their four kids. Then, he went ho"| __truncated__,..: 52 53 54 55 56 57 58 59 60 61 ...
##  $ president: Factor w/ 43 levels "","Abraham Lincoln",..: 15 15 15 15 15 15 15 15 25 25 ...
##  $ message  : Factor w/ 55 levels "","7th Annual Message",..: 25 36 51 30 23 43 41 21 25 36 ...
##  $ date     : Factor w/ 233 levels "","1790-01-08",..: 2 3 4 5 6 7 8 9 10 11 ...
##  $ text     : Factor w/ 233 levels "","Fellow-Citizens of the Senate and House of Representatives: \"In vain may we expect peace with the Indians on o"| __truncated__,..: 10 16 2 22 30 44 13 19 78 81 ...

Clean the text data

  • To do any any analysis, we need to clean the data.

Generate the word cloud

  • The wordcloud package is the most classic way to generate a word cloud.
  • The following shows the general form of Word Cloud using the default package.

We see that words like government, congress, people, united, states have been uttered frequently across Presidents.

set.seed(1234) # for reproducibility 
wordcloud(words = df$word, freq = df$freq, min.freq = 10,         
          max.words=150, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"),
          scale=c(3.5,0.25))

A fancier way to visualise a Word Cloud

This is just another use of a package that demostrates a different type of Word CLoud in a pentagon shape (seems apropos of the scenario!)

A Pentagon shape of a Word Cloud
wordcloud2(data=df, size = 0.5,minSize =10, 
           shape = 'pentagon')

Word Clouds Of SOTU Addresses by Bush, Obama and The Current Resident.

unique(colnames(matrix)) ##President's in our data
##  [1] "George Washington"     "John Adams"           
##  [3] "Thomas Jefferson"      "James Madison"        
##  [5] "James Monroe"          "John Quincy Adams"    
##  [7] "Andrew Jackson"        "Martin van Buren"     
##  [9] "John Tyler"            "James K. Polk"        
## [11] "Zachary Taylor"        "Millard Fillmore"     
## [13] "Franklin Pierce"       "James Buchanan"       
## [15] "Abraham Lincoln"       "Andrew Johnson"       
## [17] "Ulysses S. Grant"      "Rutherford B. Hayes"  
## [19] "Chester A. Arthur"     "Grover Cleveland"     
## [21] "Benjamin Harrison"     "William McKinley"     
## [23] "Theodore Roosevelt"    "William Howard Taft"  
## [25] "Woodrow Wilson"        "Warren G. Harding"    
## [27] "Calvin Coolidge"       "Herbert Hoover"       
## [29] "Franklin D. Roosevelt" "Harry S. Truman"      
## [31] "Dwight D. Eisenhower"  "John F. Kennedy"      
## [33] "Lyndon B. Johnson"     "Richard Nixon"        
## [35] "Gerald R. Ford"        "Jimmy Carter"         
## [37] "Ronald Reagan"         "George Bush"          
## [39] "William J. Clinton"    "George W. Bush"       
## [41] "Barack Obama"          "Donald J. Trump"
bush <- matrix[,215:222]
obama <- matrix[,223:230]
trump<- matrix[,231:232]





bushwords <- sort(rowSums(bush),decreasing=TRUE) 
bushdf <- data.frame(word = names(bushwords),freq=bushwords)



obamawords <- sort(rowSums(obama),decreasing=TRUE) 
obamadf <- data.frame(word = names(obamawords),freq=obamawords)

trumpwords <- sort(rowSums(trump),decreasing=TRUE) 
trumpdf <- data.frame(word = names(trumpwords),freq=trumpwords)

Bush

set.seed(1234) # for reproducibility 
wordcloud(words = bushdf$word, freq = bushdf$freq, min.freq = 30,         
          max.words=200, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"),
          scale=c(3.5,0.25))

Barack Obama

set.seed(1234) # for reproducibility 
wordcloud(words = obamadf$word, freq = obamadf$freq, 
          min.freq = 30,         
          max.words=200, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"),
          scale=c(2.5,0.25))

45th President

set.seed(1234) # for reproducibility 
wordcloud(words = trumpdf$word, freq = trumpdf$freq, 
          min.freq = 10,         
          max.words=300, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"),
          scale=c(2.5,0.25))

Future Iterations

It would be interesting to create an animated Word Cloud that presents transitions of frequent words used in a gif form from George Washington onwards.

Reference

Aside Thought (irrelevant to the above work)

As I made this file, I had Tchaikovsky’s Nutcracker Pas De Deux playing in the background repeatedly. And I never imagined that a The Simpsons episode will make me fall in love with this Orchestraic magnum opus! Go check out this The Simpsons episode of what I am talking about available on Youtube .