Introduction

Every four years the whole world focuses on the presidental election in the United States. International significance of USA changed this event long ago from domestic to worldwide. Rivalry is especially intensified because for many years it is a fight between two political parties: Republicans and Democrats. Every change of american head of state results in speculations about its impact on the forthcoming years. Polititians and media are trying to outdo each other in predictions about the marvelous or disastrous future that is waiting for the people around the world once the new president will replace the previous one. Although obviously only the USA citizens have the right to vote in these elections, happenings and signs of support for favoured candidates come from activists from many other countries, especially from Europe. But is it true? Do the presidental elections in the United States concern such a significant issue and is their outcome a matter of life and death? Do the Republicans and Democrats differ so much?

Those are very complex questions and we’re not going to argue for or against any of the statements. However, we’ve treated them as an inspiration for our project. In the following analysis we’re using selected text mining techniques in order to compare previous USA presidents from two main parties, Republicans and Democrats, based on their State of the Union Addresses. Additionally, we’re trying to predict presidents’ party affiliations from those texts. State of the Union Address is a message delivered by the serving president to the U.S. Congress at the beginning of each year and can be perceived as a quite accurate presentation of president’s official priorities.

Database description

Database consists of 224 State of the Union Addresses delivered by 42 former US presidents. The data is retrieved from Kaggle and can be accessed here.

Libraries used

library(readtext)
library(tm)
library(tidytext)
library(tidyverse)
library(topicmodels)
library(Rmisc)
library(SnowballC)
library(wordcloud)
library(caret)
library(reshape2)
library(e1071)
library(rpart)
library(rattle)

Data preparation

data <- readtext("data/addresses/*")

glimpse(data)
## Rows: 224
## Columns: 2
## $ doc_id <chr> "Abraham Lincoln [December 01, 1862].txt", "Abraham Lincoln ...
## $ text   <chr> "\n Fellow-Citizens of the Senate and House of Representativ...

We’ll start with substracting president names from doc_id.

data$pres <- sub(" \\[.*", "", data$doc_id)

unique(data$pres)
##  [1] "Abraham Lincoln"       "Andrew Jackson"        "Andrew Johnson"       
##  [4] "Barack Obama"          "Benjamin Harrison"     "Calvin Coolidge"      
##  [7] "Chester A. Arthur"     "Donald J. Trump"       "Dwight D. Eisenhower" 
## [10] "Franklin D. Roosevelt" "Franklin Pierce"       "George Bush"          
## [13] "George W. Bush"        "George Washington"     "Gerald R. Ford"       
## [16] "Grover Cleveland"      "Harry S. Truman"       "Herbert Hoover"       
## [19] "James Buchanan"        "James K. Polk"         "James Madison"        
## [22] "James Monroe"          "Jimmy Carter"          "John Adams"           
## [25] "John F. Kennedy"       "John Quincy Adams"     "John Tyler"           
## [28] "Lyndon B. Johnson"     "Martin van Buren"      "Millard Fillmore"     
## [31] "Richard Nixon"         "Ronald Reagan"         "Rutherford B. Hayes"  
## [34] "Theodore Roosevelt"    "Thomas Jefferson"      "Ulysses S. Grant"     
## [37] "Warren G. Harding"     "William Howard Taft"   "William J. Clinton"   
## [40] "William McKinley"      "Woodrow Wilson"        "Zachary Taylor"

Having a proper variable with names, we can prepare another variable with party affiliation. Because of lack of this information, we’ll need to do this manually. We will focus later only on two major US parties: Republican Party and Democratic Party. For the few presidents that were coming from other parties or were unaffiliated, we will leave the variable empty as NA.

data$party <- ifelse(data$pres %in% c("Abraham Lincoln", "Benjamin Harrison", "Calvin Coolidge", "Chester A. Arthur", "Donald J. Trump", "Dwight D. Eisenhower", "George W. Bush", "George Bush", "Gerald R. Ford", "Herbert Hoover", "Richard Nixon", "Ronald Reagan", "Rutherford B. Hayes", "Theodore Roosevelt", "Ulysses S. Grant", "Warren G. Harding", "William Howard Taft", "William McKinley"), "rep",
                     ifelse(data$pres %in% c("Andrew Jackson", "Andrew Johnson", "Barack Obama", "Franklin D. Roosevelt", "Franklin Pierce", "Grover Cleveland", "Harry S. Truman", "James Buchanan", "James K. Polk", "Jimmy Carter", "John F. Kennedy", "Lyndon B. Johnson", "Martin van Buren", "William J. Clinton", "Woodrow Wilson"), "dem", NA))


table(data$party)
## 
## dem rep 
##  92  85
sum(is.na(data$party))
## [1] 47

As a result, we end up with 92 texts from presidents affiliated with Democratic Party and 85 texts from presidents from Republican Party, so in this matter the data is very balanced. 47 texts comes from the rest of presidents, mostly from the years before Republican Party creation in 1854. We will omit them in further analysis.

data <- na.omit(data)
head(data)
## readtext object consisting of 6 documents and 2 docvars.
## # Description: df[,4] [6 x 4]
##   doc_id                                text                pres           party
## * <chr>                                 <chr>               <chr>          <chr>
## 1 Abraham Lincoln [December 01, 1862].~ "\"\n Fellow-C\"..~ Abraham Linco~ rep  
## 2 Abraham Lincoln [December 03, 1861].~ "\"\nFellow-Ci\"..~ Abraham Linco~ rep  
## 3 Abraham Lincoln [December 06, 1864].~ "\"\nFellow-Ci\"..~ Abraham Linco~ rep  
## 4 Abraham Lincoln [December 08, 1863].~ "\"\nFellow-Ci\"..~ Abraham Linco~ rep  
## 5 Andrew Jackson [December 01, 1834].t~ "\"\nFellow Ci\"..~ Andrew Jackson dem  
## 6 Andrew Jackson [December 03, 1833].t~ "\"\nFellow Ci\"..~ Andrew Jackson dem

Terms frequency

At the beginning we’ll check the frequent words in the texts. To do this we need to build a term frequency matrix.

docs <- Corpus(VectorSource(data))

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 20)
##        word   freq
## the     the 138761
## and     and  57729
## that   that  19931
## for     for  18560
## our     our  16713
## this   this  11689
## have   have  11110
## with   with  10835
## which which  10375
## will   will   9105
## are     are   8647
## has     has   8271
## not     not   7931
## been   been   7520
## their their   6999
## from   from   6706
## all     all   5838
## its     its   5251
## but     but   5220
## was     was   5078

Without surpirise, the most frequent terms are the ones that are not very informative. We will deal with them later.

frequent <- findFreqTerms(dtm, 200)
length(frequent)
## [1] 867

State of the Union addresses are pretty large documents so although we have only 224 texts, there is a huge number of words appearing at least 200 times.

Document cleaning

As we’ve seen a moment before, some cleaning is needed in order to obtain any meaningful results. The texts are for sure well written and neat, but there are some important steps which will bring them to the form much more preferred in case of text analysis. We will get rid of any dashes, whitespaces, punctuation and also of numbers, which are often provided in the State of the Union but aren’t very informative from our point of view. Next, we’ll change all capital letters to lower ones and remove stopwords, as mentioned before. Apart from default words provided in the stopwords() function, we’ll remove words, which are parts of opening expressions of State of the Union Addresses and therefore don’t represent particular texts and are not distinctive. At the end of the process we’ll perform stemming.

docs <- Corpus(VectorSource(data))

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

docs <- tm_map(docs, toSpace, "-") %>%
  tm_map(stripWhitespace) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(removeWords, c("fellow", "citizens", "senate", "house", "representatives", "speaker", "president", "members", "congress", "vice", "gentlemen")) %>%
  tm_map(stemDocument)


dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 20)
##              word freq
## will         will 9544
## state       state 7797
## govern     govern 7618
## year         year 6641
## nation     nation 5884
## unit         unit 4313
## can           can 4240
## peopl       peopl 4212
## countri   countri 4116
## upon         upon 3651
## american american 3617
## law           law 3317
## must         must 3237
## new           new 3199
## time         time 3192
## great       great 3184
## now           now 3160
## work         work 2995
## public     public 2934
## made         made 2909

The output of the most popular words is much better and reflects the socio-political character of the documents.

Topic modelling

Now we’ll perform a topic modelling with LDA method. We’ll focus separately on both parties and try to find out the main issues brought up in State of the Union Addresses.

rep_idx <- which(data$party == 'rep')  

dem_idx <- which(data$party == 'dem')  


docs_rep <- docs[rep_idx]
docs_dem <- docs[dem_idx]

Republican Party

We will use a slightly modified function provided during classes, which will create document term matrix of texts provided, calculate LDA for selected number of topics and print a plot with the most frequent terms for each topic. We’ll start with republican part of documents and plot three main topics.

top_terms_by_topic_LDA <- function(corpus,
                                   plot = T, 
                                   number_of_topics = 4) 
  
{  
  
  DTM <- DocumentTermMatrix(corpus)

  unique_indexes <- unique(DTM$i)
  DTM <- DTM[unique_indexes,]
  
  
  lda <- LDA(DTM, k = number_of_topics, control = list(seed = 1234))
  topics <- tidy(lda, matrix = "beta")
  
  top_terms <- topics  %>% 
    group_by(topic) %>% 
    top_n(10, beta) %>% 
    ungroup() %>%
    arrange(topic, -beta)
  
  if(plot == T){
    top_terms %>%
      mutate(term = reorder(term, beta)) %>%  
      ggplot(aes(term, beta, fill = factor(topic))) +
      geom_col(show.legend = FALSE) +
      facet_wrap(~ topic, scales = "free") +
      labs(x = NULL, y = "Beta") + 
      coord_flip()
  }else{ 
    return(top_terms)
  }
}

top_terms_by_topic_LDA(docs_rep, number_of_topics = 3)

It seems that some of the most common words are highly connected with issues brought up in the texts and appear frequently in each topic. Maybe we should try to remove these words in order to obtain more diversified results.

docs_rep <- tm_map(docs_rep, removeWords, c("will", "govern", "year", "nation", "state"))

top_terms_by_topic_LDA(docs_rep, number_of_topics = 3)

Now it’s much better. The words are still pretty simiral in terms of topics but we can distinguish some direct referring to America and Americans in 3rd topic, modal verbs like must and can alongside with stemmed words develop and incereas in 1st topic. However, it’s hard to say that there is something special about those topics. Only references to America and greatness expose sublime style with which Americans are associated by many.

Democratic Party

It’s time to check the Democratic Party. We’ll use the same function as before.

top_terms_by_topic_LDA(docs_dem, number_of_topics = 3)

Again the main words are the ones we removed in the republican part. We’ll do the same here.

docs_dem <- tm_map(docs_dem, removeWords, c("will", "govern", "year", "nation", "state"))


top_terms_by_topic_LDA(docs_dem, number_of_topics = 3)

Results are not identical, but we can’t say that the difference is impressive. Again one of topics concentrates on words America and Americans, so it is nothing restricted to the Republican Party. 2nd topic here looks like equivalent to the 1st one in republican part although quite interesting is there occurence of a word war. Apart from that, it is hard to distinguish anything special about those topics that would help to for example match the words with the party affiliation of their authors.

Sentiment analysis

We haven’t found anything special about the topics raised during the State of the Unions so we will move on to the emotional character of the words used. This time we have to tokenize our data. This time we will also remove documents from presidents unaffiliated with Republicans or Democrats and apply similar cleaning as before on our tokenized data.

data_tidy <- unnest_tokens(data, word, text)

custom_stop_words <- tribble(
  ~word, ~lexicon,
  "fellow", "CUSTOM",
  "citizens", "CUSTOM",
  "senate", "CUSTOM",
  "house", "CUSTOM",
  "representatives", "CUSTOM",
  "speaker", "CUSTOM",
  "president", "CUSTOM",
  "members", "CUSTOM",
  "congress", "CUSTOM",
  "vice", "CUSTOM",
  "gentlemen", "CUSTOM",
)


custom_stop_words_add <- tribble(
  ~word, ~lexicon,
  "will", "CUSTOM",
  "govern", "CUSTOM",
  "year", "CUSTOM",
  "nation", "CUSTOM",
  "state", "CUSTOM"
)

stop_words2 <- stop_words %>%
  bind_rows(custom_stop_words)


data_tidy_cleaned <- data_tidy %>%
  anti_join(stop_words2) %>%
  mutate(word = wordStem(word)) %>%
  anti_join(custom_stop_words_add)

Positivity/negativity - AFINN lexicon

sen_afinn <- data_tidy_cleaned %>% 
  inner_join(get_sentiments("afinn")) %>% 
  dplyr::group_by(party) %>% 
  dplyr::summarise(sentiment = sum(value),
                   avg_sentiment = sentiment/n()) %>% 
  mutate(method = "AFINN")

sen_afinn
## # A tibble: 2 x 4
##   party sentiment avg_sentiment method
##   <chr>     <dbl>         <dbl> <chr> 
## 1 dem        5646         0.200 AFINN 
## 2 rep        7369         0.278 AFINN

AFINN lexicon assignes to the words score between -5 and +5. From summarized results we can see that Republicans obtained a little bit higher overall result, despite slightly lower number of texts in our set. On average, words they use are also rated more positively.

afinn_counts <- data_tidy_cleaned %>%
  inner_join(get_sentiments("afinn")) %>%
  dplyr::group_by(party) %>% 
  dplyr::count(word, value, sort = TRUE) %>%
  ungroup()

# Democrats:
afinn_counts[afinn_counts$party == 'dem',]
## # A tibble: 454 x 4
##    party word      value     n
##    <chr> <chr>     <dbl> <int>
##  1 dem   war          -2  1595
##  2 dem   recommend     2   809
##  3 dem   protect       1   800
##  4 dem   support       2   695
##  5 dem   hope          2   622
##  6 dem   free          1   605
##  7 dem   care          2   549
##  8 dem   progress      2   513
##  9 dem   demand       -1   430
## 10 dem   prevent      -1   415
## # ... with 444 more rows
# Republicans:
afinn_counts[afinn_counts$party == 'rep',]
## # A tibble: 449 x 4
##    party word      value     n
##    <chr> <chr>     <dbl> <int>
##  1 rep   war          -2  1020
##  2 rep   recommend     2  1002
##  3 rep   protect       1   760
##  4 rep   progress      2   601
##  5 rep   free          1   592
##  6 rep   hope          2   576
##  7 rep   matter        1   478
##  8 rep   support       2   462
##  9 rep   care          2   452
## 10 rep   benefit       2   431
## # ... with 439 more rows

A closer look at the most frequent from emotionally characterised words emphasizes once again how similar are the documents from presidents from both parties. Three words on top are even plased in the same order and the majority of these common words is the same. This time war is the absolute “winner” for both cases, although Democrats used it more frequently. Nevertheless, we need to keep in mind that presidents from respective parties were the heads of the state in different times so there aren’t any certain conclusions that we could draw from this word occurance without a more detailed examination of the circumstances.

On the plots below we can see the distribution of words with different sentiment values. The shape for the two parties is very similar and the only significant difference is for words with -2 value in favor of Democrats and for words with -1 value in favor of Republicans. Surprisingly the most neutral words counted as -1 and 1 are not the most popular, especially the first one.

ggplot(afinn_counts, aes(x=as.factor(value), y=n, fill = party)) + 
  geom_col() +
  scale_fill_manual(values=c("blue", "red")) +
  labs(
    title = "Distribution of words with different sentiment values",
    x = "values",
    y = "Number of words"
  ) +
  facet_wrap(~party)

Four types of emotions: NRC lexicon

nrc <- data_tidy_cleaned %>%
  inner_join(get_sentiments("nrc"))

nrc %>%
  dplyr::count(sentiment) %>%
  arrange(desc(n))
## readtext object consisting of 10 documents and 0 docvars.
## # Description: df[,3] [10 x 3]
##   sentiment        n text     
##   <chr>        <int> <chr>    
## 1 positive     72537 "\"\"..."
## 2 trust        48189 "\"\"..."
## 3 anticipation 36718 "\"\"..."
## 4 negative     34990 "\"\"..."
## 5 joy          20716 "\"\"..."
## 6 fear         19223 "\"\"..."
## # ... with 4 more rows

We will also take a look at NRC lexicon, which deterministically assignes words into many different emotional categories. From the result above we can see that the main themes present in the documents are positive, trust, anticipation and negative. We will focus on those four terms.

nrc2 <- nrc %>%
  filter(sentiment %in% c("positive", "trust", 'anticipation', 'negative'))

nrc_counts <- nrc2 %>%
  dplyr::count(word, sentiment) %>%
  dplyr::group_by(sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(
    word2 = fct_reorder(word, n)
  )

ggplot(nrc_counts, aes(x=word2, y=n, fill = sentiment)) + 
  geom_col(show.legend=FALSE) +
  scale_fill_manual(values=c("red", "orange", "green", "blue")) +
  facet_wrap(~sentiment, scales="free") +
  coord_flip() +
  labs(
    title = "Sentiment Word Counts",
    x = "Words"
  )

For every category there are one or two words that are almost twice as numerous as the next word from the top. Anticipation corresponds with time and public, negative with already discussed war, positive is dominated by word public and trust with law.

nrc_counts_rep <- nrc2[nrc2$party == 'rep',] %>%
  dplyr::count(word, sentiment) %>%
  dplyr::group_by(sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(
    word2 = fct_reorder(word, n)
  )

ggplot(nrc_counts_rep, aes(x=word2, y=n, fill = sentiment)) + 
  geom_col(show.legend=FALSE) +
  scale_fill_manual(values=c("red", "orange", "green", "blue")) +
  facet_wrap(~sentiment, scales="free") +
  coord_flip() +
  labs(
    title = "Sentiment Word Counts - Republicans",
    x = "Words"
  )

The words and their order in our republican subgroup is similar, the same words are on top of every group but for all but trust category the dominance of the most frequent word over the second one is less spectacular.

nrc_counts_dem <- nrc2[nrc2$party == 'dem',] %>%
  dplyr::count(word, sentiment) %>%
  dplyr::group_by(sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(
    word2 = fct_reorder(word, n)
  )

ggplot(nrc_counts_dem, aes(x=word2, y=n, fill = sentiment)) + 
  geom_col(show.legend=FALSE) +
  scale_fill_manual(values=c("red", "orange", "green", "blue")) +
  facet_wrap(~sentiment, scales="free") +
  coord_flip() +
  labs(
    title = "Sentiment Word Counts - Democrats",
    x = "Words"
  )

It is safe to say that, no matter if we look at the all documents or on the chosen party, the general pattern is always the same. At the end we can also plot word clouds for respectively Republicans and Democrats to visualise the slight nuances in words usage in terms of four main sentiments.

# republican presidents
nrc2[nrc2$party == 'rep',] %>%
  dplyr::count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "orange", "green", "blue"),
                   max.words = 100)

#democratic presidents
nrc2[nrc2$party == 'dem',] %>%
  dplyr::count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "orange", "green", "blue"),
                   max.words = 100)

Classification

Preparation

In the final part of our analysis we will perform a simple text classification and try to predict the party affiliation based on the document content. First we need to split the data into training and testing samples.

set.seed(123456789)
training_obs <- createDataPartition(data$pres, 
                                    p = 0.7, 
                                    list = FALSE)

train0 <- data[training_obs,]
test0  <- data[-training_obs,]

train <- docs[training_obs]
test  <- docs[-training_obs]

Now we will limit the number of words used to the frequent terms, based on the training sample.

data_train <- DocumentTermMatrix(train, list(global = c(2, Inf), dictionary = frequent))
dict <- findFreqTerms(data_train, lowfreq=50)

data_train <- DocumentTermMatrix(train, list(dictionary=dict))

data_test <- DocumentTermMatrix(test, list(dictionary=dict))

Eventually we transform our data into a categorical form, suitable for modelling.

convert_counts <- function(x) {
  x <- ifelse(x > 0, 1, 0)
}

data_train <- data_train %>% apply(MARGIN=2, FUN=convert_counts)
data_test <- data_test %>% apply(MARGIN=2, FUN=convert_counts)


data_train <- as.data.frame(data_train)
data_test <- as.data.frame(data_test)

data_train1 <- cbind(label=factor(train0$party), data_train)
data_test1 <- cbind(label=factor(test0$party), data_test)

data_train1 <-as.data.frame(data_train1)
data_test1 <-as.data.frame(data_test1)

str(data_train1)
## 'data.frame':    136 obs. of  356 variables:
##  $ label     : Factor w/ 2 levels "dem","rep": 2 2 2 1 1 1 1 1 1 1 ...
##  $ account   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ across    : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ act       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ action    : num  1 1 1 1 1 1 0 1 1 1 ...
##  $ actual    : num  1 0 1 1 0 0 1 1 1 1 ...
##  $ agreement : num  1 0 1 0 0 0 1 0 1 1 ...
##  $ aid       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ along     : num  1 0 1 0 0 0 0 0 0 0 ...
##  $ also      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ america   : num  1 0 1 1 1 0 1 0 1 1 ...
##  $ american  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ among     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ amount    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ annual    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ ask       : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ back      : num  1 1 1 0 0 0 0 1 0 0 ...
##  $ bank      : num  1 0 1 1 1 1 1 1 1 0 ...
##  $ benefit   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ best      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ better    : num  1 1 0 1 1 1 1 1 1 0 ...
##  $ beyond    : num  1 1 1 1 1 1 1 1 0 0 ...
##  $ bring     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ britain   : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ british   : num  1 1 1 1 1 1 1 0 0 1 ...
##  $ brought   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ build     : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ call      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ can       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ canal     : num  1 0 0 0 0 1 1 0 0 0 ...
##  $ care      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ case      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cent      : num  1 1 1 1 0 0 0 0 0 1 ...
##  $ certain   : num  1 1 1 1 1 1 1 1 1 0 ...
##  $ chief     : num  1 1 1 0 0 1 1 1 1 0 ...
##  $ children  : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ china     : num  1 1 1 0 0 0 1 0 0 0 ...
##  $ civil     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ class     : num  1 1 1 0 0 1 1 1 1 1 ...
##  $ clear     : num  1 1 1 1 1 1 0 1 1 1 ...
##  $ close     : num  1 1 1 1 0 1 1 1 1 1 ...
##  $ come      : num  1 1 1 1 1 1 0 1 1 1 ...
##  $ commission: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ common    : num  1 1 1 1 1 1 0 1 1 0 ...
##  $ conduct   : num  1 1 1 1 0 1 1 1 1 0 ...
##  $ consider  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cost      : num  1 1 0 0 1 0 0 0 0 1 ...
##  $ credit    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cut       : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ day       : num  1 1 1 1 0 1 1 1 1 1 ...
##  $ deal      : num  1 0 1 1 1 0 1 0 0 0 ...
##  $ debt      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ demand    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ develop   : num  1 1 1 1 0 1 0 1 0 1 ...
##  $ difficult : num  1 1 1 1 0 1 0 1 1 0 ...
##  $ direct    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ district  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ done      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ doubt     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ due       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ effect    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ either    : num  1 1 1 1 0 1 1 1 1 1 ...
##  $ end       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ equal     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ establish : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ european  : num  1 0 1 1 1 0 0 1 0 1 ...
##  $ even      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ ever      : num  1 1 1 1 1 1 1 1 1 0 ...
##  $ except    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ extend    : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ extent    : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ face      : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ fact      : num  1 1 1 1 1 1 0 1 1 1 ...
##  $ faith     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ far       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ favor     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ field     : num  1 1 1 1 1 0 0 1 0 1 ...
##  $ find      : num  1 1 0 1 1 1 1 1 1 0 ...
##  $ first     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ fiscal    : num  1 1 1 1 0 0 0 1 1 1 ...
##  $ foreign   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ form      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ former    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ found     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ free      : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ freedom   : num  1 0 1 0 1 1 0 1 1 0 ...
##  $ full      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ general   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ give      : num  1 1 1 1 1 1 1 1 1 0 ...
##  $ good      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ great     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ greater   : num  1 1 1 1 1 1 1 0 1 1 ...
##  $ half      : num  1 0 1 0 0 0 1 0 0 1 ...
##  $ health    : num  1 1 1 1 0 1 1 1 1 0 ...
##  $ high      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ hold      : num  1 0 0 0 1 1 1 1 1 1 ...
##  $ home      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ honor     : num  1 1 1 1 0 1 1 1 1 1 ...
##  $ hope      : num  1 1 1 1 1 1 1 1 1 1 ...
##   [list output truncated]

We have only 136 documents and even after reduction as much as 356 variables. This makes our case rather complex. We will use Classification Tree and SVM as those methods should deal pretty smoothly with a data of this kind.

Classification Tree

model_tree <- 
  rpart(label~.,
        data = data_train1,
        method = "class")

fancyRpartPlot(model_tree, cex = 0.6, palettes = 'OrRd')

model_tree_pred  <- predict(model_tree,  data_test1, type = 'class')
summary(model_tree_pred)
## dem rep 
##  19  22
confusionMatrix(model_tree_pred, data_test1$label, positive="rep")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction dem rep
##        dem  12   7
##        rep  10  12
##                                           
##                Accuracy : 0.5854          
##                  95% CI : (0.4211, 0.7368)
##     No Information Rate : 0.5366          
##     P-Value [Acc > NIR] : 0.3206          
##                                           
##                   Kappa : 0.1751          
##                                           
##  Mcnemar's Test P-Value : 0.6276          
##                                           
##             Sensitivity : 0.6316          
##             Specificity : 0.5455          
##          Pos Pred Value : 0.5455          
##          Neg Pred Value : 0.6316          
##              Prevalence : 0.4634          
##          Detection Rate : 0.2927          
##    Detection Prevalence : 0.5366          
##       Balanced Accuracy : 0.5885          
##                                           
##        'Positive' Class : rep             
## 

Results of the simple classification tree with default parameters are disapointing. Mediocre values of accuracy, sensitivity and specificity show that we didn’t manage to predict the party efficiently. Finally our model classifies more texts to the less numerous group of republican texts.

barplot(rev(model_tree$variable.importance),
        col = "#E15D44", 
        main = "Imporatnce of variables",
        horiz = T,  
        las = 1,   
        cex.names = 0.6)

We can also check the importance of variables in this model. This can be another method to direct our attention into certain words that may distinguish both parties, but its significance is debatable considering imperfect results.

SVM

model_svm <- svm(label~., data=data_train1)

summary(model_svm)
## 
## Call:
## svm(formula = label ~ ., data = data_train1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  135
## 
##  ( 66 69 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  dem rep
model_svm_pred <- predict(model_svm, data_test1)

summary(model_svm_pred)
## dem rep 
##  27  14

We obtain 27 democratic and 14 republican texts from our predictions.

confusionMatrix(model_svm_pred, data_test1$label, positive="rep")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction dem rep
##        dem  18   9
##        rep   4  10
##                                           
##                Accuracy : 0.6829          
##                  95% CI : (0.5191, 0.8192)
##     No Information Rate : 0.5366          
##     P-Value [Acc > NIR] : 0.04115         
##                                           
##                   Kappa : 0.3508          
##                                           
##  Mcnemar's Test P-Value : 0.26726         
##                                           
##             Sensitivity : 0.5263          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7143          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.4634          
##          Detection Rate : 0.2439          
##    Detection Prevalence : 0.3415          
##       Balanced Accuracy : 0.6722          
##                                           
##        'Positive' Class : rep             
## 

The results aren’t stunning but they aren’t that dissapointing either. The accuracy is ok and the specificity is nice, but the sensitivity is far below expectations. Overall, it’s quite satisfying for such a simple model, especially if we consider all the previous steps of our analysis and the difficulties in finding any features which could help us determine the party affiliation.

Final remarks

It is not our role to decide on the uniqueness of US presidents and we are aware how much is going on far from the official, formal messages like annual State of the Unions. However surprising or not, from our analysis comes an impression that more often than not the most important representatives of the main two long-time fighting political parties of the USA speak with one voice. Nonetheless it is still possible to distinguish the party affiliation based on some characteristic words hided in the State of the Union addresses.