Every four years the whole world focuses on the presidental election in the United States. International significance of USA changed this event long ago from domestic to worldwide. Rivalry is especially intensified because for many years it is a fight between two political parties: Republicans and Democrats. Every change of american head of state results in speculations about its impact on the forthcoming years. Polititians and media are trying to outdo each other in predictions about the marvelous or disastrous future that is waiting for the people around the world once the new president will replace the previous one. Although obviously only the USA citizens have the right to vote in these elections, happenings and signs of support for favoured candidates come from activists from many other countries, especially from Europe. But is it true? Do the presidental elections in the United States concern such a significant issue and is their outcome a matter of life and death? Do the Republicans and Democrats differ so much?
Those are very complex questions and we’re not going to argue for or against any of the statements. However, we’ve treated them as an inspiration for our project. In the following analysis we’re using selected text mining techniques in order to compare previous USA presidents from two main parties, Republicans and Democrats, based on their State of the Union Addresses. Additionally, we’re trying to predict presidents’ party affiliations from those texts. State of the Union Address is a message delivered by the serving president to the U.S. Congress at the beginning of each year and can be perceived as a quite accurate presentation of president’s official priorities.
Database consists of 224 State of the Union Addresses delivered by 42 former US presidents. The data is retrieved from Kaggle and can be accessed here.
library(readtext)
library(tm)
library(tidytext)
library(tidyverse)
library(topicmodels)
library(Rmisc)
library(SnowballC)
library(wordcloud)
library(caret)
library(reshape2)
library(e1071)
library(rpart)
library(rattle)
data <- readtext("data/addresses/*")
glimpse(data)
## Rows: 224
## Columns: 2
## $ doc_id <chr> "Abraham Lincoln [December 01, 1862].txt", "Abraham Lincoln ...
## $ text <chr> "\n Fellow-Citizens of the Senate and House of Representativ...
We’ll start with substracting president names from doc_id.
data$pres <- sub(" \\[.*", "", data$doc_id)
unique(data$pres)
## [1] "Abraham Lincoln" "Andrew Jackson" "Andrew Johnson"
## [4] "Barack Obama" "Benjamin Harrison" "Calvin Coolidge"
## [7] "Chester A. Arthur" "Donald J. Trump" "Dwight D. Eisenhower"
## [10] "Franklin D. Roosevelt" "Franklin Pierce" "George Bush"
## [13] "George W. Bush" "George Washington" "Gerald R. Ford"
## [16] "Grover Cleveland" "Harry S. Truman" "Herbert Hoover"
## [19] "James Buchanan" "James K. Polk" "James Madison"
## [22] "James Monroe" "Jimmy Carter" "John Adams"
## [25] "John F. Kennedy" "John Quincy Adams" "John Tyler"
## [28] "Lyndon B. Johnson" "Martin van Buren" "Millard Fillmore"
## [31] "Richard Nixon" "Ronald Reagan" "Rutherford B. Hayes"
## [34] "Theodore Roosevelt" "Thomas Jefferson" "Ulysses S. Grant"
## [37] "Warren G. Harding" "William Howard Taft" "William J. Clinton"
## [40] "William McKinley" "Woodrow Wilson" "Zachary Taylor"
Having a proper variable with names, we can prepare another variable with party affiliation. Because of lack of this information, we’ll need to do this manually. We will focus later only on two major US parties: Republican Party and Democratic Party. For the few presidents that were coming from other parties or were unaffiliated, we will leave the variable empty as NA.
data$party <- ifelse(data$pres %in% c("Abraham Lincoln", "Benjamin Harrison", "Calvin Coolidge", "Chester A. Arthur", "Donald J. Trump", "Dwight D. Eisenhower", "George W. Bush", "George Bush", "Gerald R. Ford", "Herbert Hoover", "Richard Nixon", "Ronald Reagan", "Rutherford B. Hayes", "Theodore Roosevelt", "Ulysses S. Grant", "Warren G. Harding", "William Howard Taft", "William McKinley"), "rep",
ifelse(data$pres %in% c("Andrew Jackson", "Andrew Johnson", "Barack Obama", "Franklin D. Roosevelt", "Franklin Pierce", "Grover Cleveland", "Harry S. Truman", "James Buchanan", "James K. Polk", "Jimmy Carter", "John F. Kennedy", "Lyndon B. Johnson", "Martin van Buren", "William J. Clinton", "Woodrow Wilson"), "dem", NA))
table(data$party)
##
## dem rep
## 92 85
sum(is.na(data$party))
## [1] 47
As a result, we end up with 92 texts from presidents affiliated with Democratic Party and 85 texts from presidents from Republican Party, so in this matter the data is very balanced. 47 texts comes from the rest of presidents, mostly from the years before Republican Party creation in 1854. We will omit them in further analysis.
data <- na.omit(data)
head(data)
## readtext object consisting of 6 documents and 2 docvars.
## # Description: df[,4] [6 x 4]
## doc_id text pres party
## * <chr> <chr> <chr> <chr>
## 1 Abraham Lincoln [December 01, 1862].~ "\"\n Fellow-C\"..~ Abraham Linco~ rep
## 2 Abraham Lincoln [December 03, 1861].~ "\"\nFellow-Ci\"..~ Abraham Linco~ rep
## 3 Abraham Lincoln [December 06, 1864].~ "\"\nFellow-Ci\"..~ Abraham Linco~ rep
## 4 Abraham Lincoln [December 08, 1863].~ "\"\nFellow-Ci\"..~ Abraham Linco~ rep
## 5 Andrew Jackson [December 01, 1834].t~ "\"\nFellow Ci\"..~ Andrew Jackson dem
## 6 Andrew Jackson [December 03, 1833].t~ "\"\nFellow Ci\"..~ Andrew Jackson dem
At the beginning we’ll check the frequent words in the texts. To do this we need to build a term frequency matrix.
docs <- Corpus(VectorSource(data))
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 20)
## word freq
## the the 138761
## and and 57729
## that that 19931
## for for 18560
## our our 16713
## this this 11689
## have have 11110
## with with 10835
## which which 10375
## will will 9105
## are are 8647
## has has 8271
## not not 7931
## been been 7520
## their their 6999
## from from 6706
## all all 5838
## its its 5251
## but but 5220
## was was 5078
Without surpirise, the most frequent terms are the ones that are not very informative. We will deal with them later.
frequent <- findFreqTerms(dtm, 200)
length(frequent)
## [1] 867
State of the Union addresses are pretty large documents so although we have only 224 texts, there is a huge number of words appearing at least 200 times.
As we’ve seen a moment before, some cleaning is needed in order to obtain any meaningful results. The texts are for sure well written and neat, but there are some important steps which will bring them to the form much more preferred in case of text analysis. We will get rid of any dashes, whitespaces, punctuation and also of numbers, which are often provided in the State of the Union but aren’t very informative from our point of view. Next, we’ll change all capital letters to lower ones and remove stopwords, as mentioned before. Apart from default words provided in the stopwords() function, we’ll remove words, which are parts of opening expressions of State of the Union Addresses and therefore don’t represent particular texts and are not distinctive. At the end of the process we’ll perform stemming.
docs <- Corpus(VectorSource(data))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "-") %>%
tm_map(stripWhitespace) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(removeWords, c("fellow", "citizens", "senate", "house", "representatives", "speaker", "president", "members", "congress", "vice", "gentlemen")) %>%
tm_map(stemDocument)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 20)
## word freq
## will will 9544
## state state 7797
## govern govern 7618
## year year 6641
## nation nation 5884
## unit unit 4313
## can can 4240
## peopl peopl 4212
## countri countri 4116
## upon upon 3651
## american american 3617
## law law 3317
## must must 3237
## new new 3199
## time time 3192
## great great 3184
## now now 3160
## work work 2995
## public public 2934
## made made 2909
The output of the most popular words is much better and reflects the socio-political character of the documents.
Now we’ll perform a topic modelling with LDA method. We’ll focus separately on both parties and try to find out the main issues brought up in State of the Union Addresses.
rep_idx <- which(data$party == 'rep')
dem_idx <- which(data$party == 'dem')
docs_rep <- docs[rep_idx]
docs_dem <- docs[dem_idx]
We will use a slightly modified function provided during classes, which will create document term matrix of texts provided, calculate LDA for selected number of topics and print a plot with the most frequent terms for each topic. We’ll start with republican part of documents and plot three main topics.
top_terms_by_topic_LDA <- function(corpus,
plot = T,
number_of_topics = 4)
{
DTM <- DocumentTermMatrix(corpus)
unique_indexes <- unique(DTM$i)
DTM <- DTM[unique_indexes,]
lda <- LDA(DTM, k = number_of_topics, control = list(seed = 1234))
topics <- tidy(lda, matrix = "beta")
top_terms <- topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
if(plot == T){
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
labs(x = NULL, y = "Beta") +
coord_flip()
}else{
return(top_terms)
}
}
top_terms_by_topic_LDA(docs_rep, number_of_topics = 3)
It seems that some of the most common words are highly connected with issues brought up in the texts and appear frequently in each topic. Maybe we should try to remove these words in order to obtain more diversified results.
docs_rep <- tm_map(docs_rep, removeWords, c("will", "govern", "year", "nation", "state"))
top_terms_by_topic_LDA(docs_rep, number_of_topics = 3)
Now it’s much better. The words are still pretty simiral in terms of topics but we can distinguish some direct referring to America and Americans in 3rd topic, modal verbs like must and can alongside with stemmed words develop and incereas in 1st topic. However, it’s hard to say that there is something special about those topics. Only references to America and greatness expose sublime style with which Americans are associated by many.
It’s time to check the Democratic Party. We’ll use the same function as before.
top_terms_by_topic_LDA(docs_dem, number_of_topics = 3)
Again the main words are the ones we removed in the republican part. We’ll do the same here.
docs_dem <- tm_map(docs_dem, removeWords, c("will", "govern", "year", "nation", "state"))
top_terms_by_topic_LDA(docs_dem, number_of_topics = 3)
Results are not identical, but we can’t say that the difference is impressive. Again one of topics concentrates on words America and Americans, so it is nothing restricted to the Republican Party. 2nd topic here looks like equivalent to the 1st one in republican part although quite interesting is there occurence of a word war. Apart from that, it is hard to distinguish anything special about those topics that would help to for example match the words with the party affiliation of their authors.
We haven’t found anything special about the topics raised during the State of the Unions so we will move on to the emotional character of the words used. This time we have to tokenize our data. This time we will also remove documents from presidents unaffiliated with Republicans or Democrats and apply similar cleaning as before on our tokenized data.
data_tidy <- unnest_tokens(data, word, text)
custom_stop_words <- tribble(
~word, ~lexicon,
"fellow", "CUSTOM",
"citizens", "CUSTOM",
"senate", "CUSTOM",
"house", "CUSTOM",
"representatives", "CUSTOM",
"speaker", "CUSTOM",
"president", "CUSTOM",
"members", "CUSTOM",
"congress", "CUSTOM",
"vice", "CUSTOM",
"gentlemen", "CUSTOM",
)
custom_stop_words_add <- tribble(
~word, ~lexicon,
"will", "CUSTOM",
"govern", "CUSTOM",
"year", "CUSTOM",
"nation", "CUSTOM",
"state", "CUSTOM"
)
stop_words2 <- stop_words %>%
bind_rows(custom_stop_words)
data_tidy_cleaned <- data_tidy %>%
anti_join(stop_words2) %>%
mutate(word = wordStem(word)) %>%
anti_join(custom_stop_words_add)
sen_afinn <- data_tidy_cleaned %>%
inner_join(get_sentiments("afinn")) %>%
dplyr::group_by(party) %>%
dplyr::summarise(sentiment = sum(value),
avg_sentiment = sentiment/n()) %>%
mutate(method = "AFINN")
sen_afinn
## # A tibble: 2 x 4
## party sentiment avg_sentiment method
## <chr> <dbl> <dbl> <chr>
## 1 dem 5646 0.200 AFINN
## 2 rep 7369 0.278 AFINN
AFINN lexicon assignes to the words score between -5 and +5. From summarized results we can see that Republicans obtained a little bit higher overall result, despite slightly lower number of texts in our set. On average, words they use are also rated more positively.
afinn_counts <- data_tidy_cleaned %>%
inner_join(get_sentiments("afinn")) %>%
dplyr::group_by(party) %>%
dplyr::count(word, value, sort = TRUE) %>%
ungroup()
# Democrats:
afinn_counts[afinn_counts$party == 'dem',]
## # A tibble: 454 x 4
## party word value n
## <chr> <chr> <dbl> <int>
## 1 dem war -2 1595
## 2 dem recommend 2 809
## 3 dem protect 1 800
## 4 dem support 2 695
## 5 dem hope 2 622
## 6 dem free 1 605
## 7 dem care 2 549
## 8 dem progress 2 513
## 9 dem demand -1 430
## 10 dem prevent -1 415
## # ... with 444 more rows
# Republicans:
afinn_counts[afinn_counts$party == 'rep',]
## # A tibble: 449 x 4
## party word value n
## <chr> <chr> <dbl> <int>
## 1 rep war -2 1020
## 2 rep recommend 2 1002
## 3 rep protect 1 760
## 4 rep progress 2 601
## 5 rep free 1 592
## 6 rep hope 2 576
## 7 rep matter 1 478
## 8 rep support 2 462
## 9 rep care 2 452
## 10 rep benefit 2 431
## # ... with 439 more rows
A closer look at the most frequent from emotionally characterised words emphasizes once again how similar are the documents from presidents from both parties. Three words on top are even plased in the same order and the majority of these common words is the same. This time war is the absolute “winner” for both cases, although Democrats used it more frequently. Nevertheless, we need to keep in mind that presidents from respective parties were the heads of the state in different times so there aren’t any certain conclusions that we could draw from this word occurance without a more detailed examination of the circumstances.
On the plots below we can see the distribution of words with different sentiment values. The shape for the two parties is very similar and the only significant difference is for words with -2 value in favor of Democrats and for words with -1 value in favor of Republicans. Surprisingly the most neutral words counted as -1 and 1 are not the most popular, especially the first one.
ggplot(afinn_counts, aes(x=as.factor(value), y=n, fill = party)) +
geom_col() +
scale_fill_manual(values=c("blue", "red")) +
labs(
title = "Distribution of words with different sentiment values",
x = "values",
y = "Number of words"
) +
facet_wrap(~party)
nrc <- data_tidy_cleaned %>%
inner_join(get_sentiments("nrc"))
nrc %>%
dplyr::count(sentiment) %>%
arrange(desc(n))
## readtext object consisting of 10 documents and 0 docvars.
## # Description: df[,3] [10 x 3]
## sentiment n text
## <chr> <int> <chr>
## 1 positive 72537 "\"\"..."
## 2 trust 48189 "\"\"..."
## 3 anticipation 36718 "\"\"..."
## 4 negative 34990 "\"\"..."
## 5 joy 20716 "\"\"..."
## 6 fear 19223 "\"\"..."
## # ... with 4 more rows
We will also take a look at NRC lexicon, which deterministically assignes words into many different emotional categories. From the result above we can see that the main themes present in the documents are positive, trust, anticipation and negative. We will focus on those four terms.
nrc2 <- nrc %>%
filter(sentiment %in% c("positive", "trust", 'anticipation', 'negative'))
nrc_counts <- nrc2 %>%
dplyr::count(word, sentiment) %>%
dplyr::group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n)
)
ggplot(nrc_counts, aes(x=word2, y=n, fill = sentiment)) +
geom_col(show.legend=FALSE) +
scale_fill_manual(values=c("red", "orange", "green", "blue")) +
facet_wrap(~sentiment, scales="free") +
coord_flip() +
labs(
title = "Sentiment Word Counts",
x = "Words"
)
For every category there are one or two words that are almost twice as numerous as the next word from the top. Anticipation corresponds with time and public, negative with already discussed war, positive is dominated by word public and trust with law.
nrc_counts_rep <- nrc2[nrc2$party == 'rep',] %>%
dplyr::count(word, sentiment) %>%
dplyr::group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n)
)
ggplot(nrc_counts_rep, aes(x=word2, y=n, fill = sentiment)) +
geom_col(show.legend=FALSE) +
scale_fill_manual(values=c("red", "orange", "green", "blue")) +
facet_wrap(~sentiment, scales="free") +
coord_flip() +
labs(
title = "Sentiment Word Counts - Republicans",
x = "Words"
)
The words and their order in our republican subgroup is similar, the same words are on top of every group but for all but trust category the dominance of the most frequent word over the second one is less spectacular.
nrc_counts_dem <- nrc2[nrc2$party == 'dem',] %>%
dplyr::count(word, sentiment) %>%
dplyr::group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n)
)
ggplot(nrc_counts_dem, aes(x=word2, y=n, fill = sentiment)) +
geom_col(show.legend=FALSE) +
scale_fill_manual(values=c("red", "orange", "green", "blue")) +
facet_wrap(~sentiment, scales="free") +
coord_flip() +
labs(
title = "Sentiment Word Counts - Democrats",
x = "Words"
)
It is safe to say that, no matter if we look at the all documents or on the chosen party, the general pattern is always the same. At the end we can also plot word clouds for respectively Republicans and Democrats to visualise the slight nuances in words usage in terms of four main sentiments.
# republican presidents
nrc2[nrc2$party == 'rep',] %>%
dplyr::count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "orange", "green", "blue"),
max.words = 100)
#democratic presidents
nrc2[nrc2$party == 'dem',] %>%
dplyr::count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "orange", "green", "blue"),
max.words = 100)
In the final part of our analysis we will perform a simple text classification and try to predict the party affiliation based on the document content. First we need to split the data into training and testing samples.
set.seed(123456789)
training_obs <- createDataPartition(data$pres,
p = 0.7,
list = FALSE)
train0 <- data[training_obs,]
test0 <- data[-training_obs,]
train <- docs[training_obs]
test <- docs[-training_obs]
Now we will limit the number of words used to the frequent terms, based on the training sample.
data_train <- DocumentTermMatrix(train, list(global = c(2, Inf), dictionary = frequent))
dict <- findFreqTerms(data_train, lowfreq=50)
data_train <- DocumentTermMatrix(train, list(dictionary=dict))
data_test <- DocumentTermMatrix(test, list(dictionary=dict))
Eventually we transform our data into a categorical form, suitable for modelling.
convert_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
}
data_train <- data_train %>% apply(MARGIN=2, FUN=convert_counts)
data_test <- data_test %>% apply(MARGIN=2, FUN=convert_counts)
data_train <- as.data.frame(data_train)
data_test <- as.data.frame(data_test)
data_train1 <- cbind(label=factor(train0$party), data_train)
data_test1 <- cbind(label=factor(test0$party), data_test)
data_train1 <-as.data.frame(data_train1)
data_test1 <-as.data.frame(data_test1)
str(data_train1)
## 'data.frame': 136 obs. of 356 variables:
## $ label : Factor w/ 2 levels "dem","rep": 2 2 2 1 1 1 1 1 1 1 ...
## $ account : num 1 1 1 1 1 1 1 1 1 1 ...
## $ across : num 1 1 1 0 0 0 0 0 0 0 ...
## $ act : num 1 1 1 1 1 1 1 1 1 1 ...
## $ action : num 1 1 1 1 1 1 0 1 1 1 ...
## $ actual : num 1 0 1 1 0 0 1 1 1 1 ...
## $ agreement : num 1 0 1 0 0 0 1 0 1 1 ...
## $ aid : num 1 1 1 1 1 1 1 1 1 1 ...
## $ along : num 1 0 1 0 0 0 0 0 0 0 ...
## $ also : num 1 1 1 1 1 1 1 1 1 1 ...
## $ america : num 1 0 1 1 1 0 1 0 1 1 ...
## $ american : num 1 1 1 1 1 1 1 1 1 1 ...
## $ among : num 1 1 1 1 1 1 1 1 1 1 ...
## $ amount : num 1 1 1 1 1 1 1 1 1 1 ...
## $ annual : num 1 1 1 1 1 1 1 1 1 1 ...
## $ ask : num 1 1 1 0 1 1 1 1 1 1 ...
## $ back : num 1 1 1 0 0 0 0 1 0 0 ...
## $ bank : num 1 0 1 1 1 1 1 1 1 0 ...
## $ benefit : num 1 1 1 1 1 1 1 1 1 1 ...
## $ best : num 1 1 1 1 1 1 1 1 1 1 ...
## $ better : num 1 1 0 1 1 1 1 1 1 0 ...
## $ beyond : num 1 1 1 1 1 1 1 1 0 0 ...
## $ bring : num 1 1 1 1 1 1 1 1 1 1 ...
## $ britain : num 1 0 1 1 1 1 1 1 1 1 ...
## $ british : num 1 1 1 1 1 1 1 0 0 1 ...
## $ brought : num 1 1 1 1 1 1 1 1 1 1 ...
## $ build : num 1 0 1 1 1 1 1 1 0 0 ...
## $ call : num 1 1 1 1 1 1 1 1 1 1 ...
## $ can : num 1 1 1 1 1 1 1 1 1 1 ...
## $ canal : num 1 0 0 0 0 1 1 0 0 0 ...
## $ care : num 1 1 1 1 1 1 1 1 1 1 ...
## $ case : num 1 1 1 1 1 1 1 1 1 1 ...
## $ cent : num 1 1 1 1 0 0 0 0 0 1 ...
## $ certain : num 1 1 1 1 1 1 1 1 1 0 ...
## $ chief : num 1 1 1 0 0 1 1 1 1 0 ...
## $ children : num 1 0 0 0 0 1 0 0 0 0 ...
## $ china : num 1 1 1 0 0 0 1 0 0 0 ...
## $ civil : num 1 1 1 1 1 1 1 1 1 1 ...
## $ class : num 1 1 1 0 0 1 1 1 1 1 ...
## $ clear : num 1 1 1 1 1 1 0 1 1 1 ...
## $ close : num 1 1 1 1 0 1 1 1 1 1 ...
## $ come : num 1 1 1 1 1 1 0 1 1 1 ...
## $ commission: num 1 1 1 1 1 1 1 1 1 1 ...
## $ common : num 1 1 1 1 1 1 0 1 1 0 ...
## $ conduct : num 1 1 1 1 0 1 1 1 1 0 ...
## $ consider : num 1 1 1 1 1 1 1 1 1 1 ...
## $ cost : num 1 1 0 0 1 0 0 0 0 1 ...
## $ credit : num 1 1 1 1 1 1 1 1 1 1 ...
## $ cut : num 1 0 0 0 0 0 0 0 0 0 ...
## $ day : num 1 1 1 1 0 1 1 1 1 1 ...
## $ deal : num 1 0 1 1 1 0 1 0 0 0 ...
## $ debt : num 1 1 1 1 1 1 1 1 1 1 ...
## $ demand : num 1 1 1 1 1 1 1 1 1 1 ...
## $ develop : num 1 1 1 1 0 1 0 1 0 1 ...
## $ difficult : num 1 1 1 1 0 1 0 1 1 0 ...
## $ direct : num 1 1 1 1 1 1 1 1 1 1 ...
## $ district : num 1 1 1 1 1 1 1 1 1 1 ...
## $ done : num 1 1 1 1 1 1 1 1 1 1 ...
## $ doubt : num 1 1 1 1 1 1 1 1 1 1 ...
## $ due : num 1 1 1 1 1 1 1 1 1 1 ...
## $ effect : num 1 1 1 1 1 1 1 1 1 1 ...
## $ either : num 1 1 1 1 0 1 1 1 1 1 ...
## $ end : num 1 1 1 1 1 1 1 1 1 1 ...
## $ equal : num 1 1 1 1 1 1 1 1 1 1 ...
## $ establish : num 1 1 1 1 1 1 1 1 1 1 ...
## $ european : num 1 0 1 1 1 0 0 1 0 1 ...
## $ even : num 1 1 1 1 1 1 1 1 1 1 ...
## $ ever : num 1 1 1 1 1 1 1 1 1 0 ...
## $ except : num 1 1 1 1 1 1 1 1 1 1 ...
## $ extend : num 1 0 1 1 1 1 1 1 1 1 ...
## $ extent : num 1 0 1 1 1 1 1 1 1 1 ...
## $ face : num 1 0 0 0 0 0 0 0 1 0 ...
## $ fact : num 1 1 1 1 1 1 0 1 1 1 ...
## $ faith : num 1 1 1 1 1 1 1 1 1 1 ...
## $ far : num 1 1 1 1 1 1 1 1 1 1 ...
## $ favor : num 1 1 1 1 1 1 1 1 1 1 ...
## $ field : num 1 1 1 1 1 0 0 1 0 1 ...
## $ find : num 1 1 0 1 1 1 1 1 1 0 ...
## $ first : num 1 1 1 1 1 1 1 1 1 1 ...
## $ fiscal : num 1 1 1 1 0 0 0 1 1 1 ...
## $ foreign : num 1 1 1 1 1 1 1 1 1 1 ...
## $ form : num 1 1 1 1 1 1 1 1 1 1 ...
## $ former : num 1 1 1 1 1 1 1 1 1 1 ...
## $ found : num 1 1 1 1 1 1 1 1 1 1 ...
## $ free : num 1 1 1 0 1 1 1 1 1 1 ...
## $ freedom : num 1 0 1 0 1 1 0 1 1 0 ...
## $ full : num 1 1 1 1 1 1 1 1 1 1 ...
## $ general : num 1 1 1 1 1 1 1 1 1 1 ...
## $ give : num 1 1 1 1 1 1 1 1 1 0 ...
## $ good : num 1 1 1 1 1 1 1 1 1 1 ...
## $ great : num 1 1 1 1 1 1 1 1 1 1 ...
## $ greater : num 1 1 1 1 1 1 1 0 1 1 ...
## $ half : num 1 0 1 0 0 0 1 0 0 1 ...
## $ health : num 1 1 1 1 0 1 1 1 1 0 ...
## $ high : num 1 1 1 1 1 1 1 1 1 1 ...
## $ hold : num 1 0 0 0 1 1 1 1 1 1 ...
## $ home : num 1 1 1 1 1 1 1 1 1 1 ...
## $ honor : num 1 1 1 1 0 1 1 1 1 1 ...
## $ hope : num 1 1 1 1 1 1 1 1 1 1 ...
## [list output truncated]
We have only 136 documents and even after reduction as much as 356 variables. This makes our case rather complex. We will use Classification Tree and SVM as those methods should deal pretty smoothly with a data of this kind.
model_tree <-
rpart(label~.,
data = data_train1,
method = "class")
fancyRpartPlot(model_tree, cex = 0.6, palettes = 'OrRd')
model_tree_pred <- predict(model_tree, data_test1, type = 'class')
summary(model_tree_pred)
## dem rep
## 19 22
confusionMatrix(model_tree_pred, data_test1$label, positive="rep")
## Confusion Matrix and Statistics
##
## Reference
## Prediction dem rep
## dem 12 7
## rep 10 12
##
## Accuracy : 0.5854
## 95% CI : (0.4211, 0.7368)
## No Information Rate : 0.5366
## P-Value [Acc > NIR] : 0.3206
##
## Kappa : 0.1751
##
## Mcnemar's Test P-Value : 0.6276
##
## Sensitivity : 0.6316
## Specificity : 0.5455
## Pos Pred Value : 0.5455
## Neg Pred Value : 0.6316
## Prevalence : 0.4634
## Detection Rate : 0.2927
## Detection Prevalence : 0.5366
## Balanced Accuracy : 0.5885
##
## 'Positive' Class : rep
##
Results of the simple classification tree with default parameters are disapointing. Mediocre values of accuracy, sensitivity and specificity show that we didn’t manage to predict the party efficiently. Finally our model classifies more texts to the less numerous group of republican texts.
barplot(rev(model_tree$variable.importance),
col = "#E15D44",
main = "Imporatnce of variables",
horiz = T,
las = 1,
cex.names = 0.6)
We can also check the importance of variables in this model. This can be another method to direct our attention into certain words that may distinguish both parties, but its significance is debatable considering imperfect results.
model_svm <- svm(label~., data=data_train1)
summary(model_svm)
##
## Call:
## svm(formula = label ~ ., data = data_train1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 135
##
## ( 66 69 )
##
##
## Number of Classes: 2
##
## Levels:
## dem rep
model_svm_pred <- predict(model_svm, data_test1)
summary(model_svm_pred)
## dem rep
## 27 14
We obtain 27 democratic and 14 republican texts from our predictions.
confusionMatrix(model_svm_pred, data_test1$label, positive="rep")
## Confusion Matrix and Statistics
##
## Reference
## Prediction dem rep
## dem 18 9
## rep 4 10
##
## Accuracy : 0.6829
## 95% CI : (0.5191, 0.8192)
## No Information Rate : 0.5366
## P-Value [Acc > NIR] : 0.04115
##
## Kappa : 0.3508
##
## Mcnemar's Test P-Value : 0.26726
##
## Sensitivity : 0.5263
## Specificity : 0.8182
## Pos Pred Value : 0.7143
## Neg Pred Value : 0.6667
## Prevalence : 0.4634
## Detection Rate : 0.2439
## Detection Prevalence : 0.3415
## Balanced Accuracy : 0.6722
##
## 'Positive' Class : rep
##
The results aren’t stunning but they aren’t that dissapointing either. The accuracy is ok and the specificity is nice, but the sensitivity is far below expectations. Overall, it’s quite satisfying for such a simple model, especially if we consider all the previous steps of our analysis and the difficulties in finding any features which could help us determine the party affiliation.
It is not our role to decide on the uniqueness of US presidents and we are aware how much is going on far from the official, formal messages like annual State of the Unions. However surprising or not, from our analysis comes an impression that more often than not the most important representatives of the main two long-time fighting political parties of the USA speak with one voice. Nonetheless it is still possible to distinguish the party affiliation based on some characteristic words hided in the State of the Union addresses.