State of the Union Text Clustering

Overview
The President’s State of the Union address has changed considerably over time, more reflecting the era in which it is delivered than the individual president or the political affiliation of the president delivering the speech. However, there is one major exception in President George W. Bush, whose style and content marks a sharp departure from both his predecessors and contemporaries.

This analysis takes all Presidential State of the Union addresses from Harry S Truman to Barack Obama and clusters them by the content of the text in order to better understand how they have changed over the course of history.

Environment Setup & Pre-processing
Load necessary libraries and set random seed (necessary to make K-Means deterministic & replicable).

library(stringr)
library(plyr)
library(dplyr)
library(magrittr)
library(tm)
library(proxy)
library(ggplot2)
library(RColorBrewer)
library(wordcloud)

set.seed(1300)

Gather list of all document files to be processed.

files <- list.files(path = './data/sotu/')

Register parse function closure for vectorization. Each instance of the parse function will open a given file, parse its contents, and output the results as a single entry DataFrame. The DataFrames are then bound into a single object. The output object is mutated to extract the 4 digit year from the file name, which may be useful for modeling the time dimension in Exploratory Analysis.

parse_sotu_text <- function(file_name) {
    full_path <- paste('./data/sotu/', file_name, sep = '')
    raw <- scan(file = full_path, what = character(), sep = '\n')
    raw <- raw[nchar(raw) > 1]  # delete empty lines, single whitespace char
    sotu_content <- raw[3:length(raw)]  # first 2 lines are headers
    
    output <- data_frame(file_name = file_name,
                         head1 = raw[1],
                         head2 = raw[2],
                         char_content = sum(nchar(sotu_content)),
                         content = paste0(sotu_content, sep = ' ', collapse = ' '))
    return(output)
}

sotu <- rbind_all(alply(.data = files, 
                        .margins = 1, 
                        .fun = parse_sotu_text))

sotu <- sotu %>%
    mutate(year = as.integer(str_extract(string = file_name, pattern = '[0-9]{4}')))

Data Cleaning
To clean the data, the Text Mining tm library is leveraged. This library includes multiple function closures to do the main text pre-processing steps. Using the magrittr library pipeline operator %>% allows this pre-processing series of function maps to bind to a single variable. The result is a Corpus object with each document an element of the corpus.

sotu_corpus <- Corpus(VectorSource(sotu$content)) %>%
    tm_map(x = ., FUN = PlainTextDocument) %>%
    tm_map(x = ., FUN = removePunctuation) %>%
    tm_map(x = ., FUN = removeNumbers) %>%
    tm_map(x = ., FUN = removeWords, stopwords(kind = 'en')) %>%
    tm_map(x = ., FUN = stripWhitespace)

Model: TF-IDF
The first step in modeling is to create a Document Term Matrix, with the documents as the rows, the individual words along the columns, and a frequency count as the content. The Corpus object populates the term meta-data automatically during the tokenization process. However, it will be necessary to manually import the document names into the meta-data of the matrix.

doc_term <- DocumentTermMatrix(sotu_corpus)
doc_term$dimnames$Docs <- sotu$file_name

From the Document Term Matrix, it is possible to create a Term Frequency - Inverse Document Frequency matrix. This object is a matrix of the same dimentions as the Document Term Frequency matrix above, except each frequency has been normalized to the frequency of the term in the entire document. This gives additional weight to a term that is common in a given document, but comparably rare in the entire corpus. At the same time, terms that are common across many or all documents are penalized in frequency for any single document as they provide a smaller amount of unique information about a given document. At the same time, a native matrix object version is created and stored to be more compatable with further processing needs.

tf_idf <- weightTfIdf(m = doc_term, normalize = TRUE)
tf_idf_mat <- as.matrix(tf_idf)

Once the frequencies are calculated, a square matrix of all documents is processed as a distance lookup for clustering. The documents have a large number of terms, which means that the distance exists in very high dimensional space. It is for this reason that distance is computed as cosine similarity rather than normal euclidean distance.

tf_idf_dist <- dist(tf_idf_mat, method = 'cosine')

Model: Hierarchical Clustering
The first model is to create a hierarchical cluster, which will show the cohesion among clusters at all levels since the hierarchical method preserves all intermediate clusters. The Ward Method of clustering is used to place a heavy weight on the cohesiveness of a formed cluster at each step of the process. The goal of clustering is to learn implicit features about groups of documents, and thus the most interpretable clustering should weight its decisions on those documents that could be merged and would create the least mutated new document.

clust_h <- hclust(d = tf_idf_dist, method = 'ward.D2')
plot(clust_h,
    main = 'Cluster Dendrogram: Ward Cosine Distance',
    xlab = '', ylab = '', sub = '')

From this cluster dendrogram the first patterns are apparent:
1. Speeches made by the same president are almost always the first to cluster.
2. Speeches from the same Political Party and Era are commonly next to cluster (Obama & Clinton, Reagan & Bush Sr., Eisenhower & Truman)
3. There is a set of potential outliers in 6 of George W. Bush’s speeches that are the last to merge with any others, though the balance of his speeches cluster as would be expected with Bush Sr. and Reagan.

Cutting the Tree: to find an optimal number of clusters, it is useful to investigate the curve of cluster cohesiveness that exists within the space of where the tree can be cut and separate clusters are defined. To produce this curve, a loop will cycle through all possible levels the hierarchical tree can be cut (from 1 to number-of-documents minus 1). For each cut level, the mean number of documents in each cluster and the mean distance between sets of points in each cluster is stored.

dist_mat <- as.matrix(tf_idf_dist)

df_clust_cuts <- data_frame(cut_level = 1:length(sotu$file_name),
                            avg_size = 0,
                            avg_dist = 0)

for (i in 1:(nrow(df_clust_cuts) - 1)) {
    df_clust_cuts[df_clust_cuts$cut_level == i, 'avg_size'] <- mean(table(cutree(tree = clust_h, k = i)))

    df_dist <- data_frame(doc_name = doc_term$dimnames$Docs,
                          clust_cut = cutree(tree = clust_h, k = i)) %>%
        inner_join(x = ., y = ., by = 'clust_cut') %>%
        filter(doc_name.x != doc_name.y)
    df_dist$cos_dist <- NA
    for (t in 1:nrow(df_dist)) {
        df_dist$cos_dist[t] <- dist_mat[df_dist$doc_name.x[t], df_dist$doc_name.y[t]]
    }
    df_dist <- df_dist %>%
        group_by(clust_cut) %>%
        summarise(cos_dist = mean(cos_dist))

    df_clust_cuts[df_clust_cuts$cut_level == i, 'avg_dist'] <- mean(df_dist$cos_dist)
}

With this data frame of outcomes, it is straightforward to graph the space of possibilities.

ggplot(data = df_clust_cuts, aes(x = cut_level, y = avg_dist)) +
    geom_line(color = 'steelblue', size = 2) +
    labs(title = 'Cosine Distance of Hierarchical Clusters by Tree Cut',
         x = 'Tree Cut Level / Number of Final Clusters',
         y = 'Intra-Cluster Mean Cosine Distance')

The above graph shows an expected gradual decline as the tree is cut from 20 clusters up to approaching the number of clusters as there are documents (i.e. each document in its own cluster). Optimally, it is preferable to have as few clusters as necessary such that the Mean Cosine Distance within each cluster is minimized (indicating maximum cluster cohesiveness). There is a local minimum at about 5 clusters that is not surpassed in cohesiveness until closer to 40 clusters are computed. Having the smaller number of clusters is far more preferable.

ggplot(data = df_clust_cuts, aes(x = avg_size, y = avg_dist, color = cut_level)) +
    geom_point(size = 4) +
    labs(title = 'Mean Cluster Size & Cosine Distance by Tree Cut',
         x = 'Mean Number of Documents per Cluster',
         y = 'Intra-Cluster Mean Cosine Distance')

This is confirmed when looking at the balance of cluster size for the chosen cut level. A cut that yields about 5 clusters results in about 15-18 documents per cluster, which appears to be a solid balance agaist higher cut levels.

Model: K-Means Cluster
To confirm the hierarchical cluster analysis, a K-Means cluster analysis is computed which will provide a more visual representation of the cluster space. Since K-Means relies on Euclidean Distance rather than Cosine Dissimilarity, it is first necesary to normalize the TF-IDF matrix. The K-Means process itself will cluster for 5 centroids, and increase the maximum iterations from the default of 10 to 25.

tf_idf_norm <- tf_idf_mat / apply(tf_idf_mat, MARGIN = 1, FUN = function(x) sum(x^2)^0.5)
km_clust <- kmeans(x = tf_idf_norm, centers = 5, iter.max = 25)

The data contains thousands of dimensions, a dimension for each term. Although the clusters are built using the full dimensional feature space, it will not be practical to visualize this many dimensions. To make visualization more palatable, Principal Components Analysis is performed and the 2 most important components are mapped to a plot along with meta-data for markup purposes.

pca_comp <- prcomp(tf_idf_norm)
pca_rep <- data_frame(sotu_name = sotu$file_name,
                      pc1 = pca_comp$x[,1],
                      pc2 = pca_comp$x[,2],
                      clust_id = as.factor(km_clust$cluster))

ggplot(data = pca_rep, mapping = aes(x = pc1, y = pc2, color = clust_id)) +
    scale_color_brewer(palette = 'Set1') +
    geom_text(mapping = aes(label = sotu_name), size = 2.5, fontface = 'bold') +
    labs(title = 'K-Means Cluster: 5 clusters on PCA Features',
         x = 'Principal Component Analysis: Factor 1',
         y = 'Principal Component Analysis: Factor 2') +
    theme_grey() +
    theme(legend.position = 'right',
          legend.title = element_blank())

Many of the same patterns are apparent. The amount of distance among the same 6 George W. Bush speeches is stark from both the rest of the same era and party. It is additionally easy to see the same early level clusters among party and era combined. However, this visualization takes that context a step further. With the exception of George W. Bush’s speeches, the balance of speeches largely exist along a kind of spectrum that has roughly ordered the speeches across time, despite the fact that no date data is present in the model (it was stripped out in the loading step). Thus, it appears that the content of the State of the Union addresses is largely driven by the era it is reflecting more that the political association of the president giving it.

Analysis
In order to dig deeper into underlying drivers, examining the most common terms may help shed light on those factors that mean the most to each cluster. To produce a word cloud specifically for this purpose, it is necessary to produce a Term Document Matrix. This matrix is largely similar to the Document Term matrix created earlier, but due to the needs of the library, will have its rows and columns switched. At the same time, a larger set of common stop words will be scrubbed using the Cornell SMART list.

term_doc <- TermDocumentMatrix(sotu_corpus)
term_doc$dimnames$Docs <- sotu$file_name
td_mat <- as.matrix(term_doc)
td_mat <- td_mat[!row.names(td_mat) %in% stopwords(kind = 'SMART'),]
commonality.cloud(term.matrix = td_mat)

The terms common across all speeches are no surprise. They are the staple terms of presidential patriotism and political populism.

commonality.cloud(term.matrix = td_mat[,km_clust$cluster != 3], max.words = 50)

Spectrum vs. Outliers: Inspecting the words most common across the large spectrum shows a much larger emphasis on use of the word “world”, with many of the other terms occuring in s similar capacity.

commonality.cloud(term.matrix = td_mat[,km_clust$cluster == 3], max.words = 50)

The potential outlier cluster housing 6 of George W. Bush’s speeches is heavily influenced by the frequent use of the word “applause”. If this term is temporarily removed from the set, it may provide more information as to secondary drivers.

td_mat <- td_mat[!row.names(td_mat) %in% c('applause'),]
commonality.cloud(term.matrix = td_mat[,km_clust$cluster == 3], max.words = 50)

With “applause” gone, other differences immediately crop up. Where the large spectrum commonly used the term “nation”, these 6 speeches use other terms far more often like “america” and “country”. Additionally, other terms show up in higher frequency like “freedom” and “security” .

Within the Spectrum: In order to understand the differences within the spectrum, the post-WWII set (cluster 4) is compared to the most modern set (cluster 2).

commonality.cloud(term.matrix = td_mat[,km_clust$cluster == 4], max.words = 50)

The post-WWII/Korean War era speeches largely emphasize the “nation” and related terms, and reflect a concern for the recovering economy fresh from the scars of the Great Depression.

commonality.cloud(term.matrix = td_mat[,km_clust$cluster == 2], max.words = 50)

The tone is markedly different at the other end of the spectrum in the modern era. Multiple terms related to time are apparent, like “year/years” and “time”. There is additional a stark contrast in the use of “nation” in the earler era to the explicit “america” in multiple forms.

Conclusion
The presidential State of the Union speech largely reflects the time in which it is delivered. Though stylistic differences are commonly notable and detectable across a single president as well as a combination of the political party and era–its magnitude is secondary. On its own, political affiliation is not a characteristic driver compared to the position in time. Democrat Jimmy Carter’s 1978 State of the Union has more in common with Republican Ronald Reagan’s 1981 speech than it does with fellow Democrat Bill Clinton’s 1993 speech. This pattern is pervasive and leads to the conclusion that with the exception of George W. Bush, the content of the State of the Union is most largely driven by the evolution of time.

State of the Union Text Clustering

Frank D. Evans