I’ve been interested in playing around with text mining in R now for a while. Specifically, I wanted to try out some of the methods outlined here.

The other week I checked my email and saw a new issue of Data is Plural that linked to the UN General Debate Corpus.

Every year since 1947, representatives of UN member states gather at the annual sessions of the United Nations General Assembly. The centrepiece of each session is the General Debate. This is a forum at which leaders and other senior officials deliver statements that present their government’s perspective on the major issues in world politics. These statements are akin to the annual legislative state-of-the-union addresses in domestic politics. This new dataset, the UN General Debate Corpus (UNGDC), introduces the corpus of texts of General Debate statements from 1970 (Session 25) to 2016 (Session 71).

I will use this data to perform [1] a term frequency analysis and [2] a sentiment analysis.

[1] Term frequency analysis

I am going to use this data to compare the content of UN Security council countries’ speeches via tf-idf (a statistic that shows how important a word is to a document in a corpus). Think of the countries as different documents and the corpus as the collection of all speeches. The Security Council countries are the US, Britain, France, China, and Russia.

The tf-idf measure has been used for looking into phrases used by GOP candidates. It has also been used to discover the most important words by character in the TV Show Seinfeld. Those are just a few examples.

I downloaded the txt files for those countries’ speeches in a folder sec_council. Let’s now import the data.

         docvarsfrom = "filenames", 
         docvarnames = c("country", "speech_num", "year"),
         dvsep = "_", 
         encoding = "ISO-8859-1")

Let’s make some basic changes to the text.

speeches$text<-gsub("'s", "", speeches$text)
speeches$text<-gsub("â", "", speeches$text)
speeches$text<-gsub("92s", "s", speeches$text)
speeches$text<-gsub("Prance", "France", speeches$text)

Now, we follow steps outlined in the Text Mining manual.

library(tidytext); library(dplyr); library(tidyr)
country_words <- speeches %>%
  unnest_tokens(word, text) %>%
  count(country, word, sort = TRUE) %>%
total_words <- country_words %>% 
  group_by(country) %>% 
  summarize(total = sum(n))
country_words1 <- left_join(country_words, total_words)
country_words2 <- country_words1 %>%
  bind_tf_idf(word, country, n)
country_words2 %>%
  select(-total) %>%