Purpose

This Rmarkdown is for some exploratory text-mining of the evidence before the select committee on the gold standard of 1932. It is for Prof. Eichengreen

It starts by looking at the most common words used by all the witnesses, and follows with some specific words used by each witness. Last, there is some sentiment analysis done on the evidence provided by each witness.

Data

What are the most common words?

# remove common words
text_df <- text_df %>% 
  anti_join(stop_words) 
## Joining, by = "word"
# filter out words common to every page
text_df <- text_df %>% 
  filter(!word %in% c("ould",
                     "gold",
                     "standard",
                     "south",
                     "africa",
                     "cent",
                     "1932"))

# count words
text_df %>% 
  count(word, sort = T) %>% 
  filter(n > 200,
         nchar(word) > 3) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(n, word, fill = n)) + 
  geom_col() +
  theme(legend.position = "none") +
  labs(title = "Most freqently used words in Evidence",
       x = "Number of word uses")

What words are used by each witness?

We see from the figures below that there is variation in what the witnesses provide evidence about, as well as some common trends.

Farmers are the subject of evidence from Arndt, Norval and Schumann.

Wages and workers are dealt with by Leslie, Merks, Moore, Mushet,

Merks speaks about diamonds, and Molteno on fruit.

Wool, and important export, is a priority for Fourie, Galbraith, Gundelfinger, Norval and Mushet.

Van der Horst and Roy speak about the mechanics of trade, dealing with ships, ports, bunkers, exchange and trade in hides.

In my head I have a nice network plot that has witnesses in different colours and they are linked by the common topics that they share. I have yet to find a way to produce this plot - but I am learning many things!

There are four pages of plots that follow

Page 1

text_df %>%
  group_by(witness) %>% 
  count(word, sort = T) %>% 
  filter(nchar(word) > 3) %>% 
  mutate(word = reorder_within(word, n, witness)) %>% 
  top_n(12, n) %>% 
  ungroup() %>% 
  ggplot(aes(n, word, fill = n)) + 
  geom_col() +
  scale_y_reordered() +
  scale_fill_gradient2(low = "blue",
                       high = "red",
                       mid = "blue",
                       midpoint = 15) +
  theme(legend.position = "none") +
  facet_wrap_paginate(~ witness, ncol = 3, nrow = 2, scales = "free", page = 1) +
  labs(title = "12 most freqently used words by witnesses in giving evidence ",
       x = "Number of word uses",
       y = "")

Page 2

text_df %>%
  group_by(witness) %>% 
  count(word, sort = T) %>% 
  filter(nchar(word) > 3) %>% 
  mutate(word = reorder_within(word, n, witness)) %>% 
  top_n(12, n) %>% 
  ungroup() %>% 
  ggplot(aes(n, word, fill = n)) + 
  geom_col() +
  scale_y_reordered() +
  scale_fill_gradient2(low = "blue",
                       high = "red",
                       mid = "blue",
                       midpoint = 15) +
  theme(legend.position = "none") +
  facet_wrap_paginate(~ witness, ncol = 3, nrow = 2, scales = "free", page = 2) +
  labs(title = "12 most freqently used words by witnesses in giving evidence ",
       x = "Number of word uses",
       y = "")

Page 3

text_df %>%
  group_by(witness) %>% 
  count(word, sort = T) %>% 
  filter(nchar(word) > 3) %>% 
  mutate(word = reorder_within(word, n, witness)) %>% 
  top_n(12, n) %>% 
  ungroup() %>% 
  ggplot(aes(n, word, fill = n)) + 
  geom_col() +
  scale_y_reordered() +
  scale_fill_gradient2(low = "blue",
                       high = "red",
                       mid = "blue",
                       midpoint = 15) +
  theme(legend.position = "none") +
  facet_wrap_paginate(~ witness, ncol = 3, nrow = 2, scales = "free", page = 3) +
  labs(title = "12 most freqently used words by witnesses in giving evidence ",
       x = "Number of word uses",
       y = "")

Page 4

text_df %>%
  group_by(witness) %>% 
  count(word, sort = T) %>% 
  filter(nchar(word) > 3) %>% 
  mutate(word = reorder_within(word, n, witness)) %>% 
  top_n(12, n) %>% 
  ungroup() %>% 
  ggplot(aes(n, word, fill = n)) + 
  geom_col() +
  scale_y_reordered() +
  scale_fill_gradient2(low = "blue",
                       high = "red",
                       mid = "blue",
                       midpoint = 15) +
  theme(legend.position = "none") +
  facet_wrap_paginate(~ witness, ncol = 3, nrow = 2, scales = "free", page = 4) +
  labs(title = "12 most freqently used words by witnesses in giving evidence ",
       x = "Number of word uses",
       y = "")

Sentiment analysis

The evidence of each witness is broken up into half-page sections and scored for positive or negative sentiment based on the words used. The BING sentiment lexicon is used to get a positive or negative score for each word.

We see that Laite, Judson, Moore, Mushet, Norval, Schumann and Shiel have markedly negative sentiment in the evidence they provide throughout their testimony.

Evans, Gundelfinger and Martin, J have at least some positive sentiment amongst their testimony.

This seems interesting to compliment further reading, or to group by topic and then reanalyse.

library(tidyr)

sentiment <- text_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(witness, index = page %/% 1, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"

Sentiment visualization

Page 1

sentiment %>% 
  ggplot(aes(index, sentiment, fill = witness)) +
  geom_col(show.legend = F) +
  facet_wrap_paginate(~ witness, ncol = 3, nrow = 3, page = 1, scales = "free") +
  labs(title = "Sentiment of evidence provided by witness",
       subtitle = "Scored with BING sentiment lexicon",
       y = "Sentiment score by section",
       x = "Image number")

Page 2

sentiment %>% 
  ggplot(aes(index, sentiment, fill = witness)) +
  geom_col(show.legend = F) +
  facet_wrap_paginate(~ witness, ncol = 3, nrow = 3, page = 2, scales = "free") +
  labs(title = "Sentiment of evidence provided by witness",
       subtitle = "Scored with BING sentiment lexicon",
       y = "Sentiment score by section",
       x = "Image number")

Page 3

sentiment %>% 
  ggplot(aes(index, sentiment, fill = witness)) +
  geom_col(show.legend = F) +
  facet_wrap_paginate(~ witness, ncol = 3, nrow = 3, page = 3, scales = "free") +
  labs(title = "Sentiment of evidence provided by witness",
       subtitle = "Scored with BING sentiment lexicon",
       y = "Sentiment score by section",
       x = "Image number")