Library

Part 1: Counting Key Terms

Summary of findings:

The term “inequality” is appearing more frequently in UNGA speeches over time (Figure 1). This trend is present across all UN regional groupings, but it is most pronounced in Latin America (Figure 2). Additionally, the trend starts earlier in Latin America (~1990) than in other regions (mid-2000s) (Figure 2).

Figures 1 and 2 demonstrate this trend in absolute terms and does not account for certain confounding factors. For example, it is possible that the absolute frequency of ‘inequ’ increases over time because the number of states - and thus speeches - has increased over time (Figure 3). Figure 3 demonstrates that the number of speeches does increase over time. Furthermore, it is also possible that arbitrary variation in the length of texts could influence term frequency (Figure 4). Figure 4 demonstrates that aggregate text length (measured in number of unigrams) varies considerably over time. However, figure 5 demonstrates that the trend of increased mentions of ‘inequ’ persists over time when accounting for these potential confounders. Taken together, we have evidence that mentions of inequality are becoming more frequent in general - especially in Latin America - and that this trend increased sharply increased around the time of the 2008 financial crisis. To a lesser extent, the upward trend in Latin America that kicked off around 1990 suggests that there may have been a less pronounced moment of interest during the early 90s.

Figure 1:

This figure depicts the term frequency of ‘inequality’ by year with base and stemmed tokenization. It appears that the stem tokenizer increases the frequency, meaning that the morpheme (inequ) captures more than the word (inequality). Upon manual inspection, I also noticed that inequ also captures the synonym inequity.

Absolute term frequency remains stable until the mid-2000s when it sharply increases. This could signal increased attention to general matters of inequality in the UNGA, but it provides little indication regarding the quality of attention. Term frequency is not necessarily a direct link to concept salience, so I don’t think we can draw any conclusions about how important inequality is relative to other issues. Nevertheless, that this visualization depicts a sharp rise in the term frequency of inequality around the time of the mid-2000s financial is interesting.

inequality_frequency_aggregate %>%
  ggplot(mapping = aes(x = year, y = count, color = tokenization)) + 
  geom_smooth() + 
  labs(title = "Term Frequency of 'inequality' by year", subtitle = "base v. stemmed tokenization", caption = "Figure 1", x = "year", y = "term frequency")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 2:

This figure depicts the term frequency of inequality (stemmed to ‘inequ’) over time, disaggregated by UN Region. Latin America accounts for the highest frequency and has been on a steady upward trend since 1990 with acceleration around 2008. After Latin America, the change seems steepest in WEOG and Africa. However, this is not definitely more pronounced than elsewhere.

inequality_frequency_UNREGION[ which(inequality_frequency_UNREGION$tokenization=="inequ"),] %>%
  ggplot(mapping = aes(x = year, y = count, color = UN_REGION)) + 
  geom_smooth() + 
  labs(title = "Term Frequency of 'inequality' by year", subtitle = "stemmed tokenization", caption = "Figure 2", x = "year", y = "term frequency")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

Figure 3, 4, and 5

Figures 3, 4, and 5 aim to test whether the increase in mentions of ‘inequ’ could result from the temporal increase in the size of the corpus or to arbitrary variations in document length.

Figure 3 demonstrates that the number of speeches per year increases over time.

# Documents per year: 
ungdc18[ which(ungdc18$year != 1970),] %>%
  group_by(year) %>%
  count(doc_id) %>%
  summarise(documents = sum(n)) %>%
  ggplot(mapping = aes(x = year, y = documents)) +
  geom_point() +
  geom_line() +
  labs(title = "UNGA Speeches per Year", subtitle = "Excluding 1970, which has incomplete data", caption = "Figure 3")

## `summarise()` ungrouping output (override with `.groups` argument)

Figure 4 demonstrates the variation in the number of unigrams per year. Interestingly, there is significant variation over time. Although the number of speeches increased consistently over time, the number of unigrams per year did not follow the same pattern. Therefore, we can not say that the increase in mentions of inequality is a factor of increased document length, as document length has not followed a stable trend.

# Unigram tokens per year: 
ungdc_tokens[ which(ungdc_tokens$year != 1970),] %>%
  group_by(year) %>%
  count(word) %>%
  summarise(unigrams = sum(n)) %>%
  ggplot(mapping = aes(x = year, y = unigrams)) +
  geom_point() +
  geom_line() +
  labs(title = "UNGA Unigrams per Year", subtitle = "Excluding 1970", caption = "Figure 4")

## `summarise()` ungrouping output (override with `.groups` argument)

Figure 5 depicts the relative prevalence of ‘inequ’ per year. I computed this measure by dividing the number of mentions of ‘inequ’ per year by the number of total unigrams per year. When smoothed (using loess), it appears that the upward inflection point occurs around 1990, but that the trend definitely accelerates towards the mid-late 2000s. This demonstrates that upward trend for mentions of inequality remains present when accounting for the potential confounders of document length and number.

inequ_year %>%
  ggplot(mapping = aes(x = year, y = inequ_prevalence)) +
  geom_point() +
  geom_line() +
  geom_smooth() +
  labs(title = "Relative Prevalence of 'inequ' per Year", subtitle = "Frequency of 'inequ' / Total Unigrams", caption = "Figure 5", x = "year", y = "Relative Prevalence of 'inequ'")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Part 2: Inequality in Context

The findings from part 1 pertain to the word ‘inequality’, but they do not indicate the type of inequality. Inequality has intersectional dimensions that can be inferred from context. Part 2 contains analyses of the context surrounding mentions of inequality.

I think that the best way to get an intimate sense of the context is through manual inspection. However, there are several computational techniques that can be used to highlight exemplary documents and to gain a sense of the sentiments expressed in the context of inequality. To begin, I again found each mention of ‘inequ’ in the corpus and expand then found the surrounding context words. I did so with a window of 15, meaning that the 15 tokens on either side of the ‘inequality’ are included. I purposely set this window high in the event that a longer complex token occurred in the proximity of ‘inequality’.

After extracting inequality in context, I ran a sentiment analysis on all of the words in the context window. The NRC sentiment dictionary matches terms with sentiments across 10 dimensions. Of these dimensions, two are ambiguous (anticipation,surprise) while the others are distinctly positive or negative.

Findings from figures 6 and 7:

The sentiment analysis suggests that aggregate sentiments surrounding the context window for inequality maintains a slight negative edge that increases over time. This trend is most pronounced and consistent in Latin America and Africa. The other UN regions demonstrate greater volatility over time, making these regions appear less coherent.

Figure 6

Figure 6 depicts the aggregate sentiments across time. These findings demonstrate consistency in the sentiments associated with the context window of inequality. Overall, there is a slight increase in the proportion of negatively inflected sentiments, but apart from a brief window around 1990, the positively inflected sentiments do not approach 50%. Overall, the context window surrounding inequality has a slight negative inflection that increases slightly through time.

# Figure 6

inequ_nrc %>% 
  group_by(year) %>%
  count(sentiment) %>%
  mutate(total_sentiment = sum(n)) %>%
  mutate(proportion_sentiment = n/total_sentiment) %>%
  ggplot(mapping = aes(x = year, y = proportion_sentiment, fill = sentiment)) +
  geom_area(colour = "black") +
  scale_color_brewer(palette = "Paired") +
  labs(title = "Sentiment Analysis of 'inequ' at Context Window 15", subtitle = "Using NRC Sentiment Dictionary", caption = "Figure 6", x = "year", y = "Proportion of Terms by Sentiment")

Figure 7

Figure 7 depicts a similar sentiment analysis that has been disaggregated by UN Region. GRULAC and AFRICA have the most consistently negative sentiment scores across time. WEOG, ASIAPAC, and EASTEUROPE demonstrate much more volatility and each display clear periods of positivity. There is a global burst in positivity surrounding the end of the Cold War, and this is most pronounced in WEOG and EASTEUROPE. AFRICA seems to be the least affected by the end of the Cold War.

inequ_nrc[ which(inequ_nrc$UN_REGION != "OTHER"),] %>% 
  group_by(year,UN_REGION) %>%
  count(sentiment) %>%
  mutate(total_sentiment = sum(n)) %>%
  mutate(proportion_sentiment = n/total_sentiment) %>%
  ggplot(mapping = aes(x = year, y = proportion_sentiment, fill = sentiment)) +
  geom_area(colour = "black") +
  facet_wrap(vars(UN_REGION)) +
  scale_color_brewer(palette = "Paired") +
  labs(title = "Sentiment Analysis of 'inequ' at Context Window 15", subtitle = "Using NRC Sentiment Dictionary", caption = "Figure 7", x = "year", y = "Proportion of Terms by Sentiment")

Figure 8

Figure 8 depicts the summation of binary-coded sentiments over time. Each positive term scores a 1 and each negative term scores a -1, and the scores are then aggregated per region and year. The smoothed line depicts a slight downward slope, a trend that accelerates around the 2008 financial crisis. Also note how the dispersion of these points increases over time. As the spread opens, a greater distinction of sentiments between the various UN regions.

# Figure 8
inequ_nrc_bin %>% 
  group_by(year, UN_REGION) %>%
  summarise(sent_sum = sum(sent_binary)) %>%
  ggplot(mapping = aes(x = year, y = sent_sum)) + 
  geom_point() +
  geom_smooth() +
  geom_vline(xintercept = 2008) +
  labs(title = "Summation of Binary Sentiment Values for 'inequ' at Context Window 15", subtitle = "Using NRC Sentiment Dictionary", caption = "Figure 8", x = "year", y = "Summation of Binary Sentiment Terms")

## `summarise()` regrouping output by 'year' (override with `.groups` argument)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 9

Figure 9 is the same plot as Figure 8 with added colour for UN regions. This figure makes the consistent expression of negative sentiments by Latin America much more clear. Again, Africa has the second lowest sentiment scores. The other UN regions are also slightly negative, but the trend is less pronounced.

# Figure 9
inequ_nrc_bin[ which(inequ_nrc_bin$UN_REGION != "OTHER"),] %>% 
  group_by(year, UN_REGION) %>%
  summarise(sent_sum = sum(sent_binary)) %>%
  ggplot(mapping = aes(x = year, y = sent_sum, colour = UN_REGION, fill = UN_REGION)) + 
  geom_point() +
  geom_smooth() +
  geom_vline(xintercept = 2008) + 
  labs(title = "Summation of Binary Sentiment Values for 'inequ' at Context Window 15", subtitle = "Using NRC Sentiment Dictionary", x = "year", y = "Summation of Binary Sentiment Terms")

## `summarise()` regrouping output by 'year' (override with `.groups` argument)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

OVERALL FINDINGS:

Parts 1 and 2 depict the findings from dictionary based methods for text analysis.

Latin America has the most direct persistent and sustained engagement with inequality relative to other UN regions. The region is not without volatility, but it explicitly mentions inequality much more frequently. This finding is robust when accounting for the relative proportion of speeches delivered by Latin American states.
There is a global increase in explicit mentions of inequality over time, especially after the 2008 financial crisis. This global increase was first reflected in Latin America.
The sentiments expressed in the context of inequality maintain a slight negative inflection over time, with several periods of volatility. Since the 2008 financial crash, the global sentiments expressed in this context have grown slightly more critical.
Among the different UN Regions, Latin America expresses the most consistently negative sentiments surrounding inequality. Africa also expresses consistently negative sentiments, and these sentiments are becoming increasingly negative at a similar rate to that of Latin America.

Weaknesses:

Substantialist ontology: unlike relational methods (word embedding), the methods used in part 1 and 2 assume an inherent meaning to words that is not contextually bound.
Coverage of sentiment dictionary: not all of context words register in the NRC sentiment dictionary
Both of these issues can be addressed and improved upon through replication with word embedding replication.

consolidated_21oct