Overview:

This file details my progress on the project. I have

Library

This section loads the relevant R packages. The object ungdc18 refers to the speech corpus. set.seed is a necessary component for replicability.

Tokenizing

This tokenizes the corpus into meaningful semantic units. It is a necessary precursor step before evaluation can take place.

Dictionary Methods

This step demonstrates basic word counts over time, coloured by UN_REGION. I don’t expect this section to be particularly relevant for the analysis, as we are primarily using the relational ‘word embeddings’ approach, as opposed to the substantialist ‘dictionary methods’ displayed below.

# Mentions of Environment ----

ungdc_unnest[ which(ungdc_unnest$word == "environment"),] %>%
  group_by(year, UN_REGION) %>%
  summarise(mentions = sum(n)) %>%
  ggplot(aes(x = year, y = mentions, colour = UN_REGION)) + 
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(title = "Mentions of `environment` per year")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# Mentions of Sustainability ----
ungdc_unnest[ which(ungdc_unnest$word == "sustainable"),] %>%
  group_by(year, UN_REGION) %>%
  summarise(mentions = sum(n)) %>%
  ggplot(aes(x = year, y = mentions, colour = UN_REGION)) + 
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(title = "Mentions of `sustainable` per year")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Word Embeddings

In this step, I compute the ‘Concept Mover Distance’ (CMD) values for key terms of interest: environment, sustainable, and sustainable development. It is easy to replicate this step for different words, so I can add additional concepts if you can think of any interesting ones.

In short, Concept Mover Distance is a measure of the computational ‘work’ necessary to convert the word embeddings for a given document into the word embeddings for the concept of interest. The amount of work required to convert semantically proximate words is lower than for semantically distant words, since semantic similarity manifests as proximity in the embedding space. Therefore, documents with a high concentration of semantically proximate terms will take less ‘work’ to coerce into the target concept and this is represented as the documents ‘Concept Mover Distance’ to that term.

The functions below compute the CMD values for each document for each concept of interest. Then, the function binds these scores to the original dataframe as a column.

ungdc18$cmd_environment <- ungdc_unnest %>% 
                    cast_dtm(term = word, 
                                document = doc_id, 
                                value = n, 
                                weighting = tm::weightTf) %>%
                            removeSparseTerms(.999) %>%
                    CMDist(cw =c("environment"),wv = my.wv) %>% select(,2)
ungdc18$cmd_environment <- unlist(ungdc18$cmd_environment)
ungdc18$cmd_environment <- as.numeric(ungdc18$cmd_environment)

# Sustainable
ungdc18$cmd_sustainable <- ungdc_unnest %>% 
                    cast_dtm(term = word, 
                                document = doc_id, 
                                value = n, 
                                weighting = tm::weightTf) %>%
                            removeSparseTerms(.999) %>%
                    CMDist(cw =c("sustainable"),wv = my.wv) %>% select(,2)
ungdc18$cmd_sustainable <- unlist(ungdc18$cmd_sustainable)
ungdc18$cmd_sustainable <- as.numeric(ungdc18$cmd_sustainable)


ungdc18$cmd_sustdev <- ungdc_unnest %>% 
                    cast_dtm(term = word, 
                                document = doc_id, 
                                value = n, 
                                weighting = tm::weightTf) %>%
                            removeSparseTerms(.999) %>%
                    CMDist(cw =c("sustainable development"),wv = my.wv) %>% select(,2)
ungdc18$cmd_sustdev <- unlist(ungdc18$cmd_sustdev)
ungdc18$cmd_sustdev <- as.numeric(ungdc18$cmd_sustdev)

Initial CMD Plots: pre-defined covariates:

Interpretation- This section demonstrates the magnitude of engagement with three focal concepts: environment, sustainable, and sustainable development. Overall, the plots demonstrate a general trend of increased engagement. This trend is especially notable when examining the density of points above the horizontal line at y = 1, as this space indicates less ambiguity in the degree of engagement.

I also made visualizations after subsetting the data according to the following metadata covariates: UN_REGION, oecd, P5, and nam. I did so mostly because they were available on hand. I was interested to see if these covariates could uncover interesting clusters of actors based on their degrees of engagement. However, these subset visuals didn’t align with distinct narratives. There may be some metadata out there that can be used to segment groups based on the quality of their engagement, but I don’t yet have it. Therefore, the best course of action is probably to cluster groups of actors based on similar degrees of engagement. This will allow me to classify different groupings of actors based on similar degrees of concept engagement.

# Environment ----
ungdc18 %>%
 ggplot(mapping = aes(x = year, y = cmd_environment)) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'environment'")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  group_by(UN_REGION) %>%
 ggplot(mapping = aes(x = year, y = cmd_environment, colour = UN_REGION)) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'environment'; subset by UN Region")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  group_by(oecd) %>%
 ggplot(mapping = aes(x = year, y = cmd_environment, colour = as.factor(oecd))) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'environment'; subset by 'oecd'")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  group_by(P5) %>%
 ggplot(mapping = aes(x = year, y = cmd_environment, colour = as.factor(P5))) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'environment'; subset by 'P5'")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  group_by(nam) %>%
 ggplot(mapping = aes(x = year, y = cmd_environment, colour = as.factor(nam))) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'environment'; subset by 'nam'")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# Sustainable ----

ungdc18 %>%
 ggplot(mapping = aes(x = year, y = cmd_sustainable)) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'sustainable'")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  group_by(UN_REGION) %>%
 ggplot(mapping = aes(x = year, y = cmd_sustainable, colour = UN_REGION)) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'sustainable'; subset by UN Region")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  group_by(oecd) %>%
 ggplot(mapping = aes(x = year, y = cmd_sustainable, colour = as.factor(oecd))) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'sustainable'; subset by 'oecd'")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  group_by(P5) %>%
 ggplot(mapping = aes(x = year, y = cmd_sustainable, colour = as.factor(P5))) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'sustainable'; subset by 'P5'")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  group_by(nam) %>%
 ggplot(mapping = aes(x = year, y = cmd_sustainable, colour = as.factor(nam))) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'sustainable'; subset by 'nam'")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# Sustdev
ungdc18 %>%
 ggplot(mapping = aes(x = year, y = cmd_sustdev)) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'Sustainable Development'")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  group_by(UN_REGION) %>%
 ggplot(mapping = aes(x = year, y = cmd_sustdev, colour = UN_REGION)) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'Sustainable Development'; subset by UN Region")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  group_by(oecd) %>%
 ggplot(mapping = aes(x = year, y = cmd_sustdev, colour = as.factor(oecd))) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'Sustainable Development'; subset by 'oecd'")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  group_by(P5) %>%
 ggplot(mapping = aes(x = year, y = cmd_sustdev, colour = as.factor(P5))) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'Sustainable Development'; subset by 'P5'")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  group_by(nam) %>%
 ggplot(mapping = aes(x = year, y = cmd_sustdev, colour = as.factor(nam))) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "red") +
  geom_hline(yintercept = -1, linetype = "dashed", colour = "red") +
  labs(title = "Concept Engagement:'Sustainable Development'; subset by 'nam'")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Cosine Similarity: TBD

Vector Algebra :TBD

What is the distance between environment and economy? Following the GloVe procedure on text2vec site

Vector Algebra:

Word embeddings represent words as vectors, or 1 x n dimensional matrix. The values of each vector correspond to a coordinate in each dimension of the vector space. Word vectors are real valued, which means that they conform to the operations of linear algebra.

In this section, I use linear algebra to transform the word vector for ‘economy’. This is not based in a firm theory, more an intuition. Environmental policies are often attacked for being too costly or economically damaging. Some governments (perhaps those that rely on hydrocarbon exports) may find environmental protection to be threatening to their economy. How does this manifest in political speech?

Word embeddings represent individual words in relation to other words. In vector space, semantically related words occupy a similar ‘neighborhood’, together with other context words with which they tend to co-occur. My wager is that for countries with an oppositional view of ‘economy’ and ‘environment’, there will be less overlap between ‘economy’ and ‘environment’ in their vector space. This is because the economy is meant to be kept distinct from the environment, which is a different issue altogether. Conversly, states who understand the links between economy and environment may use similar words to describe each - or at least, there should be more overlap in the contextual signifiers for each term. Basically, I’m trying to create a measure that captures the distance between environment and economy in a single measure.

To do so, I subtract the vector for ‘environment’ from the vector for ‘economy’. The result is a transformed economy vector that is stripped of all contextual overlap from ‘environment’. Then, I measure the CMD for each speech to this transformed vector. The resultant measure serves as a proxy for engagement with ‘pure economy’, stripped of associations with environment.

# Inequality ----
targetpole <- c("economy")
antipole <- c("environment")
pair <- cbind(targetpole,antipole)

cv_ecoenv <- get_direction(pair,my.wv)

rm(targetpole,antipole)

ungdc18$ecoenv <- ungdc_unnest %>% 
  cast_dtm(term = word, 
           document = doc_id, 
           value = n, 
           weighting = tm::weightTf) %>%
           removeSparseTerms(.999) %>%
  CMDist(cv = cv_ecoenv,
         wv = my.wv,
         method = "cosine") %>% select(,2)

ungdc18$ecoenv <- as.numeric(unlist(ungdc18$ecoenv))

rm(cv_ecoenv,pair)

Economy - Environment

ungdc18 %>%
  ggplot(mapping = aes(x = year, y = ecoenv)) + 
  geom_point() + 
  geom_smooth() +
  labs(title = "Engagement w/ Transformed Concept Vector for Economy - Environment")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ungdc18 %>%
  ggplot(mapping = aes(x = year, y = ecoenv, colour = UN_REGION)) + 
  geom_point() + 
  geom_smooth() +
  labs(title = "Engagement w/ Transformed Concept Vector for Economy - Environment")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

env_nlp_1