EDU identification using latent semantic analysis

Jens Roeser

Compiled Jan 12 2024

Aim

In this report we will identify EDUs on the basis of latent semantic analysis. As input unit, we will use sentences.

Setup

We need the following R packages:

library(tidyverse)
library(tm)
library(lsa)
library(NLP)
library(openNLP)
library(udpipe)

For lemmatisation we need the udpipe language model.

# Download and load the English model for udpipe
ud_model <- udpipe_download_model(language = "english", model_dir = getwd())
ud_model <- udpipe_load_model(ud_model$file_model)

Functions for EDU identification

The following functions were created and will be applied below.

```r
# Function to split text into sentences
split_into_sentences <- function(text) {
  # ensure there is one text 
  text <- str_c(text, collapse = " ")
  
  # remove new line
  text <- gsub("[\r\n]", "", text)
  
  # Create an NLP annotation object
  annotator <- Maxent_Sent_Token_Annotator()
  
  # Tokenize the sentences
  sentence_annotations <- NLP::annotate(text, list(sent_token_annotator=annotator))

  # split in letters
  text_split <- str_split_1(text, pattern = "")  
  
  # Extract the sentences
  sentences <- map_vec(seq(nrow(as_tibble(sentence_annotations))), 
    ~str_c(text_split[sentence_annotations$start[.]:sentence_annotations$end[.]], collapse = "")
  ) 
  
  return(sentences)
}

# Performs latent semantic analysis
get_lsa <- function(sentences) {
  # Lemmatize each sentence
  lemmatized_sentences <- map_vec(sentences, 
                              ~udpipe_annotate(ud_model, x = .) %>% 
              as.data.frame() %>% 
              pull(lemma) %>% 
              str_c(collapse = " "))

  # Preprocess the text: convert to lowercase, remove punctuation and numbers
  preprocessed_sentences <- tolower(lemmatized_sentences)
  preprocessed_sentences <- removePunctuation(preprocessed_sentences)
  preprocessed_sentences <- removeNumbers(preprocessed_sentences)
  preprocessed_sentences <- removeWords(preprocessed_sentences, stopwords("english"))
  preprocessed_sentences <- str_trim(preprocessed_sentences)
  preprocessed_sentences <- gsub("\\s{2,}", " ", preprocessed_sentences, perl = TRUE)
  
  # Create a document-term matrix
  corpus <- Corpus(VectorSource(preprocessed_sentences))
  dtm <- DocumentTermMatrix(corpus)
  
  # Perform latent semantic analysis
  lsa_space <- lsa(dtm)
  return(lsa_space)
}
  
# Finds keywords for topics
get_topics <- function(lsa_space, threshold = .4){
  # Normalize signs in the dk_matrix
  dk_matrix <- apply(lsa_space$dk, 2, function(col) col * sign(col[which.max(abs(col))]))
  
  # Ensure all values are non-negative
  dk_matrix <- pmax(dk_matrix, 0)

  topics <- dk_matrix %>% 
    as.data.frame() %>% 
    rownames_to_column("key_words") %>% 
    as_tibble() %>% 
    pivot_longer(-key_words, names_to = "topic", values_to = "weight") %>% 
    filter(weight >= threshold) %>% 
    mutate(across(topic, ~str_remove(., "V"))) %>% 
    arrange(topic, weight)
  
  return(topics)
}

# orders the sentences by topics
get_sentences_by_topics <- function(lsa_space, threshold = .4){
  lsa_space$tk
  # Normalize signs in the tk_matrix
  tk_matrix <- apply(lsa_space$tk, 2, function(col) col * sign(col[which.max(abs(col))]))
  
  # Ensure all values are non-negative
  tk_matrix <- pmax(tk_matrix, 0)
  
  # Identify related sentences
  tk_fin <- tk_matrix %>% 
    as.data.frame() %>% 
    rownames_to_column("sentence_id") %>% 
    pivot_longer(-sentence_id, names_to = "topic", values_to = "weight") %>% 
    filter(weight >= threshold) %>% 
    mutate(across(topic, ~str_remove(., "V"))) %>% 
    arrange(topic, weight)
  
  return(tk_fin)  
}
```

Toy example

We will use the following text as example input. This text contains some sentences that are associated, e.g. about a lazy dog, and some sentences that are isolated, like the one about rainbows.

texts <- "The quick brown fox jumps over the lazy dog. The lazy dog barks loudly in response. 
Foxes are known for their agility and speed. Dogs are often loyal companions to humans. 
In the forest I saw a fox. Rainbows are pretty."

As a first step we separate the text into sentences, which we will use as input units for the latent semantic analysis.

sentences <- split_into_sentences(texts)

The get_lsa function performs the latent semantic analysis.

lsa_space <- get_lsa(sentences)

The get_topics function extracts the topics (EDUs) that the latent semantic analysis has identified as being represented in more than one sentence.

topics <- get_topics(lsa_space) %>% 
  select(-weight) %>% 
  summarise(across(key_words, ~str_c(., collapse = ", ")), .by = topic)

topics
# A tibble: 2 × 2
  topic key_words
  <chr> <chr>    
1 1     lazy, dog
2 2     fox      

The get_sentences_by_topics function returns an overview of sentences associated with topics.

sent_by_topic <- get_sentences_by_topics(lsa_space)

Finally we combine everything to get an overview of topics (EDUs) with keywords and the associated sentences. The analysis found that sentences 1 and 2 are related to topic 1 and sentences 3 and 5 are related to topic 2. Topic 3 has only one associated sentence and sentence 6 is unrelated to the remaining sentences.

  topic key_words sentence_id                                    sentences
1     1 lazy, dog           1 The quick brown fox jumps over the lazy dog.
2     1 lazy, dog           2       The lazy dog barks loudly in response.
3     2       fox           3 Foxes are known for their agility and speed.
4     2       fox           5                   In the forest I saw a fox.
5     3      <NA>           4   Dogs are often loyal companions to humans.
6  <NA>      <NA>           6                         Rainbows are pretty.

Application to source text

Lastly, we will apply the functions from above to the source text for the writing task.

```r
prompt <- "Summarize the different viewpoints regarding genetically modified (GM) food based on the two provided texts. Include references to the texts and share your own perspective on the topic."

text_1 <- "TEXT 1: Genetically modified food: saving lives, or lining  corporate pockets? (Tom Chivers - The Telegraph, 2014) Genetically modified (GM) crops have not overcome widespread resistance mostly because the industry is tightly controlled by 
biotech companies. That is, the real problem is that genetic engineering is hurting the poor. It makes cotton cheaper to grow 
for highly subsidized American producers, further undercutting the price of cotton and forcing West African producers out of business. Major biotech companies have no financial interest in developing them for African crops and tightly control the technology. The 
methods of transferring genes were developed by universities, but companies now hold the licenses. The companies permit others to do 
research with the technologies but want control over any product commercialized as a result. Several poor nations are trying to 
develop improved versions of local crops, but these efforts have been damaged by the companies' control over the technology. In fact, the companies which develop GM technology will have unprecedented power over the food chain. They have a clear 
battle-plan to achieve their goal of 'consolidation of the entire food chain': an aggressive patenting regime, patenting technologies 
and genetic material. Academic work has shown that how well people are fed is less to do with the actual quantity of food available in 
the world, and more to do with who controls the food chain, and how well the food is distributed. GM, and the ability to patent GM 
technology, place far more power in the hands of major companies. As a result, there would be fewer competitors in the market. These biotech companies might also have more political power and 
might be able to influence safety and health standards."

text_2 <- 'TEXT 2: The Deadly Opposition to Genetically Modified Food (Bjorn Lomborg – Project Syndicate, 2013) Three billion people depend on rice as their staple food, with 
10% at risk for vitamin A deficiency, which causes 250,000 to 500,000 children to go blind each year. Of these, half die within a 
year. A British medical study estimates that, in total, vitamin A deficiency kills 668,000 children under the age of 5 each year. Yet, despite the cost in human lives, anti-GM campaigners have denied efforts to use golden rice to avoid vitamin A deficiency. 
Indian environmental activist, Vandana Shiva, called golden rice "a hoax" that is "creating hunger and malnutrition, not solving it. The NY Times Magazine reported in 2001 that one would need to 
"eat 15 pounds of cooked golden rice a day" to get enough vitamin A. However, two recent studies in the American Journal of Clinical 
Nutrition show that just 50 grams (roughly two ounces) of golden rice can provide 60 percent of the recommended daily intake of 
vitamin A. They show that golden rice is even better than spinach in providing vitamin A to children. Opponents maintain that there are better ways to deal with 
vitamin A deficiency, saying that golden rice is "neither needed nor necessary." They call for supplementation (vitamin pills) and 
fortification (adding vitamin A to staple products), which are described as "cost-effective." However, this is not a sustainable 
solution to vitamin A deficiency. And, while it is cost-effective, recent published estimates indicate that golden rice is much more 
so. Supplementation programs costs $4,300 for every life they save in India, whereas fortification programs cost about $2,700 for each 
life saved. Meanwhile, golden rice would cost just $100 for every life saved from vitamin A deficiency."'

texts <- c(prompt, text_1, text_2)
```
# Split text into sentences
sentences <- split_into_sentences(texts)

# Get latent semantic analysis
lsa_space <- get_lsa(sentences)

# Get topics (keywords)
topics <- get_topics(lsa_space, threshold = .3) %>% 
  select(-weight) %>% 
  summarise(across(key_words, ~str_c(., collapse = ", ")), .by = topic)

# Sort topic by sentence
sent_by_topic <- get_sentences_by_topics(lsa_space, threshold = .3)

The results of this latent semantic analysis can be found here:

   topic             key_words sentence_id
1      1 golden, rice, vitamin          22
2      2               vitamin          21
3      3                  food          13
4      3                  food          17
5      4      save, cost, life          20
6      4      save, cost, life          27
7      4      save, cost, life          28
8      5               company           4
9      5               company           7
10     5               company          10
11     5               company          27
12     6      cotton, producer           6
13     7          power, might          16

TODO

  • Apply this so that EDUs are dynamic and change depending on what participants look at
  • Show how this changes before and after writing onset
  • Also compare to a full EDU identification based on source texts rather than looks to source texts
  • Input unit might not be the sentence but for example words or phrases
  • For using sentences produced by participant, I need to get the final sentence (also words) from the produced text