Aim
In this report we will identify EDUs on the basis of latent semantic analysis. As input unit, we will use sentences.
Setup
We need the following R packages:
library(tidyverse)
library(tm)
library(lsa)
library(NLP)
library(openNLP)
library(udpipe)
For lemmatisation we need the udpipe
language model.
# Download and load the English model for udpipe
<- udpipe_download_model(language = "english", model_dir = getwd())
ud_model <- udpipe_load_model(ud_model$file_model) ud_model
Functions for EDU identification
The following functions were created and will be applied below.
Click here to see all functions.
```r
# Function to split text into sentences
split_into_sentences <- function(text) {
# ensure there is one text
text <- str_c(text, collapse = " ")
# remove new line
text <- gsub("[\r\n]", "", text)
# Create an NLP annotation object
annotator <- Maxent_Sent_Token_Annotator()
# Tokenize the sentences
sentence_annotations <- NLP::annotate(text, list(sent_token_annotator=annotator))
# split in letters
text_split <- str_split_1(text, pattern = "")
# Extract the sentences
sentences <- map_vec(seq(nrow(as_tibble(sentence_annotations))),
~str_c(text_split[sentence_annotations$start[.]:sentence_annotations$end[.]], collapse = "")
)
return(sentences)
}
# Performs latent semantic analysis
get_lsa <- function(sentences) {
# Lemmatize each sentence
lemmatized_sentences <- map_vec(sentences,
~udpipe_annotate(ud_model, x = .) %>%
as.data.frame() %>%
pull(lemma) %>%
str_c(collapse = " "))
# Preprocess the text: convert to lowercase, remove punctuation and numbers
preprocessed_sentences <- tolower(lemmatized_sentences)
preprocessed_sentences <- removePunctuation(preprocessed_sentences)
preprocessed_sentences <- removeNumbers(preprocessed_sentences)
preprocessed_sentences <- removeWords(preprocessed_sentences, stopwords("english"))
preprocessed_sentences <- str_trim(preprocessed_sentences)
preprocessed_sentences <- gsub("\\s{2,}", " ", preprocessed_sentences, perl = TRUE)
# Create a document-term matrix
corpus <- Corpus(VectorSource(preprocessed_sentences))
dtm <- DocumentTermMatrix(corpus)
# Perform latent semantic analysis
lsa_space <- lsa(dtm)
return(lsa_space)
}
# Finds keywords for topics
get_topics <- function(lsa_space, threshold = .4){
# Normalize signs in the dk_matrix
dk_matrix <- apply(lsa_space$dk, 2, function(col) col * sign(col[which.max(abs(col))]))
# Ensure all values are non-negative
dk_matrix <- pmax(dk_matrix, 0)
topics <- dk_matrix %>%
as.data.frame() %>%
rownames_to_column("key_words") %>%
as_tibble() %>%
pivot_longer(-key_words, names_to = "topic", values_to = "weight") %>%
filter(weight >= threshold) %>%
mutate(across(topic, ~str_remove(., "V"))) %>%
arrange(topic, weight)
return(topics)
}
# orders the sentences by topics
get_sentences_by_topics <- function(lsa_space, threshold = .4){
lsa_space$tk
# Normalize signs in the tk_matrix
tk_matrix <- apply(lsa_space$tk, 2, function(col) col * sign(col[which.max(abs(col))]))
# Ensure all values are non-negative
tk_matrix <- pmax(tk_matrix, 0)
# Identify related sentences
tk_fin <- tk_matrix %>%
as.data.frame() %>%
rownames_to_column("sentence_id") %>%
pivot_longer(-sentence_id, names_to = "topic", values_to = "weight") %>%
filter(weight >= threshold) %>%
mutate(across(topic, ~str_remove(., "V"))) %>%
arrange(topic, weight)
return(tk_fin)
}
```
Toy example
We will use the following text as example input. This text contains some sentences that are associated, e.g. about a lazy dog, and some sentences that are isolated, like the one about rainbows.
<- "The quick brown fox jumps over the lazy dog. The lazy dog barks loudly in response.
texts Foxes are known for their agility and speed. Dogs are often loyal companions to humans.
In the forest I saw a fox. Rainbows are pretty."
As a first step we separate the text into sentences, which we will use as input units for the latent semantic analysis.
<- split_into_sentences(texts) sentences
The get_lsa
function performs the latent semantic analysis.
<- get_lsa(sentences) lsa_space
The get_topics
function extracts the topics (EDUs) that the latent semantic analysis has identified as being represented in more than one sentence.
<- get_topics(lsa_space) %>%
topics select(-weight) %>%
summarise(across(key_words, ~str_c(., collapse = ", ")), .by = topic)
topics
# A tibble: 2 × 2
topic key_words
<chr> <chr>
1 1 lazy, dog
2 2 fox
The get_sentences_by_topics
function returns an overview of sentences associated with topics.
<- get_sentences_by_topics(lsa_space) sent_by_topic
Finally we combine everything to get an overview of topics (EDUs) with keywords and the associated sentences. The analysis found that sentences 1 and 2 are related to topic 1 and sentences 3 and 5 are related to topic 2. Topic 3 has only one associated sentence and sentence 6 is unrelated to the remaining sentences.
topic key_words sentence_id sentences
1 1 lazy, dog 1 The quick brown fox jumps over the lazy dog.
2 1 lazy, dog 2 The lazy dog barks loudly in response.
3 2 fox 3 Foxes are known for their agility and speed.
4 2 fox 5 In the forest I saw a fox.
5 3 <NA> 4 Dogs are often loyal companions to humans.
6 <NA> <NA> 6 Rainbows are pretty.
Application to source text
Lastly, we will apply the functions from above to the source text for the writing task.
Click here to see full text input (prompt and source texts).
```r
prompt <- "Summarize the different viewpoints regarding genetically modified (GM) food based on the two provided texts. Include references to the texts and share your own perspective on the topic."
text_1 <- "TEXT 1: Genetically modified food: saving lives, or lining corporate pockets? (Tom Chivers - The Telegraph, 2014) Genetically modified (GM) crops have not overcome widespread resistance mostly because the industry is tightly controlled by
biotech companies. That is, the real problem is that genetic engineering is hurting the poor. It makes cotton cheaper to grow
for highly subsidized American producers, further undercutting the price of cotton and forcing West African producers out of business. Major biotech companies have no financial interest in developing them for African crops and tightly control the technology. The
methods of transferring genes were developed by universities, but companies now hold the licenses. The companies permit others to do
research with the technologies but want control over any product commercialized as a result. Several poor nations are trying to
develop improved versions of local crops, but these efforts have been damaged by the companies' control over the technology. In fact, the companies which develop GM technology will have unprecedented power over the food chain. They have a clear
battle-plan to achieve their goal of 'consolidation of the entire food chain': an aggressive patenting regime, patenting technologies
and genetic material. Academic work has shown that how well people are fed is less to do with the actual quantity of food available in
the world, and more to do with who controls the food chain, and how well the food is distributed. GM, and the ability to patent GM
technology, place far more power in the hands of major companies. As a result, there would be fewer competitors in the market. These biotech companies might also have more political power and
might be able to influence safety and health standards."
text_2 <- 'TEXT 2: The Deadly Opposition to Genetically Modified Food (Bjorn Lomborg – Project Syndicate, 2013) Three billion people depend on rice as their staple food, with
10% at risk for vitamin A deficiency, which causes 250,000 to 500,000 children to go blind each year. Of these, half die within a
year. A British medical study estimates that, in total, vitamin A deficiency kills 668,000 children under the age of 5 each year. Yet, despite the cost in human lives, anti-GM campaigners have denied efforts to use golden rice to avoid vitamin A deficiency.
Indian environmental activist, Vandana Shiva, called golden rice "a hoax" that is "creating hunger and malnutrition, not solving it. The NY Times Magazine reported in 2001 that one would need to
"eat 15 pounds of cooked golden rice a day" to get enough vitamin A. However, two recent studies in the American Journal of Clinical
Nutrition show that just 50 grams (roughly two ounces) of golden rice can provide 60 percent of the recommended daily intake of
vitamin A. They show that golden rice is even better than spinach in providing vitamin A to children. Opponents maintain that there are better ways to deal with
vitamin A deficiency, saying that golden rice is "neither needed nor necessary." They call for supplementation (vitamin pills) and
fortification (adding vitamin A to staple products), which are described as "cost-effective." However, this is not a sustainable
solution to vitamin A deficiency. And, while it is cost-effective, recent published estimates indicate that golden rice is much more
so. Supplementation programs costs $4,300 for every life they save in India, whereas fortification programs cost about $2,700 for each
life saved. Meanwhile, golden rice would cost just $100 for every life saved from vitamin A deficiency."'
texts <- c(prompt, text_1, text_2)
```
# Split text into sentences
<- split_into_sentences(texts)
sentences
# Get latent semantic analysis
<- get_lsa(sentences)
lsa_space
# Get topics (keywords)
<- get_topics(lsa_space, threshold = .3) %>%
topics select(-weight) %>%
summarise(across(key_words, ~str_c(., collapse = ", ")), .by = topic)
# Sort topic by sentence
<- get_sentences_by_topics(lsa_space, threshold = .3) sent_by_topic
The results of this latent semantic analysis can be found here:
topic key_words sentence_id
1 1 golden, rice, vitamin 22
2 2 vitamin 21
3 3 food 13
4 3 food 17
5 4 save, cost, life 20
6 4 save, cost, life 27
7 4 save, cost, life 28
8 5 company 4
9 5 company 7
10 5 company 10
11 5 company 27
12 6 cotton, producer 6
13 7 power, might 16
TODO
- Apply this so that EDUs are dynamic and change depending on what participants look at
- Show how this changes before and after writing onset
- Also compare to a full EDU identification based on source texts rather than looks to source texts
- Input unit might not be the sentence but for example words or phrases
- For using sentences produced by participant, I need to get the final sentence (also words) from the produced text