# A tibble: 49,343 × 7
document_document_id source_name title publication_date text probability
<chr> <chr> <chr> <dttm> <chr> <dbl>
1 7V2B-8WR1-2PBV-B4HN-… The Associ… Sovi… 1991-04-02 19:00:00 "Whe… 0.517
2 3TDD-S850-0031-V23G-… Agence Fra… Firs… 1991-09-08 20:00:00 "The… 0.883
3 3SJ4-DY20-0007-W197-… RusData Di… BALT… 1991-09-17 20:00:00 "WAS… 0.929
4 49R6-55B0-01VR-924M-… CNN.com Afgh… 1991-11-10 19:00:00 "A d… 0.966
5 3SJ4-D9W0-0008-C28H-… The Associ… Afgh… 1991-11-14 19:00:00 "The… 0.904
6 4D07-C0T0-009F-R125-… The Associ… paki… 1991-11-15 19:00:00 "pak… 0.708
7 3TDD-S2B0-0031-V1N2-… Agence Fra… Afgh… 1991-11-17 19:00:00 "Afg… 0.664
8 549W-G371-JBTF-631F-… The Nation Kabu… 1991-12-20 19:00:00 "The… 0.691
9 41FB-23C0-0041-709H-… Official K… Kabu… 1991-12-20 19:00:00 "The… 0.690
10 3TDD-RYF0-0031-V1N5-… Agence Fra… Mosc… 1991-12-21 19:00:00 "In … 0.674
# ℹ 49,333 more rows
# ℹ 1 more variable: dispute <chr>
Whittle this broad corpus down
You now need to work out which of these articles actually include information on your events and actors.
We are going to ask GPT to tell us the following:
Identify with ‘yes’ or ‘no’ whether the following article mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries.
Building your prompt
Your prompt needs to be:
Specific,
Concise,
Simple.
Working with a single article
Let’s head over to ChatGPT and see this classification task in action:
How did it go?
Working with many articles
This is great, but what if you have many, many articles that you need to code?
Let’s use R to help us out!
For each article, we want to:
Build a prompt,
Run a GPT model of our choice against that prompt,
Record its response in a data frame.
Building your prompt
First, we need to start with our base prompt:
base_prompt <-"Identify with 'yes' or 'no' whether the following article (delimited in XML tags) mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries: <article>{article_body}</article>"base_prompt
[1] "Identify with 'yes' or 'no' whether the following article (delimited in XML tags) mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries: <article>{article_body}</article>"
Building your prompt
Next, we need to include our article text:
article_body <- all_articles |># Select the first articleslice(1) |># Pull out the textpull(text)article_body
[1] "When Yelena Khanga first visited the United States five years ago, she told people she was Russian.\n\n \n\n \"Everyone was very polite,\" Khanga told a small gathering of students and teachers Tuesday at the University of South Florida (USF). \"They treated me like a guest.\"\n\n But for her next visit, Khanga, who is black, decided to keep quiet.\n\n \n\n \"Now . . . I just try to pass like a black American,\" said Khanga, 25. She said: \"I think I know what it is to be black American in United States.\"\n\n \n\n Khanga, a journalist from the Russianrepublic of the Soviet Union, lives in New York City. She spoke Tuesday at the University of Tampa about her life. In addition, Khanga spoke at USF and filmed a segment of The Bridge, a WUSF-TV talk show.\n\n \n\n She spoke with zeal about many aspects of Russian glasnost, feminism and the KGB among them.\n\n \n\n She also spoke about race. In the United States, she said, she had the feeling of being treated differently because of her race. \"I won't get into specifics,\" she said.\n\n \n\n On the other hand, in her native land, Khanga said, she thinks few race relations problems surface because, unlike the United States, Russia does not have a history of African slavery.\n\n \n\n Although she talked about serious tensions between different Z ethnic groups, Khanga said the country, particularly Moscow, has many interracial families. She said a number of Russians are of African descent.\n\n \n\n Some at the lecture said they find it hard to conceive of such a racially integrated society.\n\n \n\n \"I'm not so sure,\" said Troy Collier, an associate dean of students at USF. \"I suspect there is a race problem in Russia.\" Collier said he thinks color makes a difference, especially to the people who control a country.\n\n \n\n Khanga is the great-grandaughter of an ex-slave who lived in Mississippi. Her grandfather attended Tuskegee Institute in Alabama and moved to New York where he met and married Khanga's grandmother, a Jewish woman.\n\n \n\n The couple was among a group of 16 black families who left the United States in the 1930s to escape the Depression and find what they hoped would be fairer treatment under communism.\n\n \n\n The families settled in Uzbekistan,near the border with Afghanistan, primarily because the Moslems in the area were people of color. Some of the black families remained in the area and some eventually returned to the United States.\n\n \n\n Khanga's mother and father, who was born in Tanzania, lived in Uzbekistan. Khanga attended Moscow State University. Since graduating, she has worked six years for the Moscow News, a major Soviet Union newspaper, and she has visited the United States eight times.\n\n \n\n Two months ago, after extensive research, Khanga visited the site of her great-grandfather's home in Mississippi.\n\n \n\n \"It was so exciting to know that she was . . . Russian,\" said Wanda Lewis Campbell, an associate dean of students at USF, \"but at the same time she looks like one of my cousins.\""
Building your prompt
Finally, we need to add this article body into our prompt:
Identify with 'yes' or 'no' whether the following article (delimited in XML tags) mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries: <article>When Yelena Khanga first visited the United States five years ago, she told people she was Russian.
"Everyone was very polite," Khanga told a small gathering of students and teachers Tuesday at the University of South Florida (USF). "They treated me like a guest."
But for her next visit, Khanga, who is black, decided to keep quiet.
"Now . . . I just try to pass like a black American," said Khanga, 25. She said: "I think I know what it is to be black American in United States."
Khanga, a journalist from the Russianrepublic of the Soviet Union, lives in New York City. She spoke Tuesday at the University of Tampa about her life. In addition, Khanga spoke at USF and filmed a segment of The Bridge, a WUSF-TV talk show.
She spoke with zeal about many aspects of Russian glasnost, feminism and the KGB among them.
She also spoke about race. In the United States, she said, she had the feeling of being treated differently because of her race. "I won't get into specifics," she said.
On the other hand, in her native land, Khanga said, she thinks few race relations problems surface because, unlike the United States, Russia does not have a history of African slavery.
Although she talked about serious tensions between different Z ethnic groups, Khanga said the country, particularly Moscow, has many interracial families. She said a number of Russians are of African descent.
Some at the lecture said they find it hard to conceive of such a racially integrated society.
"I'm not so sure," said Troy Collier, an associate dean of students at USF. "I suspect there is a race problem in Russia." Collier said he thinks color makes a difference, especially to the people who control a country.
Khanga is the great-grandaughter of an ex-slave who lived in Mississippi. Her grandfather attended Tuskegee Institute in Alabama and moved to New York where he met and married Khanga's grandmother, a Jewish woman.
The couple was among a group of 16 black families who left the United States in the 1930s to escape the Depression and find what they hoped would be fairer treatment under communism.
The families settled in Uzbekistan,near the border with Afghanistan, primarily because the Moslems in the area were people of color. Some of the black families remained in the area and some eventually returned to the United States.
Khanga's mother and father, who was born in Tanzania, lived in Uzbekistan. Khanga attended Moscow State University. Since graduating, she has worked six years for the Moscow News, a major Soviet Union newspaper, and she has visited the United States eight times.
Two months ago, after extensive research, Khanga visited the site of her great-grandfather's home in Mississippi.
"It was so exciting to know that she was . . . Russian," said Wanda Lewis Campbell, an associate dean of students at USF, "but at the same time she looks like one of my cousins."</article>
Run a GPT model against this prompt
The end goal:
library(httr2)req <-request("https://api.openai.com/v1/chat/completions") |>req_headers("Content-Type"="application/json","Authorization"=paste("Bearer", Sys.getenv("OPENAI_API_KEY_NSF"))) |>req_body_json(list("model"="gpt-3.5-turbo-0613","messages"=list(list("role"="system","content"="You are a helpful assistant." ),list("role"="user","content"= prompt ) ),"temperature"=0,"max_tokens"=1 ) ) |>req_retry(max_tries =3)
Building your API request
We are going to take advantage of the fantastic httr2 R package to work with the OpenAI API.
First, we need to build our request to the API. You need:
The API endpoint,
The content type with which you would like to work,
Your authorization to use the API,
Your chosen GPT model,
Your model parameters.
The API endpoint
The endpoint depends on the type of GPT model you want to use.
Sometimes, your request will fail. You can ask R to retry your request using httr2::req_retry().
req <-request("https://api.openai.com/v1/chat/completions") |>req_headers("Content-Type"="application/json","Authorization"=paste("Bearer", Sys.getenv("OPENAI_API_PERSONAL"))) |>req_body_json(list("model"="gpt-3.5-turbo","messages"=list(list("role"="system","content"="You are a helpful assistant." ),list("role"="user","content"= prompt ) ),"temperature"=0,"max_tokens"=1 ) ) |>req_retry(max_tries =3)
Run the GPT model for one article
We are ready to make that API request!
We have our prompt:
article_prompt
Identify with 'yes' or 'no' whether the following article (delimited in XML tags) mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries: <article>When Yelena Khanga first visited the United States five years ago, she told people she was Russian.
"Everyone was very polite," Khanga told a small gathering of students and teachers Tuesday at the University of South Florida (USF). "They treated me like a guest."
But for her next visit, Khanga, who is black, decided to keep quiet.
"Now . . . I just try to pass like a black American," said Khanga, 25. She said: "I think I know what it is to be black American in United States."
Khanga, a journalist from the Russianrepublic of the Soviet Union, lives in New York City. She spoke Tuesday at the University of Tampa about her life. In addition, Khanga spoke at USF and filmed a segment of The Bridge, a WUSF-TV talk show.
She spoke with zeal about many aspects of Russian glasnost, feminism and the KGB among them.
She also spoke about race. In the United States, she said, she had the feeling of being treated differently because of her race. "I won't get into specifics," she said.
On the other hand, in her native land, Khanga said, she thinks few race relations problems surface because, unlike the United States, Russia does not have a history of African slavery.
Although she talked about serious tensions between different Z ethnic groups, Khanga said the country, particularly Moscow, has many interracial families. She said a number of Russians are of African descent.
Some at the lecture said they find it hard to conceive of such a racially integrated society.
"I'm not so sure," said Troy Collier, an associate dean of students at USF. "I suspect there is a race problem in Russia." Collier said he thinks color makes a difference, especially to the people who control a country.
Khanga is the great-grandaughter of an ex-slave who lived in Mississippi. Her grandfather attended Tuskegee Institute in Alabama and moved to New York where he met and married Khanga's grandmother, a Jewish woman.
The couple was among a group of 16 black families who left the United States in the 1930s to escape the Depression and find what they hoped would be fairer treatment under communism.
The families settled in Uzbekistan,near the border with Afghanistan, primarily because the Moslems in the area were people of color. Some of the black families remained in the area and some eventually returned to the United States.
Khanga's mother and father, who was born in Tanzania, lived in Uzbekistan. Khanga attended Moscow State University. Since graduating, she has worked six years for the Moscow News, a major Soviet Union newspaper, and she has visited the United States eight times.
Two months ago, after extensive research, Khanga visited the site of her great-grandfather's home in Mississippi.
"It was so exciting to know that she was . . . Russian," said Wanda Lewis Campbell, an associate dean of students at USF, "but at the same time she looks like one of my cousins."</article>
Run the GPT model for one article
We are ready to make that API request!
Which we can insert into our fully fleshed out request:
req <-request("https://api.openai.com/v1/chat/completions") |>req_headers("Content-Type"="application/json","Authorization"=paste("Bearer", Sys.getenv("OPENAI_API_PERSONAL"))) |>req_body_json(list("model"="gpt-3.5-turbo","messages"=list(list("role"="system","content"="You are a helpful assistant." ),list("role"="user","content"= article_prompt ) ),"temperature"=0,"max_tokens"=1 ) ) |>req_retry(max_tries =3)
pred <-resp_body_json(resp)$choices[[1]]$message$contentdf <-tibble(body = article_body,gpt_pred = pred )df
# A tibble: 1 × 2
body gpt_pred
<chr> <chr>
1 "When Yelena Khanga first visited the United States five years ago, … No
Congratulations!
You have now used an advanced large language model to identify whether an event is referenced in a news article.
Let’s set you up to do that across all 49,343 articles.
Building your article reader function
End goal:
article_classification <-function(article_body) {# Create your prompt article_prompt <- glue::glue("Identify with 'yes' or 'no' whether the following article (delimited in XML tags) mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries: <article>{article_body}</article>")# Build your request req <-request("https://api.openai.com/v1/chat/completions") |>req_headers("Content-Type"="application/json","Authorization"=paste("Bearer", Sys.getenv("OPENAI_API_PERSONAL"))) |>req_body_json(list("model"="gpt-3.5-turbo","messages"=list(list("role"="system","content"="You are a helpful assistant." ),list("role"="user","content"= article_prompt ) ),"temperature"=0,"max_tokens"=1 ) ) |>req_retry(max_tries =3)# Perform your request resp <-req_perform(req)# Clean up the response pred <-resp_body_json(resp)$choices[[1]]$message$content# Save the response df <-tibble(body = article_body,gpt_pred = pred )return(df)}
# A tibble: 5 × 2
body gpt_pred
<chr> <chr>
1 "When Yelena Khanga first visited the United States five years ago, … No
2 "The first Soviet delegation to visit Afghanistan since the failed h… Yes
3 "WASHINGTON - President George Bush met Tuesday with the presidents … Yes
4 "A delegation of Afghan mujahedeen rebels met Russian Vice-President… Yes
5 "The chief guerrilla delegate at talks on ending Afghanistan's civil… Yes
Some tips: token limits
There are (very large) character limits for your prompts.
For gpt-3.5-turbo, it is 4,096 tokens.
A token is roughly four characters (in English).
Some tips: token limits
To make sure your prompts do not go over this limit, add this:
article_classification <-function(article_body) {# Create your prompt article_prompt <- glue::glue("Identify with 'yes' or 'no' whether the following article (delimited in XML tags) mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries: <article>{article_body}</article>")if (nchar(article_prompt) /4>4096) {# Build your request req <-request("https://api.openai.com/v1/chat/completions") |>req_headers("Content-Type"="application/json","Authorization"=paste("Bearer", Sys.getenv("OPENAI_API_PERSONAL"))) |>req_body_json(list("model"="gpt-3.5-turbo","messages"=list(list("role"="system","content"="You are a helpful assistant." ),list("role"="user","content"= article_prompt ) ),"temperature"=0,"max_tokens"=1 ) ) |>req_retry(max_tries =3)# Perform your request resp <-req_perform(req)# Clean up the response pred <-resp_body_json(resp)$choices[[1]]$message$content# Save the response df <-tibble(body = article_body,gpt_pred = pred )return(df) } else {stop("Prompt exceeds token limit.") }}
Some tips: clean your input
You will often have junk in your text (for example, paragraph delimiters or news agency bylines). Removing this:
Increases the likelihood that you won’t reach the token limit,
Reduces your use costs,
Can produce more accurate results.
Some tips: evaluating your model
You should check whether or not the model is correctly identifying relevant articles.
Select some labelled articles at random and hand code them.